Browsing by Author "Strydom, Irene Francesca"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- ItemNatural language processing for characterising the COVID-19 infodemic on South African twitter.(Stellenbosch : Stellenbosch University, 2024-03) Strydom, Irene Francesca; Grobler, Jacomine; Stellenbosch University. Faculty of Engineering. Dept. of Industrial Engineering.ENGLISH ABSTRACT: The novel coronavirus disease of 2019 (COVID-19) was first detected in the city of Wuhan, China, and quickly spread to countries all around the globe. On the advice of the World Health Organization (WHO), governments-imposed lockdowns, social distancing, mask mandates, and other preventive measures that completely disrupted the daily lives of billions of people. Along with the disruption of daily life came fear, confusion, and anxiety, as news about the virus began circulating. Despite the attempts of the WHO and national governments to provide accurate information about the virus and prevent panic, rumours about its origin, effects, and cures surfaced on websites and social media. COVID-19 rumours became so prominent during the height of the pandemic that their spread became known as an “infodemic” and social media has been identified as a major contributing factor. The COVID-19 pandemic has exposed the potential harm that can be caused by misinformation and disinformation that is spread on social media. Scholars have responded by analysing content on social media to identify different kinds of misleading information about COVID-19 and to quantify how far it has spread. These studies make use of automated machine learning (ML) and natural language processing (NLP) techniques to analyse the large amounts of data present on social media. South Africa has, unfortunately, escaped neither the pandemic nor the infodemic. The full extent of the infodemic on South African social media, in contrast with other countries, is still unknown. ML and NLP techniques provide an opportunity to address this gap in research and characterise mis-/disinformation on South African social media. In this dissertation, two approaches were followed to characterise misleading information on South African Twitter. The first is a supervised ML approach that made use of a combination of transformer-based embedding models and feedforward neural network classifiers. The models were trained, optimised, and evaluated on publicly available, labelled COVID-19 Twitter misinformation datasets. The best performing model, LAMBERT, was then applied to unlabelled South African Tweets about the COVID-19 pandemic. Although the model performed well on the labelled test data (obtaining an F1–score of 89.9%), the model failed to reliably distinguish between mis-/disinformation Tweets and general Tweets in the unlabelled South African dataset. The second approach made use of an unsupervised topic modelling algorithm, BERTopic, to divide the unlabelled South African Tweets into coherent topics. The BERTopic model was trained and optimised on the unlabelled South African Tweets and produced 34 topics. By inspecting the representative terms and Tweets assigned to each topic, instances of mis-/disinformation were identified. The unsupervised approach was then refined by defining three novel procedures, namely discrete dynamic topic modelling (DDTM), topic evolution network formation (TENF), and topic characterisation (TC), to model the development of topics over time and characterise the extracted topics in terms of their textual, spatial, temporal and community facets. Using these procedures, networks of topics (including mis-/disinformation topics) were identified in the collected Twitter data. Lastly, these procedures were abstracted and combined to form a novel, generalised topic characterisation framework. This dissertation presents the first large-scale analysis of South African Twitter specifically aimed at characterising and mapping information disorder in the context of COVID-19, helping to better define the information disorder landscape on social media in the Global South and South Africa, in particular. The results described in the dissertation are a valubale departure point for future research and the proposed framework provides a comprehensive, yet flexible guide to characterising large corpora of text for domain experts and researchers alike.