2. Background
Fake news has been defined both in broad and narrow terms and can be characterised by authenticity, intention and whether it is news at all
[28]. The broad definition includes non-factual content that misleads the public (e.g., deceptive and false news, disinformation and misinformation), rumour and satire, amongst others
[28]. The narrow definition focuses on intentionally false news published by a recognised news outlet
[28]. Extent research focuses on differentiating between fake news and true news, and the types of actors that propagate fake news.
This paper is focused on the former, i.e., the attributes of the fake news itself. As such, it is concerned with identifying fake news based on characteristics such as writing style and quality
[29], word counts
[30], sentiment
[31] and topic-agnostic features (e.g., a large number of ads or a frequency of morphological patterns in text)
[32].
As discussed in the Introduction, the Internet, and in particular social media, is transforming public health promotion, surveillance, public response to health crises, as well as tracking disease outbreaks, monitoring the spread of misinformation and identifying intervention opportunities
[33,34][33][34]. The public benefits from improved and convenient access to easily available and tailored information in addition to the opportunity to potentially influence health policy
[33,35][33][35]. It has had a liberating effect on individuals, enabling users to search for both health and vaccine-related content and exchange information, opinions and support
[36,37][36][37]. Notwithstanding this, research suggests that there are significant concerns about information inaccuracy and potential risks associated with the use of inaccurate health information, amongst others
[38,39,40][38][39][40]. The consequences of misinformation, disinformation and misinterpretation of health information can interfere with attempts to mitigate disease outbreak, delay or result in failure to seek or continue legitimate medical treatment as well as interfere with sound public health policy and attempts to disseminate public health messages by undermining trust in health institutions
[23,41][23][41].
Historically, the news media has played a significant role in Brazilian society
[42]. However, traditional media has been in steady decline in the last decade against the backdrop of media distrust (due to perceived media bias and corruption) and the rise of the Internet and social media
[43]. According to the Reuters Institute Digital News Report 2020
[44], the Internet (including social media) is the main source of news in Brazil. It is noteworthy that Brazil is one of a handful of countries where across all media sources the public prefers partial news, a factor that can create a false sense of uniformity and validity and foster the propagation of misinformation
[44]. While Facebook is a source of misinformation concern in most countries worldwide, Brazil is relatively unique in that WhatsApp is a significant channel of news and misinformation
[44]. This preference of partial news sources and social media in Brazil has lead to significant issues in the context of COVID-19.
From the beginning of the COVID-19 pandemic, the WHO has reported on a wide variety of misinformation related to COVID-19
[11]. These include unsubstantiated claims and conspiracy theories related to hydroxychloroquine, reduced risk of infection, 5G mobile networks and sunny and hot weather, amongst others
[11]. What differs in the Brazilian context is that the Brazilian public has been exposed to statements from the political elite, including the Brazilian President, that have contradicted the Brazilian Ministry of Health, pharmaceutical companies and health experts. Indeed, the political elite in Brazil have actively promoted many of the misleading claims identified by the WHO. This has included statements promoting erroneous information on the effects of COVID-19, “cures” and treatments unsupported by scientific evidence and an end to social distancing, amongst others
[45]. These statements by government officials become news and lend legitimacy to them. As vaccines and vaccination programmes to mitigate COVID-19 become available, such statements sow mistrust in health systems but provide additional legitimacy to anti-vaccination movements that focus on similar messaging strategies, e.g., questioning the safety and effectiveness of vaccines, sharing conspiracy theories, publishing general misinformation and rumours, promoting that Big Pharma and scientific experts are not to be trusted, stating that civil liberties and human’s freedom of choice are endangered, questioning whether vaccinated individuals spread diseases and promoting alternative medicine
[46,47,48][46][47][48].
While vaccines and vaccinations are a central building block of efforts to control and reduce the impact of COVID-19, vaccination denial and misinformation propagated by the anti-vaccination movement represents a tension between freedom of speech and public health. Social network platforms have been reluctant to intervene on this topic and on misinformation in general
[49], however, there have been indicators that this attitude is changing, particularly in the context of COVID-19
[50]. However, even where there is a desire to curb misinformation by platforms, the identification of fake news and misinformation, in general, is labour intensive and particularly difficult to moderate on closed networks such as WhatsApp. To scale such monitoring requires automation. While over 282 million people speak Portuguese worldwide, commercial tools and research has overwhelmingly focused on the most popular languages, namely English and Chinese. This may be due to the concentration of Portuguese speakers in a relatively small number of countries. Over 73% of native Portuguese speakers are located in Brazil and a further 24% in just three other countries—Angola, Mozambique and Portugal
[51]. As discussed earlier, it is important to note that Portuguese as a language is pluricentric and Brazilian Portuguese is highly diglossic, thus requiring native language datasets for accurate classification.
3. Current Works
Research on automated fake news detection typically falls in to two main categories, approaches based on knowledge, and those based on style
[20]. Style-based fake news detection
, the focus of this article, attattempts to analyse the writing style of the target article to identify whether there is an attempt to mislead the reader. These approaches typically rely on binary classification techniques to classify news as fake or not based on general textual features (lexicon, syntax, discourse, and semantic), latent textual features (word, sentence and document) and associated images
[20]. These are typically based on data mining and information retrieval, natural language processing (NLP) and machine learning techniques, amongst others
[20,52][20][52].
This study compares machine learning and deep learning techniques for fake news detection.
There is well-established literature on the use of traditional machine learning for both knowledge-based and style-based detection. For example, naive Bayes
[53[53][54],
54], support vector machine (SVM)
[54[54][55][56][57][58],
55,56,57,58], Random Forest
[59[59][60],
60], and XGBoost
[59,61][59][61] are widely cited in the literature. Similarly, a wide variety of deep learning techniques have been used including convolutional neural networks (CNNs)
[62,63,64,65][62][63][64][65] long short term memory (LSTM)
[66[66][67],
67], recurrent neural networks (RNN) and general recurrent units (GRU) models
[66[66][67][68],
67,68], other deep learning neural networks architectures
[69,70,71][69][70][71] and ensemble approaches
[63,72,73][63][72][73].
While automated fake news detection has been explored in health and disease contexts, the volume of research has expanded rapidly since the commencement of the COVID-19 pandemic. While a comprehensive review of the literature is beyond the scope
of this article, four significant trends are worthy of mention. Firstly, although some studies use a variety of news sources (e.g.,
[74]) and multi-source datasets such as CoAID
[75], the majority of studies focus on data sets comprising social media data and specifically Twitter data, e.g.,
[76,77][76][77]. This is not wholly unsurprising as access to the Twitter API is easily accessible and the public data sets on the COVID-19 discourse have been made available, e.g.,
[78,79,80][78][79][80]. Secondly, though a wide range of machine learning and deep learning techniques feature in studies including CNNs, LSTMs and others, there is a notable increase in the use of bidirectional encoder representations from transformers (BERT)
[74,76,77][74][76][77]. This can be explained by the relative recency and availability of BERT as a technique and early performance indicators. Thirdly, and related to the previous points, few datasets or research identified use a Brazilian Portuguese language corpus and a Brazilian empirical context. For example, the COVID-19 Twitter Chatter dataset features English, French, Spanish and German language data
[79]. CoAID does not identify its language, but all sources and search queries identified are English language only. The Real Worry Dataset is English language only
[80]. The dataset described in
[78] does feature a significant portion of Portuguese tweets, however, none of the keywords used are in the Portuguese language and the data is Twitter only. Similarly, the MM-COVID dataset features 3981 fake news items and 7192 trustworthy items in six languages including Portuguese
[81]. While Brazilian Portuguese is included, it would appear both European and Brazilian Portuguese are labelled as one homogeneous language, and the total number of fake Portuguese language items is relatively small (371).
Notwithstanding the foregoing, there has been a small number of studies that explore fake news in the Brazilian context. Galhardi et al.
[82] used data collected from the
Eu Fiscalizo, a crowdsourcing tool where users can send content that they believe is inappropriate or fake. Analysis suggests that fake news about COVID-19 is primarily related to homemade methods of COVID-19 prevention or cure (85%), largely disseminated via WhatsApp
[82]. While
this study is consistent with other reports, e.g.,
[44], it comprises a small sample (154 items) and classification is based on self-reports. In line with
[83[83][84],
84], Garcia Filho et al.
[85] examined temporal trends in COVID-19. Using Google Health Trends, they identified a sudden increase in interest in issues related to COVID-19 from March 2020 after the adoption of the first measures of social distance. Of specific interest
to this
paper is the suggestion by Garcia Filho et al. that unclear messaging between the President, State Governors and the Minister of Health may have resulted in a reduction in search volumes. Ceron et al.
[86] proposed a new Markov-inspired method for clustering COVID-19 topics based on evolution across a time series. Using a dataset 5115 tweets published by two Brazilian fact-checking organisations,
Aos Fatos and
Agência Lupa, their data also suggested the data clearly revealed a complex intertwining between politics and the health crisis during the period
under study.
Fake news detection is a relatively new phenomenon. Monteiro et al.
[87] presented the first reference corpus in Portuguese focused on fake news, Fake.Br corpus, in 2018. The Fake.Br. corpus comprises 7200 true and fake news items and was used to evaluate an SVM approach to automatically classify fake news messages. The SVM model achieved 89% accuracy using five-fold cross validation. Subsequently, the Fake.Br corpus was used to evaluate other techniques to detect fake news. For example, Silva et al.
[88] compare the performance of six techniques to detect fake news, i.e., logistic regression, SVM, decision tree, Random Forest, bootstrap aggregating (bagging) and adaptive boosting (AdaBoost). The best F1 score, 97.1%, was achieved by logistic regression when stop words were not removed and the traditional bag-of-words (BoW) was applied to represent the text. Souza et al.
[89] proposed a linguistic-based method based on grammatical classification, sentiment analysis and emotions analysis, and evaluated five classifiers, i.e., naive Bayes, AdaBoost, SVM, gradient boost (GB) and K-nearest neighbours (KNN) using the Fake.Br corpus. GB presented the best accuracy, 92.53%, when using emotion lexicons as complementary information for classification. Faustini et al.
[90] also used the Fake.Br corpus and two other datasets, one comprising fake news disseminated via WhatsApp, as well as a dataset comprising tweets, to compare four different techniques in one-class classification (OCC)—SVM, document-class distance (DCD), EcoOCC (an algorithm based on k-means) and naive Bayes classifier for OCC. All algorithms performed similarly with the exception of the one-class SVM, which showed greater F-score variance.
More recently, the Digital Lighthouse project at the Universidade Federal do Ceara in Brazil has published a number of studies and datasets relating to misinformation on WhatsApp in Brazil. These include FakeWhatsApp.BR
[91] and COVID19.BR
[92,93][92][93]. The FakeWhatsApp.BR dataset contains 282,601 WhatsApp messages from users and groups from all Brazilian states collected from 59 groups from July 2018 to November of 2018
[91]. The FakeWhatsApp.BR corpus contains 2193 messages labelled misinformation and 3091 messages labelled non-misinformation
[91]. The COVID-19.BR contains messages from 236 open WhatsApp groups with at least 100 members collected from April 2020 to June 2020. The corpus contains 2043 messages, 865 labelled as misinformation and 1178 labelled as non-misinformation. Both datasets contain similar data, i.e., message text, time and date, phone number, Brazilian state, word count, character count and whether the message contained media
[91,93][91][93]. Cabral et al.
[91] combined classic natural language processing approaches for feature extraction with nine different machine learning classification algorithms to detect fake news on WhatsApp, i.e., logistic regression, Bernoulli, complement naive Bayes, SVM with a linear kernel (LSVM), SVM trained with stochastic gradient descent (SGD), SVM trained with an RBF kernel, K-nearest neighbours, Random Forest (RF), gradient boosting and a multilayer perceptron neural network (MLP). The best performing results were generated by MLP, LSVM and SGD, with a best F1 score of 0.73, however, when short messages were removed, the best performing F1 score rose to 0.87. Using the COVID19.BR dataset, Martins et al.
[92] compared machine learning classifiers to detect COVID-19 misinformation on WhatsApp. Similar to their earlier work
[91], they tested LSVM and MLP models to detect misinformation in WhatsApp messages, in this case related to COVID-19. Here, they achieved a highest F1 score of 0.778; an analysis of errors indicated errors occurred primarily due to short message length. In Martins et al.
[93], they extend their work to detect COVID-19 misinformation in Brazilian Portuguese WhatsApp messages using bidirectional long–short term memory (BiLSTM) neural networks, pooling operations and an attention mechanism. This solution, called MIDeepBR, outperformed their previous proposal as reported in
[92] with an F1 score of 0.834
.
In contrast with previous research, we build and present a new dataset comprising fake news in the Brazilian Portuguese language relating exclusively to COVID-19 in Brazil. In contrast with Martins et al. [92] and Cabral et al. [91], we do not use a WhatsApp dataset, which may due to its nature be dominated by L-variant Brazilian Portuguese. Furthermore, the dataset used in this study is over a longer period (12 months) compared with Martins et al. [92,93] and Cabral et al. [91]. Furthermore, unlike Li et al. [81], we specifically focus on the Brazilian Portuguese language as distinct from European or African variants. To this end, the scale of items in our dataset is significantly larger than, for example, the MM-COVID dataset. In addition to an exploratory data analysis of the content, we evaluate and compare machine learning and deep learning approaches for detecting fake news. In contrast with Martins et al. [93], we include gated recurrent units (GRUs) and evaluate both unidirectional and bidirectional GRUs and LSTMs, as well as machine learning classifiers.