Multilingual Evidence for Fake News Detection

Multilingual Evidence for Fake News Detection: Comparison

Please note this is a comparison between Version 2 by Rita Xu and Version 1 by Mikhail Kuimov.

The rapid spread of deceptive information on the internet can have severe and irreparable consequences. As a result, it is important to develop technology that can detect fake news. Although significant progress has been made in this area, current methods are limited because they focus only on one language and do not incorporate multilingual information. Multiverse—a new feature based on multilingual evidence that can be used for fake news detection and improve existing approaches.

fake news detection
multilinguality
news similarity

1. Introduction

The fast consumption of information from social media and news websites has become a daily routine for millions of users. Many readers neither have the time nor the interest (and/or skills) to fact-check every announced event. This opens up a wide range of opportunities to manipulate the opinions of citizens, one of which is fake news, which contains information about events that never happened in real life (or representations of real events in extremely narrow and biased ways). Fake news can be as simple as damaging the reputation of a person, organization, or country, or as serious as inciting immediate emotional reactions that lead to destructive actions in the physical world.

Since the exploitation of Facebook to influence public opinion during the 2016 U.S. presidential election ^[1], there has been significant interest in fake news. However, the dissemination of false information not only misinforms readers but can also result in much more serious consequences. For instance, the spreading of a baseless rumor alleging that Hillary Clinton was involved in child sex trafficking led to a dangerous situation at a Washington D.C. pizzeria ^[2]. The global pandemic in 2020 has led to the rise of an infodemic ^[3], which could have even more severe consequences by exacerbating the epidemiological situation and endangering people’s health. Furthermore, The recent events of 2022 also showed how politics and global events could be dramatically influenced by the spread of fake news. The Russia–Ukraine conflict was accompanied by an intense information war ^[4] featuring an enormous amount of fake stories. In addition to the political world, the World Cup 2022 was surrounded by rumors from both organizers and visitors that had an impact on security during the competition ^[5].

The issue of fake news has garnered significant public attention and has also become a subject of growing interest in academic circles. With the proliferation of online content, there is a great deal of optimism about the potential of automated methods for detecting fake news. Numerous studies have been conducted on fake news detection, utilizing a variety of information from diverse sources. While the misinformation mitigation field is represented in the artificial intelligence field via different tasks (i.e., stance detection, fact-checking, source credibility classification, inter alia), wresearchers focus on the supervised fake news classification task.

2. User Behavior for Fake News Detection

Firstly, before the discussion of automatic machine fake news detection methods, wresearchers analyze how real-life users react to fake information and in which way they check the veracity of information. In [19]^[6], a very broad analysis of users’ behavior was obtained. The authors discovered that when people attempt to check information credibility, they rely on a limited set of features, such as:

Is this information compatible with other things that I believe to be true?
Is this information internally coherent? Do the pieces form a plausible story?
Does it come from a credible source?
Do other people believe it?

Thus, people can rely on the news text, its source, and their judgment. However, if they receive enough internal motivation, they can also refer to some external sources for evidence. These external sources can be knowledgeable sources or other people. The conclusions from [20]^[7] repeat the previous results: individuals rely on both their judgment of the source and the message. When these factors do not adequately provide a definitive answer, people turn to external resources to authenticate the news. The intentional and institutional reactions sought confirmation from institutional sources, with some respondents answering simply “Google”. Moreover, several works have been conducted to explore the methods to combat fake information received by users and convince them with facts. In [21]^[8], it was shown that explicitly emphasizing the myth and even its repetition with refutation can help users pay attention and remember the truth. Additionally, participants who received messages across different media platforms [22]^[9] and different perspectives on the information [23]^[10] showed greater awareness of news evidence. Consequently, information obtained from external searches is an important feature for evaluating news authenticity and seeking evidence. Furthermore, obtaining different perspectives from different media sources adds more confidence decision-making process.

3. Fake News Detection Datasets

To leverage the task of automatic fake news detection there have been created several news datasets focused on misinformation, each with a different strategy of labeling. The comparison of all discussed datasets is presented in Table 1. The Fake News Challenge (http://www.fakenewschallenge.org, accessed on 14 December 2022) launched in 2016 was a big step in identifying fake news. The objective of FNC-1 was a stance detection task [24]^[11]. The dataset includes 300 topics, with 5–20 news articles each. In general, it consists of 50,000 labeled claim-article pairs. The dataset was derived from the Emergent project [25]^[12]. Another publicly available dataset is LIAR [26]^[13]. In this dataset 12,800 manually labeled short statements in various contexts from PolitiFact.com (https://www.politifact.com, accessed on 14 December 2022) were collected. They covered such topics as news releases, TV or radio interviews, campaign speeches, etc. The labels for news truthfulness are fine-grained in multiple classes: pants-fire, false, barely-true, half-true, mostly true, and true. Claim verification is also related to the Fact Extraction and VERification dataset (FEVER) [27]^[14]; 185,445 claims were manually verified against the introductory sections of Wikipedia pages and classified as SUPPORTED, REFUTED, or NOTENOUGHINFO. For the first two classes, the annotators also recorded the sentences forming the necessary evidence for their judgments. FakeNewsNet [28]^[15] contains two comprehensive datasets that include news content, social context, and dynamic information. Moreover, as opposed to all of the datasets described above, in addition to all of the textual information, there is also a visual component saved in this dataset. All news was collected via PolitiFact and GossipCop (https://www.gossipcop.com, accessed on 31 August 2021) crawlers. In general, 187,014 fake and 415,645 real news items were crawled. Another dataset collected for supervised learning is the FakeNewsDataset [6]^[16]. The authors conducted a lot of manual work to collect and verify the data. As a result, they managed to collect 240 fake and 240 legit news items on 6 different domains—sports, business, entertainment, politics, technology, and education. All of the news articles in the dataset are from the year 2018. One large dataset is NELA-GT-2018 [29]^[17]. In this dataset, the authors attempted to overcome some limitations that could be observed in previous works: (1) Engagement-driven—the majority of the datasets, for news articles and claims, contained only data that were highly engaged with on social media or received attention from fact-checking organizations; (2) lack of ground truth labels—all current large-scale news article datasets do not have any form of labeling for misinformation research. To overcome these limitations, they gathered a wide variety of news sources from varying levels of veracity and scraped article data from the gathered sources’ RSS feeds twice a day for 10 months in 2018. As a result, a new dataset was created consisting of 713,534 articles from 194 news and media producers.

Table 1. The datasets covered in related work. The majority of datasets for fake news detection tasks are in English.

Dataset	Task	Language
FNC-1 [24]^[11]	Stance Detection	English
Arabic Claims Dataset [30]^[18]	Stance Detection	Arabic
FEVER [27]^[14]	Fact-Checking	English
DanFEVER [31]^[19]	Fact-Checking	Danish
LIAR [26]^[13]	Fake News Classification	English
FakeNewsNET [6]^[16]
FakeNewsDataset [6]^[16]
NELA-GT-2018 [29]^[17]
ReCOVery [32]^[20]
GermanFakeNC [33]^[21]		German
The Spanish Fake News Corpus [34]^[22]		Spanish

Due to the events of 2020, there has been ongoing work toward creating a COVID-19 fake news detection dataset. The COVID-19 Fake News [7]^[23] is based on information from public fact-verification websites and social media. It consists of 10,700 tweets (5600 real and 5100 fake posts) connected to the COVID-19 topic. In addition, the ReCOVery [32]^[20] multimodal dataset was created. It also incorporates 140,820 labeled tweets and 2029 news articles on coronavirus collected from reliable and unreliable resources. However, all of the above datasets have one main limitation—they are monolingual and dedicated only to the English language. Regarding languages other than English, such datasets can be mentioned: the French satiric dataset [35]^[24], GermanFakeNC [33]^[21], The Spanish Fake News Corpus [34]^[22], and Arabic Claims Dataset [30]^[18]. These datasets do not fully cover the multilingualism gap in fake news detection. The mentioned datasets are monolingual as well and mostly cover fake news classification tasks, missing, for instance, fact verification and evidence generation problems.

4. Fake News Classification Methods

Based on previously described datasets, multiple methods have been developed to tackle the problem of obtaining such a classifier. The feature sets used in all existing methods can be divided into two categories: (1) internal features that can be obtained from different preprocessing strategies and a linguistic analysis of the input text; (2) external features that are extracted from a knowledge base, the internet, or social networks, and give additional information about the facts from the news, its propagation in social media, and users’ reactions. In other words, internal methods rely on the text itself while external methods rely on meta-information from the text.

4.1. Methods Based on Internal Features

Linguistic and psycholinguistic features are helpful in fake news classification tasks. In [6]^[16], a strong baseline model based on such a feature set was created based on the FakeNewsDataset. The set of features used in this work is s as follows:

Ngrams: tf–idf values of unigrams and bigrams from a bag-of-words representation of the input text.
Punctuation such as periods, commas, dashes, question marks, and exclamation marks.
Psycholinguistic features extracted with LIWC lexicon. Alongside some statistical information, LIWC also provides emotional and psychological analysis.
Readability that estimates the complexity of a text. The authors use content features such as the number of characters, complex words, long words, the number of syllables, word types, and others. In addition, they used several readability metrics, including the Flesch–Kincaid, Flesch Reading Ease, Gunning Fog, and Automatic Readability Index.
Syntax is a set of features derived from production rules based on context-free grammar (CFG) trees.

Using this feature set, the system yields strong results. That is why in our work weesearchers rely on it as a baseline, further extending this set with ourthe newly developed features. Based on such features, different statistical machine learning models can be trained. In [6]^[16], the authors trained the SVM classifier according to the set of characteristics presented. Naïve Bayes, Random Forest, KNN, and AdaBoost were also frequently used as fake news classification models [36,37,38]^[25][26][27]. In [39]^[28], the authors explore the potential of using emotional signals extracted from text to detect fake news. The authors analyzed the set of emotions present in true and fake news to test the hypothesis that trusted news sources do not use emotions to affect the reader’s opinion while fake news does. They discovered that emotions, such as negative emotions, disgust, and surprise tend to appear in fake news and can give a strong signal for fake news classification. In addition to linguistic features, feature extraction strategies based on deep learning architectures were also explored. In [40]^[29], the classical architecture for the text classification task based on CNN was successfully applied to the fake news detection task. Given the recent surge in the use of Transformer architectures in natural language processing, models like BERT [10,41]^[30][31] and RoBERTa [9]^[32] have achieved high results in classifying general-topic fake news, as well as in detecting COVID-19-related fake news. In addition to text features, images mentioned in news articles can serve as strong indicators for veracity identification. Visual content can be manipulated, for instance, via deepfakes [42]^[33] or by combining images from different contexts in a misleading format [43]^[34]. While multimodal fake news detection is a developing field, several approaches were already presented in [44,45]^[35][36]. It is evident that models based on internal feature sets have a significant advantage in their ease of use, as they do not require extensive additional time for feature extraction. Furthermore, such models can be highly efficient in terms of inference time and memory usage, as they solely rely on internal information from input news. However, if wresearchers take into account the aspect of explainability for end users, the evidence generated from such internal features is unlikely to be sufficient to persuade the user of the model’s accuracy and to justify the label assigned to the news.

4.2. Methods Based on External Features

Although internal feature-based models can achieve high classification scores in the fake news classification task, the decisions of such are hard to interpret. As a result, additional signals from external sources can add more confidence to model decision reasoning. If the news appears on a social network, information about the users who liked or reposted the item and the resulting propagation can serve as valuable features for fake news classification. It was shown in [46]^[37] that fake news tends to spread more quickly over social networks than true news. As a result, to combat fake news in the early stages of its appearance, several methods have been created to detect the anomaly behaviors in reposts or retweets [47,48]^[38][39]. In [49]^[40], different data about specific users were explored. The author extracted locations, profile images, and political biases to create a feature set. User comments related to a news article can also serve as a valuable source of information for detecting fake news, and this approach was explored in [13]^[41]. The dEFEND system was created to explain fake news detection. The information from users’ comments was used to find related evidence and validate the facts from the original news. The Factual News Graph (FANG) system from [12]^[42] was presented to connect the content of news, news sources, and user interactions to create a fulfilled social picture of the inspected news.

References

Allcott, H.; Gentzkow, M. Social media and fake news in the 2016 election. J. Econ. Perspect. 2017, 31, 211–236.
Kang, C.; Goldman, A. In Washington Pizzeria Attack, Fake News Brought Real Guns. New York Times, 5 December 2016; 5.
Alam, F.; Dalvi, F.; Shaar, S.; Durrani, N.; Mubarak, H.; Nikolov, A.; Da San Martino, G.; Abdelali, A.; Sajjad, H.; Darwish, K.; et al. Fighting the COVID-19 Infodemic in Social Media: A Holistic Perspective and a Call to Arms. arXiv 2021, arXiv:2007.07996.
Park, C.Y.; Mendelsohn, J.; Field, A.; Tsvetkov, Y. Challenges and Opportunities in Information Manipulation Detection: An Examination of Wartime Russian Media. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 5209–5235.
Atalayar. Misinformation Confuses Qatar 2022 World Cup Fans. 2022. Available online: https://atalayar.com/en/content/misinformation-confuses-qatar-2022-world-cup-fans (accessed on 14 February 2023).
Lewandowsky, S.; Ecker, U.K.; Seifert, C.M.; Schwarz, N.; Cook, J. Misinformation and its correction: Continued influence and successful debiasing. Psychol. Sci. Public Interest. 2012, 13, 106–131.
Tandoc, E.C., Jr.; Ling, R.; Westlund, O.; Duffy, A.; Goh, D.; Wei, L.Z. Audiences’ acts of authentication in the age of fake news: A conceptual framework. New Media Soc. 2018, 20, 2745–2763.
Ecker, U.K.; Hogan, J.L.; Lewandowsky, S. Reminders and repetition of misinformation: Helping or hindering its retraction? J. Appl. Res. Mem. Cogn. 2017, 6, 185–192.
Zhao, W. Misinformation Correction across Social Media Platforms. In Proceedings of the 2019 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 5–7 December 2019; pp. 1371–1376.
Geeng, C.; Yee, S.; Roesner, F. Fake News on Facebook and Twitter: Investigating How People (Don’t) Investigate. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–14.
Hanselowski, A.; PVS, A.; Schiller, B.; Caspelherr, F.; Chaudhuri, D.; Meyer, C.M.; Gurevych, I. A Retrospective Analysis of the Fake News Challenge Stance-Detection Task. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; Association for Computational Linguistics: Santa Fe, NM, USA, 2018; pp. 1859–1874.
Silverman, C. Emergent: A Real-Time Rumor Tracker. 2017; pp. 12–13. Available online: http://www.emergent.info/ (accessed on 31 August 2021).
Wang, W.Y. “ liar, liar pants on fire”: A new benchmark dataset for fake news detection. arXiv 2017, arXiv:1705.00648.
Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Mittal, A. FEVER: A Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Long Papers), New Orleans, LA, USA, 1–6 June 2018; Association for Computational Linguistics: New Orleans, LA, USA, 2018; Volume 1, pp. 809–819.
Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; Liu, H. FakeNewsNet: A Data Repository with News Content, Social Context and Spatialtemporal Information for Studying Fake News on Social Media. arXiv 2018, arXiv:1809.01286.
Pérez-Rosas, V.; Kleinberg, B.; Lefevre, A.; Mihalcea, R. Automatic Detection of Fake News. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 3391–3401.
Nørregaard, J.; Horne, B.D.; Adalı, S. NELA-GT-2018: A large multi-labelled news dataset for the study of misinformation in news articles. In Proceedings of the International AAAI Conference on Web and Social Media, Munich, Germany, 11–14 June 2019; Volume 13, pp. 630–638.
Hasanain, M.; Suwaileh, R.; Elsayed, T.; Barrón-Cedeno, A.; Nakov, P. Overview of the CLEF-2019 CheckThat! Lab on Automatic Identification and Verification of Claims. Task 2: Evidence and Factuality. In Proceedings of the CLEF, Lugano, Switzerland, 9–12 September 2019.
Nørregaard, J.; Derczynski, L. DanFEVER: Claim verification dataset for Danish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), Reykjavik, Iceland, 31 May–2 June 2021; Linköping University Electronic Press: Reykjavik, Iceland, 2021; pp. 422–428.
Zhou, X.; Mulay, A.; Ferrara, E.; Zafarani, R. Recovery: A multimodal repository for COVID-19 news credibility research. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, 20 October 2020; pp. 3205–3212.
Vogel, I.; Jiang, P. Fake News Detection with the New German Dataset “GermanFakeNC”. In Proceedings of the International Conference on Theory and Practice of Digital Libraries, Paphos, Cyprus, 23–27 September 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 288–295.
Posadas-Durán, J.P.; Gomez-Adorno, H.; Sidorov, G.; Escobar, J.J.M. Detection of fake news in a new corpus for the Spanish language. J. Intell. Fuzzy Syst. 2019, 36, 4869–4876.
Patwa, P.; Sharma, S.; PYKL, S.; Guptha, V.; Kumari, G.; Shad Akhtar, M.; Ekbal, A.; Das, A.; Chakraborty, T. Fighting an Infodemic: COVID-19 Fake News Dataset. arXiv 2020, arXiv:2011.03327.
Liu, Z.; Shabani, S.; Balet, N.G.; Sokhn, M. Detection of satiric news on social media: Analysis of the phenomenon with a French dataset. In Proceedings of the 2019 28th International Conference on Computer Communication and Networks (ICCCN), Valencia, Spain, 29 July–1 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6.
Choudhary, A.; Arora, A. Linguistic feature based learning model for fake news detection and classification. Expert Syst. Appl. 2021, 169, 114171.
Sharma, K.; Qian, F.; Jiang, H.; Ruchansky, N.; Zhang, M.; Liu, Y. Combating fake news: A survey on identification and mitigation techniques. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 10, 1–42.
Gravanis, G.; Vakali, A.; Diamantaras, K.; Karadais, P. Behind the cues: A benchmarking study for fake news detection. Expert Syst. Appl. 2019, 128, 201–213.
Ghanem, B.; Rosso, P.; Rangel, F. An emotional analysis of false information in social media and news articles. ACM Trans. Internet Technol. (TOIT) 2020, 20, 1–18.
Kaliyar, R.K.; Goswami, A.; Narang, P.; Sinha, S. FNDNet—A deep convolutional neural network for fake news detection. Cogn. Syst. Res. 2020, 61, 32–44.
Kaliyar, R.K.; Goswami, A.; Narang, P. FakeBERT: Fake news detection in social media with a BERT-based deep learning approach. In Multimedia Tools and Applications; Springer: Berlin/Heidelberg, Germany, 2021; pp. 1–24.
Jwa, H.; Oh, D.; Park, K.; Kang, J.M.; Lim, H. exBAKE: Automatic fake news detection model based on bidirectional encoder representations from transformers (bert). Appl. Sci. 2019, 9, 4062.
Glazkova, A.; Glazkov, M.; Trifonov, T. g2tmn at Constraint@ AAAI2021: Exploiting CT-BERT and Ensembling Learning for COVID-19 Fake News Detection. arXiv 2020, arXiv:2012.11967.
Agarwal, S.; Farid, H.; Gu, Y.; He, M.; Nagano, K.; Li, H. Protecting World Leaders Against Deep Fakes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Piscataway, NJ, USA, 2019; pp. 38–45.
Abdelnabi, S.; Hasan, R.; Fritz, M. Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 14920–14929.
La, T.; Tran, Q.; Tran, T.; Tran, A.; Dang-Nguyen, D.; Dao, M. Multimodal Cheapfakes Detection by Utilizing Image Captioning for Global Context. In Proceedings of the ICDAR@ICMR 2022: Proceedings of the 3rd ACM Workshop on Intelligent Cross-Data Analysis and Retrieval, Newark, NJ, USA, 27–30 June 2022; Dao, M., Dang-Nguyen, D., Riegler, M., Eds.; ACM: New York, NY, USA, 2022; pp. 9–16.
Patwa, P.; Mishra, S.; Suryavardan, S.; Bhaskar, A.; Chopra, P.; Reganti, A.; Das, A.; Chakraborty, T.; Sheth, A.P.; Ekbal, A.; et al. Benchmarking Multi-Modal Entailment for Fact Verification (short paper). In CEUR Workshop Proceedings, Proceedings of the Workshop on Multi-Modal Fake News and Hate-Speech Detection (DE-FACTIFY 2022) Co-Located with the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI 2022), Virtual Event, Vancouver, BC, Canada, 27 February 2022; Das, A., Chakraborty, T., Ekbal, A., Sheth, A.P., Eds.; CEUR-WS.org: Aachen, Germany, 2022; Volume 3199, Available online: CEUR-WS.org (accessed on 31 August 2021).
Zhao, Z.; Zhao, J.; Sano, Y.; Levy, O.; Takayasu, H.; Takayasu, M.; Li, D.; Wu, J.; Havlin, S. Fake news propagates differently from real news even at early stages of spreading. EPJ Data Sci. 2020, 9, 7.
Liu, Y.; Wu, Y.F. Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32.
Shu, K.; Wang, S.; Liu, H. Beyond news contents: The role of social context for fake news detection. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Melbourne, VIC, Australia, 11–15 February 2019; pp. 312–320.
Shu, K.; Zhou, X.; Wang, S.; Zafarani, R.; Liu, H. The role of user profiles for fake news detection. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Vancouver, BC, Canada, 27–30 August 2019; pp. 436–439.
Shu, K.; Cui, L.; Wang, S.; Lee, D.; Liu, H. defend: Explainable fake news detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 395–405.
Nguyen, V.H.; Sugiyama, K.; Nakov, P.; Kan, M.Y. FANG: Leveraging social context for fake news detection using graph representation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, 19–23 October 2020; pp. 1165–1174.