TChecker: Fake News Detection on Social Media

TChecker: Fake News Detection on Social Media: Comparison

Please note this is a comparison between Version 1 by Nada Ayman GabAllah and Version 2 by Lindsay Dong.

The spread of fake news on social media continues to be one of the main challenges facing internet users, prohibiting them from discerning authentic from fabricated pieces of information. Detecting fake news is a problem tackled through different approaches that can be categorized mainly into a content-based approach and a social-based approach. In the content-based approach, the textual features are the main features, whereas in the social-based approach other features, including users’ engagements, users’ profile features, and network propagation features, are considered.

fake news
social media
news
BERT
BiLSTM

1. Introduction

In the era of Web 2.0, our interactions as well as our perceptions of information are changing. The ways of communication have evolved in recent decades in very fast and ground-breaking ways that are pushing some of the traditional ways for acquiring information into obsoletion. One of the most observable changes is the media, where in the near past television, radio, and newspapers were the primary credible sources of news and information for everyone. Reporters used to race to the locations of events to get the first scope. Live TV coverage of important events could get millions of people watching their TVs at the same time. Headlines in newspapers could change stock market values. Nowadays, it is very rare to find someone reading a newspaper, or watching news on TV; instead, social media is becoming the most prominent source for obtaining information about a news event.

Extracting information from social media has become a very rich area of research, as it has became one of the fastest growing sources of data about almost everything. As Web 2.0 and social media enabled internet users to contribute through their use, this allowed anyone to post and share data about anything, which in turn created a huge repository of data for everyone to access. However, with the power given to everyone to post and share anything on social media comes great responsibility for the content being posted or shared. Unfortunately, social media users do not usually tend to validate or fact-check their posts before sharing them, as they tend to believe what is shared many times within their circle of friends on social media. This phenomenon has been studied in the literature and is known as the validity effect [1], where people tend to have more belief in what is shared through their close circles as this emphasizes their feeling of validation. In addition, users will tend to share more posts that are aligned with their ideas and previous knowledge, regardless of the truthfulness of these ideas, which is known as confirmation bias [2]. These combined phenomena gives individuals a false sense of credibility about any piece of news that is shared in their circle of acquaintance, and thus share it themselves to other circles, etc., thus leading to less credibility of the news sources themselves, which most often rely on social media as well to gather information [3].

A study presented in [4] on Twitter in the period between 2006 and 2017 showed that fake information spreads faster and wider than true information. They found that fake news is 70% more likely to be retweeted, therefore reaching people much faster. The effect of spreading fake news across different social media platform can be disastrous in many aspects of life. It can bias political campaigns and decisions, like what happened in the “Brexit” referendum [5], 2016 US elections, [6] and the recent 2020 US elections as well. One rumor can cause the stock market to lose millions of dollars. For example, a rumor about former US president Barack Obama being injured in an explosion cost the stock market millions of dollars [7]. With the emerge of the COVID-19 pandemic, lots of rumors spread over the social network about home remedies, off the shelf chemicals, and deadly side effects of new vaccines, which actually put lives at risk in believing those rumors [8].

2. Fake News Detection on Social Media

2.1. Content-Based Approach

The most adopted approach in the task of identifying fake news is the content-based approach. In this approach, the textual features of the news are used in different models to identify the veracity of the news. This approach has been widely applied in detecting fake news from news posts and social posts.

Fake News Detection from News Articles

Verifying the truthfulness of news is a crucial step in the domain of publishing news, and checking the credibility of the source of information is an undeniable step in the publishing process. In the media domain, where journalists and others work in this domain, the verification of the news and its sources is their job. Journalists usually check the information against credible sources and verify this information is true before publishing it, that is, manual fact-checking.

With the increase in the volume of data roaming the internet every second, automatic techniques stepped in to help in the fact-checking process. natural language processing and Information Retrieval techniques are applied to automatically identify fake news. A binary translating embedding (B-TransE) model was introduced in ^[9][18] to detect fake news based on a knowledge base graph; they evaluated their model by applying it to check the news in the dataset “Getting real about fake news” provided by Kaggle. CompareNet ^[10][19] is an end-to-end graph neural model that compares fake news against a knowledge base using entities.

The style-based approach relies on the content of the news post to detect its truthfulness based on the style of writing in the post. The style of writing would reveal the user’s intention to post false or true information. The style of the writing is represented as features to be fed to the model for detecting the truthfulness of the post. This approach was used by ^[11][12][13][20,21,22] in deception detection, where deception is defined as the bad intention of authors to post intentionally false information. Those features were used in detecting fake news from news articles in ^[14][15][23,24].

Different machine learning algorithms have been applied to detect fake news from news posts through applying classification models on the textual content of the news. An analysis of different classifiers’ performance on the LIAR dataset is presented in ^[16][25]; a comparison between the performance of Naïve Bayes, SVM, Random Forest, Logistic regression, and stochastic gradient classifier is presented, showing that the classifiers obtained near results except for the stochastic gradient classifier, which performed worse than the others. Another comparison between the Naïve Bayes, Random Forest, passive aggressive, and LSTM models is presented in ^[17][26]; they applied the three models on a dataset consisting of 11,000 English articles labeled as fake or real. They showed that the passive aggressive classifier with TF-IDF representation could achieve the highest accuracy and F1 score, while the LSTM model could achieve almost the same accuracy, but it obtained a higher precision compared to that achieved by the passive aggressive classifier.

A classification model based on BiLSTM and self-attention layers is presented in ^[18][27] and applied on a dataset provided from Kaggle that consists of news articles labeled as fake or not. The articles are represented using GloVe ^[19][28] embeddings, then fed to the BiLSTM layer, followed by the self-attention layer, and then finally the classification layer. They compared their model to other models using different text representations, such as TF-IDF and BOW, and different neural networks, such as GRU, LSTM, and CNN.

Upon the introduction of BERT in 2019 as a pre-trained language model using deep bidirectional transformers, a major change in performance in NLP tasks occurred. BERT differs from other deep learning embeddings like word embedding and sentence/document embedding

An evaluation of different language models, including BERT, RoBERTa ^[20][29], and DistilBERT ^[21][30], is presented in ^[22][31]; the authors also compared different architectures of neural network models, including a simple fully connected network, a CNN, and a combined CNN and RNN. They applied the models on different datasets of long and short text including Twitter datasets. Their study showed that simple neural network models can perform better than sophisticated models, and amongst the language models, RoBERTa performed slightly better on most of the datasets. However, all language models’ performances were close to each other. Following the content-based approached, a BERT model is used followed by LSTM and fully connected layers, which is presented in ^[23][32]; they showed that vanilla BERT models could perform better than other content-based models. Moreover, adding the LSTM layer improved the performance of the model news titles of the Politifact dataset from those of the FakeNewsNet dataset ^[24][33]. Korean fake news was detected by ^[25][34] using a BERT-based model trained on the Korean language and BiLSTM for classification. A combination of three parallel blocks of 1D convolutional neural networks and a BERT model applied on news articles from a dataset collected during the 2016 US elections, provided by Kaggle, is presented in ^[26][35]. The model showed better performance than models using GloVe representations and other model architectures based on LSTM and CNN individually.

2.2. Fake News Detection from Social Media

Detecting fake news from social media is increasing these days. Different approaches are applied to detect fake news from social media posts in order to mitigate the harm caused by their spread. Early in 2016, a dataset was collected from Twitter by ^[27][36] at the time of five major events reported by journalists. A model of a convolutional neural network and LSTM was applied to this dataset and the results are presented in ^[28][10]. The LSTM model could achieve better results in terms of accuracy and F1 score than the CNN alone and the LSTM and CNN combined.

With the emergence of the COVID-19 pandemic, all researchers from different disciplines tried to find a way to reduce the effect of the pandemic worldwide. One way was to detect false information roaming social media. A dataset was collected and annotated from Twitter by ^[29][37] discussing different topics about the COVID-19 virus. They used Bag of Words (BOW) and n-grams to represent the text of tweets. They applied an ensemble of Naïve Bayes (NB), K Nearest Neighbor (KNN), Random Forest, function-sequential minimal optimization (SMO), and voted perceptron (VP). They found that the highest F1 score was achieved by the vote ensemble classifier. Different machine learning models which were applied to classify a set of collected tweets about COVID-19 are presented in ^[30][15]. The results showed that Random Forest could outperform other models like SVM, linear regression, and LSTM.

Different BERT-based models were used to detect misinformation from tweets specially regarding COVID-19. COVID-Twitter-BERT is introduced in ^[31][38]; it is a model of BERT pre-trained on a large corpus of tweets about COVID-19. The model is very topic-specific and can be useful in various NLP tasks related to representing tweets about COVID-19, such as detecting sentiments of tweets in ^[32][14]. Another COVID-19 dataset was collected from Twitter, and ensemble machine learning was applied in ^[33][39].

BERTweet followed by an output classification layer was used in ^[34][40] to detect misinformation from tweets. The model outperformed other text representation models such as GLoVe. BERTWeet was compared with BERT cased and uncased pretrained models for detecting fake tweets about COVID-19 in ^[35][41]. BERTweet showed the best performance among different BERT models. For Arabic tweets, Ref. ^[36][42] presented a deep learning model based on ARaBERT, which is a BERT model trained on Modern Standard Arabic, to represent the tweets. The model uses the tweets’ text and user features to detect the veracity of the tweets. BiLSTM and CNN networks were used for classification, where they showed close performance.

2.3. Social-Based Approach

In an attempt to understand more fake news through users’ comments, an explainable fake news detection is presented in ^[37][17]. They use a model that relies on identifying comments that are explaining the core parts of the news article and how they are fake or not. The model is based on a co-attention network between the news articles and their users’ comments. They also applied a ranking method to pick the comments with the most explanation. They compared their work on PolitiFact and GossipCop datasets to other content-based models using news articles only, and models that consider users’ comments. They showed their model could achieve better results than the models in comparison.

Capturing features from comments and using them along with a two-level convolutional neural network learning representation from news content is presented in ^[38][43] as TCNN-URG. They use a conditional variational autoencoder for user comment generation to assist the news content classifier when user comments do not exist.

Integrating the features of the text of the news article, users’ comment on it, and the source of the news is presented in ^[39][44] as a CSI model. Their model consists of three parts: the first one uses LSTM to capture the temporal representation of the article, the second part represents the user features, and the third part concatenates the results of the earlier parts into a classification model. Their experiments were performed on Twitter and Weibo datasets ^[40][45] and showed better results over content-based models.

TriFN is a tri-relationship embedding framework proposed by ^[41][46] that represents publisher–news relations and user–news interactions as embeddings and uses them together to detect fake news. They applied their model on PolitiFact and BuzzFeed datasets.

Incorporating the article’s textual content, along with its creator and the subject of the news article, into a model presented as a deep diffusive network model is proposed in ^[42][47]. The latent features are extracted from the articles’ text, their creators, and the subjects. The model uses a gated diffusive unit that accepts multiple inputs from different sources at the same time.