Leveraging Reddit for Suicidal Ideation Detection: Comparison
Please note this is a comparison between Version 1 by Sook-Ling Chua and Version 2 by Lindsay Dong.

Suicide is a major public-health problem that exists in virtually every part of the world. Hundreds of thousands of people commit suicide every year. The early detection of suicidal ideation is critical for suicide prevention. However, there are challenges associated with conventional suicide-risk screening methods. At the same time, individuals contemplating suicide are increasingly turning to social media and online forums, such as Reddit, to express their feelings and share their struggles with suicidal thoughts. 

  • suicidal ideation detection
  • machine learning
  • natural language processing
  • text mining

1. Introduction

Suicide is a global public-health problem. According to the World Health Organization, approximately 703,000 people commit suicide every year [1]. It is the world’s fourth leading cause of death among young people aged 15 to 29 years old. Moreover, it is estimated that there are more than 20 attempts for every completed suicide [2].
The causes of suicide are largely complicated and result from the interaction of multiple factors that can be grouped into three categories: health factors, environmental factors, and factors related to personal history, such as childhood abuse or previous suicide attempts [3][4][3,4]. Other examples of suicide risk factors include mental disorder, physical illness, substance abuse, domestic violence, bullying, relationship problems, and other stressful life events. Due to the complexity of the problem, no single risk factor can be reliably used to predict suicide [5]. For instance, despite the strong association between suicide and depression, a depression diagnosis alone has a limited ability to predict suicide. More recently, the issue of suicide has been further exacerbated by the impact of the COVID-19 pandemic [6]. In particular, social isolation—which resulted from measures imposed to curb the spread of the virus—was linked to increased suicide risk.
People with suicide risk fall into two classes: ideators and attempters [7]. Suicidal ideation is a broad term that describes thoughts and behaviors ranging from being preoccupied with death to planning a suicide attempt [8]. The suicidal ideations can be passive and active. Passive suicidal ideation involves thinking about suicide and wishing to be dead, whereas active suicidal ideation implies intending and planning an attempt to take one’s own life [8]. While it is believed that passive suicidal ideation poses a lower risk, both types need to be carefully assessed by mental health professionals, since passive suicidal ideation can rapidly transform into the active form [9]. This can happen when a person’s circumstances or health condition worsen.
The early detection of suicidal ideation expressed by an at-risk individual is key to effective prevention, as it facilitates timely intervention by mental health professionals [10]. However, there are several challenges associated with suicide prevention. They include (1) social stigma, (2) limited access to professional help, and (3) inadequate training of clinicians in dealing with suicidal patients [11]. The combination of these factors creates a new challenge—(4) fragmented professional care, which entails having large time gaps between mental health assessments [11].
At the same time, an increasing number of at-risk individuals are turning to online communication channels to express their feelings and discuss their suicidal thoughts [12][13][14][12,13,14]. This tendency prompted research that focuses on detecting suicide risk and other mental health issues on social networks and online forums by applying machine learning (ML) and natural language processing (NLP) techniques [10][13][15][10,13,15]. The quantifiable signals in user-generated online data aid researchers in gaining insight into an individual’s emotional state and detecting cues indicative of suicidality [16][17][16,17]. The feasibility of such an approach has been demonstrated by numerous studies on different mental health disorders. For examples, the a studyuthors of [18] used the textual data from Facebook posts of consenting study participants to predict depression diagnoses recorded in their electronic medical records with high accuracy, using a logistic regression model. In the study of [19], using pre-trained machine learning models, the researchers detected negative changes in Twitter users’ sentiment, stress, anxiety, and loneliness measures after the declaration of emergency in the US due to the COVID-19 pandemic.

2. Leveraging Reddit for Suicidal Ideation Detection

2.1. Detection of Suicidal Ideation on Social Media

The social stigma related to having suicidal ideations has a particularly significant effect. The fear of social stigma has been shown to discourage individuals at risk of suicide from discussing their experiences in person and seeking support [20][21][22][23][22,24,25,26]. Further, it obstructs the extant suicide-risk screening methods, such as questionnaires and interviews, since they require patients to explicitly disclose their intentions to commit suicide [24][27]. According to a meta-analysis of 71 studies, on average, nearly 80% of people in non-psychiatric settings—primary healthcare patients, general population, military personnel, and incarcerated individuals—who died by suicide did not reveal their suicidal intentions when they were surveyed before their suicide attempt [25][28]. Thus, there is a need for novel suicidality detection methods that do not require face-to-face interactions [21][24]. In this case, detecting suicidal ideations on online platforms can be more effective since the anonymity of social media and forums enables people to openly share their struggles with suicidal thoughts without fear of judgment [11][16][26][27][11,16,29,30].
Although the Columbia-Suicide Severity Rating Scale (C-SSRS) has been widely used as a screening instrument, the administration of C-SSRS may place a burden on health-care providers [28][31]. Therefore, another motivation for detecting suicidal ideations on online platforms is to reduce the load on the health-care system. The goal is to create a tool that would automatically and instantaneously detect if a user is exhibiting any signs of suicidality based on their online activity before engagement with providers. Ideally, these screening tools should be highly scalable and adaptable so that they can be used with a variety of data sources and be readily integrated into existing health-care IT systems [10][28][10,31]. The adoption of such suicidal ideation detection tools can assist mental health professionals and even those without specialized training (e.g., primary-care physicians and social workers) in quickly identifying individuals at risk and making informed decisions [23][26].
Studying the online activity for suicidal ideation detection can also help address the challenges of fragmented care for existing patients [29][32]. Given that about 70% of psychiatric patients are active on social media, mental health professionals can monitor their online activity to obtain information relevant to patients’ mental state during gaps in patient–clinician interactions [11]. In this scenario, suicidality detection tools can be employed to automatically detect signals of deteriorating mental condition and alert health-care providers, prompting them to attend to a patient under their care [28][31].

2.2. Reddit as a Source for Suicidal Ideation Detection

Reddit has generated particular interest among researchers due to its distinctive characteristics. Reddit is a popular online forum, covering a wide range of topics, with subcommunities called subreddits [30][21]. Currently, there are over 13 billion posts and comments distributed across more than 100,000 active communities [31][33]. More than 50 million active unique users interact with the platform in a single day. Researchers choose Reddit over other platforms as the source of data for several reasons.

Reddit posts have a higher character limit of 40,000 characters compared to Twitter, which only allows 280 characters [22][25]. It gives users more space to express their suicidal thoughts and describe their emotional state in more detail. The large posts provide a better insight into the author’s mental state [23][26]. By analyzing long passages of text, the researchers capture and extract textual features that sufficiently indicate suicidal ideations [10][24][10,27]. Reddit facilitates better anonymity [22][23][25,26]. As per Reddit’s privacy policy, users are not required to provide any identifying personal information or email address when creating an account [32][34]. The platform only requires a username and a password and the former does not have to relate to an actual name. This is unlike other social media sites. For instance, Facebook requires either a phone number or an email address during sign up, in addition to implementing a real-name policy that necessitates users to specify their real names on profiles [33][35]. Reddit users normally do not include their names and choose non-identifying ambiguous usernames. This level of anonymity allows people at risk of suicide to express themselves in an uninhibited fashion, without fear of social stigma [10][23][24][10,26,27]. This is valuable for researchers since unconstrained expounding of one’s experiences and feelings builds a genuine picture of the user’s psychological state. Reddit has numerous specialized support forums dedicated to various mental health topics [23][26]. For example, the r/SuicideWatch subreddit is a subcommunity of 366,000 members where people share their suicidal thoughts, seek help, and provide support to others dealing with suicidal ideations [34][35][36,37]. This subreddit is extensively used by researchers as a source of suicidal posts to serve as positive samples in their datasets [10]. What further supports the validity of r/SuicideWatch as a source of genuine suicide-related posts is that this subreddit is monitored by moderators [22][25]. The moderators remove any irrelevant posts and posts that violate the community rules, e.g., abuse, criticism, and spam [35][37].

2.3. Machine Learning Approach for Suicidal Ideation Detection

2.3.1. Data Collection

The first step in the process of building a classifier is obtaining a dataset containing sufficient posts for each class label. Having an accurate dataset with labeled examples is critical for the success of the ML model. The dataset is used to train and then test the model. The model’s predictive performance and its generalizability strongly depend on the quality and amount of training data. There are two broad data collection approaches adopted by the studies: collecting data directly from Reddit and using datasets created by other researchers.

2.3.2. Data Annotation

Supervised ML algorithms require annotated datasets. During the training stage, the algorithm generates a function that maps the relationship between the features and the target variables. To train the model to detect posts with suicidal ideations, the researchers need examples of posts annotated as suicidal and not suicidal. For the multiclass classification problem, posts with annotations for different suicide risk levels are required. 

2.3.3. Data Preprocessing

The data collected from Reddit consist of raw, unstructured text and contain noise that can negatively impact the predictive performance of the model. The noise includes punctuation, special characters, URLs, emails, etc. The raw text needs to be converted into a numerical representation before it can be fed into a classifier. During the preprocessing stage, the input data are cleaned and standardized. Therefore, it is an important step that lays the foundation for feature extraction and classification.

2.3.4. Feature Engineering

To use ML algorithms, researchers need to extract features from the data. These features then serve as an input to a classifier algorithm. Therefore, the quality of extracted features is one of the factors that significantly affects the predictive performance of the model. Most studies combined techniques to extract different types of features. The researchers primarily focused on extracting features from the textual content of posts. However, several studies also considered statistical metadata, such as the number of posts per user, the frequency of posting, and the number of votes [13][23][13,26].

2.3.5. Model Development

All the studies in the corpus frame their contributions as building a predictive model that detects suicidal ideations from Reddit data. They tested multiple algorithms with different sets of features and proposed best-performing models. In total, 21 supervised ML algorithms were explored by the researchers. Most studies (18 out of 26 studies) included deep learning techniques. The researchers chose deep learning because, when used in conjunction with word embeddings, the deep-learning-based models can effectively detect suicidal ideations without the need for feature engineering.

2.3.6. Model Validation

Once the predictive model is trained, the performance of the model is evaluated. The most common evaluation metrics include accuracy, precision, recall, and F1-score. However, two studies also calculated the area under the curve (AUC) metric [11][23][11,26].  For suicidality detection task, true positive (TP) represents the number of posts that were correctly classified as suicidal. True negative (TN) represents the number of posts that were correctly classified as non-suicidal. False positive (FP), also known as Type I error, represents the number of non-suicidal posts that were misclassified as suicidal. False negative (FN), also known as Type II error, represents the number of suicidal posts that were misclassified as non-suicidal. Accuracy measures the overall portion of correct predictions [36][51]. It is a ratio of all correctly classified posts to the total number of posts: A c c u r a c y = T P + T N T P + F P + T N + F N Precision is a ratio of correctly classified suicidal posts to the total number of posts classified as suicidal (both correctly and incorrectly) [36][51]:
P r e c i s i o n = T P T P + F P
Recall, also called sensitivity or true-positive rate, is the ratio of correctly classified suicidal posts to the total number of suicidal posts, including both correctly classified posts and posts that should have been classified as suicidal [36][51]:
R e c a l l = T P T P + F N
This metric is especially useful for selecting the best model where there is a high cost of false-negative predictions [37][61]. In the suicidal ideation detection model, false positives are more tolerable than false negatives [38][62]. In other words, it is better to raise a false alarm by incorrectly predicting someone as suicidal than to miss someone who is indeed at risk of suicide.
F1-score is the harmonic mean of precision and recall:
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
For multiclass classification problems, the macro-averaged F1-score can be determined by calculating individual F1-scores for each class and finding their unweighted mean. The receiver operating characteristic curve is a graph that plots the true-positive rate (Equation (3)) against the false-positive rate (Equation (5)) at different classification thresholds [39][55]. It provides a graphical representation of the classifier’s performance and a larger area under the curve indicates better performance. F a l s e   P o s i t i v e   R a t e = F P F P + T N