Mining social web text has been at the heart of the Natural Language Processing and Data Mining research community in the last 15 years. Though most of the reported work is on widely spoken languages, such as English, the significance of approaches that deal with less commonly spoken languages, such as Greek, is evident for reasons of preserving and documenting minority languages, cultural and ethnic diversity, and identifying intercultural similarities and differences.
1. Introduction
Over recent years, social web text (also known as
social text) processing and mining has attracted the focus of the Natural Language Processing (NLP), Machine Learning (ML) and Data Mining research communities. The increasing number of users connecting through social networks and web platforms, such as Facebook and Twitter, as well as numerous Blogs and Wikis, creates continuously a significant volume in written communication through the Web [
1,
2,
3,
4,
5,
6,
7]. The amount and quality of information and knowledge extracted from social text has been considered crucial to studying and analyzing public opinion [
1,
3,
5,
8,
9], as well as linguistic [
2,
7,
10,
11,
12,
13,
14,
15] and behavioral [
4,
6,
16,
17,
18] patterns. In its typical form, social text is often short in length, low in readability scores, informal, syntactically unstructured, characterized by great morphological diversity and features of oral speech, misspellings and slang vocabulary, consequently presenting major challenges for NLP and Data Mining tasks [
2,
4,
7,
10,
11,
13,
14,
15,
16,
19,
19]. Therefore, several works have attempted to develop tools to extract meaningful information from this type of text with applications in numerous fields, such as offensive behavior detection, opinion-mining, politics analysis, marketing and business intelligence, etc. Capturing public sentiment on matters related to social events, political movements, marketing campaigns, and product preferences passes through emotion processing methodologies, which are being developed in the inter-compatible Web. On that notion, the combination of several academic principles (inter-disciplinarity), allows experts to develop “affect-sensitive” systems through syntax-oriented techniques (e.g., NLP) [
20].
ML tools and techniques have been significant in NLP and Data Mining tasks on social text, due to their adaptability to the data, as well as their ability to efficiently handle vast volumes of data. “ML is programming computers to optimize a performance criterion using example data or past experience” [
21]. During the learning phase, parameters of a general model are adjusted according to the training data. During the testing phase, the specialized model is tested with new, not previously known data, and its performance regarding a target task is evaluated [
21,
22]. The objective of supervised learning is to map the provided input to an output, where true values are acquired by a supervisor [
21,
22]. The objective of unsupervised learning is to detect the regularities in the provided input and its underlying structure, though the true values of the output are not acquired by a supervisor [
21,
22]. Semi-supervised learning includes training with both labeled and unlabeled data [
21]. In reinforcement learning, an agent learns behavior through trial-and-error in a dynamic environment [
23]. It is applied when the target task results from a sequence of actions [
21,
22,
24,
25].
2. Linguistic and Behavioral Patterns
The identification of patterns in data has been a demanding task in the context of social text, mainly due to its unstructured nature, rich morphology and increasing volume [
2,
4,
7,
10,
11,
13,
14,
15,
16]. Several researchers have focused on the identification of linguistic and/or behavioral patterns of interest in social text data. The most commonly used process is the following: At first, the data are collected from the social web, usually by a web scraper or through an Application Programming Interface (API). Then, they are preprocessed, including normalization and transformation, and encoded into a data set with a form and structure suitable for the stage of processing; the implementation of Data Mining and NLP techniques. At the next stage, experiments with the data set and several ML algorithms are conducted. Finally, the results are interpreted and the performance of the algorithms is evaluated.
2.1. Linguistic Patterns Analysis
There are several approaches that have attempted to identify, analyze and extract linguistic patterns by developing and using various NLP tools [
2,
12]. Other work focuses on the creation of corpora from various linguistic contexts to apply either classification [
7], or machine translation [
37]. Additionally, there are certain approaches that have explored argument extraction and detection from text corpora [
13,
14,
15]. Another approach attempted authorship attribution and author’s gender identification for bloggers [
10,
11]. An overview of the recent literature regarding linguistic patterns analysis, which is discussed in this subsection, is shown in and .
Table 1. Overview of the literature (linguistic patterns analysis). Social media, data sets and corpora, methods applied on data, and the resulting tool.
Paper |
Social Media |
Data |
Methods |
Tool |
[2] |
Twitter |
2405 tweets |
tokenization, normalization, |
POS tagger |
|
|
31,697 tokens (April 2019) |
encoding, annotation |
|
[12] |
Twitter |
4,373,197 tweets |
automated & manual rating, |
Sentiment analysis lexicon |
|
|
30,778 users |
removal: stop words & tone marks, |
|
|
|
54,354 hashtags (April 2008–November 2014) |
stemming, uppercase |
|
[7] |
Twitter |
1039 sentences |
anonymization, manual annotation, |
Bidialectal classifier |
|
Facebook |
7026 words (Cypriot Greek) |
removal: tabs, newlines, duplicate punctuation, |
|
|
forums, blogs |
7100 words (Modern Greek) (March–April 2018) |
insertion: spaces, n-grams, encoding, tokenization |
|
[37] |
MOOC |
multilingual corpus |
conversion into plain text, |
- |
|
|
course forum text |
removal: special characters, non-content lines, |
|
|
|
quiz assessment text |
multiple whitespaces, tokenization, sentence segmentation, |
|
|
|
subtitles of online video lectures |
special elements markup |
|
[13] |
Twitter, news |
204 documents |
manual annotation, tokenization, sentence splitting, |
- |
|
blogs, sites |
16,000 sentences: |
POS tagging, feature selection, gazetteer lists, lexica, TF-IDF |
|
|
|
760 argumentative |
|
|
[14] |
Twitter, news |
204 documents |
manual annotation, tokenization, sentence splitting, |
- |
|
blogs, sites |
16,000 sentences: |
POS tagging, feature selection, gazetteer lists, lexica, TF-IDF |
|
|
|
760 argumentative |
|
|
|
|
comparison with NOMAD data set |
|
|
[15] |
Twitter |
1st: 77 million documents |
POS tagging, cue words, distributed representations of words, |
- |
|
Facebook |
2nd: 300 news articles, |
feature extraction, sentiment analysis, lowercase |
|
|
news, blogs |
1191 argumentative segments |
|
|
[10] |
Blogs |
1000 blog posts |
stylometric variables, character & word uni-grams, |
Authorship attribution & |
|
|
406,460 words (September 2010–August 2011) |
bi-grams, tri-grams, feature extraction |
author’s gender identification |
[11] |
Twitter |
45,848 tweets |
removal: stop words, encoding: Bag-of-Words, TF-IDF |
Author’s gender identification |
Table 2. Overview of the literature (linguistic patterns analysis). Machine learning and other algorithms, experimental results, contribution, and open issues.
Paper |
Algorithms |
Results |
Contribution |
Open Issues |
[2] |
Naive Bayes |
accuracy up to 99.87% |
1st data set for Greek social text |
larger data sets |
|
ID3 |
|
1st tag set |
data from different social media |
|
|
|
1st supervised POS tagger |
syntactic & semantic analysis tools |
|
|
|
|
linguistic diversity by region |
|
|
|
|
tracking controversial events & |
|
|
|
|
mapping connections with users |
[12] |
Pearson Kendall |
sentiment correlation |
public benchmark data set |
lexicon for social text |
|
correlation |
|
set of intensity rated tweets |
more linguistic data |
|
|
|
automated method for detecting |
larger data set & number of raters |
|
|
|
intensity (tweets & hashtags) |
|
|
|
|
temporal changes in intensity (hashtags) |
|
[7] |
Naive Bayes, SVM, LR |
95% mean accuracy |
1st classifying Greek dialects in social text |
applications in social media moderation |
|
|
|
bidialectal corpus & classifier |
and academic research |
|
|
|
most informative features |
larger corpus including POS |
|
|
|
|
detecting dialects prior to online translation |
|
|
|
|
extension with Greeklish, Pontic & Cretan Greek |
|
|
|
|
distinction between Katharevousa & Ancient Greek |
[37] |
- |
- |
multilingual parallel corpus to train, tune, |
- |
|
|
|
test machine translation engines |
|
|
|
|
translation crowd-sourcing experiment |
|
|
|
|
examination of difficulties: text genre, language pairs, |
|
|
|
|
large data volume, quality assurance, |
|
|
|
|
crowd-sourcing workflow |
|
[13] |
LR, RF, SVM, CRF |
accuracy up to 77.4% |
2-step argument extraction |
more features & algorithms |
|
|
|
novel corpus |
testing of Markov models |
|
|
|
most determinant features |
|
[14] |
LR, RF, SVM, CRF |
accuracy up to 77.4% |
2-step argument extraction |
more features & algorithms |
|
|
|
novel corpus |
testing of Markov models |
|
|
|
most determinant features |
comparing performance with approaches for English |
|
|
|
|
experiments with unsampled data |
[15] |
word2vec CRF |
up to 39.7% precision |
semi-supervised multi-domain method |
extending the gazetteer lists |
|
|
27.59% recall |
argument extraction |
bootstrapping on CRF |
|
|
32.53% F1 score |
novel corpus |
more algorithms |
|
|
|
|
patterns based on verbs and POS |
|
|
|
|
grammatical inference algorithm |
[10] |
SVM |
accuracy 85.4% & 82.6% |
tool for authorship attribution & author’s gender |
- |
|
|
|
identification with many candidates |
|
|
|
|
novel social text corpus |
|
|
|
|
10 most determinant features |
|
[11] |
SVM |
accuracy up to 70% |
novel, manually annotated, corpus |
more features combining gender & age |
|
|
|
NLP framework for gender identification of the author |
neural networks |
2.2. Offensive Behavior and Language Detection
There are several approaches that have attempted to detect and analyze bullying and aggressive behavior in Virtual Learning Communities (VLCs) [
4,
16,
17]. Other work focuses on offensive language identification and analysis in tweets [
6,
18]. An overview of the recent literature regarding offensive behavior and language detection, which is discussed in this subsection, is shown in and .
Table 3. Overview of the literature (offensive behavior and language detection). Social media, data sets and corpora, methods applied on data, and the resulting tool.
Paper |
Social Media |
Data |
Methods |
Tool |
[4] |
VLCs, Wikispaces |
500 dialogue segments (VLC-1) |
anonymization, segmentation in periods, manual annotation, |
Detection of bullying behavior |
|
|
83 dialogue segments (VLC-2) |
lowercase, tokenization, n-grams, removal: stop words, |
|
|
|
|
stemming, pruning of low/high-frequency terms, length filtering |
|
[16] |
VLCs, Wikispaces |
126 dialogue segments |
anonymization, segmentation in periods |
Detection of bullying behavior |
|
|
1167 dialogue segments |
|
|
[17] |
VLCs, Google Docs |
activity log files, dialogue text, |
semantic segmentation, annotation |
Discourse & artifacts analysis |
|
|
questionnaires, interviews |
|
|
[6] |
Twitter |
4779 tweets |
keyword search, removal: emoticons, URLs, accentuation, |
- |
|
|
(May–June 2019) |
normalization, lowercase, manual annotation, TF-IDF, |
|
|
|
|
n-grams, POS tags, word embeddings, LSTM |
|
[18] |
Twitter |
4,490,572 tweets |
keyword search, knowledge representation, |
- |
|
|
(2013–2016) |
computational analysis, data visualization, tokenization, |
|
|
|
|
sentence splitting, POS tagging, lemmatization |
|
[18] |
Twitter |
4,490,572 tweets |
keyword search, knowledge representation, |
- |
|
|
(2013–2016) |
computational analysis, data visualization, tokenization, |
|
|
|
|
sentence splitting, POS tagging, lemmatization |
Table 4. Overview of the literature (offensive behavior and language detection). Machine learning and other algorithms, experimental results, contribution, and open issues.
Paper |
Algorithms |
Results |
Contribution |
Open Issues |
[4] |
Naive Bayes, Naive Bayes Kernel, |
accuracy up to 94.2% |
1st study of the influence of VLCs on behavior |
- |
|
ID3, Decision Tree, Feed-forward NN, |
|
modification regarding bullying |
|
|
Rule induction, Gradient boosted trees |
|
NLP & ML framework for automatic detection |
|
|
|
|
of aggressive behavior & bullying |
|
|
|
|
authentic humanistic data collected |
|
|
|
|
under real conditions |
|
[16] |
Text analysis & annotation t-test |
- |
authentic humanistic data collected |
- |
|
|
|
under real conditions |
|
[17] |
Struggle Analysis Framework |
- |
collaboration assessment |
- |
|
|
|
action analysis |
|
|
|
|
interaction analysis |
|
|
|
|
evaluation of presentations & dialogues |
|
[15] |
SVM Stochastic Gradient Descent |
F1 score 89% |
1st Greek annotated data set for offensive |
- |
|
Naive Bayes 6 deep learning models |
|
language identification |
|
[18] |
- |
- |
framework for verbal aggression analysis |
extending to other types of attacks |
|
|
|
verbal attacks against target groups |
including other languages for |
|
|
|
xenophobic attitudes during the Greek |
cross-country & cross-cultural comparisons |
|
|
|
financial crisis |
|
3. Opinion-Mining
Taking this work a step further, we focus on a quite well-known fact: millions of content creators worldwide produce a wealth of unstructured opinion data that exist online obtainable through popular crawling methods (i.e., Scrapy
22) or through readily available platforms
23, while being generated when people share their opinions on several things, such as consumer experience. In principle, the intention to comment is voluntary, as it provides an honest view and opinion on a particular topic. Under this notion, the term of
opinion-mining arises, since the analysis and summarization of large-scale data has led to a specific type of concept-based analysis [
38]. In general, understanding public sentiment is the core action of implementing opinion-mining. There are many useful sources on the web, probably describing present opinion on politics, social matters, user reviews and many more, which are easily minable. On the other hand, it remains true that this novelty provides a volunteered source of highly esteemed user opinion. Although people express positive or negative feelings on a given topic (sentiment analysis), researchers need to understand the reasoning behind a given sentiment (opinion-mining); therefore, individual opinions are often reflective of a broader view. Given the large minable data sets, research groups need to develop new interpretation methods with the help of AI, to extract opinion from textual data. Nevertheless, such large data sets produce complex tasks that require arduous and tedious work on behalf of data scientists. Applying mining techniques for identifying the sentiment on the social web. Initially, texts are collected in the form of raw data and then they are preprocessed into specific data sets through ML and NLP approaches. Afterwards, researchers deploy various types of ML algorithms to detect web sentiment among a specific data set under the scope of analytical interpretation and assessment of the methodology in place. Recent research work has indicated that Greek social media presents a platform for users to express their opinion related to many aspects of private and social life and their experience with services and products. This section presents recent literature on the political footprint along with voting patterns (
Section 3.1 [
8,
14,
39,
40,
41,
42,
43,
44]) and introduces to the reader work related to Marketing and Business Analysis (
Section 3.2 [
3,
5]) that employ state-of-the-art opinion-mining ML techniques.
3.1. Politics and Voting Analysis
Greece has witnessed major political events during the last decade and subsequently Greek citizens, and voters in particular, are very often forced to reflect on their political preference based on broader occasions [
45]. On that notion, there were many attempts to recognize the underlying patterns of social events by multidisciplinary scientific communities. The aim of this section is to explore whether the Greek media and social media discourses can provide discursive reconstruction on politics through state-of-the-art analytical methods. presents a summary of works related to Greek text mining on Politics and Voting Analysis, which are discussed in this subsection.
Table 5. Overview of the literature. Opinion-mining on Politics and Voting Analysis.
Paper |
Social Media |
Data |
Methods |
Tool |
Algorithms |
Results |
Contribution |
Open Issues |
[39] |
Twitter |
57,424 tweets |
sentiment analysis |
- |
- |
- |
confirmation of the |
implementation of more sophisticated text |
|
|
(April to May 2012) |
TF |
|
|
|
alignment between |
analysis techniques |
|
|
|
distribution |
|
|
|
actual and social |
|
|
|
|
|
|
|
|
web-based political sentiment |
|
[41] |
Twitter |
61.427 tweets (May 2012) |
text classification, |
OMW |
NLTK |
precision 82.4% |
real-world application |
use of stemmer/lemmatizer, |
|
|
divided into |
semantic analysis |
|
|
|
of irony detection |
tool unavailability, |
|
|
Parties & Leaders |
|
|
|
|
|
small manually |
|
|
44.438 tweets |
|
|
|
|
|
trained data set |
|
|
(after cleanup) |
|
|
|
|
|
|
[40] |
Twitter |
61,427 tweets (May 2012) |
collective classification |
OMW |
J48, Naive Bayes, |
Supervised: |
- |
application with |
|
|
divided into |
|
|
Functional Trees, |
Functional Trees 82.4% |
|
Word Vector or |
|
|
Parties & Leaders |
|
|
K-Star, RF, SVM, |
Semi-supervised: |
|
Deep Learning |
|
|
44,438 tweets |
|
|
Neural Networks |
RF 83.1% |
|
|
|
|
(after cleanup) |
|
|
|
|
|
|
[44] |
Twitter |
48,000 Tweets |
data collection and |
SentiStrength |
- |
highlight the societal |
political domain analysis |
bot recognition |
|
|
in two data sets |
entity identification, |
|
|
and political trends |
|
|
|
|
(July & September 2015) |
volume analysis, |
|
|
|
|
|
|
|
|
entity co-occurrence, |
|
|
|
|
|
|
|
|
sentiment analysis |
|
|
|
|
|
|
|
|
and topic modeling |
|
|
|
|
|
[8] |
Twitter |
14,62M tweets, |
convolutional kernels |
User Voting |
SVM, LR, FF, RF |
MCKL = 0.02% |
real time systematic study on |
annotating a random sample of |
|
|
283 Greek “stopwords” |
|
intention modeling |
|
|
nowcasting |
Twitter users for increased performance |
|
|
|
|
|
|
|
the voting intention |
|
[42] |
Twitter & Digital |
540,989 articles |
PEA & NERC |
NLP, NERC, |
- |
quantitative |
- |
enrichment of sociopolitical |
|
news media |
(1996–2014) |
|
EAU and FST |
|
and qualitative |
|
event categories |
[43] |
Twitter & Digital |
540,989 articles & |
PEA & NERC |
NLP, NERC, |
- |
quantitative |
- |
enrichment of |
|
news media |
166,100,543 tweets |
|
EAU and FST |
|
and qualitative |
|
sociopolitical |
|
|
(1996–2014) |
|
|
|
|
|
event categories |
One of the first complete approaches on Greek texts mining on political events was that of Kermanidis & Maragoudakis [39], where they propose a method for assessing political tweets before and after the election day focusing on the difference in web sentiment. This study indicated the degree of alignment between actual and social web-based political belief, related to electoral sentiment on major political events. The authors studied the impact of the acquired web sentiment before and after the Greek parliamentary elections of 2012 by implementing sentiment identification and Term Frequency (TF) distributions. Furthermore, this work negotiates the two-way alignment of actual political and web sentiment while using minimal linguistic resources.
3.2. Marketing and Business Analysis
Sentiment analysis is an artificial intelligence technique that employs ML and NLP text analysis techniques to track polarity of opinion (positive to negative). A corporation, with the right tools, can gain insights from social media conversations, online reviews, emails, customer service tickets, and more. It has become an essential tool for marketing campaigns because it allows the researcher to automatically analyze data on a scale far beyond what manual human analysis could do, with unsurpassed accuracy, and in real time. Furthermore, it allows the approach of the mentality of a specific group of customers and the public at large to make data-driven decisions. More specifically, a corporation can even analyze customer sentiment and compare it against their competition, follow the emerging topics and check brand perception in new potential markets. The public offers millions of opinions about brands, services and products daily, on social media and within the world wide web. In we present an overview related to literature on opinion-mining on Marketing and Business Analysis, which is discussed in this subsection.
Table 6. Overview of the literature. Opinion-mining on Marketing and Business Analysis.
Paper |
Social Media |
Data |
Methods |
Tool |
Algorithms |
Results |
Contribution |
Open Issues |
[5] |
PaloPro |
Blogs, |
sentiment analysis, |
OpinionBuster |
NLP, CRFs |
performance |
sentiment and polarity |
further optimization |
|
|
Twitter and |
reputation management, |
|
|
> 93% |
detection of a word |
|
|
|
Facebook posts |
brand monitoring |
|
|
|
in its context |
|
[3] |
SVM classifier |
- |
effectiveness of TF-IDF |
Further use of |
SVM classifier |
- |
effectiveness of TF-IDF |
further use of |
|
|
|
for automatic sentiment |
contextual |
|
|
for automatic sentiment |
contextual |
|
|
|
classifier for hotel reviews |
Valence shifters |
|
|
classifier for hotel reviews |
Valence shifters |