Alzheimer’s disease (AD) is one of the most devastating brain diseases in the world, especially in the more advanced age groups
[1]. It is a progressive neurological disease that results in irreversible loss of neurons, particularly in the cortex and hippocampus, which leads to characteristic memory loss and behavioral changes in humans
[2].
Although the nature of AD is unknown and is likely to be a multiple-cause disease, it has been observed that its onset is insidious and appears in adulthood, causing, in advanced stages, a cognitive and behavioral disability
[3].
As the disease progresses, the quality of life of patients is deeply affected in different ways. As they lose cognitive abilities and functional skills, individuals with this dementia become unable to perform many of the activities that were usually part of their daily lives. Behavior and social skills may also deteriorate, precipitating interpersonal conflicts that lead to the individual with AD being socially isolated. This, in turn, has an impact on their emotional state
[4]. In these syndromes, amnesic symptoms may not be the first evidence, but others, more prominent initial aspects, such as language problems, visual dysfunction, or difficulties with praxis
[5].
Mild cognitive impairment (MCI) is known to be one of the first detectable indicators of cognitive decline. It is a heterogeneous syndrome that shows great clinical importance for the early detection of AD
[6]. At this stage, the symptoms related with the ability to think begin to be noticed by the individual himself and by his closest members, but there are no functional changes in its daily life. Not all patients diagnosed with MCI develop AD, in fact, only 10 to 15% per year. There are two types of MCI, the amnesic and the non-amnesic. Patients with the first type are thought to have a greater tendency to develop AD. In cases where they do, MCI is considered the second phase of AD
[7]. In general, the MCI captures the point in the spectrum of cognitive function between non-dementia aging and dementia with main characteristics for the amnesic type
[8].
The general diagnosis of neurodegenerative diseases is usually compromised by the fact that the symptoms that trigger it represent an advanced stage of the disease, causing it to appear late. Therefore, the assessment of dementia should be based on four key issues: (1) whether there is a subjective disability detected by the individual himself or observed by a close individual; (2) whether there is objective evidence of cognitive disability in the tests performed; (3) whether there is a functional decline; (4) whether there are symptoms caused by something inherent in dementia (e.g., delirium, substances or other medical, neurological or psychiatric disorders). To answer these questions, a medical history is acquired, and appropriate physical examinations and laboratory studies are performed, as well as cognitive screenings, that also use neuroimaging techniques
[8]. Within cognitive tests, it stands out the Mini-Mental State Exam (MMSE), the Clock-drawing test, and the Alzheimer’s Disease Assessment Scale
[5][9][10]. The main exams using imaging techniques are Computed Axial Tomography (CT), Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), and Single-Photon Emission Computed Tomography (SPECT)
[8]. Although there is currently a wide range of diagnostic methods applied to AD, there is still a concern to find new methods that respond more urgently to dementia while being simple and cost effective.
Alzheimer’s disease is characterized by a progressive worsening of deficits in several cognitive fields, including language. Aphasia and dysarthria are common symptoms and language impairment in AD occurs mainly due to a decline in semantic and pragmatic levels of language processing
[11]. From a physiological perspective, superior parietal, posterior temporal, and occipital cortical areas are interconnected by posterior corpus callosum. The superior longitudinal fasciculus surrounds the putamen, connecting all four cerebral lobes, areas that are known to be affected in MCI and AD and that have a central role in language processing
[12][13]. Language difficulties are a major problem for most patients with dementia, especially as the disease progresses. The first signs that communication is being affected are the difficulties on finding words, especially when it comes to naming familiar people or objects. Words are replaced by wrong and meaningless words and pauses during speech are increased as well
[14]. In the early stages of AD, language impairment involves problems of lexical recovery, loss of verbal fluency, and a breakdown in higher-order written and spoken language comprehension. In the moderate and severe phases of AD, the loss of verbal fluency is profound, with loss of understanding and prominent literal and semantic paraphrases. In the very severe phases of AD, speech is often restricted to echolalia and verbal stereotypes. In
Table 1, it is possible to see the association of the mentioned speech impairments with the stage of the disease
[11][15]. Communicative difficulties (speech and language) constitute one of the groups of symptoms that most accompany dementia and, therefore, should be recognized as a central study instrument. This recognition aims to provide earlier diagnosis, resulting in greater effectiveness in delaying the disease evolution.
Table 1. Language changes in AD (adapted from Ferris and Farlow
[11] and Greta et al.
[16]).
-
Data Preparation: In this step the extraction, optimization and normalization of features occurs. This consists in the selection of the most significant features (by removal of the non-dominant features) and in the transformation of ranges to similar limits, which will reduce training time and the complexity of the classification models. Metadata are “the data of the data”, more specifically, structured, and organized information on a given object (in this case voice recordings) that allow certain characteristics of it to be known. This metadata together with the results of the pre-processing of the recordings makes the final database. Incorrect or poor-quality data (e.g., outliers, wrong labels, noise, …), if not properly cared for, will lead to under optimized models and to unsatisfactory results. If data is not enough, for example when deep learning algorithms are used, then data augmentation techniques can be useful.
-
Training and Validation: The supporting database is divided into subsets, usually 70–90% for training and 30–10% for testing. The subsets can be randomly generated several times and the results can be averaged for additional confidence in the results, a procedure that is designated by cross-validation. The data model is trained, i.e., the involved parameters are adjusted, by one or many optimizers, and the performance is calculated using the test subset. This step allows categorizing and organizing the data to promote better analysis
[18]. When data is not enough, then transfer learning approaches can be used.
-
Optimization: After model evaluation, it is possible to conclude on the parameters that need to be improved, as well as to proceed in a more effective way to the selection of the most interesting and relevant features, so that a new extraction and consequently a new process (iteration) of Training and Validation can be performed.
-
Run-Time: Having concluded the previous points, the system is ready to be deployed and to classify new unseen inputs. More specifically, from the recording of a patient’s voice, to classify it as possible healthy or possible Alzheimer’s patient.
Function |
Early Stages |
Moderate to Severe Stages |
Spontaneous Speech |
Fluent, grammatical |
Non-fluent, echolalic |
Paraphrastic errors |
Semantics |
Semantic and phonetic |
2.3. Language and Speech Features
As mentioned in
Table 1, the most evident problems early on in AD, as far as speech is concerned, are related to difficulties in general semantics, that is, in finding words to name objects. In this sense, temporal cycles during spontaneous speech production (speech fluency) are affected and, therefore, can be detectable in the patient’s hesitation and pronunciation
[44]. Other speech characteristics affected in AD patients seem to be those related to articulation (speed in language processing), prosody in terms of temporal and acoustic measurements, and eventually, in later phases, phonological fluency
[45].
Considering the linearity of the features, they can be classified as linear or non-linear, the linear ones being more conventionally used. Linear features can be subdivided into several groups, but these are always very interconnected. Thus, we chose to divide into two groups, linguistics, and acoustics, and present them in
Table 3 and
Table 4.
Table 3. Linguistic features that have been used for AD detection. The features are organized by type. For each feature name, the number of occurrences/usages is provided inside parenthesis.
Feature Type |
Feature Name |
Most significantly used classification models.
Model |
Occurrence frequency |
Words (3); Verbs (2); Nouns, Predicates (1); Coordinate and Subordinate Phrases (2); Reduced phrases (2); Incomplete Phrases/Ideas (3); Filling words (1); Unique words (2); Revisions/Repetitions (1); Word Replacement (2) |
Table 4. Acoustic features that have been used for AD detection. The features are organized by type. For each feature name, the number of occurrences/usages is provided inside parenthesis.
Feature Type |
Feature Name |
Characterization |
References |
Hesitations |
Filled Pauses (2); Silent Pauses (4); Long Pauses (3); Short Pauses (3); Voice Breaks (5). |
NB |
Consists of a network, composed of a main node with other associated descending nodes that follow Bayes’ theorem [47 |
presents the evaluation models applied in the literature search.
Table 6. Evaluation models for classification models.
Model |
Method |
Reference |
] | . |
[6][23][28][41] |
Time/Duration |
Total speech (3); Speech Rate (3); Speech time (2); Average of syllables (2); Pauses (4); Maximum pause (2). |
SVM |
Consists of building the hyperplane with maximum margin capable of optimally separating two classes of a data set [47]. |
[6][25] |
Parts of speech ratio |
Nouns/Verbs (2); Pronouns/Substantives (1); Determinants/Substantives (2); Type/Token (2); Silence/Speaking (4); Hesitation/Speaking (3). |
[ | 26 | ][ |
Semantic density |
The density of the idea (1); Efficiency of the idea (1); Density of information (2); Density of the sentences (1). |
POS (Parts-of-Speech) |
Voice Segments |
Period (4); Average duration (4); Accentuation (2). |
27 | ] | [28][29][38][39][40][41][42][43][48][49] |
Frequency |
Fundamental frequency (8); Short term energy (7); Spectral centroid (1); Autocorrelation (2); Variation of voice frequencies (2). |
Regularity |
RF |
Relies on the creation of a large number of uncorrelated decision trees based on the average random selection of predictor variables [50]. |
[6][48] |
Jitter (11); Shimmer (11); Intensity (6); Square Energy Operator (1); Teager-Kaiser Energy Operator (1); Root Mean Square Amplitude (2). |
DT |
Consists of building a decision tree where each node in the tree specifies a test on an attribute, each branch descending from that node corresponds to one of the possible values for that attribute, and each leaf represents class labels associated with the instance. The instances of the training set are classified following the path from the root to a leaf, according to the result of the tests along the path [51]. |
[27][41][42][ |
Text tags (4). |
43 | ] |
Noise |
Harmonic-Noise ratio (3); Noise-Harmonic ratio (2). |
KNN |
Based on the memory principle in the sense that it stores all cases and classifies new cases based on similar measures [47]. |
Complexity |
The entropy of words (1); Honore’s Statistics (1). |
[ | 30 | ][34 |
Phonetics |
Articulation dynamics (1); the rate of articulation (1); Pause rate (5). |
Lexical Variation |
Variation: nominal (2), adjective (1), modifier (1), adverb (1), verbal (1), word (1); Brunet’s Index (1). |
30 |
] |
[ |
39 |
Cross Validation |
k-Fold |
[28][29][31][ |
] |
34 | ] | [35][36][40][48] |
LDA |
It is a discriminatory approach based on the differences between samples of certain groups. Unsupervised learning technique where the objective is to maximize the relationship between the variance between groups and the variance within the same group | [ |
Leave-pair-out |
[39][ | Repetition |
Intact |
Very affected |
53 | ]. |
[42][43] |
49 | ] |
Leave-one-out |
[6][26][38][41][42]] |
Naming objects |
Slightly affected |
Very affected |
Split Evaluation |
90–10% |
[40] |
Understanding the words |
Intact |
Very affected |
] | [ | 36] |
Syntactical understanding |
Intact |
Very affected |
Reading |
Intact |
Very affected |
Writing |
±Intact |
Very affected |
Semantic knowledge of words and objects |
Difficulties with less used words and objects. |
Very affected |
2. Speech- and Language-Based Classification of Alzheimer’s Disease
2.1. Machine Learning Pipeline
The use of speech analysis is potentially a useful, non-invasive, and simple method for early diagnosis of AD. The automation of this process allows a fast, accurate, and economical follow-up over time. Initially, speech-based tests for AD detection were performed by linguists. These tests were designed to extract linguistic characteristics from speech or writing samples. However, more current studies seek to optimize this task by automating the process of speech recognition through audio recordings
[17]. Thus, and in sequence, the process can be described in 4 crucial steps:
In
Figure 1 we can observe the described methodology in detail.
Figure 1. Flowchart of a general machine learning pipeline to process acoustic/prosodic correlates of disease. Adapted from Braga et al.
[19].
2.2. Speech and Language Resources
Table 2 presents the main databases that are referred in the scientific literature, accompanied by a summary of their characteristics. These resources are crucial for supporting the development of new systems, in particular when deep learning approaches are used. The use of similar databases in different studies, by different researchers, also provides a common ground for evaluation and performance comparison.
Table 2. List of databases, with related specifications, with Alzheimer’s patients’ speech recordings. (Table contents are sorted by language, first column, and database name, second column).
LR |
A model capable of finding an equation that predicts an outcome for a binary variable from one or more response variables |
[ |
52 | ]. |
[ |
Intensity |
From the voice segments (1); From the pause segments (1); |
Timbre |
80–20% |
[ | ANN | Formant’s Structure (6); Formant’s Frequency (8). |
30 |
] |
DNN |
Naturally inspired models. Supervised learning approach based on a theory of association (pattern recognition) between cognitive elements |
[ | 54 | ] |
Random Sub-Sampling |
- |
[25] |
. There are many possibilities with different elements, structures, layers, etc. The larger the number of parameters then the larger the dataset must be. |
[30][31][34][35][36][40][41] |
CNN |
RNN |
MLP |
Language |
Database Name |
Task |
Population |
Availability |
Refs. |
HC M/F |
MCI M/F |
AD M/F |
English |
DementiaBank (TalkBank) |
DF |
99 |
- |
169 |
Upon request |
[20] |
English |
Pitt Corpus |
PD |
75/142 |
27/16 |
87/170 |
Upon request |
[21] |
English |
WRAP |
PD |
59/141 |
28/36 |
- |
Upon request |
[22] |
English |
- |
PD |
112 |
- |
98 |
Undefined |
[23] |
French |
- |
Mixed |
6/9 |
11/12 |
13/13 |
Undefined |
[24] |
French |
- |
VF, PD, SS Counting |
- |
19/25 |
12/15 |
Undefined |
[25] |
French |
- |
VF, Semantics |
5/19 |
23/24 |
8/16 |
Undefined |
[26] |
French |
- |
Reading |
16 |
16 |
16 |
Undefined |
[27] |
Greek |
- |
PD |
16/14 |
- |
13/17 |
Undefined |
[28] |
Hungarian |
BEA |
SS |
13/23 |
16/32 |
- |
Upon request |
[6] [29] |
25 |
25 |
25 |
Italian |
- |
Mixture |
48 |
48 |
- |
Undefined |
[30] |
Mandarin |
Lu Corpus |
PD/SS |
4/6 |
- |
6/4 |
Upon request |
[31] |
Mandarin |
- |
PD/SS |
24 |
20 |
20 |
Undefined |
[32] |
Portuguese |
Cinderella |
SS |
20 |
20 |
20 |
Undefined |
[33] |
Spanish |
AZTITXIKI (AZTIAHO) |
SS |
5 |
- |
5 |
Undefined |
[34] |
Spanish |
AZTIAHORE (AZTIAHO) |
SS |
11/9 |
- |
8/12 |
Undefined |
[35][36] |
Spanish |
PGA-OREKA |
VF |
26/36 |
17/21 |
- |
Upon request |
[35] |
Mini-PGA |
PD |
4/8 |
- |
1/5 |
Spanish |
- |
Reading |
30/68 |
- |
14/33 |
Undefined |
[37] |
Swedish |
Gothenburg |
PD |
13/23 |
15/16 |
- |
Undefined |
[38] |
Swedish |
- |
Mixed |
12/14 |
8/21 |
- |
Upon request |
[39] |
Swedish |
- |
Reading |
11/19 |
12/13 |
- |
Undefined |
[40] |
Turkish |
- |
SS/Interview |
31/20 |
- |
18/10 |
Undefined |
[41] |
Turkish |
- |
SS/Interview |
12/15 |
|
17/10 |
Undefined |
[42] |
Turkish |
- |
SS |
12/15 |
- |
17/10 |
Undefined |
[43] |
Legend: M: Males; F: Females; HC: Healthy Controls; MCI: Mild Cognitive Impairment; AD: Alzheimer’s Disease; SS: Spontaneous Speech; VF: Verbal Fluency; PD: Picture Description; PGA: Gipuzkoa Alzheimer Project; WRAP: Wisconsin Registry for Alzheimer’s Prevention.
2.4. Classification Models
The process of classification lies in identifying to which, of a given set of categories, a new observation belongs to, based on another set of training categories whose observations have already been assigned a category
[46]. Thus, after the extraction and selection of the most significant features, it is necessary to proceed to their classification so that it is also possible to classify the groups of data under study.
When data distribution or patterns are known, then a compatible model (linear, polynomial, exponential or other) will lead to optimal results. However, machine learning has gained special relevance due to its ability to provide good estimates even when facing unstructured high dimensionality data. In this context, deep neural networks (DNN) can excel. These are flexible models where elements, inspired on the human brain anatomophysiology, are combined in large structures, with several sequential layers, to provide the output. The number of elements per layer, the number of layers, and the behavior of each layer (fully connected, convolutional, recurrent, …) are some of the parameters that can be adjusted to fit the network to the data/problem.
In
Table 5, some of the most commonly used models are summarized and defined in general terms.
Table 5.
NB: Naive Bayes; RF: Random Forest; LDA: Linear Discriminant Analysis; SVM: Support Vector Machine; DT: Decision Trees; ANN: Artificial Neural Networks; RNN: Recurrent Neural Network; CNN: Convolutional Neural Networks; MLP: Multilayer Perceptron; KNN: k-Nearest Neighbors; DNN: Deep Neural Networks; LR: Logistic Regression.
2.5. Testing and Performance Indicators
To conclude on the efficiency and viability of the classification model adopted, it is necessary to evaluate it. To be able to compare the performance of a given system against others reported systems it is important to choose a common metric with a well/defined testing method/setup otherwise it will be impossible to understand how good a system stands against its competitors. In this sense,
Table 6
3. Future Work
With the evolution of technology also the methods of diagnosis and analysis are evolving. Thus, more, and better ways of detecting diseases or even new diagnostic processes are appearing. The detection and classification of Alzheimer’s disease, which was usually performed via neurological tests and neuroimaging, is now possible through less invasive and equally efficient methods. The existing models for the detection of AD through speech have been increasing in quantity and in quality, though improvements are still needed. At present, the biggest barriers in the methods created for the automatic detection of AD lie in the fact that: (a) most systems are language dependent; (b) the number of samples used per study is very small, so the number of experiments on which the system is based is little for it to achieve optimal performance; (c) System components are not always integrated and may require human intervention; (d) feature sets are not yet fully established although temporal aspects (total duration, speech rate, articulation rate, among others) pitch, voice periods and interruptions, when combined with language or linguistic features can lead to very good results. Additional research is needed to find the optimal combination of parameters and what tasks should the (potential) patient be invited to perform. Thus, it is envisioned as future work the implementation of multilingual or language independent systems, supported by extensive and diverse databases (that still must be gathered, with balanced number of M/F, ages, disease severity), as well as the automation of the features selection and extraction. Better decision models, task oriented, are also required.