Machine Learning Algorithms in Developing CNS Biomarkers

Machine Learning Algorithms in Developing CNS Biomarkers: Comparison

Please note this is a comparison between Version 1 by Ahnjili Zhuparris and Version 2 by Sirius Huang.

Drawing from an extensive review of 66 publications, we present a comprehensive overview of the diverse approaches to creating mHealth-based biomarkers using machine learning is presented herein. By exploring the current landscape of biomarker development using mHealth technologies and machine learning, researchers aimthe review aims to provide valuable insights into this rapidly evolving field. By doing so, researcherswe reflect on current challenges in this field and propose recommendations for ensuring the development of accurate, reliable, and interpretable biomarkers.

machine learning
biomarker
wearables
smartphones
mHealth
remote monitoring
central nervous system
clinical trials

1. Introduction

1.1. Motivation

Disorders that are affected by the Central Nervous System (CNS), such as Parkinson’s Disease (PD) and Alzheimer’s Disease (AD), have a significant impact on the quality of life of patients. These disorders are often progressive and chronic, making long-term monitoring essential for assessing disease progression and treatment effects. However, the current methods for monitoring disease activity are often limited by accessibility, cost, and patient compliance ^[1][2][1,2]. Limited accessibility to clinics or disease monitoring devices may hinder the regular and consistent monitoring of a patient’s condition, especially for patients living in remote areas or for those who have mobility limitations. Clinical trials incur costs related to personnel, infrastructure, and equipment. A qualified healthcare team, including clinical raters, physicians, and nurses, contributes to personnel costs through salaries, training, and administrative support. Trials involving specialized equipment for measuring biomarkers can significantly impact the budget due to costs associated with procurement, maintenance, calibration, and upgrades. Furthermore, infrastructure costs may increase as suitable facilities are required for data collection during patient visits and equipment storage. Patient compliance poses challenges for disease monitoring, as some methods require patients to adhere to strict protocols, collect data at specific time intervals, or perform certain tasks that can be challenging for patients to execute. Low or no compliance can lead to incomplete or unreliable monitoring results, which in turn can hinder the reliability of the assessments. Given these limitations, there is a growing interest in exploring alternative approaches to monitoring CNS disorders that can overcome these challenges. The increasing adoption of smartphones and wearables among patients and researchers offers a promising avenue for remote monitoring.

Patient-generated data from smartphones, wearables, and other remote monitoring devices can potentially complement or supplement clinical visits by providing data during evidence gaps between visits. As the promise of mobile Health (mHealth) technologies is to provide more sensitive, ecologically valid, and frequent measures of disease activity, the data collected may enable the development and validation of novel biomarkers. The development of novel ‘digital biomarkers’ using data collected from electronic Health (eHealth) and mHealth device sensors (such as accelerometers, GPS, and microphones) offers a scalable opportunity for the continuous collection of data regarding behavioral and physiological activity under free-living conditions. Previous clinical studies have demonstrated the benefits of smartphone and wearable sensors to monitor and estimate symptom severity associated with a wide range of diseases and disorders, including cardiovascular diseases [3], mood disorders [4], and neurodegenerative disorders ^[5][6][5,6]. These sensors can capture a range of physiological and behavioral data, including movement, heart rate, sleep, and cognitive function, providing a wealth of information that can be used to develop biomarkers for CNS disorders in particular. These longitudinal and unobtrusive measurements are highly valuable for clinical research, providing a scalable opportunity for measuring behavioral and physiological activity in real-time. However, these approaches may carry potential pitfalls as the data sourced from these devices can be large, complex, and highly variable in terms of availability, quality, and synchronicity, which can therefore complicate analysis and interpretation ^[7][8][7,8]. Machine Learning (ML) may provide a solution to processing heterogenous and large datasets, identifying meaningful patterns within the datasets, and predicting complex clinical outcomes from the data. However, the complexities involved in developing biomarkers using these new technologies need to be addressed. While these tools can aid the discovery of novel and important digital biomarkers, the lack of standardization, validation, and transparency of the ML pipelines used can pose challenges for clinical, scientific, and regulatory committees.

1.2. What Is Machine Learning

In clinical research, one of the primary objectives is to understand the relationship between a set of observable variables (features) and one or more outcomes. Building a statistical model that captures the relationship between these variables and the corresponding outputs facilitates the attainment of this understanding [9]. Once this model is built, it can be used to predict the value of an output based on the features.

ML is a powerful tool for clinical research as it can be used to build statistical models. A ML model consists of a set of tunable parameters and a ML algorithm that enables the generation of outputs based on given inputs and selected parameters. Although ML algorithms are fundamentally statistical learning algorithms, ML and traditional statistical learning algorithms can differ in their objectives. Traditional statistical learning aims to create a statistical model that represents causal inference from a sample, while ML aims to build generalizable predictive models that can be used to make accurate predictions on previously unseen data ^[10][11][10,11]. However, it is essential to recognize that while ML models can identify relationships between variables and outcomes, they may not necessarily identify a causal link between them. This is because even though these models may achieve good performances, it is crucial to ensure that their predictions are based on relevant features rather than spurious correlations. This enables the researchers to gain meaningful insights from ML models while also being aware of their inherent limitations.

While ML is not a substitute for the clinical evaluation of patients, it can provide valuable insights into a patient’s clinical profile. ML can help to identify relevant features that clinicians may not have considered, leading to better diagnosis, treatment, and patient outcomes. Additionally, ML can help to avoid common pitfalls observed in clinical decision making by removing bias, reducing human error, and improving the accuracy of predictions ^{[12][13][14][15]}[12,13,14,15]. As the volume of data generated for clinical trials and outside clinical settings continues to grow, ML’s support in processing data and informing the decision-making process becomes necessary. ML can help to uncover insights from large and complex datasets that would be difficult or impossible to identify manually.

To develop an effective ML model, it is necessary to follow a rigorous and standardized procedure. This is where ML pipelines come in. Table 1 showcases an exemplary ML pipeline, which serves as a systematic framework for automating and standardizing the model generation process. The pipeline encompasses multiple stages to ensure an organized and efficient approach to model development. First, defining the study objective guides the subsequent stages and ensures the final model meets the desired goals. Second, raw data must be preprocessed to remove errors, inconsistencies, missing data, or outliers. Third, feature extraction and selection identifies quantifiable characteristics of the data relevant to the study objective and extracts them for use in the ML model. Fourth, ML algorithms are applied to learn patterns and relationships between features, with optimal configurations identified through iterative processes until desired performance metrics are achieved. Finally, the model is validated against a new dataset that is not used in training to ensure generalizability. Effective reporting and assessment of ML procedures must be established to ensure transparency, reliability, and reproducibility.

Table 1.

Representation of a standard machine learning pipeline.

Objective
Study Design	The ML pipeline is provided with a study objective in which the features and corresponding outputs are defined. The ML model aims to identify the associations between the features and outputs.	The study objective is to classify Parkinson’s Disease patients and control groups using smartphone-based features.
2. Data Preprocessing	Data preprocessing filters and transforms raw data to guarantee or enhance the ML training process.	To improve the model performance, one may identify and exclude any missing or outlier data.
3. Feature Engineering and Selection	Feature engineering uses raw data to create new features that are not readily available in the dataset. Feature selection selects the most relevant features for the model objective by removing redundant or noisy features. Together, the goal is to simplify and accelerate the computational process while also improving the model process. For deep learning methods, the concept of “feature engineering” is typically embedded within the model architecture and training process, although substantial preprocessing steps may occur prior to that.	An interaction of two or more predictors (such as a ratio or product) or re-representation of a predictor are examples of feature engineering. Removing highly correlated or non-informative features are examples of feature selection. Note: The feature selection step can occur during model training
4. Model Training and Validation	During training, the ML model(s) iterates through all the examples in the training dataset and optimizes the parameters of the mathematical function to minimize the prediction error. To evaluate the performance of the trained ML model, the predictions of an unseen test set are compared with a known ground truth label.	Cross-validation can be used to optimize and evaluate model performance. Classification models may be evaluated based on their prediction accuracy, sensitivity, and specificity, while regression models may be evaluated using variance explained (R2) and Mean Absolute Error.

2. Machine Learning Algorithms

ML algorithms build a statistical model based on a training dataset, which can subsequently be used to make predictions about a new, unseen dataset. ML algorithms have been used in a wide variety of clinical trial applications, such as the classification of a diagnoses, classification of physical or mental state (such as a seizure or mood), and the estimation of symptom severity. Within the realm of clinical research, ML algorithms can be broadly divided into two learning paradigms: supervised and unsupervised learning ^[16][122].

Supervised ML algorithms use labeled data to map the patterns within a dataset to a known label, while unsupervised ML algorithms do not ^[17][123]. Rather, the unsupervised ML algorithms learn the structure present within a dataset without relying on annotations. Supervised learning can be used to automate the labelling process, detect disease cases, or predict clinical outcomes (such as treatment outcomes). There are scenarios when experts or participants can provide labelled data; however, it can become labor-intensive or time-consuming to label every observation. For example, a supervised learning algorithm trained to classify human sounds can be used to automatically annotate and quantify hours of coughs ^[18][124] and instances of crying ^[19][125]. These algorithms can also be used to differentiate between clinical populations and control participants ^[20][95] to identify known clinical population subtypes ^[21][23] or classify a clinical event (such as a seizure or tremor) ^[22][126]. The majority of the selected studies (N = 38) used a clinician to provide the label data. Some studies (N = 22) used a combination of a clinician and self-reported label data, and six studies solely relied on self-reported assessments. Unsupervised ML algorithms can be used to investigate the similarities and differences within a dataset without human intervention. This makes it the ideal solution for exploratory data analysis, subgroup phenotype identification, and anomaly detection. Among digital phenotyping studies, unsupervised learning has been used to identify location patterns ^[23][81] and classify sleep disturbance subtypes using wrist-worn accelerometer data ^[24][127].

It is important to recognize that unsupervised and supervised methods are not mutually exclusive, and they can be effectively combined. For instance, unsupervised methods can be employed to extract a meaningful latent representation of the input data. Subsequently, these latent vectors, along with the original inputs, can be used as inputs for a supervised model. This type of approach is commonly observed when applying techniques such as PCA, clustering, or other dimensionality reduction methods ^{[25][26][27][28]}[29,73,74,128]. By combining unsupervised and supervised methods, valuable information can be extracted from the data and used to enhance the performance and interpretability of the overall model.

In clinical research, supervised ML algorithms have been used to classify class labels or estimate scores. Classification algorithms learn to map a new observation to a pre-defined class label. These algorithms can be used to classify patient populations and patient population subtypes and identify clinical events. Regression algorithms learn to map an observation to a continuous output. These algorithms are commonly used to estimate symptom severity ^[29][129], quantify physical activity, and forecast future events ^[30][130]. Among the selected papers that were focused on the classification of a diagnosis or state, the four most common algorithms were Random Forest, Support Vector Machine, Logistic Regression, and k-Nearest Neighbors (Figure 1). Some additional classification algorithm families identified were Naïve Bayes, Ensemble-based methods (including Decision Trees, Bagging, and Gradient Boosting), and Neural Networks (such as Convolutional, Artificial, and Recurring Neural Networks). The three most common algorithms for the regression-focused papers were Linear Regression (including linear mixed effects models), Support Vector Machine, and k-Nearest Neighbors (Figure 1). Most studies only considered or reported a single ML algorithm (N = 32). Additionally, 29 of the studies considered or reported two to five ML algorithms, and the remaining 5 studies considered six or more. The following section provides an overview of the most widely used machine learning models, their properties, advantages, and disadvantages. In addition, some notable off-the-shelf ML approaches and some custom-built ML methods such as transfer learning, multi-task learning, and generalized and personalized models, are also discussed.

Figure 1.

Machine learning algorithms and their respective objectives in the selected studies.

2.1. Tree-Based Models

A Decision Tree (DT) is a supervised non-parametric algorithm that is used for both classification and regression. A DT algorithm has a hierarchical structure in which each node represents a test of a feature, each branch represents the result of that test, and each leaf represents the class label or class distribution ^[31][32][131,132]. A Random Forest (RF) algorithm is a supervised ensemble learning algorithm consisting of multiple DTs that aims to predict a class or value ^[33][133]. Ensemble learning algorithms use multiple ML algorithms to obtain a prediction ^[34][134]. Tree-based models have several benefits. As each tree is only based on a subset of features and data and because they make no assumptions about the relationship between the features and distribution, they are not sensitive to collinearity between features, can ignore missing data, and are less susceptible to overfitting (for multiple trees), making the model more generalizable ^[35][135]. Another advantage of RF and DT models is that they can support linear and nonlinear relationships between the dependent and independent variables ^[36][136]. Further, as the design of the RF models can be interpreted in terms of feature importance and proximity plots, the interpretability of the RF model is feasible. However, a limitation of using tree-based models is that small changes in the data can lead to drastically different models. Additionally, the more complicated a tree-based model becomes, the less explainable a model becomes. However, pruning the trees can help to reduce the complexity of the model.

According to the selected studies, RF is a versatile and powerful model used for classification and regression tasks across multiple datatypes and populations. RF models have been used for the classification of diagnoses among PD patients ^[37][38][107,110], Multiple Sclerosis ^[39][40][34,118], and BD and unipolar depressed patients ^[41][42][45,61]. It is also a popular classification model for the classification of states or episodes, such as the detection of flares among Rheumatoid Arthritis or Axial Spondylarthritis patients ^[43][32] and tremor detection among PD patients ^[44][137], to quantify physical activity among cerebral palsy patients ^[45][138] and detect the moods of BD patients ^[46][47][69,139]. RF regression algorithms have also been used to predict anxiety deterioration among patients who suffer with anxiety ^[48][140].

2.2. Support Vector Machines

A Support Vector Machine (SVM) is a supervised algorithm that is used for classification and regression tasks. The objective of a SVM is to identify the optimal hyperplane based on the individual observations, also known as the support vectors. For SVM regression, the optimal hyperplane represents the minimal distance between the hyperplane and the support vectors. Whereas for SVM classification, the objective is to find the hyperplane that represents the maximum distance between two classes ^[49][141]. The hyperplanes can separate the classes in either a linear or non-linear fashion ^[36][136]. Given that SVMs are influenced by the support vectors closest to the hyperplanes, SVMs are less influenced by outliers, making them more suitable for extreme case binary classification. The performance of a SVM can be relatively poor when the classes are overlapping or do not have clear decision boundaries. This makes SVMs less appealing for classification tasks as inter class similarity is low. SVMs are computationally demanding models as they compute the distance between each support vector; hence, SVMs do not scale well for large datasets ^[50][142].

SVM classifiers have been used to classify clinical populations (e.g., facial nerve palsy and their control participants) ^[51][143]. SVM classifiers have also been used to classify events or states, such as detecting gait among PD patients ^[52][104] and classifying seizures among epileptic children ^[53][144]. Researchers identified studies that used SVM regression to estimate motor fluctuations and gait speed among PD and Multiple Sclerosis patients, respectively ^[27][54][74,145].

2.3. k-Nearest Neighbors

A k-Nearest Neighbor (k-NN) algorithm is a non-parametric supervised learning approach that can be used for multi-class classification and regression tasks. Classification k-NN algorithms determine class membership by the plurality vote of its nearest neighbors. They can estimate the continuous value of an output by calculating the average value of its nearest neighbors ^[36][136]. Given this, the quality of predictions is not only dependent on the amount of data but also on the density of the data (the number of points per unit). K-NN is simple to implement, intuitive to understand, and robust to noisy training data. However, the disadvantage is that k-NN is computationally slow when it is faced with large multi-dimensional datasets. Further, k-NN does not work well with imbalanced datasets, as under- or over-represented datapoints will influence the classification ^[55][146].

The most popular application for k-NN algorithms is for wearable-based time series data. K-NN classification models have been used to classify PD and healthy controls ^[56][24], classify tremor severity ^[57][147], predict acute exacerbations of chronic obstructive pulmonary disease (AECOPD) ^[58][44], and identify mood stability among BD and MDD patients ^[46][59][60][33,69,148]. Using wearable data, k-NN regression models have been used to predict the deterioration of symptoms associated with anxiety disorder ^[48][140].

2.4. Naïve Bayes

A Naïve Bayes (NB) classifier is a supervised multi-class classification algorithm. NB classifiers calculate the class conditional probability—the probability that a datapoint belongs to a given class in the data ^[49][61][141,149]. NB classifiers are computational efficient algorithms; thus, they are suitable for real-time predictions, scale well for larger datasets, and can handle missing values. A limitation of NB is that it assumes that all features are conditionally independent; hence, it is recommended that collinear features are removed in advance. Another limitation is that when new feature-observation pairs do not resemble the data in the training data, the NB assigns a probability of zero to that observation. This approach is particularly harsh, especially when dealing with a smaller dataset. Hence, the training data should represent the entire population.

As NB classifiers help form classification models, it is found that NB classifiers have been used for the classification of tremors or for freezing gait among PD patients ^[62][52], as well as to classify flares among Rheumatoid Arthritis and Axial Spondylarthritis patients ^[43][32] and classify bipolar episodes and mood stability among BD and MDD patients ^[46][59][60][33,69,148].

2.5. Linear and Logistic Regression

A Linear Regression model is a supervised regression model that predicts a continuous output. It finds the optimal hyperplane that minimizes the sum of squared difference between the true data points and the hyperplane. A Logistic Regression model is a supervised classification model that can be used for binomial, multinominal, and ordinal classification tasks. Logistic Regression classifies observations by examining the outcome variables on the extreme ends and determines a logistic line that divides two or more classes ^[36][136]. Linear and Logistic Regression are popular in algorithms as they are easy to implement, efficient to train, and easy to interpret. However, a limitation of both models is that they make multiple assumptions, e.g., that a solution is linear, the input residuals are normally distributed, and that all features are mutually independent ^[63][150]. Multicollinearity, the correlation between multiple features, and outliers will inflate the standard error of the model and may undermine the significance of significant features ^[64][151]. Further, outliers that deviate from the expected range of the data can skew the extreme bounds of the probability, making both algorithms sensitive to outliers in the dataset ^[63][150].

Linear Regression has been used to quantify tremors among Essential Tremor (ET) patients ^[65][116] and to estimate motor-related symptom severity among PD patients ^[66][67][31,93]. It has also been used to forecast convergence between body sides for Hemiparetic patients ^[30][130]. Logistic Regression was a popular approach for classifying PD diagnosis ^[37][38][107,110], Post-Traumatic Stress Disorder ^[68][109], and distinguishing fallers and non-fallers ^[69][152]. Logistic Regression has been used to classify drug effects, such as predicting the pre- and post-medication states among PD patients ^[70][22].

2.6. Neural Networks

Neural Networks (NN), also known as Artificial Neural Networks (ANN), can be used for unsupervised and supervised classification and regression tasks ^[71][153]. NN consists of a collection of artificial neurons (or nodes). Each artificial neuron receives, processes, and sends the signal to the artificial neuron connected to it. The neurons are aggregated into multiple layers, and each layer performs different transformations on the signal. The signal first travels from the input layer into the output layer while possibly traversing multiple hidden layers in between. NN offer several advantages, such as the ability to detect complex non-linear relationships between features and outcomes and work with missing data, while it also requires less preprocessing of the data and offers the availability of multiple training algorithms. However, the disadvantages of NN include increased computational burden, reduced explainability and interpretability (as NN are ‘black box’ in nature), and the fact that NN are prone to overfitting ^[72][154]. However, it is important to highlight the growing number of studies that specifically explore explainable deep learning approaches for biomarker discovery and development. Studies utilizing methodologies such as LIME (Lime Tabular Explainer), SHAP (Shapley Additive exPlanations), and other visual inspections of feature distribution and importance have aided clinicians in understanding the model mechanisms. These approaches also provide patient-specific insights by describing the importance of each feature, which may, in turn, facilitate personalized treatment opportunities ^{[73][74][75][76]}[90,155,156,157].

The most popular applications for neural networks were for the classification of a diagnosis or classification of a state or event. The most popular application is the detection of tremors among PD patients ^{[21][44][62][77][78]}[23,52,86,137,158]. NN have been used to classify unipolar and bipolar depressed patients based on motor activity ^[41][79][45,159], estimate depression severity ^[79][159], forecast seizures ^[80][160], and classify a treatment response using keyboard patterns among PD patients ^[81][161].

2.7. Transfer Learning

Transfer learning (also known as domain adaption) refers to the act of deriving the representations of a previously trained ML model to extract meaningful features from another dataset for an inter-related task ^[82][162]. One applicable scenario is the training of a supervised ML model on data collected in a controlled setting (such as in a lab or clinic). The performance of the model may suffer when applied to a dataset collected under free-living conditions. Rather than developing a new model trained solely on a free-living condition dataset, transfer learning can use patterns learned from the controlled setting dataset to improve the learning of the patterns from the free-living conditions dataset.

Transfer learning can also be a valuable technique for enhancing the utilization of limited or rare data ^[83][163]. One practical application is to employ pretraining on abundant control data and subsequently finetune the model on the specific population of interest to improve the model’s performance ^[83][84][85][163,164,165]. This approach not only optimizes the efficiency of utilizing scarce data but also facilitates model personalization. By adapting a pretrained model to individual characteristics or preferences, it becomes possible to create personalized models that better cater to unique needs or circumstances. Transfer learning thus offers a powerful means to leverage existing knowledge and make the most of available data resources, enhancing both the efficiency and personalization of biomarkers.

Given its application, transfer learning reduces the amount of labeled data and computational resources required to train new ML models ^[82][162], thus making this method advantageous when the sensor modalities, sensor placements, and populations differ between studies. While researchers only identified two studies that applied transfer learning to estimate PD disease severity using movement sensor data ^[86][87][166,167], researchers predict that the application of transfer learning will enable future researchers to overcome the challenges of a limited dataset and develop more sensitive and effective ML models.

2.8. Multi-Task Learning

Multi-task learning (MTL) enables the learning of multiple tasks simultaneously ^[88][168]. Learning the commonalities and differences between multiple tasks can improve both the learning efficiency and the prediction accuracy of the ML models ^[88][168]. A traditional single-task ML model can have a performance ceiling effect, given the limitations of the dataset size and the model’s ability to learn meaningful representations. MTL uses all available data across multiple datasets and can learn to develop generalized models that are applicable to multiple tasks. To use MTL, there should be some degree of information shared between or across all tasks. The correlation allows MTL to exploit the underlying shared information or principles within tasks. Sometimes MTL models can perform worse than single-task models because of ‘negative transfers’. This occurs when different tasks share no mutual information or if the information of tasks are contradictory ^[89][169]. MTL models have been used to simultaneously model data sourced from two separate sources or to model multiple outcomes ^[90][91][170,171]. For example, Lu et al. explored the use of MTL to jointly model data collected from two different smartphone platforms (iPhone and Android) to jointly predict two different types of depression assessments (QIDS and a DSM-5 survey) ^[92][79]. They illustrated that the classification accuracy of the MTL approach outperformed the single-task learning approach by 48%; thus, the classification model benefited from learning from observations sourced from multiple devices.

2.9. Generalized vs. Personalized

ML algorithms can be trained on population data or individual subject data. Generalized models, which are trained on population data, are fed data from all participants for the purpose of general knowledge learning. Conversely, personalized models are trained on an individual’s data and take into consideration individual factors such as biological or lifestyle-related variations ^[93][172]. The researchers have adopted these terms from Kahdemi et al.’s study, in which they developed generalized and personalized models for sleep-wake prediction ^[94][173]. The heterogenous nature of each population or individual can be a potential hinderance for generalizable models. A single individual’s deviation from the ‘norm’ may be viewed as a source of ‘noise’ in a generalized model. For example, patients with mood disorders such as MDD and BD have large inter-individual symptom variability. Abdullah et al., reliably predicted the social rhythms of BD patients with personalized models using smartphone activity data ^[95][30]. Cho et al. compared the mood prediction accuracy of personalized and generalized models based on the circadian rhythms of MDD and BD participants ^[96][38]. Their studies illustrated that their personalized model predictions were, on average, 24% more accurate than the generalized models. These studies lay the groundwork for developing personalized models that are more sensitive to individual differences.