The current polythetic and operational criteria for major depression inevitably contribute to the heterogeneity of depressive syndromes. The heterogeneity of depressive syndrome has been criticized using the concept of language game in Wittgensteinian philosophy. Moreover, “a symptom- or endophenotype-based approach, rather than a diagnosis-based approach, has been proposed” as the “next-generation treatment for mental disorders” by Thomas Insel. Understanding the heterogeneity renders promise for personalized medicine to treat cases of depressive syndrome, in terms of both defining symptom clusters and selecting antidepressants. Machine learning algorithms have emerged as a tool for personalized medicine by handling clinical big data that can be used as predictors for subtype classification and treatment outcome prediction. The large clinical cohort data from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D), Combining Medications to Enhance Depression Outcome (CO-MED), and the German Research Network on Depression (GRND) have recently began to be acknowledged as useful sources for machine learning-based depression research with regard to cost effectiveness and generalizability. In addition, noninvasive biological tools such as functional and resting state magnetic resonance imaging techniques are widely combined with machine learning methods to detect intrinsic endophenotypes of depression.
Depression is one of the most burdensome disorders worldwide, with a lifetime prevalence of approximately 20% of the global population 
. Depression remission after the first antidepressant trial is only 30% 
. This low remission rate is partly because diagnosing depression does not guarantee heterogeneous symptom subtypes 
. Inevitably, the concept that depression is characterized by symptomatic heterogeneity, such as atypical 
, melancholic 
, and anxious 
subtypes, has gained considerable attention. In addition, it has been reported that the heterogeneity of depressive syndrome can theoretically result from the polythetic and operational criteria of major depression 
. According to the Diagnostic and Statistical Manual of Mental Disorders
, fifth edition (DSM-5) 
, a confirmed diagnosis of major depressive disorder requires both the presence of five or more symptoms among, nine symptoms, including depressed mood, diminished interest or pleasure, weight loss or gain, insomnia or hypersomnia, psychomotor retardation or agitation, fatigue or loss of energy, feelings of worthlessness or excessive guilt, diminished thinking ability or indecisiveness, recurrent thoughts of death or recurrent suicidal ideation, and the presence of either depressed mood or diminished interest or pleasure. Herein, the subset of k draws from n distinguishable objects without replacement and without regard to order that (nCk) can calculate from the theoretical number of different combinations meeting the polythetic and operational criteria of major depressive disorder in DSM-5. Thus, 227 different diagnostic symptom combinations were calculated that can fulfill the DSM-5 diagnostic criteria for major depressive disorder 
. In terms of psychiatric taxonomy, the heterogeneity of depressive syndrome has been criticized by the concept of a language game in Wittgensteinian philosophy 
. Wittgenstein suggested the analogy as follows 
Consider for example the proceedings that we call games. I mean board-games, card-games, ball-games, Olympic games, and so on. What is common to them all?—Don’t say: “There must be something common, or they would not be called games”—but look and see whether there is anything common to all.—For if you look at them you will not see something that is common to all, but similarities, relationships, and a whole series of them at that. To repeat: don’t think, but look!—the concept game is a concept with blurred edges.—“But is a blurred concept a concept at all?”—Is an indistinct photograph a picture of a person at all? Is it even always an advantage to replace an indistinct picture by a sharp one? Isn’t the indistinct one often exactly what we need? (Wittgenstein, 2001).
It is also proposed that cases of depressive syndrome are conceptually related by the “family resemblance” rather than the “essence.” Thus, it is concluded that the heterogeneity of depressive syndrome is consistent with Wittgensteinian’s analogy 
. Thus, the nomenclature of depressive syndrome can be consistent not with the categorical approach, but the dimensional approach, in the context of the heterogeneity of major depressive disorder 
. Furthermore, based on the theoretical construct change from chemical imbalance to dysfunctional circuitry, the symptom-based approach, but not the diagnosis-based approach, has been emphasized by Thomas Insel in his work on the next generation of treatments for mental disorders 
. Along with the heterogeneity concept, the therapeutic approach also shifts toward selecting antidepressants according to specific symptom clusters 
. Each cluster of depression symptoms may be thought to react to specific antidepressants, thus potentially improving the current low remission rates. The theorem supporting depression heterogeneity has not generated notable clinical utility in that theory-driven classification of symptom clusters and subsequent antidepressant selection have only produced low accuracies in treatment outcome predictions 
. However, the clinical utility of the depression heterogeneity concept in diagnostics and therapeutics is increasingly acknowledged with the use of data-driven machine learning approaches.
Machine learning approaches can be more beneficial in the study of depression compared with traditional methods. Factor analysis, for instance, may generate complicated combinations of heterogeneous symptoms within specific dimensions 
. These analytic approaches also can be vulnerable to experimenter bias in that a researcher has to choose the number of components or clusters in data, as such in k means clustering method 
. Hierarchical clustering, a type of machine learning method, is an easy-to-implement, deterministic approach in which each of the symptoms is assigned to a single cluster even without predetermining the desired number of clusters.
2. Brain Imaging Techniques and Machine Learning in Depression
Machine learning algorithms are wildly applicable to diverse data of patients for elucidation of the complex nature of depression. In particular, in addition to the aforementioned clinical cohort data, there has been growing attention toward using brain imaging methods to detect endophenotypes of depression that can be clinically significant and feasible for translation to diagnosis 
. Brain MRIs are one of the most widely used techniques that help identify the potential biological markers of depression. Importantly, MRI techniques with machine learning algorithms can produce a classification with brain networks and prediction of treatment response in depression. First, some representative studies adopted graph theory approaches 
to detect defective functional and structural brain networks of depressed patients. Gong et al. 
enumerated diverse brain network features, such as alterations in regional and connectivity patterns of different MRI modalities for depression, which include the regional betweenness and degree centrality in structural MRI, region-of-interest-based analysis in functional MRI, and white matter structural connectivity in the diffusion tensor image. Zeng et al. 
examined the whole-brain functional connectivity at resting state to distinguish the depressed patients from the controls, which yielded 100% sensitivity. The most discriminant functional connectivity was found within or across the affective network, default mode network, and visual cortical areas, which seem to play a critical role in the neuromechanism of depression. Second, other representative studies sought to find alterations in brain network activity at resting state as potential endophenotypes for prediction of therapeutic outcomes. Drysdale et al. 
suggested that four different patterns of fronto-striatal and limbic functional connectivity be defined as depression biomarkers from functional MRI analyses. The biomarkers were also related to distinctive profiles of clinical symptoms. For instance, biomarker 1 was associated with severe fatigue and anhedonia, and showed best response to repetitive transcranial magnetic stimulation. Redlich et al. 
examined whether changes in gray matter volume predict response to electroconvulsive therapy. Support vector regression was accompanied with univariate analysis of the Hamilton Depression Rating Scale score, which successfully predicted the response to electroconvulsive therapy and reduction in HDRS. Jiang et al. 
predicted remission after electroconvulsive therapy using the gray matter of depressed patients, in which six different gray matter networks were suggested as predictors of response to electroconvulsive therapy. Thus, the connectome-based endophenotypes may yield novel opportunities to define the diagnosis of depression and improve therapeutic response.
3. Future Research into Treatment Selection Models
We address the major steps involved in building antidepressant selection models from a clinical database that involves values, for each patient, on variables that represent clinical and demographic characteristics, therapeutics applied to the patient, and observed outcomes from the therapeutics. Understanding the sequential steps is crucial for interpreting and evaluating the utility of findings from the antidepressant selection studies.
The first step was to establish candidate predictor variables. Appropriate candidate predictor variables are those that are acquired prior to the treatment assignment and that credibility could be related to outcome, either generally or differentially between treatments. If a previous study has suggested that a variable can predict an outcome, then it should be involved as a potential predictor variable. However, as the literature on predictors of psychiatric disorders is still relatively scarce, considering other putative variables is recommended.
Variables should be free of significant missingness, and systematic missingness should be examined to ensure the appropriateness of imputation 
. Variables should also show considerable variability. For instance, it does not make sense to involve sex if 90% of the sample is male. Selecting variables used for prediction is reliant on situations in which predictors exhibit high collinearity. Therefore, it is plausible to test the covariance structure of the putative predictors and take measures to reduce the high collinearity 
. Other suggestions for identifying putative predictors include addressing outliers, making categorical variables binary, and converting variables for hypothetical reasons or handling highly skewed distributions.
Once putative predictor variables were selected, the next step was to construct the prediction model. This is typically a two-step procedure that includes variable selection and model specifications. Many different variable selection approaches have been suggested for treatment selection, all of which attempt to identify which variables, among the putative predictors, contribute significantly to the prediction outcome. Gillan and Whelan 
presented an outstanding discussion of data-driven versus theory-driven approaches to model specifications. Typical approaches depend on parametric regression models 
that select only variables with statistically significant contributions to the outcome. Another approach includes penalties with the goal of limiting the number of selected variables 
. Others utilize bootstrapping processes that help maximize the generalizability of the models 
. Progress in statistical modeling has led to feature selection methods, which are largely based on machine learning algorithms that can compliantly model and identify predictors, even with higher-order interactions 
. Gillan and Whelan 
provided an in-depth discussion of the merits of machine learning in the field of psychiatric disorders.