Amyotrophic Lateral Sclerosis (ALS) is the most common late-onset motor neuron disorder, but our current knowledge of the molecular mechanisms and pathways underlying this disease remains elusive. Genome-Wide Association Studies (GWAS) aim to identify Single Nucleotide Polymorphisms (SNPs) and other types of genetic variation that are more frequent in patients than in people without the disease, using a variety of statistical tests. Despite the rapid recent technological advances and great efforts in the GWAS field that have led to the genomic profiling of large ALS cohorts, the identified associations have been able to explain only a very small fraction of the ALS heritability and aetiology. Here, we outline ALS-Specific GWAS Challenges, explaining the limitations of traditional GWAS analyses, considering known features of the ALS genetic architecture and hypotheses about ALS pathology (e.g., multilocus interactions, rare variations with low effect size). Future advances in the genomic and machine learning fields may bring about a better understanding of ALS genetic architecture and enable improved personalized approaches to this and other devastating and complex diseases.
Amyotrophic Lateral Sclerosis (ALS) is a progressively fatal, late-onset motor neuron disorder that is predominately characterised by the loss of upper and lower motor neurons. Progressive muscle atrophy in ALS patients leads to swallowing difficulties, paralysis and ultimately to death from neuromuscular respiratory failure . ALS is the most common type of motor neuron disorder, and has peak onset at 54–67 years old, although it can affect individuals of any age . Patients typically survive 2–5 years after the first symptoms occur, with 5–10% surviving more than 10 years . A population-based study of estimated ALS incidence in 10 countries found that prevalence could increase more than 31% from 2015 to 2040 . Thus, there is an increasing need to understand ALS pathology and the molecular pathways involved, towards prevention or successful therapeutic intervention.
There are two major classifications among ALS patients, based on family history: 5–10% of cases are genetically linked, and are classified as familial, having one or more relatives that suffer from ALS, while 90% are classified as sporadic, in which a familial history is not established, and where a genetic cause is usually not identified . However, the distinction between the two categories is not always simple, with familial ALS-associated mutations also being present among sporadic ALS cases . The extent and form of genetic contribution to sporadic ALS remains unclear, but genetic factors are considered to play an important role in the disease pathology . Further investigation of the genetic architecture of both familial and sporadic cases is necessary.
Our current knowledge of the aetiology and the genetic architecture of ALS is still elusive. Genetic mutations, environmental contributions, epigenetic changes and DNA damage are hypothesized as potential causal factors that ultimately lead to motor neuron death . Variants in more than 30 genes are recognized as monogenic causes of ALS . The most frequent monogenic cause in European populations is the intronic hexanucleotide GGGGCC (G4C2) repeat expansion (HRE) in the C9orf72 gene . Other genes linked to ALS with high reproducibility include Cu/Zn superoxide dismutase 1 SOD1, fused in sarcoma FUS, and transactive response DNA-binding protein of 43 kD TARDBP/TDP-43 . The discovery of risk gene mutations has helped to unravel the molecular mechanisms of ALS, and may lead ultimately to targeted therapy and stratified drug discovery .
Numerous studies have been published aimed at explaining motor neuron death, investigating the functional effects of specific mutations of known risk-associated genes such as C9orf72, FUS, SOD1 and TDP-43 . Recent systematic reviews from our group have summarised the molecular pathways and biomarkers for which there are strong supporting evidence in ALS . The molecular pathways affected in ALS can be grouped as follows (see  for detailed review):
Mitochondrial dysfunction as a direct or indirect consequence of ALS-associated gene mutations CHCHD10, FUS, SOD1, C9orf72 and TDP-43 can lead to an increase in oxidative stress, an increase in cytosolic calcium, ATP deficiency and/or stimulation of pro-apoptotic pathways .
Oxidative stress can also be derived from a stimulation of NADPH oxidase, as observed with ATXN2 mutations , or from deficiency in the elimination of Reactive Oxygen Species (ROS) as observed with some SOD1 mutations in familial cases . It then may contribute to DNA damage. Interestingly, other mutations on the ALS-associated genes NEK1 , SETX  and C21orf2  are suspected to alter the DNA repair machinery, leading to an accumulation of oxidative damage over time. Consequently, these events could ultimately lead to motor neuron death .
Disrupted axonal transport has been directly linked to a mutation in the C-terminal of the ALS-associated gene, KIF5A , and to mutations in genes encoding for neurofilaments (NEFH), microtubules and motor proteins (PFN1,TUBA4A, DCTN1) . Consequently, organelle transport, protein degradation, and RNA transport are affected, disrupting cellular homeostasis. Similarly, axonal transport disruptions have been observed in fALS patients harboring mutations in non-cytoskeletal-related genes such as SOD1 .
Protein degradation is suspected to be a key pathway that is defective in ALS. This can be a direct consequence of mutations in ALS-associated genes involved in proteasome activity and the autophagy pathway, such as UBQLN2, VCP, SQSTM1/P62, OPTN, FIG4, Spg11, or TBK1 , and may lead to an accumulation of misfolded and non-functional proteins . It can also be an indirect consequence of other mutations leading to the formation of protein aggregates such as SOD1, FUS, TDP43, C9orf72-derived DPR - aggregates that in turn impair the proteasome and autophagic degradation pathways , thus exacerbating the accumulation of misfolded proteins. Consequently, the blockade of autophagy pathways may affect vesicle secretion . Interestingly, some ALS-associated genes are known to be directly or indirectly involved in exosome biogenesis such as CHMP2B  or C9orf72 , respectively.
Glutamate-mediated excitotoxicity has been suggested to cause motor neuron deterioration, and could be an indirect consequence of ALS-associated gene mutations such as in SOD1 or C9orf72, resulting in an elevated level of glutamate in the cerebrospinal fluid of patients .
RNA processing and metabolism is another key pathway affected in ALS. For example, mutations to RNA-binding proteins encoded by FUS, TDP-43, hnRNPA1, hnRNPA2B1, and MATR3, result in altered mRNA splicing, RNA nucleocytoplasmic transport and translation , as well as in the generation and accumulation of toxic stress granules . Similarly, accumulation of toxic RNA foci can be observed in motor neurons in the context of C9orf72 mutations, and may lead to the sequestration of splicing proteins, thus affecting RNA maturation and translation . The biogenesis of microRNA is also directly affected by mutated FUS, TDP-43, or C9orf72-mediated DPRs, thus having an impact on the expression of genes involved in motor neuron survival .
Understanding the functional processes that drive ALS pathology has proven to be a difficult and complex task, compounded by the heterogeneity that characterises the disease. The gene products of the 30 or more known ALS-associated genes interact with each other, are implicated in multiple molecular pathways, and result in multiple disease phenotypes, making functional curation and interpretation complex . In addition, these monogenic causes in ALS occur only in ∼15% of sporadic ALS and ∼66% of familial ALS patients, so that more than 80% of the ALS population do not currently have any known ALS-associated mutations . Nonetheless, acquiring an in-depth understanding of the molecular mechanisms and the genetic architecture of ALS could potentially lead to the identification of multiple patient strata and therefore targeted therapies to be applied to different subgroups of ALS patients.
In recent years, advances in high-throughput technologies have enabled the discovery of multiple Single Nucleotide Polymorphisms (SNPs) that are associated with ALS, mainly by the application of the Genome-Wide Association Study (GWAS) approach. GWAS aims to identify SNPs and other types of genetic variation (such as structural variants, copy number variations and multiple nucleotide polymorphisms) that are more frequent in patients than in people without the disease . Statistical tests are carried out for disease association across genetic markers numbering from hundreds of thousands up to millions, depending on the genomic analytical platform. The most popular genotype-phenotype association studies use statistical models such as logistic or linear regression, depending on whether the trait is binary (i.e., case-control studies, such as ALS versus healthy controls) or quantitative (e.g., different scales of height). GWAS has been successful in discovering tens of thousands of significant genotype-phenotype associations in a large spectrum of diseases and traits, such as schizophrenia, anorexia nervosa, body-mass index (BMI), type 2 diabetes, and ALS . Over the past decade, the discovery of significant genotype-phenotype associations has provided new insights into disease susceptibility, pathology, prevention, drug design and personalized medical approaches .
So far, numerous ALS GWAS studies have been published, aiming to identify novel ALS-associated variants through standard genotype-phenotype analyses. The first was published in 2007, providing genomic data for 276 cases and 271 controls . Rapid recent technological advances and great efforts in the field have led to the genomic profiling of large ALS cohorts, providing new insights into the pathology of ALS . The largest release of ALS genomic data was published in 2018 by Nicolas et al., and identified KIF5A as a novel ALS-associated gene; the study included a publicly-available large meta-analysis dataset of 10,031,630 imputed SNPs of 20,806 ALS and 59,804 controls as well as providing controlled access to “raw” genomic data including SNP-arrays of 12,188 cases and 3,292 controls . Initiatives such as Project MinE and dbGaP have contributed to the systematic release of ALS GWAS data . The ALSoD publicly available database for genes that are implicated in ALS records 126 genes, with a subset having been reproduced in multiple studies . As of July 2020, the GWAS Catalogue has published 317 variants and risk allele associations with ALS .
The genetic contribution to familial and sporadic ALS has not been fully explained by genotype-phenotype discoveries , and the known Mendelian causes of ALS represent only a small proportion of the ALS population . Nonetheless, estimates of heritability are high in sporadic ALS patients - for example, 61% in a twin meta-analysis study - suggesting that genetic factors are strongly represented in sporadic ALS and that further investigation may yet identify novel causal variants and/or multilocus interactions that could account for this high estimated heritability .
So far, evidence supports a model implicating rare variants (minor allele frequency <1%) along with non-genetic causes, such as environmental factors . Large GWAS efforts suggest a genetic architecture for ALS that falls somewhere in the middle of the spectrum of genetic pathology in terms of effect size and prevalence of risk variants-i.e., an intermediate genetic architecture, lying between conditions such as schizophrenia which have many common variants each imparting a small increase to disease risk, and conditions such as Huntington’s disease which are caused by rare large-effect variants located in a single gene .
Many ALS-associated variants, particularly for C9orf72, also contribute to other conditions such as frontotemporal dementia (FTD) and cerebellar disease, suggesting that ALS is a multi-system syndrome . ALS has an established overlap with other neurodegenerative and neuropsychiatric disorders, investigation of which could lead to insights into the understanding of pathology . An example of this is the degree of overlap between familial ALS (∼40%) and familial FTD (∼25%) patients that carry the G6C4 expansion of C9orf72 . C9orf72 hexanucleotide expansion has been associated to multiple traits including Alzheimer’s and Parkinson’s diseases, ataxia, chorea and schizophrenia . A population-based GWAS study reported a higher prevalence of psychosis, suicidal behaviour, and schizophrenia, in Irish ALS kindreds, which was associated with the C9orf72 repeat expansion, based on an aggregation analysis . Further evidence for a shared susceptibility to ALS was provided by the greater occurrence of dementia among first-degree relatives of ALS patients . Several studies have suggested that the genetic overlap between ALS and other neurodegenerative and neuropsychiatric disorders could also be explained by the presence of ALS-associated pleiotropic variants that influence multiple, and in some cases quite distinct, phenotypic traits . One study that supports this hypothesis is that of O’Brien et al., which shows that first-degree and second-degree relatives of Irish ALS patients have a significantly higher prevalence of schizophrenia and neuropsychiatric diseases than healthy controls, including obsessive-compulsive disorder, psychotic illness, and autism-the authors performed k-means clustering and calculated the relative risk to estimate aggregation .
Despite that hundreds of ALS-associated variants have been recorded in public databases such as the GWAS Catalog , these associations show very little reproducibility across different studies and have not been able to explain a large percentage of ALS heritability ; a phenomenon which is generally known as the “missing heritability” paradox . It has been proposed that SNPs contribute ∼8.5% of the overall heritability of ALS, although it should be noted that such estimates consider only linear single-marker effects of SNPs . Here we outline some general GWAS limitations in the context of ALS, as well as potential reasons why standard GWAS phenotype-genotype analysis is unlikely to fully explain the genetic architecture of ALS.
A first general challenge in large scale genomic analyses is to ensure a high quality of the genotype data, so that the downstream results of the experimental design reflect true biology and not artifacts. Therefore, the collected genomic data first need to pass a comprehensive Quality Control (QC) pipeline including multiple sample and variant QC steps . One challenge is that each dataset has its own specific features, thus there are not fixed thresholds for each quality-control step. For this reason, each study needs to follow a data-driven approach, taking into consideration the distribution of each data metric. However, there are some good practices in QC that may be generally applicable to most studies . For example, it is typical to follow a procedure first filtering out low quality samples then removing poor quality markers, the order of this ensuring that as many genetic markers as possible are kept in the final dataset. However, overly strict thresholds can lead to the loss of a substantial proportion of samples, reducing study power. Another challenge is to ensure homogeneity of the collected samples in terms of ancestry. This QC step is carried out by analysing the population structure to remove ethnic outliers, and by accounting for confounding factors in later stages of the analysis, such as a potential inner population sub-structure, usually using the first few Principal Components, after performing a Principal Component Analysis on the homogeneous sample cohort. Also, it is very important to check for duplicated samples and, in non-family GWAS analyses, ensure that all samples are unrelated so that specific genotypes are not over-represented (and thereby contributing a bias to the subsequent analysis). Identity-by-descent (IBD) is a metric that corrects for such bias and takes into account the number of variants that a pair of individuals share.
GWAS is a single marker analysis treating each variant association as an independent event that contributes to the phenotype. Due to this, it is a standard practice for results to be corrected under the strict multiple testing threshold (p < 5 × 10) of the Bonferroni correction in order to control for false positive discoveries (Family-wise type I errors). This threshold derives from the hypothesis of 1,000,000 independent markers being tested under a significance level of 5%. Particularly in low sample size studies this correction can result in a loss of power of the analysis, which may then fail to capture a portion of potential risk variants that do not pass the significance threshold (Family-wise type II errors) .
Univariate analyses such as GWAS that test trait association for one locus at a time are not able to capture multilocus interactions-a phenomenon called epistasis-and the interaction of the environment with the genome; events that could potentially account for the missing heritability of ALS and explain the disease pathology . The term epistasis was introduced in genetics over a century ago by Bateson et al. , and genetic and evolutionary biology studies have highlighted the importance of gene-gene interactions not only in the genetic architecture of an organism but also in evolution . Epistasis represents non-additive events in the genome including interactions among two or more loci that have an effect on the phenotype . Several studies have highlighted the role of epistasis in pathology, showing that SNP interactions provide a stronger association to the disease than the participating SNPs do individually . To understand pathology in a complex disease such as ALS, it may be necessary to identify complex genetic interactions, including epistatic interations . Nevertheless, the study of multilocus interactions poses a number of challenges, in particular the need for a high computational power as the number of tested interactions is extremely high even in pairwise combinations. As such, multivariate computational approaches and appropriate machine learning methods may be able to capture the potentially complex relationships among risk variants in ALS .
GWAS is more successfully employed under a “common disease-common variant” hypothesis, being of particular use in common diseases such as schizophrenia which are driven by many risk alleles each with high frequency . In contrast, ALS is a heterogeneous disease likely comprised of multiple strata each resulting from combinations of different rare mutations and other factors. As a result, stratum-specific mutations may each have very small effects that are diluted and thus not captured by GWAS . The majority of GWAS analyses have used SNP-arrays as they have until recently had a lower experimental cost in comparison to sequencing of the exome or the whole genome. SNP-array analyses can typically capture the effect of only common variants to the phenotype whereas sequencing analyses identify both common and rare variants. In most SNP-array GWAS studies, variants with Minor Allele Frequency (MAF) of <1–5% are removed from subsequent analysis as they are generally more difficult to genotype and therefore are considered potential false positives . Nevertheless, whole genome sequencing, custom designed exome sequencing arrays, rare variant burden analyses and imputation approaches using large reference panels (such as the Haplotype Reference Consortium, which contains 64,976 haplotypes), face this challenge by recovering both rare (up to 0.1% MAF) and common variants that SNP-array platforms do not usually contain . However, there is still a proportion of low frequency minor allele effects on the phenotype that cannot yet be detected by GWAS approaches and that could also potentially explain some of the missing heritability in ALS .
Lastly, another common GWAS challenge in complex diseases is the difficulty to distinguish causal variants from other non-disease-associated variants that are in high linkage disequilibrium . Linkage disequilibrium describes the phenomenon where an allele of a variant is inherited together with the alleles of other variants . These alleles of other variants are highly correlated and will have very similar GWAS signals with the truly causal SNP. The majority of disease-related variants are located in cis-regulatory regions of the genome , and given our limited knowledge of non-coding genomic loci, it is even more challenging for those to discern causal SNPs from the noise. Our difficulty to identify the causal variants in complex diseases among a pool of statistically significant associated variants adds to the challenge of identifying molecular processes that could have a significant impact on the disease.
Advanced machine learning prediction models trained in ALS genomic data could overcome the aforementioned challenges, moving towards better insights into disease causality and ultimately to a personalized understanding of ALS . In Figure 1, we describe the basic steps of an ALS machine learning experimental design in order to discover ALS-associated novel loci or combinations of loci, as well as the main challenges of each step. Each of the main challenges is addressed in successive chapters of our review , as we describe and compare the experimental design of collected gene prioritization studies. Some of the challenges in Figure 1 have already been mentioned, such as the need for a large sample size that could increase the power of the study, a comprehensive quality control pipeline to assure high quality genomic data, as well as the curse of dimensionality which is a very common problem in genomic studies that include an extremely high number of features and especially in studies that focus on multilocus interactions.
Figure 1. The main challenges of an ALS machine learning experimental design.