3.1. Genome-Wide Association Studies (GWAS) and Polygenic Risk Scores (PRSs)
GWA studies are not hypothesis-driven, unlike candidate gene association studies that are designed with specific questions in mind, interrogating particular genes or genomic loci implicated in specific molecular pathways or biological processes hypothesized to be involved. Nevertheless, GWAS proved less successful than originally expected in helping to pinpoint SLD susceptibility loci, partly owing to the heterogeneous dyslexia phenotype and diagnostic/recruitment criteria used or to the small sample numbers analyzed compared to other neurodevelopmental/psychiatric phenotypes. Small sample sizes confer low detection power for common variants with small effect sizes, especially considering the stringent statistical correction for multiple testing over hundreds of thousands or millions of variants that needs to be taken into account. To compensate, genome-wide screening of the general population for DNA variants associated with reading, arithmetic and language abilities as heritable traits attracted intense research interest; these were viewed as ”intermediate phenotypes”, or quantitative traits acting as endophenotypes, determined by a genetic background that potentially also underlies SLD etiology.
Reading skill as a quantitative trait was explored for the first time by applying a GWAS approach using the extremes of its continuous distribution. Two groups, low versus high reading ability, comprising a total sample of 1500 children, were genotyped using a low-density SNP microarray (~100 k). Top candidate SNPs showing the largest allele frequency differences between extreme-ends groups were validated in an independent sample of 900 age-matched children. Of those, ten SNPs showed nominally significant association with continuous variation in reading ability
[117]. Since this seminal effort, a significant number of studies have been conducted, several of which focused on variants with pleiotropic effects in both reading and language traits ()
[103][102][105]. We believe that the most recent one deserves highlighting for two reasons. First, the authors studied reading disability predictors, namely RAN and rapid alternating stimulus, in a sample of more than 1300 Hispanic-American and African-American young individuals. Second, they found, for the first time in a GWAS design, genome-wide significance for a variant located on the upstream region of a long non-coding RNA (lncRNA) gene, namely RPL7P34, 30kb upstream of RNLS (10q23.31). It was suggested that this variant resides on an enhancer element that potentially interacts with an active RNLS transcription start site in the hippocampus, owing to chromatin’s three-dimensional structure. The variant was further associated with structural variation (cortical volume) in the right inferior parietal lobule of an independent multi-ethnic sample
[118]. Currently, it remains largely unknown how non-coding regions of the genome may impact reading traits; the identification of variants in gene regulatory regions, as recently demonstrated for
ARHGEF39 in SLI
[119], or the role of post-transcriptional (e.g., miRNA-based) regulation of gene expression, is undoubtedly an exciting new field of research.
Coming to the context of dyslexia, one of the first GWAS, albeit of a very small scale in comparison to current standards (200 cases for discovery and 186 for replication, tested for a limited number of markers (300k)), identified rs4234898 on chromosome 4 as a trans-acting regulatory variant for
SLC2A3 which resides on chromosome 12.
SLC2A3 codes for a glucose transporter in neurons, and its reduced expression in lymphoblastoid cell lines was shown to be significantly associated with the minor rs4234898 allele. It was suggested that
SLC2A3 might act as a susceptibility gene for an electrophysiological endophenotype in dyslexic children with glucose transport deficits, namely mismatch negativity (MMN) or mismatch response. MMN serves as a measure for speech perception and automatic speech deviance which has been found impaired in dyslexic children
[90]. This mismatch response endophenotype was later shown to associate with common variants in
DYX1C1 [120], unlike common variants in
DCDC2 and
KIAA0319 [121].
The largest GWAS for dyslexia-specific traits was recently published, with data generated for almost 3500 reading-impaired and typically developing children of European ancestry from nine countries speaking six different languages. Genome-wide significance was observed with RAN for four variants on 18q12.2, within
MIR924HG (rs17663182), and a suggestive association on 8q12.3 within
NKAIN3. It is of note that
MIR924 is predicted to regulate candidate dyslexia susceptibility genes like
MRPL19 and
KIAA0319L, as observed via in silico analysis of putative miR-924 binding sites
[111]. The same group performed a polygenic risk score (PRS) analysis between eight reading traits and different neuropsychiatric disorders (ADHD, ASD, major depressive disorder and schizophrenia), educational attainment, and neuroimaging phenotypes (seven brain areas) and found a significant genetic overlap between some of these reading traits and educational attainment and, to a lesser extent, with ADHD
[111]. This initiative led to an even larger dyslexia case-control GWAS of almost 2300 cases and 6300 controls, a subset of which overlapped with the same authors’ 2019 paper
[26]. No novel genome-wide significant associations emerged at single-marker level; gene-based analysis from the top SNP association signals revealed
VEPH1 (3q25) as a top candidate gene, but no specific pathways showed significant enrichment
[26].
Actually, the first study assessing the reading ability of non-dyslexic children and adolescents with the use of PRS analysis was published in 2017. The authors in this study utilized GWAS data from >5800 cases and used educational attainment (=years of education completed) to predict reading performance in English. They calculated a PRS-heritability estimate of reading ability of almost 5%, based only on common variants. This estimate represents approximately 7% of the total heritability for reading ability (h
2 = 70%; 5%/70%) evaluated through twin studies
[122]. However, if calculating the PRS-heritability estimate using an SNP-heritability estimate, which was shown to account for 22% of the total genetic variance
[123], then the PRS-heritability estimate can explain a significant 23% (5%/22%) of the genetic variance observed for reading ability, an estimate that remained significant after accounting for age-specific cognitive ability and family socioeconomic status
[122].
The use of PRSs is a rather young addition to the armor of (statistical) tools to evaluate the genetic component of complex traits, even more so for complex cognitive skills like reading performance; yet, we can already foresee its potential. Given its inherent nature (as DNA variants do not change by age), knowing the individual genetic differences in reading ability perhaps may prove useful in the early prediction of reading problems like dyslexia. This will require large multicentered initiatives of tens of thousands of participants. However, because language transparency is an important issue in assessing dyslexia, perhaps large GWAS with participants using the same language would be powerful enough to explore the applicability of PRS further, an approach already tested by Gialluisi et al. in their 2019 analysis
[111].
The first GWAS study conducted to exclusively assess mathematical ability and disability was published ten years ago; two groups of children from the Twins Early Development Study, with high versus low mathematical ability (600 individuals per group), served as the discovery cohort, and 2356 individuals, spanning the entire distribution of mathematical ability, were used for validation purposes. Out of 10 top candidate SNPs, rs11225308 (
MMP7), rs363449 (
GRIK1), and rs17278234 (
DNAH5) were the variants most significantly associated with mathematical ability. Because the effect sizes of these 10 SNPs were small, the authors created an ‘SNP-set score’ for each of the 2356 individuals, which accounted for 2.9% of the variance in their sample
[68]. In fact, by using this SNP-set score, it was shown that one third of children who harbored ≥50% of the identified risk alleles were nearly twice as likely to be in the lowest-performing 15% of the mathematical ability distribution
[68]. This score was later correlated with certain environmental factors, demonstrating likely gene × environment interactions
[124].
Subsequently, in a sample of almost 700 dyslexic cases and more than 1400 controls, available GWAS data were reanalyzed to associate genetic variation specifically with dyscalculia. The authors found rs133885 in
MYO18B to be strongly correlated with mathematical abilities in the dyslexia sample and, to a lesser extent, the general population. A significantly lower depth of the right intraparietal sulcus, an anatomical brain region involved in numerical processing in humans, was associated with rs133885
[114]. However, this association was not supported in the subsequent analysis of a much larger collection of 5144 individuals from four cohorts of European ancestry, 329 of which were diagnosed with dyslexia
[125]. A third GWAS aiming to explore the genetic contributions to mathematical ability was conducted in a general population sample of 602 adolescents/young adults with excellent verbal ability but either high or low mathematical ability. The marker with the largest effect size was rs789859, located in the promoter of
FAM43A and in high linkage disequilibrium with two SNPs in the adjacent
LSG1 gene (3q29), a region previously linked to learning difficulties and autism
[115]. Although the encoded protein’s function remains obscure,
FAM43A was found expressed in the brain, cerebellum and spinal cord
[115].
One GWAS was conducted exclusively on the purpose to assess mathematical ability in the general population of Chinese elementary school students in 2017. Two discovery and one replication groups were used, totaling almost 1600 individuals. Sample meta-analysis revealed four linked SNPs in
SPOCK1 associated on a genome-wide significance level with a decrease in math scores on two examination periods
[116]. Interestingly, mutations in
SPOCK1, which encodes for the extracellular proteoglycan testican-1, have been associated with ID and microcephaly in humans, whereas
Spock1 mouse models have demonstrated strong gene expression in the brain as well as its role in neurogenesis
[116].
By now, it has become clear that because GWAS are designed to target common variants, often in non-coding, regulatory or even intergenic regions, they do not necessarily directly reveal the true effect of likely pathogenic variants, as it would be expected in the case of rare coding variants. On the other hand, initial genome-wide genotyping platforms were designed based on Caucasian genome frequencies and most of what we currently know about reading and mathematical abilities and disabilities originates from studies of individuals of Caucasian ancestry, despite the fact that SLD affects populations globally and irrespective of language. Thus, we are largely unaware of the genetic architecture of SLD across populations and ethnic ancestries. GWAS, despite setting the grounds for unbiased genome-wide interrogations, most often than not, have returned results that could be hardly replicated. This has been attributed either to small effect sizes of common variants, especially for quantitative traits such as reading-associated traits, small sample sizes to reveal statistically powerful associations or even to lack of consensus in SLD diagnosis. Hence, alternative yet complementary methods, as those described in the next paragraphs, have significantly contributed in the delineation of the genetic architecture of SLD during the last years.
3.2. Copy-Number Variants (CNVs)
Part of the missing heritability of SLD may be also caused by structural variants. CNVs have been extensively explored in other neurodevelopmental disorders, such as ASD, ID
[126][127][128], Tourette Syndrome
[129][130], and SLI
[131]; results for SLD have been inconclusive. On one hand, recent analyses of dyslexia cohorts indicate that rare, large CNVs may not confer a significant burden
[126][132]. On the other hand, rare de novo or inherited deletions or duplications, such as the Xq21.3 region bearing
PCDH11X [104], 17q21.31 harboring
NSF [106], and 15q11.2(BP1-BP2) harboring four highly conserved genes ()
[43][44], have been reported in cases with SLD. Earlier, a father and his three affected sons were found to carry a submicroscopic deletion (at least ~176 kb) on 21q22.3, encompassing the 3′ region of
PCNT, genes
DIP2A and
S100B and the 5′ upstream sequence of
PRMT2. The deletion perfectly segregated with dyslexia and standard scores for phonological decoding and single-word reading of below −1.5 to −2 standard deviations
[65]. As described later (
Section 3.3), a non-coding variant in
S100B was also associated with spelling performance in a German family set
[96].
Different loci have been found to harbor deletions and duplications in patients with various clinical presentations and comorbid math comprehension difficulties. Children with the 22q11.2 deletion syndrome show considerable difficulties in procedural calculation and word problem solving due to difficulties in understanding and representing numerical quantities, despite relatively normal reading performance
[133]. A 22q11.2 deletion spanning LCR22-4 to LCR22-5 interval was found in an 11-year-old girl with normal intelligence, number sense deficit, normal results in spelling and reading tests and social contact difficulties
[134]. A severely affected girl with X-linked myotubular myopathy and math difficulties was found to carry an inherited 661kb Xq28 microduplication with a skewed X chromosome inactivation pattern
[135]. If we exclude syndromic cases, reports on individuals presenting exclusively with mathematical impairments who bear rare or novel de novo or inherited CNVs are truly scarce. An increase of CNVs of the Olduvai protein domain on 1q21 (
NBPF15), previously known as DUF1220, appear to be involved in human brain size and evolution and may determine the mathematical aptitude ability of both sexes
[136]. This genetic locus is highly expressed in brain regions with high cognitive function
[137], but it has not been studied in the context of mathematical disabilities.
Last but not least, a recent study from the Icelandic population investigated the effect of 15q11.2(BP1-BP2) deletion in cognitive, structural and functional correlations of dyslexia and mathematical disabilities. This CNV was previously associated with cognition deficits in non-neuropsychiatric cases with a history of SLD
[43]. Later, Ulfarsson et al. showed that the deletion conferred high risk in either dyslexia or dyscalculia, but the risk was even higher in the combined dyslexia plus dyscalculia phenotype; all deletion carriers performed worse on a battery of tests assessing reading and mathematical abilities. In the same sample, structural magnetic resonance imaging (sMRI) and functional MRI (fMRI) were performed, demonstrating that smaller left fusiform gyrus and altered activation in the left fusiform and left angular gyrus also associated with the 15q11.2 deletion
[44]. These brain areas are involved in the retrieval of mathematical facts, the usage of learned facts and the performance of arithmetic operations
[138][139][140]. This anatomical and functional brain differentiation could be one cause of the greater risk observed for the combined phenotype in deletion carriers.
Either de novo or transmitted, these structural variations may produce a yet unknown spectrum of disturbances on genomic, transcriptomic and proteomic level, for instance haploinsufficiency in the case of deletion or overexpression in the case of duplication
[141][142], consequently also affecting subsequent protein-protein interactions; these are hypotheses that warrant further investigation. Interestingly, the 15q11.2(BP1-BP2) duplication carriers do not show significant cognitive impairments, compared to 15q11.2(BP1-BP2) deletion carriers, and are comparable to no-CNV controls
[44]. This fact supports the role of haploinsufficiency for the genes mapped on this region, particularly
CYFIP1, which was shown to be involved in neuronal development
[143].
3.3. Next-Generation Sequencing
It is unclear how much of the missing heritability of SLD could be attributed to rare or de novo variants of moderate or high effect, even though this issue has been extensively studied with respect to ID, ASD and developmental delay
[144][145][146]. With the emergence of NGS technology, the identification of rare variants could help fill in some of the missing pieces of the puzzle. Sequencing data have only recently begun to emerge for SLD, supporting the influence of certain genomic regions on reading performance and related disabilities. As expected, the first efforts concentrated and sources were allocated on the validation of previously established or suspected dyslexia genes in various populations.
Originally mapped through a submicroscopic deletion on 21q22.3 in a dyslexia family
[65],
S100B was one of 11 genes to be scrutinized for rare variants using targeted NGS in more than 900 dyslexia cases from Finland and Germany; a 3′ UTR variant (rs9722), located on or adjacent to in silico predicted miRNA target sites, was associated with spelling performance in the German family set. Moreover, a nonsynonymous variant in
DCDC2 (rs2274305) was associated with severe spelling deficiency in the same sample set
[96]. A similar approach was applied to a subsequent next-generation targeted sequencing effort by Adams et al., who selected dyslexia-associated candidate genes to be screened in 96 affected, unrelated subjects of European ancestry from the Colorado Learning Disability Research Center (CLDRC). These cases were selected based on a CLDRC-derived discriminant score indicating impairment in reading ability
[108]. The authors searched for rare, likely disrupting, variants and calculated a statistically significant increase in the frequency of observed mutations in dyslexia cases—compared to data from 1000 Genomes Project—in two loci: 7q32.1 harboring the adjacent genes
CCDC136 and
FLNC (19 missense variants) and 6p22 harboring
DCDC2 and
KIAA0319 (74 missense variants). The data indicate that these regions must have an influence on reading performance, even though not all of the above-mentioned genes show detectable expression in the brain ()
[108].
The first whole-exome sequencing (WES) study was published in 2015 by Einarsdottir et al. in an effort to identify the genetic basis of a familial form of dyslexia with likely complete penetrance in an extended three-generation pedigree with 12 confirmed dyslexic and four uncertain cases. Through several filtering steps on WES data, a small heterozygous in/del variant was identified in
CEP63, namely c.686–687delGCinsTT; its transmission was compatible with autosomal dominant inheritance. This rare variant codes for a non-synonymous change in a highly evolutionarily conserved amino acid (p.R229L), which was in silico predicted to alter the protein’s tertiary structure
[107]. As discussed later (
Section 6), CEP63 is a centrosomal protein involved in microtubule organization and, even though it is ubiquitously expressed (), brain-specific isoforms may be affected by such rare variants. It still remains to be seen whether
CEP63 variants are linked to dyslexia in additional cases.
Several other reports have also demonstrated that dyslexia-associated genes encode proteins with structural and functional roles in cilia
[147][148][149][150][151][152][153]. Recently, rare variants were identified in two genes related to motile cilia structure and function, namely dynein axonemal heavy chain 5 (
DNAH5) and dynein axonemal heavy chain 11 (
DNAH11). This represents the first whole-genome sequencing (WGS) analysis in literature of two unrelated dyslexia cases, with situs inversus and ADHD symptomatology
[154]. Even though direct links between visceral and functional brain asymmetry are lacking, visceral asymmetry (e.g., situs inversus) is comorbid, at least in some cases, with psychiatric and neurodevelopmental disorders
[155]. Although it could not be proven unequivocally that the identified variants in
DNAH5 and
DNAH11 cause susceptibility to dyslexia, these two genes represent good candidates for further studies.
Overall, the most recent studies that have used state-of-the-art methodology to look for either likely pathogenic CNVs or rare variants in isolated families have provided clues for the implication of novel genes. Family-based studies continue to be a powerful method to unravel the genetic basis of dyslexia
[107]. However, variations in reported loci do not explain, so far, but a small percentage of the genetic component of SLD. Consequently, much of the heritability of learning-related disorders remains unaccounted for. Perhaps the answer is not “hiding” exclusively in single, rare variants that remain yet to be identified, but also in gene × gene and higher-order chromatin interactions or epigenetic regulatory mechanisms and ways that the environment can determine the (epi)genome
[156]. It is of note that epigenome-wide association studies have not been reported yet.