Dissecting Polygenic Etiology of Ischemic Stroke: Comparison
Please note this is a comparison between Version 3 by Jiang Li and Version 2 by Jason Zhu.

Ischemic stroke (IS), the leading cause of death and disability worldwide, is caused by many modifiable and non-modifiable risk factors. This complex disease is also known for its multiple etiologies with moderate heritability. Polygenic risk scores (PRSs), which have been used to establish a common genetic basis for IS, may contribute to IS risk stratification for disease/outcome prediction and personalized management. Statistical modeling and machine learning algorithms have contributed significantly to this field. For instance, multiple algorithms have been successfully applied to PRS construction and integration of genetic and non-genetic features for outcome prediction to aid in risk stratification for personalized management and prevention measures. PRS derived from variants with effect size estimated based on the summary statistics of a specific subtype shows a stronger association with the matched subtype. The disruption of the extracellular matrix and amyloidosis account for the pathogenesis of cerebral small vessel disease (CSVD). Pathway-specific PRS analyses confirm known and identify novel etiologies related to IS.

  • genome-wide association study
  • ischemic stroke
  • stroke subtypes
  • cerebral small vessel disease
  • polygenic risk score
  • machine learning
  • electronic health records
  • gene ontology
  • least absolute shrinkage and selection operator (LASSO)
  • survival analysis

1. Polygenic Nature of Ischemic Stroke

Ischemic stroke (IS) is a highly complex and heterogeneous disorder caused by multiple etiologies with moderate heritability. Monogenic forms of IS are rare. Some studies have reported 30% to 40% phenotypic variability explained by common genetic variation [1]. All main classification methods stratify IS subtypes into the five major categories: large artery atherosclerosis (LAS), cardiac embolism (CES), small artery occlusion (SVS), uncommon causes, and undetermined causes [2]. The focus of this article is to dissect the etiology of IS through pathway analyses and highlight how statistical methods and machine learning algorithms have contributed to the integration of genetic information into risk models.

 

2. Pioneer Studies on Monogenetic Disease

Genetic studies contribute significantly to our understanding of the causality of IS and its subtypes. With reference to previous linkage studies, several distinct single-gene variants have been discovered among patients with lacunar stroke and CSVD. CSVD is a common cause of stroke and cognitive impairment in the elderly and affects small vessels of the brain, including small arteries, arterioles, capillaries, and small veins. So-called monogenic cerebrovascular diseases include: (1) cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL), which is the most prevalent monogenic CSVD and is caused by a cysteine-altering mutation in one of the 34 epidermal growth factor-like repeat (EGFr) domains of NOTCH3 gene at 19q1 [3,4,5]; (2) cerebral autosomal recessive arteriopathy with subcortical infarcts and leukoencephalopathy (CARASIL), which is caused by missense mutations in HTRA1, encoding a serine protease, located at 10q26.13 [6]; (3) Fabry disease (FD), a rare X-linked inborn error of glycosphingolipid metabolism resulting from reduced production of lysosomal α-galactosidase A (α-Gal A), resulting in the accumulation of glycosphingolipids [7] in various cellular compartments, causing structural damage and cellular dysfunction and triggering a secondary inflammatory response, resulting in progressive organ dysfunction [8]; (4) retinal vasculopathy with cerebral leukodystrophy, an autosomal dominant disorder caused by C-terminal frameshift mutations in the Three Prime Repair Exonuclease 1 (TREX1) gene located at 3p21.31 [9]; (5) COL4A1/COL4A2-related angiopathies; COL4A1/A2, located at 13q34, encodes the most abundant and prevalent protein in the basement membrane of all tissues, including cerebral vasculature; type IV collagen helps the basement membrane interact with other cells, playing a role in cell migration, proliferation, differentiation, and survival; and (6) hereditary cerebral amyloid angiopathy (CAA), characterized by cerebrovascular amyloid deposition, mainly observed in leptomeningeal and cortical vessels; it can be classified based on accumulated amyloid proteins, such as amyloid β (APP), cystatin C (CST2), integral membrane protein 2B (ITM2B), prion protein, transthyretin (TTR), and others [10].
Understanding the genetics of monogenic CSVD and lacunar stroke [11] can lead to precise diagnosis and prognosis, aid in the development of a targeted treatment plan, and ultimately lead to an improved phenotype definition. Monogenic diseases are rare, and the causal variants have a minor allele frequency (MAF) of less than 0.005 (ultra-rare) in the stroke population. Sporadic IS, which dominates the disease population, cannot be explained by these rare inheritances despite some success in identifying common risk loci at the gene level (e.g., COL4A2 and HTRA1) by the GWAS [11,12,13,14,15].

3. Low-Frequency Variants Explain More Phenotypic Variation

Previously identified IS risk loci with significant genome-wide association are enriched with low-frequency variants [31]. The partition of SNPs by MAF can provide deep insight into the mechanisms of heritability. If a genetic variant is associated with fitness, selection would drive one allele to low frequency [32]. The latter is the case even for traits without any obvious connection to fitness. The functional architecture of low-frequency variants (0.5% < MAF < 5%) highlights the strength of negative selection across coding and non-coding variants; this effect is also obvious with respect to many cardiometabolic traits [33]. Low-frequency variants bridge the gap between rare variants with putatively larger effect sizes and common variants with smaller effect sizes. Because the loci for cardiovascular diseases are significantly enriched for lifetime reproductive success by natural selection [34] and identified IS subtype-specific loci are more likely to be low in MAF [24,31], we propose that genetic variants with lower MAF may contribute more to the phenotypic variation in IS. When we partitioned the variants by MAF ≤ 0.01, 0.05, 0.1, 0.2, or to all, PRSLAS, PRSCES, and PRSSVS derived from low-frequency common variants (0.01 < MAF < 0.05) provided the best-fit modeling for our IS cohort, suggesting that low-frequency common variants, when taken together, could contribute more to the risk for matched IS subtypes.

4. Polygenic Risk Scores (PRSs) Augment IS Subtyping

PRSs derived from stroke subtypes may augment the predictive power for patients with a similar etiology. PRSs for atrial fibrillation can significantly explain cardioembolic stroke (CES) risk, independent of other clinical risk factors [46].
We previously showed that PRSLAS, PRSCES, and PRSSVS, which were constructed by the variants with effect size estimated according the MEGASTROKE IS subtypes (LAS, CES, or SVS), explained the most variance of the corresponding subtypes of IS among MEGASTROKE subtypes (larger and warmer dots for the significant level and Nagelkerke pseudo-R2, respectively using variants from the base file with p < 0.1). To determine the robustness of this subtype-specific PRS, a synthesized group (ASL) with more LAS cases (n = 120) than SVS cases (n = 70) was created. We observed that the predictive power (R2) and significance was the highest using PRSsvs, suggesting that there is a lack of a clear boundary between LAS and SVS. However, PRSCES differentiated LAS from CES and SVS from CES (yellow arrows), suggesting that CES has a unique polygenic architecture that separates it from other subtypes. Furthermore, none of the PRSs could significantly explain the phenotypic variation of our ‘Undetermined’ subtype. In summary, some clinical IS subtypes may have distinct or shared polygenic architecture. The effect sizes from low-frequency variants estimated by the summary statistics of GWAS on clinical subtypes contribute more to the polygenic inheritance of the matched subtype.

4. A Modified Paradigm of IS Risk Stratification beyond TOAST Subtyping

The primary goal of diagnostic stroke evaluation is to identify the underlying etiology so that targeted treatments can be designed and implemented to prevent a recurrence [2]. Several classification systems have managed to stratify stroke etiologies into discrete clinical, radiographic, and prognostic categories. Despite a decade of GWAS on IS and its subtypes, genetic evidence currently has only been considered under certain circumstances, in which prothrombotic abnormalities should be considered as a cause of stroke exclusively in patients with a history of unexplained thromboembolic events in young stroke patients who have no other explanations for their stroke [107,108,109]. There is an unmet need for the etiologic classification of strokes with multiple potential mechanisms into specific etiologic classes in the absence of evidence-based strategies, such as risk factors, family history, and medication, and to better quantify multiple competing causes in a given patient [110,111]. How genetic information from GWAS contributes to this etiologic classification of strokes and may assist in identifying the etiology of strokes of unknown origin, referred to as cryptogenic strokes, is still unclear. Mechanism-targeted treatments are not available for cryptogenic strokes, which represent 25% to 30% of IS, increasing the likelihood of have recurrent events. The quality of etiologic classification depends on the ability to generate homogenous subtypes with discrete outcomes (discriminative validity) and the clarity of classification rules to ensure utility in different settings with different investigators (reliability) [2]. It is necessary to further categorize IS using more homogenous groups stratified by risk factors, including PRS, and refine the current diagnostic system for subtyping. Whether PRS may augment the newer clinical classification systems (e.g., ASCO and CCS) should be determined, as these newer schemes may better stratify the stroke etiology, at least in some patients.

5. Improved Predictability of Pathway-Specific PRS for Post-IS Mortality Using an Integrated Cox Proportional Hazards Model

Improved predictability can be achieved by better interrogating the data and by using methodologies that are carefully aligned with the data characteristics. Owing to the hierarchical nature of GO biological process terms, multicollinearity of PRSs is common. There are also extensive correlations within or between PRSs and clinical risk factors. All these factors can inflate the regression coefficients of predictive variables in the multivariate regression model. An L1 penalization technique (LASSO regression) can handle this situation by forcing some regression coefficient estimates to be exactly zero, thus achieving variable selection while shrinking the remaining coefficients toward zero to avoid the overfitting and overestimation caused by data-driven model selection. The least absolute shrinkage and selection operator (LASSO) method [118] in the multivariate Coxph model was applied for feature selection of prognostic pathway-specific PRSs [95]. A prediction model including an additional 16 disease-associated pathway-specific PRSs outperformed the base model (8 clinical risk factors), as demonstrated by a higher concordance index (0.754, 95% CI: 0.693–0.814 versus 0.729, 95% CI: 0.676–0.782, respectively) in the holdout sample (p < 0.001 for the median improvement). Compared to the base model, the integrated PRS prediction model differentiated not only the high-risk from the intermediate-risk (p = 0.006) but also the intermediate-risk from the low-risk (p = 0.001). Thee PRS derived from GO negative regulation of endothelial apoptotic pathway was the independent predictor for 3-year post-IS mortality (HR = 1.203) [95].
ScholarVision Creations