Histology-Based Detection of Microsatellite Instability: Comparison
Please note this is a comparison between Version 1 by Munizay Paracha and Version 2 by Camila Xu.

Microsatellite instability (MSI) is a molecular marker of deficient DNA mismatch repair (dMMR) that is found in approximately 15% of colorectal cancer (CRC) patients.

  • colorectal cancer
  • microsatellite instability
  • DNA mismatch repair
  • tumor immunology

1. Introduction

Colorectal cancer (CRC) is the third most common and second most deadly cancer worldwide, causing an estimated 880,000 deaths in 2018 [1]. Mortality rates for CRC have been declining in many countries due to improved screening efforts and therapeutic advances [2], but both the incidence and mortality of CRC have been increasing in patients under the age of 50 in high-income countries [3]. CRC is a heterogenous group of diseases (subtypes) with differences in epidemiology, anatomy, histology, genomics, transcriptomics and host immune response [4][5][6][7][4,5,6,7]. This heterogeneity leads to disparate clinical presentation, survival and response to therapy [8][9][10][11][8,9,10,11].

One of the clinically relevant subtypes of CRC is DNA mismatch repair deficient (dMMR) CRC. dMMR occurs due to pathogenic alterations in genes involved in the MMR system (MLH1, MSH2/EPCAM, MSH6, and PMS2) [12][13][14][12,13,14]. Several mechanisms can lead to dMMR, the most common being somatic hypermethylation of the MLH1 gene promoter [14]. In patients with Lynch Syndrome, who carry a germline pathogenic mutation in one of the MMR genes, an additional somatic alteration can occur (“second hit”), leading to the phenotype of dMMR tumors. Sporadic bi-allelic somatic mutations in the MMR genes can also occur [13]. Deficiencies in MMR cause high rates of mutations throughout the DNA, especially in microsatellites, regions of DNA in which short sequences of nucleotides are repeated in tandem [12]. Thus, dMMR results in microsatellite instability (MSI), which is a highly sensitive marker of dMMR [13].

MSI has diagnostic, prognostic, and therapeutic implications in CRC and other cancers. The detection of dMMR/MSI is recommended as a screening test for Lynch syndrome for every case of CRC [14]. Confirmation requires germline testing and can inform surveillance and treatment decisions for both the patient and their relatives [15]. CRC patients with MSI generally have a better prognosis [13][16][17][13,16,17], which may be explained by a robust host immune response to tumor neoantigens [10][11][18][10,11,18]. Lastly, MSI status can inform treatment decisions, as patients with MSI tumors may be eligible for immune checkpoint inhibitor (ICI) therapy [19], and the benefit of fluorouracil-based chemotherapy regimens for tumors with MSI has been questioned [20][21][22][23][20,21,22,23].

Current testing for dMMR/MSI requires either an immunohistochemical analysis of MMR protein expression or a PCR-based assay of microsatellite markers [14]. While guidelines set forth by multiple professional societies recommend universal testing for dMMR/MSI [24], these methods require additional resources and are not available at all medical facilities, so many CRC patients are not currently tested [25]. Recently, artificial intelligence has been evaluated as a method to predict MSI directly from hematoxylin and eosin (H&E) stained slides (Figure 1). If successful, this approach could have significant benefits, including reducing cost and resource-utilization and increasing the proportion of CRC patients that are tested for MSI.

Figure 1. Detection of microsatellite instability (MSI) or mismatch repair (MMR) deficiency is performed by (A1) Immunohistochemistry of the mismatch repair proteins or (A2) PCR amplification of consensus microsatellite repeats that are analyzed with capillary electrophoresis. Inference of MSI/MMR status from next generation sequencing (NGS) is not presented. (B) MSI/MMR status can be predicted from hematoxylin and eosin (H&E) stained slides, without requiring molecular analyses (see Figure 2). Detection of MSI/dMMR has implications for Lynch Syndrome screening and determining eligibility for immune checkpoint blockade in advanced disease. MSS: microsatellite stable. MSI-H: high microsatellite instability. pMMR: proficient mismatch repair. dMMR: deficient mismatch repair.

2. Histological and Clinical Predictors of Microsatellite Instability

With the significant cost and non-universal availability of the molecular testing required to determine MMR/MSI status, studies have sought to predict MSI based on routinely available data, such as clinical information and histopathology [26]. CRC tumors with MSI are associated with certain histological features, detectable via standard H&E staining, and clinical data, such as patient age and tumor location [26][27][28][26,27,28]. Similar observations have been made in other tumors enriched for MSI, such as endometrial cancer [29]. These associations may present a means of identifying the tumors most likely to have the dMMR phenotype, and therefore the patients most likely to benefit from additional testing. They may also help to identify those at low risk who would be less likely to benefit. The targeted deployment of MMR/MSI testing could reduce costs and save resources [26]. Inferring MSI status may be considered in settings where MSI testing is not performed but is unlikely to be adopted in resource-rich settings unless the prediction accuracy is near-perfect.

Several clinicopathologic predictors of MSI have been discovered and several groups have proposed models for MSI prediction (Table 1). Histological features such as signet ring cells, mucinous or medullary morphology, and poor differentiation are significantly associated with MSI status, but show poor sensitivity for MSI prediction on their own [27][30][27,30]. Correlations between MSI and immunological features of tumor pathology, such as measurements of tumor infiltrating lymphocytes (TILs) [11][26][28][31][11,26,28,31] and specific histological structures such as the Crohn’s-like lymphoid reaction (CLR), are well established in the literature [18][26][28][18,26,28]. CLR represents CRC-specific tertiary lymphoid aggregates [18]. The host response to MSI tumors is attributed to the high tumor mutational burden (TMB) and the abundance of immunogenic mutations, including insertion-deletion mutations, but other factors may contribute [32][33][34][32,33,34]. The Revised Bethesda Guidelines for MSI testing in CRC suggested testing tumors with “MSI histology” in patients younger than 60 years of age [35]. MSI histology was defined as the presence of TILs, CLR, mucinous/signet-ring differentiation, or medullary growth pattern. One of the histopathological features most strongly associated with MSI is the density of TILs [26][27][30][26,27,30]. When TIL density was assessed as a potential predictor of MSI, the area under the receiver operating characteristic curve (AUC) was 0.73. With a cutoff value of 40 lymphocytes/0.94 mm2, MSI status could be predicted with a sensitivity of 75% and a specificity of 67% [30]. However, given that TIL density can vary across tumor area, this study using surgical specimens likely yielded a greater AUC than would be achieved with smaller biopsy specimens, such as those typically available from sites of metastasis.

Table 1. Histological predictors of microsatellite instability.

[64,65,66]. All CNNs in the study by Kather et al. had been pretrained on the ImageNet database, and only the last ten layers of the CNNs were trainable. After assessing the performance of five CNNs in differentiating tumor tissue from healthy tissue, the CNN ResNet-18 (a ResNet with 18 layers) was selected for further evaluation based on its strong performance and smaller number of parameters. The advantage of a CNN with a smaller number of parameters is a decreased risk of overfitting the data and increased likelihood of maintaining performance when applied to a validation cohort. ResNet-18 was trained with two sets of CRC (fresh frozen and FFPE slides) and one gastric cancer dataset (FFPE) from TCGA (Table 2). Tumor tissue was divided into smaller tiles, each of which was separately analyzed and assigned a predicted MSI score. Predicted MSI status for each slide was determined by the predicted MSI status of the majority of its constituent tiles.

Table 2.

Deep learning for prediction of MSI from digital pathology.

 

 

 

 

Multiple histological and clinical variables have been incorporated into algorithms designed to predict MSI status. The MsPath score was developed to predict MSI in patients under the age of 60 [27]. Using a scoring system incorporating age, anatomical site of the primary tumor, histologic type, tumor grade, and the presence or absence of TILs and CLR, an AUC of 0.89 was achieved when the model was tested against a separate validation cohort (Table 1). Validation of the MsPath score in a population based-cohort showed that its accuracy was insufficient for the selection of patients for Lynch Syndrome germline testing, misclassifying 18% (2/11) of patients with a pathogenic mutation in MLH1/MSH2 [39]. Another scoring scheme by Greenson et al. incorporated similar variables but included lack of dirty necrosis in the model and was derived from a population that included patients of all ages [26]. The features associated with MSI all had a negative predictive value >90%. This model yielded an AUC of 0.85 based on the study cohort alone (no validation cohort was tested) (Table 1). Over half of the tumors analyzed had less than 5% chance of harboring MSI, presenting the potential for significant cost savings [26]. In another cohort, the model by Greenson et al. detected 93% of tumors with MSI and outperformed MsPath [40].

The PREDICT score was developed to improve on MsPath and other models [36]. It included variables that were significantly associated with MSI in a multivariable regression model, including age <50, right sided location, TILs, a peritumoral lymphocytic reaction, any mucinous component and increased stromal plasma cells [36]. PREDICT reported a sensitivity of 97% for the detection of MSI with an AUC of 0.924 in the validation cohort (Table 1). The RERtest6 model was developed to maximize the negative predictive value and included tumor location, growth pattern, solid and mucinous pattern, TIL and CLR [38]. The model had an accuracy of 92% in the global cohort and a negative predictive value of 97.9% (Table 1). The prevalence of MSI was 8.5% in this study. If this model were applied as screening for MSI in this study population, only 10% of patients would need confirmatory testing [38].

Another large study of MSI prediction from commonly available clinico-pathologic data included over three thousand patients over 50 years of age in Japan [37]. Female sex, proximal location, tumor size larger than 60 mm, mucinous component and BRAF mutation were associated with MSI and were included in a composite score used for prediction. CLR and TILs were not evaluated. In the validation cohort, the AUC was 0.856. Patients with MLH1 promoter hypermethylation had higher scores than patients with Lynch Syndrome, as a result of the known association between BRAF mutations and MLH1 hypermethylation and the high score given to BRAF mutations in the model. Overall, the performance of the model was disappointing, with approximately 25% of MSI tumors misclassified at the proposed threshold [37].

The encouraging performance of certain histology-based prediction models has not been sufficient to supersede universal testing for MSI/dMMR. Measurement of the variables for MSI prediction requires significant effort and expertise by pathologists, and inter-rater differences may affect the perceived reliability of histology-based scoring systems [41][42][41,42]. However, this work is fundamental to the premise that MSI can be predicted from histology, which has now been proposed as a task for deep learning from digital pathology [43] (Figure 1).

3. Predicting MSI Status with Deep Learning

Recently, several studies have investigated the potential for CNNs to predict MSI from H&E stained histological samples. Kather et al. trained and tested CNNs on gastric, endometrial, and colorectal samples that were snap-frozen or formalin-fixed paraffin-embedded (FFPE) [43]. FFPE slides are routinely used for histological diagnosis and immunohistochemistry. Fixation with formalin and embedding with paraffin are performed to maintain tissue architecture and morphology, and to allow long-term preservation at room temperature. The process of generating an FFPE slide requires many hours and the fixation process results in the cross-linking of DNA and proteins that can impair the performance of molecular analyses. Snap-frozen tissue is not routinely obtained but can be used for intraoperative diagnoses because it can be rapidly reviewed by a pathologist. Snap-frozen tissue can also be used for extensive molecular analyses [44][45][62,63]. The morphological quality of snap-frozen tissue is not considered sufficient to render a definitive diagnosis, and confirmation using FFPE slides is typically required [46][47][48]

 

 

 

 

Using this process, the CNN was able to detect MSI in snap-frozen and FFPE TCGA samples with similar AUCs to those achieved with previous pathology-based scoring systems such as MsPath and the model by Greenson et al. (0.84 for snap-frozen CRC samples, 0.77 for FFPE CRC samples, and 0.81 in FFPE gastric adenocarcinoma samples). This level of performance was maintained when the CNN trained on FFPE CRC samples was tested on an external validation cohort from the DACHS (Darmkrebs: Chancen der Verhütung durch Screening) study (Table 2), which consisted of FFPE CRC samples from Germany (AUC 0.84). The authors also tested the classification performance of the ResNet when applied to slides with limited tissue, finding that performance plateaued with a quantity of tissue that is available from standard needle biopsies [43].

To attempt to identify what pathological features the ResNet used to make its classifications, tumor regions that were assigned high or low MSI scores were visually inspected. Areas predicted by the CNN to represent MSI often showed characteristics consistent with known pathological correlates of MSI, such as poor differentiation and lymphocytic infiltration. PD-L1 expression and an interferon gamma transcriptomic signature were correlated with the proportion of a sample’s tiles predicted to have MSI. This finding is consistent with previous data showing high expression of PD-L1 and interferon gamma in CRC with MSI [55][56][73,74].

Despite encouraging performance for MSI classification in similar cohorts, testing against different cohorts revealed some limitations. CNNs trained on snap-frozen CRC samples or gastric adenocarcinoma samples did not perform as well as the CNN both trained and tested on FFPE CRC samples. When the CNN was trained to detect MSI in endometrial cancers, its performance was significantly reduced to an AUC of 0.75, raising the possibility that the CNN is learning tissue-specific features associated with MSI. Additionally, the CNN trained on TCGA gastric adenocarcinomas did not perform as well when tested on a Japanese gastric adenocarcinoma cohort (AUC 0.69), possibly due to distinctive histological patterns seen in gastric adenocarcinomas in this cohort [43].

Other studies have attempted to improve upon these results using other CNNs and machine learning techniques (Table 2). In a follow up study by Kather et al., the prediction of MSI was performed as a benchmark task by various CNNs, which were pretrained on the ImageNet database [49][67]. The ResNet and Inception CNNs were outperformed by the DenseNet [57][75] and ShuffleNet [58][76] architectures. ShuffleNet, a CNN optimized for mobile devices, was able to achieve an AUC of 0.89 when trained on a CRC cohort from TCGA and validated on the DACHS CRC cohort (Table 2). The ResNet used for the previous study by Kather et al. achieved an AUC of 0.84 [43][49][43,67].

Another group reports improvement upon the results by Kather et al. in terms of overall predictive accuracy and generalizability to different cohorts [50][68]. This study also used ResNet-18 to assign each tile within the tumor area an MSI likelihood. However, multiple instance learning was used to train the CNN to classify the whole slide image. Multiple instance learning assumes that not all tumor regions contribute the same amount of information to the task of classification of the tumor as a whole [59][77]. Certain regions or patterns found in limited areas of a sample may be more important to determining the likelihood of the tumor being MSI. For example, any mucinous differentiation increases the likelihood of a tumor harboring dMMR/MSI [26][28][26,28]; this may be focal and not seen in the majority of tumor areas. Two different multiple instance learning methods were used in this study, and their input was integrated into a final ensemble predictor (Table 2). This ensemble classifier achieved an AUC of 0.885 [50][68], which was better than the performance reported by Kather et al. [43].

This group also found a significant reduction in AUC (0.650) when the TCGA-trained ensemble classifier was tested on a cohort of Asian patients with samples acquired with a different slide preparation protocol [50][68]. They were able to overcome this reduction in performance by transfer learning. By adding increasing proportions of data from the Asian cohort to the training set, they were able to achieve an AUC of 0.850 with 10% samples from the Asian cohort, with continued improvement up to an AUC of 0.926 with 70% samples from the Asian cohort (Table 2) [50][68]. Pathologic signatures were derived from the model and were associated with known features of MSI, including TMB and insertion-deletion mutational burden, as well as transcription signatures of immune activation.

A conference paper by Wang et al. also assessed an alternative technique, Patch Likelihood Histogram (PALHI), for integrating tile-level MSI predictions into patient-level predictions using whole slide images from a TCGA endometrial cancer cohort [60][78]. First, a ResNet-18 pre-trained on ImageNet was trained to predict MSI for individual tiles on a subset of the TCGA cohort. PALHI then generated a histogram of the patch-level estimated MSI likelihoods, which were used to train a machine learning classifier called XGBoost to make patient-level predictions. The performance of a pipeline using PALHI to make patient-level predictions was compared to pipelines using another machine learning method, Bag of Words (BoW) and the “majority voting” method, using another subset of the TCGA cohort as a testing set. The three methods were each trained on both patches assigned binary “hard labels” and patches assigned “soft labels,” or MSI probabilities. The PALHI method trained using “soft labels” yielded the best performance on the test set, with an AUC of 0.75. By comparison, the AUCs for BoW and the majority method using “soft labels” were 0.71 and 0.56, respectively [60][78].

Transcriptomic prediction from H&E slides has also been used to improve MSI prediction when limited training data are available [51][69]. First, features were extracted from each tissue tile using the ResNet-50, pretrained on the ImageNet database. These features served as the input for a custom multilayer perceptron, which was trained to predict gene expression from RNA-Seq data. Multilayer perceptrons are neural networks composed of fully connected layers, typically without convolutional layers. This neural network was trained on pan-cancer and tissue-specific TCGA datasets and was able to predict several expression signatures, including adaptive immune response signatures [51][69]. For MSI prediction, the authors simulated a situation where a limited number of training slides are available at two sites. They showed that, using the transcriptomic representation trained at one site, they could improve MSI prediction at the second site. However, when increasing proportions of data at the second site were used for MSI prediction without integrating transcriptomic representation, this advantage was largely lost. Neither method achieved an AUC > 0.85 and no external validation set was used (Table 2) [51][69]. It is unclear if this approach would be applicable in real-life settings.

In a conference paper submitted to the 1st Conference on Medical Imaging with Deep Learning (MIDL 2018) [52][70] and a related patent [61][79], adversarial learning was used to improve the generalizability of CNNs for MSI prediction across different cancers. The Inception-V3, ResNet-50 and VGG-19 CNNs were compared; Inception-V3 was chosen for downstream analysis. TCGA samples were used for both testing and training; this study did not use an external validation dataset. MSI status was categorized as stable, low instability or high instability. Inception-V3 was trained on CRC samples and achieved a slide-level accuracy of 98.3% with an internal validation set of 10% of TCGA slides. It is unclear if this level of accuracy represents overfitting. Accuracy was poor when applied to endometrial carcinoma samples at 54%, whereas training the CNN on both CRC and endometrial carcinoma decreased the accuracy of MSI prediction for CRC to 72% (Table 2). This CNN also performed poorly at classifying MSI in gastric adenocarcinoma with a slide-level accuracy of 35%. Next, a tumor type classifier was added to the CNN with an adversarial objective—to decrease the ability of the model to predict tumor type. The rationale for creating this adversarial objective is to remove tissue-specific features that are learned by the CNN, such that the model will recognize the features associated with MSI better. Adversarial training improved MSI classification across the three cancer types, but accuracy remained poor for gastric adenocarcinoma at 57% [52][70].

Focusing on endometrial cancer, a recent study available as a preprint generated CNNs that had three branches of an InceptionResNet architecture, each analyzing tiles at a different resolution [53][71]. An optional fully connected layer incorporating clinical features was also evaluated as a fourth branch. This structure, termed Panoptes, allowed the model to take into account both tissue-level and cellular-level structures, as would a human pathologist using a microscope. MSI classification was one of several tasks that the CNNs were trained to do. While the complex architecture showed strong performance in predicting many histological and molecular features, MSI was best predicted by the existing InceptionResnetV1 architecture, with an AUC of 0.827 (Table 2), which outperformed Kather’s previously described ResNet-18 architecture (AUC 0.75). The inclusion of clinical data did not seem to improve the model’s performance: when the age and BMI of the patient were added into the model, its performance did not significantly improve [53][71]. Predicted MSI was correlated with certain histological features, including intratumoral and peritumoral lymphocytic infiltrates.

The strongest-performing model for MSI prediction was developed by Echle et al. by training a CNN on a large cohort of H&E-stained CRC samples from the MSIDETECT consortium, which is comprised of whole slide images from TCGA, DACHS, the United Kingdom-based Quick and Simple and Reliable trial (QUASAR), and the Netherlands Cohort Study (NLCS) [54][72]. A modified version of the CNN ShuffleNet that was pre-trained on ImageNet was trained on whole slide images from MSIDETECT with known MSI or dMMR status and externally validated on a separate population-based cohort, Yorkshire Cancer Research Bowel Cancer Improvement Programme (YCR-BCIP). For each slide, tumor tissue was manually outlined and the slide was divided into smaller tiles. The patient-level prediction of MSI/dMMR was based on the average tile-level prediction for each patient. The CNN was first trained and tested on individual sub-cohorts. As in earlier-described studies [43][50][52][43,68,70], when a CNN trained on a single sub-cohort was tested on another sub-cohort, performance usually suffered. A positive correlation between the size of the training cohort and the performance of the model was noted. The CNN was then trained on increasing numbers of patients randomly selected from the MSIDETECT cohort. The model showed better performance with greater numbers of patients up until about 5000 patients, after which performance plateaued. After training with samples from 5500 patients, the model attained an AUC of 0.92 when tested on a separate set of patients from MSIDETECT. When tested on the external validation cohort (YCR-BCIP), the model attained a similarly impressive AUC of 0.95. Additionally, when slides were subjected to color normalization, the specificity at given levels of sensitivity increased and a slight improvement in AUC to 0.96 was demonstrated [54][72]. Though these results are encouraging, it is worth noting that the samples used to train and test this model were derived mostly from European patients. Further validation with more diverse cohorts and prospective studies will be necessary before this model can be applied in a broad clinical context.

Subgroup analysis did reveal some variation in the model’s performance for certain tumor characteristics. While the performance was consistent for tumors at stages I-III (AUCs 0.91–0.93), the AUC for stage IV tumors was lower (0.83). The authors do not discuss potential explanations for this discrepancy, but there was a similar reduction in AUC for tumors with high histologic grade (AUC for high grade tumors was 0.83). The relatively low prevalence of MSI/dMMR in stage 4 colorectal cancers would have decreased the number of available images from this subgroup available for training, as would the fact that stage 4 tumors are more likely to come from biopsy specimens than complete resection samples. This lower performance for stage 4 tumors is unfortunate given that ICI therapy is currently primarily used in late-stage colorectal cancer, reducing the model’s potential utility for guiding treatment decisions. Additionally, the model predicted MSI more effectively for colon cancer (AUC 0.91) than for rectal cancer (AUC 0.83). Performance did not vary significantly by tumor molecular characteristics (e.g., mutation status) [54][72].

As noted above, a previous study demonstrated that the performance of ResNet-18 in classifying MSI status plateaued with a quantity of tissue that can be obtained by needle biopsy [43]. However, Echle et al. found a significant decrease in AUC when the CNN trained on surgical specimens was tested on YCR-BCIP biopsy specimens as compared to YCR-BCP surgical specimens (0.78 vs. 0.96). Though size of specimen may be a factor here, artifacts from specimen acquisition and the fact that samples were derived only from luminal tumor tissue may also affect performance. When the authors performed a 3-fold cross-validated experiment using YCR-BCIP biopsy specimens to both train and test, the AUC improved to 0.89 [54][72]. However, the model was not tested on samples from sites of metastasis, which are commonly biopsied in the clinical setting. Thus, machine learning models may be effective in classifying the MSI status of biopsy specimens, but will likely perform best when trained on similarly derived specimens.

Taken together, these studies demonstrate that multiple CNNs and machine learning techniques are being evaluated for MSI prediction from histology. There is no clear consensus regarding the optimal network architecture. The use of large and diverse datasets for training may overcome some of the limitations of models whose classification accuracy for MSI status is worse when applied to datasets with differing characteristics, which could be the case when applying these methods across different health systems, regions and populations. With continued experimentation, improvement, and validation of existing models, the use of machine learning to predict MSI may reach a level of accuracy sufficient for clinical application in the future.

Video Production Service