Lung Cancer: Genotype Prediction in Computer-Aided Decision Systems

Lung Cancer: Genotype Prediction in Computer-Aided Decision Systems: Comparison

Please note this is a comparison between Version 2 by Conner Chen and Version 1 by Tania Pereira.

Computer-aided Decision
Genotype Prediction
Lung Cancer
Personalized Medicine

1. Genotype Prediction

Genotype studies are the fundamental keys in the development of personalized medicine in lung cancer and they enable the progress of targeted therapies. Furthermore, gene analysis allows to identify biomarkers that can be used for early cancer detection, predict the prognosis and the response to the treatment plans, and monitor disease progression. The most recent ambition for computer-aided decision systems (CAD) has been to correlate the phenotype captured by the radiological images and determine the associated genotype. Recent studies have focused on predicting the EGFR mutation status using CT imaging since targeted therapies for this gene already exist.

In total, twenty studies were found after employing the query (“Gene Mutation Status”) AND (“Prediction”) AND (“Lung Cancer”) in the research databases IEEE Xplore and PubMed, and excluding the ones that were not based on CT scans. These studies included semantic, radiomic, and deep features, which were the inputs of statistical, machine learning, or deep learning models. All of these studies were from 2017 to 2021, which shows how novel the investigation of this area is.

2. Centered on Nodule

Thus far, twelve studies were found, dedicated to studyexplore gene mutation status prediction by CT scan analysis taking into account features related to the nodule. Table 1 provides detailed information on each work dedicated to genotype prediction using nodule features.

Table 1.

Overview of published studies regarding predictive models for gene mutation status based on nodule features (2017–2021).

Authors	Year	Dataset	Methods	Performance Results (%)
Zou et al. [177]^[1]	2017	Private (209 patients)	Multivariable analyses	EGFR: AUC = 73.7
Cheng et al. [176]^[2]	2017	Private (2146 patients)	Weighted mean difference, inverse variance	EGFR: OR = 49.0
Li et al. [179]^[3]	2018	Private (1010 patients)	Random forest/CNNs	EGFR: AUC = 83.4

EGFR is the most relevant oncogene due to the frequency of occurrence and the target therapies available for clinical use. For these reasons, several CADs have been developed for the detection of the mutation status of this gene. Correlations between CT morphological features and the presence of EGFR mutations were studied and showed that the EGFR mutation tended to exist in tumors with part-solid GGO [176]^[2]. Approaches based on ML methods were extensively used and showed promising results [177,178]^[1][4]. A different approach was used to predict EGFR mutation status and to extract high-level deep features [179,180,181,182]^[3][5][6][8]; a CNN showed the best classification performance with an AUC of 0.85. Few other works were dedicated to identify the mutation statuses of other oncogenes, including KRAS [183^[7][9],184], ALK [185]^[12], or even other genes (ERBB2 receptor tyrosine kinase 2 (ERBB2) and tumor protein 53 (TP53)) [186]^[11]. Those predictions were performed using ML-based approaches and considering radiomic features [183,184,185,186]^{[7][9][11][12]}. The best performance results obtained achieved an AUC of 0.81 for KRAS, 0.87 for ERBB2, and 0.84 for TP53 [186]^[11].

3. More Comprehensive Approaches

Thus far, seven studies were found that took into account at least one feature related to the structure or disease external to the nodule. Table 2 presents an overview of each work that used a more comprehensive approach for genotype prediction.

Table 2. Overview of published studies regarding predictive models for gene mutation status based on nodule and extra nodule features (2017–2021).

Authors	Year	Dataset	Methods	Performance Results (%)
Gevaert et al. [71]^[13]	2017	Private (186 patients)	Decision Tree	EGFR: AUC = 89.0
Cao et al. [188]^[14]	2018	Private (156 patients)	Principal component analysis	EGFR: TPR = 72.3 TNR = 78.5
Rizzo et al. [189]^[15]	2019	Private (122 patients)	Univariate analysis	EGFR: AUC = 82.0 KRAS: AUC = 67.0
Koyasu et al. [178]^[4]	2019	NSCLC-radiogenomics
Pinheiro et al.	XGBoost/random forest	[50]	EGFR:	AUC = 65.9
^[16]	2019	NSCLC-radiogenomics	Gradient tree boosting	EGFR: AUC = 74.6	Wang et al. [180]^[5]	2019	Private
Xiong et al. [190]^[17]	2019	(844 patients)	CNNs	EGFR: AUC = 85.0
Private		(1010 patients)	ResNet 101	EGFR: AUC = 83.8	Zhao et al. [181]^[6]	2019	TCIA and private (879 patients)	3D DenseNet	EGFR: AUC = 75.8
Silva et al. [191]^[18]	2021	LIDC-IDRI NSCLC-radiogenomics	Convolutional autoencoder	EGFR: AUC = 68.0	Moreno et al. [183]^[7]	2021
Morgado et al. [192]^[19	NSCLC-radiogenomics	^]	2021SCAV with ML/CNN	NSCLC-radiogenomicsEGFR: AUC = 82.0 (CNN)	LR, Elastic Net, Linear SVM, RBG SVM, RF, and XGBoost	KRAS: AUC = 73.9 (CNN)
EGFR:		AUC = 73.7 (Linear SVM)		AUC = 73.3 (Elastic Net) AUC = 72.5 (LR)	Zhang et al. [182]^[8]	2021	Private (914 patients)	Machine learning (SVM/RF/MLP) Deep learning (SE-CNN/CNN/1D-CNN/AlexNet/Fine-tuned VG16/Fine-tuned VGG19)	EGFR: AUC = 91.0 (SE-CNN) AUC = 83.6 (SVM)
Le et al. [184]^[9]	2021	NSCLC-radiogenomics	LR / KNN / RF / XGBoost	EGFR: ACC = 77.8 KRAS: ACC = 83.3
Cheng et al. [187]^[10]	2021	Private (670 patients)	Pre-trained 3D DenseNet	EGFR: AUC = 76.0 ACC = 72.5 F-score = 71.3
Zhang et al. [186]^[11]	2021	Private (134 patients)	Logistic regression	EGFR: AUC = 78.0 KRAS: AUC = 81.0 ERBB2: AUC = 87.0 TP53: AUC = 84.0
Han et al. [185]^[12]	2021	Private (827 patients)	Logistic Regression	EGFR: AUC = 75.8 ALK: AUC = 73.9

ACC: Accuracy; AUC: area under the ROC curve; KNN: K-nearest neighbors; LR: logistic regression; MLP: multilayer perceptron; OR: odds ratio; RF: random forest; SCAV: selective class average voting; SE-CNN: squeeze and-excitation convolutional neural network; SVM: support vector machine; XGBoost: extreme gradient boosting.

AUC: area under the ROC curve; LR: logistic regression; RF: random forest; SVM: support vector machine; TNR: true negative rate; TPR: true positive rate.

A comprehensive approach is based on the combination of information from nodule features, other lung structures, and a possible fusion with clinical data. The use of all this knowledge allows a deep characterization of the pathophysiological changes that occurred, which could benefit the prediction of the mutational status of the oncogenes. Part of the models developed on a more comprehensive analysis employed semantic imaging data annotated by thoracic radiologists that captured extensive regions on the lung and patient conditions, instead of focusing only on the nodule region. These approaches were based on radiological qualitative features [71,188,189]^[13][14][15]. On the other hand, the features from the CT images can be objective and automatically extracted, such as radiomic or high-level deep features [190,191,192]^[17][18][19]. Additionally, both types of features (semantic features and the automatically extracted) can be used together by the learning models [50]^[16].

The studies that used semantic features combined with the simplest classification models allowed the assessment of the most relevant lung and nodule features for the mutation status prediction. The wild-type status for EGFR was predicted by the appearance of emphysema and airway abnormality while the presence of any ground glass component indicates EGFR mutations [71]^[13]. Moreover, gender, smoking history, emphysema, and diameter in the mediastinal, TDR, and GGO showed statistical differences between the wild-type group and mutated group of EGFR [188]^[14]. The connection between EGFR mutation and internal air bronchogram, pleural retraction, emphysema, and lack of smoking was found [189]^[15]. The mixing of nodule-related features with features from other lung structures showed to benefit the EGFR mutational status prediction [50]^[16].

The KRAS mutation status prediction showed non-consensual results even in these more comprehensive studies, and in some studies, this oncogene status was not connected with image features [50,71]^[13][16].

4. Discussion and Future Work: Genotype Prediction

Radiogenomic approaches used to classify the mutation status of oncogenes for lung cancer patients have shown that there are radiomic signatures in CT images that can be used to distinguish mutant from wild-type statuses. Previous studies have also demonstrated that radiological features, corresponding to descriptive features more familiar to radiologists, may be associated with tumor biology. Subsequent studies further demonstrated that the combination of radiomic features and the inclusion of clinical information strengthens the robustness of predictive models. Furthermore, recent studies that have taken into consideration features from a larger region of analysis that contained other structures from the lung appear to have more accurate predictive performances compared to traditional nodule-based approaches. Since lung cancer development is related to multiple physiological changes not restricted to the nodule region, it is expected that the studies that employ comprehensive approaches and consider extra-tumor features from the lung with the tumor obtain a significant increase in predictive performance. It is crucial to highlight these results and further investigate the importance of holistic lung cancer characterization studies, as many complex combinations of morphological, molecular, and genetic alterations occur during lung cancer development that, when taken into account, would allow the development of more accurate predictive models.

The value of image analysis to reveal biological information will not completely replace the need for tissue biopsy or liquid biopsy. However, image-driven studies can provide additional information that is complementary to biopsies. For example, if the biopsy result of a tumor shows EGFR-wild type, the result may include false negatives because of intra-tumor heterogeneity. At this time, the learning model can be seen as an alternative validation tool, as CT imaging provides biological information that can describe the genotype and phenotype of the whole tumor and project the biological information onto each pixel of images to reflect intra-tumor heterogeneity. If it predicts the tumor to be EGFR-mutant, clinicians may need to re-biopsy tissues. In addition, predicting mutation status by CT imaging helps us to choose the most suspicious tumor for biopsy if multiple tumors are present in a patient. Finally, the predictive model requires only routinely used CT imaging, which is a non-invasive technique and easy to acquire throughout the course of treatment. The CT scan can be performed multiple times along the treatment plan, allowing the assessment of the treatment response of the patient. Multiple assessments throughout the treatment plan may not be possible to perform by biopsy due to its invasive nature. Therefore, it is worthwhile to develop an image analysis to complement the tissue biopsy and liquid biopsy for more precise systemic treatment and local therapy.

The radiogenomics field presents a small number of publications that are strongly limited by the small sizes of the available databases, which are hardly a good representation of the population affected by lung cancer. In addition, there is a larger number of benign nodules compared to malignant ones in the available public databases, which hinders the ability to extract useful features related to malignant cases only. Furthermore, performance comparisons between models trained and tested with different data do not allow clear and objective conclusions, and image acquisition protocols and performance validation methods (i.e., cross-validation) differ from study to study. Still, direct quantitative comparisons on prediction results are crucial for a clearer understanding of the research evolution, increasing the need for a large and heterogeneous cohort of patients affected by lung cancer, as well as methods capable of coping with data heterogeneity. Accordingly, the sharing of image data among different clinical institutions, but under an uniform protocol to avoid any inconsistency during data record, is valuable to obtain an unique reliable dataset.

Before translation into clinical practice, multisite trials are also needed to validate the results obtained in training cohorts on separate independent groups of patients. Since a model fitting is optimal in the training set used to build the model itself, it is crucial to validate the model in a large external cohort of patients to obtain more reliable fitting estimates. External validation will determine the transportability of the model in different locations consisting of plausibly similar individuals.

Studying the variability amongst radiologists in multi-institutional cohorts is required in the near future to further study the robustness of the annotation of semantic features. Moreover, explainable AI is a field that should be further explored in radiogenomics studies, as it is important not only to consider black-box models but also interpretable models whose predictive decisions can be understood by human observers.