Artificial intelligence (AI), boosted through deep learning (DL), has achieved significant breakthroughs and is a powerful tool for managing large complex datasets such as crop genetic markers genome and their association with crop phenotypes. AI techniques can be applied to analyze large amounts of genomic data and identify patterns that are difficult for humans to detect. These patterns can then be used to develop more accurate predictive models.
Disease resistance can be qualitative or quantitative, which can often be attributed to differences in the plant genome. The scope of qualitative disease resistance is generally conditioned by a single resistance (R) gene recognizing avirulence factors in a classic gene−for−gene mechanism, and the inheritance is said to be qualitative or Mendelian. In contrast, quantitative resistance is usually conditioned by many genes of small effect, and the inheritance is said to be quantitative or polygenic [1]. For qualitative traits that are controlled by single genes, DNA markers are often used to screen and to select the desired gene in breeding programs through marker−assisted selection (MAS). The identification of accurate markers that are strongly associated with the trait is the key to the successful application of MAS. However, this will be more challenging for complex traits controlled by many genes. An alternative method that investigates quantitative traits is genomic selection or genomic prediction (GP). GP takes advantage of all molecular markers, regardless of the significance threshold, to determine the breeding and/or genetic potential of a candidate individual for selection. Compared to MAS, GP has several advantages. GP allows breeders to select individuals with desirable traits at an earlier stage of the breeding process, which can save time and resources. At the same time, GP can reduce the cost of selection by reducing the need for phenotyping, which can be expensive and time−consuming. More importantly, GP can be used to select complex traits that are difficult to measure directly, such as disease resistance and yield.
The key element in GP is to build a robust and accurate statistical model based on available individuals with both phenotypic and genotypic data. Statistical models have been developed to improve the robustness and prediction accuracy. One of the frequently used models is the genomic best linear unbiased prediction (GBLUP). It was built based on the assumption that all SNPs contribute to the heritability of breeding traits and that they rise from the same normal distribution [2][3][2, 3]. However, this assumption could reduce the prediction ability of the linear mixed model (LMM) when the trait under study is controlled under several dominant genes. In fitting these effects, Bayesian−based methods have been developed, including BayesA, BayesB, BayesLASSO, etc. [4][5][6][7][8][4-8]. The Bayesian−based methods assume that SNPs belong to different groups that have their own independent variances and specific distributions, such as the inverse chi−squared distribution. Although these traditional LMM or Bayesian−based approaches have been used in plant breeding, they are developed based on the assumption that genotype random effects follow a prior distribution and that each genotype contributes to the associated phenotype independently. Such assumptions require a large number of samples to dilute the effects of population structure. At the same time, the individual genotype effect may not follow a specific distribution perfectly. Additionally, these approaches are all based on a linear mapping from genotype to phenotypes, and it is less powerful for them to capture non−linear effects such as dominance and epistasis, which are common and important in complex traits [9][10][9, 10].
To overcome the limitations of assumptions about the genetic architecture and the linear effects, machine learning (ML) approaches were developed. These methods do not require pre−assumptions and they are capable of extracting non−linear features. Many of these methods have been applied in GP problems, including but not limited to SVM with non−linear kernels (i.e., radial basis function SVRrbf and polynomial SVRpoly [11][12][11, 12], reproducing kernel Hilbert spaces (RKHS) regression [13][14][13, 14], and Gradient Tree Boosting (GTB) [15], as well as RF [15][16][15, 16]).
Deep Learning (DL) is regarded as an efficient method in several studies of GP [15][17][18][19][20][21][22][23][24][15, 17-24] because of its capability in handling a diversity of high−dimensional tasks [19][25][19, 25]. After major innovations in recent years, advanced DL architectures have been developed to conduct complex trait predictions in several crops [26][27][26, Jubair et al27]. (2021Jubair et al. (2021) developed a transformer−based DL model, GPTransformer, to conduct GP for barley resistance to Fusarium head blight (FHB) which is caused by Fusarium graminearum Schwabe [28]. Two pathogen inoculation methods were used to fully explore the possible pathogen–plant interactions. The first method inoculated the barley plant with the microbe communities on maize kernels which were infected with two strains of F. graminearum. In GP, the pre−screened essential genomic markers were fed into a GPTransformer to predict FHB and deoxynivalenol (DON). The results indicated that the GPTransformer performed similarly to the GBLUP model, with only 1% improvement over BLUP for DON and the performance for FHB. This study suggests the potential of DL methods in understanding pathogen–plant interactions or predicting plant disease phenotype when compared to the popular BLUP model.
However, ML models did not always outperform other methods for all traits and species. Linear models tend to perform consistently across predictions, while the ML models varied substantially from trait to trait. Montesinos−López et al. (2018) compared three DL architectures of ANN, CNN, and RNN against the commonly used linear GBLUP model with nine datasets [29]. Generally, GBLUP achieved the best performance in eight out of nine datasets when considering the interaction between the genotype and the environment. Interestingly, DL outperformed GBLUP in six out of the nine datasets when ignoring the interactions. From a larger view, Montesinos−López et al. (2021) surveyed 23 papers and found that no relevant differences in prediction performance were found between DL methods and the conventional linear models [30]. Specifically, DL performed better in 11 out of the 23 studies when taking into account the interaction between genotype and environment interaction, while 13 of these studies observed a better performance of DL when ignoring the genotype × environment interaction.
One of the reasons for the modest performance of DL could be that the number of training samples for most GP tasks was not sufficient for DL to learn non−linear interactions when the number of SNPs (or background SNPs) is too large. It is particularly so under a flawed experimental design which failed to screen out noisy SNPs, and the traits have major effect loci. With datasets containing different numbers of SNP markers from six plant species, Azodi et al. (2019) found that non−linear methods showed better performances in predicting traits in the datasets containing fewer markers. Reasonably reducing background information could improve DL performance. Pook et al. (2020) added a convolutional layer to intensify information, which were then fed into the ANN layers (referred to as LCNNs) [31]. In this way, the model performance was improved significantly compared to ANNs, regardless of data size. It is interesting to note that adding a convolutional layer to intensify information does not involve human screening of the markers, which is an advantage of the DL models in refining maker information over statistical methods.
DL’s ability in handling GP tasks can also be improved by adapting advanced DL architectures. In most cases, CNN−based models are more advanced in capturing spatial information and can therefore outperform the relatively simple ANN models when they are compared together. For example, using the International Maize and Wheat Improvement Center (CYMMIT) datasets, Pérez−Enciso and Zingaretti (2019) benchmarked several ANNs and CNNs [32]. It was found that CNNs always outperformed ANNs in GP. Similarly, in the study conducted by Ma et al. (2018), by investigating 2000 wheat lines and 33,000 markers, CNNs showed a much better prediction performance than ANN models [33]. Further, it seems that LSTM is more appropriate for handling sequential data and for exploring SNP dependencies. Maldonado et al. (2020) exploited the potential of LSTM architecture in conducting GP on Zea mays L. and Eucalyptus globulus Labill [34]. A significant increase in prediction performance was observed in LSTM compared to the other ML method, linear models of GBLUP, and different types of Bayesian regression models. On the contrary, when the CNN architecture was compared against linear Bayesian models, its GP performance was less attractive on polyploid outcrossing species of strawberry and blueberry [35]. The different performances of CNN and LSTM compared to conventional methods in the above−mentioned two studies may be attributed to the genome differences, presence of interactions, sample size, or model tuning. However, the differences in architecture strength between CNN and LSTM cannot be ignored, although it is still hard to make a conclusion because of the limited LSTM applications in GP tasks. As we introduced before, CNN is better at feature extraction from 2D data, while RNNs (or LSTMs) are more advantageous for sequential data. The nature of SNPs is a series of mutants on a genome sequence, and their sequential property and SNP dependencies might more easily captured by LSTM models.
A deeper understanding of both the DL architectures and the biological questions is also important in constructing DL networks. For example, given the situation where adjacent SNPs usually have no underlying direct functional relation, region−specific filters were introduced by adding a local CNN layer to reduce the background noise, and the GP performance was improved significantly [31]. As DL models are not outstanding for all applications, they can be integrated with conventional ML and/or linear models. For example, Jeong et al. (2020) integrated four types of models of CNN, RF, DNN, and RRB into the GMStool to conduct GP tasks [36]. As these individual models could capture SNP features from different aspects, the GMStool achieved the best prediction performance on the testing dataset. In addition, the microbes associated with plant growth environments are critical to disease development, and GP performance is difficult to improve if the variance and composition of microbe communities are ignored. Unfortunately, to our knowledge, no study has integrated phytomicrobiome information into consideration when conducting GP. Along with phytomicrobiome data accumulation and DL method improvement, it is very promising to improve crop trait prediction.
Reference: