AI Applications in Plant Genomic Prediction: Comparison
Please note this is a comparison between Version 3 by Liang Zhao and Version 2 by Catherine Yang.

Artificial intelligence (AI), boosted through deep learning (DL), has achieved significant breakthroughs and is a powerful tool for managing large complex datasets such as crop genetic markers genome and their association with crop phenotypes. AI techniques can be applied to analyze large amounts of genomic data and identify patterns that are difficult for humans to detect. These patterns can then be used to develop more accurate predictive models. 

  • microbe–plant association
  • artificial intelligence
  • machine learning

1. AI AppPlications in ant Diseasse Resistance and Genomic Prediction against Pathogens

Disease resistance can be qualitative or quantitative, which can often be attributed to differences in the plant genome. The scope of qualitative disease resistance is generally conditioned by a single resistance (R) gene recognizing avirulence factors in a classic gene−for−gene mechanism, and the inheritance is said to be qualitative or Mendelian. In contrast, quantitative resistance is usually conditioned by many genes of small effect, and the inheritance is said to be quantitative or polygenic [1]. For qualitative traits that are controlled by single genes, DNA markers are often used to screen and to select the desired gene in breeding programs through marker−assisted selection (MAS). The identification of accurate markers that are strongly associated with the trait is the key to the successful application of MAS. However, this will be more challenging for complex traits controlled by many genes. An alternative method that investigates quantitative traits is genomic selection or genomic prediction (GP). GP takes advantage of all molecular markers, regardless of the significance threshold, to determine the breeding and/or genetic potential of a candidate individual for selection. Compared to MAS, GP has several advantages. GP allows breeders to select individuals with desirable traits at an earlier stage of the breeding process, which can save time and resources. At the same time, GP can reduce the cost of selection by reducing the need for phenotyping, which can be expensive and time−consuming. More importantly, GP can be used to select complex traits that are difficult to measure directly, such as disease resistance and yield.

2. Methods in conducting Genomic Prediction

The key element in GP is to build a robust and accurate statistical model based on available individuals with both phenotypic and genotypic data. Statistical models have been developed to improve the robustness and prediction accuracy. One of the frequently used models is the genomic best linear unbiased prediction (GBLUP). It was built based on the assumption that all SNPs contribute to the heritability of breeding traits and that they rise from the same normal distribution [2][3]. However, this assumption could reduce the prediction ability of the linear mixed model (LMM) when the trait under study is controlled under several dominant genes. In fitting these effects, Bayesian−based methods have been developed, including BayesA, BayesB, BayesLASSO, etc. [4][5][6][7][8]. The Bayesian−based methods assume that SNPs belong to different groups that have their own independent variances and specific distributions, such as the inverse chi−squared distribution. Although these traditional LMM or Bayesian−based approaches have been used in plant breeding, they are developed based on the assumption that genotype random effects follow a prior distribution and that each genotype contributes to the associated phenotype independently. Such assumptions require a large number of samples to dilute the effects of population structure. At the same time, the individual genotype effect may not follow a specific distribution perfectly. Additionally, these approaches are all based on a linear mapping from genotype to phenotypes, and it is less powerful for them to capture non−linear effects such as dominance and epistasis, which are common and important in complex traits [9][10].

To overcome the limitations of assumptions about the genetic architecture and the linear effects, machine learning (ML) approaches were developed. These methods do not require pre−assumptions and they are capable of extracting non−linear features. Many of these methods have been applied in GP problems, including but not limited to SVM with non−linear kernels (i.e., radial basis function SVRrbf and polynomial SVRpoly [11][12], reproducing kernel Hilbert spaces (RKHS) regression [13][14], and Gradient Tree Boosting (GTB) [15], as well as RF [15][16]).

3. Applying Artificial Intelligence in Genomic Prediction

Deep Learning (DL) is regarded as an efficient method in several studies of GP [15][17][18][19][20][21][22][23][24] because of its capability in handling a diversity of high−dimensional tasks [19][25]. After major innovations in recent years, advanced DL architectures have been developed to conduct complex trait predictions in several crops [26][27] Jubair et al. (2021) developed a transformer−based DL model, GPTransformer, to conduct GP for barley resistance to Fusarium head blight (FHB) which is caused by Fusarium graminearum Schwabe [28]. Two pathogen inoculation methods were used to fully explore the possible pathogen–plant interactions. The first method inoculated the barley plant with the microbe communities on maize kernels which were infected with two strains of F. graminearum. In GP, the pre−screened essential genomic markers were fed into a GPTransformer to predict FHB and deoxynivalenol (DON). The results indicated that the GPTransformer performed similarly to the GBLUP model, with only 1% improvement over BLUP for DON and the performance for FHB. This study suggests the potential of DL methods in understanding pathogen–plant interactions or predicting plant disease phenotype when compared to the popular BLUP model.

However, ML models did not always outperform other methods for all traits and species. Linear models tend to perform consistently across predictions, while the ML models varied substantially from trait to trait. Montesinos−López et al. (2018)  compared three DL architectures of ANN, CNN, and RNN against the commonly used linear GBLUP model with nine datasets [29]. Generally, GBLUP achieved the best performance in eight out of nine datasets when considering the interaction between the genotype and the environment. Interestingly, DL outperformed GBLUP in six out of the nine datasets when ignoring the interactions. From a larger view, Montesinos−López et al. (2021) surveyed 23 papers and found that no relevant differences in prediction performance were found between DL methods and the conventional linear models [30]. Specifically, DL performed better in 11 out of the 23 studies when taking into account the interaction between genotype and environment interaction, while 13 of these studies observed a better performance of DL when ignoring the genotype × environment interaction.

One of the reasons for the modest performance of DL could be that the number of training samples for most GP tasks was not sufficient for DL to learn non−linear interactions when the number of SNPs (or background SNPs) is too large. It is particularly so under a flawed experimental design which failed to screen out noisy SNPs, and the traits have major effect loci. With datasets containing different numbers of SNP markers from six plant species, Azodi et al. (2019) found that non−linear methods showed better performances in predicting traits in the datasets containing fewer markers. Reasonably reducing background information could improve DL performance. Pook et al. (2020) added a convolutional layer to intensify information, which were then fed into the ANN layers (referred to as LCNNs) [31]. In this way, the model performance was improved significantly compared to ANNs, regardless of data size. It is interesting to note that adding a convolutional layer to intensify information does not involve human screening of the markers, which is an advantage of the DL models in refining maker information over statistical methods.

DL’s ability in handling GP tasks can also be improved by adapting advanced DL architectures. In most cases, CNN−based models are more advanced in capturing spatial information and can therefore outperform the relatively simple ANN models when they are compared together. For example, using the International Maize and Wheat Improvement Center (CYMMIT) datasets, Pérez−Enciso and Zingaretti (2019) benchmarked several ANNs and CNNs [32]. It was found that CNNs always outperformed ANNs in GP. Similarly, in the study conducted by Ma et al. (2018), by investigating 2000 wheat lines and 33,000 markers, CNNs showed a much better prediction performance than ANN models [33]. Further, it seems that LSTM is more appropriate for handling sequential data and for exploring SNP dependencies. Maldonado et al. (2020) exploited the potential of LSTM architecture in conducting GP on Zea mays L. and Eucalyptus globulus Labill [34]. A significant increase in prediction performance was observed in LSTM compared to the other ML method, linear models of GBLUP, and different types of Bayesian regression models. On the contrary, when the CNN architecture was compared against linear Bayesian models, its GP performance was less attractive on polyploid outcrossing species of strawberry and blueberry [35]. The different performances of CNN and LSTM compared to conventional methods in the above−mentioned two studies may be attributed to the genome differences, presence of interactions, sample size, or model tuning. However, the differences in architecture strength between CNN and LSTM cannot be ignored, although it is still hard to make a conclusion because of the limited LSTM applications in GP tasks. As introduced before, CNN is better at feature extraction from 2D data, while RNNs (or LSTMs) are more advantageous for sequential data. The nature of SNPs is a series of mutants on a genome sequence, and their sequential property and SNP dependencies might more easily captured by LSTM models.

A deeper understanding of both the DL architectures and the biological questions is also important in constructing DL networks. For example, given the situation where adjacent SNPs usually have no underlying direct functional relation, region−specific filters were introduced by adding a local CNN layer to reduce the background noise, and the GP performance was improved significantly [31]. As DL models are not outstanding for all applications, they can be integrated with conventional ML and/or linear models. For example, Jeong et al. (2020) integrated four types of models of CNN, RF, DNN, and RRB into the GMStool to conduct GP tasks [36]. As these individual models could capture SNP features from different aspects, the GMStool achieved the best prediction performance on the testing dataset. In addition, the microbes associated with plant growth environments are critical to disease development, and GP performance is difficult to improve if the variance and composition of microbe communities are ignored. Unfortunately, no study has integrated phytomicrobiome information into consideration when conducting GP. Along with phytomicrobiome data accumulation and DL method improvement, it is very promising to improve crop trait prediction.

References

  1. Parlevliet, E. Durability of resistance against fungal, bacterial and viral pathogens: Present situation. Euphytica 2002, 124, 147–156.
  2. Bernardo, R.; Yu, J. Prospects for genomewide selection for quantitative traits in maize. Crop Sci. 2007, 47, 1082–1090.
  3. Heffner, E.L.; Lorenz, A.J.; Jannink, J.L.; Sorrells, M.E. Plant breeding with genomic selection: Gain per unit time and cost. Crop Sci. 2010, 50, 1681–1690.
  4. Bernardo, R. A model for marker−assisted selection among single crosses with multiple genetic markers. Appl. Genet. 1998, 97, 473–478.
  5. Hayes, B.J.; Bowman, P.J.; Chamberlain, A.J.; Goddard, M.E. Genomic selection in dairy cattle: Progress and challenges. Dairy Sci. 2009, 92, 433–443.
  6. Endelman, J.B. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 2011, 4, 250–255.
  7. Pérez, P.; de Los Campos, G.; Crossa, J.; Gianola, D. Genomic−enabled prediction based on molecular markers and pedigree using the Bayesian linear regression package in R. Plant Genome 2010, 3, 106.
  8. Meuwissen, T.H.; Hayes, B.J.; Goddard, M.E. Prediction of total genetic value using genome−wide dense marker maps. Genetics 2001, 157, 1819–1829.
  9. Monir, M.M.; Zhu, J. Dominance and epistasis interactions revealed as important variants for leaf traits of maize NAM population. Plant Sci. 2018, 9, 627.
  10. Holland, J.B. Genetic architecture of complex traits in plants. Opin. Plant Biol. 2007, 10, 156–161.
  11. Kasnavi, S.A.; Afshar, M.A.; Shariati, M.M.; Kashan, N.E.J.; Honarvar, M. Performance evaluation of support vector machine (SVM)−based predictors in genomic selection. Indian J. Anim. Sci. 2017, 87, 1226–1231.
  12. Long, N.; Gianola, D.; Rosa, G.J.; Weigel, K.A. Application of support vector regression to genome−assisted prediction of quantitative traits. Appl. Genet. 2011, 123, 1065–1074.
  13. De Los Campos, G.; Gianola, D.; Rosa, G.J.; Weigel, K.A.; Crossa, J. Semi−parametric genomic−enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Res. 2010, 92, 295–308.
  14. Gianola, D.; Fernando, R.L.; Stella, A. Genomic−assisted prediction of genetic value with semiparametric procedures. Genetics 2006, 173, 1761–1776.
  15. González−Recio, O.; Forni, S. Genome−wide prediction of discrete traits using Bayesian regressions and machine learning. Sel. 2011, 43, 7.
  16. Spindel, J.; Begum, H.; Akdemir, D.; Virk, P.; Collard, B.; Redona, E.; Atlin, G.; Jannink, J.L.; McCouch, S.R. Genomic selection and association mapping in rice (Oryza sativa): Effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines. PLoS Genet. 2015, 11, e1004982.
  17. Drummond, S.T.; Sudduth, K.A.; Joshi, A.; Birrell, S.J.; Kitchen, N.R. Statistical and neural methods for site–specific yield prediction. ASAE 2003, 46, 5.
  18. Gianola, D.; Okut, H.; Weigel, K.A.; Rosa, G.J. Predicting complex quantitative traits with Bayesian neural networks: A case study with Jersey cows and wheat. BMC Genet. 2011, 12, 87.
  19. González−Recio, O.; Rosa, G.J.; Gianola, D. Machine learning methods and predictive ability metrics for genome−wide prediction of complex traits. Sci. 2014, 166, 217–231.
  20. Leung, M.K.; Delong, A.; Alipanahi, B.; Frey, B.J. Machine learning in genomic medicine: A review of computational problems and data sets. IEEE 2015, 104, 176–197.
  21. Glória, L.S.; Cruz, C.D.; Vieira, R.A.M.; de Resende, M.D.V.; Lopes, P.S.; de Siqueira, O.H.D.; e Silva, F.F. Accessing marker effects and heritability estimates from genome prediction by Bayesian regularized neural networks. Sci. 2016, 191, 91–96.
  22. Romagnoni, A.; Jégou, S.; Van Steen, K.; Wainrib, G.; Hugot, J.P. Comparative performances of machine learning methods for classifying Crohn Disease patients using genome−wide genotyping data. Rep. 2019, 9, 10351.
  23. Yin, B.; Balvert, M.; van der Spek, R.A.; Dutilh, B.E.; Bohté, S.; Veldink, J.; Schönhuth, A. Using the structure of genome data in the design of deep neural networks for predicting amyotrophic lateral sclerosis from genotype. Bioinformatics 2019, 35, i538–i547.
  24. Grinberg, N.F.; Orhobor, O.I.; King, R.D. An evaluation of machine−learning for predicting phenotype: Studies in yeast, rice, and wheat. Learn. 2020, 109, 251–277.
  25. Ranganathan, S.; Nakai, K.; Schonbach, C. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics; Elsevier: Amsterdam, The Netherlands, 2018.
  26. Ma, W.; Qiu, Z.; Song, J.; Cheng, Q.; Ma, C. DeepGS: Predicting phenotypes from genotypes using deep learning. bioRxiv 2017, bioRxiv:
  27. Khaki, S.; Wang, L. Crop yield prediction using deep neural networks. Plant Sci. 2019, 10, 621.
  28. Jubair, S.; Tucker, J.R.; Henderson, N.; Hiebert, C.W.; Badea, A.; Domaratzki, M.; Fernando, W. GPTransformer: A transformer−based deep learning method for predicting Fusarium related traits in barley. Plant Sci. 2021, 12, 2984.
  29. Montesinos−López, O.A.; Montesinos−López, A.; Crossa, J.; Gianola, D.; Hernández−Suárez, C.M.; Martín−Vallejo, J. Multi−trait, multi−environment deep learning modeling for genomic−enabled prediction of plant traits. G3 Genes Genomes Genet. 2018, 8, 3829–3840.
  30. Montesinos−López, O.A.; Montesinos−López, A.; Pérez−Rodríguez, P.; Barrón−López, J.A.; Martini, J.W.; Fajardo−Flores, S.B.; Gaytan−Lugo, L.S.; Santana−Mancilla, P.C.; Crossa, J. A review of deep learning applications for genomic selection. BMC Genom. 2021, 22, 19.
  31. Pook, T.; Freudenthal, J.; Korte, A.; Simianer, H. Using local convolutional neural networks for genomic prediction. Genet. 2020, 11, 561497.
  32. Pérez−Enciso, M.; Zingaretti, L.M. A guide on deep learning for complex trait genomic prediction. Genes 2019, 10, 553.
  33. Ma, W.; Qiu, Z.; Song, J.; Li, J.; Cheng, Q.; Zhai, J.; Ma, C. A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta 2018, 248, 1307–1318.
  34. Maldonado, C.; Mora, F.; Contreras−Soto, R.; Ahmar, S.; Chen, J.T.; do Amaral Júnior, A.T.; Scapim, C.A. Genome−wide prediction of complex traits in two outcrossing plant species through deep learning and Bayesian regularized neural network. Plant Sci. 2020, 11, 1734.
  35. Zingaretti, L.M.; Gezan, S.A.; Ferrão, L.F.V.; Osorio, L.F.; Monfort, A.; Muñoz, P.R.; Whitaker, V.M.; Pérez−Enciso, M. Exploring deep learning for complex trait genomic prediction in polyploid outcrossing species. Plant Sci. 2020, 11, 25.
  36. Jeong, S.; Kim, J.Y.; Kim, N. GMStool: GWAS−based marker selection tool for genomic prediction from genomic data. Rep. 2020, 10, 19653.
More
ScholarVision Creations