You're using an outdated browser. Please upgrade to a modern browser for the best experience.
Genomic Selection of Complex Traits in Alfalfa
Edit

Genomic selection (GS) is an alternative approach in which genome-wide markers are used to determine the genomic estimated breeding value (GEBV) of individuals in a population. In alfalfa (Medicago sativa L.), previous results indicated that low to moderate prediction accuracy values (<70%) were obtained in complex traits, such as yield and abiotic stress resistance. There is a need to increase the prediction value in order to employ GS in breeding programs.

genomic selection Medicago sativa complex trait polyploid

1. Introduction

Alfalfa (Medicago sativa L.) is an autotetraploid (2n = 4x = 32) perennial forage crop with a genome size of 800–1000 Mb [1]. However, alfalfa breeding is complicated by its high heterozygosity, polysomic inheritance, and out-crossing nature, which hinder the creation of inbred lines. Alfalfa breeding goals target improvement of forage yield, quality, and tolerance to biotic and abiotic stresses. This process requires the selection of perennial plants that can maintain biomass productivity and quality over several years. Therefore, traits must be evaluated over multiple harvests each year for several years. Consequently, genetic gain is slower compared to annual crops. In addition, alfalfa breeding programs have largely focused on recurrent phenotypic selection (PS) in field environments to improve quantitative traits of interest. However, this approach is constrained by breeding population size, genotype × environment interactions, or low heritability of the trait, thus hindering the development of superior varieties.
One promising alternative to recurrent PS is indirect selection based on the use of molecular markers generated, for example, via genotyping by sequencing (GBS) [2]. Markers closely linked to quantitative trait loci (QTL) can then be used for marker-assisted selection (MAS) in breeding programs. Initially, QTLs are detected through genetic mapping or genome-wide association studies (GWAS), where marker-trait associations that exceed specific thresholds are declared statistically significant (Figure 1a). However, MAS is primarily effective for traits controlled by relatively few genes with large effects. For complex traits (e.g., stress tolerance or yield) in elite populations, it can be difficult to clearly identify QTL with major effect because the trait is often controlled by multiple loci possessing small effects.
Figure 1. Indirect selection based on molecular markers. (a) Generalized Manhattan plots illustrating a comparison of GWAS effectiveness in simple (left) vs. complex traits (right). Note: Bold dashed line indicates minimum threshold to select significant markers. A significant signal (i.e., QTL) was identified in the simple trait (left panel), while no defined QTL was identified for the complex trait. Therefore, genomic selection (GS) is more appropriate and practical for complex traits. (b) Common parametric and non-parametric models used in GS and their computational requirements. GBLUP, genomic best linear unbiased prediction; RRBLUP, ridge-regression BLUP; RF, random forest; SVM, support vector machine; MLP, multilayer perceptron; CNN, convolutional neural network; RNN, recurrent neural network.
Soil salinity is one of the major environmental factors limiting the productivity of alfalfa. Genomic tools have been applied to identify important genetic factors that influence salt tolerance in alfalfa. Genetic loci associated with salt stress tolerance have been identified from different studies. Yu et al. (2016) identified 23 markers and 14 functional genes associated with germination under salt stress [3]. Liu et al. (2017) identified 42 markers associated with forage yield, plant height, leaf chlorophyll content, and stomatal conductance under salt stress [4]. Liu et al. (2019) identified 49 markers associated with forage yield, plant height, leaf relative water content, and stomatal conductance under salt stress [5]. Most recently, Medina et al. (2020) identified 27 markers associated with biomass yield and plant vigor under salt stress [6]. Those results highlighted the genetic complexity of salt stress response in alfalfa.
Conventional breeding strategy for improving salt tolerance of alfalfa is time consuming and less effective. Genomic selection (GS) offers the potential to shorten alfalfa breeding and selection cycles. GS is a promising alternative to determine the genetic potential or breeding value of an individual based on whole-genome markers (Figure 1a). This method follows the infinitesimal model, which assumes that a quantitative trait is determined by an infinite number of unlinked and non-epistatic loci, each one with a very small effect that satisfies normality and linearity [7]. This technique uses both parametric and non-parametric statistical models to determine associations of phenotypic trait values with genome-wide molecular markers. This information is subsequently used to predict future breeding values (i.e., genomic-estimated breeding values, GEBVs) for each individual in a population based on their genome-wide marker profile/genotype [8]

2. Statistical Methods in GS

There is a wide repertoire of parametric and non-parametric models to obtain GEBVs that differ in complexity, accuracy, and computational requirements (Figure 1b). Some phenotypic traits are highly complex and more difficult to predict using their genetic information, therefore, accuracy in GS modeling is a cornerstone.

2.1. Machine Learning Models

Machine learning is a field that involves the application of computer algorithms and statistical models to interpret and predict large datasets. Algorithmic modeling is a rapidly developing discipline with strong potential to provide accurate and informative analyses or predictions using large and complex data sets [9]. These models are widely used to solve problems across different disciplines, such as medicine, genomics, natural language processing, and stock market forecasting. Compared with classical statistical models, ML models have fewer assumptions about normality and distribution of data. One important remark is that ML models are being developed much faster than their interpretability, developing a new field to be explored. The most common problem with ML algorithms is data overfitting, which results in models that poorly predict the behavior of future data. To avoid this problem, it is necessary to use a robust validation method, such as cross-validation, which provides an indication of performance on new data. Support vector machine (SVM) and random forest (RF) are the most common ML models for classification and regression in GS [10][11].

2.2. Other Models

Klápště et al. (2020) presented a strategy to generate a marker-based relationship matrix that prioritized markers using Partial Least Squares (PLS). This approach downweighs noisy predictors, but does not remove them from the model. The advantage of PLS is that it deals with multicollinearity and can handle several response variables at a time. The authors used PLS-Canonical Analysis (PLS-CA) for constructing marker-based relationship matrices with different numbers of markers. This strategy attempts to improve the accuracy of traits with low heritability by taking advantage of the genetic covariance common across all investigated traits. In order to perform the marker selection by PLS-CA, all individuals in the training population must be phenotyped for all traits that will be included in the analysis [12].

3. Genomic Selection in Polyploids

GS requires high-quality genome-wide markers to determine GEBVs. Two types of high-throughput genotyping methods can be employed: SNP arrays and GBS. There are SNPs arrays with different marker densities in potato [13] and wheat [14]. Alfalfa also has an array with 9277 SNPs [15]. However, its use has not been widely adopted, and GBS is currently the best option to obtain genome-wide markers. During the genotyping process by GBS, different types of markers, such as single nucleotide polymorphisms (SNPs), insertions/deletions (indels), or short tandem repeats (STRs) can be obtained. Genome-wide markers can then be arrayed in a genotypic matrix of m samples and n markers. The genotypic matrix can be filtered to retain only biallelic SNPs, which are the most abundant and stable markers for identifying QTLs associated with traits of interest [16].
Allele dosage counts alternative allele frequency for each biallelic SNP. In diploid species the genotypic matrix is coded as {0, 1, 2}, reflecting if a given marker is present in the homozygous reference (AA), heterozygous (AB), or homozygous alternate (BB) allelic state. For biallelic SNPs in polyploid species with ploidy N, the biallelic dosage is N+1 and the genotypic matrix is coded as {0,,N}. Genotype calling in autotetraploids requires bioinformatics tools to distinguish among five possible genotypes (AAAA, AAAB, AABB, ABBB, BBBB) with biallelic SNPs coded as {0, 1, 2, 3, 4}. There are several R packages, such as polyRAD [17], superMASSA [18], FitTetra 2.0 [19] or Updog [20], with which to obtain allele dosage in numeric format from variant call format file (vcf) format. Some of these R packages, such as Updog, require users to specify genotype priors [20] to accurately calculate the allele dosage and distinguish between all possible genotypes. However, the most common option is to use high depth sequence reads (e.g., ~60×) which leads to 98.4% accuracy in genotypic calls [21] The effects of marker allele dosage on phenotype for genomic selection have been reported previously. Slater et al. (2016) described three different models for GS in autopolyploids: additive autotetraploid, pseudodiploid, and full autotetraploid. In the additive autotetraploid model, the allele dosage has an additive effect, and 0, 1, 2, 3, 4 corresponds to AAAA, AAAB, AABB, ABBB, BBBB, respectively. In the pseudodiploid model, all heterozygous genotypes (AAAB, AABB, ABBB) have the same effect of 1 on the genotypic variation, while the two homozygotes AAAA and BBBB have an effect of 0 and 2, respectively. Finally, the full autotetraploid model assumes that each genotype has its own effect with five possible effects per marker, assuming that the markers are fitted as random effects [22]. In addition, Rosyara et al. (2016) developed GWASpoly, a software for genome-wide association studies in autopolyploids. GWASpoly has different assumptions over allele dosages and conducts the hypothesis tests for each marker using six models (general, diploidized general, diploidized additive, additive, simplex dominant, and duplex dominant models) (Table 1).
Table 1. Coding effect assumptions of GWASpoly models according to allele dosage in biallelic SNPs.
Allele Dosage  AAAA AAAB AABB ABBB BBBB
Numerical Code 0 1 2 3 4
GWASpoly Models Phenotypic Effect §
Diplo-additive 0.00 0.50 1.00
Diplo-general  0.00 0.00 < x <1.00 1.00
Additive 0.00 0.25 0.50 0.75 1.00
1-dom-ref (A > B simplex) 1.00 1.00 1.00 1.00 0.00
2-dom-ref (A > B duplex) 1.00 1.00 1.00 0.00 0.00
1-dom-alt (B > A simplex) 0.00 1.00 1.00 1.00 1.00
2-dom-alt (B > A duplex) 0.00 0.00 1.00 1.00 1.00
General  No restrictions
, allele dosage A is coded as the reference allele and B is coded as the alternative allele; §, phenotypic effects are scaled from 0.00 to 1.00; , for the diplo-general model all heterozygotes have the same effect (x), but x is not constrained to be halfway between the homozygous effects; , the general model has no restrictions on the effects of the different dosage levels.

4. Conclusions

Genomic selection is a breeding strategy that predicts the genomic estimated breeding value (GEBV) of individuals in a population using genomic-wide genetic markers. A significant advantage of GS is the ability to select superior individuals in a population at very early stages in a breeding cycle based on their genotype, versus conducting lengthy/expensive phenotyping trials prior to each selection cycle. For instance, in a simulated wheat breeding program, selection based on GEBVs for grain yield tripled the genetic gain compared with PS [23]. In alfalfa breeding programs, GS can be implemented in elite large germplasm panels with genotyped individuals to decrease PS efforts, thus reducing overall selection cycle time and accelerating variety development. However, determining correct GEBVs relies on prediction accuracy, which can vary according to a combination of phenotypic trait, genotypic information, and the statistical model. Consequently, GS research efforts have focused heavily on evaluating prediction accuracies of multiple parametric and non-parametric models to develop robust strategies that can be used in testing populations. Once robust modeling strategies are developed, GS has the potential to accumulate thousands of favorable alleles to develop climate-resilient crops with high yield potential. Additionally, as genotyping is required only once for a given population, multiple traits can be associated with the same genotypic matrix to determine GEBVs for each trait, thus making GS a valuable approach in multi-trait selection [24].
There is a need to explore new methodologies to improve molecular and bioinformatic tools for the application of GS in polyploid crops. The development of new approaches to obtain high-quality genome-wide markers will help to resolve the genetic architecture of complex traits. For example, the PRINCESS platform uses long-read sequencing to detect SNPs, indels, or methylation sites with high accuracy [25]. In GWAS and GS, the number of individuals in a population is crucial to maximize statistical power. Therefore, researchers search for genotyping methods that optimize the balance between cost, sample size, and the number of SNPs. In this regard, GBS is a relatively affordable genotyping methodology. However, for polyploid crops, there is a need for high-coverage sequencing (i.e., read-depth) to accurately estimate allele dosage, which increases genotyping cost. Biallelic SNPs are commonly used in polyploid GS because they are the most abundant type and are easier to transform into a numerical format for developing a genotypic matrix. However, during the construction of the vcf file, ~20% of high-quality SNP markers are discarded because they are not biallelic. Additionally, indels or simple sequence repeats (SSR) could add important information to GS model to increase prediction accuracy.

References

  1. Blondon, F.; Marie, D.; Brown, S.; Kondorosi, A. Genome size and base composition in Medicago sativa and M. truncatula species. Genome 1994, 37, 264–270.
  2. Elshire, R.J.; Glaubitz, J.C.; Sun, Q.; Poland, J.A.; Kawamoto, K.; Buckler, E.S.; Mitchell, S.E. A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species. PLoS ONE 2011, 6, e19379.
  3. Yu, L.-X.; Liu, X.; Boge, W.; Liu, X.-P. Genome-Wide Association Study Identifies Loci for Salt Tolerance during Germination in Autotetraploid Alfalfa (Medicago sativa L.) Using Genotyping-by-Sequencing. Front. Plant Sci. 2016, 7, 956.
  4. Liu, X.-P.; Yu, L.-X. Genome-Wide Association Mapping of Loci Associated with Plant Growth and Forage Production under Salt Stress in Alfalfa (Medicago sativa L.). Front. Plant Sci. 2017, 8, 853.
  5. Liu, X.; Hawkins, C.; Peel, M.D.; Yu, L. Genetic Loci Associated with Salt Tolerance in Advanced Breeding Populations of Tetraploid Alfalfa Using Genome-Wide Association Studies. Plant Genome 2019, 12, 180026.
  6. Medina, C.A.; Hawkins, C.; Liu, X.-P.; Peel, M.; Yu, L.-X. Genome-Wide Association and Prediction of Traits Related to Salt Tolerance in Autotetraploid Alfalfa (Medicago sativa L.). Int. J. Mol. Sci. 2020, 21, 3361.
  7. Bulmer, M.G. The Effect of Selection on Genetic Variability. Am. Nat. 1971, 105, 201–211.
  8. Meuwissen, T.H.E.; Hayes, B.J.; Goddard, M.E. Prediction of total genetic value using genome-wide dense marker maps. Genetics 2001, 157, 1819–1829.
  9. Breiman, L. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Stat. Sci. 2001, 16, 199–231.
  10. Drucker, H.; Surges, C.J.C.; Kaufman, L.; Smola, A.; Vapnik, V. Support vector regression machines. In Advances in Neural Information Processing Systems; Mozer, M.C., Jordan, M.I., Petsche, T., Eds.; MIT Press: Cambridge, MA, USA, 1997; pp. 155–161.
  11. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32.
  12. Klápště, J.; Dungey, H.S.; Telfer, E.J.; Suontama, M.; Graham, N.J.; Li, Y.; McKinley, R. Marker Selection in Multivariate Genomic Prediction Improves Accuracy of Low Heritability Traits. Front. Genet. 2020, 11, 499094.
  13. Vos, P.G.; Uitdewilligen, J.G.A.M.L.; Voorrips, R.E.; Visser, R.G.F.; Van Eck, H.J. Development and analysis of a 20K SNP array for potato (Solanum tuberosum): An insight into the breeding history. Theor. Appl. Genet. 2015, 128, 2387–2401.
  14. Winfield, M.O.; Allen, A.M.; Burridge, A.; Barker, G.L.A.; Benbow, H.R.; Wilkinson, P.A.; Coghill, J.; Waterfall, C.; Davassi, A.; Scopes, G.; et al. High-density SNP genotyping array for hexaploid wheat and its secondary and tertiary gene pool. Plant Biotechnol. J. 2016, 14, 1195–1206.
  15. Li, X.; Han, Y.; Wei, Y.; Acharya, A.; Farmer, A.D.; Ho, J.; Monteros, M.; Brummer, E.C. Development of an Alfalfa SNP Array and Its Use to Evaluate Patterns of Population Structure and Linkage Disequilibrium. PLoS ONE 2014, 9, e84329.
  16. Perkel, J. SNP genotyping: Six technologies that keyed a revolution. Nat. Chem. Biol. 2008, 5, 447–453.
  17. Clark, L.V.; Lipka, A.E.; Sacks, E.J. polyRAD: Genotype Calling with Uncertainty from Sequencing Data in Polyploids and Diploids. G3 Genes Genomes Genet. 2019, 9, 663–673.
  18. Pereira, G.S.; Garcia, A.A.F.; Margarido, G.R.A. A fully automated pipeline for quantitative genotype calling from next generation sequencing data in autopolyploids. BMC Bioinform. 2018, 19, 398.
  19. Zych, K.; Gort, G.; Maliepaard, C.A.; Jansen, R.C.; Voorrips, R.E. FitTetra 2.0—Improved genotype calling for tetraploids with multiple population and parental data support. BMC Bioinform. 2019, 20, 148.
  20. Gerard, D.; Ferrão, L.F.V.; Garcia, A.A.F.; Stephens, M. Genotyping Polyploids from Messy Sequencing Data. Genetics 2018, 210, 789–807.
  21. Uitdewilligen, J.G.A.M.L.; Wolters, A.-M.; D’Hoop, B.B.; Borm, T.J.A.; Visser, R.G.F.; Van Eck, H.J. A Next-Generation Sequencing Method for Genotyping-by-Sequencing of Highly Heterozygous Autotetraploid Potato. PLoS ONE 2013, 8, e62355.
  22. Slater, A.T.; Cogan, N.O.; Forster, J.W.; Hayes, B.; Daetwyler, H.D. Improving Genetic Gain with Genomic Selection in Autotetraploid Potato. Plant Genome 2016, 9, 1–15.
  23. Tessema, B.B.; Liu, H.; Sørensen, A.C.; Andersen, J.R.; Jensen, J. Strategies Using Genomic Selection to Increase Genetic Gain in Breeding Programs for Wheat. Front. Genet. 2020, 11, 578123.
  24. Moeinizade, S.; Kusmec, A.; Hu, G.; Wang, L.; Schnable, P. Multi-trait Genomic Selection Methods for Crop Improvement. Genetics 2020, 215, 931–945.
  25. Mahmoud, M.; Doddapaneni, H.; Timp, W.; Sedlazeck, F.J. PRINCESS: Comprehensive detection of haplotype resolved SNVs, SVs, and methylation. Genome Biol. 2021, 22, 268.
More
Upload a video for this entry
Information
Contributor MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :
View Times: 1.1K
Revisions: 2 times (View History)
Update Date: 09 Dec 2021
Academic Video Service