Genomic selection (GS) is an alternative approach in which genome-wide markers are used to determine the genomic estimated breeding value (GEBV) of individuals in a population. In alfalfa (Medicago sativa L.), previous results indicated that low to moderate prediction accuracy values (<70%) were obtained in complex traits, such as yield and abiotic stress resistance. There is a need to increase the prediction value in order to employ GS in breeding programs.
1. Introduction
Alfalfa (
Medicago sativa L.) is an autotetraploid (2
n = 4
x = 32) perennial forage crop with a genome size of 800–1000 Mb
[1]. However, alfalfa breeding is complicated by its high heterozygosity, polysomic inheritance, and out-crossing nature, which hinder the creation of inbred lines. Alfalfa breeding goals target improvement of forage yield, quality, and tolerance to biotic and abiotic stresses. This process requires the selection of perennial plants that can maintain biomass productivity and quality over several years. Therefore, traits must be evaluated over multiple harvests each year for several years. Consequently, genetic gain is slower compared to annual crops. In addition, alfalfa breeding programs have largely focused on recurrent phenotypic selection (PS) in field environments to improve quantitative traits of interest. However, this approach is constrained by breeding population size, genotype × environment interactions, or low heritability of the trait, thus hindering the development of superior varieties.
One promising alternative to recurrent PS is indirect selection based on the use of molecular markers generated, for example, via genotyping by sequencing (GBS)
[2]. Markers closely linked to quantitative trait loci (QTL) can then be used for marker-assisted selection (MAS) in breeding programs. Initially, QTLs are detected through genetic mapping or genome-wide association studies (GWAS), where marker-trait associations that exceed specific thresholds are declared statistically significant (
Figure 1a). However, MAS is primarily effective for traits controlled by relatively few genes with large effects. For complex traits (e.g., stress tolerance or yield) in elite populations, it can be difficult to clearly identify QTL with major effect because the trait is often controlled by multiple loci possessing small effects.
Figure 1. Indirect selection based on molecular markers. (a) Generalized Manhattan plots illustrating a comparison of GWAS effectiveness in simple (left) vs. complex traits (right). Note: Bold dashed line indicates minimum threshold to select significant markers. A significant signal (i.e., QTL) was identified in the simple trait (left panel), while no defined QTL was identified for the complex trait. Therefore, genomic selection (GS) is more appropriate and practical for complex traits. (b) Common parametric and non-parametric models used in GS and their computational requirements. GBLUP, genomic best linear unbiased prediction; RRBLUP, ridge-regression BLUP; RF, random forest; SVM, support vector machine; MLP, multilayer perceptron; CNN, convolutional neural network; RNN, recurrent neural network.
Soil salinity is one of the major environmental factors limiting the productivity of alfalfa. Genomic tools have been applied to identify important genetic factors that influence salt tolerance in alfalfa. Genetic loci associated with salt stress tolerance have been identified from different studies. Yu et al. (2016) identified 23 markers and 14 functional genes associated with germination under salt stress
[3]. Liu et al. (2017) identified 42 markers associated with forage yield, plant height, leaf chlorophyll content, and stomatal conductance under salt stress
[4]. Liu et al. (2019) identified 49 markers associated with forage yield, plant height, leaf relative water content, and stomatal conductance under salt stress
[5]. Most recently, Medina et al. (2020) identified 27 markers associated with biomass yield and plant vigor under salt stress
[6]. Those results highlighted the genetic complexity of salt stress response in alfalfa.
Conventional breeding strategy for improving salt tolerance of alfalfa is time consuming and less effective. Genomic selection (GS) offers the potential to shorten alfalfa breeding and selection cycles. GS is a promising alternative to determine the genetic potential or breeding value of an individual based on whole-genome markers (
Figure 1a). This method follows the infinitesimal model, which assumes that a quantitative trait is determined by an infinite number of unlinked and non-epistatic loci, each one with a very small effect that satisfies normality and linearity
[7]. This technique uses both parametric and non-parametric statistical models to determine associations of phenotypic trait values with genome-wide molecular markers. This information is subsequently used to predict future breeding values (i.e., genomic-estimated breeding values, GEBVs) for each individual in a population based on their genome-wide marker profile/genotype
[8].
2. Statistical Methods in GS
There is a wide repertoire of parametric and non-parametric models to obtain GEBVs that differ in complexity, accuracy, and computational requirements (Figure 1b). Some phenotypic traits are highly complex and more difficult to predict using their genetic information, therefore, accuracy in GS modeling is a cornerstone.
2.1. Machine Learning Models
Machine learning is a field that involves the application of computer algorithms and statistical models to interpret and predict large datasets. Algorithmic modeling is a rapidly developing discipline with strong potential to provide accurate and informative analyses or predictions using large and complex data sets
[9]. These models are widely used to solve problems across different disciplines, such as medicine, genomics, natural language processing, and stock market forecasting. Compared with classical statistical models, ML models have fewer assumptions about normality and distribution of data. One important remark is that ML models are being developed much faster than their interpretability, developing a new field to be explored. The most common problem with ML algorithms is data overfitting, which results in models that poorly predict the behavior of future data. To avoid this problem, it is necessary to use a robust validation method, such as cross-validation, which provides an indication of performance on new data. Support vector machine (SVM) and random forest (RF) are the most common ML models for classification and regression in GS
[10][11].
2.2. Other Models
Klápště et al. (2020) presented a strategy to generate a marker-based relationship matrix that prioritized markers using Partial Least Squares (PLS). This approach downweighs noisy predictors, but does not remove them from the model. The advantage of PLS is that it deals with multicollinearity and can handle several response variables at a time. The authors used PLS-Canonical Analysis (PLS-CA) for constructing marker-based relationship matrices with different numbers of markers. This strategy attempts to improve the accuracy of traits with low heritability by taking advantage of the genetic covariance common across all investigated traits. In order to perform the marker selection by PLS-CA, all individuals in the training population must be phenotyped for all traits that will be included in the analysis
[12].
3. Genomic Selection in Polyploids
GS requires high-quality genome-wide markers to determine
GEBVs. Two types of high-throughput genotyping methods can be employed: SNP arrays and GBS. There are SNPs arrays with different marker densities in potato
[13] and wheat
[14]. Alfalfa also has an array with 9277 SNPs
[15]. However, its use has not been widely adopted, and GBS is currently the best option to obtain genome-wide markers. During the genotyping process by GBS, different types of markers, such as single nucleotide polymorphisms (SNPs), insertions/deletions (indels), or short tandem repeats (STRs) can be obtained. Genome-wide markers can then be arrayed in a genotypic matrix of
m samples and
n markers. The genotypic matrix can be filtered to retain only biallelic SNPs, which are the most abundant and stable markers for identifying QTLs associated with traits of interest
[16].
Allele dosage counts alternative allele frequency for each biallelic SNP. In diploid species the genotypic matrix is coded as
{0, 1, 2}, reflecting if a given marker is present in the homozygous reference (AA), heterozygous (AB), or homozygous alternate (BB) allelic state. For biallelic SNPs in polyploid species with ploidy
N, the biallelic dosage is
N+1 and the genotypic matrix is coded as
{0,…,N}. Genotype calling in autotetraploids requires bioinformatics tools to distinguish among five possible genotypes (AAAA, AAAB, AABB, ABBB, BBBB) with biallelic SNPs coded as {0, 1, 2, 3, 4}. There are several R packages, such as polyRAD
[17], superMASSA
[18], FitTetra 2.0
[19] or Updog
[20], with which to obtain allele dosage in numeric format from variant call format file (vcf) format. Some of these R packages, such as Updog, require users to specify genotype priors
[20] to accurately calculate the allele dosage and distinguish between all possible genotypes. However, the most common option is to use high depth sequence reads (e.g., ~60×) which leads to 98.4% accuracy in genotypic calls
[21] The effects of marker allele dosage on phenotype for genomic selection have been reported previously. Slater et al. (2016) described three different models for GS in autopolyploids: additive autotetraploid, pseudodiploid, and full autotetraploid. In the additive autotetraploid model, the allele dosage has an additive effect, and 0, 1, 2, 3, 4 corresponds to AAAA, AAAB, AABB, ABBB, BBBB, respectively. In the pseudodiploid model, all heterozygous genotypes (AAAB, AABB, ABBB) have the same effect of 1 on the genotypic variation, while the two homozygotes AAAA and BBBB have an effect of 0 and 2, respectively. Finally, the full autotetraploid model assumes that each genotype has its own effect with five possible effects per marker, assuming that the markers are fitted as random effects
[22]. In addition, Rosyara et al. (2016) developed GWASpoly, a software for genome-wide association studies in autopolyploids. GWASpoly has different assumptions over allele dosages and conducts the hypothesis tests for each marker using six models (general, diploidized general, diploidized additive, additive, simplex dominant, and duplex dominant models) (
Table 1).
Table 1. Coding effect assumptions of GWASpoly models according to allele dosage in biallelic SNPs.