Machine Learning Basis and Function

Machine Learning Basis and Function: Comparison

Please note this is a comparison between Version 1 by Mohsen Yoosefzadeh Najafabadi and Version 2 by Camila Xu.

Machine learning, as an important subfield of AI, has been widely used in different aspects of our lives, such as communication and agriculture, among many others. In agriculture, ML algorithms can be used for crop-yield prediction, crop-growth monitoring, precision agriculture, and automated irrigation.

data-integration strategies
deep learning

1. Machine Learning: Basis

Machine learning, as an important subfield of AI, has been widely used in different aspects of our lives, such as communication and agriculture, among many others ^[1][2][53,54]. In agriculture, ML algorithms can be used for crop-yield prediction, crop-growth monitoring, precision agriculture, and automated irrigation ^[3][8]. ML algorithms are typically divided into three subgroups: supervised learning, unsupervised learning, and reinforcement learning, which are extensively reviewed in Hesami et al. ^[4][14]; therefore, reswearchers provide only a brief explanation of these subgroups in this entryreview. In supervised learning, the algorithm is trained on a labeled dataset to make predictions based on the data ^[5][55]. The model learns by being given a set of inputs and associated outputs and then adjusting its internal parameters to produce the desired output. Supervised learning is the most common subgroup of ML algorithms that are frequently used in plant breeding to predict complex traits in an early growth stage ^[6][56], detect genomic regions associated with a specific trait ^[7][40], and select superior genotypes via genomic selection ^[8][4].

2. Function

Unsupervised learning is used when data is not labeled, and the algorithm uses the data to find patterns and similarities in the dataset on its own ^[9][57]. The model learns by identifying patterns in the data, such as clusters or groups. In plant breeding, unsupervised learning is usually implemented to find possible associations among genotypes within a breeding population, design kinship matrices, and categorize unstructured datasets ^[9][57]. Reinforcement learning is another ML algorithm, in which the model is exposed to an environment and receives feedback in the form of rewards or penalties based on its actions ^[10][58]. The model learns by taking actions and adjusting its parameters to maximize the total rewards received. Reinforcement learning is quite a new area in plant breeding, and its applications need to be explored more.

Several important factors need to be taken into account for the successful use of ML algorithms in predicting a given complex trait. Factors include, but are not limited to, data collection, pre-processing, feature extraction, model training, model evaluation, hyperparameter tuning, model deployment, and model monitoring ^[11][12][59,60]. These factors are intensively reviewed in several studies and review papers ^[11][12][13][59,60,61]. In brief, (1) data collection is the process of gathering data from different sources (environments, genotypes, etc.) in different formats, such as images, text, numerical/categorial datasets, or video, for use in model training ^[14][28]; (2) the pre-processing step is defined as the cleaning, transforming, and organizing of data to make it more suitable for ML algorithms ^[11][59]; (3) feature extraction is the process in which features/variables are extracted from the data to be represented in a form that is more suitable for ML algorithms ^[15][18]; (4) model training uses different ML algorithms to fit models to the data ^[7][40]; (5) model evaluation is the process of assessing the accuracy and errors of the algorithm against unseen data ^[16][27]; (6) the hyperparameter tuning step contains a second round of adjusting the parameters of tested ML algorithms to achieve the best performance ^[4][17][14,45]; (7) model deployment is summarized as the process of deploying a developed model in production, usually in the form of an application ^[13][61]; and (8) model monitoring is the process of tracking model performance over time to ensure it remains accurate ^[7][40].

In plant breeding, data collection is an essential step involving the collection of data for target traits from a wide range of environments, trials, and plant populations. Plant breeders often work in different environmental settings in order to gain an accurate understanding of the genotype-by-environment interaction in different trials within each environment. Additionally, they measure different traits in order to establish accurate multi-trait breeding strategies, such as tandem selection, independent culling levels, and selection index. As such, any collected data must be precise, accurate, and pre-processed using various packages and software in order to be suitable for plant breeding programs. Recently, the AllInOne R-shiny package was introduced as an open-source, breeder-friendly, analytical R package for pre-processing phenotypic data ^[18][62]. The basis of AllInOne is to utilize various R packages and develop a pipeline for pre-processing the phenotypic datasets in an accurate, easy, and timely manner without any coding skills required. A brief introduction to AllInOne is available at https://github.com/MohsenYN/AllInOne/wiki (accessed on 15 February 2023). Feature extraction is another critical step in determining the most relevant variables for further analysis. The recursive feature elimination of 250 spectral properties of a soybean population revealed a significance of 395 nm, in addition to four other bands in the blue, green, red, and near-infrared regions, in predicting soybean yield ^[19][63]. This spectral band can be used to complement other important bands to enhance the accuracy of soybean-yield prediction at an early stage. Furthermore, another study investigated the potential of 34 commonly used spectral indices in anticipating the soybean yield and biomass of a Canadian soybean panel, in which the Normalized Difference Vegetation Index (NDVI) was identified as the most pivotal index in predicting soybean yield and biomass concurrently ^[6][56].

Plant breeding involves a series of tasks and data analyses that are carried out over multiple years, and, therefore, repeatability and reproducibility are two important factors to consider when establishing a plant breeding program. Plant breeders may be reluctant to use sophisticated algorithms, such as ML algorithms, for analyzing their trials because of the ambiguity regarding whether or not the results will be reproducible and repeatable. Therefore, it is of the utmost importance to ensure proper model training and evaluation and hyperparameter tuning, deployment, and monitoring when we develop an algorithm. To further improve model training in plant breeding, larger datasets from different locations and years, as well as plant populations with different genetic backgrounds, should be collected ^[20][64]. Automated tuning methods can be used to optimize hyperparameters in plant breeding datasets. As an example, grid search is a popular automated tuning method, which is based on an exhaustive search for optimal parameter values ^[21][65]. Grid search works by training and evaluating a model for each combination of parameter values specified in a grid. It then selects the combination with the best results ^[21][65]. Bayesian optimization is another automated tuning method that uses Bayesian probability theory to determine the best set of parameters for a given problem ^[22][66]. Bayesian optimization works by constructing a probabilistic model of an objective function based on previously evaluated values. This model is then used to predict the optimal set of parameters for the given problem ^[22][66]. It then evaluates the performance of the system with the predicted parameters and updates the model with new information. This process is repeated to maximally optimize the model’s performance for the given dataset. Bayesian optimization is useful for optimizing complex problems with many variables or where the cost of evaluating the objective function is high ^[22][66]. As plant breeders work with different omics datasets, all of which are categorized as bigdata context, the developed algorithm can be exploited in cloud-based services such as the Google Cloud Platform to deploy models at scale ^[23][67]. To ensure optimal performance, model performance should be monitored over time and analyzed with metrics such as accuracy and precision, along with anomaly detection, to identify areas of improvement ^[24][68].

There are other components/methods that are important in reducing possible errors and increasing the ultimate accuracy of ML algorithms, including transfer learning, feature engineering, dimensionality reduction, and ensemble learning. Transfer learning is an ML technique in which a pre-trained model for a task is reused as the starting point for a model on a second task ^[25][69]. Transfer learning reduces the amount of data and computation needed to train a model, and it is particularly helpful for improving the model’s performance when the amount of training data for the second task is small ^[25][69]. Feature engineering is the process of using domain knowledge of the data to create features (variables) for the ML pipeline. Feature engineering is an informal topic, but it is considered essential in applied machine learning ^[26][70]. Feature engineering can help increase the accuracy of machine-learning models by creating features from raw data that help the model learn more effectively and accurately. Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables ^[27][71]. It can be divided into feature selection and feature extraction. Feature selection is the process of selecting a subset of relevant features for use in model construction ^[28][9]. Feature extraction is the process of combining or transforming existing features into more informative representations that are more useful for a given task ^[29][72]. Ensemble learning is an ML technique that combines multiple models to create more powerful and accurate models. Ensemble learning is used to improve the accuracy and robustness of ML models ^[28][9]. It combines multiple weak learners to form a strong learner that can make more accurate predictions than the single model. The most common ensemble-learning techniques are the bagging, boosting, and stacking algorithms ^[28][9].