Bioinformatics in Plant Breeding and Disease Resistance: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Subjects: Biology
Contributor: , ,

In the context of plant breeding, bioinformatics can empower genetic and genomic selection to determine the optimal combination of genotypes that will produce a desired phenotype and help expedite the isolation of these new varieties. Bioinformatics is also instrumental in collecting and processing plant phenotypes, which facilitates plant breeding. Robots that use automated and digital technologies to collect and analyze different types of information to monitor the environment in which plants grow, analyze the environmental stresses they face, and promptly optimize suboptimal and adverse growth conditions accordingly, have helped plant research and saved human resources.

  • plant resistance
  • plant breeding
  • bioinformatics
  • automated robots
  • precision fertilization

1. Introduction

Broadly speaking, bioinformatics covers the interdisciplinary studies of biological objects (including genes, proteins, and physiological indices) using informatics methods, such as various algorithms and statistical methods. Specifically, complex biological data can be processed using computer tools, which is common practice in dedicated databases, such as in nucleic acid databases, protein databases, and custom functional databases [1]. The implementation of bioinformatics tools reduces the cost of complex analyses, thus enhancing research into topics such as sustainable agriculture [2]. Understanding how bioinformatics can be applied to plant biology research is therefore important for researchers in the life sciences, and here, the researchers have provided a description of these tools and their applications, focusing on plant breeding and research on disease resistance. For example, based on the VPg gene sequence of a PVY (Y virus) isolated from potato, combined with all published sequences in GenBank, two things can be inferred: the rate of evolution of PVY and the time to reach the most recent common ancestor using a Bayesian system dynamics framework to advance disease resistance studies in potatoes [3,4,5]. Given that multifactorial traits involved in resistance and quality are extremely difficult to improve, especially in combinations, and some of the genomes of major forage crops, such as maize, rice, wheat, sorghum and barley, and the forage legumes soybean and alfalfa, are too large to be analyzed using whole-genome sequencing, attention has been focused on comparative genomic approaches in order to produce seeds with desirable shapes [6,7,8].
The typical datasets generated by plant researchers contain morphological, physiological, molecular, and genetic information that describes the entire plant life cycle. Bioinformatics process the collected data and extract key indices and trends to quickly and accurately generate hypotheses and then offer solutions. For example, phenotypes and genotypes can be combined to reveal the underlying mechanism, such as the study of plant rejuvenation [9], and the future growth pattern of plants can be predicted according to the growth trend of plants in the past, such as the plant growth pattern prediction system, developed by deep learning [10], and the comparison of multiple genomes can be used for the prediction of evolutionary relationships, such as in the study of Amphicarpaea edgeworthii [11].
In agricultural applications, the wide utilization of bioinformatics can assist with efficient crop breeding and the improvement of plant resistance against pathogens [12]. In particular, scientists are committed to breeding and modifying crop species to improve the yield and quality, as well as creating new varieties with qualities that benefit human nutrition and health. Bioinformatics accelerates the generation and deployment of these new varieties. Indeed, genes associated with specific traits can be analyzed on a computer before being introduced into a plant, and the results can be used to determine what to introduce further into the plant for a precise phenotypic analysis. Maize (Zea mays L.) kernels, rich in lysine [13]; lettuce (Lactuca sativa), high in vitamin C [14]; and the recently developed vitamin D-rich tomato [15] are examples of the implementation of such pipelines.
Bioinformatics plays a critical role in data integration, analysis, and model prediction, as well as in managing the massive amounts of data resulting from new, high-throughput approaches [16]. Classical biological experiments, such as the visualization of mitosis and meiosis and pollen tube growth, are undergoing deeper, higher throughput exploration thanks to bioinformatics and time-lapse microscopy [17,18]. Plant growth can be predicted based on the available wealth of physiological and phenotypic data, enabling the generation of a virtual plant that can accurately predict growth patterns and the consequences of interactions with diseases or pests [19]. Bioinformatics has also wide applications in the analysis of plant resistance to various stresses [20]. The molecular mechanisms underlying plant responses to abiotic stress have been studied in depth, and they can open new avenues in agriculture when combined with the predictive power of bioinformatics [21]. In addition, bioinformatics has been applied in plant pathology, such as identifying and predicting the “effector” proteins produced by plant pathogens in order to manipulate their host plants. The functional annotation of this pathogen’s ability to predict virulence is a critical step in translating the sequence data into potential applications in plant pathology [22]. A bioinformatics framework has been proposed to enable stakeholders to make more informed decisions. In this way, a shared biosecurity infrastructure can be established to cater for sustainable global food and fiber production in the context of global climate change and the increased chances of accidental disease invasions in the global plant trade [23].

2. Databases Provide Abundant Gene and Pathway Information to Study Plant Biology

Thanks to large-scale sequencing technologies, vast amounts of data are released continuously and are often uploaded to a specific database. Depending on the species they represent, databases can be formally classified as general or species-specific databases.
General databases include those that integrate information about genomes, proteins, and metabolic pathways (Table 1). Genome databases represent a centralized and public collection of all published data, so researchers can easily obtain information concerning their gene or protein of interest. For example, UniProt offers a comprehensive resource for protein sequences and functional annotation. The database can be queried with a specific gene/protein name or with keywords of interest to sort through the catalogued data, but it is also possible to perform a protein BLAST (basic local alignment search tool) and download the sequence of the new protein of interest [25]. In addition, general databases compile various biological pathways, such as those represented in Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), EuKaryotic Orthologous Groups (KOG), and metabolic pathways, which can be used to determine if a candidate protein belongs to one of many known pathways.
Table 1. General databases used for data integration and presentation.
As one example, a bioinformatic analysis of ribulose-1,5-bisphosphate carboxylase/oxygenase (RuBisCO) from multiple C3 plant species included an in silico characterization of RuBisCO and its interacting proteins, whose structures and functions were predicted with the ProtParam, SOPMA, Predotar 1.03, SignalP 4.1, TargetP 1.1, and TMHMM 2.0 tools, which are all accessible from the ExPASy database. A MEME and MAST analysis of RuBisCO from all C3 plants, combined with a phylogenetic tree constructed with MEGA 6.06 software based on a sequence alignment obtained with the ClustalW algorithm, illustrated the high-sequence identity shared by RuBisCO from different C3 plant species, supporting the notion that they originated from a common ancestor [26]. A list of these databases and how they are used is provided in Table 1.
The model plant Arabidopsis thaliana is one of a few plant species with its own databases due to its widespread use in plant research (Table 2). These databases are rich in resources and can help researchers quickly obtain the latest Arabidopsis genome information. For example, The Arabidopsis Information Resource (TAIR) database allows users to download gene sequences in bulk, while the SeqViewer in TAIR also provides a simple tool to visualize the genes. In addition, TAIR has a powerful function for displaying various expression maps, each representing expression data during Arabidopsis development or under different growth conditions [27].
Table 2. Databases specific for Arabidopsis.
Most major crops have dedicated databases, including rice (Oryza sativa), wheat (Triticum aestivum), barley (Hordeum vulgare), maize, soybean (Glycine max), cotton (Gossypium hirsutum), and sorghum (Sorghum bicolor) (Table 3). For example, the Rice Mutant Database (RMD) includes mutants for the identification of new genes and regulatory elements and includes a list of lines for the ectopic expression of target genes in specific tissues or at specific growth stages, providing rich data resources for the study of different rice mutants [28]. The Wheat Genomic Variation Database (WGVD) compiles all published single nucleotide polymorphisms (SNPs), insertion/deletion polymorphisms (InDels), and selection sweeps, together with a BLAST search tool for wheat [29]. Researchers at Huazhong Agricultural University have developed the ZEAMAP database for corn, which includes multiple omics data resources, such as genomes, transcriptomes and genetic variation, phenotypic data, metabolome studies, and genetic maps. The database also provides access to a variety of data on complex traits and boasts rich online capacities for data retrieval, analysis, and visualization [30].
Table 3. Databases used for major crops.

3. Various Algorithms Create Possibilities for Customized Analysis

Bioinformatics tools or websites can be used to predict protein structure, to look for conserved domains in a protein, or to annotate genes (Table 4). Data visualization and presentation are an integral part of bioinformatics analysis [31]. The biggest advantages of TBtools are batch processing and the visualization of data, and the interactive graphics generated with TBtools are rich with editable features that provide maximum flexibility for users [32]. Protter supports protein data analysis and protein prediction by visualizing the characteristics of an annotated sequence and associated experimental proteomic data in a protein topological environment. Protter is of great use for comprehensive visualization of membrane proteins and the selection of targeted proteomic peptides [33].
Table 4. Bioinformatics tools and websites that can be used in plant research.
Many tools are also tailored for specific applications (Table 5), including the prediction of transcription factor binding sites and the exploration of large-scale genomic variation data. For example, the PlantPAN database hosts a comprehensive list of transcription factors and their cognate binding sites. TRANSFAC and PlnTFDB are comprehensive databases of plant transcription factors, and AGRIS contains a database of Arabidopsis transcription factors, which can be used to predict transcription factor binding sites in plant promoter regions [34]. SnpHub can be used to retrieve, analyze, and visualize large-scale genomic variation data by specifying samples and lists of specific genomic regions [35].
Table 5. Bioinformatics tools specific to applications in plant research.
Outside of dedicated web tools, various algorithms can be used to empower data integration and analysis, such as Python, R, and Perl. Python and R are perhaps more widely used in bioinformatics than Perl. R has powerful statistical functions, which are very useful for processing large experimental datasets, together with a graphics solution for data exploration [36]. Python is better suited for building databases and web applications and is better for developing utilities [37]. While the basic introductory programming paradigm in R relies on so-called functions hosted by user-written packages, Python’s programming paradigm is based on design flow. Although R code might not be as human-readable as Python’s, R is overall better suited to biologists with no strong programming background. Based on these programming languages, various scripts have been developed to efficiently analyze data. For example, R uses a k-means function for clustering analysis and can draw Manhattan plots produced from genome-wide association studies (GWASs) with the qqman package [38].

4. Application of Bioinformatics in Plant Breeding

Plant breeding aims to produce new plant varieties. This long-term activity begins with basic research and often takes many years, thus necessitating a significant financial investment [39]. Genomics-assisted breeding is an effective and economical strategy and is thus widely applied in crop breeding. Genomics may help to understand the organization and function of biological systems and has the potential to track the molecular changes during development under different conditions, such as changes in plant physiology, pathogen pressure, or in the environment [40]. Samples for genomics studies can be collected from the same or different individuals from one species or from different species [41]. In addition, comparative genomics allows the study of specific traits in related plants by capitalizing on sequence conservation between species with small genomes (easier to study) and those with large and complex genomes (more difficult to study, but including most current crop species). For example, in Chrysanthemum, GWASs have been used to explore genetic patterns and identify favorable alleles for several ornamental and resistance traits, including plant structural and inflorescence traits, waterlogging tolerance, aphid resistance, and drought tolerance [42]. Su et al. transferred a major SNP co-isolated with waterlogging tolerance in Chrysanthemum to a PCR-based derived cut amplified polymorphism sequence (dCAPS) marker with an accuracy of 78.9%, which was verified in 52 cultivars or progenitors [43]. Chong et al. developed two dCAPS markers associated with the flowering stage and diameter of the head in Chrysanthemum. These dCAPS markers have potential applications in the molecular breeding of Chrysanthemum [44]. These techniques will provide new powerful tools for future Chrysanthemum breeding.

This entry is adapted from the peer-reviewed paper 10.3390/plants11223118

This entry is offline, you can click here to edit this entry!
Video Production Service