Bioinformatics in Plant Breeding and Disease Resistance

Bioinformatics in Plant Breeding and Disease Resistance: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Biology

Contributor:

Huiying Mu

, Baoshan Wang , Fang Yuan

In the context of plant breeding, bioinformatics can empower genetic and genomic selection to determine the optimal combination of genotypes that will produce a desired phenotype and help expedite the isolation of these new varieties. Bioinformatics is also instrumental in collecting and processing plant phenotypes, which facilitates plant breeding. Robots that use automated and digital technologies to collect and analyze different types of information to monitor the environment in which plants grow, analyze the environmental stresses they face, and promptly optimize suboptimal and adverse growth conditions accordingly, have helped plant research and saved human resources.

plant resistance
plant breeding
bioinformatics
automated robots
precision fertilization

1. Introduction

Broadly speaking, bioinformatics covers the interdisciplinary studies of biological objects (including genes, proteins, and physiological indices) using informatics methods, such as various algorithms and statistical methods. Specifically, complex biological data can be processed using computer tools, which is common practice in dedicated databases, such as in nucleic acid databases, protein databases, and custom functional databases [1]. The implementation of bioinformatics tools reduces the cost of complex analyses, thus enhancing research into topics such as sustainable agriculture [2]. Understanding how bioinformatics can be applied to plant biology research is therefore important for researchers in the life sciences, and here, the researchers have provided a description of these tools and their applications, focusing on plant breeding and research on disease resistance. For example, based on the VPg gene sequence of a PVY (Y virus) isolated from potato, combined with all published sequences in GenBank, two things can be inferred: the rate of evolution of PVY and the time to reach the most recent common ancestor using a Bayesian system dynamics framework to advance disease resistance studies in potatoes [3,4,5]. Given that multifactorial traits involved in resistance and quality are extremely difficult to improve, especially in combinations, and some of the genomes of major forage crops, such as maize, rice, wheat, sorghum and barley, and the forage legumes soybean and alfalfa, are too large to be analyzed using whole-genome sequencing, attention has been focused on comparative genomic approaches in order to produce seeds with desirable shapes [6,7,8].

The typical datasets generated by plant researchers contain morphological, physiological, molecular, and genetic information that describes the entire plant life cycle. Bioinformatics process the collected data and extract key indices and trends to quickly and accurately generate hypotheses and then offer solutions. For example, phenotypes and genotypes can be combined to reveal the underlying mechanism, such as the study of plant rejuvenation [9], and the future growth pattern of plants can be predicted according to the growth trend of plants in the past, such as the plant growth pattern prediction system, developed by deep learning [10], and the comparison of multiple genomes can be used for the prediction of evolutionary relationships, such as in the study of Amphicarpaea edgeworthii [11].

In agricultural applications, the wide utilization of bioinformatics can assist with efficient crop breeding and the improvement of plant resistance against pathogens [12]. In particular, scientists are committed to breeding and modifying crop species to improve the yield and quality, as well as creating new varieties with qualities that benefit human nutrition and health. Bioinformatics accelerates the generation and deployment of these new varieties. Indeed, genes associated with specific traits can be analyzed on a computer before being introduced into a plant, and the results can be used to determine what to introduce further into the plant for a precise phenotypic analysis. Maize (Zea mays L.) kernels, rich in lysine [13]; lettuce (Lactuca sativa), high in vitamin C [14]; and the recently developed vitamin D-rich tomato [15] are examples of the implementation of such pipelines.

Bioinformatics plays a critical role in data integration, analysis, and model prediction, as well as in managing the massive amounts of data resulting from new, high-throughput approaches [16]. Classical biological experiments, such as the visualization of mitosis and meiosis and pollen tube growth, are undergoing deeper, higher throughput exploration thanks to bioinformatics and time-lapse microscopy [17,18]. Plant growth can be predicted based on the available wealth of physiological and phenotypic data, enabling the generation of a virtual plant that can accurately predict growth patterns and the consequences of interactions with diseases or pests [19]. Bioinformatics has also wide applications in the analysis of plant resistance to various stresses [20]. The molecular mechanisms underlying plant responses to abiotic stress have been studied in depth, and they can open new avenues in agriculture when combined with the predictive power of bioinformatics [21]. In addition, bioinformatics has been applied in plant pathology, such as identifying and predicting the “effector” proteins produced by plant pathogens in order to manipulate their host plants. The functional annotation of this pathogen’s ability to predict virulence is a critical step in translating the sequence data into potential applications in plant pathology [22]. A bioinformatics framework has been proposed to enable stakeholders to make more informed decisions. In this way, a shared biosecurity infrastructure can be established to cater for sustainable global food and fiber production in the context of global climate change and the increased chances of accidental disease invasions in the global plant trade [23].

2. Databases Provide Abundant Gene and Pathway Information to Study Plant Biology

Thanks to large-scale sequencing technologies, vast amounts of data are released continuously and are often uploaded to a specific database. Depending on the species they represent, databases can be formally classified as general or species-specific databases.

General databases include those that integrate information about genomes, proteins, and metabolic pathways (Table 1). Genome databases represent a centralized and public collection of all published data, so researchers can easily obtain information concerning their gene or protein of interest. For example, UniProt offers a comprehensive resource for protein sequences and functional annotation. The database can be queried with a specific gene/protein name or with keywords of interest to sort through the catalogued data, but it is also possible to perform a protein BLAST (basic local alignment search tool) and download the sequence of the new protein of interest [25]. In addition, general databases compile various biological pathways, such as those represented in Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), EuKaryotic Orthologous Groups (KOG), and metabolic pathways, which can be used to determine if a candidate protein belongs to one of many known pathways.

Table 1. General databases used for data integration and presentation.

URL	Note	Description
http://bigd.big.ac.cn/databasecommons/ Accessed on 4 May 2022.	Comprehensive publicly available data repository covering a wide range of organisms	Bigd database consolidates all the information collected about the database. Each database is classified by data type, category, subject, and location, so that people can easily find a specific collection of databases of interest.
https://www.expasy.org/ Accessed on 4 May 2022.	Covers a wide range of biological research databases and software tools	Expasy database is divided into several areas: DNA, RNA, protein, population, cell, etc. According to omics, it is divided into proteome, genome, transcriptome, structure analysis, population genetics, and so on.
https://www.agbiodata.org/ Accessed on 4 May 2022.	Integrated platform of agricultural biological databases and related resources	It is a consortium of agricultural biological databases that integrate standards and best practices for the acquisition, display, and reuse of genomic, genetic, and breeding data.
https://phytozome.jgi.doe.gov/pz/portal.html Accessed on 4 May 2022.	Plant Comparative Genomics Repository	Phytozome database, the Plant Comparative Genomics portal of the Department of Energy’s Joint Genome Institute, provides a hub for accessing, visualizing, and analyzing JGI-sequenced plant genomes, as well as selected genomes and datasets sequenced elsewhere.
http://harvest.ucr.edu/ Accessed on 4 May 2022.	Platform for Crop EST sequences and related molecular information	HarvEST database includes various functions, such as microarray content design, SNP identification, genotyping platform design, comparative genomics, and the coupling of physical and genetic profiles.
https://www.uniprot.org/ Accessed on 4 May 2022.	Protein sequence and functional information resource database and analysis platform	Uniprot database is the world’s leading resource for high-quality, comprehensive, and freely accessible protein sequence and functional information.
http://www.plantgdb.org/ Accessed on 4 May 2022.	Plant Genome Sequence Database	Plantgdb database includes software, visualization, and data access portals that implement novel prediction algorithms, as well as a network infrastructure environment implementation of development tools for distributed computing, protocol sharing, and analysis of source records.
https://mpss.danforthcenter.org/index.php Accessed on 4 May 2022.	NGS database, including small RNAs and genome resources for plants	Meyers Lab database focuses on many aspects of plant small RNAs, including their major roles in gene and transposable regulation, but also their biogenesis and evolution. Includes small RNA sequencing, cut target RNA sequencing, and a variety of informatics tools.
http://metacrop.ipk-gatersleben.de Accessed on 4 May 2022.	Crop Metabolism Pathway Database	Metacrop database summarizes various information about metabolic pathways in crops and allows the automatic export of information to create detailed metabolic models.

As one example, a bioinformatic analysis of ribulose-1,5-bisphosphate carboxylase/oxygenase (RuBisCO) from multiple C3 plant species included an in silico characterization of RuBisCO and its interacting proteins, whose structures and functions were predicted with the ProtParam, SOPMA, Predotar 1.03, SignalP 4.1, TargetP 1.1, and TMHMM 2.0 tools, which are all accessible from the ExPASy database. A MEME and MAST analysis of RuBisCO from all C3 plants, combined with a phylogenetic tree constructed with MEGA 6.06 software based on a sequence alignment obtained with the ClustalW algorithm, illustrated the high-sequence identity shared by RuBisCO from different C3 plant species, supporting the notion that they originated from a common ancestor [26]. A list of these databases and how they are used is provided in Table 1.

The model plant Arabidopsis thaliana is one of a few plant species with its own databases due to its widespread use in plant research (Table 2). These databases are rich in resources and can help researchers quickly obtain the latest Arabidopsis genome information. For example, The Arabidopsis Information Resource (TAIR) database allows users to download gene sequences in bulk, while the SeqViewer in TAIR also provides a simple tool to visualize the genes. In addition, TAIR has a powerful function for displaying various expression maps, each representing expression data during Arabidopsis development or under different growth conditions [27].

Table 2. Databases specific for Arabidopsis.

URL	Note
http://www.arabidopsis.org Accessed on 4 May 2022.	The most commonly used repository of Arabidopsis genetic and molecular biology data
http://rarge.gsc.riken.jp/ Accessed on 4 May 2022.	Arabidopsis cDNA, mutant, and microarray database
http://www.athamap.de/ Accessed on 4 May 2022.	A genome-wide database of putative transcription factor binding sites in Arabidopsis
http://www.plprot.ethz.ch/ Accessed on 4 May 2022.	Arabidopsis plastid protein database
http://seedgenes.org/ Accessed on 4 May 2022.	Database of key Arabidopsis developmental genes
http://suba.live/ Accessed on 4 May 2022.	Subcellular localization database for Arabidopsis proteins
http://atrm.cbi.pku.edu.cn/ Accessed on 4 May 2022.	Arabidopsis transcriptional regulatory mapping database
http://wanglab.sippe.ac.cn/rootatlas/ Accessed on 4 May 2022.	Arabidopsis root single-cell RNA-seq database
http://ipf.sustech.edu.cn/pub/athrna/ Accessed on 4 May 2022.	Arabidopsis RNA-seq data resources
http://signal.salk.edu/ Accessed on 4 May 2022.	A database showing all T-DNA insertions and methyl group data

Most major crops have dedicated databases, including rice (Oryza sativa), wheat (Triticum aestivum), barley (Hordeum vulgare), maize, soybean (Glycine max), cotton (Gossypium hirsutum), and sorghum (Sorghum bicolor) (Table 3). For example, the Rice Mutant Database (RMD) includes mutants for the identification of new genes and regulatory elements and includes a list of lines for the ectopic expression of target genes in specific tissues or at specific growth stages, providing rich data resources for the study of different rice mutants [28]. The Wheat Genomic Variation Database (WGVD) compiles all published single nucleotide polymorphisms (SNPs), insertion/deletion polymorphisms (InDels), and selection sweeps, together with a BLAST search tool for wheat [29]. Researchers at Huazhong Agricultural University have developed the ZEAMAP database for corn, which includes multiple omics data resources, such as genomes, transcriptomes and genetic variation, phenotypic data, metabolome studies, and genetic maps. The database also provides access to a variety of data on complex traits and boasts rich online capacities for data retrieval, analysis, and visualization [30].

Table 3. Databases used for major crops.

URL	Note
http://www.ricedata.cn/index.htm Accessed on 4 May 2022.	National Rice Data Center
http://signal.salk.edu/cgi-bin/RiceGE Accessed on 4 May 2022.	Rice functional genome expression database
https://shigen.nig.ac.jp/rice/oryzabase/ Accessed on 4 May 2022.	Rice genetics and genomics database
http://www.wheatgenome.org/ Accessed on 4 May 2022.	Wheat genome information database
http://earth.nig.ac.jp/~dclust/cgi-bin/index.cgi Accessed on 4 May 2022.	Barley germplasm resources and genome analysis database
http://maize.jcvi.org/ Accessed on 4 May 2022.	Maize genome database
https://www.maizegdb.org/ Accessed on 4 May 2022.	Maize genome and genetic analysis platform
https://soybase.org/ Accessed on 4 May 2022.	Soybean genomics and molecular biology database
http://www.ildis.org/LegumeWeb/ Accessed on 4 May 2022.	International legume database and information service
https://www.cottongen.org/ Accessed on 4 May 2022.	Cotton genomics, genetics, and breeding database
http://ted.bti.cornell.edu/ Accessed on 4 May 2022.	Tomato functional genome database
http://ted.bti.cornell.edu/epigenome/ Accessed on 4 May 2022.	Tomato epigenome database
http://tea.solgenomics.net/ Accessed on 4 May 2022.	High-resolution mapping and search tool for tomato genes and their products
http://tomexpress.toulouse.inra.fr/ Accessed on 4 May 2022.	Tomato transcriptome data visualization and analysis platform
https://solgenomics.net/ Accessed on 4 May 2022.	Genome sequencing database of Solanaceae species
http://gabipd.org/projects/Pomamo/ Accessed on 4 May 2022.	Potato bioinformatics database

3. Various Algorithms Create Possibilities for Customized Analysis

Bioinformatics tools or websites can be used to predict protein structure, to look for conserved domains in a protein, or to annotate genes (Table 4). Data visualization and presentation are an integral part of bioinformatics analysis [31]. The biggest advantages of TBtools are batch processing and the visualization of data, and the interactive graphics generated with TBtools are rich with editable features that provide maximum flexibility for users [32]. Protter supports protein data analysis and protein prediction by visualizing the characteristics of an annotated sequence and associated experimental proteomic data in a protein topological environment. Protter is of great use for comprehensive visualization of membrane proteins and the selection of targeted proteomic peptides [33].

Table 4. Bioinformatics tools and websites that can be used in plant research.

Database Name	URL	Note
Tbtools	https://github.com/srbehera11/stag-cns Accessed on 4 May 2022.	An integrated toolkit for interactive analysis of big biological data
SMART	http://smart.embl-heidelberg.de/ Accessed on 4 May 2022.	Protein conserved domain prediction tool
STAG-CNS	https://github.com/srbehera11/stag-cns Accessed on 4 May 2022.	A sequentially conserved non-coding sequence discovery tool for an arbitrary number of species
FED	http://www.hi-tom.net/FED Accessed on 4 May 2022.	Genome editing exogenous component detection platform
MAFFT	https://mafft.cbrc.jp/alignment/server/ Accessed on 4 May 2022.	Online sequence matching tool
Protter	http://wlab.ethz.ch/protter/start/ Accessed on 4 May 2022.	Online protein structure mapping tool
EvolView	https://www.evolgenius.info/evolview/ Accessed on 4 May 2022.	Web-based tools for visualizing, annotating, and managing system trees
iTOL	https://itol.embl.de/ Accessed on 4 May 2022.	Online tool for displaying, annotating, and managing system development trees

Many tools are also tailored for specific applications (Table 5), including the prediction of transcription factor binding sites and the exploration of large-scale genomic variation data. For example, the PlantPAN database hosts a comprehensive list of transcription factors and their cognate binding sites. TRANSFAC and PlnTFDB are comprehensive databases of plant transcription factors, and AGRIS contains a database of Arabidopsis transcription factors, which can be used to predict transcription factor binding sites in plant promoter regions [34]. SnpHub can be used to retrieve, analyze, and visualize large-scale genomic variation data by specifying samples and lists of specific genomic regions [35].

Table 5. Bioinformatics tools specific to applications in plant research.

Database Name	URL	NOTE
BAR	http://www.bar.utoronto.ca/welcome.htm Accessed on 4 May 2022.	Plant biology analysis tools platform
CRISPR-P	http://crispr.hzau.edu.cn/CRISPR2/ Accessed on 4 May 2022.	Improved CRISPR/Cas9 toolkit for plant genome editing
ACT	https://www.michalopoulos.net/act/ Accessed on 4 May 2022.	Arabidopsis co-expression analysis tool
OryGenesDB	http://orygenesdb.cirad.fr/ Accessed on 4 May 2022.	An interactive tool for reverse genetics studies in rice
T-DNA Express	http://signal.salk.edu/cgi-bin/tdnaexpress Accessed on 4 May 2022.	Arabidopsis gene targeting tool
Plant MetGenMAP	http://bioinfo.bti.cornell.edu/cgi-bin/MetGenMAP/home.cgi Accessed on 4 May 2022.	Web-based tools for comprehensive mining and integration of gene expression and metabolite changes in the context of biochemical pathways
iTAK	http://itak.feilab.net/cgi-bin/itak/index.cgi Accessed on 4 May 2022.	Software packages for identifying and classifying plant transcription factors and protein kinases
PlantPAN	http://plantpan2.itps.ncku.edu.tw/ Accessed on 4 May 2022.	Tools for detecting transcription factor binding sites in plants
SnpHub	http://guoweilong.github.io/SnpHub/ Accessed on 4 May 2022.	A unified web server framework for exploring large-scale genomic variation data

Outside of dedicated web tools, various algorithms can be used to empower data integration and analysis, such as Python, R, and Perl. Python and R are perhaps more widely used in bioinformatics than Perl. R has powerful statistical functions, which are very useful for processing large experimental datasets, together with a graphics solution for data exploration [36]. Python is better suited for building databases and web applications and is better for developing utilities [37]. While the basic introductory programming paradigm in R relies on so-called functions hosted by user-written packages, Python’s programming paradigm is based on design flow. Although R code might not be as human-readable as Python’s, R is overall better suited to biologists with no strong programming background. Based on these programming languages, various scripts have been developed to efficiently analyze data. For example, R uses a k-means function for clustering analysis and can draw Manhattan plots produced from genome-wide association studies (GWASs) with the qqman package [38].

4. Application of Bioinformatics in Plant Breeding

Plant breeding aims to produce new plant varieties. This long-term activity begins with basic research and often takes many years, thus necessitating a significant financial investment [39]. Genomics-assisted breeding is an effective and economical strategy and is thus widely applied in crop breeding. Genomics may help to understand the organization and function of biological systems and has the potential to track the molecular changes during development under different conditions, such as changes in plant physiology, pathogen pressure, or in the environment [40]. Samples for genomics studies can be collected from the same or different individuals from one species or from different species [41]. In addition, comparative genomics allows the study of specific traits in related plants by capitalizing on sequence conservation between species with small genomes (easier to study) and those with large and complex genomes (more difficult to study, but including most current crop species). For example, in Chrysanthemum, GWASs have been used to explore genetic patterns and identify favorable alleles for several ornamental and resistance traits, including plant structural and inflorescence traits, waterlogging tolerance, aphid resistance, and drought tolerance [42]. Su et al. transferred a major SNP co-isolated with waterlogging tolerance in Chrysanthemum to a PCR-based derived cut amplified polymorphism sequence (dCAPS) marker with an accuracy of 78.9%, which was verified in 52 cultivars or progenitors [43]. Chong et al. developed two dCAPS markers associated with the flowering stage and diameter of the head in Chrysanthemum. These dCAPS markers have potential applications in the molecular breeding of Chrysanthemum [44]. These techniques will provide new powerful tools for future Chrysanthemum breeding.

This entry is adapted from the peer-reviewed paper 10.3390/plants11223118

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.