Engineering Microbial Phenotypes: Comparison
Please note this is a comparison between Version 2 by Arjun Patel and Version 3 by Vicky Zhou.

Microbial strains are being engineered for an increasingly diverse array of applications, from chemical production to human health. While traditional engineering disciplines are driven by predictive design tools, these tools have been difficult to build for biological design due to the complexity of biological systems and many unknowns of their quantitative behavior. However, due to many recent advances, the gap between design in biology and other engineering fields is closing.

  • synthetic biology
  • metabolic modeling
  • machine learning
  • metabolic engineering

1. Introduction

Microbes have been engineered for a broad number of applications. As cell factories, cells have been designed to convert low-value substrates into valuable chemical products, including biofuels [1], commodity chemicals [2], bioactive compounds [3], and foods [4]. To benefit the environment, microbes have been engineered for bioremediation [5] and biosensing [6] of toxic compounds and pollutants. As engineered tools, microbes have been programmed using cell circuits to exhibit elaborate behaviors, from synchronized fluorescence [7] to hunting down tumors to deliver chemotherapeutics [8]. Finally, as cellular products, microbes themselves are increasingly of interest for probiotic and nutritional supplements [9].

The experimental workflow to engineer a new microbial strain has a number of common steps, although the order may vary (Figure 1A) [10]. First, a background organism and strain is chosen for the application of interest. Genes may be knocked out, introduced, knocked down, or overexpressed for a variety of purposes, such as control of transcriptional regulation, redirection of metabolic flux to desired pathways, or removal of unwanted or wasteful processes. Bioprocess conditions can be optimized through control of various factors including media, feed rate, growth rate, pH, and temperature. Specific sequence variants can be introduced through rational design or selected through screens and adaptive laboratory evolution to control expression, alter enzyme activity, or remove regulatory sites from proteins. The typical strain design workflow thus requires a large number of decisions on how to improve strain behavior. Left to a trial and error approach, the complexity of biological systems makes efficient engineering of strains a daunting task.

Figure 1. Challenges and computational solutions for a typical strain design workflow. (A) Typical experimental steps in the development of a new strain design. (B) Common challenges encountered at each strain design step. (C) Computational tools that may be used to meet the strain design challenges. Note that the design steps, challenges, and computational tools highlighted here are intended to be exemplative rather than comprehensive. 1, Modeling organism capabilities [11][27]; 2, Network reconstruction [12][18]; 3, Top-down data-driven regulons [13][28]; 4, Kinetic and COBRA modeling [14][26]; 5, Kinetic and thermodynamic models [15][29]; 6, COBRA modeling of gene knockouts [16][30]; 7, Overflow models [17][31]; 8, Expression tuning ML models [18][32]; 9, Kinetic models including regulation [19][33]; 10, Protein structural analysis [20][34]; 11, Models using enzyme kinetics [21][35]; 12, Bioprocess models [22][36]; 13, StressME models [23][24][25][37,38,39]; 14, Analysis of bioreactor omics data [26][40].

2. Computational Tools

To aid strain design efforts, computational tools have been integrated from various fields into the strain design workflow [27][11]. These tools offer the promise of restricting the experimental search space by either identifying modifications that are more likely to improve strain performance or proposing entirely new designs through mathematical modeling of cell behavior. However, many steps in the strain design process are still driven by rational approaches, rules of thumb, and extensive experimental screening and trial and error. Workflows driven purely by predictive tools would have the advantage of efficiency of execution through fewer experimental steps, reduced time, and ultimately improved performance through careful guidance toward an optimal desired phenotype. We describe two approaches that show promise as systematic tools for cell design: genetic circuits and genome-scale modeling.

One strategy for constructing synthetic strains has been to engineer desired behaviors through the use of genetic circuits [28][12]. The key concept is to carefully characterize and often mathematically model the behavior of a ‘circuit’, typically a small transcriptional regulatory network, to control a cell phenotype. As greater numbers of these small circuits are characterized, they begin to comprise a ‘parts list’ of available phenotypes from which an engineer can choose or can be assembled automatically by an algorithm [29][13]. Larger and larger circuits can then be constructed of well-characterized smaller circuits to engineer more complex phenotypes. This strategy has been employed for a number of promising applications [30][31][14,15].

Another successful paradigm for computational design of cells is genome-scale network modeling [32][16]. While genetic circuits approaches utilize highly controllable systems of limited scope, genome-scale models seek to predict cell phenotype by comprehensively modeling all known functions of the cell. As part of the Constraint-based Reconstruction and Analysis (COBRA) framework, genome-scale models of metabolism utilize a metabolic network reconstruction to predict metabolic phenotypes and analyze genome-scale datasets [33][17]. These models deal with the large scope of the system by utilizing the constraint-based modeling framework, which requires few parameters to generate predictions. The challenge of managing these large-scale models is achieved through community enforcement of rigid requirements, testing, and data standards [12][34][18,19]. Although these models were originally developed for metabolism, they have recently been extended to include transcription and translation machinery [35][36][37][20,21,22] and even further to whole-cell kinetic simulations [38][23].

Although computational methods have undoubtedly augmented rational strain design efforts, there are a number of challenges in a strain design workflow that still cannot be effectively addressed by existing computational tools [39][24] (Figure 1B). For example: (1) Organisms are often chosen for a strain design project due to historical knowledge and convenience, rather than fundamental benefits provided by the organism that could be calculated computationally a priori, (2) Gaps in gene annotation make choosing non-model organisms a risk, (3) The difficulty in accounting for enzyme kinetics makes the understanding of metabolic and allosteric regulation a challenge, (4) A lack of understanding of regulatory networks impedes the understanding and control of gene expression, and (5) Insufficient annotation of the organism genome makes it difficult to interpret the functional implications of sequence variation. Challenges such as these present major barriers to interpreting data and predicting strain phenotype.

There are many methods currently being developed that may directly meet these challenges to enable fully predictive strain design workflows (Figure 1C). For example, advances in metabolic modeling could enable the optimization of bioprocess conditions or the identification of optimal expression levels of pathway genes [40][14][25,26]. However, these models are still in development and have not yet been shown to enable accurate predictions at scale. In this perspective, we describe five frontiers consisting of promising developments in computational strain design that may pave the way toward achieving comprehensive and integrated strain design workflows (Figure 2).

Figure 2. Overview of frontiers in the computational design of synthetic organisms. Frontier 1: Constraint-based Reconstruction and Modeling, consisting of tools for analyzing pan-genomes, microbial communities, gap-filling metabolic networks, and modeling proteome allocation. Frontier 2: Kinetics and Thermodynamics, consisting of tools for parameterizing and simulating kinetic and thermodynamic models. Parameterization can utilize the Michaelis-Menten equation where [A] is the substrate concentration, whereas simulation uses dynamic mass balance equations where S is the stoichiometric matrix. Frontier 3: 3D Structures, consisting of methods for the reconstruction of 3D metabolic networks with protein structural information and subsequent applications of these 3D reconstructions. Frontier 4: Genome Sequence and Phenotype Prediction, consisting of workflows for analyzing strain variations in genome sequence as well as building machine learning models based on genome sequence to predict strain phenotype. Frontier 5: Regulatory Networks, consisting of methods for the determination of transcriptional regulatory networks and subsequence models of gene expression and strain phenotype utilizing regulatory network information.

3. Outlook for Synthetic Genome Design

Although there has been substantial progress in each of the individual fields discussed above, there are additional challenges with integrating these tools into an effective strain design workflow. Workflow: While we discussed many tools as they relate to individual strain design tasks, these tasks must be synthesized into a coherent end-to-end design workflow. The decisions of the order of operations in the development of a strain could greatly benefit from computational predictions, but much work is yet to be done to identify a strain design workflow that maximizes efficiency and minimizes cost and risk. Expertise: Any workflow that integrates many different computational tools will require domain expertise in each tool to decide details of implementation, from parameters to valid use cases. Thus, strain designers will be required to have broad computational skillsets that exceed what is taught by most current training programs. Software: The practical difficulty of implementing many separate computational tools can become a substantial burden, spanning various details from licensing issues to file formats. However, the number of software packages enabling these workflows continues to increase, and we mention many examples in this work (Figure 3). Thanks to these efforts, finding compatible tools for easily integrated workflows is becoming easier. Validation: Tools must be validated to clearly established accuracy metrics under physiological conditions. Validation of tools on individual datasets, for example on a single wild type strain background, is likely to be insufficient as the strain is engineered further from the wild type. To meet these challenges, it is critical to take a systematic approach that includes dedicated training, effective documentation of tools, and extensive validation of tools in real applications. There will be a significant challenge reaching a standard where strain design researchers can effectively conduct analyses and understand results from multiple tools across a typical workflow.

Figure 3. A selection of actively maintained software for computational design and analysis of microbial phenotypes. We focus on Python tools due to the popularity of the language as well as potential for integration in a single strain design workflow, but also include important packages in other languages and standalone applications. Frontier 1: Packages for constraint-based reconstruction and modeling, proteome allocation modeling, and strain design optimization. Frontier 2: Kinetics and thermodynamics packages for model parameterization, simulation, and thermodynamics constrained modeling. Frontier 3: Software for annotating and visualizing structures as well as integrating 3D structural information with systems biology approaches. Frontier 4: Python package for storing, organizing, and analyzing genome sequences. Frontier 5: Online knowledgebase and software for determining transcriptional regulatory networks using ICA decomposition methods. 1 COBRApy [41][42]; 2 COBRA Toolbox [33][17]; 3 COBRAme [35][20]; 4 GECKO [42][52]; 5 CAMEO [43]; 6 pyTFA [44][95]; 7 COPASI [45][96]; 8 MASSpy [46][94]; 9 eQuilibrator [47][89]; 10 SSBIO [48][106]; 11 Amber [49][111]; 12 I-TASSER [50][107]; 13 Bitome [51][120]; 14 MEME [52][117]; 15 iModulonDB [53][125]; 16 PRECISE [13][28].

The field is nearing an important milestone in synthetic biology, that of the comprehensive and computationally-driven strain design workflow. We may soon enter an era of ‘computational genome design’, where rational approaches finally give way to biological design algorithms dominated by computational predictions. Thus, one of the early promises of the field of systems biology may finally be nearing its realization. The practical applications of such a cell design workflow are endless, from the chemical industry to the environment to human health.