Statistics-based model for prediction of chemical biosynthesis yield from Saccharomyces cerevisiae

Background The robustness of Saccharomyces cerevisiae in facilitating industrial-scale production of ethanol extends its utilization as a platform to synthesize other metabolites. Metabolic engineering strategies, typically via pathway overexpression and deletion, continue to play a key role for optimizing the conversion efficiency of substrates into the desired products. However, chemical production titer or yield remains difficult to predict based on reaction stoichiometry and mass balance. We sampled a large space of data of chemical production from S. cerevisiae, and developed a statistics-based model to calculate production yield using input variables that represent the number of enzymatic steps in the key biosynthetic pathway of interest, metabolic modifications, cultivation modes, nutrition and oxygen availability. Results Based on the production data of about 40 chemicals produced from S. cerevisiae, metabolic engineering methods, nutrient supplementation, and fermentation conditions described therein, we generated mathematical models with numerical and categorical variables to predict production yield. Statistically, the models showed that: 1. Chemical production from central metabolic precursors decreased exponentially with increasing number of enzymatic steps for biosynthesis (>30% loss of yield per enzymatic step, P-value = 0); 2. Categorical variables of gene overexpression and knockout improved product yield by 2~4 folds (P-value < 0.1); 3. Addition of notable amount of intermediate precursors or nutrients improved product yield by over five folds (P-value < 0.05); 4. Performing the cultivation in a well-controlled bioreactor enhanced the yield of product by three folds (P-value < 0.05); 5. Contribution of oxygen to product yield was not statistically significant. Yield calculations for various chemicals using the linear model were in fairly good agreement with the experimental values. The model generally underestimated the ethanol production as compared to other chemicals, which supported the notion that the metabolism of Saccharomyces cerevisiae has historically evolved for robust alcohol fermentation. Conclusions We generated simple mathematical models for first-order approximation of chemical production yield from S. cerevisiae. These linear models provide empirical insights to the effects of strain engineering and cultivation conditions toward biosynthetic efficiency. These models may not only provide guidelines for metabolic engineers to synthesize desired products, but also be useful to compare the biosynthesis performance among different research papers.


Background
Producing small-molecule chemicals from microbial biocatalysts offers several advantages. Unlike conventional chemical synthesis which are heavily dependent on petroleum-derived substrates, microbes are able to use renewable materials to synthesize many commodity chemicals and fuels [1] (Figure 1). Due to its scalability, microorganisms are also suitable platforms to synthesize pharmaceutical molecules that are conventionally produced from extracting large amounts of natural resources. Among many industrial microorganisms, the baker's yeast, i.e., S. cerevisiae continues to emerge as a preferred production platform [2]. S. cerevisiae is typically known for its robustness in fermenting sugars into alcohol. In the recent past, it has also gained importance as a heterologous platform to synthesize many precursors of commodity chemicals and pharmaceuticals [1].
In general, chemical production using whole-cell biocatalysts are achieved by genetic engineering to extend the substrate range of an existing biosynthetic pathway or to introduce new biosynthetic pathways (either derived from other organisms, or completely novel). Rational metabolic engineering approaches then analyze the cellular metabolism and improve production titer by overexpressing rate-limiting enzymes or deleting competing pathways. In general, the actual yield of chemical production is not easily predicted due to the complexity of biological systems and dependency of cultivation conditions. Biological complexities not only include intrinsic properties (such as enzyme kinetics and substrate specificity), but also include enzyme compartmentalization, intracellular signaling, and metabolite transport between eukaryotic cell organelles. Therefore, strain engineering requires multiple rounds of trial-anderror experiments to perform the optimum combination of genetic manipulations. In the present work, we sought to develop mathematical models that could provide a priori estimation of chemical production yield from engineered S. cerevisiae when given a set of parameters, namely the number of steps in the biosynthetic pathway of interest, genetic modifications, cultivation conditions, and nutrient and oxygen availability. The coefficients of these parameters were obtained from the regression of the yields and production conditions reported by recent literatures. Such model predicted the empirical yields that were lower than the theoretical productivities under "ideal" conditions. The model results could give metabolic engineers guidelines for increasing desired products and for reducing futile attempts.

Model development
The model defined several important parameters that influenced the efficiency of chemical production from microbial hosts. The first group of parameters accounted for the number of enzymatic steps in the biosynthetic pathway of interest since it had been shown that this parameter was often inversely correlated with microbial product yield [3]. To enumerate the number of enzymatic steps, we introduced two numerical variables in our model, i.e. PRI and SEC. The variable PRI specified the number of enzymatic steps in primary metabolism (Figure 1), e.g. glycolysis that is required to convert sugar (glucose or galactose) to pyruvate. The variable SEC specified the number of enzymatic steps in the subsequent pathway (typically belongs to secondary metabolism), which catalyzed the conversion of central carbon intermediate into the final product of interest. The next group of variables was to capture the effects of genetic modification. Various genetic strategies have been used to implement metabolic engineering [4,5]. For example, promoters with different strength influence production level. However, for the sake of simplifying our model, variations of genetic components used in metabolic engineering strategies were lumped into two ordinal variables, i.e. OVE, and KNO. OVE signified the introduction of multiple copies of genes of native or heterologous origin for the purpose of improving production level. KNO signifies the alteration of branch pathways that might compete with the pathway of interest [6,7]. We further sub-categorized OVE based on the number of modified genes into OVE C1 (without "pushing" pathway flux), OVE C2 (enhancing 1~2 enzyme activities), and OVE C3 (improving a number of key enzyme functions). KNO was also categorized by KNO C1 and KNO C2 (i.e., without knockout or with knockout, respectively). Table 1 explained the specifications for each sub-category. The yield of metabolite production is also a function of cultivation conditions and nutrient availability. For instance, production of metabolites from a bioreactor is often higher than a shaking flask, due to the increased efficiency of mass transfer of oxygen, substrates, and nutrients. Moreover, culture acidification that often generates cytotoxicity and maintenance burden to the microbial hosts can be mitigated in a bioreactor by automated pH control. Based on these basic properties, we introduced the variable CUL to represent the general property of a cultivation condition. We also introduced the variable OXY and NUT to capture the effects of oxygen availability and nutrient supplementation, respectively [8][9][10]. Moreover, the variable INT captured the effect of addition of a secondary carbon source which served as a precursor or an intermediate metabolite of the pathway of interest.
Several assumptions were made to simplify our model development. A) Yield calculation was based on the conversion of major carbon substrate to final product if multiple nutrient sources were supplemented (e.g., yeast extract was not treated as the carbon source). B) We calculated the yields based on two factors: initially added carbon substrate in the culture and final measured product. We neglected the unused carbon substrate that remained in the end of the production. C) To calculate enzymatic steps from the carbon source, the model only considered the key route from the major substrate (mostly glucose) to the final products (enzyme steps for co-factors or ATPs synthesis were neglected). D) For product synthesis promoted by the addition of an intermediate, we had no means of differentiating the carbons derived from added precursor or from the carbon substrate (i.e., glucose). To account for the contribution from both carbon sources, the yield calculation was assumed to be an arithmetic mean of the two yields (One yield was based on substrate, e.g., glucose, and the other yield was estimated from the intermediates). Meanwhile, the number of primary steps or secondary steps were also assumed as an arithmetic mean of two data sets (one variable was counted from substrate; the other variable was counted from the intermediate).
Biochemical systems theory [5] states that reaction rates (v i ) can be described by a general power law expression of the type: Where X j represents the system variables and the parameters α i , g ij are the constants. Equation (1) yields a linear form in logarithmic coordinates. Based on similar assumptions, our model for yield prediction used system variables (i.e., numerical or categorical variables related to yeast biosynthesis) to describe the relative carbon flux to the final products. log 10 Y = β0 + βPRIPRI + βSECSEC + βOVE,C2OVEC2 + βOVE,C3OVEC3 + βKNO,C2KNOC2+ βNUT,C2NUTC2 + βINT,C2INTC2 + βCUL,C2CULC2 + βOXY,C2OXYC2 In Equation 2, log 10 Y was the dependent variable which represented production yield (mol C in product/mol C in primary substrate), given each independent variables β i [11]. We defined β 0 as the intercept in Equation 2, which represented the combined contribution of Category 1 of all ordinal variables. β 0 was defined as: The ordinal variables (using a binary system) were assigned a value of one if and only if the condition fitted the category in Table 1. Otherwise, the ordinal variables were assigned a value of 0 [12]. (2) To acquire the coefficients in Equation 2 and 3, we compiled data from 40 publications which described the production of chemicals by S. cerevisiae under various experimental conditions. Table 2 summarized the categories assigned to these experimental conditions and the yield of product from our best judgment. Using these data, we performed regression analysis to fit the model via the software package R [13] to find the regression coefficients and Pvalues. For this study, a variable was statistically significant (90%) if its P-value was below 0.1.

Result and Discussion
We constructed simple models which linked several numerical and ordinal variables that affected the yield of chemical production from S. cerevisiae. The accuracy of obtained coefficients in Equation 4 was evaluated based on R 2 and the P-value. Here, we used a P-value of 0.1 as the limit below which the result was considered significant [14]. Out of the eight variables specified in our model, SEC, OVE, KNO, NUT, INT and CUL had P-value of less than 0.1. The summary of the P-value of each variable was listed in Table 3. Figure 2A showed a plot of the production yields obtained experimentally and those obtained from model prediction for the corresponding conditions. The correlation of this model to the dataset had an R 2 value of 0.55, which reflected the moderate discrepancy between reported yields and the model-predicted yields. Figure 2B plotted the residuals of model fitting. The residuals appeared to scatter around zero randomly, so the linear model was proper to describe the experimental data.
Interestingly, the number of enzymes in the primary pathway (PRI) did not significantly affect production yield (P-value = 0.76) ( Table 3). This suggested that rate-limiting steps to increase chemical production flux often lay in the downstream pathway of central metabolism. The coefficient of SEC was negative. This suggested that the length of a pathway downstream of central metabolism negatively OXY: oxygen conditions Fermentation occurred in aerobic conditions. Fermentation occurred under oxygenlimited conditions (anaerobic or microaerobic).
Note: the input of ordinal variables was specified using a binary system, 1 and 0. When a category (e.g., overexpression Category 2) was applied, the value 1 was assigned to OVEc2. Otherwise, the value 0 was assigned.    Since the number of steps in central metabolism (PRI) did not significantly affect production yield, we Table 2 Dataset used for the linear regression (Continued) [72] Poly As shown in Table 3, regression using Equation 2 with the exclusion of the variable PRI did not change the R 2 value. This result indicated that the number of enzymatic steps in primary metabolism did not significantly affect product yield. Presumably, fluxes in central metabolic pathways were typically high and robust [16], when compared to those downstream secondary pathways. It has been demonstrated recently that production of chemicals was significantly improved, only when the capacity of a downstream pathway was increased [17].
Metabolic engineering typically involves pathway modification [16][17][18][19][20][21][22] to shift metabolic fluxes into a desired product or to permit the use of an alternative carbon source. We defined the variable OVE, and KNO in Equation 2 to capture the effect of pathway overexpression, and deletion, respectively. The regression of experimental data using Equation 2 showed that the coefficients of OVE C2 and OVE C3 had positive values ( Table 3). The model successfully captured the contribution of both pathway overexpression and gene deletions to increase product yield in S. cerevisiae. The high Pvalue of OVE C2 (0.98) indicated that statistically, the overexpression of a small number of genes (1-2) was uncertain to improve production yield. However, the coefficient of OVE C3 (= 0.52; P-value = 0.07) indicated the effectiveness of multiple gene modification to resolve the bottleneck steps. This observation is consistent to the fact that metabolic fluxes generally do not sensitively respond to changes of single enzyme activity, but are controlled by all key enzymes along the biosynthesis pathway. On the other hand, the regression coefficients of KNO C2 had positive value (= 0.31, P-value = 0.08), and thus the removal of competitive pathways could be effective to increase production yield.
It is a general knowledge that bioprocess conditions affect cellular viability and product yield. Our model suggested fermentation using a well-controlled bioreactor improved production yield by 3.2 times (CUL C2 : 10 β CUL,C2 = 10 0.51 ) . The model further suggested that fermentation under anaerobic or microaerobic condition could enhance yield compared to aerobic fermentation. However, such enhancement was not statistically significant (P-value = 0.32). This observation could be explained by the fact that S. cerevisiae produced fermentative products (ethanol and glycerol) (Crabtree effect) [18,19] under aerobic and glucose-sufficient medium. Therefore, aerobic metabolism in S. cerevisiae could operate similarly to metabolism under oxygen-limited condition. The coefficient for the variable INT was 0.77, which represented that the supplementation of a precursor metabolite translated to an approximately six fold increase of the product yield (Pvalue = 0.02). Similarly, the addition of nutrients (such as yeast extract) also significantly increased production yield (the coefficient of NUT C2 was 0.73). The contributions of INT and NUT to product formation indicated that intermediates/nutrients provided building blocks or energy sources that reduced the rate-limiting steps in biosynthetic pathways.
We used Equation 2 to compute the production yield of chemicals according to the specifications listed in Table 2. We observed that, for ethanol production, the experimental values were generally higher than the empirical model predictions. In reality, the reported maximum ethanol yield could reach 0.5 mol C-ethanol/ mol C-glucose [20], which could be several folds higher

Log of Experimental Yield
Regression of the data using Equation 6 improved the R 2 value from 0.55 to 0.58, demonstrating that ethanol could be better assumed as a central metabolite for S. cerevisiae. Using Equation 6, we predicted ethanol production based on a recent reference [21] by specifying PRI = 11, SEC = 1 (cellulose degradation step), OVE = C3, KNO = C1; NUT = C2, INT = C1, CUL = C1, and OXY = C2. The ethanol production yield calculated by Equation 6 was 0.31. This value was in good agreement with the reported values of~0.4 [21].

Model Applications and Limitations
The main application of the model is to predict the biosynthesis yield from S. cerevisiae. The model were validated by "unseen data" ( Figure 2C) from some randomly selected new publications (2010~2011). The model predicted the yields based on the reported experimental conditions described by these papers [22][23][24][25][26]. Most yield data were close to model predictions. The predictive power of the model was consistent with the model quality described in Table 3. Furthermore, the model can reveal the metabolic features of S. cerevisiae. For example, the modified model Equation 6 showed that it was better to treat ethanol pathway as the primary routes in cell metabolism, because of the strong ability for ethanol fermentation by yeast, possibly due to long-term process for selecting yeast as alcohol producer through human history. The model can also be useful for comparing the productivity among other yeast species (Figure 3). For example, riboflavin producer, Candida famata, exhibits a high riboflavin productivity (2~3 order of magnitude higher than model prediction) [27]. Pichia pastoris, a common species for protein expression, shows high S-adenosyl-Lmethionine productivity if a large amount of the intermediate methionine was repeatedly added in the medium [28]. Besides, Pichia stipitis also has high yields of L-lactic acid and ethanol from glucose and xylose [29]. Figure 3 demonstrated that some yeast species were able to explore their native pathways for biosynthesis of certain products with extraordinary efficiency (better than S. cerevisiae), therefore, these yeast species may be alternative hosts for certain biotechnology applications.
The accuracy of the model predictions for some products could be poor due to several limitations during model development. First, the category was a rough estimation of experimental conditions especially for variables related to gene modifications (OVE and KNO), and the yields could be very different even in the same category. Second, some products, despite large synthesis rates, were either not very stable or difficult to accumulate in a large quantity due to consumptions by downstream pathways or product degradations (e.g., Glycerol 3-phosphate [30]). Their yields could be significantly lower than model predictions even though the actual flux to the product was high. Third, the coefficient β SEC from model regression could not account for the big variances of biosynthesis efficiency or potentially feedback inhibitions in secondary pathways. For example, butanol synthesis is significantly improved via non-fermentative amino acid pathways compared to traditional acetyl-CoA routes [31], because amino acid synthesis pathways in microorganisms are more effective than other heterogeneous pathways. Fourth, because of limited information from the references, the yield calculation could not precisely include the CO 2 fixation (e.g., overexpression of the native carboxylase pathway: pyruvate + CO 2 oxaloacetate) [32] or the nutrients utilization in the rich medium. Fifth, the model neglected enzyme steps related to energy metabolism (such as ATP and NADPH synthesis), while cofactor imbalance can also affect the product yields.
Comparison to the previously published E. coli model [33] Recently, we have constructed the E. coli model using same modeling approach. Compared to the E. coli model, S. cerevisiae shows several differences: 1. Oxygen conditions made a more significant impact on biosynthesis yield in E. coli than that in S. cerevisiae; 2. The genetic modification in E. coli had higher uncertainty for metabolic outcomes; 3. For metabolic pathways from precursors to final products, loss of yield per biosynthesis step (~30%) in S. cerevisiae is higher than that in E. coli (10~20%). Interestingly, E. coli model states that primary metabolism influences product yield (a relatively small P-value of 0.06) which indicates the balance of precursor production from central metabolism is also an important consideration for metabolic engineering of E. coli. For example, it has been demonstrated that lycopene production with E. coli was enhanced by redirecting the carbon flux from pyruvate to G3P [34], but feeding other central metabolite precursors (such as pyruvate) could not improve lycopene production. On the other hand, the S. cerevisiae model indicates that it is less likely that the number of steps in central metabolism play a bottleneck role in the production of metabolites derived from it, while the bottlenecks are more likely in the secondary pathways (from central precursors to the final product). Therefore, the metabolic strategies should focus on the secondary pathways to have a better chance for increasing final yield. Although modification of central metabolism may affect microbial physiologies, a few studies indicate the robustness of the central metabolism in S. cerevisiae because of its importance to cell vitality. For example, S. cerevisiae may maintain central metabolic fluxes via gene duplication and alternative pathways under different environmental and physiological conditions [16,35]. Therefore, the inflexibility of central pathways in S. cerevisiae is likely to render metabolic engineering strategies ineffective when targeting enzymes in central metabolism. In general, the unique metabolic features of yeast and bacteria can be of important consideration when choosing a production host.

Conclusions
Although S. cerevisiae has been widely used as a robust industrial organism for metabolic engineering applications, many metabolic features of this organism for biosynthesis under various conditions remain unknown. In this study, the statistic model for yeast biosynthesis permits a priori calculation of the final product yield achievable by current biotechnology. Unlike other in silico models based on mass balance or thermodynamics (such as FBA model) [36,37], our model is based on a statistical analysis of published data using numerical and ordinal variables (categorized experimental conditions). The model has three applications. 1. The yield prediction takes into account the genetic design of the microbial host system and the "suboptimal" conditions under which the fermentation process occurs. 2. The model may identify effective metabolic strategies and at the same time, quantitatively provide the degree of uncertainty (i.e., possibility for failure). For example, statistical analysis shows that, for S. cerevisiae, metabolic bottlenecks may be more likely to be in the secondary metabolic pathways rather than primary pathways, and thus it can narrow down the genetic targets and avoid futile work. 3. This model may be used to qualitatively benchmark yields of different engineered production platforms.