Genome, secretome and glucose transport highlight unique features of the protein production host Pichia pastoris

Background Pichia pastoris is widely used as a production platform for heterologous proteins and model organism for organelle proliferation. Without a published genome sequence available, strain and process development relied mainly on analogies to other, well studied yeasts like Saccharomyces cerevisiae. Results To investigate specific features of growth and protein secretion, we have sequenced the 9.4 Mb genome of the type strain DSMZ 70382 and analyzed the secretome and the sugar transporters. The computationally predicted secretome consists of 88 ORFs. When grown on glucose, only 20 proteins were actually secreted at detectable levels. These data highlight one major feature of P. pastoris, namely the low contamination of heterologous proteins with host cell protein, when applying glucose based expression systems. Putative sugar transporters were identified and compared to those of related yeast species. The genome comprises 2 homologs to S. cerevisiae low affinity transporters and 2 to high affinity transporters of other Crabtree negative yeasts. Contrary to other yeasts, P. pastoris possesses 4 H+/glycerol transporters. Conclusion This work highlights significant advantages of using the P. pastoris system with glucose based expression and fermentation strategies. As only few proteins and no proteases are actually secreted on glucose, it becomes evident that cell lysis is the relevant cause of proteolytic degradation of secreted proteins. The endowment with hexose transporters, dominantly of the high affinity type, limits glucose uptake rates and thus overflow metabolism as observed in S. cerevisiae. The presence of 4 genes for glycerol transporters explains the high specific growth rates on this substrate and underlines the suitability of a glycerol/glucose based fermentation strategy. Furthermore, we present an open access web based genome browser .


Background
Yeasts have attracted renewed interest in the last few decades as production hosts for biopharmaceutical proteins as well as for bulk chemicals. The methylotrophic yeast Pichia pastoris (Guillermond) Phaff (1956) is well reputed for efficient secretion of heterologous proteins [1], and has come into focus for metabolic engineering applications recently. Especially reengineering of the N-glycosylation pathway has enabled the production of heterologous proteins with human-like N-glycan structures [2][3][4]. While protein production is the major application of P. pastoris, production of metabolites has come into research focus recently too [5,6]. Apart from these biotechnological applications, it is widely used as a model for peroxisome [7] and secretory organelle research [8]. P. pastoris has recently been reclassified into a new genus, Komagataella [9], and split into three species, K. pastoris, K. phaffii, and K. pseudopastoris [10]. Strains used for biotechnological applications belong to two proposed species, K. pastoris and K. phaffii. The strains GS115 and X-33 are K. phaffii, while the SMD series of protease deficient strains (most popular SMD1168) is classified into the type species, K. pastoris. Apart from these strains which have been made available by Invitrogen, research labs and industry use different other strains belonging to either of these two species, and no trend towards a superior expression level of one of the two species has been observed. In order to provide a common information basis across the different strains, we have performed this work with the type strain (DSMZ 70382) of the type species K. pastoris, which is the reference strain for all the available P. pastoris strains. In coherence with the published literature, we name all strains P. pastoris, which thus stands for the entire genus Komagataella. As other strains, DSMZ 70382 was isolated from tree exudate, in this case from the chestnut tree.
The majority of P. pastoris processes described so far utilize methanol as substrate and inducer for heterologous protein production. While tight gene regulation and high product titers can be achieved with this strategy, the disadvantages as large scale use of a flammable substrate, high heat production and oxygen consumption, and significant cell lysis have been reported. Apart from technological challenges in large scale fermentation, this leads to significant contamination of culture supernatants with intracellular proteins including proteases [11]. P. pastoris has been described to secrete some heterologous proteins like human serum albumin [12] or as recently reported glycoengineered antibodies [13] in the g L -1 range, while naturally secreted proteins account only for low amounts [14], which supports the easy production of highly pure proteins. However, several secreted P. pastoris proteins are observed as contaminants in culture supernatants, requiring elaborate product purification and analytical effort. A detailed characterization of the secretome would signifi-cantly improve production and quality control of biopharmaceuticals produced with this expression system. The secretomes of few yeasts and filamentous fungi have been analyzed experimentally. Computational analyses of yeast genomes predicted approximately 200 potentially secreted proteins [15,16]. Secretomes of filamentous fungi contain numerous enzymes for degradation of starch, cellulose, lignin and similar plant polymers [17][18][19]. However, these predictions suffer from some limitations. As certain targeting sequences are not recognized, the predictions may contain proteins which are retained in cellular organelles. Most cell wall associated proteins can be predicted, but due to the fluctuating nature of the cell wall during growth and budding a fraction of these may be released from the cell wall structure and add to the secretome. Finally the actual composition of the secretome will depend on growth conditions and the actual expression of the genes encoding potentially secreted proteins. Therefore the extracellular proteome of P. pastoris was analyzed here and compared to the predicted secretome.
Substrate uptake kinetics determine growth kinetics and the characteristics of biotechnological processes. P. pastoris is described as a Crabtree-negative yeast, featuring respiratory metabolism under glucose surplus [20]. A major reason for the easy growth to high biomass concentrations is assumed in the endowment with hexose transporters and their features. We report here the determination and analysis of the P. pastoris draft genome sequence and its application in correlating in silico and mass spectrometric analysis of the extracellular proteome. Furthermore, a comparative analysis of hexose transporters allows drawing conclusions towards glucose uptake kinetics, a major determinant of growth and bioprocess characteristics in relation to substrate supply. Additionally, a web based database with search functions and annotation data for analysis of the genome sequence is reported.

Sequencing
The genome of P. pastoris was sequenced using two next generation sequencing methods. First a Roche GS-FLX run was used to take advantage of the longer reads (400 nts) of this method, which was then complemented by a paired end run with the short read method of Illumina Genome Analyzer (36 nts) to improve the quality of the sequence. The combined result of both methods was a draft genome of 326 assembled contigs of which 93 were larger than 10 kb and 60 between 1 and 10 kb. The longest contig comprised 419,475 nts and the shortest 128 nts. 125 of the 326 contigs could be aggregated into 38 supercontigs. Overall 9,405,451 bases were sequenced with a coverage of 22× with Roche GS-FLX and 60× with Illu-mina GA. Key statistical data of the draft genome are presented in table 1.

Gene prediction
We initially predicted 7,935 open reading frames using two different gene finders. Manual curation reduced this number to 5,450 ORFs. The eukaryotic gene finder Augustus has been pre-trained on a number of datasets including various yeast species. Of these, Candida guilliermondii, Debariomyces hansenii and Pichia stipitis were selected for their relatively close relation to P. pastoris (based on sequence similarity), and Saccharomyces cerevisiae as a reference yeast species with the best sequence annotation. In addition the prokaryotic gene finder Glimmer3 was applied since many eukaryotic gene finders overpredict intron containing genes. As yeast genomes are generally compact a large amount of intron containing genes was not expected. All putative ORFs < 100 nts or comprising a starting codon other than ATG were excluded from the set except for genes on contig borders. 194 of the predicted genes are truncated because they crossed contig borders. Ribosomal RNAs were annotated by homology to S. cerevisiae rRNAs. Contrary to S. cerevisiae, the 5S rRNA is not part of the cluster containing 18S, 26S and 5.8S rRNA but spread across the genome. 149 transfer RNAs were identified using tRNA Scan, which is lower than the average number of tRNAs identified in other yeasts (216 on average).

Functional Annotation
Functional annotation was performed computationally with a reciprocal best hit (RBH) strategy, using BLAST [21] searches against a selected dataset of the subphylum Saccharomycotina to which P. pastoris also belongs, and the Uniprot database. All P. pastoris genes and proteins that were publicly available at the NCBI (National Center for Biotechnology Information) were manually compared against our predictions. The native genes and proteins were present in our set. The average identity between these genes deposited in NCBI and their homologs in the present genome sequence was 95%. For all proteins that were predicted to be secreted and all others that are discussed here the functional annotation was manually curated. The distribution in GO functional terms of all functionally annotated ORFs was compared to S. cerevisiae (figure 1). The distribution is rather similar with differences observed mainly in the groups organelle organization, protein modification, lipid, amino acid and cofactor metabolism.

Secretome
To validate the secretome prediction pipeline (see Materials and Methods) used for P. pastoris, it was applied to the S. cerevisiae proteome beforehand. The majority of proteins which were described to be extracellular in the Saccharomyces genome database SGD [22] were found in the secreted dataset, for the rest a GPI-anchor signal was predicted. Due to the good performance of the prediction pipeline with S. cerevisiae and the successful application of similar methods for K. lactis [15] and C. albicans [16] respectively, a high accuracy for the secretome predictions was expected for P. pastoris as well. The predicted secretome of P. pastoris comprises 88 putative proteins of which 55 could be functionally annotated. Additionally, 172 ORFs were predicted to encode proteins entering the general secretion pathway but being localized in different cellular compartments (for the complete list see additional file 1). Obviously the secretome prediction cannot easily discriminate between ER/Golgi localized and secreted proteins, as the chaperone Kar2 and protein disulfide isomerise (Pdi1) appear among the predictions. Therefore the experimental determination of the extracellular proteins is essential for an assessment.
To identify the extracellular secretome of P. pastoris, the strain DSMZ 70382 was grown in chemostat culture on glucose as limiting carbon source, reaching 26.4 ± 0.1 g L -1 dry biomass (YDM). The supernatants contained 407 mg L -1 total protein. Analysis by SDS-PAGE indicated that approximately 15 distinct protein bands, ranging from 12 kDa to 170 kDa, were present in the culture supernatant (figure 2a). On 2D gels, 28 protein spots were visible at higher abundance, at least 7 thereof being obviously isoforms of other protein spots with identical MW but differ- Categorization of the P. pastoris annotated genome compared to S. cerevisiae  ent pI (figure 2b). Almost all highly abundant proteins ran at low pI values between 3 and 5.5. As the cellular viability was 99% throughout the cultivation, and total DNA content of the supernatants was 1.12 ± 0.03 μg mL -1 , a maximum of 1% lysed cells was estimated, accounting for maximally 10% of total protein in the supernatant. Therefore, the potential contamination by intracellular protein was assumed to be minor. A 1D SDS PAGE gel was cut into 21 slices and analyzed by LC-ESI-MS/MS. Detailed data on protein identification are found in additional file 2. Twenty different proteins were identified (table 2), 12 of which appeared in more than one gel slice (additional file 2). Nine proteins ran at higher molecular weight than predicted from the sequence. Eight out of these proteins contained potential N-glycosylation sites (table 2 and additional file 2) and corresponded to detected glycoproteins (figure 2a). Apparently 6 of these proteins were subject to proteolysis. However, the proteolytic activity in the supernatants was very low (equivalent to 11 ± 0.9 ng mL -1 trypsin), and in contrast to other yeast secretomes, no protein with putative proteolytic activity was identified.
Fourteen of the proteins identified by homology are obviously secreted or cell wall bound, 6 of them with homology to glucanases. The other proteins with extracellular localization comprise 7 cell wall modifying enzymes and 1 secreted protein of unknown function. Four proteins are homologous to intracellular proteins (including glyceraldehyde phosphate dehydrogenase which has been described to be also located at the cell wall in S. cerevisiae [23], and for 2 no similarity was found. The putative intracellular proteins mainly comprise glycolytic enzymes and ribosomal proteins which are highly abundant on glucose [24]. A comparison of predicted to identified secretome reveals a good correlation of prediction, putative function, and experimentally determined localization (table 2). All proteins homologous to intracellular proteins were predicted to be intracellular, and only for 2 of the 14 putatively secreted proteins the prediction was unclear or slightly below threshold.

Hexose transporters
Fourteen putative sugar transporters all belonging to the major facilitator superfamily (MFS) were identified by sequence similarity. All P. pastoris sugar transporters feature the classical 12 transmembrane domains, and contain the PESP motif and at least one of the two sugar transporter signature sequences. Contrary to S. cerevisiae, which comprises 20 isogenes for low and high affinity hexose transport, only two putative transporters with sequence similarity to S. cerevisiae transporters are present in the P. pastoris genome. While PIPA00236 possesses more than 60% identity to S. cerevisiae HXT-family proteins, and the low-affinity transporters of Kluyveromyces lactis Rag1 [25] and Hansenula polymorpha Hxt1 [26] on the amino acid level, PIPA08653 shows only low similar-ity (max. 37% identity/58% positives) to these proteins as well as to other P. pastoris sugar transporters. Although all 5 conserved amino acids that have been postulated to be required for high affinity transporters in S. cerevisiae Hxt2 [27] are present also in the respective translated protein sequence of P. pastoris gene PIPA00236, disruption of the gene led to impaired growth on high concentrations of glucose (2%). Disruption of PIPA08653 did not show a distinct growth phenotype. This indicates that PIPA00236 encodes the major low affinity glucose transporter in P. pastoris.
For high affinity transport, two P. pastoris proteins (PIPA02561 and PIPA00372) with high sequence similarity (>65% identity) to K. lactis high affinity glucose transporter Hgt1 were identified (see figure 3). The potential transporter-like hexose sensor is encoded by PIPA01691, and lacks the C-terminal "glucose sensor domain" as do the respective orthologous sensors in H. polymorpha (Hxt1) and Candida albicans [26]. Additionally a gene with similarity to quinate permease of P. stipitis and filamentous fungi was identified, which has putative orthologs in many other yeast species, but is missing in S. cerevisiae. According to Barnett et al. [28] P. pastoris cannot utilize quinate as a carbon source, although some of the genes required for the utilization of quinate are part of the shikimate pathway leading to the production of aromatic amino acids, and are present as part of the pentafunctional AROM protein. However, regulatory proteins of the quinate pathway are missing in the genome of P. pastoris. Interestingly, P. pastoris possesses four transporters that are highly similar to putative glycerol transporters from K. lactis (KLLA0A03223g) and Yarrowia lipolytica (YALI0F06776g), and weakly similar to the S. cerevisiae glycerol transporter Slt1. Sequence similarities of the proteins discussed above to their respective orthologs in S. cerevisiae, P. stipitis, H. polymorpha, K. lactis, and Emericella nidulans are illustrated in figure 3.

Database, genome browser
To make the genomic data accessible it was loaded into a relational database. For visualization a genome browser was installed on a web server and connected to the database.
The draft genome sequence data are deposited at EMBL-EBI, accession number CABH01000001 -CABH01000326.

Discussion
The predicted size of the haploid genome of P. pastoris [30] was confirmed here to comprise 9.4 Mb, which is Branch length dendrogram of sugar transporters and related proteins of different yeasts smaller than the genomes of other yeasts, spanning from 10-20 Mb [31]. Nevertheless the number of functionally annotated genes is comparable to other yeasts, which can be attributed to the fact that P. pastoris contains fewer genome redundancies compared e.g. to S. cerevisiae and D. hansenii, which have undergone genome duplications followed by partial genome losses during evolution [32]. While P. pastoris contains specific subclasses of genes for methanol metabolism and peroxisome synthesis, structure and degradation which are present only in methylotrophic yeasts, most metabolic enzymes are present only in single copies, and the number of secreted proteins is low. To verify the quality of gene prediction, all 173 P. pastoris genes and 245 proteins currently deposited in NCBI were BLAST searched among the predicted gene list. All of the P. pastoris specific genes were present, indicating a high quality of gene prediction.
The secretomes of K. lactis and C. albicans have been predicted computationally [15,16], yielding 178 ORFs of K. lactis and 283 of C. albicans. The C. albicans secretome apparently is more complex and contains numerous lipases, proteases and agglutinin-like proteins, while both for K. lactis and P. pastoris only few enzymes apart from glucanases and chitin modifying enzymes appear. As P. pastoris utilizes only few carbon sources [28] it appears obvious that neither proteolytic, lipolytic or saccharolytic activities are secreted for substrate utilization. Yeast glucanases and chitinases are required for cell wall plasticity during cell growth and division [33]. While these enzymes are commonly regarded to be cell wall associated, it is plausible that they reach the culture supernatant during cell wall remodelling, indicating that a distinct border cannot be drawn between cell wall and the exterior space.
Fourteen of the 20 proteins identified in the culture supernatant of P. pastoris were homologous to proteins implicated in cell wall or extracellular functions. No other secretory enzyme homologs were identified, further indicating that cell wall associated proteins are the essential constitutents of the secretome of glucose grown P. pastoris. The computationally predicted secretome contains all secreted proteins plus mainly soluble cellular proteins containing a signal peptide but no transmembrane domains. Thus these predictions obviously overestimate secretory proteomes ( figure 4). The culture supernatant of K. lactis contained significantly more (82) proteins [15] of which 34 were predicted to be secreted or cell wall bound, and the rest were assumed to be localized either to the ER or the cytosol. The latter group of proteins indicates a significant release of intracellular proteins in this study, probably by cell lysis due to the culture conditions. The low concentration, together with the small number of actually secreted proteins from P. pastoris highlights a major advantage of this protein production system, as secreted products are much less contaminated with host cell protein. Jahic et al. [34] have shown that host cell protein released from P. pastoris grown on methanol mainly derives from cell lysis, which occurs to a much lower extent upon growth on glucose. Combined with the fact that strong promoters for use on glucose are available [34,35], these data provide convincing arguments for a reconsideration of methanol based protein production with P. pastoris. The toxicity of methanol and several of its metabolites is the main reason for cell lysis and consequently also protease leakage to the culture supernatant. Additionally other host cell proteins are released, leading to significant contamination of protein products. A common approach to reduce product proteolysis is the knock out of protease genes. However, multiple protease knockout strains tend to be growth retarded, so that it appears reasonable to employ a production strategy based on glucose media which avoids the detrimental effects of methanol at all. Detailed knowledge of the secreted host cell proteins, as presented here, can have a strong positive effect on product purification and quality control, as specific assays can be developed. Additionally a knock out of major secreted proteins can reduce the host cell protein load significantly [36].
Substrate uptake kinetics determines growth kinetics and the characteristics of biotechnological processes. The fermentative (Crabtree-positive) yeast S. cerevisiae consumes glucose at high rates when supplied with high concentrations. This exceptionally high glucose uptake rate is attributed to high abundance of hexose transporters, encoded by more than 10 isogenes [37]. Respiratory (Crabtree-negative) yeasts limit glucose uptake, as they contain few hexose transporter genes, encoding energy dependent symporters with high affinity to glucose [38]. The endowment of P. pastoris with hexose transporters is in good accordance to other respiratory yeasts such as K. lactis, H. polymorpha and P. stipitis, all having a reduced number of hexose transporters in comparison to S. cerevisiae. Moreover, Crabtree-negative yeasts usually exhibit K m values in the micromolar range for glucose [37], due to their very high-affinity transporters such as K. lactis Hgt1, which is an ortholog of P. pastoris PIPA02561 and PIPA00372. While K m values for P. pastoris specific transporters remain to be determined in future, conclusions to glucose uptake behavior can be drawn. Accordingly, specific glucose uptake rate is limited to q Smax = 0.35 g g -1 YDM h -1 (at growth rates near μ max = 0.193 h -1 ) in P. pastoris chemostat cultivations [39], in comparison to q Smax = 2.88 g g -1 YDM h -1 in fully aerobic S. cerevisiae [40]. The limited glucose uptake prevents Crabtree-negative yeasts such as P. pastoris from extensive overflow metabolism, which leads to the aerobic formation of ethanol and a reduced biomass yield at high external glucose concentrations in S. cerevisiae.  This difference is also reflected in the very high biomass concentrations (more than 100 g l -1 ) that can be achieved in P. pastoris cultivations. For heterologous protein production, aerobic ethanol formation is a substantial problem, because it lowers the yield of the desired product due to a lower biomass concentration.

Categorization of P. pastoris secretome
Interestingly, P. pastoris contains four genes encoding putative H + /glycerol symporters, contrary to all other sequenced yeasts up-to-date. Consequently, the maximum glycerol uptake rate of P. pastoris is q Glycerol_max = 0.37 g g -1 YDM h -1 . This is substantially higher than the uptake rates reported for S. cerevisiae (q Glycerol_max = 0.046 g g -1 YDM h -1 ) and many other yeast species [41]. The ability to grow on glycerol as a single carbon and energy source -a mode of cultivation widely applied for generation of biomass with P. pastoris prior to methanol induction or glucose fed batch -is dependent on the activity of a constitutive salt-independent active glycerol transport by the H + /glycerol symport and has also been reported for Pichia sorbitophila and Pichia jadinii [41]. Specific growth rates of these yeasts on glycerol are similar to the specific growth rates that can be obtained on glucose (e.g. for P. pastoris on mineral media μ Glycerol_max = 0.26 h -1 , μ Glucose_max = 0.19 h -1 ), whereas yeasts lacking the activity of such a type of carrier have significantly reduced growth rates on glycerol. The high specific glycerol uptake rate, enabled by the exceptional endowment with specific transporters emphasizes the suitability of glycerol as a substrate for biomass growth.

Conclusion
The availability of genome data has become an essential tool for cell and metabolic engineering of biotechnological production organisms. This work highlights major advantages of P. pastoris as a protein production platform and the benefits of glycerol/glucose based production technology. Apart from lower heat production and oxygen demand compared to methanol based processes, glucose grown cultures display higher viability and essentially no protease release to the culture supernatant. Furthermore detailed insights into the sugar transport will enable rational modulation of substrate fluxes, especially for efficient metabolite production.

Strain
The P. pastoris type strain (DSMZ 70382 = CBS704) was selected as the source of genomic DNA, and used for all experimental work. Genomic DNA was prepared as described in Hohenblum et al. using the Qiagen Genomic G-20 kit [42].

Sequencing
Genomic DNA was sequenced by GATC Biotech AG, Konstanz (Germany) with a Roche GS FLX-Titanium Series complemented by an Illumina Genome Analyzer paired end run. The reads were assembled with SeqMan NGen by DNASTAR. To verify the sequencing quality all P. pastoris gene and protein sequences available at NCBI were downloaded and the sequences were compared using BLAST searches.

Gene prediction and annotation
Gene prediction was performed with the eukaryotic gene finder Augustus [43] using the option for overlapping genes as well as the prokaryotic gene finder Glimmer3 [44]. Predicted open reading frames were kept if they were longer than 100 nucleotides and started with ATG, except for genes predicted on contig boarders. The ORF sets were merged and made non redundant using the clustering program cd-hit-est [45] with a similarity cut-off of 95%.
Annotation was done by a reciprocal protein BLAST against a dataset consisting of the publicly available Saccharomycotina species and the UNIPROT protein database with an E-value threshold of 10 -10 . All P. pastoris proteins and genes available at NCBI, all proteins that were predicted to be secreted and all sugar transporters were manually curated. Gene Ontology annotation was done for all proteins with a homolog in S. cerevisiae.
Ribosomal RNA annotation was done through homology with S. cerevisiae using nucleotide BLAST against the P. pastoris contigs, and the results were manually analyzed. tRNAs were localized using the program tRNAscan-SE [46]. Gene predictions were manually curated using BLASTx.

In silico secretome prediction
A similar method was used as described to predict the secretomes of K. lactis [15] and C. albicans [16], respectively. Proteins were considered to be secreted when an N-terminal signal peptide existed but neither a transmembrane domain (except one within the first 40 residues), nor a GPI-anchor, nor any localization signal to other organelles were identified.