Structural genomics of human proteins – target selection and generation of a public catalogue of expression clones

Background The availability of suitable recombinant protein is still a major bottleneck in protein structure analysis. The Protein Structure Factory, part of the international structural genomics initiative, targets human proteins for structure determination. It has implemented high throughput procedures for all steps from cloning to structure calculation. This article describes the selection of human target proteins for structure analysis, our high throughput cloning strategy, and the expression of human proteins in Escherichia coli host cells. Results and Conclusion Protein expression and sequence data of 1414 E. coli expression clones representing 537 different proteins are presented. 139 human proteins (18%) could be expressed and purified in soluble form and with the expected size. All E. coli expression clones are publicly available to facilitate further functional characterisation of this set of human proteins.


Background
The Protein Structure Factory The Protein Structure Factory (PSF) is a joint endeavour of universities, research institutes and companies from the Berlin area [1,2]. It takes part in the international structural genomics initiative [3,4] and aims at the determination of human protein structures by X-ray diffraction methods and NMR spectroscopy using standardised highthroughput procedures. A complete pipeline has been established for this purpose that comprises cloning, pro-tein expression in small and large scale, biophysical protein characterisation, crystallisation, X-ray diffraction and structure calculation.
It is known that eukaryotic proteins are often difficult to express in Escherichia coli [5]. Only a certain fraction of these proteins can be overproduced in E. coli in sufficient yield without formation of inclusion body aggregates or proteolytic degradation. Alternative expression systems include cell cultures of various eukaryotic organisms and Affinity tags allow for standardised protein purification procedures. The first vector that was used routinely in the PSF, pQStrep2 (GenBank AY028642, Figure 1), is based on pQE-30 (Qiagen) and adds an N-terminal His-tag [13] for metal chelate affinity chromatography (IMAC) and a C-terminal Strep-tag II [14,15] to the expression product. pQStrep2 allows for an efficient two-step affinity purification of the encoded protein, as demonstrated in a study of an SH3 domain [16]. The eluate of the initial IMAC is directly loaded onto a Streptactin column. Thereby, only full-length expression products are purified and degradation products are removed. However, the two tags, which are flexible unfolded peptides, remain on the protein and may interfere with protein crystallisation, although we could show that crystal growth may be possible in their presence even for small proteins [16]. To exclude any negative influence by the affinity tags, another vector, pQTEV (GenBank AY243506, Figure 1), was constructed. pQTEV allows for expression of N-terminal His-tag fusion proteins that contain a recognition site of the tobacco etch virus (TEV) protease for proteolytic removal of the tag.
Codon usage has a major influence on protein expression levels in E. coli [17], and eukaryotic sequences often contain codons that are rare in E. coli. Especially the arginine codons AGA and AGG lead to low protein yield [18]. This can be alleviated by introducing genes for overexpression Vector maps   of the corresponding tRNAs into the E. coli host cells. We have used the plasmid pSE111 ( Figure 1) carrying the argU gene for this purpose. pSE111 is compatible with pQTEV and other common expression vectors. It carries the lacI Q gene for overexpression of the Lac repressor, which is required when using promoters regulated by lac operators. pSE111 was used at the PSF in combination with the expression vectors pQStrep2 and pGEX-6P-1. Strains for overexpression of rare tRNAs are available from Invitrogen (BL21 Codon Plus) and Novagen (Rosetta). The Rosetta strain contains the chloramphenicol-resistant pRARE plasmid that supplies tRNAs for the codons AUA, AGG, AGA, CUA, CCC, GGA [19]. This plasmid is used at the PSF in combination with pQTEV and pGEX-6P-2.
The Additional file 1 psfClones.xml lists the vector and helper plasmid for overexpression of rare tRNAs that was used for each individual clone.

Selection of target proteins
We selected target proteins with higher-than-average chances of successful expression in E. coli and crystallisation [1]. Proteins were excluded for which sequence analysis predicted that structure determination would be difficult. Starting from the complete set of known human proteins, potentially difficult target proteins and proteins of known structure were excluded according to the following criteria: • Membrane proteins are known to be complicated targets for structure determination and were excluded. Membrane proteins were identified with the program TMHMM [20,21].
• Since very large proteins are often difficult to express, the maximal length of target proteins was set to 500 amino acids.
• Protein regions that are unstructured or only partially structured [22] may lead to difficulties during protein expression and purification. Unstructured regions are susceptible to proteolyic attack, and represent an obstacle to protein crystallisation. A large proportion of intrinsically unstructured protein sequences are characterised by sequence stretches of low complexity and tandem repeats [23]. Proteins with low complexity regions of more than 20 amino acids length, detected by the SEG program, or with more than one region were excluded [24,25].
• Coiled-coil proteins were excluded from our target list, since this fold is not novel, and structural analysis of coiled coils requires special attention. Many coiled coil proteins form hetero-complexes with other coiled-coil proteins and cannot be studied without their binding partner. Coiled-coils domains are long, extended structures which can usually only be crystallised as domains, i.e. expression constructs lacking other domains have to be prepared. To identify coiled-coil proteins, the program COILS was used [26][27][28].
• The cellular localisation of target proteins was assigned with the Meta_A(nnotator) [29,30]. Target proteins annotated to be localised in the extracellular space, endoplasmic reticulum, Golgi stack, peroxisome or mitochondria were excluded from expression in E. coli. Many of these proteins require formation of disulphide bonds for correct folding, but these are generally not formed in the reducing environment of the E. coli cytoplasm. Therefore, these proteins were allocated only for extracellular expression by yeast host cells. Proteins with predicted intracellular localisation or which were not assigned with a localisation by Meta_A(nnotator) were expressed in the cytosol of E. coli.
• Potential target proteins were matched to the sequences of proteins with known structure at the Protein Data Bank (PDB) [31]. PSI-BLAST [32] was used to detect even very distinct homologies to PDB entries to rule out proteins with known folds. This filter was later replaced by a less stringent one, which considers the sequence identity and 'coverage' of matches to PDB sequences. The coverage is the length of the sequence match divided by the protein length. According to the less stringent filter, proteins with 60% or more sequence identity over 90% or more of the sequence length were excluded. Thereby target proteins could be included of which only a part, e.g. a single domain, has a known structure.

Target lists
The first target list was generated in 1999, when the PSF project started, from the set of all human proteins known at that time. This set was filtered as described above. PSI-BLAST was used to match potential target proteins to the PDB and to include only proteins of presumably novel folds. To enable high-throughput cloning, we selected target proteins for which full-length cDNA clones were available. In 1999, cDNA clones of the IMAGE consortium represented the main public source of sequenced cDNA clones [33]. However, only partial sequence information existed for these clones -the EST sequences of the dbEST database [34] -and only a small proportion contained complete open reading frames (full-ORF clones). 490 proteins were selected which met the selection criteria, had no match to PDB detected by PSI-BLAST and for which full-ORF IMAGE cDNA clones were available.
In 2003, a second target list was compiled from novel fulllength cDNAs discovered by the German cDNA consortium [35,36]. The same filter criteria as for the first target list were applied, except that cellular localisation was not taken into account. Proteins with a PDB match of 60% or more sequence identity and 90% sequence coverage were excluded, resulting in a target list of 259 proteins.
A third set of target proteins was selected from a human cDNA expression library (hEx1), which was cloned in a bacterial expression vector. This library was screened for expression clones on high density arrays and by highthroughput protein expression and purification experiments [37][38][39][40]. This identified 2,700 clones expressing soluble His-tag fusion proteins that could be purified by affinity chromatography [39]. The cDNA inserts of these clones were sequenced and assigned to sequences of the Ensembl database [41]. 141 proteins represented by clones of the expression library where selected as targets for structural analysis [39]. These clones express soluble, full-length proteins, of which the three-dimensional structure was unknown.
The numbers of targets and success rates grouped by the type of target cDNA clone are summarised in Table 1.

Generation and characterisation of expression clones
We established a common cloning strategy that allows for easy shuttling of cDNA fragments between different E. coli and yeast vectors. We adopted a cloning system that adds only a minimal number of extra amino acids to the protein of interest and therefore decided to clone with restriction enzymes instead of using alternative systems, such as Invitrogen's Gateway system or ligation independent cloning [42].
The PSF has been working with more than a thousand target proteins to date. Suitable cDNA clones were selected and subcloned into the E. coli expression vectors pQTEV GenBank AY243506 and pQStrep2 AY028642 [16]. These vectors provide for an N-terminal His-tag; pQStrep2 also encodes an C-terminal Strep-tag-II. Some proteins have also been expressed as GST fusion proteins using the vectors pGEX-4T2 or pGEX-6P1 (Amersham Biosciences).
Expression of protein coding genes from multiple transformants per target was tested under multiple conditions. Standardisation and automation was introduced to achieve this throughput. Expression clones were characterised by small scale protein synthesis at different temperatures, 37°C, 30°C and 25°C, in 1 ml volumes in deep-96-well microplates. Proteins were purified in parallel by a pipetting robot, as described previously [43]. 10% of the purified protein eluate from a 1 ml culture was analysed by SDS-PAGE. For each protein expression experiment, the size of the expression product was recorded and the amount of protein was classified into four categories: none, weak, moderate and strong expression. This classification is arbitrary to a certain degree, however, we found it sufficient to select suitable clones for protein production scale-up.
1414 clones for 537 target proteins were successfully cloned in E. coli expression vectors, with 473, 191 and 94 target proteins corresponding to target lists one (IMAGE clones), two (DKFZ clones) and three (hEx1 clones), respectively. Clones for 139 different target proteins were found to be expressed in soluble form by E. coli. Figure 2 and Table 2 show the result of small scale expression and purification of these proteins. The yield varied significantly among different target proteins. The Additional file 1 psfClones.xml contains further details on the expression clones, such as vector, strain and helper plasmid for overexpression of rare tRNAs.
Biophysical properties of proteins which could be expressed in soluble form in E. coli were compared against all tested proteins. We found no significant correlation between expression success and either protein length or mean net charge (data not shown). However, when analysing the mean hydrophobicity, we found that hydrophobic proteins are less likely to be expressed in soluble form. Only one of 139 well expressed proteins has a mean hydrophobicity of more than 0.2, while 8% of the other proteins are above this value. This group of proteins does not contain transmembrane helices according to SDS-PAGE of purified human proteins Figure 2 SDS-PAGE of purified human proteins. 15% SDS-PAGE (Coomassie-stained) of proteins expressed in small scale in E. coli and purified by automated immobilised metal chelate affinity chromatography as described in [43]. The identities of the purified proteins are indicated in Table 2. Protein expression was induced at the temperature that is optimal for the individual clone.    TMHMM, and therefore may represent peripheral membrane proteins with hydrophobic surface regions.
The E. coli expression clones of the PSF are publicly available from the RZPD German Resource Center [44]. The Additional file 1 is an XML list of these clones. It can be viewed in a web browser (Figure 3). The Additional file contains, for each clone: • Gene ID and name, • Accession number, • Cloning details, • Strain and vector, • Expected sequence, • Protein expression results, • Sequence verification.

Solved structures
As a result of the target selection and cloning described in this paper, ten novel X-ray structures of human proteins were determined ( Table 3). The structure of one protein, TRAPPC3/BET-3, was determined after protein expression in S. cerevisae, while the other proteins were produced in E. coli.

Discussion
We describe here the strategies and experiments of our structural genomics project on human proteins. In addition to the expression of full length proteins, the Protein Structure Factory has also studied protein domains by NMR spectroscopy, which has been described elsewhere [45][46][47]. Our selection of full length target proteins was mainly determined by the availability of full length cDNA clones. In addition, biophysical and bioinformatical criteria were applied, leading to a biased selection of target proteins from the human proteome. Therefore, we expect that the percentage of proteins that we could express and purify in soluble form, 18%, is higher than it would be in a randomly selected set. The low proportion of successfully expressed proteins indicates that E. coli is not the appropriate expression host for many full length human proteins. High throughput protein expression in alterna-  tive system such as yeast [9][10][11] or insect cells/baculovirus [48] has been established and will lead to better success rates in future projects.
Generally, clones that did express a soluble protein were verified by DNA sequencing, while clones that did not express or expressed an insoluble product were usually not sequence verified. It cannot be ruled out that some of the unsuccessful clones contain sequence errors introduced during cloning. Since template cDNA clones of the IMAGE consortium with only partial sequence information were used for most cloning experiments, expression clones that were not sequence verified might represent splice variants or isoforms of the original target. The distribution of mean net charge and length was similar among successfully expressed and all proteins, while very hydrophobic proteins were generally not expressed well in our E. coli expression system.
Future efforts in structural genomics of mammalian proteins will benefit from a much better supply of full length cDNA clones. Clones prepared for protein expression by resource centres and commercial suppliers are becoming available now. With such resources, alternative target selection strategies will become feasible that will not be restricted by the availability of cDNA clones. Instead, all potential target proteins, including splice variants, could be clustered by similarity and the most suitable members of each cluster could be selected by appropriate criteria as outlined in the Background section.
In our approach, we have excluded certain types of proteins such as membrane proteins and very large proteins.
A structural genomics approach that includes membrane proteins would require standard protocols to optimise expression conditions and detergents [49]. The best strategy to study large proteins is to divide them into domains and smaller regions. However, such smaller constructs usually have to be designed manually.
All clones listed in the supplementary file (Additional file 1) and Table 2 are available to the research community. Thereby we hope to facilitate further functional characterisation of this set of human proteins.

Materials and Methods
Cloning with restriction enzymes cDNA inserts were amplified by PCR primers carrying tails with BamHI and NotI sites and cloned into the respective sites of one of the expression vectors pQTEV, pQStrep2 or pGEX-6p1. This had the drawback that the restriction sites chosen for cloning might occur in the insert. In such cases, compatible overhangs were produced by alternative enzymes or by the hetero-stagger cloning method [50]. Alternative enzymes are BglII for BamHI and the type IIs enzymes BpiI, Eco31I, Esp3I, which can replace both BamHI and NotI. Type IIs enzymes cut outside their recognition sequence and can produce arbitrary overhangs.

PCR Primer design
PCR primers with tails carrying restriction enzyme cleavage sites were designed automatically by a Perl program. The primer design program adjusts the length of the primers to achieve a melting temperature close to a common default. Then, restriction enzymes that do not cut within the respective cDNA sequence are selected by the program and restriction enzyme sites are attached to the primer sequences. Finally, since restriction enzymes do not cut well at the very end of a DNA molecule, an additional short nucleotide tail is automatically attached to the primers. The sequence of this tail is optimised to minimise formation of secondary structure, hairpins or dimerisation. A Java version of the primer design software, 'ORFprimer', is publicly available [51].
Automated high-throughput cloning, protein expression and purification PCR primers and cDNA clones were delivered in 96-well microplate format. Upon delivery, plates with PCR primers and template clones were reformatted to obtain corre- The supplementary XML Additional file 1 file displayed in a web browser Figure 3 The supplementary XML Additional file 1 file displayed in a web browser.
sponding plate positions by a Zinsser Speedy pipetting robot. The PCR master mix (Roche Expand) and cDNA primers (10 µM stocks) were pipetted into a PCR microplate with a multichannel pipet. Template clone bacteria were added with a 96-pin steel replicating device from overnight cultures in microtitre plates. PCR product size and yield was determined by agarose gel electrophoresis and the software Phoretix 1D Quantifier (Nonlinear dynamics). PCR products were purified with magnetic beads on the pipetting robot with a system that has been developed at the Max Planck Institute of Molecular Genetics in collaboration with Bruker Daltonics (Bruker genopure kit). The correct restriction enzyme master mixes were automatically added and the digested fragments were purified again, analysed by agarose gel electrophoresis and quantified. The robot then adjusted DNA concentrations to a common default by dilution. Ligations were set up manually with a multichannel pipet, and SCS1 E. coli cells carrying pRARE were transformed in a PCR microplate by chemical transformation on a PCR machine [52]. Transformed cells were manually plated on individual agar plates. Four clones were picked per transformation and were checked by PCR using vector primers. E. coli expression clones were ready for protein expression at this stage.
Primer sequences and template clones for cloning of the target cDNAs are listed in the supplementary XML file, Additional file 1.
The characterisation of expression clones by parallel expression and protein purification is described in reference [43].

Sequence analysis
Sequence analysis software was run with default settings unless indicated otherwise. The mean charge of a protein was calculated as the difference of the number of positive and negatively charged amino acids (Lys, Arg and Glu, Asp, respectively) divided by the protein length. The mean hydrophobicity was calculated with the Kyte and Doolittle hydropathy index [53], obtained from the EMBOSS package [54]. The index values were added up for a given protein and divided by the protein length.