Interactive visualization of clusters in microarray data: an efficient tool for improved metabolic analysis of E. coli
© Scharl et al. 2009
Received: 29 April 2009
Accepted: 15 July 2009
Published: 15 July 2009
Skip to main content
© Scharl et al. 2009
Received: 29 April 2009
Accepted: 15 July 2009
Published: 15 July 2009
Interpretation of comprehensive DNA microarray data sets is a challenging task for biologists and process engineers where scientific assistance of statistics and bioinformatics is essential. Interdisciplinary cooperation and concerted development of software-tools for simplified and accelerated data analysis and interpretation is the key to overcome the bottleneck in data-analysis workflows. This approach is exemplified by gcExplorer an interactive visualization toolbox based on cluster analysis. Clustering is an important tool in gene expression data analysis to find groups of co-expressed genes which can finally suggest functional pathways and interactions between genes. The visualization of gene clusters gives practitioners an understanding of the cluster structure of their data and makes it easier to interpret the cluster results.
In this study the interactive visualization toolbox gcExplorer is applied to the interpretation of E. coli microarray data. The data sets derive from two fedbatch experiments conducted in order to investigate the impact of different induction strategies on the host metabolism and product yield. The software enables direct graphical comparison of these two experiments. The identification of potentially interesting gene candidates or functional groups is substantially accelerated and eased.
It was shown that gcExplorer is a very helpful tool to gain a general overview of microarray experiments. Interesting gene expression patterns can easily be found, compared among different experiments and combined with information about gene function from publicly available databases.
The implementation of comprehensive analysis tools from systems biology into bioprocess development concepts enables the change from empirical to rational knowledge based approaches in host engineering and process design. DNA microarrays are powerful, state of the art tools for the monitoring of cellular systems on transcriptome level providing insight into cellular response to defined changes in cultivation conditions, e.g induction of recombinant protein production . The successful application of microarrays as monitoring tool in bioprocess development strongly depends on concerted design of cultivation experiments as well as array experiments and systematic data analysis. To enable interpretation of results the most significant information must be extracted from the acquired microarray data by using optimally suited methods of statistics and bioinformatics. Comparative analysis of data sets from independent experiments provide additional information and contributes to the optimal exploitation of microarray data. Cluster analysis is frequently used in gene expression data analysis to find groups of co-expressed genes which can finally suggest functional pathways and interactions between genes. Clusters of co-expressed genes can help to discover potentially co-regulated genes or genes associated to conditions under investigation, i.e., the induction strategies. Usually cluster analysis provides a good initial investigation of microarray data before actually focusing on smaller gene groups of interest. In the literature numerous cluster algorithms for clustering gene expression data have been proposed. Besides traditional methods like hierarchical clustering, K-means, partitioning around medoids (PAM, K-medoids) or self-organizing maps there are several algorithms dealing with time-course gene expression data (e.g., [2–5]). Clustering is commonly used to reduce the complexity of the data from multidimensional space to a single nominal variable, the cluster membership. In the analysis of microarray data clustering is used as vector quantization because no clear density clusters exist in the data. Genetic interactions are so complex that the definition of gene clusters is not clear. Additionally microarray data are very noisy and co-expressed genes can end up in different clusters. Therefore the set of genes is divided into artificial subsets where relationships between clusters play an important role. Depending on the purpose of the cluster analysis different numbers of clusters can be appropriate. Few large clusters are typically used for a broad overview of a data set and many small clusters are more suitable to detect co-regulated genes (e.g., over 25 clusters in ).
The display of cluster solutions particularly for a large number of clusters is very important in exploratory data analysis. Visualization methods are necessary in order to make cluster analysis useful for practitioners. They give an understanding of the relationships between segments of a partition and make it easier to interpret the cluster results. In this work neighborhood graphs  are used for visual assessment of the cluster structure of partitioning cluster solutions.
All cluster algorithms and visualization methods used are implemented in the statistical computing environment R (, http://www.R-project.org). R package flexclust contains extensible implementations of the K-centroids and QT-Clust algorithm. The new interactive visualization toolbox gcExplorer uses the non-linear graph layout algorithms implemented in the open-source graph visualization software Graphviz (http://www.graphviz.org) for the arrangement of nodes. Bioconductor packages graph and Rgraphviz provide tools for creating, manipulating, and visualizing graphs in R as well as an interface to Graphviz. The gcExplorer contains several possibilities to investigate gene clusters. A detailed view of single clusters is given by clicking on the nodes of the graph where various panel functions can be used to show the corresponding genes, e.g., matrix plots for gene expression profiles over time or HTML tables giving detailed information about differential expression as well as links to databases. Properties of the clusters can be included in the display of the neighborhood graph, e.g., cluster size or cluster tightness. Additionally external knowledge from differential expression analysis or functional grouping is used to investigate the data. Finally different experiments can easily be compared by visualizing groups of genes with common expression pattern in one experiment and potentially different expression pattern in the other experiment. The latest release of gcExplorer is always available at the Comprehensive R Archive Network CRAN: http://cran.R-project.org/package=gcExplorer.
In this paper the utility of the interactive visualization toolbox gcExplorer is demonstrated for the interpretation of E. coli microarray data. The data sets used derive from two independent fedbatch experiments conducted in order to investigate the impact of different induction strategies on the host metabolism and product yield. The goal of the comparison is to identify genes and pathways that act similar in both settings and more importantly to identify groups of genes with differential reaction to the two induction strategies. For this reason cluster analysis followed by comparative graphical investigation of the different groups of genes is performed. The graphical exploration of clusterings is applicable to arbitrary partitioning cluster solutions. In this case the stochastic quality cluster algorithm QT-Clust  is used. In the Methods Section this cluster algorithm and the concept of neighborhood graphs are reviewed for completeness. The data sets used are described in the Data Section. In the Results Section several steps of the analysis of the given data sets are presented including the visualization of the cluster structure and the direct graphical comparison of these two experiments. Further, a method is presented how to include external knowledge about gene function in the display of cluster solutions. It is shown that the identification of potentially interesting gene candidates or functional groups is substantially accelerated and eased.
Start with a randomly chosen centroid.
Iteratively add the gene that minimizes the increase in cluster diameter.
Continue until no gene can be added without surpassing the diameter threshold.
Repeat from 1. for ntry - 1 further centroids.
Select the largest candidate cluster and remove the genes it contains from further consideration.
Goto 1. on the smaller data set.
Stop when the largest remaining cluster has fewer than some prespecified number of elements.
If ntry is equal to the number of genes G the original QT-Clust algorithm is obtained. Stochastic QT-Clust speeds up the procedure and yields different local maxima of the objective function. The original algorithm will always converge in the same local optimum.
In order to gain maximum information the choice of the cluster diameter and the minimum number of points has to be carefully chosen as both have a large impact on the resulting clustering and its interpretation. A small diameter will yield a cluster solution with many small clusters containing genes with very similar expression patterns whereas a larger diameter will result in a smaller number of less tight clusters. Additionally, if the diameter is chosen too small many genes cannot be added to a cluster and will be treated as outliers. The minimum number of points also has a big in influence on the number of clusters and the number of outliers. If small clusters are allowed (e.g., the minimum number of points is 2) there will be less outliers than in the case of a larger minimum number of points. There is a tradeoff between the number of clusters, the size of the clusters and the number of outliers. Therefore it is necessary to finetune these parameters for each data set to obtain a cluster solution that fits the needs of the current experiment. In order to use the neighborhood graph for the visualization of a cluster solution obtained from QT-Clust the corresponding cluster centroids are computed. However, neighborhood graphs are generally applicable to various partitioning cluster algorithms like the well-known K-means or PAM.
Neighborhood graphs  use the mean relative distances between points as edge weights in order to measure how separated pairs of clusters are. Hence they display the distance between clusters. In the graph each node corresponds to a cluster centroid and two nodes are connected by an edge if there exists at least one point that has these two as closest and second-closest centroid.
|A i | is used in the denominator instead of |A ij | to make sure that a small set A ij consisting only of badly clustered points with large s-values does not induce large cluster similarity.
Neighborhood graphs are a useful tool for the visualization of the structure of a cluster solution. Additionally they can be used as exploratory tool to determine the quality of a given clustering and to validate the number of clusters.
In order to analyze the cellular response to different induction strategies on the transcription level two independent DNA microarray experiments were performed. A dye-swap design was used and the cells in the non-induced state of each experiment were compared to samples past induction. Since the production period of the fully induced system was limited to approximately one generation (7 h at a growth rate of 0.1 h-1) samples were drawn in a frequency of 1 h-1. To cover the production period of the process with limited induction the sampling frequency was reduced to one sample every two hours. The used microarrays were epoxide-coated slides (Corning® Epoxide Coated Slides) with selective probes (50-mer oligos) for all 4289 open reading frames of the E. coli K12 genome (MWG E. coli K12 V2 oligo set; MWG Biotech AG, Germany) spotted in duplicates. The two experiments (including all processing protocols) have been loaded into ArrayExpress (http://www.ebi.ac.uk/microarray-as/ae/). The ArrayExpress accession number of the array design is A-MARS-10. The experiment with fully induced E. coli expression system (experiment A) has accession number E-MARS-16 and the experiment with partially induced system (experiment B) has accession number E-MARS-17. For standard low level analysis the data were preprocessed using print-tip loess normalization. Differential expression estimates were calculated using Bioconductor (http://www.bioconductor.org) package limma. The two data sets were filtered by excluding genes expressed at a very low level (average log2 intensity smaller 8), genes not showing differential expression (log-ratio M smaller ± 1.5) at least at one time point and genes with p-value of the corresponding F-statistic smaller 0.05. After filtering the data acquired from the experiment with a fully induced E. coli expression system (experiment A) consists of 733 genes and the data acquired from the process with limited induction (experiment B) consists of 429 genes where 311 genes are differentially expressed in both experiments. The filtered data sets were clustered using stochastic QT-Clust and further analysis and visualization was conducted using the gcExplorer.
The major goal of this study is to identify differences between two independent microarray experiments which cannot be compared directly. For this purpose the two data sets are clustered into small and tight subgroups of genes with common expression pattern which can easily be investigated. The diameter of the clusters is tuned in such a way to get in the range of 15 clusters and 10 outliers. The minimum number of points that form a single cluster is set to 2. These parameter settings lead reasonable cluster solutions that can directly be interpreted. The data sets of experiments A and B were separated into 19 and 15 clusters respectively with 20 and 9 outliers. Next these two cluster solutions are investigated independently and combined in the following section. In case of very similar clusters the neighborhood graph can be used to combine the clusters after proofing the similarity. However, in this exploratory approach it is advantageous to merge similar clusters than to split large ones.
The cluster profiles with immediate and stern up or down regulation followed by constant values for the rest of the process definitely reflect the macroscopic outcome of the experiment with full induction. The irreversibility of the cellular response to the applied load level is mirrored in the transcriptome data. The only exception are the transcription profiles of genes related to phage shock grouped in cluster 15 which show continuously increasing gene expression until the end of the process.
Cluster analysis is used to find groups of co-regulated genes in the microarray data without prior knowledge about the gene functions. However, by clustering expression profiles of co-expressed genes groups of genes with similar function are found. External information about the annotation of genes to functional groups can easily be included in the neighborhood graph, e.g., the accumulation of gene ontology (GO, ) classifications in certain gene clusters can be highlighted in the node representation. For E. coli GO classifications about biological process (GOBP), molecular function (GOMF) and cellular component (GOCC), the GenProtEC (, http://genprotec.mbl.edu/) classification system for cellular and physiological roles of E. coli gene products and the RegulonDB (, http://regulondb.ccg.unam.mx/) providing information about operons and regulatory networks were implemented. These knowledge-based functional mappings can be used to study cellular functions in individual clusters.
The interactive visualization tool gcExplorer was developed in order to make cluster analysis useful for practitioners. It allows not only to visualize the cluster structure, beyond that the gene clusters are plotted or shown in HTML tables with links to databases. Additional properties of the clusters like cluster size or cluster tightness can be highlighted as well as external information like functional grouping. Furthermore gcExplorer provides functions for comparative graphical analysis of different μ-array experiments. gcExplorer is a userfriendly software tool for the analysis of gene expression data and very helpful for practitioners to get an overview on the output of μ-array experiments.
In this study microarray data from two processes with a strong recombinant E. coli expression system were analyzed. Neighborhood graphs enable the investigation of the underlying cluster structure and relationships between clusters. The implemented features for functional grouping allowed the assignment of cellular functions to clusters and provided hints about the functionality of other genes belonging to a certain cluster. Comparative graphical analysis of these two experiments resulted in the identification of differences in the cellular response and a number of interesting gene candidates involved. It was shown that the cellular strategies are different in the two DNA-μ-array experiments. Useful information was extracted for the further advancement of the expression system by means of genetic engineering or by means of process engineering.
This work was supported by the Austrian K ind /K net Center of Biopharmaceutical Technology (ACBT). Technical assistance of Jürgen Kern and Franz Clementschitsch is highly appreciated.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.