The Cellulose Synthase Superfamily in Fully Sequenced Plants and Algae
Yanbin Yin and Ying Xu, Dept. of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia; BioEnergy Science Center
We have identified, classified and evolutionarily studied the cellulose synthase like (Csl) superfamily in 19 fully sequenced plant and algal genomes. Among all the known Csl families, CslA and C families are the most widely distributed, identified in all six Chlorophyta green algae and land plants; CesA (cellulose synthase catalytic gene) and CslD families are identified only in land plants including mosses while all the other families, namely B, E, F, G and H, are observed only in the five angiosperms. Evolution rate estimation on different Csl families suggests that the widely distributed CesA and Csl A, C, D are under stronger selection constraints than the narrowly distributed B, E, F, G and H families, adding another evidence that the latter have evolved later and are involved in new functions in plants such as vascular tissue formation. A new (putative) Csl family I, closely related to CslG, was identified, consisting of genes from poplar, grape and sorghum but not from Arabidopsis and rice. Our analyses confirmed that CslA and C have a different origin than the other Csl families; these two families are also found to be encoded by a single copy gene in each of the green algal genomes, suggesting they diverged by duplication only in land plants after they split from green algae. We also showed that Cyanobacteria CesA genes are closer to plant CesA/CslD/CslF than to CslE/B/H/G/I, suggesting that the former branched out earlier in evolution.
Mining the Pharmacogenomic Literature to Create Gene-Drug Interaction Networks
Yael Garten, Biomedical Informatics, Stanford University
With over 17 million citations on PubMed, the scientific literature has become unmanageably large for individual researchers to remain informed on all publications relevant to their field. Natural Language Processing (NLP) and text mining techniques have been used to assist in this task. In my research, I am creating a resource that will identify and gather information from the literature of relevance to pharmacogenomics, will automatically extract facts from the text, and will allow large-scale analysis of the collective pharmacogenomic knowledge published in the various journals. This knowledge, in the form of interactions reported by researchers between genes, gene variants, drugs and diseases, will then be analyzed at a systems level. By creating networks of these interactions, we can then move forward to infer and predict additional connections that may exist. Analyses of the network may suggest new drug targets, new applications for existing drugs, and areas of Pharmacogenomics that appear under-investigated.
WISE: a Keyword Search Engine for Workflows
Yi Chen, Arizona State University
With the advance of experimental technology, the number of scientific workflows is growing dramatically. There is an increasing need for scientists to search a workflow repository using keywords and retrieve the relevant ones according to their interests. A workflow structure is a three dimensional object containing multiple abstraction views of different granularity on the same workflow. This unique structure poses a new set of challenges compared to keyword search on documents. Existing workflow search engines either retrieve individual tasks or retrieve the whole workflow structures that match input keywords, and thus are not effective.
In this work, we have developed WISE (a Workflow Information Search Engine), http://wise.asu.edu/, which dynamically extracts and synthesizes the most relevant information in a workflow repository according to the user keyword query. To achieve this, we define a keyword search result as a most specific workflow that contains tasks matching keywords as well as the dataflow among those tasks. Such a query result can be understood as a perspective projection of a three dimensional workflow structure in the repository on a two dimensional query-driven viewing plane. To efficiently retrieve query results, we have developed indexes and labeling schemes and leveraged a database backend for performance speedup.
A user study with students in biology department shows that the query result retrieved by WISE is more informative and yet more concise than the ones returned by existing approaches. Performance evaluation shows the efficiency of WISE.
Biological Data Mining and Visualization
Association study of SNPs and schizophrenia via classification rule mining and ranking
Qian Xu, Dept. of Bioengineering at Hong Kong University of Science and Technology
Materials: We utilize two datasets containing 17 SNPs in intron 8 of type A γ-aminobutyric acid (GABAA) receptor β2 subunit gene (GABRB2) respectively, which were identified and genotyped through resequencing a 1839 base pair (bp) region in GABRB2 of Japanese (JP) and German (GE) samples. The total of 511 JP samples were composed of 304 unrelated schizophrenia patients and 207 unrelated healthy individuals; while the total 365 GE samples consisted of 175 patients and 190 healthy individuals.
Methods: Our two-step algorithm studies association between SNPs and complex diseases includes finding class association rules and aggregating rules to extract interacting SNPs associated with disease. The special subset of rules achieving optimal classification accuracy are discovered by classification rule mining which regards the class as the pre-determined target instead of conventional association rule mining whose target is not predetermined. The rules are understandable through classification rule mining, while decision tree may not discover interesting rules because of the limitations of the model and it cannot generate all the rules. Furthermore, our algorithm can rank a large amount of rules according to interestingness measure and can easily scale up.
Results: We extracted and ranked 1000 rules for JP and GE respectively. The combinations of SNP13-SNP3, SNP13-SNP1 and SNP17-SNP16 (SNP14)-SNP1 were ranked as first, second and third most strongly associated SNP sets in JP. The SNP10-SNP9, SNP11-SNP9 and SNP15-SNP8 combinations were top-ranked associated SNP sets in GE. Generally, SNP interactions associated with schizophrenia vary from populations and genders.
Power and Sample size Calculations for Multivariate Regression in High Dimensional Data
Yu Guo, BG Medicine Inc.
Data from high throughput profiling platforms measuring transcripts, metabolites or proteins usually contain a small proportion of signal relative to noise, and the number of features measured is usually much larger than the sample size. Traditional power and sample size estimation do not apply well in such situations. Current literature that addresses power estimation in high-throughput data only deals with categorical outcomes [Dobbin et al., Biostatistics, 2005].
In the recent design of a human biomarker study, we estimated power for detecting biomarkers associated with a continuous outcome. We derived theoretic formula and performed simulation studies for power estimation using Elastic Net, a multivariate regression method that applies to situations when the number of predictors is much bigger than the number of observations. In our simulation study, we used external cross validation for minimum fingerprinting, and we permuted outcome labels to assess model significance. We automated and parallelized the computationally extensive functions to generate power curve in R under a cluster environment and reduced computing time significantly.
Assuming variability in outcome and in-house profiling data based on historical information, and assuming 10% of the 100 measured analytes to be signals, we derived power curve and recommended a sample size of 100 to achieve 80% power when correlation coefficient is 0.75 between observed and theoretic outcome, which is consistent with recommended sample size of 70~140 under similar assumptions in Dobbin et al.
Set-Up for Superposition of the Signals Obtained From Single-Channel Ionometers
Torgom Seferyan, Department of Biophysics, Yerevan State University
In biological system studies usually one needs to monitor a few different ion-concentrations simultaneously. The presented set-up allows simultaneous recording of up to five different ion concentration- and temperature-kinetics. The set-up is also equipped with a temperature regulatory system.
Methods The software for the set-up was developed in NI LabVIEW 7.0 graphical programming environment. NI USB 6008 card was used as a data acquisition card. Five IM-22P ionometers with their ion-selective electrodes (for Na, K, Ca, H end Cu ions) were used as single channel ionometers.
To test the operation of the set-up the kinetics of the change of the concentration of a given ion was monitored simultaneously through an IM-22P ionometer. The control experiments were carried out separately for each ion.
Results: Comparison of the data showed a 0.1% shift from the control values. The data was being recorded in the form of numerical arrays with a frequency of 50 samples/sec. In this case the time-response of the set-up is limited solely to the response of electrodes.
Conclusion: The obtained results show that the presented set-up is a cost effective alternative to expensive equipment for such measurements without compromising the wealth of information as well as the accuracy of analysis.
Visualizing Three Bioinformatics Algorithms
Philip Heller, San Jose State University
I have developed visualization programs for three algorithms that play important roles in bioinformatics. In addition to an actual poster, my presentation will consist of demonstrations of these algorithms, given on my laptop.
The algorithms are:
* UPGMA: a procedure for building distance-based phylogenetic trees. The visualization shows how input distance relationships affect the resulting tree.
* Fragment assembly using Eulerian graphs. Brute-force assembly algorithms run in worse than polynomial time. Conversion of the assembly problem to an Eulerian path problem reduces complexity to polynomial time. The approach is graphically shown, and competes in a race against a brute-force algorithm.
* The Nussinov algorithm. This dynamic-programming algorithm provides a first approximation of RNA secondary structure. The software shows how to derive the DP grid, and then illustrates the relationship between the traceback and the RNA shape.
These algorithms solve fundamental bioinformatics problems. Phylogenetic trees reveal evolutionary relationships. Fragment assembly is vital to genome projects, where large data sets absolutely require polynomial running time. Secondary structure is the first step in predicting overall 3D structure, which is essential for understanding function. Visualization software facilitates communication and instruction.
A Bayesian network approach to transcription factor binding site prediction and its application to
the identification of genes linked to the interferon-gamma response to tuberculosis
Sue Jones, University of Sussex, United Kingdom
Genes sharing transcription factors (TFs) have a greater probability of being co-expressed and linked to a common regulatory pathway. Transcription factor binding sites (TFBSs) can be predicted computationally but such methods have low levels of sensitivity, meaning functional sites cannot be distinguished from non-functional sites. In the current study, a Bayesian network classifier is developed which integrates scores from phylogenetic footprinting with physical characteristiscs of the TFBSs, including distance from the transcription start site and nucleosome occupany. The classifier was trained and tested on a dataset of Human genes from the Transfac database to evaluate its ability to differentiate functional from non-functional TFBSs.
The Bayesian classifier was then used to identify genes with shared TFs in data derived from a genome wide linkage scan. The linkage scan had identified 3 large chromosomal regions linked to interferon-gamma production in mycobacterial infection. The classifier was used to predict candidate genes by identifying TFs shared by genes from all 3 chromosomal regions. A number of genes were predicted to share TFs, including SP1 and STAT1; a nd some were themselves transcription factors or involved in transcriptional regulation. A number of the shared TFs have been connected experimentally, in other studies, to the production of interferon-gamma on mycobacterial infection.
The TFBS Bayesian classifier allows more reliable TFBS predictions to be made for large gene datasets. The classifier has been use to identify genes linked to a single regulatory network, leading to candidate genes for further testing in a model system.
Identifying phenotype-dependent modules in signaling pathway
Yunquan Bao#, Ye Liu#, Tao Ma, Yuanlie Lin, Shao Li, MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST / Dept. Automation and Mathematics, Tsinghua University, China
# Joint first authors
Uncovering phenotype-genotype relationship from pathway and network has raised wide concern in recent years. However, one pathway, such as MAPK, may be associated with many disease phenotypes. The present work established a method to identify phenotype-dependent pathway module in the view of ZHENG (syndrome in Traditional Chinese Medicine), which places more emphasis on empirical phenotype profile.
We analyzed a pair of ZHENGs, Cold and Hot, related phenotypes in OMIM with a text-mining method and found that 88 and 128 diseases including gastrointestinal class show Cold ZHENG- and Hot ZHENG-related phenotype profiles, respectively; 136 Cold ZHENG related genes and 170 Hot ZHENG related genes correspond to two modular organizations in both PPI network and signaling pathways. Next, we developed a network based significance analysis approach integrating random 0-1 table procedure and a specific weight setting method to infer ZHENG-dependent modules in pathways, such as MAPK. By using the microarray data in different stages of chronic gastritis (CG), we found that, though the single phenotype-related gene MAPK9, ILR2, and the whole pathway are not significant under either condition, the module centered at MAPK9 is tightly related to intestinal gastritis, and the module centered at ILR2 is tightly related to intestinal metaplasia, indicating that phenotype-dependent modules may play a role in the development from gastritis to gastric cancers.
Systems Analysis of Common Multigenetic Human Diseases
Supriyo De1, Yonqing Zhang1, John R. Garner1, S. Alex, Wang2, and Kevin G. Becker1, 1Gene Expression and Genomics Unit, National Institute on Aging, National Institutes of Health, Baltimore, MD; 2Division of Computational Bioscience, Center for Information Technology, National Institutes of Health, Bethesda, MD
Complex multi-genetic diseases such as cardiovascular disease, autoimmune disorders, neurological disorders and metabolic diseases make up a majority of mortality and morbidity in developed countries. The Genetic Association Database (GAD) (Becker et al. 2004) is a public repository of information from genetic association studies which archives published human disease association studies of all kinds with an emphasis on non-mendelian common disease. It currently contains approximately 40,000 disease and gene specific records, including information on 3,397 unique genes, and 6,932 unique disease phenotypic descriptions, including Alzheimer’s disease, autoimmune disease, infection, sepsis, cardiovascular disorders, and stroke; among many others.
In this study, relationships between diseases were identified by a unique method similar to phyologenetic classification. First the distance between the diseases were calculated by pairwise comparison of the genes associated with specific diseases. The disease relationships were then calculated from the distance matrix using the Fitch program based on the Fitch and Margoliash method of constructing the phylogenetic trees. The Neighbor-Joining method of Saitou and Nei (1987) was also used to visualize the larger sets. Though Fitch performed more consistently in randomized inputs, Neighbor gave very similar results most of the times. More traditional method of Hierarchical Clustering was also used to supplement the phylogenetic trees. This analysis identified major grouping of disease, placing related disorders in appropriate general categories as well as positioning highly related disorders closer in space. In the future, this approach can be further developed to make predictions regarding risk of developing related common complex disorders.
This research was supported by the Intramural Research Program of the NIH, National Institute on Aging, and the NIH Center for Information Technology, National Institutes of Health.
Graph-theoretic tools for reducing the size of the DCJ Median Problem
Wei Xu, University of Ottawa
At the heart of rearrangement-based phylogenetics is the "median problem": given a set of genomes G, and a genomic distance measure d, find the genome q minimizing the total distance ΣgεG d(q; g). The median problem with the double-cut-joint distance measure is NP-hard. Exact algorithms implementing branch-and-bound are slow and severely limited as to size of problem. Heuristics are inaccurate to an unpredictable degree. We need analytic criteria see if a median problem is tractable and fast algorithms that work for such cases.
For |G| = 3, we use the "multiple breakpoint graph" (MBG), where three colors, represent adjacencies in three genomes, and with a fourth color represent- ing the median genome. We are interested in finding non-crossing subgraphs, ones where no 4-colored edges connect them to the remaining graph. The optimal solution can always be found by combining the solutions for such subgraphs with the solution of the remaining graph.
Adequate subgraphs contain at least 3/2 m cycles, where m is half the number of vertices. Our main result is: every adequate subgraph is non-crossing. Finding adequate subgraphs can thus systematically reduce the size of the me- dian problem. We have already found all the small adequate subgraphs and are developing methods to search for larger ones. These are being incorporated into increasingly e cient software for the median problem.
We further conjecture that the total distance ΣgεG d(q; g) 3/2 n, where n is number of genes, and this is supported by computational experiments.
Evolution and Phylogenetics
Genetic Variation of RAPD Markers for Indian Jatropha curcas Collections and Cultivars
Naresh, B. and Prathibha Devi, Biotechnology and Molecular Genetics Laboratory, Department of Botany, Osmania University, Hyderabad, India
Biodiesel, an alternative diesel fuel is made from vegetable oils and animal fats through a process of transesterification. It is biodegradable and non-toxic, has low emission profiles and so is environmentally beneficial. The hardy Jatropha is resistant to drought and pests and produces seeds containing up to 40% oil. After conversion to Biodiesel, Jatropha seed-oil can be used in a standard diesel engine, while the residue can be processed into biomass for different purposes. However, despite its abundance and use as an oil and reclamation plant, none of the Jatropha species have been properly domesticated. The goal of germplasm conservation in genebanks is to maximize genetic variation. Collecting explorations would be more efficient if factors that predict areas and habitats associated with greater genetic differences and diversity could be identified. Therefore, the objective of this research on Jatropha curcas, the Biodiesel plant was to investigate whether ecogeographical variables have significant associations with patterns of genetic variation in populations of wild Jatropha curcas.
Presently, combined morphological and molecular techniques were used to characterize variation in Jatropha curcas populations collected from four different locations in India. We studied the morphological variability of J. curcas in relation to the degree of its genetic differentiation, in order to unravel the causes of conspicuous intraspecific morphological variation. The four genotypes could be distinguished by their mophological traits viz. plant height (m), number of primary branches, number of secondary branches, Leaf area index (with C1-203 laser digital leaf area index meter, CID Instruments), number of pods per tree, seed-length (cm), seed-diameter (cm), seed-yield (gm) and percentage of seed-oil. The genetic variability among accessions of Jatropha curcas was determined using randomly amplified polymorphic DNA (RAPD) profiles. It utilized the SAHN clustering programme of the NTSYSpc package (Rohlf, 2000) to construct an UPGMA (Unweighted Pair Group Method of Arithmetic Mean) dendrogram. Similarity coefficients from the squared data matrix were used for Principal Dendrograms obtained on the basis of primers, which were in accordance with existing taxonomy thus, confirming the usefulness of RAPD analysis for taxonomic studies. Our observations suggest that RAPD analysis could help in identifying genetic variations among different accessions of Jatropha.
Results of the study indicated that patterns of different ecogeographical structures were not associated with genetic differences except for a few instances. The surprising genetic similarities of three populations derived from different climates and geographic regions of the country may indicate a common origin for much of the naturalized Jatropha populations. Hence, remarkably, geographical separation of populations, a parameter usually considered important when collecting germplasm, did not bring up any genetic differences. Therefore, one should collect many populations and incorporate a manageable subset into the genebank on the basis of empirical measurements of genetic diversity.
Inferring Phylogenetic Relationships between Organisms for Y-Family Polymerases Using HMMER 2.0 and the Neighbor-Joining Method
Wendy Lee*, Nancy Fong*, Magdalena Franco*, Robert Fowler, Sami Khuri, Department of Biological Sciences, San Jose State University
The increasing number of protein sequences in databases has become an arduous task to infer phylogenetic relationships between organisms. To alleviate some of these complications, the relationships between organisms in individual protein families are being devised. In 2004, Tamura et al. suggested that neighbor-joining (N-J) algorithm is one of the best algorithms for inferring large phylogenies. In the present study, we constructed a phylogenetic tree for the Y-family polymerase family using the N-J algorithm. First, the Y-family polymerase homologous protein sequences were found with HMMSearch from HMMER 2.0. Hypothetical protein sequences were eliminated from the search results. ClustalW was used to create a multiple sequence alignment of 146 qualified sequences from HMMSearch. A distance matrix was created from the ClustalW multiple sequence alignment, using Protdist with the Henikoff/Tillier PMB Matrix distance model. Bootstrapping was performed for statistical inference and resampling. Finally, the neighbor joining method was used to construct a phylogenetic tree to infer the molecular evolution of Y-family polymerases. Our results show that Y-family polymerases have a common bacterial ancestor, and then diverged into three subtrees. Each subtree contains genetically similar organisms. The three clades represent the major groups of Y-family polymerases of Homo Sapiens and their homologs: Pol kappa, Pol iota, and Pol eta. Pol kappa proteins are closely related to the bacterial polymerases, while Pol iota and Pol eta resulted from a further divergence from the common ancestor of Pol kappa. Thus, the Y-family polymerase family of proteins is conserved in several kingdoms of life.
Probability Analysis and Modeling of Influenza Type A Virus Hemagglutinin Gene with a novel Markov Model
Ham Ching Lam, Dept of Computer Science , University of Minnesota
We have developed a novel Markov model which models the genetic distance of the Hemagglutinin (HA) gene, a major surface antigen of the avian influenza virus. Through this model we estimate the probability of finding highly similar virus sequences separated by long time gaps. Our biological assumption is based on neutral evolutionary theory, which has been applied previously to study this virus [Gojobori, Moriyama, and Kimura. PNAS Vol 87. 1990]. Our working hypothesis is that after a long enough time gap under this theory, many site mutations should accumulate, leading to distinct modern variants. We obtained 3439 HA protein sequences isolated through years 1918 to 2006 from around the globe, aligned them to a consensus sequence using NCBI alignment tool, and used a Hamming distance metric on the aligned sequences. We test our hypothesis by combining a standard Poisson process with the Markov model. The Poisson process models the occurrences of mutations in a given time interval, and the Markov model estimates the probabilities of changes to the genetic distances due to mutations. By coalescing all sequences at a given genetic distance to a single state, we obtain a tractable Markov chain with a number of states equal to the length of the base peptide sequence. The model predicts that the probability of finding highly similar virus after several decades is extremely small. The existence of recent viruses which are very similar to older viruses suggests that there exists some reservoir which preserves viruses over long periods.
Modeling and Simulating Biomolecular Interactions in Signal Transduction Systems with Molecular State Machines
Jin Yang,CAS-MPG Partner Institite for Computational Biology, Shanghai, China
To simulate the dynamics of complex biomolecular interaction systems by solving deterministic ordinary differential equations or integrating chemical master equation using stochastic simulation algorithm, a set of biochemical species and their reactions has to be specified. However, due to the significant representational limitations of biochemical reaction models, quantitative study of large-scale cell signaling systems are inefficient in both computational and analytical concerns. We propose here a computer science formalism for describing individual proteins as computing machines referred to as molecular finite state machines (MFSM). In this framework, biomolecules are modeled as computing machines which react to environmental signals by local computation. An entire biomolecular system is recapitulated as the result of interactions among MFSMs under prescribed protocols that can be used to simulate the stochastic dynamics. Models specified with such formal structures explicitly represent the internal state transitions of individual biomolecules in response to environmental information. The above approach integrates the biochemistry of molecular interactions with well-established computer science theory of finite state machines, and provides an executable, reusable and extensible framework to construct, simulate and analyze biomolecular interaction models for signal transduction systems.
Gene Function Inference
ClueGene: An Online Search Engine For Querying Gene Regulation
Joshua M. Stuart, Department of Biomolecular Engineering, University of California, Santa Cruz
Biologists often have incomplete knowledge of the full set of genes that constitute a pathway or molecular complex they are studying. Researchers can use existing Web-based gene recommendation search engines to obtain recommendations for additional genes to include in their study. But such search engines base their results on only expression data. The ClueGene online search engine allows biologists to explore multiple aspects of gene regulation for their genes of interest via a suite of integrated search tools. The following types of search are available: coexpression search for recommended genes (using the ClueGene method), search for known transcription factor binding sites (using TRANSFAC data), search for novel transcription factor binding sites (using the BioProspector method), and search for enriched Gene Ontology terms. Exploratory analysis is supported via iterative searching with the user selecting query genes at each search iteration. Search results are displayed as a table with rows for genes and columns for the results of each search type. When a search is performed for known or novel transcription factor binding sites or for enriched GO terms, the new results are added to the table along with the result columns from any other types of searches previously performed. Because ClueGene provides a variety of search types, the user can make a more informed choice of potentially relevant genes for their next search iteration. The ClueGene Web server is free and open to all users and is available at: http://sysbio.soe.ucsc.edu/cluegene/cg4/.
High Performance Bio-Computing
A Bioinformatics Approach to Modeling RNA Structure
Magdalena A. Jonikas, Stanford University
Functional RNA molecules such as ribozymes have complex three-dimensional structures that enable them to play catalytic or structural roles in the cell. Knowing the structure of these molecules is critical to understanding their functions. However, predicting RNA structure from primary sequence remains a significant challenge. We have developed the Nucleic Acid Simulation Tool (NAST), a software package that builds coarse-grained models of RNA structures in a fully automated fashion. We use an RNA-specific knowledge-based potential in a coarse-grained molecular dynamics engine to generate large numbers of plausible 3D structures. We then filter these structures using the surface measurements to identify those that are most compatible with the experimental data.
NAST requires no special RNA modeling expertise, uses available information about the secondary and tertiary structure, and can run on either a single computer or a cluster.
We have used NAST to address two classes of structure modeling problems for the Tetrahymena thermophila group I intron: modeling missing structural elements in RNA crystal structures and modeling folding intermediate structures.
A ProGenGrid Service for the Protein Structure Prediction
Giovanni Aloisio, University of Salento
Protein structure prediction is of high importance in medicine and biotechnology. From this point of view, an important aspect regards the function of metabolite mitochondrial carriers, a family of intrinsic proteins that mediate the flux of several metabolites, in relation to the role that such proteins cover in some mitochondrial pathologies. Along with the fast progress in the identification of novel mitochondrial carriers over the last few years, an increasing number of genes encoding mitochondrial carriers have been identified whose defects cause various inherited diseases. Sequence studies have shown that all carrier proteins have a highly conserved sequence motif, the carrier signature. To determine the possible role of this motif in the function of a carrier protein, we used the dicarboxylate carrier (DIC) of Saccharomyces cerevisiae as a model protein. In this study we model the structure of DIC, using the only carrier so far solved with atomic resolution in the PDB data bank: the ADP/ATP carrier protein of Bos taurus heart mitochondria. Here we integrated a routinely expert dependent strategy in a automatic tool that may facilitate the generation of carrier models at low resolution, good enough for serving as template models in different occasion, including site directed mutagenesis to prove and/or disprove the involvement of a computed topological structure in specific functional process. Therefore, we developed a new service, that uses applications deployed in a Grid environment, in order to automate the prediction procedure and integrate the data produced by the in vitro and in silico analysis.
Principal Component Analysis-Based Linear Combinations of Oligonucleotide Frequencies for Metagenomic DNA Fragment Binning
Hongwei Wu, Georgia Institute of Technology
We have investigated linear combinations of oligonucleotide (k-mer) frequencies for binning the metagenomic DNA fragments of short-to-moderate lengths. The k-mer frequencies have been widely used for gene prediction, phylogenetic tree construction, and metagenomic binning. However, the k-mer frequencies will lead to a high dimensional feature space even for a modest value of k. Existing methods to reduce the dimensionality of the feature space focus on particular oligonucleotide patterns or rather small values of k. We have applied the principal component analysis (PCA) on the oligonucleotide frequencies, based on which we can not only achieve a reduction of the feature dimensionality at a ratio higher than five, but can also retain the most informative features. Our experiments on simulated metagenomic data sets with four types of classifiers have shown that (i) the PCA-based linear combinations of k-mer frequencies are capable of capturing the intrinsic characteristics of DNA fragments and can therefore adequately serve as the binning features; (ii) the PCA-based linear combinations of k-mer frequencies tend to be more effective and stable as the DNA fragment length increases; and (iii) the rather simple linear classifiers can achieve high accuracy for the metagenomic DNA fragment binning at various taxonomic levels, even at a level as specific as species.
Microarray Data Analysis
Cancer Classification using the 1-D Wavelet Transform and Gaussian Mixture Modeling
Adarsh Jose, University of Akron
Classifying a tumor sample into its sub-class is imperative in diagnosis and treatment of cancer. The DNA microarray technology enables classifying of cancer based only on the gene expression profile of the cancer samples. One of the most important problems in classification is the limited availability of the tumor samples. So choosing the features which will optimize the classification becomes very important. The Feature Selection methods used now are heuristic approaches which cannot be generalized to all the datasets available.
A feature extraction method based on 1D Wavelet transform has been suggested for training classifiers as an alternative to the standard methods used. We intend to explore the potential of 1 D Wavelet transform as a feature extraction tool for classification of gene expression data and later for clustering to develop a class discovery tool.
The classification is based on the Gaussian Mixture Model based methods. Models for different number, shape, size and distribution of the clusters are evaluated using the Bayesian Information Criteria (BIC). Different combinations of wavelets, noise thresholds and feature sizes should be explored for getting optimum results in the classification section.
Exploring High-Dimensional Data with Feature Selection and Regression for Microarray Data Analysis
Brian Quanz, University of Kansas
Microarray technology allows exploration of the relationship between disease conditions and expression levels for many genes simultaneously. The motivation for this work is the goal of predicting the susceptibility of different brain cells to a common disease condition, called oxidative stress, based on the cells' gene expression patterns derived from microarray experiments. We address the problem of predicting how neurons respond to oxidative stress based on their GeneChip (microarray) data, by identifying important genes and using those genes' expression levels to predict a response score for a given sample. This consists of feature selection with high dimensionality (15923 features) and small sample size (12 samples), and regression as opposed to classification. We combine statistical techniques to filter the genes, then select key genes using multiple iterative machine learning methods. To predict response scores for a sample using the selected genes, we develop a new method of regression, Two-Stage Regression. Two-Stage Regression uses separate regression models for different classes, combining classification and regression. We compared our method of Two-Stage Regression to several existing regression methods. Our experimental study using leave-one-out cross-validation suggests our method offers improved performance for regression with such microarray data. We believe we have identified possible key genes for further exploration and developed a suitable method for response prediction.
Gene Module Regulatory Network Analysis of Follicular Lymphoma Transformation
Andrew J. Gentles, Stanford University
The transformation of follicular lymphoma (FL) to diffuse large B cell lymphoma (DLBCL) is common, and associated with worse prognosis. Mechanisms underlying transformation are poorly understood, and implicate multiple pathways. To better understand the transformation process, we constructed a gene module regulatory network from microarray data on FL and DLBCL samples. Modules that significantly discriminate between FL and DLBCL were identified by supervised classification, in addition to modules that discriminate between FL that transform, and FL that are not seen to transform. A network of regulatory modules was constructed with a directed edge between pairs of modules if a gene in one module served as a regulator of the other module. Known aspects of FL/DLBCL transformation, such as changes attributable to infiltrating T-cell populations were identified and served as internal validation. Core discriminant modules associated with transformation show expression signatures of cellular differentiation states, proliferative drive, deregulation of mitochondrial function, and increased proteasome activity impacting the cell cycle. Our network of gene modules can be used to generate hypotheses regarding processes driving transformation, and suggests a potential role for Pax5. In addition, it reveals that Bortezamib (a proteasome inhibitor) and Ecteinascidin-743 may have therapeutic benefits for treatment of FL and transformed DLBCL.
Methodology and data handling of expression profiles for alternative promoters in human tissues
Edwin Jacox, NIH
Alternative promoters often direct tissue specific gene expression. Additionally, they may be species-specific to allow diversification of gene expression patterns within a biological niche. To date, few quantitative studies have looked across the entire human genome to compare expression patterns that distinguish alternative promoters of the same gene. We developed a unique computational approach to systematically investigate the gene expression profiles associated with alternative promoters using data for eleven tissues (heart, lung, etc.), generated from high-throughput exon-specific microarray experiments. One problem in using array data to examine promoter-specific expression has been the mixture of heterogeneous microarray signals generated by alternative transcripts of the same gene. To circumvent this problem we used arrays with probes for all annotated exons in the human genome. We mapped alternative transcripts having unique first exons, which allowed us to address questions regarding the expression levels and tissue profiles. Our software pipeline identified genes that utilize alternative promoters in diverse ways, producing widely divergent combinations of expression data. Even with the limited accuracy of microarrays, we were able to establish general trends of expression patterns and tissue specificity. This platform illustrates an approach to unraveling the complexities of gene regulation that have not been previously explored on a genomic scale.
Searching for Temporal Gene Expression Profiles in Databases
Guenter Tusch, Swadeep Malgireddy, Chris Bretl Grand Valley State University;
Martin O'Connor, Amar Das, Stanford University
The analysis of time series expression experiments helps in understanding biological systems and their responses to a particular stimulus. Traditional approaches to clustering temporal gene expression data include PCA, Pearson correlation, or software packages like GQL, CAGED, or STEM. Let's asume that a researcher obtained a typical fold change profile and tries to retrieve similar profiles in microarray databases or clinical databases (that contain microarray data in increasing numbers). One approach is to search for highly correlated profiles. We could show for an example data set (GEO GDS656) that this is a reasonable approach (sensitivity and specificity in between 90% and 98%). However, this approach assumes that the pattern of time points is identical or at least very similar to the original experimental design. It is more realistic to look, for instance, for a peak in the profile instead of correlating the entire profile. In Knowledge-based Temporal Abstraction time-stamped data points are transformed into an interval-based representation. We extended this framework by creating an open-source platform, SPOT. It supports the R statistical package and knowledge representation standards (OWL, SWRL) using the open source Semantic Web tool Protégé-OWL.
The user selects one of the different time representations in the system. The program generates R macros and OWL/SWRL code. SWRL allows users to write rules that can reason about OWL individuals. The Protégé-OWL plug-in allows to easily building ontologies. The researcher (user) can define gene expression profile peaks, e.g., "Early" or "Late", and search for similar episodes in the database.
Stationary Wavelet Packet Transform and Dependent Laplacian Bivariate Shrinkage Estimator for Array-CGH Data Smoothing
Nha Nguyen, University of Texas at Arlington
Array based comparative genomic hybridization (aCGH) has merged as a highly efficient technique for the detection of chromosomal imbalances. Characteristics of these DNA copy number aberrations provide the insights into cancer, and they are useful for the diagnostic and therapy strategies. In this paper, we propose a statistical bivariate model for array CGH data in the stationary wavelet packet transform (SWPT) and apply this bivariate shrinkage estimator into the array CGH smoothing study. In our experiments, we use both synthetic data and real data. In synthetic data generation, we use two different noise assumptions: Gaussian noise and the real array CGH data noise. The results of the Root Mean Squared Error (RMSE) and the Receiver Operating Characteristic curve (ROC) demonstrate our methods outperform the existing methods such as Lowess (2003), Quantreg (2005), Smoothseg (2007), SWTi (2007), and DTCWTi-bi (2007).
Method: aCGH data is a finite signal. If we apply wavelet smoothing method directly, we may get error at the border of denoised signal. So, extension step is a very important preprocessing step before denoising. One important thing is that aCGH data contains many step functions which their information is in both low frequency and high frequency. The previous wavelet methods could not offer enough subbands in high frequency for smoothing operation. In this paper, the SWPT will be used to overcome some above problems because it keeps shift invariant property and decomposes aCGH data into many subbands in low frequency as well as in high frequency. Several methods were proposed for selecting thresholding values such as hard universal and un-universal thresholding. However, the dependency between wavelet coefficients is not exploited in these methods. Thus, we propose the usage of shift invariant SWPT and dependent Laplacian bivariate shrinkage estimator which takes advantage of the dependency between wavelet coefficient and its cousin for aCGH data denoising.
Our method can be summarized as follows:
Step 1: Extend data by using symmetric extension method and decompose new data by the SWPT.
Step 2: Calculate the noise variance and the marginal variance for wavelet coefficient.
Step 3: Estimate the child coefficients and the counsin coefficients.
Step 4: Reconstruct data from the denoised coefficients by taking the inverse SWPT.
Result and conclusion: The denoising results of our method are much better than the previous methods in terms of the root mean squared error measurement (improve 5%-59.3%) and the ROC curve at different Gaussian noise levels. Furthermore, simulated aCGH with real noise is also exploited in evaluation, and our method still outperforms others (improve 7.9%-51.8%). Meantime, we also use the real aCGH data to demonstrate our approach is better than other most common used smoothing methods
Towards Early Detection of Human Cancer: A Systems Biology Approach to Identify Serum Biomarker for Gastric Cancer
Juan Cui1, Kun Xu1, David Puett1, Ying Xu1,2, 1Dept. of Biochemistry and Molecular Biology and 2Institute of Bioinformatics, University of Georgia
Gastric cancer is the second leading cause of cancer deaths worldwide, especially occurring with a high incidence in Asia. The precise mechanism that underlies gastric carcinogenesis is not fully understood and the detection always fails in the early stage. In this study, we present a systematic approach for identifying serum protein biomarkers for early detection of gastric cancer. We developed two key computational techniques which are essential to the success of the detection of cancer. The first one is to identify alternatively spliced variants that are differentially expressed in stomach cancer versus normal stomach tissues, as well as at different developmental stages of cancer. For this purpose, an exon array study was designed based on 80 paired gastric tumors and normal samples. Statistical analysis and sophisticated data mining techniques, such as ANOVA, splicing index, and machine learning methods, have been applied to identify around 20 discriminative gene signatures associated to the gastric cancer. The second tool was focused on predicting which proteins from the highly expressed genes in gastric cancer can be secreted into the bloodstream, suggesting possible marker proteins for follow-up serum proteomic studies. We have identified a list of features such as signal peptides, transmembrane domains, glycosylation sites, disordered regions, secondary structural content, hydrophobicity and polarity measures that show relevance to protein secretion. Using these features, we have trained a SVM-based classifier to predict protein secretion to the bloodstream, which shows an improved performance (~90% accuracy) against other secretion prediction tools. Based on a set of evaluation results, it demonstrates that our method can provide highly useful information linking genomic and proteomic studies for cancer biomarker discovery.
Global and local architecture of the mammalian microRNA-transcription factor regulatory network
Reut Shalgi, Weizmann Institute of Science
MicroRNAs (miRs) are small RNAs that regulate gene expression at the posttranscriptional level. It is anticipated that, in combination with transcription factors (TFs), they span a regulatory network that controls thousands of mammalian genes. Here we set out to uncover local and global architectural features of the mammalian miR regulatory network. Using evolutionarily conserved potential binding sites of miRs in human targets, and conserved binding sites of TFs in promoters, we uncovered two regulation networks. The first depicts combinatorial interactions between pairs of miRs with many shared targets. The network reveals several levels of hierarchy, whereby a few miRs interact with many other lowly connected miR partners. We revealed hundreds of "target hubs" genes, each potentially subject to massive regulation by dozens of miRs. Interestingly, many of these target hub genes are transcription regulators and they are often related to various developmental processes. The second network consists of miR-TF pairs that coregulate large sets of common targets. We discovered that the network consists of several recurring motifs. Most notably, in a significant fraction of the miR-TF coregulators the TF appears to regulate the miR, or to be regulated by the miR, forming a diversity of feed-forward loops. Together these findings provide new insights on the architecture of the combined transcriptional-post transcriptional regulatory network.
Pathways, Networks, Systems Biology
A Bayesian Approach to Improve Genome-Scale Metabolic Models by the Integration of Biological and Topological Evidences
Xinghua Shi, University of Chicago
With the rapid availability of hundreds to thousands of sequenced genomes, the construction of genome-scale metabolic models for these organisms has attracted much attention. Although current genome/pathway databases provide a large proportion of metabolic information that can be used directly to build metabolic models, there are still a number of problems that introduce network holes and thus make these models incomplete. Network holes occur when the network is disconnected and certain metabolites cannot be produced or consumed. Many efforts have been carried out to fill these network holes. Due to the huge amount of data and large size of genome-scale metabolic network, computational techniques based on both topological properties of the network and various biological features are desirable to generate feasible candidate reactions. Up to now, manual search for candidates to fill network holes is still dominating to construct genome-scale metabolic models and only a handful of such models are available due to this time-consuming and labor-intensive manual work. Towards the automatic reconstruction of metabolic models, we propose a set of Bayesian methods that integrates topological and biological evidences from different databases to fill network holes and improve metabolic model reconstructions. Seven individual Bayesian-based predictors are built by combining network topological and various biological evidences extracted out of a published metabolic model repository, Biochemical Genetic and Genomic Database of Large-Scale Metabolic Reconstructions (BIGG), and two genome/pathway databases such as the SEED and Kyoto Encyclopedia of Genes and Genomes (KEGG). After individual predictors are trained, two mechanisms including majority vote and Bayesian techniques are utilized to integrate individual predictors and produce unified predictions based on results of the individual predictors. This set of computational tools is tested on data sets including one hundred synthetic models with ten reactions, and two new genome-scale metabolic models for C. acetobutylicum and C. tepidum by improving their draft models from the SEED and comparing with published genome-scale models of E. coli and S. aureus.
Cell Network Modelling System Using Petri Nets
Thomas Muscarello, DePaul University
A major challenge in the field of Integrative Systems Biology lies in the representation of cell (and subcell) network activity, and in utilizing that representation to build simulation systems that cover all network aspects and activity. Specifically, tools are needed to allow for ease in the simulation of mechanisms in the study of cellular function and activity, gene expression, protein synthesis, and drug intervention. The key to mapping the knowledge space of any biological domain into a simulation system, lies in the choice of an appropriate representational formalism.
This presentation describes work that has been done to develop a Petri Net based simulation/modeling system, which allows for the representation of the structure, function, and causality expressed across the layers in biological systems (from the organism level down to the molecular/atomic).
The SFC representation is built using a hierarchical knowledge structure that follows the biological and chemical structures of the organism at all levels. It incorporates the structural (S) and functional (F) relationships of biology and physiology and is modeled after a biomedical researcher's mental models of genetic information and expression, pathophysiologic function, pathology, and intervention planning. It also captures the causality (C), which underlies changes in structure and function within and across levels of complexity.
This work has been developed and proven successful in research projects at Chicago medical centers, in dissertation projects at the University of Illinois at Chicago, and in expert treatment planning systems, over a period of 15 years.
Comparative and functional genomic analysis of a novel type of versatile ABC transporters with shared energizing modules
Dmitry Rodionov, Burnham Institute fro Biomedical Research
Members of the ATP-binding cassette (ABC) transporter superfamily mediate translocation of diverse substrates across membranes in all organisms. Comparative genomics predicted the existence of a new and widespread class of microbial ABC importers, frequently of vitamins, that lack traditional, soluble substrate-binding domains. Instead, they have substrate-specific integral membrane components, unrelated to any classical ABC transporter domains, plus energizing modules comprising a conventional ATP binding cassette and an integral membrane domain. The predicted new transporters are of two types, in which the energizing module is either (a) dedicated to one substrate-specific component, or (b) shared by multiple substrate-specific components. Such shared use was shown experimentally by reconstituting lactobacterial thiamine- and folate-transport systems, and by demonstrating physical interaction between the energizing module and various substrate-specific components.
Genome-Wide Discovery of Missing Pathway Genes
Yong Chen1,3,4, Fenglou Mao1, Guojun Li1,3, and Ying Xu1,2, 1Dept. of Biochemistry and Molecular Biology, and 2Institute of Bioinformatics, University of Georgia; 3School of Mathematics and System Sciences, Shandong University, Jinan, Shandong, P.R China; 4School of Sciences, Jinan University, Jinan, Shandong, P.R.China
Currently pathway reconstruction simply becomes mapping a known pathway in a well studied organism such as E.coli to the target organism; specifically people just replace pathway genes with its orthologs to reconstruct it in the target organism. In this category of methods the newly constructed pathway will contain “missing genes” or “pathway holes” if an ortholog cannot be identified; and most ortholog identification methods (and its derivatives) such as COG and Pfam only use homology information; this makes them impossible to fill pathway holes; and of course they cannot do anything in the even more challenge task of recruiting new genes into the existing pathway.
We developed a new method which utilizes not only homology information but also operon information and phylogenetic profile to identify missing genes in the pathway and recruit new genes into the pathway. 185 genomes were carefully selected based on their evolution relationship and genome size; and operons are predicted for all the selected genomes; and homologies are also calculated for any two genes (if they are homologs). Then a big graph is constructed, the vertices are genes from 185 genomes, and an edge is created for two genes if they are in same operon or they are homologs; the weights of the edges are generated based on how the edges are created. We call this graph as “reference graph”. For the target genome we also do the same to add its genes into the reference graph. For a specific pathway P in the target genome we assume we know part of its genes (normally these genes are identified by orthologs based method). Then we start from the known genes, and calculate the shortest path between these genes and all other genes in the target genome. The genes which have shorter path from the known genes are predicted to have higher rank to be in the pathway P.
The KEGG pathways for E.coli are used to validate our method. Our method has positive predictive value (PPV) 60% in top 10 candidates (out of 4131) when the gene number in reference pathway is equal or more than 5. PPV can reach up to 90% when the genes number in pathway increases; parameter analysis shows that our method is very robust; and some of our negative predictions are validated by the most recent release of KEGG; Further analysis shows many negative predictions are often in the same other pathway, which reveal some new insights for how the pathway defined.
Sepal Development in Arabidopsis: The Division of Space Hypothesis
Adrienne Roeder, Vijay Chickarmane, Elliot Meyerowitz, Division of Biology, California Institute of Technology
In the plant Arabidopsis thaliana, the sepal, which is the outermost green leaf-like floral organ, is covered with cells in a wide range of sizes. The largest cells reach a third the length of the sepals while the smallest cells span only 1%. We are interested in the question of how this pattern of diverse cell sizes is generated during development. In all organisms, cell size is related to the number of copies of the chromosomes or ploidy level of the cell, which is controlled by the number of rounds of endoreduplication (growth and DNA replication without division) that the cell has undergone. In the sepal we find a distribution in ploidy levels from 2C to 16C. We have generated a multicellular model, in which cell division is driven autonomously, to test the hypothesis that the cell size distribution is generated by a timing mechanism. The earlier a cell stops dividing and starts endoreduplicating, the higher ploidy level it can achieve and the larger it becomes. In the model, stochastic entry of the cells into endoreduplication at different time points is sufficient to recreate the pattern of cell sizes observed in the sepal. Furthermore, we can recapitulate the sepals of mutants with altered cell size distributions in silico by changing the probability of entry into endoreduplication in our model. We conclude that the pattern of cell sizes is generated through a division of space in the growing sepal between dividing and endoreduplicating cells.
Quantitative models for HER signaling pathways
Yi Zhang, Pacific Northwest National Lab
The human epidermal growth factor receptor (HER) is arguably the most important receptor family in the context of growth and development. Ligand binding quickly activates these receptors and induces receptor dimerization and phosphorylation. Activated receptors recruit signaling proteins and initiate downstream signaling pathways including Erk and PI3K /Akt activation. It has been hypothesized that the HER dimers have different signaling potencies and that dimer forms are the determinants of the signaling by the HER family. Using a library of cell lines co-expressing varying levels of EGFR (HER1), HER2 and HER3, we examined the temporal relationships between the activation levels of these receptors and the Erk and Akt phosphorylation. As it is not possible to measure the contribution of receptor dimers to Erk/Akt signaling directly, we constructed a kinetic model that describes receptor interactions and activation, and used it to predict the abundances of phosphory lated receptor dimers from the collected experimental data. These predictions were then defined as the input of our signal transduction module, with Erk and Akt phosphorylation levels as module output. Transfer functions were introduced to characterize the observed input and output relationships of the module. By examining the obtained transfer functions, we confirmed the hypothesis that receptor dimer types are indeed the determinants of downstream signaling. Our analysis also indicated that both Erk and Akt activation were driven by fast processes mediated by these receptor dimers.
Protein Structures and Complexes
Investigating the Phenotypic Impacts of Structural Properties of Collagen
Nina Gonzaludo, Biomedical Informatics, Stanford University
Collagen is the most abundant protein in animals and is a major structural component of the human body. Mutations in collagen can have potentially lethal consequences, such as Osteogenesis Imperfecta (OI), a genetic disease characterized by brittle bones. Although clinical criteria have been developed to diagnose collagen-related disorders and evaluate mutations, many of the connections between molecular structure and clinical phenotype are still being researched. Our research is based on the idea that the ability to assess the structural impact of non-synonymous mutations is important in understanding the basis for disease. Past research of type I collagen has revealed the potentials of using molecular dynamics simulation data in studying structural effects of mutations on the collagen molecule and their associated clinical phenotype. The acquisition of such data for the collagen molecule, combined with OI-related mutation data, drives our analysis. In our work, we identify and extract three structural properties of collagen from the data: intra-chain linearity, alpha carbon movement, and cross-sectional helix area. We then apply machine learning methods and determine that while these three features are weak predictors of lethality of mutation sites along the helix, at least one feature is significantly correlated with a lethal clinical phenotype. Finally, we develop a Java-based visualization tool that when used with PyMol, allows simultaneous viewing of multiple data types on the collagen structure. These empirical results, combined with the visualization tool, provide a solid foundation for the continued investigation of collagen structural properties and their effects on clinical phenotypes.
Molecular modeling of structural interaction between PfHslU and PfHslV subunits of P. falciparum and identification of interface residues:
Implications for targeting protein-protein interface.
Sangeetha, Internationnal Center for Genetic Engineering and Biotechnology
The PfHslUV system, a proteasome homolog of prokaryotic HslUV has been identified to be unique to P. falciparum. The formation of complete complex machinery of PfHslUV is essential for the proteasome to carry out its biochemical and physiological role in the parasite. PfHslV, which is a potential drug target, exhibits increased proteolytic activity in the presence of PfHslU. Understanding the interactions between PfHslU and PfHslV can open new possibilities of targeting the proteolytic machinery of the parasite. In the present work, computational methods have been employed to simulate interaction of PfHslU and PfHslV chains. The key residues participating in the protein-protein interface of PfHslUV complex of P. falciparum have been identified using in-silico methods namely homology modeling, protein-protein docking and computational alanine scanning. We have generated a reliable three-dimensional PfHslUV complex model that shares comparable protein-protein interface characteristics as that of crystal structures of prokaryotic HslUV. The binding free energy changes (Delta Gbind) calculated by computational alanine scanning has helped in the identification of key residues which may be crucial to PfHslU and PfHslV interactions. The existence of protein hot-spots and the presence of druggable pockets at the macromolecular interfaces have opened up new possibilities of targeting the proteolytic machinery. Drug discovery strategies have mostly focused on identifying small molecules to inhibit catalytic sites. However, targeting the protein-protein interface could be a viable approach. We propose that the protein-protein interface in the PfHslUV can be an attractive target to disrupt the proteolytic machinery.
Properties of the Residue Contact Networks of Delaunay Tessellated Protein Structures
Todd J. Taylor, NIST
The authors have subjected several sets of real and simplified model protein structures to Delaunay tessellation. The system of contacts defined by residues joined with simplex edges in the tessellation can be thought of as a graph or network. Properties like the graph distances between residues and clustering coefficients can be computed to investigate the nature of such contact networks. Using metric multi-dimensional scaling, the dimensionality d of the space in which such networks live can also be computed from a matrix of all inter-residue graph distances. We find that protein contact networks are not strictly small world networks, as has previously been asserted, and the variation of d with simplex edge length cutoff gives a set of natural distance scales for proteins. These two results are related?d determines how the characteristic path length of the contact graph scales with the number of residues N.
Protein dynamics and network of dynamically-conserved residues
Suryani Lukman, University of Cambridge
Protein conformational changes and dynamics are important for their biological function, such as the enzymatic activity. The integration of computational methods in the fields of normal mode analysis (NMA) and the analysis of residue-specific evolutionary information using perturbation as a mimic of point mutation, is useful to understand the dynamics of protein. NMA is based on the harmonic approximation to the potential energy function, where first derivative is zero at a minimum energy conformation and terms higher than quadratic are ignored. NMA for large biomolecules can be performed with a simplified representation of potential energy. We use an elastic network model of NMA to calculate pairwise interactions between all atoms that are within a cut-off distance. The analysis of residue-specific evolutionary information using perturbation as a mimic of point mutation is based on a previous work to understand protein folding (Socolich et al., Nature, (437) 512?O5 18, 2005).
We adapt the method, together with NMA, to study conformational changes of our case protein: maltose transporter. It is a member of the ATP-binding casette (ABC) transporter, involved in maltose/maltodextrin import. Maltose transporter comprises a periplasmic maltose-binding protein, two integral membrane proteins, MalF and MalG, and two copies of cytoplasmic ABC MalK. MalK is an ATP-hydrolase responsible for energy coupling to the transport system. We identify clusters of residues in MalK that are both highly conserved and dynamically-important, as obtained from sequence analysis and NMA respectively. This approach is potentially useful for other proteins.
Tracking DNA Routes by Basic Residues in Predicted Functional Regions of Protein Structures
Chien-Yu Chen, Dept. of Bio-industrial Mechatronics Engineering, National Taiwan University
DNA-binding proteins reveal their functions through specific or non-specific protein-DNA recognition. Identifying DNA-binding residues with computational tools facilitates validating protein functions at a high-throughput rate. The protein-DNA complexes available in Protein Data Bank (PDB) further unveils how a DNA-binding protein recognizes its partners. Such information greatly helps biologists to determine or predict the binding elements in DNA sequences such as transcription factor binding sites (TFBSs). In this way, accurate regulatory network in whole-genome scale can be constructed more efficiently in the near future. While it remains a challenging task to identify the exact way of protein-DNA interactions without crystal complex structures, this study proposes a simple way to identify the protein-DNA interacting surface based on predicted functional regions in protein structures. First, the functional regions of the query protein is predicted by MAGIIC-PRO (http://biominer.bime.ntu.edu.tw/magiicpro/), which employs sequential pattern mining for discovering concurrent conserved regions among related protein sequences. Based on a benchmark of 62 protein-DNA complexes, it has been shown that the predicted sequential blocks usually cluster together in space and more than 60% of the conserved regions provide physical contacts to DNA molecules. After functional regions are predicted, the coordinates of basic residues within the conserved regions are extracted from existing structure files. Then, the proposed method conducts an exhausted search based on those basic residues to identify potential binding surfaces as well as the orientation of DNA molecules. Our primitive results demonstrate that the proposed method can successfully identify the routes of DNA molecules on protein surfaces.
Catalytic site prediction using E1DS server
Ting-Ying Chien, Dept. of Computer Science and Information Engineering, National Taiwan University
E1DS (http://e1ds.csbb.ntu.edu.tw/) identifies potential catalytic residues when given a protein sequence. The prediction is based on a pattern database, where each pattern is derived from a group of enzymes that share the same 4-digital Enzyme Commission (EC) number. The patterns, also called 1D signatures in our paper published recently, are constructed by exploiting the phenomenon of concurrent conservation observed within the functional sites of homologous proteins. Such sequence signatures are considered as functional motifs and can be used in characterizing novel protein sequences and discovering the functionally important residues. Here, we use a dataset containing 367 annotated catalytic sites to demonstrate that the predicting power of E1DS is considerably competitive to a structure-based approach named THEMATICS. The 367 tested cases include the original 177 sites from the paper of THEMATICS and additional 190 sites randomly selected from the dataset used in our recent publication. For all the testing data, both E1DS and THEMATICS have made predictions, except the nine cases that were excluded in the study of THEMATICS. Sensitivity and specificity rates are used to measure the performance of both predictors, where sensitivity (specificity) rate is defined as the ratio of the number of correctly predicted (rejected) residues to the number of annotated catalytic (non-catalytic) residues. The sensitivity (specificity) rates of E1DS and THEMATICS are 34.7% (96.2%) and 36.1% (98.0%), respectively. Though E1DS does not outperform THEMATICS in terms of accuracy, the efficiency of a sequence-based predictor is still largely favorable when large-scale automated annotation is desired.
Comparative sequence analysis of chicken and yeast nucleosomal DNA:
Implications for linker histone binding
Feng Cui, Lab of Cell Biology, NCI, NIH
Linker histones (LHs) bind to the entry and exit points of nucleosomal DNA and protect approximately 20 bp of linker DNA. We sought to determine if any sequence features at the ends of nucleosomal DNA facilitate this LH binding. One would expect these features, if they exist, to be more pronounced in nucleosomes of species with abundant LHs, compared with species with few LHs. To test this hypothesis, we analyzed two sets of nucleosome core particle (NCP) sequences from chicken and yeast, two species with substantial differences in LH abundance: the ratio of LHs to nucleosomes in chicken is 1:1, whereas in yeast it is 1:4 or greater. The flanking sequences of the NCP fragments were extracted from the corresponding genomes and appended to the ends of the fragments. We analyzed the positioning of various AT-rich fragments, from the dimers to the 10-bp long A-tracts, along the 'extended' NCP sequences in the 'default' center-alignment and found that the known 10-11 bp periodic oscillation in the occurrence of the AT-rich elements goes beyond the end points of the yeast nucleosomes. By contrast, this oscillation is distorted in the chicken nucleosomes — the 'out-of-phase' peaks appear just at the NCP ends. More importantly, the observed difference in the positioning of AT-rich fragments is not sensitive to the way the sequences are centrally aligned, reflecting an inherent difference in sequence organization at the ends of nucleosomal DNA in the two species. Since LHs bind to the 'entry/exit' points of nucleosomes and exhibit a general preference for AT-rich DNA, such a difference may be associated with LH binding. We therefore propose a new structural model for LH binding to nucleosomes, postulating that the LH globular domain (GH1/GH5) recognizes the AT-rich fragments at the NCP ends. We suggest that the highly conserved non-polar 'wing' region of the GH1/GH5 domain (containing tetrapeptide GVGA) can favorably interact with the hydrophobic 'patches' in the major groove formed by the thymine methyl groups. Thus, DNA at the ends of chicken nucleosomes is likely to be deformed due to LH-DNA interactions, while in yeast, DNA at the 'entry/exit' points may follow its 'natural extension' trajectory because of the scarcity of the H1-analog protein Hho1p.
Detection of Remote Protein Homology by Comparing Profile Hidden Markov Models for the Y-Family Polymerase Family
Wendy Lee*, Nancy Fong*, Magdalena Franco*, Sami Khuri, Robert Fowler, Department of Biological Sciences, San Jose State University
Profile hidden Markov models (HMMs) have been widely used as successful methods to detect remote protein homology. Two popular profile HMM packages include HMMER and SAM-T06. In 2002, Madera and Gough tested HMMER and SAM-T99 and found that SAM-T99 performed better based on their parameters. The most important aspect for homology detection lies in the quality of the multiple sequence alignment. HMMER requires manual methods to generate a refined multiple sequence alignment, while SAM-T06 automatically creates this alignment. In this study, we compared the two profile HMM methods using default settings for both packages. Additionally, we utilized several programs to confirm the outputs generated by the two packages. This ensures that the quality and accuracy of the outputs are within the Y-family polymerase protein family. We found that HMMER performed better than SAM-T06 in many areas, such as the time necessary to yield the outputs. The protein sequences generated by HMMER and SAM were used to query a protein family database (Pfam). All of the proteins belong to the Y-family of polymerases. However, the scores and the E-values for the HMMER proteins are better than those for the SAM proteins. To further confirm the results from HMMER and SAM-T06, we searched protein databases for several other sequences from the results and all concur that HMMER was better. Although HMMER outperformed SAM for the Y-family polymerase family of proteins, other protein families may yield different results.
Sequence and Function Analysis of Immunoreceptor Tyrosine-Based Activation Motifs (ITAMs)
Marco D. Sorani, John G. Monroe, Genentech, Inc.
OBJECTIVE: Data support the ability of certain multi-subunit receptor complexes such as the B-cell antigen receptor to signal independently of ligand engagement. This "tonic signaling" has been shown to be dependent on immunoreceptor tyrosine-based activation motif (ITAM)-containing proteins associated with these receptors. ITAM motifs are highly degenerate, displaying a consensus sequence of [DE]-X(0,2)-Y-X-X-[LI]-X(6,12)-Y-X-X-[LI]. Here, we characterized the sequence preferences of ITAM motifs and the proteins that contain them.
METHODS: We performed a motif search using ScanProsite to identify proteins in Swiss-Prot that contain consensus ITAM motifs. We characterized these motifs in terms of the frequencies of residues at restricted positions, we identified features of sequence length and composition, and we analyzed Gene Ontology (GO) annotations for human membrane proteins.
RESULTS: We found 6333 hits in 6167 sequences, including 261 unique human proteins. Of the human proteins, 79 contained transmembrane domains, most of them predicted. The ITAMs identified most commonly contained Glu-Leu-Leu and least commonly contained Asp-Ile-Ile at the restricted positions (p<0.001). The variable X(6,12) spacer domain was most often 9 or 10 residues in length. Sequence analysis of the variable regions confirmed high degeneracy with the exception of frequent acidic residues at the third and third-to-last positions in the X(6,12) domain. The 79 membrane proteins had 218 GO annotations, many involved in cell signaling and B-cell or T-cell processes.
CONCLUSIONS: We characterized the nature of ITAM motifs and the proteins that contain them and found both to be more diverse than previously appreciated.