Identification and evolutionary analysis of alternative splicing events using cross-species EST-to-genome comparisons in human, mouse and rat [Comparative Genomics]
Feng-Chi Chen1, Chuang-Jong Chen1, Sheng-Shun Wang2, Jar-Yi Ho1, Wen-Hsiung Li1,3, and Trees-Juen Chuang1*
1 Genomics Research Center
2 Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan
3 Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA.
Corresponding author: email@example.com
Alternative splicing (AS) is important for evolution and major biological functions in complex organisms. However, the extent of AS in mammals other than human and mouse is largely unknown, making it difficult to study AS evolution in mammals and its biomedical implications. To address this question, we developed an AS detection algorithm (ENACE) based on cross-species EST-to-genome comparisons. A genome-wide AS analysis in human, mouse, and rat using ENACE revealed 758 novel cassette-on exons and 167 novel retained introns, emphasizing that these ENACE-identified exons were also novel exons. With RT-PCR-sequencing experiments, ~50~80% of the tested exons were validated, indicating high presence of ENACE predictions. Moreover, ENACE can distinguish conserved from lineage-specific AS events and can be applied to AS prediction in organisms of which EST information is limited. In an associate study, we also probed evolutionary forces in AS. We compared the synonymous (KS) and nonsynonymous (KA) substitution rates of alternatively spliced exons (ASEs) and constitutively spliced exons (CSEs) among human, mouse, and rat. We showed that ASEs have higher KA values and KA/KS ratios but lower KS values than CSEs regardless of different molecular clocks. With reference to the substitution rate in introns, we demonstrated that the KS values in ASEs are close to neutral while those of CSEs are accelerated. The elevated synonymous rate in CSEs is not related to CpG dinucleotides or low-complexity regions of protein. Our results indicated that CSEs and ASEs are subject to different evolutionary forces.
A likelihood ratio test to identify fast-evolving sites in protein sequences, and its application to phylogenomic analysis [Comparative Genomics]
Yu Liu, Franz B. Lang
Département de biochimie, Université de Montréal, Montréal, Québec, Canada
Corresponding author: firstname.lastname@example.org
A common artifact in phylogenetic analyses is Long Branch Attraction (LBA), which leads to the regrouping of species with high evolutionary rates irrespective of their true phylogenetic position. Likelihood inference is least sensitive to this problem, but doesn’t eliminate it. Also the use of large datasets won’t prevent LBA artifacts as phylogenetic signal and LBA increase concomitantly. Here we present a method to reduce LBA based on the elimination of fast-evolving sites by a Likelihood Ratio Test (LRT). These sites contain little if any phylogenetic signal. In our procedure, they are eliminated only in those subgroups that appear to be affected by LBA.
A previously published dataset is used to demonstrate the potential of our method. It contains members of major metazoan lineages (Arthropoda, Choanoflagellata, Deuterostomia, Nematoda, and Platyhelminthes), and Fungi as an outgroup. The analysis of the original dataset incorrectly regroups the fast-evolving Nematoda with Platyhelminthes, with 90% bootstrap support. After removing fast-evolving sites in these two groups, the support for Nematoda + Arthropoda (the expected ‘true’ topology) increases up to ~ 97%.
In summary, phylogenomic analyses may be positively misleading when the evolutionary rates of species varies considerably across a phylogenetic tree. In order to increase the ratio of phylogenetic signal to misleading noise, unreliable information should be removed from the dataset. We present a method that efficiently removes fast-evolving sites. Because only unreliable data are removed, and only for those species with known high evolutionary rates, the phylogenetic resolution potential of the dataset remains virtually unchanged.
Virtual-CGH: Prediction of Novel Regions of DNA Segmental Alterations from Microarray Gene Expression in Natural Killer Cell Lymphoma [Comparative Genomics]
Huimin Geng1,2, Javeed Iqbal1, Wing C. Chan1 and Hesham Ali2
1Department of Pathology and Microbiology, University of Nebraska Medical Center, Omaha, NE, USA and 2Department of Computer Science, University of Nebraska at Omaha, NE, USA.
Corresponding author: email@example.com
Natural Killer (NK) cell lymphomas/leukemias are highly aggressive lymphoid malignancies with poor prognosis. Since it has not been well characterized until recently, the identification of genetic alterations using whole genome array Comparative Genomic Hybridization (CGH) would provide important insights into the mechanisms of NK lymphomagenesis. We have performed high resolution array CGH on seven cell lines and a number of samples of NK-cell lymphoma. By aligning all regions of gains and losses from multiple cell lines and samples we have identified several minimal common regions of gains and losses which may contain potential oncogenes and tumor suppressors. While noticing that high resolution array CGH are very expensive and require a separate sample, we developed a computational method for predicting gains and losses of DNA segments based on gene expression profiles, and call this method a virtual-CGH predictor. Virtual-CGH is performed through a novel multiple-criteria clustering algorithm in which both the expression of genes (up or down and the fold difference) and chromosomal locations of the genes are used as the clustering criteria. Then we propose to use virtual-CGH to guide further experiments or combine it with the experimental array CGH using a Bayesian framework to obtain reliable and accurate minimal recurrent regions of DNA. Comparing with other gene selection methods such as fold-change, FDR and SAM, our preliminary results showed improved sensitivity and specificity of predicting chromosomal gain and loss regions containing tumor oncogenes and suppressor genes.
Regulatory Element Prediction in Mammalian Genomes [Comparative Genomics]
Bilenky M.1*, Robertson G.1, Dagpinar M.1, He A.1, Lin K.1, Yuen, W.1, Bainbridge M.1, Varhol R.1, Teague K.1, Griffith O.L.1, Zhang X.1, Pan, Y.1, Quayle A.1, Hassel M.1, Sleumer M.C.1, Pan, W.1, Pleasance E.D.1, Chuang, M.1, Hao H.1, Li Y.Y.1, Tsang E.1
1 Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada 2 Department of Computing Science, University of Alberta, Edmonton, AB, Canada
Corresponding author: firstname.lastname@example.org
We have applied comparative genomics approaches to understanding the underlying mechanisms in temporal and spatial gene expression, which remains a fundamental goal in genomics and molecular biology. As much of the specificity of gene expression can be assigned to how proteins (transcription factors) bind to regulatory DNA sequences and facilitate or repress the assembly of the transcriptional machinery, we have developed a genome-scale method to identify conserved DNA sequence motifs and the subset of these motifs that are transcription factor binding sites (TFBS).
The pipeline utilizes multiple published motif discovery methods and a novel calculation of motif significance through the creation of random sequences generated under rules of neutral evolution. The pipeline runs on a compute cluster of over 400 nodes and takes approximately 3 weeks to complete a full run on a mammalian genome.
The pipeline has discovered over 200,000 putative regulatory motifs for more than 18,000 human genes. Regulatory element prediction has also been performed for mouse and rat. Results are available in the cisRED web database (www.cisred.org), the UCSC genome browser, and as a native data type in the Ensembl genome browser. Motifs are annotated with experimentally known TFBS, and are filtered by genome-scale properties like similarity and co-occurrence. The predicted regulatory modules are found to correlate with large scale expression, protein-protein interaction (PPI) data and the Gene Ontology at levels above that of randomly constructed modules.
New tools for large-scale genome comparison analysis [Comparative Genomics]
Heidi J. Sofia, Grant C. Nakamura, Joel M. Malard, and Chris S. Oehmen
Pacific Northwest National Lab, Richland, WA
Corresponding author: Heidi.Sofia@pnl.gov
High throughput sequencing is rapidly producing complete genomes for a long list of organisms. Currently there are over 400 microbial genomes and this number is likely to increase to 10,000 within just a few years. The rapid growth of genomic data is outpacing the ability of scientists to analyze it. The scale of the data is a significant problem but it carries within it the seed of a solution. New genome comparison methods are being developed to detect complex patterns in genome data, such as positional clustering, phylogenetic profiles, and regulatory sites. With larger data, these methods can potentially gain sensitivity and resolution. We are developing genome comparison tools designed for parallel, high performance computing. The basic engine for this system is the ScalaBLAST software, which provides massively parallel BLAST searches in a truly scaleable fashion. At the other end is the Similarity Box software implemented in Java, which provides biologists a powerful interface for exploring results using interactive visualization. Similarity Box is an analysis tool as well and has enabled us to design new genome comparison methods with improved accuracy and reliability. Currently we are implementing these methods for high performance computing to provide the middle tier of our system. One key element in this system is a parallel hierarchical clustering algorithm which can be run on both shared memory machines and clusters. We are applying these methods to metagenomics data which provides a snapshot of microbial communities in the environment.
Detecting uber-operons in microbes [Comparative Genomics]
Fenglou Mao, Guojun Li, DOngsheng Che, Hongwei Wu, Ying Xu
University of Georgia
Corresponding author: email@example.com
We present a study on computational identification of uber-operons in a prokaryotic genome, each of which represents a group of operons that are evolutionarily or functionally associated through operons in other (reference) genomes. Uber-operons represent a rich set of footprints of operon evolution, whose full utilization could lead to new and more powerful tools for elucidation of biological pathways and networks than what operons have provided, and a better understanding of prokaryotic genome structures and evolution. Our prediction algorithm predicts uber-operons through identifying groups of functionally or transcriptionally related operons. Using this algorithm, we have predicted uber-operons for each of a group of 91 genomes, using the other 90 genomes as references. In particular, we predicted 158 uber-operons in E coli K12, and found that many of the uber-operons correspond to parts of known regulons or biological pathways or are involved in highly related biological processes based on their Gene Ontology assignments. For some of the predicted uber-operons that are not parts of known regulons or pathways, our analyses indicate that their genes are highly likely to work together in the same biological processes, suggesting the possibility of new regulons and pathways. We believe that our uber-operon prediction provides a highly useful capability and a rich information source for elucidation of complex biological processes such as pathways in microbes. All the prediction results are available at our Uber-Operon Database: http://csbl.bmb.uga.edu/uber, the first of its kind.
Operon conservation and its function diversity [Comparative Genomics]
Ping Wan, Fenglou Mao, Victor Olman, Ying Xu
University of Georgia, Capital Normal University
Corresponding author: firstname.lastname@example.org
Operons are important structural units in prokaryotic genomes. Many studies have focused on the evolution of single operons, and found that the majority of operons are not conserved during evolution, while some operons have been observed to be very conserved across many genomes. The questions we intend to address include (a) why some operons are conserved while the others are not? and (b) what are the intrinsic mechanism and evolutionary rules behind these observations? These questions remain unanswered and no good study has been conducted in this area. We have observed that it might exist statistically significant relationship between the operon conservativeness and the diversities of the biological function(s) carried by the operon. For example, when compared with fts operon, ATP operon is more conserved among prokaryotic genomes; and fts operon has been involved in five pathways while ATP operon is involved in only one pathway. Another example is the ribosome operon, which is a large operon with 11 genes, while all these genes are only involved in one pathway. In this study, we present a method to measure the conservativeness of an operon among a large group of genomes, which we have used to calculate the conservativeness of 57 operons of E. coli K12 across 224 prokaryotic genomes. We have also calculated the function diversity for these operons (in terms of the number of KEGG pathways each of which is involved). We found that generally the more conserved an operon is, the less diversified its function is.
Hierarchical Clustering of Bacterial Genes for Functional Annotations at Multiple Resolution Levels [Comparative Genomics]
Hongwei Wu, Fenglou Mao, Victor Olman, Ying Xu
University of Georgia
Corresponding author: email@example.com
We develop a hierarchical clustering system of homologous genes (HCG) for prokaryotic genomes by (1) integrating the multi-dimensional information provided by the Smith-Waterman algorithm to obtain a comprehensive measure to better quantify the sequence similarity between genes; (2) incorporating two different types of information into the homology measure, one about genes’ sequence similarity and the other about genes’ functional relevance; and (3) applying a minimum spanning tree-based clustering algorithm to the graph presentations of genes and their homology relationship measures to obtain a hierarchical system of clusters of homologous genes. We analyze the HCG system by referring to the Clusters of Orthologous Groups of proteins (COG) system and the hierarchical system of taxonomic lineages of prokaryotic genomes. Through the comparisons, we demonstrate that (1) the information that is conveyable via the COG system, including the clusterability of homologous genes and functional difference between different COGs, can also be faithfully conveyed via the HCG system; (2) the functional commonality shared by the genes belonging to the same HCG cluster tends to be more and more specific; and (3) the HCG system and the hierarchy of taxonomic lineages of prokaryotic genomes are consistent to certain extent. The HCG system makes it feasible to infer genes’ functional rules at different resolution levels; and, being combined with a more accurate phylogenetic/taxonomic model of prokaryotic genomes, it can also be used to establish accurate correspondence between genes for the studies on tracing the evolutionary history of genes and/or genomes.
Classifier Fusion for Poorly-Differentiated Tumor Classification using Both Messenger RNA and MicroRNA Expression Profiles [Data Integration]
Yuhang Wang (a), Margaret H. Dunham (a), James A. Waddle (b), Monnie McGee (c)
(a) Department of Computer Science and Engineering, Southern Methodist University, Dallas, Texas 75275; (b) Department of Biology, Southern Methodist University, Dallas, Texas 75275; (c) Department of Statistical Science, Southern Methodist University, Da
Corresponding author: firstname.lastname@example.org
MicroRNAs (miRNAs) are an important class of small non-coding RNAs that regulate diverse biological processes. MiRNAs are thought to regulate gene expression by degrading or repressing target messenger RNAs (mRNAs) at the post-transcriptional level. Recent studies suggest that miRNAs are implicated in human cancers. In a recent Letter to Nature, Lu et al. showed that the expression profile of 217 mammalian miRNAs could be used to successfully classify poorly differentiated tumor samples at the accuracy of 70.6%, whereas the same classifier using mRNA profiles resulted in a low accuracy of 5.9%. Because miRNAs regulate gene expression at the post-transcriptional level, we hypothesize that miRNA expression profiles can provide information that is complementary to mRNA expression profiles. Therefore, a data fusion approach could lead to improved classification accuracy. As a proof of concept, we re-analyzed the data in the paper by Lu et al. using a classifier fusion approach that utilizes both mRNA and miRNA expression data. We built a meta-classifier from two bagged k-nearest-neighbor classifiers. Experimental results showed that our meta-classifier was able to classify the same set of poorly differentiated tumor samples at an improved accuracy of 76.5%, when trained only with the expression profiles of more-differentiated tumor samples.
Modeling and storing scientific protocols [Data Integration]
Zoe Lacroix, Yi Chen, and Natalia Kwasnikowska
Arizona State University, Hasselt University, and transnational University of Limburg, Belgium
Corresponding author: email@example.com
Scientific discovery relies on the adequate expression, execution, and analysis of scientific protocols. Although datasets are properly stored, the protocols themselves are often recorded on paper or only remain in a digital form as the script developed to implement it. Once the scientist or bioinformatician who has implemented the scientific protocol leaves the laboratory, the record of the scientific protocol may be lost. Collected datasets without the description of the process followed to produce them may become meaningless. Moreover, to support scientific discovery, anyone should be able to reproduce the experiment. A detailed description of the protocol is necessary to achieve the reproduction of the experiment, therefore, the protocol together with the collected datasets constitute the complete description of an experiment.
We present an abstract model for scientific protocols, where several atomic operators are proposed for protocol definition and composition. We distinguish two different layers associated with scientific protocols: design and implementation, and discuss the mapping among them. We illustrate the protocol model with representative examples. Then we discuss the opportunities and challenges of realizing such a protocol model in several classical database models, including relational, nested relational, XML-based and object-oriented model. We demonstrate the approach with an XML database.
Our approach benefits scientists by allowing the archiving of all scientific protocols with the collected datasets to constitute a scientific portfolio for the laboratory to query, compare and revise protocols, and express cross protocol-data queries.
Querying Consolidated miRNA data: A Comparative Study [Data Integration]
S. Lopez, Z. Chen
University of Nebraska at Omaha
Corresponding author: firstname.lastname@example.org
Non-coding RNA (ncRNA) genes produce RNA that does not encode a protein but instead produces a strand of RNA that performs regulatory and control processes.
Even within the sub-subject of non-coding RNA (ncRNA), the diversity displayed is immense. Different types of ncRNA have been identified, each with a unique role, including microRNA (miRNA), rRNA, siRNA, etc. Several databases projects have been conducted in bioinformatics community, storing ncRNA or more specific types of ncRNA, such as miRNA. However, although the importance of building such database is well-understood, a systematic study of related methodological aspects is still lacking. Recently we have investigated data modeling issues to consolidate miRNA data from different resources (such as ASRP and the miRNA Base) into a repository. We have constructed databases based on three modeling approaches, namely, E-R based relational, XML shredded relational and eXist NXD (Native XML Database). In addition, a number of queries were conducted to compare the performance of these different models. For the same original source data, we measured several factors such as overall storage size, speed for query answering, etc., for each implementation. Based on our experiments, we conclude that a native XML database does the best job at providing geneticists a flexible and extensible storage solution for sequence data. We have also obtained other useful observations as well. Since efficient query processing and data mining tasks depend on effective data modeling, experience and lessons learned from this research shed important insight for our future research.
Benefits and costs of bioinformatics pipelines : an experiment [Data Integration]
B. A. Eckman, T. Gaasterland, Z. Lacroix, L. Raschid, B. Snyder, and M.-E. Vidal
IBM Healthcare and Life Science, University of California San Diego, Arizona State University, University of Maryland, Universidad Simon Bolivar
Corresponding author: email@example.com
The implementation of a bioinformatics pipeline (BIP) raises multiple challenges. The first is the selection of resources (data sources and applications) to collect information and construct datasets. Different sources may provide similar information, but the number of collected entries and the quality of their characterization may differ significantly. Two applications may perform the same service, but with different execution cost. Furthermore, the design and implementation of a BIP is an iterative task: A scientist typically designs a succession of steps, selects the resources to implement each step, selects some parameters, implements the steps (often using a script) evaluates the BIP, then examines the results, and modifies the BIP.
We report on our experiences in implementing a framework that allows the evaluation of alternative BIP solutions, and the costs and benefits of each solution for a BIP to detect alternative splicing of an organism’s genes, based on the IBM Websphere Information Integrator platform. We consider several options to select candidate transcripts from Entrez Nucleotide to populate the pipeline and we report on the costs for downloading the transcripts and aligning to the genome. For each choice, we report on the costs and benefits in clustering the transcripts to find alternate splice forms. The benefits metrics include the percentage of transcripts that are successfully aligned, metrics that describe the number of clusters, and variant clusters (with alternative splice forms). In addition to the valuable results obtained from this experiment, our approach is generic and may be used to evaluate multiple BIPs.
Pathway Knowledge Base: Integrating and querying pathways using BioPAX and RDF [Data Integration]
Kyle Bruck, Nikesh Kotecha, and William Lu
1Department of Biomedical Informatics, Stanford University, Stanford CA
Corresponding author: firstname.lastname@example.org
The role of proteins and their function in pathways is crucial to understanding complex biological processes and their failures that lead to disease. With over 200 pathway databases in existence, it is not possible for biologists to examine a pathway in all of them. The emergence and adoption of Biological Pathways Exchange (BioPAX), a standardized format for pathway representation, provides a unique opportunity to integrate knowledge from multiple pathway databases. We have taken advantage of this opportunity to create the Pathway Knowledge Base (PKB). PKB integrates biological pathway data from KEGG, BioCyc, and Reactome over the set of species: human, E. Coli, and yeast. Users can now query for proteins, pathways and reactions across multiple species and pathway databases.
This poster outlines our approach for integrating and querying pathways using BioPAX and RDF. We discuss the benefits and limitations of using RDF and evaluate two approaches for storing and querying RDF data: Oracle"'"s RDF store and a Sesame layer built on MySQL.
The Pathway Knowledge Base is accessible via the following URL: http://pkb.stanford.edu
Information Mining over Heterogeneous Microarray and Clinical Data [Data Mining]
Fatih Altiparmak(1), Ozgur Ozturk(1), Selnur Erdal(1), Hakan Ferhatosmanoglu(1), Donald C. Trost(2)
(1)The Ohio State University, Columbus, OH (2)Pfizer Global Research and Development
Corresponding author: email@example.com
The current time series data mining methods generally assume that the series are statistically sufficiently long, collected in equal length time intervals, and/or extend over equal length time periods. However, these assumptions are not valid for many real data sets, i.e. biomedical databases. We propose a mining framework to gather high quality information from heterogeneous and high-dimensional time series data. The framework has two steps: (1) Significant and homogeneous subsets of data (e.g., data generated by similar sources) are selected and analyzed using the mining algorithm of interest, (2) The information gathered in the first step is joined by identifying common (or distinctive) patterns. We applied the proposed framework to two important classes of biomedical data applications: Clinical Trials and Microarray Gene Expressions. In the first application, the time series of blood ingredients, analytes, of each patient were clustered as the first step. The common patterns across the clusters are then identified as highly correlated analyte groups and are validated by the experts. The patterns can also be utilized to identify a global panel of analytes, which contains a member from each biological group. In the latter application, the time series of gene expression from heterogeneous sources of microarray data and/or the results of different distance metrics on the same dataset are grouped, and the common patterns over these clusters are mined to extract strong rules for gene expression. The quality of the results improves with the number of data sets and/or metrics used for mining.
In silico identification of mitochondrial proteins using EST data [Data Mining]
Yaoqing Shen, Gertraud Burger
Biochemistry Department, University of Montreal
Corresponding author: firstname.lastname@example.org
Knowledge about the makeup of the mitochondrial proteome from primitive eukaryotes is paramount for the understanding of mitochondrial function and evolution. The newly generated 47,000 Expressed Sequence Tags (ESTs) from jakobids provide a rich source for in silico inference of mitochondrial proteins (mit-proteins). Since the currently available bioinformatics tools do not perform well in the prediction of a protein’s subcellular localization based on ESTs, a new predictor was developed for this purpose. The EST-derived proteins from Arabidopsis and other plants with experimentally verified localization are used as training data, due to their phylogenetic distances to jakobids. The physicochemical properties and up to 6-order amino acid composition (the frequency of 1 to 6 consecutive amino acids) are used to encode the training ESTs. By encoding the ESTs-derived proteins with 3-order amino acid composition, and using Support Vector Machine (SVM) as computational method, the new predictor can identify mit-proteins deduced from ESTs at high accuracy: 87% for Arabidopsis ESTs, 71% for all tested plants and 81% for jakobid ESTs. The prediction did not rely on the sequence similarity or the existence of N-terminal targeting peptide. Therefore it could be applied to the incomplete sequences such as cDNA or genomic sequences with an ambiguous N-terminus. Future work will focus on interpreting the biological reason behind the SVM prediction.
Feature filtering to improve predictions of RNA interference activities by support vector machine regression [Data Mining]
Andrew S. Peek
Integrated DNA Technologies, Inc
Corresponding author: APeek@idtdna.com
RNA interference is a naturally occurring phenomenon that results in the suppression of a target RNA sequence by several possible pathways. To further dissect the factors that result in effective RNA interference sequences a regression kernel Support Vector Machine (SVM) approach was used to quantitatively model RNA interference activities. Eight overall feature mapping methods are compared in their ability to build SVM regression models that predict published siRNA activities. Six feature types from the guide strand: 1-position specific base composition, 2-thermodynamics, 3-Shannon entropy, 4-secondary structure, 5-motif composition 6-sequence-structure features, as well as two feature types from the target strand 7-secondary structure and 8-multiple guide strand binding location energetics. Combining these methods can yield thousands of modeling features. The primary factors in predictive SVM models are position specific nucleotides. Secondarily important are motifs and thermodynamics. Finally the least contributory factors, but still predictive of efficacy, are measures of intramolecular guide strand and target strand secondary structures, where the 5’ most base of the guide strand is most informative. The large number of potential features can be reduced by feature filtering methods, resulting in improvments from models developed from all features. For example using 525 features from correlation filtering, an SVM regression method can yield predictive models, correlation squared (R2) values between predicted and observed activities of R2=0.715 overall and R2=0.501 in 10-fold cross-validation. C++ software to perform these analyses is available under the Creative Commons license.
Web-Based Tools for Gene Comparison and Clustering [Data Mining]
James Z. Wang, Rapeeporn Payattakool, Chin-fu Chen
Corresponding author: email@example.com
Microarray and other genomic-scale studies usually involve a large number of genes and gene products. Although many tools allow scientists to obtain gene annotation information from public genomic databases such as NCBI and Gene Ontology (GO) and to use this information to understand the global patterns of gene-gene interaction and biological pathways, there is not an efficient tool to measure the similarity of genes based on such existing information about genes. In this poster, we investigate the complex problems of measuring the similarity of GO terms and, in turn, discover the similarity of genes based on the similarity of their correlated GO terms and other information from heterogeneous data sources. Specifically, we propose a novel method for encoding the semantics of GO terms based on their inter-relationships (“is-a” or “part-of”) defined by the Directed Acyclic Graph (DAG) and, hence, converting the descriptive semantics into measurable numeric values. Based on the similarity measurement of GO terms, a new algorithm of measuring the similarity of genes is derived. Using this gene comparison algorithm, a gene clustering algorithm is developed to discover similarity patterns in a group of genes. Based on these algorithms and methods, web-based tools, featuring GO terms comparison tool, genes comparison tool, and gene clustering tool, are designed and implemented. The results of clustering a group of well-known genes using our gene clustering tool are consistent with the results of manually clustering them based on their biological functionalities.
A Two-Phase Hybrid Method for Biological Sequence Clustering [Data Mining]
Wei-Bang Chen, Chengcui Zhang
Department of Computer and Information Sciences, University of Alabama at Birmingham, Birmingham, AL 35294, USA
Corresponding author: firstname.lastname@example.org
We proposed a two-phase hybrid method for biological sequence clustering, which combines the strength of hierarchical methods and partitioning methods. In Phase I, the proposed method uses a hierarchical clustering algorithm to pre-cluster the aligned sequences. The second phase takes the pre-clustered result as the initial partition for the profile Hidden Markov Models (HMMs) based k-means clustering method. As against random initial partitions, the initial partitions generated by the hierarchical clustering can avoid the inconsistency problem in the partition clustering methods (e.g. k-means) that use random initial partitions. In addition, the inaccuracy of the hierarchical agglomerative clustering methods can be compensated by the profile HMM based k-means clustering since the latter is model-based and can better describe the dynamic properties of the data in a cluster. The sequence dataset used in our experiments contains 429 protein sequences from 65 families of cytochrome P450. The performance of the proposed hybrid method was compared with that of the hierarchical method and the k-means partitioning method, evaluated by the F-measure. The average F-measure of the hybrid method is 0.723, while the average F-measure values of the hierarchical method and the k-means method are 0.621 and 0.291, respectively. The results demonstrate that the pre-clustered sequences by the hierarchical clustering method can greatly improve the accuracy and the robustness of the partitioning method. Furthermore, the inaccuracy of the hierarchical method can be compensated by the model-based partitioning method.
Learning SNP Dependencies using Embedded Bayesian Networks [Data Mining]
Ara V. Nefian
Corresponding author: email@example.com
This paper describes an efficient, scalable method for learning
nucleotide dependencies around SNP sites using Bayesian networks.
The complexity of the network is reduced by introducing a set of
hidden nodes that do not appear in the original model. The
resulting model is an embedded Bayesian network that encodes both
local and global dependencies between the observed variables and
allows for structure and parameter learning on parallel machines
from distributed data. The results of the proposed learning method
are reported on both synthetic data and DNA sequences of the human
FOCUS Discovery: An Asynchronous, Parallel, Non-Linear Software Program for Gene Association Studies [Data Mining]
Matthew Frome, Gerri Shaw and Jim Shaw
FOCUS Biology, Inc.
Corresponding author: firstname.lastname@example.org
Evidence is accumulating that indicates epistasis, or complex gene-gene interactions, which interact through nonlinear mechanisms, plays a significant role in both an individual"'"s susceptibility to common diseases, and in the individual"'"s response to pharmaceutical drugs. In light of recent advancements in genotyping technology, investigating these interactions in the context of whole genome SNP association studies poses formidable discovery and computational challenges for which there is no current solution. To address this problem, we have developed FOCUS Discovery, an asynchronous, parallel, non-linear software solution to identify these interactions. In this application we have implemented a three step analysis process, where each step reduces the impact of the “curse of dimensionality” by identifying the most likely SNPs of interest. These steps are comprised of The Substantial SNP Evaluator, Epistatic SNP Evaluator, and the Comprehensive SNP Associator engines and each step is implemented as highly efficient, asynchronous parallel algorithms. Results of an analysis of a SNP dataset will be shown outlining processing performance, analytical methods and results identifying interesting SNP/phenotypic complexes.
Phylo-mLogo: An interactive multiple-logo tool for visualization of large-number sequence alignment [Data Visualization]
Arthur Chun-Chieh Shih, D.T. Lee, Chin-Lin Peng, and Yu-Wei Wu
Institute of Information Science, Academia Sinica, Taipei, 115, Taiwan
Corresponding author: email@example.com
When aligning several hundreds or thousands of sequences, such as HIVs, dengue virus, and influenza viruses, to reconstruct the epidemiological history or to understand the mechanisms of epidemic virus evolution, how to analyze and visualize the large-number alignment results has become a new challenge for computational scientists. Although there are several tools available for visualization of very long sequence alignments, few of them are applicable to the large-number alignments. In this paper, we present a multiple-logo alignment visualization tool, called Phylo-mLogo, which allows the user to visualize the global profile of whole multiple sequence alignment and to hierarchically visualize homologous logos of each clade simultaneously. Phylo-mLogo calculates the variabilities and homogeneities of alignment sequences by base frequencies or entropies. Different from the traditional representations of sequence logos, Phylo-mLogo not only displays the global logo patterns of the whole alignment but also demonstrates their local logos for each clade. In addition, Phylo-mLogo also allows the user to focus only on the analysis of some important structurally or functionally constrained sites in the alignment selected by the user or by built-in automatic calculation. With Phylo-mLogo, the user can symbolically and hierarchically visualize hundreds of aligned sequences simultaneously and easily check the sites potentially under selective sweep, negative selection, neutrality, or positive selection when analyzing large-number human or avian influenza virus sequences.
3D Phylogeny Explorer: Distinguishing paralogs, lateral transfer, and violations of the “molecular clock” assumption with 3D visualization [Data Visualization]
Namshin Kim and Christopher Lee
Molecular Biology Institute, Center for Genomics and Proteomics, Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
Corresponding author: firstname.lastname@example.org
We have developed 3D Phylogeny Explorer, a novel phylogeny tree viewer that maps trees onto three spatial axes (species on the X-axis; paralogs on Z; evolutionary distance on Y), enabling one to distinguish at a glance evolutionary features such as speciation; gene duplication and paralog evolution; lateral gene transfer; and violations of the “molecular clock” assumption. To illustrate the value of this visualization approach for microbial genomes, we generated 3D Phylogeny data for all clusters from COG, made available as “live” 3D views using VRML2 at http://bioinfo.mbi.ucla.edu/3DPhylogenyExplorer. We constructed tree views using well-established phylogeny methods and graph algorithms. We used CLUSTALW/PHYLIP to generate traditional 1D phylogeny. 3D Tree layout is generated on the fly based on user queries: after finding best hit relationship from 1D phylogeny tree, orthologous groups were identified as fully connected cliques of reciprocal best hits. Trees are then reoriented by evolutionary order, recent events first and old events last. While walking tree in POSTORDER traversal, 2D gene layout - species and orthologous group - is generated by order of appearance. We used Scientific Python to generate VRML2 3D views viewable in any web browser. Views can be scrolled, rotated, rescaled, and explored interactively, make it easy to see all evolutionary events such as speciation, gene duplication and lateral gene transfer. All objects in 3D Phylogeny Explorer are clickable to display subtrees, connectivity path highlighting, sequence alignments, and gene summary views, etc.
Tools for Integrated Sequence-Structure Analysis with UCSF Chimera [Data Visualization]
Elaine C. Meng, Eric F. Pettersen, Conrad C. Huang, John "Scooter" Morris, and Thomas E. Ferrin
Corresponding author: email@example.com
UCSF Chimera is an extensible molecular graphics program with a broad set of features. Chimera includes tools for the combined analysis of sequence and structure:
Multalign Viewer displays sequence alignments (generated in Chimera or externally) along with associated structures. Structures can be superimposed using the sequence alignment, information from the sequences can be shown on the structures, and information from the structures can be shown on the sequences.
MatchMaker superimposes structures in the absence of a pre-existing sequence alignment. Secondary structure information is used in addition to residue similarity, allowing correct matches of even proteins with very low sequence identity.
Match->Align constructs a multiple sequence alignment from an already superimposed set of structures.
A bioinformatics approach to identify recoding events of A-to-I RNA editing [Gene Expression]
Stefan Maas1, Daniel Lopresti2, Derek Drake3, Rikhi Kaushal1, Steven Hookway2, Walter Scheirer2, Mark Strohmaier2, and Christopher Wojciechowski2
1 Department of Biological Sciences, Lehigh University, Bethlehem, PA; 2 Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA; 3 Department of Computer Science, Purdue University, West Lafayette, IN
Corresponding author: firstname.lastname@example.org
RNA editing by adenosine deamination is a posttranscriptional mechanism for the regulation of gene expression and regulates important functional properties of neurotransmitter receptors in the brain by changing single codons in pre-mRNA. It has also been implicated in the regulation of splicing, siRNA mediated gene silencing and miRNA biogenesis. We have recently identified repeat elements in the human genome as a major target for RNA editing affecting mainly non-coding mRNA sequences using a combined bioinformatics and experimental screening strategy (Athanasiadis et al., PLoS Biology 2004).
Here we present our ongoing work to develop a comprehensive yet flexible end-to-end system to identify recoding editing events in human mRNAs. Our software implements a reconfigurable pipeline with several input parameters that can be set interactively by the user to encourage experimentation with “what if” scenarios. Special care has been taken regarding run-time efficiency and space requirements.
At the first stage of our computational screen, the locations of base discrepancies between genomic DNA and its corresponding mRNAs are determined. Subsequently, known SNP"'"s are removed and the sequences are ranked based on cross-species conservation and local sequence determinants. Finally, each potential editing site is further ranked by computing a foldback score using a sliding-window algorithm.
Applied to a subset of human mRNAs, we have identified known editing targets as well as several strong novel candidates for which we are experimentally validating the occurrence of editing in vivo. Next, we are planning to comprehensively screen the human transcriptome for novel editing targets.
Use generalized Procrustes analysis for be-tween-slide normalization of microarray data [Gene Expression]
Huiling Xiong, Dapeng Zhang, Crhstopher J. Martyniuk, Vance L. Tru-deau, Xuhua Xia
Biology Department, University of Ottawa
Corresponding author: email@example.com
Normalization is an essential process in dual-labelled microarray data analysis to remove sources of non-biological variation and systematic bias in the experiments. It can be divided into two steps, within-slide normalization and between-slide normalization. Few studies address between-slide normalization. To tackle this problem, we have devel-oped a novel between-slide normalization method based on the generalized Procrustes analysis (GPA) algorithm.
Both publicly available data and simulated data were used in comparing our GPA method with the popular scale method. Three different empirical criteria: between-replicate variability, the Kolmogorov-Smirnov statistic and mean square error, were used to evaluate the performance of these two methods in removing bias without compromis-ing underlying information. The results show that, for all three criteria, our proposed GPA method performs better than the scale method in between-slide normalization. In addition, GPA is free from the assumptions inherent in the scale method that are difficult to validate. This makes the GPA method particularly suitable for analyzing the boutique array where the majority of genes may be differentially ex-pressed.
An Automated Pipeline for Regulatory Motif Detection Tool Assessment [Gene Expression]
Daniel Quest, Kathryn Dempsey, Dhundy Bastola, and Hesham Ali
Universtiy of Nebraska Medical Center Pathology and Microbiology Department Center for Bioinformatics
Corresponding author: firstname.lastname@example.org
Accurate regulatory motif detection is a challenging problem in bioinformatics. Each year, many competing tools are reported that have differing advantages and disadvantages. Because of the lack of good data sets and the difficulty of tool assessment these tools are evaluated on proprietary data sets using measures that are not universal. Consequently, it is unclear which methods are most sensitive and specific in detecting regulatory regions. Preliminary studies have compared some of the most popular software but such studies are limited because of the many manual steps required to assess each tool. Recently, universal statistics for evaluating motif prediction tools were purposed based on statistics for evaluating gene prediction programs. In this work, we propose a universal framework and accompanying software that is capable of running motif detection tools in batch mode and that calculates the important statistics for tool assessment automatically. Because of the ease of use and automation, our tool allows for rapid construction of large transcription factor binding site data sets, and rapid and accurate assessment of motif detection tools for a domain of interest. As a case study we have evaluated our framework in prokaryotes, benchmarking many popular motif sampling tools against known motifs backed by experimental evidence.
Extension of SVM-RFE for multiclass gene selection on DNA microarray data [Gene Expression]
Xin Zhou and David P. Tuck
School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore 639798; Departments of Pathology, Yale University School of Medicine, New Haven, CT 06510
Corresponding author: email@example.com
Gene selection has emerged as a major issue of microarray data analysis. It can present precise cancer classification and provide useful information in cancer study. RFE (Recursive Feature Elimination) is a state-of-the-art gene selection algorithm, which is related to SVM (Support Vector Machine). Just like SVM, RFE is designed to solve binary gene selection problem initially. However, the multiclass gene selection problem is common in nature and intrinsically more difficult. Hence it is worth of further investigation for researchers. Normally, the k-class problem is divided into k binary gene selection problems, each of which considers one class as positive and the rests as negative (one-versus-rest). The classical SVM-RFE can be performed in the same way to solve the multiclass problem in the literature. However, the genes selected from one binary problem may deteriorate the classification performance in other binary problems. In present study, we proposed another extension of SVM-RFE to solve the multiclass problem. Although the extension is still based on OVR (one-versus-rest) SVM classification, it takes k classes simultaneously into consideration on the gene selection stage. The genes with minimal discriminant power on the total k trained classifiers are recursively removed until a small subset of genes is obtained. The proposed multiclass gene selection algorithm was tested on two benchmark microarray datasets: NCI dataset and leukaemia dataset. Compared with the traditional extension of SVM-RFE (in the one-versus-rest manner), our proposed extension provides genes leading to more accurate classification performance.
DNA Aberration and RNA Expression Pattern Detection Using Affymetrix Gene Expression Arrays [Gene Expression]
Guoliang (Leon) Xing1, Cristina R. Antonescu2, Kai Wu1, Manqiu Cao1, Yaron Turpaz1, Earl Hubbell1, Nicholas Socci2, Robert G. Maki3, C. Garrett Miyada1, and Raji Pillai1.
Affymetrix, Inc.1, 3420 Central Expressway, Santa Clara, CA 95051, USA. Department of Pathology2 and Medicine3, Memorial Sloan-Kettering Cancer Center, New York, NY, USA.
Corresponding author: firstname.lastname@example.org
Cancer and inherited diseases are commonly associated with genome abnormality. Detecting DNA copy number changes is important for cancer diagnosis and in fundamental life science research. It’s interesting to infer the impact of underlying DNA alteration on gene expression patterns.
We have developed a high density microarray-based approach on Human Genome U133 Plus 2.0 arrays to detect genome alterations at the gene level, which can be compared with gene expression patterns of the same sample using the same probesets on the same or similar type of Affymetrix arrays.
We took a maximum likelihood approach to build a probeset-level computational model based on training data, and derived copy number distributions from samples with known 1X to 5X copies of X-chromosomes. We validated our method on other cell lines with defined chromosomal deletions and amplifications.
We applied our computational method to examine DNA samples from tumor or tumor cell lines and detected known chromosomal aberrations, in concordance with previously published results using conventional or BAC-array CGH. Furthermore, gene expression analysis on the same tumor samples allowed us to examine the impact of DNA alteration on RNA expression patterns in a coherent fashion. Pathway analysis provided insights into associated and co-regulated genes.
In conclusion, our method to detect chromosomal copy number aberrations using expression Microarrays provides an interesting way to detect genome abnormality.
A Noncoding RNA Gene Finder for Bacteria [Genomic Annotation]
Corresponding author: email@example.com
Noncoding RNAs are genes for which RNA, rather than protein, is the functional end product. While many structural or catalytic RNA genes are well known, such as ribosomal RNAs and transfer RNAs, the number and diversity of other noncoding RNAs remain poorly understood. The majority of small, noncoding RNAs in bacteria appear to act as post-transcriptional regulators by basepairing with target mRNAs. Only a handful of these RNA genes have been identified although it is estimated that they compose between 1% and 4% of bacterial genes.
Here, we present a gene-finder, sRNAFinder, for identifying noncoding RNAs in bacteria. The gene-finder incorporates three heterogeneous sources of information for characterizing RNA genes. The three sources of information include promoter signals and transcription termination signals, transcript expression data as determined from whole genome microarrays, and comparative genome analyses which evince compensatory base changes that conserve RNA secondary structure. The gene-finder is based on a variable duration hidden Markov model, also known as a semi-Markov model or a general Markov model.
The performance of sRNAFinder is evaluated using a set of documented RNA genes in Escherichia coli and Vibrio cholerae. With our approach, we find that evidence of conserved RNA secondary structure is the single best data source for noncoding gene prediction in bacteria, but in all cases, integrating multiple data sources offers improved performance over using any single data source. On the test set of documented RNA genes, the sensitivity and specificity of sRNAFinder are estimated to be 76% and 78%, respectively.
Splign: a Tool for cDNA-to-Genome Compartmentization and Alignment [Genomic Annotation]
Yu.Kapustin, A.Souvorov, T.Tatusova
National Center for Biotechnology Information
Corresponding author: firstname.lastname@example.org
A critical step in eukaryotic genome annotation is computing spliced gene product alignments as they provide best evidence for gene models. Although several tools have been developed to address the problem, a variety of possibilities including small exons, non-conventional splice signals, exon and gene duplications as well as sequencing errors are difficult to account for using heuristic approaches. We developed a tool called Splign which relies on formally defined models when searching for gene copies and splices which proved to be robust when dealing with the above mentioned factors.
The tool uses a complete set of local alignments between the query and genome to recognize gene duplications in a process called compartmentization. While compartments on different subject sequences or of different strands are trivially separated, those on same subject strands can be obtained via solving an optimization problem in terms of query coverage. The solution is delivered with a dynamic programming algorithm.
Precise intron-exon structures within compartments are obtained via global alignment specifically formulated to account for introns and different splice types. A version of the algorithm discriminates further between more and less conserved splice signals when aligning against low quality genomes. As a performance measure, set of selected local alignments is used to split the dynamic programming space.
In a series of comparisons with other tools, Splign was more accurate and tolerant to sequence errors while not sacrificing in speed. Splign has been used to compute same- and cross-species alignments in all recent genome builds at NCBI.
goldMINER: an efficient automatic sequence annotation platform [Genomic Annotation]
Xuhua Xia, Pingchao Ma, Christopher Martyniuk, Kate Werry, Huiling Xiong, Vance L. Trudeau
Biology Department, University of Ottawa
Corresponding author: email@example.com
Genomic sequencing or large-scale characterization of expressed sequence tags (ESTs) generates many sequences of unknown functions and calls for automatic sequence annotation to assign functions to the sequence fragments. Sequence annotation depends heavily on databases for protein functional classification. The Conserved-Domain Database (CDD) hosted at NCBI integrates popularly used databases of protein functional classification such as pFAM, SMART and COG to form a centralized data center augmented by additional protein annotations curated at NCBI. This significantly increases the chance of retrieving functional annotations for the unknown sequences and alleviates the frustrating experience of a protein misclassified into multiple protein families. Here we present goldMINER, the first software package that takes advantage of the power of CDD to automate the functional annotation of unknown sequences such as ESTs. The software has been successfully used to annotate many ESTs sequences from a large-scale gene expression study of the goldfish brain in response to chemicals that functionally mimic oestrogens. The automatic installation package of goldMINER for the Windows platform is freely and publicly available at http://dambe.bio.uottawa.ca/goldminer.asp.
Computational pipeline for the analysis of genome-wide experimental data [Genomic Annotation]
Gabriel Renaud (1), Ingeborg Holt(1), James Malley(2), Tyra Wolfsberg(1)
(1) National Human Genome Research Institute, NIH, (2) Center for Information Technology, NIH
Corresponding author: firstname.lastname@example.org
With the completion of the human genome sequence and the development of high-throughput experimental techniques, laboratory researchers are performing large-scale, genome-wide analyses that generate hundreds of thousands of sequence tags. We have developed a pipeline to computationally characterize these experimentally identified sequences by comparing them to publicly available genome sequences and annotations. We align each sequence to the reference human genome assembly to determine its genomic location, and then compare the coordinates of this sequence to the coordinates of a variety of genome annotations. Using this approach, we can assign putative functions to the experimentally-identified sequences based on their proximity to known sequence features, such as genes. In order to provide statistical rigor for the analysis, we also characterize sequences that we picked at random from the genome. This step, the most intricate part of the process and focus of this poster, is custom-designed to emulate the experimental protocol used to generate the original sequence tags. We use these control sequences to comprehensively evaluate null hypotheses relating to the question of whether the experimental sequences derive from specific genomic regions, such as those near genes. This general method has been applied to sequences generated during the course of a variety of biomedical experiments, including the mapping of DNAse hypersensitive sites to identify potential gene regulatory regions, and the characterization of retroviral integration sites in patients treated in a retroviral gene therapy trial.
Integrating Multiple Genome Builds into an Automated Sequence Annotation Pipeline [Genomic Annotation]
Conrad Huang, John "Scooter" Morris, Susan Johns, Michiko Kawamoto, Doug Stryke, Courtney Harper, Thomas Ferrin, Patricia Babbitt
Corresponding author: email@example.com
he International Gene Trap Consortium (IGTC) pipeline annotates gene trap sequence tags with the identities of the trapped genes (transcripts) and their genomic locations. The pipeline is entirely automated, allowing the annotation of large numbers of sequence tags according to systematic and reproducible procedures. Two protocols process sequences independently. The first protocol identifies the transcript associated with a cell line sequence using BLAST to search the Genbank non-redundant database. The second protocol localizes the cell line sequence onto the mouse genome using the BLAT program from UC Santa Cruz. When both protocols produce results, those results must be reconciled, requiring that the two protocols use a single mouse genome build. The IGTC uses the current publicly released build from NCBI.
Next, the IGTC pipeline maps cell line sequences to genes at Entrez and Ensembl based on genomic coordinate overlap. Unfortunately, a lag can occur between NCBI"'"s release of a genome build and release of the Ensembl web site based on that build. When the IGTC pipeline detects that Ensembl is using a mouse genome build other than NCBI"'"s current release, the genomic coordinates of the cell line and any AutoIdent transcripts are re-localized to the Ensembl genome build using BLAT. These results are then reconciled in the same manner as the original reconciliation. The resulting genomic coordinates are used to find the overlapping Ensembl gene. The result is an internally consistent set of gene annotations obtained from multiple data providers using multiple mouse genome builds.
Graphic Models: Approaches for Mining Genetic Epidemiology Data in Complex Trait Analysis [Genotyping and SNPs]
Lu-yong Wang, Dorin Comaniciu, Daniel Fasulo
Siemens Corporate Research
Corresponding author: Luyong.Wang@siemens.com
Although there was great success in identifying disease gene in simple, monogenic Mendelian traits, the understanding of genetic mechanisms in most complex diseases remains challenging. A central goal is to identify single nucleotide polymorphisms (SNPs) and their interactions that confer the susceptibility of the disease. Traditional methods, such as multiple dimensional reduction method and combinatorial partitioning method, etc., provide good tools to decipher such interactions in the absence of genetic heterogeneity among population. However, these traditional methods have not managed to solve genetic heterogeneity problem common to diseases. As prior knowledge regarding the causes of genetic heterogeneity is rarely known, these methods based on estimation over the entire population, are unlikely to succeed in tackling the genetic causes of disease.
Thus, we are motivated to propose a novel boosted generative modeling approach for structure-modeling the interactions leading to diseases in the presence of genetic heterogeneity. This method innovatively bridges the ensemble method and generative modeling in the genetic association study. Generative modeling is to model interaction network configuration and the causal relationship, while boosting is used to address the genetic heterogeneity problem, a common problem in genetic epidemiology study. We perform our method on simulation data of complex diseases. The results indicate that our method is capable of structure-modeling of interaction networks among disease-susceptible loci and addressing genetic heterogeneity issues concurrently, where the traditional methods, such as multiple dimensional reduction method, fail to apply. It provides an exploratory tool for potential SNPs that are likely to contribute to the diseases.
Designing a Rapid SNP-Based Genomic Screening Tool for Common Diseases [Genotyping and SNPs]
Lu-yong Wang, Dorin Comaniciu, Daniel Fasulo
Siemens Corporate Research
Corresponding author: Luyong.Wang@siemens.com
Genome-wide association study for complex diseases will generate massive amount of single nucleotide polymorphisms data in experiments. Traditionally, univariate statistical test is used to first screen out non-associated SNPs, retaining only those meeting some criterion. Fisher exact test can provide p-value for each SNP and those with the lowest p-values are retained. However, the disease-susceptible SNPs may have little marginal effects in population and are unlikely to retain after the univariate tests. Also, model-based methods are impractical for large scale dataset with thousands of SNPs. Moreover, genetic heterogeneity makes the traditional methods harder to identify the genetic causes of diseases. A more recent method based random forest provides a more robust method for screening the SNPs in thousands scale. However, for more large-scale data, i.e., Affymetrix Human Mapping 100K GeneChip data, a faster screening method is required to screening SNPs in whole-genome large scale association analysis with genetic heterogeneity.
We are motivated to propose a boosting–based method for rapid screening in large scale analysis of complex traits in the presence of genetic heterogeneity. This method is a quick screening alternative for the rapid and accurate identification and screening of the candidate SNPs. It provides an assistant tool for further delicate modeling task. Boosting-based variable selection is used to solve the genetic heterogeneity problem. It enables fast detection of the SNPs with little marginal effects. More importantly, its computational efficiency makes it a good candidate for screening large-scale association data.
Morphometric Analysis of Imaging Genetics in Mild Cognitive Impairment [Genotyping and SNPs]
Li Shen, Andrew Saykin, Heng Huang, Moo K. Chung, James Ford, Fillia Makedon
UMass Dartmouth, Dartmouth College, UW Madison
Corresponding author: firstname.lastname@example.org
Brain imaging methods for identifying medial temporal morphological abnormalities have been studied for diagnosis of mild cognitive impairment (MCI) and Alzheimer"'"s disease (AD). However, the connection between the genotype and imaging phenotype has yet to be established in order to identify possible genetic risk factors for the diseases. The Interleukin-6 (IL-6) gene is a proinflammatory cytokine involved in neuronal signaling that appears to reduce hippocampal neurogenesis. This MCI study aims to identify hippocampal shape changes related to the G allele of a common SNP of the IL-6 gene in the -174 promoter region. We present a new computational framework that integrates a set of powerful surface modeling and processing techniques, including the spherical harmonic surface modeling, quaternion-based 3D shape registration, a novel surface signal extraction method, heat kernel smoothing for increasing signal-to-noise ratio, and random fields theory for statistical inference on the surface. The participants include 40 healthy matched older controls, 39 older adults with cognitive complaints (CC) but normal memory performance, and 37 patients with amnestic MCI. The analysis shows that the G/G MCI"'"s are the most abnormal in shape relative to controls, while G/C and C/C genotype MCI"'"s and CC group are intermediate. The shape changes of the G/G MCI group are mostly pronounced in the posterior part of the right hippocampus. These findings suggest that combining imaging phenotypes and genetic profiles has the potential to elucidate the biological pathways for better understanding MCI and AD.
SIDACS: An integrated computational system for SNP/indel discovery and classification [Genotyping and SNPs]
Stephen M. Beckstrom-Sternberg1,2 and Raymond K. Auerbach2
1. Translational Genomics Research Institute 2. Northern Arizona University
Corresponding author: email@example.com
We have created a general-purpose, integrated, in silico solution, SIDACS, for single nucleotide polymorphism (SNP) and insertion/deletion (indel) discovery, classification, and visualization in microorganisms, which can be applied to any combination of characterized and uncharacterized sequences. Analysis of characterized sequences uses BLAST to align specific regions of interest among homologous regions in other strains or organisms. Analysis of uncharacterized sequences, such as draft genome sequences, utilizes MUMmer for SNP/indel discovery on a genome-wide scale between a combination of annotated, finished genomes and recently sequenced genomes. PERL programs were designed to enable a variety of different alignment and gene prediction tools to work in concert regardless of a tool’s output format. Further PERL programs were developed to filter for quality and perform SNP/indel classification for all sequences. SNP classification can be performed by strain-specificity and/or by the SNP’s corresponding effect upon the amino-acid chain. To our knowledge, this is the first time that a program has been designed to handle SNP classification by amino acid chain effect on both single-gene and whole-genome bases in a semi-automated manner. SIDACS has been used successfully to discover and classify SNPs and indels for a number of different bacterial genomes, including Francisella, Burkholderia, Brucella, Staphylococcus, Streptococcus, and Yersinia. SIDACS was then used to visualize genomic distribution of these SNPs/indels and prioritize them for assay/DNA microarray design. In addition, these SNPs/indels were used to clarify phylogenetic positions of each organism within their respective phylogenies and to create rapid typing systems.
A Computational RNA Evolution Simulator based on thermodynamic stability, mutational robustness, and linguistic complexity [Molecular Simulation]
Nir Dromi, Assaf Avihoo, and Danny Barash
Department of Computer Science, Ben-Gurion University, Israel
Corresponding author: firstname.lastname@example.org
It is possible to analyze the sequence and structure of RNA molecules using various quantifiable measures. In prior works that used secondary structure predictions to model RNA evolution, it has been shown that natural RNAs differ significantly from artificial RNAs in some of these measures. In particular, two of these measures are the secondary structure minimum free energy and the robustness of the structure to single point mutations. Here, our goal is to analyze the relationships and mutual influence of thermodynamic stability and mutational robustness, and in addition the linguistic complexity measures of both sequence and structure of the RNA molecule. In order to study their effect on the evolution of natural RNAs, we have developed an optimization-based RNA sequence and structure evolution simulator. Using evolutionary optimization, the goal of our computer tool is to simulate the emergence of natural RNAs from a collection of random nucleic acid sequences, given generic objective functions that take into consideration thermodynamic stability, mutational robustness, and linguistic complexity.
Chemical molecular similarity analysis and its applications in structure-activity visualization [Other]
Weiguo Fan, Xin Lin, Yu-wei Hsieh, Boren Lin, Paul Durand, Johnnie Baker and Chun-Che Tsai
Kent State University
Corresponding author: email@example.com
We describe an efficient algorithm for finding the maximal common substructure (MaCS) of a pair of molecules, each represented as a two-dimensional (2D) labeled graph. The size of the MaCS, expressed as the total number of non-hydrogen atoms and bonds (NAB), is used as the basis for calculating a molecular similarity index (MSI) and a topological distance (TD). The algorithm uses a subgraph isomorphism approach to finding the maximum common subgraph (MCSG). This study presents the development and application of an important topological method to analyze molecular similarities based on structure-activity relationships (SAR), structure-activity maps (SAMs). SAMs are graphic maps in which chemical structures, quantified by molecular descriptors such as NAB or MSI, are plotted against other activity measures of the molecules. We used flavonoids with antioxidant activity to exemplify how SAMs could be used to discover the correlations of structures and activities of compounds and how improvements of activity could be made. TOPSIM software was designed to find the maximum common substructures of molecules, and to compute TD and MSI for building SAMs. Results indicated that SAMs greatly facilitated, and most importantly, visualized the process of discovering important features of flavonoids with antioxidant activity and to determine important trends in activity and sites of modification.
A Comparison Study of Signal Extensions Methods for Wavelet Denoising of Array CGH Data [Other]
Department of Computer Science and Engineering, Southern Methodist University, Dallas, TX 75205
Corresponding author: firstname.lastname@example.org
Array-based comparative genome hybridization (array CGH) is a recently developed high-throughput technique to detect DNA copy number aberrations. Typically, array CGH data is noisy. Wavelet denoising was previously shown to have superior performance for denoising array CGH data. However, the effect of different signal extensions methods on the performance of wavelet denoising in this particular application has not been previously studied. In this paper, we performed a comparison study of three signal extensions methods (zero-padding, periodic extension, and symmetrization) for wavelet denoising of array CGH data using realistically generated synthetic data. Empirical results suggest that the zero-padding method outperforms the other two methods by 0.9--1.2% in terms of the overall root mean squared error. The difference is statistically significant (P<0.05) in at least 80% of all test cases.
Grid-based Secure Web Service Framework for Bioinformatics [Other]
Dawei Sun and Xiaoyu Zhang
California State University San Marcos
Corresponding author: email@example.com
Although the web-based bioinformatics is very popular after dozens of years’ growth, biologists found out that it is inconvenient because they need to access many web sites manually in order to perform a single task. Web-service based bioinformatics was proposed to provide well-defined interfaces accessible to programs. However, security for web services is a very important issue that was not addressed in most web-service based bioinformatics systems. We developed a Grid-based Secure Web Service Framework for Bioinformatics (GSWSF). The GSWSF is designed based on the Open Grid Service Architecture (OGSA) and Grid Security Infrastructure (GSI), which provide two security mechanisms: transport level security and message level security. We can build secure and easy-to-use bioinformatics services using this framework. This paper covers the architecture and some design and implementation details of the framework. A preliminary implementation of the framework can be found at http://bioinfo.csusm.edu.
OptiPNAFinder: A Sequence Designing Tool for Peptide Nucleic Acid (PNA) with Minimized Off-target Effect [Other]
Seungpyo Hong1, Hosang Jeon2, Seongjo Kim2, Hyon Chang Kim1, Dong Soon Choi3, Han Jip Kim1, Hyun Joo4, Churl K. Min3
1Department of Biological Sciences, 2Division of Information & Computer Engineering, 3Department of Molecular Science & Technology, Ajou University, Suwon, Korea 4Department of Physiology and Biophysics, College of Medicine, Inje University, Busan Kore
Corresponding author: firstname.lastname@example.org
Peptide nucleic acids (PNAs) are nucleic acid analogues with good properties for therapeutic purposes. PNAs can bind to DNA and RNA in a complementary way, and act as antisense molecules. Like other antisense techniques, such as RNAi, the sequence is important in using PNAs as antisense molecules. The length of a PNA is relatively short at about 11-mer; this is in order to secure proper transportation into bacteria. The short length of PNA increases the probability of the off-target effects and the knocking down of non-target genes. We have developed Opti-PNA Finder, a computational tool that finds the least off-target effect sequence with the least amount of PNA concentration. We used the model described by Ratilainen et al. [Biophysical Journal 81 (2001) 2876]. Sequences derived from the target sequence were compared with the whole E.coli genome. Exact match sequences plus one and two mismatch sequences were recorded, and then its binding constants with PNA were evaluated. Finally, the off-target effects and the concentration of PNA required to knock down a gene were evaluated by using the model of Ratilainen et al.
This work was supported by grant No. RTI04-03-05 from the Regional Technology Innovation Program of the Korean Ministry of Commerce, Industry and Energy (MOCIE).
Using Pathway Logic to Integrate Signal Transduction and Gene Expression Data [Other]
Linda Briesemeister (1), Joe Gray (2), Laura Heiser (2), Merrill Knapp (1), Keith Laderoute (1), Andy Poggio (1), Paul Spellman (1), Carolyn Talcott (1)
(1) SRI International, (2) Lawrence Berkeley National Laboratory
Corresponding author: email@example.com
Pathway Logic (http://pl.csl.sri.com/) is an approach to the modeling and analysis of molecular and cellular processes based on rewriting logic. A Pathway Logic knowledge base includes data types representing cellular components such as proteins, small molecules, complexes, compartments/locations protein state, and post-translational modifications. Rewrite rules describe the behavior of proteins and other components depending on modification state and biological context. Each rule represents a step in a biological process such as metabolism or intra/inter- cellular signaling. A collection of such facts forms a formal knowledge base. Logical inference and analysis techniques are used for simulation to study possible ways a system could evolve, to assemble pathwayse as answers to queries, and to reason about dynamic assembly of complexes, cascading transmission of signals, feedback-loops, cross talk between subsystems, and larger pathways.
The poster will illustrate the use of the Pathway Logic knowledge base in combination with statistical methods to analyze gene expression data obtained from a collection of breast cancer cell lines. From the gene expression data for a cell line, a network of potentially reachable signaling reactions (rules) is extracted from the knowledge base. These networks of rules are clustered to find small signaling modules whose elements appear together or are absent together in the networks for the different cell lines. Also, The subnet of the Egf signaling pathway controlled by the genes whose presence/absence differs across the cell lines will be shown.
Prediction of catalytic residues in proteins using machine-learning techniques [Predictive Methods]
Natalia V. Petrova, Cathy H. Wu
Department of Biochemistry and Molecular & Cellular Biology, Georgetown University, Washington DC, 20007
Corresponding author: firstname.lastname@example.org
The gap between the number of proteins with experimentally characterized and unknown function is growing exponentially each year. This necessitates the development of new computational methods for functional prediction. Knowledge of the location of catalytic residues provides a valuable insight into protein function. Although computational methods to predict active sites are rapidly developing, their accuracy remains low (60 - 70%) with a significant number of false positives.
We present a novel method for the prediction of catalytic sites, using a machine-learning approach, and analyze the results in a case study of a large evolutionarily diverse group of proteins – a/b hydrolases. We used a dataset of 79 enzymes with experimentally identified catalytic sites from the CATRES database as our benchmarking dataset. Each residue of the benchmarking dataset was represented by a set of 24 residue properties.
Our tasks were to determine the best performing algorithm among 26 machine-learning techniques currently built in WEKA, a JAVA-software package, and to select an optimal subset of features using our benchmarking dataset and an attribute selection algorithm (Wrapper Subset Selection). In the 10-fold cross-validation analysis, the best result was achieved with a support vector machine (SVM) algorithm and 7 out of 24 attributes.
For all 17 enzymes from the case study, the method correctly predicted the catalytic triad and 3 false positives (1.06%) out of 282 residues on average.
Our method can identify catalytic residues for proteins with known structure but unknown function with an accuracy of at least 86%.
DISTINCTIVE COMPOSITIONAL FEATURES FOR EXTRACELLULAR LEVANSUCRASES OF PROTEOBACTERIA [Predictive Methods]
INARA ANDERSONE, MARTINA BALTKALNE and PETERIS ZIKMANIS
Institute of Microbiology & Biotechnology, University of Latvia, Kronvalda blvd. 4, LV-1586, Riga, Latvia
Corresponding author: email@example.com
Our recent studies (Zikmanis et al.,2006) revealed definite frequences of amino acid residues (R,E,G,I,M,P,S,Y,V) together with selected propensities for β – sheets and polarity of C-terminal fragments as the independent sets of strong predictor variables to discriminate between the sequences of annotated type I and type III secreted proteins of proteobacteria. Here we assessed the predictive objectives of corresponding discrimination functions to the 12 sequences (Swiss-Prot/TrEMBL) of levansucrases (EC 184.108.40.206) released in the absence of the cleveable signal peptide motif from diverse species of α – and γ – proteobacteria by hitherto undefined type ( I or III) of secretion.
The results of discriminant analysis pointed to the levansucrases of Zymomonas mobilis, Acetobacter xylinus, Rahnella aquatilis, Erwinia amylovora, Sphingomonas chungbukensis and Pseudomonas syringae (pv. phaseolicola, glycinea, tomato, syringae) as being completely concordant to the group of annotated type I secreted proteins of proteobacteria. In turn, the levansucrases of Ps. aurantiaca, Novosphingobium aromaticivorans, Erythrobacter litoralis were attributed to the separate group of type III secreted proteins. These group memberships were further approved as most probable by comprehensive comparisons between more extended sequence attributes of annotated extracellular proteins and levansucrases of proteobacteria.
Reference : P. Zikmanis, I. Andersone, M. Baltkalne. Discriminative features of type I and type III secreted proteins from Gram-negative bacteria. Central Eur. J.Biol.,2006,v. 1(in press).
DomainDiscovery: A Novel Algorithm for Protein Domain Boundary Assignment Using Support Vector Machine [Predictive Methods]
Abdur R. Sikder, Stella Veretnik, Albert Y. Zomaya and Philip E. Bourne
Advanced Networks Research Group, School of Information Technologies, University of Sydney, NSW 2006, Australia,San Diego Supercomputer Center, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA,Department of Pharmacology,
Corresponding author: firstname.lastname@example.org
Knowledge of protein domain boundaries is critical for the characterisation and understanding of protein function, specifically in the post genomic era. The ability to identify structural domains without the knowledge of the structure – by using sequence information only – is essential step in many types of protein analysis. We present a novel method for domain identification from sequence-based information. DomainDiscovery uses a Support Vector Machine (SVM) approach and a unique training dataset built on the principle of consensus among experts of protein structure.
DomainDiscovery method is tested and compared with others on a structurally non-redundant dataset, as well as CASP5 targets. DomainDiscovery achieves above 52% accurate domain boundary identification for multi-domains protein chains from sequence information. DomainDiscovery is a machine learning approach to domain boundary prediction. We trained Support Vector Machine (SVM) using PSSM (Position Specific Scoring Matrix), Secondary Structure and Solvent accessibility information to detect possible domain boundaries for a target sequence.
We have presented a new protein domain boundary prediction method, DomainDiscovery, based on support vector machine (SVM) and training with structurally-defined domains based on consensus among experts.
In six-fold cross-validation technique using Benchmark_2 dataset we achieve 53% accuracy for the data that includes single-domain and multi-domain chains. Performance of DomainDiscovery is comparable or better than other recent sequence-based methods, particularly with regards to its performance on multi-domain chains.
Additionally, a new evaluation method, Precision of Boundary Placement (PBP) is introduced and applied.
On the Prediction of Biomineralization Proteins in the Absence of Sequence Homologies [Predictive Methods]
Xiaoyu Zhang and Betsy Read
California State University San Marcos
Corresponding author: email@example.com
The major goal of the bioinformatics research at Cal State San Marcos is to identify gene and protein sequences of Coccolithophorids involved in biomineralization. However, most known biomineralization genes lack sequence homology, which makes it more challenging to identify such genes. On the other hand known biomineralization proteins share some common biochemical and biophysical characteristics: relatively small molecular weights, high acidity, repeats sequences, little or no secondary structure, propensity for trans-membrane helices, and high percentages of highly acidic amino acids (Asp, Glu, and Ser) in protein composition, etc. These features can be applied to predict candidate biomineralization proteins. We have developed a prediction scheme based on a statistical profile of features of known biomineralization proteins compared to randomly selected counterparts. Features in the predictor are combined and weights for the features are computed in order to minimize the relative entropy of wrong predictions in the training set. Given a new gene sequence, its features are calculated using tools from the ExPASy Proteomics Server. A biomineralization protein probability index is then computed by the predictor as a number between 0 and 1. We have tested the predictor on assembled EST and cDNA sequences of Emiliania huxleyi (E. Hux) and ranked them by their probability indices. The results reveal some interesting candidates whose involvement in biomineralization is currently being examined using microarray analysis and in-vitro biomineralization studies.
Semi-Supervised Learning for Protein Subcellular Localization Prediction [Predictive Methods]
Leon French, Martin Ester, Fiona Brinkman
Simon Fraser University
Corresponding author: firstname.lastname@example.org
The protein localization problem, which consists of determining the sub-cellular localization of a protein sequence is very important and challenging. The majority of past work has focused on feature extraction ranging from amino acid frequencies and frequent subsequences to annotations. Results from PSORTb v.2.0 (Gardy et al., 2005), the most precise method in the field have been computed for each protein of many bacterial genomes and stored in the computational PSORT database (cPSORTdb, Rey et al., 2005). In this report we apply a semi-supervised learning approach by using cPSORTdb to extend the training dataset of a support vector machine. This approach makes use of the large amount of sequences with localizations that have not been experimentally verified. The method performed is based on the co-training semi-supervised learning strategy where one classifier labels sequences for a second (Blum and Mitchell, 1998). Several database selection criterions were tested for precision and recall. We show that the additional training sequences increase the recall accuracy of the PSORTb v.2.0 SVM modules from 10% to 30%, while keeping a high level of precision.
A statistical approach using network structure in the prediction of protein characteristics [Predictive Methods]
Pao-Yang Chen, Charlotte M. Deane, Gesine Reinert
Department of Statistics, Oxford University
Corresponding author: email@example.com
Protein structure and function are two characteristics of proteins that are
known to affect protein-protein interactions. Computational approaches have
been proposed which use protein-protein interactions to predict structure or
function. We propose a statistical approach based on lines and triangles.
It predicts protein structure or function by analysing the network structure
of each protein through counting the pattern frequencies of lines and
triangles. The results show high accuracy in the prediction of structures
Our model is the first one to include network structure in protein-protein
interaction prediction. Furthermore, we included additional biological
information in the model. The results show not only high accurate
prediction, but also that the integration of biological information are
Classification and Prediction of Antisense Oligonucleotide Efficiency using global structure information with support vector machines [Predictive Methods]
Roger Craig and Li Liao
University of Delaware
Corresponding author: firstname.lastname@example.org
Designing antisense oligonucleotides with high efficiency is of great interests both for its usefulness to the study of gene of regulation and for its potential therapeutic effects. Because of the high cost associated with experimental approaches, it has motivated the developments of computational methods. Essentially, these computational methods rely on various sequential and structural features to differentiate the high efficiency antisense oligonucleotides from the low efficiency antisense oligonucleotides. By far, however, most the features used are either some local motifs in sequences or in secondary structures, or some global attributes such as compositional frequencies. We proposed a novel approach to profiling antisense oligonucleotides and the target RNA to reflect some global structural features such as hairpins. Such profiles are then utilized for classification and prediction of high efficiency oligonucleotides using support vector machines. The classification and prediction were carried out on a set of 348. The performance was evaluated using ROC scores on multiple runs of cross-validation experiments. The ROC scores show that our method significantly improves the prediction accuracy as compared to similar methods that utilized only local features. To help further pinpoint the responsible features for high (or low) efficiency, we use profiles on oligos and RNA target sites separately, and also in concatenation. The results indicate that information of
global structure help classification and prediction of
high activity oligonucleotides.
Prediction of disulfide patterns from protein sequences [Predictive Methods]
Yu-Ching Chen and Jenn-Kang Hwang
Institute of Bioinformatics, National Chiao Tung University, Hsinchu 30050, Taiwan
Corresponding author: email@example.com
Disulfide bonds play important structural roles in both stabilizing the protein conformations and regulating protein functions. The ability to infer disulfide patterns directly from protein sequences will provide a valuable tool to biologists in the processes of investigating the structure-function relationship of proteins. However, the prediction of disulfide connectivity from protein sequences presents a major challenge to computational biologists due to the nonlocal nature of disulfide connectivity in terms of linear sequence, i.e., the spatial proximity cysteine pair does not necessary imply sequential closeness. In this report we treated each distinct disulfide pattern as a distinct class and solved the problem as a multi-class classification problem. However, we use the support vector machines based one sequence features such as the coupling between the local sequence environments of cysteine pair, the cysteines sequence separations, and the global sequence descriptor, such as amino acid content. Our approach is able to predict 55% of the disulfide patterns of proteins with two to five disulfide bridges.
A Base Stacking Energetics Centered Dynamic Programming Algorithm for the Computational Prediction of Thermodynamic Interaction Potentials between Nucleic Acid Sequences [Predictive Methods]
Yuan Lin, Jeffrey A. Manthey, Andrew S. Peek
Integrated DNA Technologies, Inc
Corresponding author: firstname.lastname@example.org
Predicting the hybridization energetics between two non-complementary nucleic acids is a very important part in designing sequences for various functions, including PCR primers, microarray probes and RNA interference. We have developed an algorithm for finding stable interaction energies based on a) continuous segments of base pairing (nodes) and b) unpaired connections between nodes (edges). A dynamic programming algorithm can then be used to predict the minimally stable structural conformation of nucleic acid sequences, as model comprised of nodes and edges. The algorithm is base-stacking centric rooted on the finding that stable (negative) free energies are mostly contributed by base stacking at nodes, and once nodes are determined the connection between nodes, by the generally destabiling energetics of edges can proceed. Theoretically, the algorithm has order O((mn)2) for predicting the energetics of two sequences of length m and n. However, simulation tests showed the average computation time proportional to O((mn)1.5). For comparison, 13,615 sequence pairs were used to predict the interaction thermodynamics from both the DNAMelt software and the present algorithm. DNAMelt and the present algorithm resulted in identical thermodynamic predictions for 90% of these pairs, and of the remaining disparities 1,308 of the 1,314 cases resulted in the prediction from the present algorithm being more stable than DNAMelt, suggesting the present algorithm finds overall stable regions in structure space. The software source code is implemented in C and available under the Creative Commons License, and a SOAP XML webservice is available for network utilization.
Modelling eukaryotic promoters with an evolving HMM method [Predictive Methods]
Kyoung Jae Won, Troels Torben Marstrand, Anders Krogh
Bioinformatics Centre, University of Copenhagen
Corresponding author: email@example.com
We present a method to characterise a set of sequences by their recurrent motifs using hidden Markov models (HMMs). Motifs often co-occur, but the pattern of occurrence is not deterministic in biological sequences. Especially, in higher eukaryotes the transcription factor binding sites vary greatly in the number of sites and the distances between the sites. To search for an HMM structure
which represents the probabilistic distributions of motifs, we used Genetic Algorithms (GA).
An evolving HMM method is designed to search an HMM structure automatically for complex data. We used a pool of position-specific scoring matrices (PSSMs) and a number of them are selected to build an HMM. The selected PSSMs are represented with HMM blocks composed of HMM states. By linking those blocks and with other HMM blocks, the proposed method constructs a complete HMM structure for the given sequences. We hybridised GA with the Baum-Welch algorithm to train the HMM structure as well as the HMM parameters . Mutation and crossover operators were
designed to explore the space of topologies.
The proposed method is used to model the muscle specific regions of human and mouse genome. This shows how the Block-HMM discriminates the suitable PSSMs for the given sequences. The method is also applied to model the core promoter in the Drosophila genome. We used 10 PSSMs that
characterise the core promoter sequences. The resulting HMM structure is used to the promoter recognition problem in Drosophila genome.
Prediction of Protein Subcellular Localization [Predictive Methods]
Chin-Sheng Yu and Jenn-Kang Hwang
Department of Biological Science & Technology, National Chiao Tung University, Hsinchu, Taiwan
Corresponding author: firstname.lastname@example.org
Recent years have seen a surging interest in developing computational approaches to predict subcellular localization. These methods, based on a wide range of algorithms, have achieved varying degrees of success for specific organisms and for certain localization categories. Here, we developed an approach based on a two-level support vector machine (SVM) system: the first level comprises a number of SVM classifiers, each based on a specific type of feature vectors derived from sequences; the second level SVM classifier functions as the jury machine to generate the probability distribution of decisions for possible localizations. We compare our approach with a global sequence alignment approach and other existing approaches for two often-used benchmark data sets ?one comprising prokaryotic sequences and the other eukaryotic sequences. We found that the homology search approach performs surprisingly well for identifying sequence homology as low as 25% sequence homology, but its performance deteriorates considerably for lower sequence identity. A data set of high homology levels obviously appear lead to biased assessment of the performances of the predictive approaches - especially those relying on homology search or sequence annotations. Since our two-level classification system based on SVM does not rely on homology search, its performance remains relatively unaffected by sequence homology. Furthermore, we also develop a practical hybrid method that pipelines the two-level SVM classifier and the homology search method in sequential order as a general tool for the sequence annotation of subcellular localization.
Recommending pathway genes using a compendium of clustering solutions [Predictive Methods]
David M. Ng, Marcos H. Woehrmann, and Josh Stuart
Department of Biomolecular Engineering, University of California, Santa Cruz
Corresponding author: email@example.com
Motivation. A common approach for identifying pathways from gene expression data is to cluster the genes which often finds only the dominant coexpression groups without using prior information about a pathway. Recommender systems are well-suited for using the known genes of a pathway to identify the appropriate experiments to use for predicting new members. However, existing systems, such as the GeneRecommender, do not implement a true collaborative filtering approach because they ignore how genes naturally group together within specific experiments.
Methods. Our approach uses the pattern of how genes cluster together in different experiments to recommend new genes in a pathway. We identify clusters within a single experiment series. We then scan for informative clusters where the user-supplied query genes are co-clustered significantly. Finally, we identify new genes to recommend as those that are clustered in a significant fraction of the informative clusters identified in the previous step.
Results. We implemented a prototype of our system and assessed its accuracy on three positive control pathways including: genes encoding proteins participating in the ribosome, proteasome, and cell-cycle. As expected, the areas under the recall-precision curves are significantly better compared to negative controls.
Conclusions. Our recommendation system, which uses clusters analogously to e-commerce shopping carts, efficiently and accurately predicts gene pathways in our small pilot study. Future work includes testing our system on more pathways, improving the algorithm (for example, down-weighting genes that appear in clusters with a diverse set of other genes), and extending the method to additional species.
The Evolutionary Origins of Phosphorylation [Predictive Methods]
Samuel M. Pearlman, Zach Serber, James E. Ferrell, Jr.
Stanford University Deparments of Bioinformatics, Molecular Pharmacology, and Molecular Pharmacology
Corresponding author: firstname.lastname@example.org
Molecular biologists can sometimes mimic the phosphorylated state of a protein by substituting an acidic residue (aspartate or glutamate), for the phosphosite, a seemingly fortuitous trick that works despite differences in side-chain geometry and charge. To test whether nature might employ the same trick, we built a database of proteins with several thousand experimentally verified phosphorylation sites and BLASTed each of the proteins against SwissProt. We then generated high-quality multiple sequence alignments around each phosphosite and counted the number of times every serine, threonine, or tyrosine was replaced by each of the other 20 amino acids. Interestingly, we observe enrichment in the replacement of phosphoserines and phosphothreonines by aspartate or glutamate. Additional phylogenetic and structural analyses further support our hypothesis that phosphorylation may have first evolved to conditionally mimic negatively charged amino acids in crucial locations and conferred an advantage by converting a protein that was previously regulated only by synthesis and destruction into one that was switchable by a fast, reversible enzymatic reaction.
Most current methods for predicting phosphosites make exclusive use of primary sequence data to determine phosphorylation motifs, generating many false positives. By incorporating hereto-ignored evolutionary information, we intend to predict and experimentally verify phosphorylation sites in the human ORFeome. FLAG-tagged proteins containing serines and threonines with similar replacement profiles as observed at known phosphosites are being transfected into HeLa cells, immunoprecipitated, and analyzed by mass spectrometry. Our prediction algorithm will simultaneously experimentally verify our evolutionary hypothesis while providing an extremely useful tool for uncovering novel phosphorylation sites.
Modeling Protein Complexes Using Comparative Patch Analysis [Proteomics]
Dmitry Korkin, Fred P. Davis, Frank Alber, and Andrej Sali
Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, and California Institute for Quantitative Biomedical Research, University of California at San Francisco, San Francisco, CA, U.S.A.
Corresponding author: email@example.com
We present comparative patch analysis for modeling the structures of multi-domain proteins and protein complexes. Comparative patch analysis is a hybrid of comparative modeling based on a template complex and protein docking, with a greater applicability than comparative modeling and a higher accuracy than docking. It relies on structurally defined interactions of each of the complex components or their homologs with any other subunit, irrespective of its fold. For each component, its known binding modes with other subunits of any fold are collected and expanded by the known binding modes of its homologs. These modes are then used to restrain conventional molecular docking, resulting in a set of binary domain complexes that are subsequently ranked by geometric complementarity and a statistical potential. The method is evaluated by predicting 20 binary complexes of known structure. It is able to correctly identify the binding mode in 70% of complexes, predicting the overall structure with an average improvement in all-atom RMS error of 13.4 Å, compared to protein docking which identifies the binding mode in 30% of the complexes. We apply comparative patch analysis to model the core fragment of the PSD-95 protein, whose structure is unknown. As a result, we predict two alternate configurations that, we suggest, correspond to the active and inactive forms of PSD-95. In general, we expect that comparative patch analysis will provide useful spatial restraints for the structural characterization of an increasing number of binary and higher order protein complexes.
Mathematical Models and Extensions of Ramachandran Plots [Proteomics]
Kiran B. Chilakamarri, Nathaniel Dean, and Charu Malhotra
Dept. of Mathematical Sciences, Texas Southern University
Corresponding author: firstname.lastname@example.org
To understand the 3-dimensional structure of proteins it is necessary to first understand the conformational constraints that must be satisfied by its primary constituents, the amino acids. Because of the planarity of the peptide bond, there are only two degrees of freedom, namely, the rotation of an amide plane about the bond linking the α-carbon Cα and the carbon C of the peptide bond and the rotation of the adjacent amide plane about the same α-carbon and the Nitrogen N in the adjacent amide plane. The angle φ of rotation about the Cα-N bond and the angle ψ of rotation about the Cα-C bond are used to construct the usual Ramachandran plot. When viewed as a contour map it shows the areas of relative stability. We construct a simple 3-dimensional geometric model to provide a better understanding of Ramachandran plots and the stability of conformations of amino acids.
Secondly, in the amino acid side chain, three amide planes are shared by any two consecutive residues. Letting t and s be the angles between the consecutive pairs of amide planes we have constructed a (t, s) plot similar to that of the usual (ψ,φ) Ramachandran plot. Thus, we have a single diagram for the tight turns, and we have a mathematical model for the construction of this diagram. Finally, we have also developed the same diagram empirically using data from the Protein Data Bank (PDB).
AN ASSESSMENT OF PHYLOGENETIC CAPACITY FOR BACTERIAL LEVANSUCRASES [Sequence Comparison]
MARTINA BALTKALNE, IEVA MINICHA, INARA ANDERSONE and PETERIS ZIKMANIS
Institute of Microbiology & Biotechnology, University of Latvia, Riga, LV 1586, Latvia
Corresponding author: email@example.com
This study represents the estimates of congruence between the distance matrix derived from the 16S rDNA-based phylogenetic tree and those on the basis of bacterial extracellular levansucrases from 20 diverse species. It was shown, that full amino acid sequences of levansucrases as well as their N- and C- terminal fragments (50 amino acids) contain a significant phylogenetic signal as confirmed by a congruence of corresponding matrices with the 16S rDNA distance matrix (34.8%, 20.3% and 16%, respectively) and provides a reasonable approximation of phylogenetic trees.
Matrices derived from both terminal fragments of levansucrases displayed a significant increase of congruence (14.8% - 78.2%) with the matrix of full sequences in a proportion to the growth of fragment (5-50 amino acids).
The positional variability of amino acids was shown to be a determinant of congruence by phylogenetic comparisons of original and randomized fragments of invariable (Shannon’s entropies, Kulback - Leibler distances) composition.
The phylogenetic capacity of restricted fragments (50 amino acids) to approximate the full levansucrases did not depend on their position throughout the sequences. The same position of fragment within the protein sequence for each organism was found to be a prerequisite for high levels (63.5%-78.6%) of congruence between distance matrices. Observed sharp decline of congruence for the matrix derived from fragments of randomly chosen position for each organism suggests the presence of quasi-periodic repeats in bacterial levansucrases and supports the view on the possible modular arrangement of functional protein sequences.
A Web-Based Tool for Protein Thermostability [Sequence Comparison]
Chia-Mao Huang, Ming-Tat Ko, Jenn-Kang Hwang
Corresponding author: firstname.lastname@example.org
When the homology gene from any thermophilic organism is absent and the structural data is unavailable, the sequence thermostability improvement from knowledge-based tool is currently inoperative. Most of the industrial mesophilic enzymes are made to be thermophilic ones by random mutagenesis methods. This website suggests some useful candidates for thermostability as biologists wish. We propose a tool that can generate more thermophilic enzyme sequences with the same function by knowledge-based methods. It can support any protein enzyme mutated to thermophilic ones without any thermophilic homology. According to the optimal growth temperature (OGT), we can separate thermophilic genomes from mesophiles, and compute the temperature sensitive amino acid composition (J. Biochem. 133, 507, 2003) and amino acid-coupling preference (We have published in Proteins 59, 58, 2005), and combine each of them into the propensity profiles. Users only should support the FASTA format protein Unicode-sequence, and then quickly get the thermostability preference based on thermopile genomes. Indeed, if users plan to mutate any enzyme using the site-directed mutagenesis method, they will get some possible thermo-sequence combinations that they need. Users can use this interface for detail analyses in the protein sequence; it is still viable even if lacks of structure information. Furthermore, it also decreases combination analysis needs when there is structural information available.
(The test version is available in http://220.127.116.11/TPS/)
Key words: amino acid-coupling patterns; thermophilic proteins; mesophilic proteins; thermostability
Scoring Alignment Gaps in the Twilight Zone [Sequence Comparison]
Barbara S. Chapman, Ph.D
Interdisciplinary Studies, Sonoma State University
Corresponding author: email@example.com
Genome projects produce thousands of new protein sequences. Many are orphans with no sufficiently close relatives identifiable by sequence alignment methods. They remain "'"hypothetical"'" until 3D structural models are built. However, sequence alignment methods are unreliable for modeling. The question addressed here: Can sequence alignments be improved? Sequence alignment algorithms model two kinds of evolutionary occurrences—amino acid replacements (substitutions) and insertion-deletion events (indels). Whereas good substitution models exist (e.g., the BLOSUM matrices) for amino acid replacements, the traditional affine gap penalty inadequately models indels by assessing a fixed cost to open a gap and a fixed cost to extend the gap (add more unmatched residues). Arbitrary gap costs ignore structural context, the primary determinant of the probability of an indel at a point of alignment. Modeling of indels can be improved by sensing their structural context through the distinct distribution of amino acids in indels and their anchors in structural superpositions. The proposed cost function uses residue propensities calculated from structural assignments for all 20 amino acids, with separate tables for anchors and insertions. The resulting gap cost function improves sensitivity in detecting homologs having <35% identical residues (the twilight zone), doubles raw scores, decreases alignment lengths for false pairs, and dramatically reduces alignment errors relative to structural superposition. The structure-sensing gap cost increases the time to construct a Smith-Waterman dynamic programming cell by a constant factor, leaving the time complexity at O(nm).
Tools for Identification of Sequence and Structurally Conserved Environments [Sequence Comparison]
Brandon Peters1 , Eunseog Youn1 , Charlie Moad2 , Randy Heiland2 , Sean D. Mooney1
1 Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202 USA, 2 Scientific Data Analysis Lab, Pervasive Technology Labs, Indiana University, Indianapolis, IN 46202, USA
Corresponding author: firstname.lastname@example.org
We have developed a website, http://www.sblest.org, and a suite of web services, that enables users to submit protein structures and identify the sequence and structurally conserved environments in that query. To do this, we integrated several sequence and structure based analysis tools, such as S-BLEST, PSI-BLAST, and HMMer to identify sites that are associated with SCOP families, GO terms and EC terms. The results page shows the HMMer predicted SCOP superfamily, and a list of hits found.
These hits are annotated with a Z-Score from S-BLEST, a PSI-BLAST e-value, GO annotations, EC annotations, and SCOP annotations. Each hit has a corresponding results page which shows a list of structurally significant matched residues, a JMol window for both the query PDB and the matched PDB, and several links to other databases. In addition to the website, we have built a suite of web services for accessing this resource, and have extended UCSF Chimera and Delano Scientific PyMOL to use this. Overall, this method enables researchers to identify structurally conserved sites.
Quantum Based Evolutionary Algorithm for Multiple Sequence Alignments [Sequence Comparison]
Hongwei Huo, Vojislav Stojkovic, and Qiuli Lin
Xidian University, School of Computer Science and Technology; Morgan State University, Computer Science Department
Corresponding author: email@example.com
Kuk-Hym Han (KHH) was the first researcher who analyzed the characteristics of quantum evolution algorithms (QEAs) and showed that QEA can be successfully used to solve the knapsack problem. Using KHH"'"s work as a foundation, we try to redesign QEA to solve the multiple sequence alignment (MSA) problem. We present a new Quantum Based Evolution Algorithm for multiple sequence ALIGNments called QEAlign. QEAlign is written-implemented in the programming language C and executes at the simulator of Quantum Computer. QEAlign uses a new probabilistic representation, qubit, that can represent a linear superposition of individuals of solutions. QEAlign does not need a large number of individuals of solutions for searching the best alignment. QEAlign employs two variation operators, Quantum-gate and mutation operator, to make individuals of alignments towards better alignment. Global migration and local migration in QEAlign make automatic balance ability between global search and local search. The probabilistic representation makes QEAlign to convergent fast. The performance of QEAlign is evaluated on standard alignment benchmark data set BAliBASE. On BAliBASE, the results of QEAlign were compared with the results of seven well-known multiple alignment programs, including CLUSTALX and SAGA. The sum-of-pairs score is used to evaluate the quality of an alignment for each category. The experimental results show that in many cases QEAlign is efficient.
Our results are important and directly influence several fields, including sequence assembly, sequence annotation, structural and proteins, phylogeny and evolutionary analysis.
Our future short-term research will be focused on the parameter optimization of QEAlign.
DNA-Based Addition and Subtraction of Two Unsigned Integer Numbers Inspired by Unrestricted Grammars Implemented in Prolog Programming Language [Strings, Graphs, and Algorithms]
CSU, Baltimore, MD 21216
Corresponding author: E-Britto@hotmail.com
DNA computers cannot be considered important computing machines until they are used to solve complex numerical calculation problems. The challenge is to discover the right ways how DNA computers use numbers and arithmetic operations. Guarnieri, Fliss, and Bancroft (GFB) were the first researchers who found a way how a DNA computer can add two binary unsigned integer numbers. Using GFB"'"s work as a foundation, we expanded their results by connecting Unrestricted Grammars, Automata-Machines and DNAs. Our unique approach has accomplished the following objectives: i) transform DNA computers into numerical computing machines; ii) extend DNA computing to numerical computing.
The main contributions of our work are: i) Unrestricted Grammars to add/subtract two unsigned integer numbers; ii) DNA representation of digits of an unsigned integer number; iii) Translation Scheme for translating rules of an Unrestricted Grammar into DNA sequences; iv) DNA based addition and subtraction of two unsigned integer numbers; v) DNA representation of unsigned integer numbers.
The results are theoretically founded (Unrestricted Grammars), general (base independent), shorter (does not make difference between the last and other digits), efficient (smaller number of rules), more elegant than GFB’s, and implemented (in Prolog programming language).
The algorithms presented are not technically demanding and involve simple biochemical procedures that require just a few days work in a DNA computing lab.
Our results directly influence several disciplines, including: i) computer science (parallel algorithms, formal languages, grammars and automata-machines, computability and complexity, parallel programming languages); ii) cryptography (key algorithms, coding-decoding algorithms); iii) computer security & information assurance.
A Novel Method for Clustering Internal Transcribed Spacer (ITS) Sequences from Fungi using Computer-simulated Restriction Enzyme Cut Patterns [Strings, Graphs, and Algorithms]
Rajib Sengupta*, Dhundy Bastola# and Hesham Ali*
*College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68102-0116, #Department of Pediatrics, University of Nebraska Medical Center, Omaha, NE 68198-6495
Corresponding author: firstname.lastname@example.org
Restriction fragment length polymorphism (RFLP) of chromosomal DNA is one of the powerful tools that is exploited in fingerprinting of microorganism. Although, use of multiple restriction enzymes allows better fingerprints, such digestions are impractical in the laboratory. Additionally, they generate many small fragments which can not be resolved by gel electrophoresis. To overcome this limitation, we developed a computational tool to simulate RFLP on large numbers of Internal Transcribed Spacer (ITS) region of the rDNA complex gene from fungi. These restriction-fragments data were clustered to assess the phylogenetic resolution and evaluate the limitations of the method. Unlike previous approaches, which depend on sequence alignment (pairwise or multiple), our approach require no alignment. This alignment-free approach uses pair-wise Longest Common Subsequences (LCS) of restriction enzyme cut patterns data for computation.
To validate our approach an exclusive maximum gap clustering technique was applied on a smaller data set comprising of the ITS sequences from the genera Aspergillus and Candida. The phylogenetic resolution measured was 83% for smaller dataset and 81% for dataset with 3000+ sequences from multiple genera. Alternatively, implementation of similarity based hierarchical clustering algorithm accurately grouped 91% of the fungal taxa (916 of 1049 taxa). Additionally, the accuracy of result directly correlated with the number of restriction enzymes included in the analysis. The analysis with 217, 57 and 4 enzymes showed accuracy of 91%, 90% and 77%, respectively. An optimal algorithm to obtain a best set of restriction enzymes for analysis is currently being explored.
A Linear-Time Algorithm for Suffix Sorting and Its Applications [Strings, Graphs, and Algorithms]
Fei Nan and Don Adjeroh
West Virginia University
Corresponding author: email@example.com
The suffix tree is a data structure that has found applications in various important problems, such as genetic
sequencing, pattern matching and computational biology. Its derivative data structure, the suffix array, is
another practically optimistic representation of data with the added advantage of easier implementation and
small memory footprint. We propose an $O(n)$ time divide-and-conquer sort-and-merge based algorithm for solving
suffix sorting problem. The problem of constructing suffix array is called the suffix sorting problem.
Recursively, we divide the input sequence into two groups; sort one group; and merge the unsorted group to the
sorted group in order to construct the suffix array. Given the suffix array, the array of Longest Common Prefix
(LCP) can be constructed in $O(n)$ time. Our proposed algorithm distinguishes itself from existing suffix array
algorithms by meeting the customized partition demand for different application requirements. The proposed
algorithm uses a simpler merging step and introduces a new approach for non-symmetric treatment of the two
subgroups at each divide-and-conquer step. We discuss applications of the proposed suffix sorting algorithm to
different problems in computational biology, such as analysis of repetition structures and identification of
Functional modules from protein interaction network via local optimization [Strings, Graphs, and Algorithms]
Feng Luo1, *, Xiu-Feng Wan2, Chin-Fu Chen3, Richard Scheuermann4
1Department of Computer Science, 100 McAdams Hall, Clemson University, Clemson, SC 29634-0974, USA. 2Department of Microbiology, Miami University, Oxford, Ohio 45056, USA. 3Department of Genetics and Biochemistry, 316 Biosystems Research Complex, 51 New C
Corresponding author: firstname.lastname@example.org
Functional modules from protein interaction network via local optimization
Accumulating evidences suggest that biological systems are composed of interacting functional modules. Identification of these modules is essential for the understanding of the structure, function and evolution of biological systems. In this paper, we present a local optimization algorithm for detection of functional modules from protein interaction networks. This algorithm has been implemented in a JAVA program, called MoNet (Module of Network). By applying our local optimization algorithm to the yeast core protein interaction network from the Database of Interaction proteins (DIP), we identified 139 modules consisting of at least 5 proteins. All modules are significantly enriched in proteins with similar biological process defined by Gene Ontology (GO) terms. By comparing with modules obtained by a global optimization method, we found that modules identified by MoNet display a much higher purity (frequency) and higher statistical significance (lower p-value). Furthermore, the overlapping vertices between adjacent MoNet modules allow the construction of an interconnecting web of modules to get insights into the high level relationship among modules.
On the approximation of optimal structures for RNA-RNA interaction [Strings, Graphs, and Algorithms]
Hunter College of CUNY, New York, NY
Corresponding author: email@example.com
The interaction of two RNA molecules involves an interplay between the folding of individual molecules on one hand, and the binding of the two molecules on the other. Recently, there have been several concurrent yet independent efforts (including our own) to mathematically formulate RNA-RNA interactions and develop algorithms that predict the structure of the RNA complex thus formed. Most of the proposed algorithms are based on dynamic programming principles – apparently a “hard-to-avoid” influence from extensive RNA folding literature. Since most RNA-RNA interaction formulations are NP-complete problems, proposed algorithms are not guaranteed to always produce optimal structures. Our goal is to characterize this sub-optimality.
We demonstrate for the first time the existence of constant factor approximation algorithms that are based on dynamic programming. In particular, we develop 2/3 factor approximation algorithms (for various optimality criteria). Furthermore, we achieve our stated goal of characterizing the sub-optimality by proving theoretically that 2/3 is an upper bound on the approximation factor for all dynamic-programming-based algorithms. We establish this result by introducing the concept of an
“entangler”: a special molecular sub-structure that may exist in the formed RNA complex. We prove that (1) any algorithm based on dynamic programming cannot produce an entangler, and (2) for some instances, any entangler-free solution is at best a 2/3 factor approximation. However despite the theoretical sub-optimality of these algorithms, they are able to predict some known RNA complexes. In particular, our algorithms predict to a great degree of satisfaction the fhlA-OxyS and the CopA-CopT complexes in E. Coli.
Evaluation of Features for Catalytic Residue Prediction in Novel Folds [Structural Biology]
Eunseog Youn1, Brandon Peters1, Predrag Radivojac2, and Sean D. Mooney1
1Center for Computational Biology and Bioinformatics, Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, 2School of Informatics, Indiana University, Bloomington, IN 47408
Corresponding author: firstname.lastname@example.org
Structural genomics projects are determining the three dimensional structure of proteins without full characterization of their function. A critical part of the annotation process involves appropriate knowledge representation and prediction of functionally important residue environments. We have developed a method to extract features from sequence, sequence alignments, 3-D structure, and structural environment conservation, and used support vector machines to annotate homologous and non-homologous residue positions based on a specific training set of residue functions. In order to evaluate this pipeline for automated protein annotation, we applied it to the challenging problem of classification of catalytic residues in enzymes. We also ranked the features based on their ability to discriminate catalytic from non-catalytic residues. When applying our method to a well-annotated set of protein structures, we found that top ranked features were a measure of sequence conservation, a measure of structural conservation, solvent accessibility, and residue hydrophobicity. We also found that features based on structural conservation were complementary to those based on sequence conservation and that they were capable of increasing predictor performance. Using a family non-redundant version of the ASTRAL 40 v1.65 dataset, we estimated that the true catalytic residues were correctly predicted in 57.0% of the cases with a precision of 18.5%. When testing on proteins containing novel folds not used in training, the best features were highly correlated with the training on families, thus validating the approach to non-homologous catalytic residue classification in general.
The fragment transformation method to detect the protein structural motifs [Structural Biology]
Chih Hao, Lu
Institute of Bioinformatics, National Chiao Tung University, Hsinchu, Taiwan
Corresponding author: email@example.com
To identify functional structural motifs from protein structures of unknown function becomes increasingly important in recent years due to the progress of the structural genomics initiatives. Although certain structural patterns such as the Asp-His-Ser catalytic triad are easy to detect because of their conserved residues and stringently constrained geometry, it is usually more challenging to detect a general structural motifs like, for example, the betabetaalpha-metal binding motif, which has a much more variable conformation and sequence. At present, the identification of these motifs usually relies on manual procedures based on different structure and sequence analysis tools. In this study, we develop a structural alignment algorithm combining both structural and sequence information to identify the local structure motifs. We applied our method to the following examples: the betabetaalpha-metal binding motif and the treble clef motif. The betabetaalpha-metal binding motif plays an important role in nonspecific DNA interactions and cleavage in host defense and apoptosis. The treble clef motif is a zinc-binding motif adaptable to diverse functions such as the binding of nucleic acid and hydrolysis of phosphodiester bonds. Our results are encouraging, indicating that we can effectively identify these structural motifs in an automatic fashion. Our method may provide a useful means for automatic functional annotation through detecting structural motifs associated with particular functions.
A Semantic Map to select Structural Bioinformatics services [Structural Biology]
Zoé Lacroix, Hervé Ménager, and Pierre Tufféry
Arizona State University (USA), Institut Pasteur (France), Université Denis Diderot (France)
Corresponding author: firstname.lastname@example.org
Structural Bioinformatics covers the prediction and analysis in-silico of biological molecular structures with the goal to understand functional mechanisms at a molecular level. Due to the significant effort of the scientific community, this field has dramatically evolved over the recent years, in particular for proteins. The techniques available to predict and analyze protein structures are continuously improving both in their focus and their performances while new algorithms are developed. In such a context, the scientists face the increasing difficulty of identifying and accessing existing tools and understanding how the results should be interpreted and contribute to scientific discovery.
We propose a semantic map of services for structural bioinformatics, applied to proteins including:
· A semantic description of each service that captures an abstraction of the service rather than a low level syntactic description. This description is expressed in terms of an ontology of the services, linking items of the structural bioinformatics concepts ontology.
· Service identification, detailing for each their purpose, the type of the data on which they are effective, and the type of result they provide.
· Exploration of available services, navigating through the graph composed of the possible interconnections between the services.
Our approach addresses the problem of semantic interoperability of scientific resources publicly available on the web. It exploits an ontology to represent the resources made available to the scientists, so that scientists express a query that captures their scientific aim, and are guided by the system to identify the resources best meeting their needs.
Computation of Conformational Entropy from Protein Sequences [Structural Biology]
Shao-Wei Huang and Jenn-Kang Hwang
Institute of Bioinformatics, National Chiao Tung University, Taiwan, Republic of China
Corresponding author: email@example.com
A complete protein sequence can
usually determine a unique conformation; however,
the situation is different for shorter subsequences—
some of them are able to adopt unique conformations,
independent of context; while others assume
diverse conformations in different contexts. The
conformations of subsequences are determined by
the interplay between local and nonlocal interactions.
A quantitative measure of such structural
conservation or variability will be useful in the
understanding of the sequence–structure relationship.
In this report, we developed an approach using
the support vector machine method to compute the
conformational variability directly from sequences,
which is referred to as the sequence structural
entropy. As a practical application, we studied the
relationship between sequence structural entropy
and the hydrogen exchange for a set of well-studied
proteins. We found that the slowest exchange cores
usually comprise amino acids of the lowest sequence
structural entropy. Our results indicate that
structural conservation is closely related to the
local structural stability. This relationship may have
interesting implications in the protein folding processes,
and may be useful in the study of the sequence–
Finding all plausible pairing partners for a stem-forming RNA region with the PRPVS algorithm [Structural Biology]
Xiaolu Huang, William Tappriach and Hesham Ali
Department of Computer Science, University of Nebraska at Omaha, Omaha, NE 68182, USA; Department of Pathology and Microbiology, University of Nebraska Medical Center, Omaha, NE 68198, USA
Corresponding author: firstname.lastname@example.org
Often in RNA structure study, one encounters some region that is known to be a stem-forming region (the K region), but its pairing partner (the P region) is not known. Most RNA secondary structure predicting methods, in many cases, will miss the actually P region, especially if P and K regions form a pseudoknot stem. The phylogenic alignments also may not be helpful because they require significant sequence similarities. The RNA researchers have demanded a tool that provides a group of base-pairing candidates and has a high sensitivity rather than a prediction method that provides only one base-pairing option and with low sensitivity. We have built the Pairing enRiched Parikh Vectors Searching (PRPVS) algorithm which employs partner dynamic pairing approach and Parikh vector data structure in searching for the P region. Due to its neighboring-region-interference-free style, the PRPVS has high sensitivity. The web application RNApair implemented with the PRPVS has been tested on three RNA sequences with sizes from 360 nt to 550 nt, and each contains a known H-type pseudoknot structure. The results have shown that when the stem 2 5’ regions are considered as the K regions, the true P region ( the stem 2 3’ region) for each sequence is in a candidate group of size equal or less than 14. We believe that PRPVS provides a manageablely low number of P region candidates given the RNA sequence and the K region. The RNApair is available at http://bioinformatics.ist.unomaha.edu:8080/x/RNApair.html.
Accurate prediction of protein complex structures through machine learning [Structural Biology]
Andrew J. Bordner and Andrey Gorin
Oak Ridge National Laboratory
Corresponding author: email@example.com
Structures of protein complexes provide valuable mechanistic insights into the protein interactions that mediate biological processes. Although many experimental structures of isolated proteins exist, there are comparatively fewer structures of complexes. Protein docking is well suited to addressing this discrepancy by providing experimentally verifiable structural models. A new docking method is presented that efficiently samples conformations by matching surface normal vectors, employs fast filtering for shape complementarity, and scores the conformations using a machine learning approach. This is the first time that machine learning has been used for discriminating docked protein complexes. The docking solutions are selected using a Random Forest classifier trained on contacting residue pair frequencies, residue propensities, evolutionary conservation, and shape complementarity. Prediction performance is assessed by cross-validation using a non-redundant set of X-ray structures for 93 heterodimers and 733 homodimers. The single highest rank docking solution is the correct (near-native) structure for just over one third of the complexes and the fraction of high ranked correct structures is significantly enhanced for almost all complexes. A detailed study of the remaining difficult to predict complexes reveals that the majority of homodimer cases, in fact, have a different oligomeric state, due to annotation errors. Also, the Random Forest classifier is shown to outperform an empirical residue contact potential. Evolutionary conservation and shape complementarity, as well as both underrepresented and overrepresented interface residues and residue pairs, were found to make the largest contributions to prediction accuracy.
Severe supersecondary regularities in beta sandwich proteins [Structural Biology]
Y-S. Chiang and A.E. Kister
Department of Health Informatics, SHRP, University of Medicine and Dentistry of New Jersey, Newark, NJ, 07107,USA
Corresponding author: firstname.lastname@example.org
To explain how proteins with dissimilar sequences may share similar architecture we put forth a hypothesis that there exist the strict rules that govern spatial organizations of strands and helices in tertiary structures.
In this research we focus on the problem of the arrangement of strands in a large group of beta proteins, so-called, sandwich-like proteins (SPs). Spatial structures of SPs are composed of β-strands, which form two main β-sheets that pack face-to-face. This type of architecture unites 93 superfamilies, which have no detectable sequence homology.
For supersecondary analysis we introduced a new unit – a set of consecutive strands connected with hydrogen bonds in a beta-sheet. We call these sets - strandons. Based on this idea we described a structure as a set of strandons, which we call a supermotif. This a simple formal description of protein structures in the term of supermotifs allowed us to create a rational supersecondary classification: sandwich-like beta proteins are hierarchically divided on the supermotifs and then subdivided on the supersecondary motifs – the certain arrangements of strands in two beta sheets.
There are three main conclusions from this classification: First, the large diversity of the sandwich proteins is described by 6 different supermotifs in almost all (96%) of SPs. Second, the strict rules describe the regularities of the disposition of strandons in the supermotifs. Third, it was found that the arrangement of strands in the strandons follows by certain regularities and it depends on the position of a strandon in the supermotif.
Protein Structure Prediction using Integer Programming and new Residue Pair Energy Functions [Structural Biology]
Kyle Ellrott, Jun-Tao Guo, Victor Olman, Ying Xu
Computational Systems Biology Lab, Department of Biochemsitry and Molecular Biology, and Institute of Bioinformatics, University of Georgia
Corresponding author: email@example.com
Protein structure prediction by threading is recognized as one of the best techniques for comparing a new protein sequence to known protein structures to identify native-like structural folds. It is able to find fold matches with much less sequence similarity than sequence-based approaches are. At the same time, it is a much less computationally intenstive task then ab initio methods.
While the optimization of a sequence-structure alignment when using two-body energies and gaps has been proven to be NP-hard, model constraints and recent improvements in the application of combinatorial optimization techniques allow for a relatively fast solution to the problem. Original two-body energy descriptions, and model descriptions were written with a more simplistic framework in mind. Given these new algorithmic techniques, it is also important to reevaluate the models and energy functions used.
OpenProspect, our open source implementation of Prospect, utilizes a more complex two-body energy description, called Dfire, based on a distance depended description of the problem, as well as employs a new core deletion model to allow for the search for more distant homologues.
Initial results have shown that the distant dependent energy function increases fold level recognition over the original distance cutoff based energy.
Global Computational Regulatory Analysis of Anti-endotoxin Effect of LL-37 [Systems Biology]
Simon Chan1, Gregory Doho1, Neeloffer Mookherjee1, Kelly Brown1, Fiona Roche2, Fiona S.L. Brinkman2, and Robert E.W. Hancock1
1. Pathogenomics of Innate Immunity, Department of Microbiology and Immunology, University of British Columbia, Vancouver, British Columbia, CANADA. 2. Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbi
Corresponding author: firstname.lastname@example.org
A major result of the binding of the bacterial signature molecule LPS to the Toll-like Receptor (TLR)-4 on immune cells is the translocation of transcription factors (TF) like NF-kB into the cell nucleus, where pathogen response genes are induced. These genes include certain chemokines, which attract phagocytes to the infection site, and pro-inflammatory cytokines, which if over stimulated, can lead to endotoxic shock. The human host defence peptide LL-37 is a modulator of innate immunity and has been shown to induce expression of chemokines and suppress the expression of pro-inflammatory cytokines. Therefore, to understand how LL-37 works, microarray experiments were performed on THP-1 cells treated with LL-37. Differentially expressed genes were determined using a custom analysis pipeline called ArrayPipe. These genes were then clustered based on the temporal patterns of fold change values using various clustering algorithms. Next, Z-scores defining the overabundance of particular transcription factor binding sites (TFBS) were calculated for each cluster and compared to a background data set of randomly chosen TFBS. The results indicate that LL-37 selectively neutralized the LPS-induced expression of genes with promoters containing the NF-kB TFBS and TFs downstream of Mitogen-Activated Protein Kinase (MAPK) pathways. These results provide novel insight into the mechanism of LL-37 anti-endotoxin effect. To automate the above bioinformatics analyses, we have developed a user-friendly online tool. It is our goal to incorporate this tool into ArrayPipe to assist in the discovery of novel innate immunity pathways.
S. pombe Regulatory Network Construction Using the Fuzzy Logic Network [Systems Biology]
Yingjun Cao, Paul P. Wang, Alade Tokuta
Duke University (Yingjun Cao, Paul P. Wang), North Carolina Central University (Alade Tokuta)
Corresponding author: email@example.com
In this poster, a novel gene regulation data processing algorithm based on the Fuzzy Logic Network (FLN) theory is proposed and tested. The key motivation for this algorithm is that genes with regulatory relationships can be modeled via fuzzy logic and the degrees of regulation can be represented as the length of accumulated distance during a period of time intervals. We have deduced the dynamic properties of FLN using the approach of anneal approximation and the dynamic equations of a dynamical FLN have been analyzed. Based upon previous investigation results that in yeast protein-protein networks, as well as in the Internet and social networks, the distribution of connectivity follows Zipf"'"s law, the criteria of parameter quantifications for the algorithm have been achieved. One unique feature of this algorithm is that it makes very limited a priori assumptions concerning the modeling; hence the algorithm is categorized as a data-driven algorithm. Using the guidelines obtained from the theoretical deductions, the algorithm was applied on the Saccharomyces pombe time-series dataset. We chose the 407 genes which have been proposed to be cell cycle regulated, and the algorithm inferred 57 previous verified regulations, 48 unknown regulations, and 20 dubious regulations. The 125 regulatory pairs involve 108 genes and the average connectivity of the inferred network confirms the theoretical assumptions of Zipf’s law.
Assessing Hierarchical Modularity in Protein Interaction Networks [Systems Biology]
Young-Rae Cho, Woochang Hwang, Aidong Zhang and Murali Ramanathan
State University of New York at Buffalo
Corresponding author: firstname.lastname@example.org
The complete and systematic analysis of protein-protein
interactions is one of the most fundamental challenges to understand cellular organizations, processes and functions. The interactions between two proteins provide clues to identify functional modules. However, previous studies of protein interaction networks have suffered from the complexity of the networks and large amounts of noisy data. In this work, we present a novel approach for modularization of protein interaction networks based on the
hierarchical modularity in scale-free networks. First, we introduce two metrics to accurately measure the likelihood of bridging two modules for each node and edge in a network, using connectivity and betweenness centrality. Next, we propose an efficient algorithm to detect modules by collapsing the bridging nodes and edges. To assess our measurement to determine the bridging nodes, we compute the
clustering coefficients of networks which are built from successive deletion of the node with the highest score in our metric. The alteration pattern of the clustering coefficients can approximate the amount of the bridging nodes in a network. We also investigate the biological importance of bridging nodes based on the lethality of proteins. As results of modularization, we demonstrate that our approach can discover functional associations of proteins. Furthermore, we show that our algorithm outperforms other previous clustering methods in terms of accuracy and efficiency. Finally, we apply our approach to predict the biological functions of uncharacterized proteins.
Simulation of Leukocyte Interactions with a P-Selectin Coated Substrate [Systems Biology]
Jon Tang and C. Anthony Hunt
UCSF & UCB Joint Graduate Group in Bioengineering and The Biosystems Group, Department of Biopharmaceutical Sciences, The University of California, San Francisco, CA 94143, USA
Corresponding author: email@example.com
Leukocyte transendothelial migration is a key process in the pathogenesis of a number of inflammatory disease states such as asthma, rheumatoid arthritis, multiple sclerosis, and atherosclerosis. Such diseases can be characterized by excess or inappropriate leukocyte transmigration and the misdirected actions of leukocytes towards healthy host-tissue. Discovery of new and better therapeutic interventions will be facilitated by having a better understanding of how the process of leukocyte transendothelial migration works and by being able to predict the consequences of interventions. Making such predictions requires having an in silico analogue capable of exhibiting a large, explorable behavior space that significantly overlaps the observed behavior space of the in vitro models used for research. The envisioned analogue will be capable exhibiting multiple levels of resolution. We have constructed an in silico model for representing the dynamics of rolling, activation, and adhesion of individual leukocytes prior to transmigration. We use the synthetic modeling method. Object-oriented software components are designed, verified, plugged together logically, and then operated in ways that represent the mechanisms and processes that are believed to influence leukocyte adhesion (or lack thereof) and rolling. Our first objective has been to refine the analogue’s ability to represent—mimic—the essential jerky characteristics of leukocyte rolling. We report simulation results that compare well to data from flow chamber experiments of leukocyte rolling on P-selectin. These results provide a necessary and essential foundation for simulation studies of leukocyte rolling, activation by KC chemokines, and adhesion on P-selectin and VCAM-1 substrate.
A local clustering algorithm for discovering functional modules from the synthetic lethal interaction network in yeast [Systems Biology]
Ping Ye, Joel Bader
Johns Hopkins University, Department of Biomedical Engineering, High Throughput Biology Center
Corresponding author: firstname.lastname@example.org
One powerful tool for dissecting biological pathways is to identify synthetic lethality relationship between two genes that cause cell death when mutated concurrently while neither by itself is lethal to the cell. Genome-wide approaches to assessing synthetic lethality have been recently conducted in Saccharomyces cerevisiae. Analyses have revealed that synthetic lethal interacting genes mostly function in compensating pathways while genes sharing synthetic lethal partners belong to the same pathway. Accordingly, global clustering methods have grouped genes into functional modules based on their global synthetic lethal interaction patterns. However, multi-task genes cannot be classified correctly using these global methods, due to the fact that they share one set of partners with genes in one module and another set of partners with genes in another module. Therefore, it is essential to identify correct pathway membership by clustering genes according to their local synthetic lethal interaction pattern. We have developed a local clustering algorithm, the bi-clustering algorithm, to identify statistically significant modules according to local synthetic lethal interaction patterns. By applying this method to a DNA integrity network, we have successfully identified all functional modules previously determined through manual curation. Our method outperforms global clustering methods by identifying overlapping modules containing multi-task genes. Therefore, this method can predict functional gene modules and has general application to genomic synthetic lethality screens.
Agent-Directed DEVS Modeling of Adaptive Cellular Immunity [Systems Biology]
Sunwoo Park, Sean H. J. Kim, and C. Anthony Hunt
Joint UCSF/UCB Bioengineering Graduate Group and The Biosystems Group, Department of Biopharmaceutical Sciences, The University of California, San Francisco
Corresponding author: email@example.com
We present a new Discrete Event System Specification (DEVS) agent modeling formalism and simulation processors for simulation-driven systems biology, and report on their application in representing aspects of the adaptive cellular immune system. Adaptive immune responses are emergent phenomena arising from dynamic spatiotemporal interactions between immunogens and components of a host immune system. To represent system-level biological behaviors, we created agent-directed analogues of key components of the cellular immune system, and described each using the extended DEVS. The formalism integrates descriptions of the key features of agent-based and system-oriented models into a unified formal specification, including a universal coupling mechanism. The mechanism enables dynamic creation and permutation of coupling relations between analogue entities, thus making it possible to formally describe adaptive system-to-system and system-to-environment interactions. Antigen recognition, activation, and effector actions of T cells are represented as discrete events. T cells are represented as atomic DEVS agents. Temporal trajectories of inputs, outputs, and states are represented concisely and consistently, facilitating systematic analysis of simulation dynamics. The simulation processors are entities that execute the DEVS agent model and allow exploration of spatiotemporal spaces generated from the model. Together the new formalism and simulation processors enable unified formal descriptions of DEVS agent models and their adaptive interactions.
Agent-Based Simulations of In Vitro Epithelial Morphogenesis in Multiple Environments [Systems Biology]
Mark R. Grant, Sean H. J. Kim, and C. Anthony Hunt
Joint UCSF/UCB Bioengineering Graduate Group and The Biosystems Group, Department of Biopharmaceutical Sciences, The University of California, San Francisco
Corresponding author: firstname.lastname@example.org
In vitro studies of epithelial morphogenesis have demonstrated the influence of environment composition and orientation in the development of multicellular epithelial structures such as tubules and cysts. We have constructed discrete event, agent-based analogues of epithelial cells, and report on their use for experimentation to explore how the morphogenetic phenomena observed in wet-lab experiments under four growth conditions might be generated and controlled. The analogues were constructed using the agent-based modeling package MASON, and comprises 2D grids and three component types that represent free space, matrix, and epithelial cells in in vitro system. Cell agent axioms capture posited processes involved in epithelial morphogenesis, and each action taken by an analogue is strictly determined by environment-focused axioms. Actions are mandated in response to type and location in the local environment of any combination of the three components. The key features of simulation outcomes under four different growth conditions are remarkably similar to observations of epithelial cell growth under corresponding in vitro conditions: surface, embedded, suspension, and overlay. We identified in silico attributes that may have in vitro counterparts that lead to normal growth patterns. We also studied how changes in the logic governing simulated epithelial cell behavior might cause abnormal growth. Simulation results confirm the importance of a polarized response to the environment to the generation of a normal epithelial phenotype and show how disruptions of tight mechanistic control lead to aberrant, sometimes cancer-like growth characteristics.
Stochastic, Agent-Based Analogues of Primary Rat Hepatocytes: Prediction of In Vitro Biliary Excretion [Systems Biology]
Shahab Sheikh-Bahaei and C. Anthony Hunt
University of California, Berkeley and San Francisco
Corresponding author: email@example.com
Using an agent-directed modeling method, we constructed analogues of individual hepatocytes that can simulate the biliary excretion of compounds. The in silico experimental system mimics an in vitro experimental system that uses sandwich-cultured hepatocytes. We represent hepatocytes as fixed agents in a 2D grid. Objects representing drug can move around stochastically. Simulated hepatocytes are container objects equipped with binders and "enzymes" on the inside and "transporters" on their surface. When a drug object is encountered, it can partition or get transported in based on the drug"'"s physicochemical properties. Once inside, they can attach to (a detach from) binders and enzymes. In the latter case, they may be "metabolized," excreted to "bile," or transported out.
We tuned the parameters to match data for salicylate, taurocholate, and methotrexate. We then predicted the biliary excretion of enkephalin. We classified the four compounds into two and three clusters using a Fuzzy c-Means algorithm that used each drug"'"s physicochemical properties. The simulation results are minimally
acceptable. Our plan is to increase the number of compounds for which decedents of this model will produce acceptably similar uptake and excretion data. Both analogues and approach are fundamentally different from the current predictive approaches. We leverage mechanistic knowledge by representing within the analogue our understanding of the generative relationships that develop between analogue components and compounds. Our expectation is that the approach will significantly improve our ability to anticipate the biological properties of compounds of interest.
Axiom Based in Silico Analogues of In Vitro Tumor Spheroids [Systems Biology]
Jesse Engelberg, C. Anthony Hunt
UCSF / UC Berkeley Joint Graduate Group in Bioengineering. BioSystems Group, Department of Biopharmaceutical Sciences, UCSF.
Corresponding author: firstname.lastname@example.org
In vitro multicellular tumor spheroids are models for avascular tumor growth. We describe and present agent-based analogues of those in vitro model systems that can represent their key spatial, temporal, and morphological phenotypic attributes. The analogues, developed using JAS and MASON, use four rectangular grids (spaces): one each for tumor cells, nutrient, oxygen, and the necrotic factor. Square and hexagonal grid subdivisions are used, and each cell object occupies a single grid space. The other three spaces represent the relative quantity of the referent components in a space comparable to that occupied by a cell object. Events and processes that would be evident at a higher resolution are conflated into axioms and the aforementioned components. Cell objects move within their grid and interact with each other across grids. The oxygen and nutrient spaces can be replenished. The behaviors of simulated tumor cells are dictated by a small set of axioms; events include change state, consume resources, move, reproduce, be shed, and die. The growth curves of analogues in different growth conditions can be tuned to match growth curves of the in vitro tumor spheroid in corresponding different growth conditions. The concentric layered morphology of the analogue spheroids is an acceptable match to in vitro morphology. In those simulations, the simulated toxic factors play a key role; the in vitro mechanism, however, remains unclear. In silico, a method of volume loss, other than shedding, is necessary for the simulated spheroid to reach and maintain a relatively constant size.
Global Analysis of Protein Translation Networks in Yeast [Systems Biology]
Daniel D. Wu, Xiaohua Hu
College of Information Science and Technology, Drexel University, Philadelphia, PA 19104, U.S.A.
Corresponding author: email@example.com
Protein translation is a vital cellular process for any living organism. The availability of interaction databases provides an opportunity for researchers to exploit the immense amount of data in silico such as studying biological systems using network analysis. There has been an extensive effort using computational methods in deciphering the transcriptional regulatory networks. However, research on translation regulatory networks has caught little attention in the bioinformatics and computational biology community probably due to the nature of available data and the bias of the conventional wisdom. In this paper, we present a global network analysis of protein translation networks in yeast, a first step in attempting to facilitate the elucidation of the structures and properties of translation networks. We extract the translation proteome using MIPS functional category and analyze it in the context of the full protein-protein interaction network. We further derive the individual translation networks from the full interaction network using the extracted proteome. We show that the protein translation networks do not exhibit power law degree distributions in contrast to the full network. In addition, we demonstrate the close relationship between the translation networks and other cellular processes especially transcription and metabolism. We also examine the essentiality and its correlation to connectivity of proteins in the translation networks, the cellular localization of these proteins, and the mapping of these proteins to the kinase-substrate system. These results have potential implications for understanding mechanisms of translational control from a system’s perspective.
Proteins’ large indel mutations tend to reside in the internal regions [Systems Biology]
Fengfeng Zhou, Ying Xu
Computational Systems Biology Laboratory, Department of Biochemical and Molecular Biology and Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
Corresponding author: firstname.lastname@example.org
Average protein/gene lengths (APLs) are reported to be highly conserved among each of the three domains of life, and eukaryotes have longer APLs than prokaryotes. We have previously found that the APLs of proteins with large indel mutations, denoted as group Glid, are almost always longer than those of all the proteins, suggesting that the most radical mutations tend to take place in long proteins. In this work, we have carried an analysis on the positions of the indel mutations. We found that most of the indel mutations in proteins from group Glid take place in the internal regions of the protein sequences, and the averaged lengths of such mutations are only ~20 amino acids across 61 genomes. Mobile elements, e.g. transposons, could introduce much longer indel mutations into proteins, and they usually occur at the two termini. So the above interesting observations support that although the transpositions of mobile elements could greatly increase the genomes’ evolutionary tempo, most of the mutations are not likely to be incurred by this mechanism.
RETURN TO TOP