PAPER ABSTRACTS


SESSIONS
I. Structural Bioinformatics
II. Genomics 1
III. Transcriptomes 1
IV. Evolution and Phylogeny
V. Proteomics
VI. Transcriptomes 2
VII. Applied Bioinformatics
VIII. Data Mining and Ontology1
X. Data Base and Ontology
XI. Pathways and Networks 1
XII. Pathways and Networks 2
XIII. Genomics 2
XIV. Transcriptomes 3


SESSION I - STRUCTURAL BIOINFORMATICS

Multiple RNA Structure Alignment
Zhuozhi Wang - Experimental Therapeutics, Research, University Health Network, Canada
Kaizhong Zhang - Department of Computer Science, University of Western Ontario, Canada

RNA structures can be viewed as a kind of special strings with some characters bonded with each other. The question of aligning two RNA structures has been studied for a while, and there are several successful algorithms that are based upon different models. In this paper, by adopting the model introduced in [18], we propose two algorithms to attack the question of aligning multiple RNA structures. We reduce the multiple RNA structure alignment problem to the problem of aligning two RNA structure alignments.

Return to Program
Return to Top


Weighting Features to Recognize 3D Patterns of Electron Density in X-ray Protein Crystallography
Kreshna Gopal [1], Tod D. Romo [2], James C. Sacchettini [2], Thomas R. Ioerger [1]
[1] Department of Computer Science, Texas A&M University
[2] Department of Biochemistry & Biophysics, Texas A&M University


Feature selection and weighting are central problems in pattern recognition and instance-based learning. In this work, we discuss the challenges of constructing and weighting features to recognize 3D patterns of electron density to determine protein structures. We present SLIDER, a feature-weighting algorithm that adjusts weights iteratively such that patterns that match query instances are better ranked than mismatching ones. Moreover, SLIDER makes judicious choices of weight values to be considered in each iteration, by examining specific weights at which matching and mismatching patterns switch as nearest neighbors to query instances. This approach reduces the space of weight vectors to be searched. We make the following two main observations: (1) SLIDER efficiently generates weights that contribute significantly in the retrieval of matching electron density patterns; (2) the optimum weight vector is sensitive to the distance metric i.e. feature relevance can be, to a certain extent, sensitive to the underlying metric used to compare patterns.

Return to Program
Return to Top


Reasoning About Molecular Similarity and Properties
Rahul Singh - Department of Computer Science, San Francisco State University

Ascertaining the similarity amongst molecules is a fundamental problem in biology and drug discovery. Since similar molecules tend to have similar biological properties, the notion of molecular similarity plays an important role in exploration of molecular structural space, query-retrieval in molecular databases, and in structure-activity modeling. This problem is related to the issue of molecular representation. Currently, approaches with high descriptive power like 3D surface-based representations are available. However, most techniques tend to focus on 2D graph-based molecular similarity due to the complexity that accompanies reasoning with more elaborate representations. This paper addresses the problem of determining similarity when molecules are described using complex surface-based representations. It proposes an intrinsic, spherical representation that systematically maps points on a molecular surface to points on a standard coordinate system (a sphere). Molecular geometry, molecular fields, and effects due to field super-positioning can then be captured as distributions on the surface of the sphere. Molecular similarity is obtained by computing the similarity of the corresponding property distributions using a novel formulation of histogram-intersection. This method is robust to noise, obviates molecular pose-optimization, can incorporate conformational variations, and facilitates highly efficient determination of similarity. Retrieval performance, applications in structureactivity modeling of complex biological properties, and comparisons with existing research and commercial methods demonstrate the validity and effectiveness of the approach.

Return to Program
Return to Top


High-throughput 3D Structural Homology Detection via NMR Resonance Assignment
Christopher James Langmead - Carnegie Mellon Dept. of Computer Science
Bruce Randall Donald - Computer Science Department, Chemistry Department, Biological Sciences Department, Dartmouth Center for Structural Biology and Computational Chemistry, Dartmouth College

One goal of the structural genomics initiative is the identification of new protein folds. Sequence-based structural homology prediction methods are an important means for prioritizing unknown proteins for structure determination. However, an important challenge remains: two highly dissimilar sequences can have similar folds—how can we detect this rapidly, in the context of structural genomics? High-throughput NMR experiments, coupled with novel algorithms for data analysis, can address this challenge. We report an automated procedure, called HD, for detecting 3D structural homologies from sparse, unassigned protein NMR data. Our method identifies 3D models in a protein structural database whose geometries best fit the unassigned experimental NMR data. HD does not use, and is thus not limited by sequence homology. The method can also be used to confirm or refute structural predictions made by other techniques such as protein threading or homology modelling. The algorithm runs in O(pn + pn5/2 log (cn)+p log p) time, where p is the number of proteins in the database, n is the number of residues in the target protein and c is the maximum edge weight in an integerweighted bipartite graph. Our experiments on real NMR data from 3 different proteins against a database of 4,500 representative folds demonstrate that the method identifies closely related protein folds, including sub-domains of larger proteins, with as little as 10-30% sequence homology between the target protein (or sub-domain) and the computed model. In particular, we report no false-negatives or false-positives despite significant percentages of missing experimental data.

Abbreviations used: NMR, nuclear magnetic resonance; RDC, residual dipolar coupling; 3D, three-dimensional; HSQC, heteronuclear single-quantum coherence; HN, amide proton; SAR, structure activity relation; SO(3), special orthogonal (rotation) group in 3D.

Return to Program
Return to Top


Pair Stochastic Tree Adjoining Grammars for Aligning and Predicting Pseudoknot RNA Structures
Hiroshi Matsui, Kengo Sato, Yasubumi Sakakibara - Department of Biosciences and Informatics, Keio University

Motivation: Since the whole genome sequences for many species are currently available, computational predictions of RNA secondary structures and computational identifi- cations of those non-coding RNA regions by comparative genomics become important, and require more advanced alignment methods. Recently, an approach of structural alignments for RNA sequences has been introduced to solve these problems. By structural alignments, we mean a pairwise alignment to align an unfolded RNA sequence into a folded RNA sequence of known secondary structure. Pair HMMs on tree structures (PHMMTSs) proposed by Sakakibara are efficient automata-theoretic models for structural alignments of RNA secondary structures, but are incapable of handling pseudoknots. On the other hand, tree adjoining grammars (TAGs) is a subclass of context-sensitive grammar, which is suitable for modeling pseudoknots. Our goal is to extend PHMMTSs by incorporating TAGs to be able to handle pseudoknots.

Results: We propose the pair stochastic tree adjoining grammars (PSTAGs) for modeling RNA secondary structures including pseudoknots and show the strong experimental evidences that modeling pseudoknot structures significantly improves the prediction accuracies of RNA secondary structures. First, we extend the notion of PHMMTSs defined on alignments of ?etrees?f to PSTAGs defined on alignments of ?gTAG (derivation) trees?h, which represent a topdown parsing process of TAGs and are functionally equivalent to derived trees of TAGs. Second, we modify PSTAGs so that it takes as input a pair of a linear sequence and a TAG tree representing a pseudoknot structure of RNA to produce a structural alignment. Then, we develop a polynomial-time algorithm for obtaining an optimal structural alignment by PSTAGs, based on dynamic programming parser. We have done several computational experiments for predicting pseudoknots by PSTAGs, and our computational experiments suggests that prediction of RNA pseudoknot structures by our method are more efficient and biologically plausible than by other conventional methods. The binary code for PSTAG method is freely available from our website at http://www.dna.bio.keio.ac.jp/pstag/.

Return to Program
Return to Top


An Algorithm for Detecting Homologues of Known Structured RNAs in Genomes
Shu-Yun Le, Jacob V. Maizel, Jr. - Laboratory of Experimental and Computational Biology, NCI Center for Cancer Research, National Cancer Institute, NIH
Kaizhong Zhang - Department of Computer Science, University of Western Ontario; School of Computing Science, Simon Fraser University

Distinct RNA structures are frequently involved in a wide range of functions in various biological mechanisms. The three dimensional RNA structures solved by X-ray crystallography and various well-established RNA phylogenetic structures indicate that functional RNAs have characteristic RNA structural motifs represented by specific combinations of base pairings and conserved nucleotides in the loop region. Discovery of well-ordered RNA structures and their homologues in genome-wide searches will enhance our ability to detect the RNA structural motifs and help us to highlight their association with functional and regulatory RNA elements. We present here a novel computer algorithm, HomoStRscan, that takes a single RNA sequence with its secondary structure to search for homologousRNAs in complete genomes. This novel algorithm completely differs from other currently used search algorithms of homologous structures or structural motifs. For an arbitrary segment (or window) given in the target sequence, that has similar size to the query sequence, HomoStRscan finds the most similar structure to the input query structure and computes the maximal similarity score (MSS) between the two structures. The homologousRNA structures are then statistically inferred from the MSS distribution computed in the target genome. The method provides a flexible, robust and fine search tool for any homologous structural RNAs.

Return to Program
Return to Top


Inverse Protein Folding in 2D HP Model
Arvind Gupta - School of Computing Science, Simon Fraser University, Canada
Ján Mañuch - School of Computing Science and PIMS, Simon Fraser University, Canada
Ladislav Stacho - Department of Mathematics, Simon Fraser University, Canada

The inverse protein folding problem is that of designing an amino acid sequence which has a particular native protein fold. This problem arises in drug design where a particular structure is necessary to ensure proper protein-protein interactions. In this paper we show that in the 2D HP model of Dill it is possible to solve this problem for a broad class of structures. These structures can be used to closely approximate any given structure. One of the most important properties of a good protein is its stability ?\ the aptitude not to fold simultanously into other structures. We show that for a number of basic structures, our sequences have a unique fold.

Return to Program
Return to Top


Analysis of a Systematic Search-Based Algorithm for Determining Protein Backbone Structure from a Minimum Number of Residual Dipolar Couplings
Lincong Wang - Dartmouth Computer Science Department, Dartmouth College
Bruce Randall Donald - Dartmouth Computer Science Department, Dartmouth Chemistry Department, Dartmouth Department of Biological Sciences, Dartmouth College

We have developed an ab initio algorithm for determining a protein backbone structure using global orientational restraints on internuclear vectors derived from residual dipolar couplings (RDCs) measured in one or two different aligning media by solution nuclear magnetic resonance (NMR) spectroscopy [14, 15]. Specifically, the conformation and global orientations of individual secondary structure elements are computed, independently, by an exact solution, systematic search-based minimization algorithm using only 2 RDCs per residue. The systematic search is built upon a quartic equation for computing, exactly and in constant time, the directions of an internuclear vector from RDCs, and linear or quadratic equations for computing the sines and cosines of backbone dihedral (φ,ψ ) angles from two vectors in consecutive peptide planes. In contrast to heuristic search such as simulated annealing (SA) or Monte-Carlo (MC) used by other NMR structure determination algorithms, our minimization algorithm can be analyzed rigorously in terms of expected algorithmic complexity and the coordinate precision of the protein structure as a function of error in the input data. The algorithm has been successfully applied to compute the backbone structures of three proteins using real NMR data.

Return to Program
Return to Top


SESSION II - GENOMICS 1

Shannon Information in Complete Genomes
Chang-Heng Chang [1], Li-Ching Hsieh [1], Ta-Yuan Chen [1], Hong-Da Chen [1], Liaofu Luo [3], Hoong-Chien Lee [1,2,4]
[1] Department of Physics, National Central University, Taiwan
[2] Department of Life Sciences, National Central University, Taiwan
[3] Physics Department, Inner Mongolia University, China
[4] National Center for Theoretical Sciences, Taiwan


Shannon information in the genomes of all completely sequenced prokaryotes and eukaryotes are measured in word lengths of two to ten letters. It is found that in a scale-dependent way, the Shannon information in complete genomes are much greater than that in matching random sequences—thousands of times greater in the case of short words. Furthermore, with the exception of the 14 chromosomes of Plasmodium falciparum, the Shannon information in all available complete genomes belong to a universality class given by an extremely simple formula. The data are consistent with a model for genome growth composed of two main ingredients: random segmental duplications that increase the Shannon information in a scale-independent way, and random point mutations that preferentially reduces the larger-scale Shannon information. The inference drawn from the present study is that the large-scale and coarse-grained growth of genomes was selectively neutral and this suggests an independent corroboration of Kimura?fs neutral theory of evolution.

Return to Program
Return to Top


Segmental Duplications Containing Tandem Repeated Genes Encoding Putative Deubiquitinating Enzymes
Hong Liu, Li Li, Asher Zilberstein, Chang S. Hahn - Aventis Pharmaceuticals Inc.

Both inter- and intra-chromosomal segmental duplications are known occurred in human genome during evolution. Few cases of such segments involving functional genes have been reported. While searching for the human orthologs of murine hematopoietic deubiquitinating enzymes (DUBs), we identified four clusters of DUB-like genes on chromosome 4p15 and chromosome 8p22-23 that are over 90% identical to each other at the DNA level. These genes are expressed in a cell type- and activation-specific manner, with different clusters possessing potentially distinct expression profiles. Examining the surrounding sequences of these gene duplication events, we have identified previously unreported conserved sequence elements that are as large as 35 to 74 kb encircling the gene clusters. Traces of these elements are also found on chromosome 12p13 and chromosome 11q13. The coding and immediate upstream sequences for DUB - like genes as well as the surrounding conserved elements, are present in the chimpanzee trace database, but not in rodent genome. We hypothesize that the segments containing these DUB clusters and surrounding elements arose relatively recently in evolution through inter- and intra-chromosomal duplicative transpositions, following the divergence of primates and rodents. Genome wide systematical search of the segmental duplication containing duplicated gene cluster has been performed.

Return to Program
Return to Top


Compressed Pattern Matching in DNA Sequences
Lei Chen, Shiyong Lu, Jeffrey Ram - Wayne State University

We propose derivative Boyer-Moore (d-BM), a new compressed pattern matching algorithm in DNA sequences. This algorithm is based on the Boyer-Moore method, which is one of the most popular string matching algorithms. In this approach, we compress both DNA sequences and patterns by using two bits to represent each A, T, C, G character. Experiments indicate that this compressed pattern matching algorithm searches long DNA patterns (length > 50) more than 10 times faster than the exact match routine of the software package Agrep, which is known as the fastest pattern matching tool. Moreover, compression of DNA sequences by this method gives a guaranteed space saving of 75%. In part the enhanced speed of the algorithm is due to the increased efficiency of the Boyer-Moore method resulting from an increase in alphabet size from 4 to 256.

Return to Program
Return to Top


A Self-tuning Method for One-Chip SNP Identification
Michael Molla [1,2], Jude Shavlik [1], Thomas Albert [2], Todd Richmond [2], Steven Smith [2]
[1] University of Wisconsin-Madison
[2] NimbleGen Systems, Inc.


Current methods for interpreting oligonucleotidebased SNP-detection microarrays, SNP chips, are based on statistics and require extensive parameter tuning as well as extremely high-resolution images of the chip being processed. We present a method, based on a simple data-classification technique called nearest-neighbors that, on haploid organisms, produces results comparable to the published results of the leading statistical methods and requires very little in the way of parameter tuning. Furthermore, it can interpret SNP chips using lower-resolution scanners of the type more typically used in current microarray experiments.

Along with our algorithm, we present the results of a SNP-detection experiment where, when independently applying this algorithm to six identical SARS SNP chips, we correctly identify all 24 SNPs in a particular strain of the SARS virus, with between 6 and 13 false positives across the six experiments.

Return to Program
Return to Top


Space-conserving Optimal DNA-Protein Alignment
Pang Ko, Mahesh Narayanan, Anantharaman Kalyanaraman, Srinivas Aluru - Department of Electrical and Computer Engineering, Iowa State University

DNA-protein alignment algorithms can be used to discover coding sequences in a genomic sequence, if the corresponding protein derivatives are known. They can also be used to identify potential coding sequences of a newly sequenced genome, by using proteins from related species. Previously known algorithms either solve a simplified formulation, or sacrifice optimality to achieve practical implementation. In this paper, we present a comprehensive formulation of the DNA-protein alignment problem, and an algorithmto compute the optimal alignment inO(mn) time using only four tables of size (m + 1) ?~ (n + 1), wheremand n are the lengths of the DNA and protein sequences, respectively. We also developed a Protein and DNA Alignment program PanDA that implements the proposed solution. Experimental results indicate that our algorithm produces high quality alignments.

Return to Program
Return to Top


SESSION III - TRANSCRIPTOMES 1

A Theoretical Analysis of Gene Selection
Sach Mukherjee, Stephen J. Roberts - Department of Engineering Science, University of Oxford, UK

A great deal of recent research has focused on the challenging task of selecting differentially expressed genes from microarray data (?egene selection?f). Numerous gene selection algorithms have been proposed in the literature, but it is often unclear exactly how these algorithms respond to conditions like small sample-sizes or differing variances. Choosing an appropriate algorithm can therefore be difficult in many cases. In this paper we propose a theoretical analysis of gene selection, in which the probability of successfully selecting relevant genes, using a given gene ranking function, is explicitly calculated in terms of population parameters. The theory developed is applicable to any ranking function which has a known sampling distribution, or one which can be approximated analytically. In contrast to empirical methods, the analysis can easily be used to examine the behaviour of gene selection algorithms under a wide variety of conditions, even when the numbers of genes involved runs into the tens of thousands. The utility of our approach is illustrated by comparing three well-known gene ranking functions.

Return to Program
Return to Top


Improved Fourier Transform Method for Unsupervised Cell-cycle Regulated Gene Prediction
Karuturi R. Krishna Murthy, Liu Jian Hua - Genome Institute of Singapore, Republic of Singapore

Motivation: Cell-cycle regulated gene prediction using microarray time-course measurements of the mRNA expression levels of genes has been used by several researchers. The popularly employed approach is Fourier transform (FT) method in conjunction with the set of known cell-cycle regulated genes. In the absence of training data, fourier transform method is sensitive to noise, additive monotonic component arising from cell population growth and deviation from strict sinusoidal form of expression. Known cell cycle regulated genes may not be available for certain organisms or using them for training may bias the prediction.

Results: In this paper we propose an Improved Fourier Transform (IFT) method which takes care of several factors such as monotonic additive component of the cell-cycle expression, irregular or partial-cycle sampling of gene expression. The proposed algorithm does not need any known cell-cycle regulated genes for prediction. Apart from alleviating need for training set, it also removes bias towards genes similar to the training set. We have evaluated the developed method on two publicly available datasets: yeast cell-cycle data and HeLa cell-cycle data. The proposed algorithm has performed competitively on both datasets with that of the supervised fourier transform method used. It outperformed other unsupervised methods such as Partial Least Squares (PLS) and Single Pulse Modeling (SPM). This method is easy to comprehend and implement, and runs faster.

Software: http://giscompute.gis.nus.edu.sg/cdcAnal

Return to Program
Return to Top


SESSION IV - EVOLUTION AND PHYLOGENY

Algorithms for Association Study Design Using a Generalized Model of Haplotype Conservation
Russell Schwartz - Department of Biological Sciences and School of Computer Science, Carnegie Mellon University

There is considerable interest in computational methods to assist in the use of genetic polymorphism data for locating disease-related genes. Haplotypes, contiguous sets of correlated variants, may provide a means of reducing the difficulty of the data analysis problems involved. The field to date has been dominated by methods based on the ?ghaplotype block?h hypothesis, which assumes discrete population-wide boundaries between conserved genetic segments, but there is strong reason to believe that haplotype blocks do not fully capture true haplotype conservation patterns. In this paper, we address the computational challenges of using a more flexible, block-free representation of haplotype structure called the ?ghaplotype motif?h model for downstream analysis problems. We develope algorithms for htSNP selection and missing data inference using this more generalized model of sequence conservation. Application to a dataset from the literature demonstrates the practical value of these block-free methods.

Return to Program
Return to Top


Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees
Usman W. Roshan - Department of Computer Science, University of Texas at Austin
Bernard M. E. Moret - Department of Computer Science, University of New Mexico
Tandy Warnow - Department of Computer Science, University of Texas at Austin; Radcliffe Institute for Advanced Study
Tiffani L. Williams - Department of Computer Science, University of New Mexico

Phylogenetic trees are commonly reconstructed based on hard optimization problems such as maximum parsimony (MP) and maximum likelihood (ML). Conventional MP heuristics for producing phylogenetic trees produce good solutions within reasonable time on small datasets (up to a few thousand sequences), while ML heuristics are limited to smaller datasets (up to a few hundred sequences). However, since MP (and presumably ML) is NP-hard, such approaches do not scale when applied to large datasets. In this paper, we present a new technique called Recursive-Iterative-DCM3 (Rec-I-DCM3), which belongs to our family of Disk-Covering Methods (DCMs). We tested this new technique on ten large biological datasets ranging from 1,322 to 13,921 sequences and obtained dramatic speedups as well as significant improvements in accuracy (better than 99.99%) in comparison to existing approaches. Thus, highquality reconstructions can be obtained for datasets at least ten times larger than was previously possible.

Return to Program
Return to Top


MinPD: Distance-based Phylogenetic Analysis and Recombination Detection of Serially-Sampled HIV Quasispecies
Patricia Buendia, Giri Narasimhan - Bioinformatics Reseach Group (BioRG), School of Computer Science, Florida International University

A new computational method to study within-host viral evolution is explored to better understand the evolution and pathogenesis of viruses. Traditional phylogenetic tree methods are better suited to study relationships between contemporaneous species, which appear as leaves of a phylogenetic tree. However, viral sequences are often sampled serially from a single host. Consequently, data may be available at the leaves as well as the internal nodes of a phylogenetic tree. Recombination may further complicate the analysis. Such relationships are not easily expressed by traditional phylogenetic methods. We propose a new algorithm, called MinPD, based on minimum pairwise distances. Our algorithm uses multiple distance matrices and correlation rules to output a MinPD tree or network. We test our algorithm using extensive simulations and apply it to a set of HIV sequence data isolated from one patient over a period of ten years. The proposed visualization of the phylogenetic tree\network further enhances the benefits of our methods.

Return to Program
Return to Top


SESSION V - PROTEOMICS

SPIDER: Software for Protein Identification from Sequence Tags with De Novo Sequencing Error

Yonghua Han, Bin Ma, Kaizhong Zhang - Department of Computer Science, University of Western Ontario, Canada

For the identification of novel proteins using MS/MS, de novo sequencing software computes one or several possible amino acid sequences (called sequence tags) for each MS/MS spectrum. Those tags are then used to match, accounting amino acid mutations, the sequences in a protein database. If the de novo sequencing gives correct tags, the homologs of the proteins can be identified by this approach and software such as MS-BLAST is available for the matching. However, de novo sequencing very often gives only partially correct tags. The most common error is that a segment of amino acids is replaced by another segment with approximately the same masses. We developed a new efficient algorithm to match sequence tags with errors to database sequences for the purpose of protein and peptide identification. A software package, SPIDER, was developed and made available on Internet for free public use. This paper describes the algorithms and features of the SPIDER software.

Return to Program
Return to Top


Estimating and Improving Protein Interaction Error Rates
Patrik D'haeseleer, George M. Church - Lipper Center for Computational Genetics, Harvard Medical School

High throughput protein interaction data sets have proven to be notoriously noisy. Although it is possible to focus on interactions with higher reliability by using only those that are backed up by two or more lines of evidence, this approach invariably throws out the majority of available data. A more optimal use could be achieved by incorporating the probabilities associated with all available interactions into the analysis.

We present a novel method for estimating error rates associated with specific protein interaction data sets, as well as with individual interactions given the data sets in which they appear. As a bonus, we also get an estimate for the total number of protein interactions in yeast. Certain types of false positive results can be identified and removed, resulting in a significant improvement in quality of the data set. For co-purification data sets, we show how we can reach a tradeoff between the "spoke" and "matrix" representation of interactions within co-purified groups of proteins to achieve an optimal false positive error rate.

Return to Program
Return to Top


Automated Protein Classification Using Consensus Decision
Tolga Can*, Orhan Çamoglu*, Ambuj K. Singh, Yuan-Fang Wang - Department of Computer Science, University of California, Santa Barbara
* These authors contributed equally to this work.

We propose a novel technique for automatically generating the SCOP classification of a protein structure with high accuracy. High accuracy is achieved by combining the decisions of multiple methods using the consensus of a committee (or an ensemble) classifier. Our technique is rooted in machine learning which shows that by judicially employing component classifiers, an ensemble classifier can be constructed to outperform its components. We use two sequence- and three structure-comparison tools as component classifiers. Given a protein structure, using the joint hypothesis, we first determine if the protein belongs to an existing category (family, superfamily, fold) in the SCOP hierarchy. For the proteins that are predicted as members of the existing categories, we compute their family-, superfamily-, and fold-level classifications using the consensus classifier. We show that we can significantly improve the classification accuracy compared to the individual component classifiers. In particular, we achieve error rates that are 3-12 times less than the individual classifiers?f error rates at the family level, 1.5-4.5 times less at the superfamily level, and 1.1-2.4 times less at the fold level.

Return to Program
Return to Top


Separation of Ion Types in Tandem Mass Spectrometry Data Interpretation—A Graph-Theoretic Approach
Bo Yan #[1,2], Chongle Pan #[2,4], Victor N. Olman [1,2], Robert L Hettich [3], Ying Xu [1,2]
[1] Department of Biochemical and Molecular Biology, University of Georgia
[2] Computational Biology Institute, Oak Ridge National Laboratory
[3] Chemical Sciences Division, Oak Ridge National Laboratory
[4] Genome Science and Technology Graduate School, University of Tennessee
#Both authors contributed equally to this work


Mass spectrometry is one of the most popular analytical techniques for identification of individual proteins in a protein mixture, one of the basic problems in proteomics. It identifies a protein through identifying its unique mass spectral pattern. While the problem is theoretically solvable, it remains a challenging problem computationally. One of the key challenges comes from the difficulty in distinguishing the N- and C-terminus ions, mostly b- and y-ions respectively. In this paper, we present a graph algorithm for solving the problem of separating b- from y-ions in a set of mass spectra. We represent each spectral peak as a node and consider two types of edges: a type-1 edge connects two peaks possibly of the same ion types and a type-2 edge connects two peaks possibly of different ion types, predicted based on local information. The ion-separation problem is then formulated and solved as a graph partition problem, which is to partition the graph into three subgraphs, namely b-, y-ions and others respectively, so to maximize the total weight of type-1 edges while minimizing the total weight of type-2 edges within each subgraph. We have developed a dynamic programming algorithm for rigorously solving this graph partition problem and implemented it as a computer program PRIME. We have tested PRIME on 18 data sets of high accurate FT-ICR tandem mass spectra and found that it achieved ~90% accuracy for separation of b- and y-ions.

Return to Program
Return to Top


SESSION VI - TRANSCRIPTOMES 2

Minimum Entropy Clustering and Applications to Gene Expression Analysis
Haifeng Li - University of California Riverside
Keshu Zhang - Rensselaer Polytechnic Institute
Tao Jiang - University of California Riverside

Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Fano?fs inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon?fs entropy with Havrda-Charvat?fs structural ??-entropy. Interestingly, the minimum entropy criterion based on structural ??-entropy is equal to the probability error of the nearest neighbor method when ?? = 2. This is another evidence that the proposed criterion is good for clustering. With a nonparametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.

Return to Program
Return to Top


A Mixed Factors Model for Dimension Reduction and Extraction of a Group Structure in Gene Expression Data
Ryo Yoshida - The Graduate University for Advanced Studies, Japan
Tomoyuki Higuchi - Institute of Statistical Mathematics, Japan
Seiya Imoto - Human Genome Center, Institute of Medical Science, University of Tokyo, Japan

When we cluster tissue samples on the basis of genes, the number of observations to be grouped is much smaller than the dimension of feature vector. In such a case, the applicability of conventional model-based clustering is limited since the high dimensionality of feature vector leads to overfitting during the density estimation process. To overcome such difficulty, we attempt a methodological extension of the factor analysis. Our approach enables us not only to prevent from the occurrence of overfitting, but also to handle the issues of clustering, data compression and extracting a set of genes to be relevant to explain the group structure. The potential usefulness are demonstrated with the application to the leukemia dataset.

Return to Program
Return to Top


Gridding and Compression of Microarray Images
Stefano Lonardi, Yu Luo - Department of Computer Science and Engineering, University of California Riverside

With the recent explosion of interest in microarray technology, massive amounts of microarray images are currently being produced. The storage and the transmission of this type of data are becoming increasingly challenging. Here we propose lossless and lossy compression algorithms for microarray images originally digitized at 16 bpp (bits per pixels) that achieve an average of 9.5?|11.5 bpp (lossless) and 4.6?|6.7 bpp (lossy, with a PSNR of 63 dB). The lossy compression is applied only on the background of the image, thereby preserving the regions of interest. The methods are based on a completely automatic gridding procedure of the image.

Return to Program
Return to Top


SESSION VII - APPLIED BIOINFORMATICS

PoPS: A Computational Tool for Modeling and Predicting Protease Specificity
Sarah E. Boyd, Maria Garcia de la Banda - School of Computer Science & Software Engineering and Victorian Bioinformatics Consortium, Monash University, Australia
Robert N. Pike, James C. Whisstock - Department of Biochemistry & Molecular Biology and Victorian Bioinformatics Consortium, Monash University, Australia
George B. Rudy - Genetics & Bioinformatics Division, Walter & Eliza Hall Institute of Medical Research, Australia

Proteases play a fundamental role in the control of intra- and extracellular processes by binding and cleaving specific amino acid sequences. Identifying these targets is extremely challenging. Current computational attempts to predict cleavage sites are limited, representing these amino acid sequences as patterns or frequency matrices. Here we present PoPS, a publicly accessible bioinformatics tool (http://pops.csse.monash.edu.au/) which provides a novel method for building computational models of protease specificity that, while still being based on these amino acid sequences, can be built from any experimental data or expert knowledge available to the user. PoPS specificity models can be used to predict and rank likely cleavages within a single substrate, and within entire proteomes. Other factors, such as the secondary or tertiary structure of the substrate, can be used to screen unlikely sites. Furthermore, the tool also provides facilities to infer, compare and test models, and to store them in a publicly accessible database.

Return to Program
Return to Top


Selection of Patient Samples and Genes for Outcome Prediction
Huiqing Liu, Jinyan Li, Limsoon Wong - Institute for Infocomm Research. Singapore

Gene expression profiles with clinical outcome data enable monitoring of disease progression and prediction of patient survival at the molecular level. We present a new computational method for outcome prediction. Our idea is to use an informative subset of original training samples. This subset consists of only short-term survivors who died within a short period and long-term survivors who were still alive after a long follow-up time. These extreme training samples yield a clear platform to identify genes whose expression is related to survival. To find relevant genes, we combine two feature selection methods ?\ entropy measure and Wilcoxon rank sum test?\so that a set of sharp discriminating features are identified. The selected training samples and genes are then integrated by a support vector machine to build a prediction model, by which each validation sample is assigned a survival/relapse risk score for drawing Kaplan-Meier survival curves. We apply this method to two data sets: diffuse large-B-cell lymphoma (DLBCL) and primary lung adenocarcinoma. In both cases, patients in high and low risk groups stratified by our risk scores are clearly distinguishable. We also compare our risk scores to some clinical factors, such as International Prognostic Index score for DLBCL analysis and tumor stage information for lung adenocarcinoma. Our results indicate that gene expression profiles combined with carefully chosen learning algorithms can predict patient survival for certain diseases.

Return to Program
Return to Top


SESSION VIII - DATA MINING AND ONTOLOGY 1

Comparison of Two Schemes for Automatic Keyword Extraction from MEDLINE for Functional Gene Clustering
Ying Liu - College of Computing, Georgia Institute of Technology
Brian J. Ciliax - Department of Neurology, Emory University School of Medicine
Karin Borges - Department of Pharmacology, Emory University School of Medicine
Venu Dasigi - Southern Polytechnic State University
Ashwin Ram - College of Computing, Georgia Institute of Technology
Shamkant B. Navathe - College of Computing, Georgia Institute of Technology
Ray Dingledine - Department of Pharmacology, Emory University School of Medicine

One of the key challenges of microarray studies is to derive biological insights from the unprecedented quantities of data on gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the nature of the functional links among genes within the derived clusters. However, the quality of the keyword lists extracted from biomedical literature for each gene significantly affects the clustering results. We extracted keywords from MEDLINE that describe the most prominent functions of the genes, and used the resulting weights of the keywords as feature vectors for gene clustering. By analyzing the resulting cluster quality, we compared two keyword weighting schemes: normalized z-score and term frequency?|inverse document frequency (TFIDF). The best combination of background comparison set, stop list and stemming algorithm was selected based on precision and recall metrics. In a test set of four known gene groups, a hierarchical algorithm correctly assigned 25 of 26 genes to the appropriate clusters based on keywords extracted by the TDFIDF weighting scheme, but only 23 of 26 with the z-score method. To evaluate the effectiveness of the weighting schemes for keyword extraction for gene clusters from microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle were used as a second test set. Using established measures of cluster quality, the results produced from TFIDFweighted keywords had higher purity, lower entropy, and higher mutual information than those produced from normalized z-score weighted keywords. The optimized algorithms should be useful for sorting genes from microarray lists into functionally discrete clusters.

Return to Program
Return to Top


Calculation, Visualization, and Manipulation of MASTs (Maximum Agreement Subtrees)
Shiming Dong, Eileen Kraemer - Computer Science Department, The University of Georgia

Phylogenetic trees are used to represent the evolutionary history of a set of species. Comparison of multiple phylogenetic trees can help researchers find the common classification of a tree group, compare tree construction inferences or obtain distances between trees. We present TreeAnalyzer, a freely available package for phylogenetic tree comparison. A MAST (Maximum Agreement Subtree) algorithm is implemented to compare the trees. Additional features of this software include tree comparison, visualization, manipulation, labeling, and printing.

Availability: http://www.cs.uga.edu/~eileen/TreeAnalyzer

Return to Program
Return to Top


AZuRE, a Scalable System for Automated Term Disambiguation of Gene and Protein Names

Raf M. Podowski - AstraZeneca R&D Boston and Karolinska Institutet
John G. Cleary - Reel Two, Ltd. and University of Waikato
Nicholas T. Goncharoff - Reel Two, Inc.
Gregory Amoutzias - AstraZeneca
William S. Hayes - AstraZeneca R&D Boston

Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system is described which is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7,344 produced good quality models (F-measure > 0.7, nearly 60% of which were > 0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the system?fs internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.

Return to Program
Return to Top


SESSION X - DATA BASE AND ONTOLOGY 2

Comparative Analysis of Gene Sets in the Gene Ontology Space under the Multiple Hypothesis Testing Framework
Sheng Zhong [1], Lu Tian [1], Cheng Li [1,3], Kai-Florian Storch [4], Wing H. Wong [1,2]
[1] Department of Biostatistics, Harvard University
[2] Department of Statistics, Harvard University
[3] Department of Biostatistical Sciences, Dana Farber Cancer Institute
[4] Department of Neurobiology, Harvard Medical School


The Gene Ontology (GO) resource can be used as a powerful tool to uncover the properties shared among, and specific to, a list of genes produced by high-throughput functional genomics studies, such as microarray studies. In the comparative analysis of several gene lists, researchers maybe interested in knowing which GO terms are enriched in one list of genes but relatively depleted in another. Statistical tests such as Fisher?fs exact test or Chi-square test can be performed to search for such GO terms. However, because multiple GO terms are tested simultaneously, individual p-values from individual tests do not serve as good indicators for picking GO terms. Furthermore, these multiple tests are highly correlated, usual multiple testing procedures that work under an independence assumption are not applicable. In this paper we introduce a procedure, based on False Discovery Rate (FDR), to treat this correlated multiple testing problem. This procedure calculates a moderately conserved estimator of q-value for every GO term. We identify the GO terms with q-values that satisfy a desired level as the significant GO terms. This procedure has been implemented into the GoSurfer software. GoSurfer is a windows based graphical data mining tool. It is freely available at http://www.gosurfer.org

Return to Program
Return to Top


Gene Ontology Friendly Biclustering of Expression Profiles
Jinze Liu [1], Wei Wang [1], and Jiong Yang [2]
[1] Department of Computer Science, University of North Carolina, Chapel Hill
[2] Department of Computer Science, University of Illinois, Urbana-Champaign


The soundness of clustering in the analysis of gene expression profiles and gene function prediction is based on the hypothesis that genes with similar expression profiles may imply strong correlations with their functions in the biological activities. Gene Ontology (GO) has become a well accepted standard in organizing gene function categories. Different gene function categories in GO can have very sophisticated relationships, such as ?fpart of?f and ?foverlapping?f. Until now, no clustering algorithm can generate gene clusters within which the relationships can naturally reflect those of gene function categories in the GO hierarchy. The failure in resembling the relationships may reduce the confidence of clustering in gene function prediction. In this paper, we present a new clustering technique, Smart Hierarchical Tendency Preserving clustering (SHTP-clustering), based on a bicluster model, Tendency Preserving cluster (TP-Cluster). By directly incorporating Gene Ontology information into the clustering process, the SHTP-clustering algorithm yields a TP-cluster tree within which any subtree can be well mapped to a part of the GO hierarchy. Our experiments on yeast cell cycle data demonstrate that this method is efficient and effective in generating the biological relevant TP-Clusters.

Return to Program
Return to Top


SESSION XI - PATHWAYS AND NETWORKS 1

Dynamic Algorithm for Inferring Qualitative Models of Gene Regulatory Networks
Zheng Yun - BIRC, School of Comp. Eng., Nanyang Technological University, Singapore
Kwoh Chee Keong - School of Comp. Eng., Nanyang Technological University, Singapore

It is still an open problem to identify functional relations with o(N?Enk) time for any domain[2], where N is the number of learning instances, n is the number of genes (or variables) in the Gene Regulatory Network (GRN) models and k is the indegree of the genes. To solve the problem, we introduce a novel algorithm, DFL (Discrete Function Learning), for reconstructing qualitative models of GRNs from gene expression data in this paper. We analyze its complexity of O(k ?E N ?E n2) on the average and its data requirements. We also perform experiments on both synthetic and Cho et al. [7] yeast cell cycle gene expression data to validate the efficiency and prediction performance of the DFL algorithm. The experiments of synthetic Boolean networks show that the DFL algorithm is more efficient than current algorithms without loss of prediction performances. The results of yeast cell cycle gene expression data show that the DFL algorithm can identify biologically significant models with reasonable accuracy, sensitivity and high precision with respect to the literature evidences.We further introduce a method called function to deal with noises in data sets. The experimental results show that the function method is a good supplement to the DFL algorithm.

Return to Program
Return to Top


Mapping of Microbial Pathways through Constrained Mapping of Orthologous Genes
Victor Olman [1], Hanchuan Peng [1,2], Zhengchang Su [1,2] and Ying Xu [1,2]
[1] Computational Systems Biology Laboratory, Biochemistry and Molecular Biology Department, University of Georgia
[2] Computational Biology Institute, Oak Ridge National Laboratory


We present a novel computer algorithm for mapping biological pathways from one prokaryotic genome to another. The algorithm maps genes in a known pathway to their homologous genes (if any) in a target genome that is most consistent with (a) predicted orthologous gene relationship, (b) predicted operon structures, and (c) predicted co-regulation relationship of operons. Mathematically, we have formulated this problem as a constrained minimum spanning tree problem (called a Steiner network problem), and demonstrated that this formulation has the desired property through applications. We have solved this mapping problem using a combinatorial optimization algorithm, with guaranteed global optimality. We have implemented this algorithm as a computer program, called PMAP. Our test results on pathway mapping are highly encouraging---we have mapped a number of pathways of H. influenzae, B. subtilis, H. pylori, and M. tuberculosis to E. coli using P-MAP, whose homologous pathways in E. coli are known and hence the mapping accuracy could be checked. We have then mapped known E. coli pathways in the EcoCyc database to the newly sequenced organism Synechococcus sp WH8102, and predicted 158 Synechococcus pathways. Detailed analyses on the predicted pathways indicate that P-MAP's mapping results are consistent with our general knowledge about (local) pathways. We believe that P-MAP will be a useful tool for microbial genome annotation projects and inference of individual microbial pathways.

Return to Program
Return to Top


SESSION XII - PATHWAYS AND NETWORKS 2

Multi-knockout Genetic Network Analysis: The Rad6 Example
Alon Kaufman - Center of Neural Computation, Hebrew University, Israel
Martin Kupiec - Department of Molecular Microbiology and Biotechnology, Tel Aviv University, Israel
Eytan Ruppin - School of Computer Science and School of Medicine, Tel Aviv University, Israel

A novel and rigorous Multi-perturbation Shapley Value Analysis (MSA) method has been recently presented [12]. The method addresses the challenge of defining and calculating the functional causal contributions of elements of a biological system. This paper presents the first study applying MSA to the analysis of gene knockout data. The MSA identifies the importance of genes in the Rad6 DNA repair pathway of the yeast S. cerevisiae, quantifying their contributions and characterizing their functional interactions. Incorporating additional biological knowledge, a new functional description of the Rad6 pathway is provided, predicting the existence of additional DNA polymerase and RFC-like complexes. The MSA is the first method for rigorously analyzing multi-knockout experiments, which are likely to soon become a standard and necessary tool for analyzing complex biological systems.

Return to Program
Return to Top


A Hierarchical Mixture of Markov Models for Finding Biologically Active Metabolic Paths Using Gene Expression and Protein Classes
Hiroshi Mamitsuka - Institute for Chemical Research, Kyoto University, Japan
Yasushi Okuno - Graduate School of Pharmaceutical Sciences, Kyoto University, Japan

With the recent development of experimental high-throughput techniques, the type and volume of accumulating biological data have extremely increased these few years. Mining from different types of data might lead us to find new biological insights. We present a new methodology for systematically combining three different datasets to find biologically active metabolic paths/patterns. This method consists of two steps: First it synthesizes metabolic paths from a given set of chemical reactions, which are already known and whose enzymes are co-expressed, in an efficient manner. It then represents the obtained metabolic paths in a more comprehensible way through estimating parameters of a probabilistic model by using these synthesized paths. This model is built upon an assumption that an entire set of chemical reactions corresponds to a Markov state transition diagram. Furthermore, this model is a hierarchical latent variable model, containing a set of protein classes as a latent variable, for clustering input paths in terms of existing knowledge of protein classes. We tested the performance of our method using a main pathway of glycolysis, and found that our method achieved higher predictive performance for the issue of classifying gene expressions than those obtained by other unsupervised methods. We further analyzed the estimated parameters of our probabilistic models, and found that biologically active paths were clustered into only two or three patterns for each expression experiment type, and each pattern suggested some new long-range relations in the glycolysis pathway.

Return to Program
Return to Top


SESSION XIII - GENOMICS 2

FastR: Fast Database Search Tool for Non-coding RNA
Vineet Bafna, Shaojie Zhang - Department of Computer Science and Engineering, University of California San Diego

The discovery of novel non-coding RNAs has been among the most exciting recent developments in Biology. Yet, many more remain undiscovered. It has been hypothesized that there is in fact an abundance of functional non-coding RNA (ncRNA) with various catalytic and regulatory functions. Computational methods tailored specifically for ncRNA are being actively developed. As the inherent signal for ncRNA is weaker than that for protein coding genes, comparative methods offer the most promising approach, and are the subject of our research. We consider the following problem: Given an RNA sequence with a known secondary structure, ef- ficiently compute all structural homologs (computed as a function of sequence and structural similarity) in a genomic database. Our approach, based on structural filters that eliminate a large portion of the database, while retaining the true homologs allows us to search a typical bacterial database in minutes on a standard PC, with high sensitivity and specificity. This is two orders of magnitude better than current available software for the problem.

Return to Program
Return to Top


Recurrence Time Statistics: Versatile Tools for Genomic DNA Sequence Analysis
Yinhe Cao [1,3], Wen-wen Tung [2], J.B. Gao [3]
[1] 1026 Springfield Drive, Campbell, CA
[2] National Center for Atmospheric Research
[3] Department of Electrical and Computer Engineering, University of Florida


With the completion of the human and a few model organisms?f genomes, and the genomes of many other organisms waiting to be sequenced, it has become increasingly important to develop faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be sequenced. Here we report a novel recurrence time based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence time statistics, which has the salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected EST belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Computationally, our method is very efficient. It allows us to carry out analysis of genomes on the whole genomic scale by a PC.

Return to Program
Return to Top


SESSION XIV - TRANSCRIPTOMES 3

Profile-based String Kernels for Remote Homology Detection and Motif Extraction
Rui Kuang [1], Eugene Ie [1,3], Ke Wang [1], Kai Wang [2], Mahira Siddiqi [2], Yoav Freund [1,3,4], Christina Leslie [1,3,4]
[1] Department of Computer Science, Columbia University
[2] Department of Biomedical Informatics, Columbia University
[3] Center for Computational Learning Systems, Columbia University
[4] Center for Computational Biology and Bioinformatics, Columbia University


We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences (?gk-mers?h) in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We also show how we can use the learned SVM classifier to extract ?gdiscriminative sequence motifs?h—short regions of the original profile that contribute almost all the weight of the SVM classification score —and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented ?gcluster kernels?h give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results are comparable to cluster kernels while providing much better scalability to large datasets.

Return to Program
Return to Top


MISAE: A New Approach for Regulatory Motif Extraction
Zhaohui Sun, Jingyi Yang, Jitender S. Deogun - Department of Computer Science and Engineering, University of Nebraska Lincoln

The recognition of regulatory motifs of co-regulated genes is essential for understanding the regulatory mechanisms. However, the automatic extraction of regulatory motifs from a given data set of the upstream non-coding DNA sequences of a family of co-regulated genes is difficult because regulatory motifs are often subtle and inexact. This problem is further complicated by the corruption of the data sets. In this paper, a new approach called Mismatch-allowed Probabilistic Suffix Tree Motif Extraction (MISAE) is proposed. It combines the mismatch-allowed probabilistic suffix tree that is a probabilistic model and local prediction for the extraction of regulatory motifs. The proposed approach is tested on 15 co-regulated gene families and compares favorably with other state-of-the-art approaches. Moreover, MISAE performs well on ?gcorrupted?h data sets. It is able to extract the motif from a ?gcorrupted?h data set with less than one fourth of the sequences containing the real motif.

Return to Program
Return to Top


Biclustering in Gene Expression Data by Tendency
Jinze Liu [1], Jiong Yang [2], and Wei Wang [1]
[1] Department of Computer Science, University of North Carolina, Chapel Hill
[2] Department of Computer Science, University of Illinois, Urbana-Champaign


The advent of DNA microarray technologies has revolutionized the experimental study of gene expression. Clustering is the most popular approach of analyzing gene expression data and has indeed proven to be successful in many applications. Our work focuses on discovering a subset of genes which exhibit similar expression patterns along a subset of conditions in the gene expression matrix. Specifically, we are looking for the Order Preserving clusters (OPCluster), in each of which a subset of genes induce a similar linear ordering along a subset of conditions. The pioneering work of the OPSM model[3], which enforces the strict order shared by the genes in a cluster, is included in our model as a special case. Our model is more robust than OPSM because similarly expressed conditions are allowed to form order equivalent groups and no restriction is placed on the order within a group. Guided by our model, we design and implement a deterministic algorithm, namely OPCTree, to discover OP-Clusters. Experimental study on two real datasets demonstrates the effectiveness of the algorithm in the application of tissue classification and cell cycle identification. In addition, a large percentage of OP-Clusters exhibit significant enrichment of one or more function categories, which implies that OP-Clusters indeed carry significant biological relevance.

Return to Program
Return to Top


HOME •  REGISTRATION •  PAPERS •  POSTERS •  TUTORIALS •  PROGRAM •  KEYNOTE SPEAKERS •  INVITED SPEAKERS
SPECIAL EVENTS •  COMMITTEES •  SPONSORS •  NEWS ROOM •  CONTACT US •  PREVIOUS CONFERENCES

© 2004 IEEE Computational Systems Bioinformatics