–
Room P3.10, Mathematics Building
CCC-biclustering: a linear time biclustering algorithm for time-series gene expression data
Several non-supervised machine learning methods have been used in the analysis of gene expression data obtained from microarray experiments. Recently, biclustering, a non-supervised approach that performs simultaneous clustering on the row and column dimensions of the data matrix, has been shown to be remarkably effective in a variety of applications. The goal of biclustering is to find subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated behaviors. These correlated behaviors correspond to expression patterns and can be used to identify relevant biological processes possibly involved in regulatory networks. In the most common settings, biclustering is an NP-complete problem, and heuristic approaches are usually used to obtain sub-optimal solutions using reasonable computational resources. In this talk, we examine a particular setting of the problem, where we are concerned with finding biclusters in time-series expression data. In this context, we are interested in finding biclusters with consecutive columns. The motivation for this problem is the fact that biological processes start and finish in an identifiable contiguous period of time, leading to increased (or decreased) activity of sets of genes that can be identified because they form biclusters with contiguous columns. The relevance of biclusters that span consecutive columns and their importance in the identification of gene regulatory processes is thus evident. For this particular setting of the problem, we present an algorithm based on string processing techniques that finds and reports all relevant biclusters in time linear on the size of the expression matrix.