Gene Ontology Based Prediction and Analysis of

advertisement
Gene Ontology Based Prediction and Analysis of Microarray
Data, GO-PAM
Narinder Singh Sahni
Centre for Computational Biology and Bioinformatics,School of Information Technology,
Jawaharlal Nehru University
DNA microarray technology permits us to analyze the behaviour of several thousands
genes simultaneously. The routine case of analysis involves looking for differentially
expressed genes in two class problems, e.g. {disease/non-disease}, {stress/control},
{knock-out/wild type} etc. The “traditional” approach to analyzing gene expression data
is to use data mining algorithms for detecting differentially expressed genes, and then
relate these genes to biological pathways.
The biggest drawback following this approach is that the hypotheses follows the actual
analysis of data. In other words, one first mines the data and then forms some kind of a
biological hypotheses. Also, different data mining methods would provide different lists
of differentially expressed genes, leading to discovery of different GO’s for the same data
set.
The sets of differentially expressed genes may have different biological interpretations,
which are hard to decipher. Therefore, it becomes pertinent to infer differences in gene
expression based on the biological background as well. Gene Ontology (GO) provides a
structured vocabulary in terms of a hierarchy in form of DAG, for annotation of genes
and proteins.
In this presentation, we present a hypotheses based approach where we first select the
biological attributes (as described in the GO database) that are of interest. Next, we select
only those genes that are related to the biological attribute of interest. Finally, we build a
model using only the chosen genes to validate whether the chosen biological attribute
shows significant difference in the condition under study, either accepting or refuting the
hypotheses. One can easily go through the entire list of over 22000 attributes as described
in the GO-database. This approach has several advantages to the first method as
described above. Aspects such as signal-to-noise ratio due to the sheer number of genes
involved, validation (both statistical and biological), and more importantly gene selection
become much more manageable.
Our results on re-analyzing several of the published data in human-cancer show a
significant improvement in error rates in comparison to what has been reported in the
original articles. Also, the proposed methodology also provides a direct link to relevant
biological attributes and pathways thus reducing the overall effort required in analyzing
gene expression data.
Download