Universidade Nacional da Irlanda em Galway -- Ciência sem Fronteiras
Haixuan Yang
Will be in the NUIG in January 2013
School of Mathematics, Statistics and Applied Mathematics
Research Centre / group
Bioinformatics and Statistical Modelling
A cell can be viewed as a complex network of inter-relating proteins, nucleic acids and other biomolecules. I am working on learning/mining valuable information from bio-molecular networks
via statistical approaches.
Optimal co-expression graph building by microarray data re-weighting for improving Guilt-byAssociation Analyses
The Guilt-by-Association (GBA) principle, according to which genes with similar expression profiles
are functionally associated, is widely applied for functional analyses using large heterogeneous
collections of transcriptomics data. However, GBA has many perceived limitations, and it was
observed that some genes known to be involved in a particular pathway invariably are missed,
whereas other apparently unrelated genes exhibit expression profiles that are strikingly similar to
bona fide pathway components [1]. Although there may be many reasons that cause these
limitations, we focus on one reason that data are not fully exploited, therefore we want to
investigate the effect of re-weighting micro-array data. The intuition behind is that different microarray data may play different roles in revealing the pathway activations, so re-weighting micro-array
data may result in better functional analyses.
In our previous work [2], it was found that the simple use of large collections could hamper GBA
functional analysis for genes whose expression is condition specific; and a smaller set of condition
related experiments selected relevant to a given biological function category can improve GBA. This
work can be considered a re-weighting of micro-array data – micro-arrays in selected experiments
are weighted by one while the remaining ones are weighted by zero. From this point of view, the
proposed project is a generalization of the work in [2].
GBA-based analyses often begin with the calculation of similarity between gene expression profiles
using a metric such as Pearson’s correlation. In this project, we aim to build a better co-expression
graph based on re-weighted micro-array data in order to improve GBA. This project has three
1. Investigate a single experiment data
Within a single experiment, some treatments or time points or samples may be more deterministic
than others in reflecting a change responding to a given gene function. When building the
correlation graph, it therefore is desirable to assign higher weights to those more critical treatments
or time points or samples. We plan to use a semi-supervised algorithms that can find the unknown
weights by using some criteria such as maximizing the ratio of the average correlation between gene
pairs where both genes in the pair belong to the functional category of interest to the average
correlation between gene pairs where only one gene belongs to the functional category of interest,
i.e., minimizing the Cheeger ratio of the genes who belong to the function category if we formulate
it in terms of a co-expression graph. Once the weights are found, we build the co-expression graph
using the weighted transcriptomics data. The graph construction method in [12] may be used in this
2. Investigate a large collection of heterogeneous transcriptomics data
Often, GBA-based analyses have been performed over large heterogeneous collections of
experiments. One reason behind this approach is that transcriptional profiling is currently the most
abundant amongst the various high-throughput data types available. Another reason is that
experiments on different platforms under different lab conditions are often complementary.
It is therefore desirable to perform our analyses over large heterogeneous collections of
experiments. The idea is similar as before. However, the number of parameters (weights) becomes
larger, so a simple algorithm that minimizes the Cheeger ratio may suffer from the problem of overfitting. Once we observe the over-fitting problem, a kind of regularization should be imposed on the
weights in the algorithm. Finally we can build a graph by calculating the correlation in an integrated
data obtained by concatenating the weighted transcriptomics data.
3. Compare the difference in performance between two different graph building methods.
After the previous two stages, we have two options: combining the correlation graphs induced by
individual transcriptomics data using method found in stage1, and building the graphs directly by
the method found in stage 2. We should compare which one is better. However, the first option is
not straightforward. There are many methods for combining the correlation graphs: linear
combination by assuming that data are complementary [3], combination by assuming data are
independent [4], and maximum combination [5].
In this project, we shall investigate model organisms such as yeast, Arabidopsis, and Human. The
yeast microarray data can be downloaded from the Many Microbes Database [6]. The Arabidopsis
microarray data can be downloaded from the NASCARRAY database [7]. The human dataset can be
downloaded from Stanford Microarray Database and Expression Omnibus similar to [8]. The gene
functions mentioned above can be GO terms [11].
This is a semi-open project. The PHD student is encouraged to develop his/her own ideas such as
investigating more dataset and more organisms, or extending the scope of this project as long as the
PI approves them. For example, it is possible to employ the semantic similarity [9] to guide the
weights learning.
Related to this project, see PI’s research [2,9,10].
[1] Qu acken b u sh , J. (2003). Micr o ar r ays--Gu ilt b y Asso ciat io n . Scien ce, 302(5643), 240-241.
[2] Bhat, P., et al. (2012). Computational Selection of Transcriptomics Experiments Improves Guiltby-Association Analyses. PLoS ONE, 7(8): e39681.
[3] Mostafavi, S., and Quaid, M. (2010). Fast integration of heterogeneous data sources for
predicting gene function with limited annotation. Bioinformatics, 26 (14): 1759-1765.
[4] Christian von Mering, et al. (2005). STRING: known and predicted protein–protein associations,
integrated and transferred across organisms. Nucl. Acids Res., 33(suppl 1): D433-D437.
[5] Michaut, M, and Bader, G. D. (2012). Multiple Genetic Interaction Experiments Provide
Complementary Information Useful for Gene Function Prediction. PLoS Comput Biol, 8(6):
[6] Faith, J. J., et al. (2007). Many Microbe Microarrays Database: uniformly normalized Affymetrix
compendia with structured experimental metadata. Nucl. Acids Res., 36 (suppl 1): D866-D870.
[7] Craigon, D. J., et al. (2004). NASCArrays: a repository for microarray data generated by NASC’s
transcriptomics service. Nucleic acids research, 32 (suppl 1): D575–577.
[8] Lee, H. K., et al. (2004). GeCoexpression Analysis of Human Genes Across Many Microarray Data
Sets. Genome Res., 14: 1085-1094.
[9] Yan g , H., et al. (2012). Im p r o v in g GO sem an t ic sim ilar it y m easu r es b y exp lo r in g t h e
o n t o lo g y b en eat h t h e t er m s an d m o d ellin g u n cer t ain t y. Bio in f o r m at ics, 28, 1383 –
[10] Yan g , H., et al. A g r ap h -t h eo r et ic ap p r o ach t o p r ed ict p r o t ein f u n ct io n b y
in t eg r at in g lar g e scale h et er o g en eo u s d at a. In p r ep ar at io n .
[11] Ash b u r n er , M., et al. (2000). Gen e o n t o lo g y: To o l f o r t h e u n if icat io n o f b io lo g y .
Nat u r e Gen et ics, 25, 25-29.
[12] Zh an g , B. an d Ho r v at h , S. (2005). A Gen er al Fr am ew o r k f o r Weig h t ed Gen e Co Exp r essio n Net w o r k An alysis. St at i st ical Ap p licat io n s in Gen et ics an d Mo lecu lar
Bio lo g y , 4(1), 1544-6115.
This is a multi-disciplinary project. It needs interaction with people who have knowledge in
statistics, biology, bioinformatics, and machine learning. Our research group is an ideal place
where the required effective interaction could happen. See, for example, the research profiles of
Prof. John Hinde, Prof. Cathal Seoighe, Dr. Tim Downing, and me.
Mathematics, Statistics, Computer Science, Biology
