Universidade Nacional da Irlanda em Galway -- Ciência sem Fronteiras PhD Project Template Use one form per project Please complete & submit to international@nuigalway.ie as soon as possible, and by 27/11/2012 In your email, begin the subject line with [SWB] (be sure to use square brackets) to ensure that your email is filed correctly. Emails will be automatically filed PI name & contact details: School: Haixuan Yang Will be in the NUIG in January 2013 hx.yang@gmail.com School of Mathematics, Statistics and Applied Mathematics Has project been agreed with head (or nominee) of proposed registration school? Research Centre / group affiliation: Bioinformatics and Statistical Modelling Research group / centre website: http://www.maths.nuigalway.ie/ PI website / link to CV: http://www.cs.rhul.ac.uk/home/haixuan/ http://scholar.google.com/citations?user=8l9RrysAAAAJ&hl=en Brief summary of PI research / research group / centre activity (2 or 3 lines max): If you are already involved in relevant scholarly activities relating to Brazil, please mention them here. A cell can be viewed as a complex network of inter-relating proteins, nucleic acids and other biomolecules. I am working on learning/mining valuable information from bio-molecular networks via statistical approaches. Universidade Nacional da Irlanda em Galway -- Ciência sem Fronteiras Title & brief description of PhD project (suitable for publication on web): Optimal co-expression graph building by microarray data re-weighting for improving Guilt-byAssociation Analyses The Guilt-by-Association (GBA) principle, according to which genes with similar expression profiles are functionally associated, is widely applied for functional analyses using large heterogeneous collections of transcriptomics data. However, GBA has many perceived limitations, and it was observed that some genes known to be involved in a particular pathway invariably are missed, whereas other apparently unrelated genes exhibit expression profiles that are strikingly similar to bona fide pathway components [1]. Although there may be many reasons that cause these limitations, we focus on one reason that data are not fully exploited, therefore we want to investigate the effect of re-weighting micro-array data. The intuition behind is that different microarray data may play different roles in revealing the pathway activations, so re-weighting micro-array data may result in better functional analyses. In our previous work [2], it was found that the simple use of large collections could hamper GBA functional analysis for genes whose expression is condition specific; and a smaller set of condition related experiments selected relevant to a given biological function category can improve GBA. This work can be considered a re-weighting of micro-array data – micro-arrays in selected experiments are weighted by one while the remaining ones are weighted by zero. From this point of view, the proposed project is a generalization of the work in [2]. GBA-based analyses often begin with the calculation of similarity between gene expression profiles using a metric such as Pearson’s correlation. In this project, we aim to build a better co-expression graph based on re-weighted micro-array data in order to improve GBA. This project has three stages: 1. Investigate a single experiment data Within a single experiment, some treatments or time points or samples may be more deterministic than others in reflecting a change responding to a given gene function. When building the correlation graph, it therefore is desirable to assign higher weights to those more critical treatments or time points or samples. We plan to use a semi-supervised algorithms that can find the unknown weights by using some criteria such as maximizing the ratio of the average correlation between gene pairs where both genes in the pair belong to the functional category of interest to the average correlation between gene pairs where only one gene belongs to the functional category of interest, i.e., minimizing the Cheeger ratio of the genes who belong to the function category if we formulate it in terms of a co-expression graph. Once the weights are found, we build the co-expression graph using the weighted transcriptomics data. The graph construction method in [12] may be used in this project. 2. Investigate a large collection of heterogeneous transcriptomics data Often, GBA-based analyses have been performed over large heterogeneous collections of experiments. One reason behind this approach is that transcriptional profiling is currently the most abundant amongst the various high-throughput data types available. Another reason is that Universidade Nacional da Irlanda em Galway -- Ciência sem Fronteiras experiments on different platforms under different lab conditions are often complementary. It is therefore desirable to perform our analyses over large heterogeneous collections of experiments. The idea is similar as before. However, the number of parameters (weights) becomes larger, so a simple algorithm that minimizes the Cheeger ratio may suffer from the problem of overfitting. Once we observe the over-fitting problem, a kind of regularization should be imposed on the weights in the algorithm. Finally we can build a graph by calculating the correlation in an integrated data obtained by concatenating the weighted transcriptomics data. 3. Compare the difference in performance between two different graph building methods. After the previous two stages, we have two options: combining the correlation graphs induced by individual transcriptomics data using method found in stage1, and building the graphs directly by the method found in stage 2. We should compare which one is better. However, the first option is not straightforward. There are many methods for combining the correlation graphs: linear combination by assuming that data are complementary [3], combination by assuming data are independent [4], and maximum combination [5]. In this project, we shall investigate model organisms such as yeast, Arabidopsis, and Human. The yeast microarray data can be downloaded from the Many Microbes Database [6]. The Arabidopsis microarray data can be downloaded from the NASCARRAY database [7]. The human dataset can be downloaded from Stanford Microarray Database and Expression Omnibus similar to [8]. The gene functions mentioned above can be GO terms [11]. This is a semi-open project. The PHD student is encouraged to develop his/her own ideas such as investigating more dataset and more organisms, or extending the scope of this project as long as the PI approves them. For example, it is possible to employ the semantic similarity [9] to guide the weights learning. Related to this project, see PI’s research [2,9,10]. References [1] Qu acken b u sh , J. (2003). Micr o ar r ays--Gu ilt b y Asso ciat io n . Scien ce, 302(5643), 240-241. [2] Bhat, P., et al. (2012). Computational Selection of Transcriptomics Experiments Improves Guiltby-Association Analyses. PLoS ONE, 7(8): e39681. [3] Mostafavi, S., and Quaid, M. (2010). Fast integration of heterogeneous data sources for predicting gene function with limited annotation. Bioinformatics, 26 (14): 1759-1765. [4] Christian von Mering, et al. (2005). STRING: known and predicted protein–protein associations, integrated and transferred across organisms. Nucl. Acids Res., 33(suppl 1): D433-D437. [5] Michaut, M, and Bader, G. D. (2012). Multiple Genetic Interaction Experiments Provide Complementary Information Useful for Gene Function Prediction. PLoS Comput Biol, 8(6): e1002559. Universidade Nacional da Irlanda em Galway -- Ciência sem Fronteiras [6] Faith, J. J., et al. (2007). Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucl. Acids Res., 36 (suppl 1): D866-D870. [7] Craigon, D. J., et al. (2004). NASCArrays: a repository for microarray data generated by NASC’s transcriptomics service. Nucleic acids research, 32 (suppl 1): D575–577. [8] Lee, H. K., et al. (2004). GeCoexpression Analysis of Human Genes Across Many Microarray Data Sets. Genome Res., 14: 1085-1094. [9] Yan g , H., et al. (2012). Im p r o v in g GO sem an t ic sim ilar it y m easu r es b y exp lo r in g t h e o n t o lo g y b en eat h t h e t er m s an d m o d ellin g u n cer t ain t y. Bio in f o r m at ics, 28, 1383 – 1389. [10] Yan g , H., et al. A g r ap h -t h eo r et ic ap p r o ach t o p r ed ict p r o t ein f u n ct io n b y in t eg r at in g lar g e scale h et er o g en eo u s d at a. In p r ep ar at io n . [11] Ash b u r n er , M., et al. (2000). Gen e o n t o lo g y: To o l f o r t h e u n if icat io n o f b io lo g y . Nat u r e Gen et ics, 25, 25-29. [12] Zh an g , B. an d Ho r v at h , S. (2005). A Gen er al Fr am ew o r k f o r Weig h t ed Gen e Co Exp r essio n Net w o r k An alysis. St at i st ical Ap p licat io n s in Gen et ics an d Mo lecu lar Bio lo g y , 4(1), 1544-6115. Unique selling points of PhD project in NUI Galway: NUI Galway projects should emphasise features that are not typically available in Brazil – specific equipment, multi-disciplinarity, aspects of structured programme, links with industry, placements, links with other research groups etc. This is a multi-disciplinary project. It needs interaction with people who have knowledge in statistics, biology, bioinformatics, and machine learning. Our research group is an ideal place where the required effective interaction could happen. See, for example, the research profiles of Prof. John Hinde, Prof. Cathal Seoighe, Dr. Tim Downing, and me. Name & contact details for project queries, if different from PI named above: Please indicate the graduates of which disciplines that should apply: Mathematics, Statistics, Computer Science, Biology Ciência sem Fronteiras / Science Without Borders Priority Area: Please indicate the specific programme priority area under which the proposed PhD project fits- choose only one (tick box): Engineering and other technological areas Pure and Natural Sciences (e.g. mathematics, physics, chemistry) √ Health and Biomedical Sciences Information and Communication Technologies (ICTs) Universidade Nacional da Irlanda em Galway -- Ciência sem Fronteiras Aerospace Pharmaceuticals Oil, Gas and Coal Renewable Energy Minerals Biotechnology Nanotechnology and New Materials Technology of prevention and remediation of natural disasters Biodiversity and Bioprospection Marine Sciences Creative Industry New technologies in constructive engineering Please indicate which of the following applies to this project (referring to Science Without Borders arrangements): Suitable only as a Full PhD (Y/N): Y Available to candidates seeking a Sandwich PhD arrangement (Y/N): N Suitable for either/Don’t know: _____