Politecnico di Milano School of Information Engineering Master Degree in Information Engineering Course “Bioinformatics and Computational Biology for Medicine” Ontologizer Tutorial and Exercises Arif Canakoglu canakoglu@elet.polimi.it (This tutorial was taken from Ontologizer web application help) Ontologizer Tutorial You can run the application from the given link below: (you should start the application with clicking java webstart button http://compbio.charite.de/contao/index.php/ontologizer2.html Setting Up the Ontologizer 1. Download the file http://compbio.charite.de/tl_files/ontologizer/examples/yeastSampleFiles.zip and unpack it into a directory of your choice, e.g, the desktop. The archive contains sets of genes up- or down regulated following treatment with sulfometuron methyl, which is an inhibitor of amino acid biosynthesis. Data was gathered from Jia et al. (2000) Global expression profiling of yeast treated with an inhibitor of amino acid biosynthesis, sulfometuron methyl. 2. If you need a proxy to access the internet, please open the Preferences Window via the Window > Preferences... menu entry within Ontologizer. 3. Enter your proxy configuration in the appropriate line and then press Ok. File Sets 1 Each File Set contains a definition file and an association file. Each project uses a File Set to perform the analysis. This is done so that users can manage different file configurations easily, allowing for instance different versions of the definition file to be used for different projects, as it may be useful to use the same version of the definition file (which is frequently updated at the GO website) for development. User-Supplied Files In order to analyze your experimental data, you need to prepare one file for each of the groups of interest in your experiment (For instance, this might be a list of genes differentially expressed at different time points). We refer to such groups as study sets. Additionally, you need to indicate the population set. In general, this will be a list of all genes that were test (for instance, all genes represented on a microarray). The genes should be listed one on a line in plain text (Alternatively, FASTA files can be used if desired if the name of the gene directly follows the '>' sign). For this tutorial, you can download the yeast study and populations files from the Ontologizer website: http://www.charite.de/ch/medgen/ontologizer/howto/index.html. Unpack these files before use. Creating a New Project 1. In order to create the new project, press the New Project button within the toolbar of Ontologizer's main window or select the Project > New > Project... menu entry. 2. This brings up the New Project Wizard. First enter a name for the project. For the tutorial, enter suflometuronMethyl into the Project Name textfield then press the Next button to proceed to the next page. 2 3. Here you need to indicate the definition file (via the Ontology text field) and the association file (via the Association text field). The Ontologizer comes with predefined File Sets for frequently used species that can be automatically downloaded. We have downloaded the File Set for Yeast above. If we hadn't, the Ontologizer would now automatically download these files in the background. For this tutorial, click on the File Set combo box and choose Yeast. Then press Next which brings you to the Population Edit page. 4. Now enter the genes of the population set. Use the study set/population set example files downloaded from the Ontologizer homepage as described above. Drag & Drop the file called population.txt into the gene editor field or use a File Selection Dialog by clicking on the Append Set... button. Notice that names of genes with GO annotations are highlighted (you may have to wait for completion of downloads or parsing before seeing highlighting). You can hover the mouse over these entries to see more information about the gene's 3 annotation. Proceed by clicking on the Next button. 5. Drag & Drop a study file into editor area (again, alternatively, you can use a file selection dialog by clicking on the Append Set... button). Press Next and repeat the procedure for each study set (file). 6. Press Finish when you added the last study set. The New Project Wizard window closes and you should now see your new project suflometuronMethyl appearing in the main window. 4 Performing the Analysis The Ontologizer offers multiple methods for searching for GO term overrepresentation and for multiple testing correction. For more information on these topics please consult the Ontologizer homepage, where you will also find links to publications describing the Ontologizer. For the purposes of this tutorial, we will use the Parent-Child Union Methods with a Bonferroni multiple testing correction. 1. Within the main window, select our project which is sulfometuronMethyl. 2. From the combo boxes in the tool bar, choose a calculation method (first combo box), Parent-Child-Union and and the Bonferroni a multiple test correction (second combo box). Then press Analyze. Exploring the Results 1. The Results Window now appears. Depending on the size and number of the study sets and the type of multiple testing correction desired, the analyis should complete in a few seconds to a few minutes. As individual study sets are completed, new tabs appear with the results. If you have used all the files of this tutorial, you should see seven tab folders corresponding to the name of the study sets once analysis is completed. The first study set is activated and within the tab folder the results are presented in form of a table. 5 2. Notice that the background of terms whose adjusted p-value falls below the significance level (as given by widget below the table) is colorized according to the sub-ontology and the rank. (Note that the significant terms are marked in color, whereby the terms from biological process are shown in green, terms from molecular function in yellow and terms from cellular component in magenta.) 3. Now click on one of the terms, e.g., amino acid and derivative. This refreshes the browser of the bottom part in the window to contain information about the term including the parents (more general terms), children (more specific terms) or the names of the genes, to which the term is annotated to. 4. To get a graphical overview, press the Preview Graph button (the third from left in the toolbar). 5. The graphs consists of all active terms as defined by the little checkboxes before every time, which by default are all signifcant terms. 6. The parameter of the graph view (i.e., zoom factor, which extend is displayed) can be altered the button or context menu commands. Question 1: try the same analysis without any correction and compare the results. Answer 1: The analysis for the 15minSMinduced are as follows: As a calculation method (first combo box), Term-for-Term and the Bonferroni a multiple test correction, there are 47 ontology terms were found.(with default threshold: 0.10). While if we are not using any correction method, we extract 345 ontology for the same test Question 2: try the same analysis with different correction methods and try to explain the reason of the numbers of the elements found it different for the different cases. Answer 2: 6 Again for the analysis “15minSMinduced” are as follows with statistical analysis control as term-for-term: We run the analysis as it is given in the class: Bonferroni: 47 Bonferroni-Holm: 47 Westfall-Young-Single-Step: 59 Westfall-Young-Step-Down: 59 Benjamini-Hochberg: 163 None: 345 So the results are as we expected while we are moving more false negatives we found less terms and while we are moving to more false positives we found more terms. Question 3: Use different Statistical Analysis methods and compare them. Answer 3: As a calculation method (first combo box), Parent-Child-Union and the Bonferroni a multiple test correction, there are 24 ontology terms were found.(with default threshold: 0.10). While if we are not using any correction method, we extract 272 ontology for the same test When we are using term-for-term we got more ontology than parent-child-union. It is because in the as explaned below the parent-child-union is used also the ontology information of the terms. So it can combined the the terms in the roots. Note: Statistical Analysis controls the method by which the annotated genes or gene products in the study set are analyzed for GO term overrepresentation with respect to the population set. The standard method has been to calculate the upper tail of the hypergeometric distribution (One-sided Fisher exact test) for each term separately. The Ontologizer also provides analysis by means of the parent-child approach, which has several advantages compared to the standard approach (see the Ontologizer homepage for further details and references). 7