Gene Expression Explorer Tutorial

advertisement
 Gene Expression Explorer Tutorial An Exploration and Visualization Tool For Use with Microsoft Windows 2000 Microsoft Windows XP 1 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB Contents Preface ................................................................................................................................ 3
What is Gene Expression Explorer? .................................................................................... 4
File formats ..................................................................................................................... 5
Example file..................................................................................................................... 5
Some background on microarray data ............................................................................... 6
A work through of how to use Gene Expression Explorer.................................................. 7
Starting Gene Expression Explorer ............................................................................. 7
Working with Samples .............................................................................................. 12
Using Graphs ............................................................................................................. 20
Modifying annotations.............................................................................................. 21
Multiple Plot Windows ............................................................................................. 23
Working with Variables............................................................................................. 26
Exporting images, animations and complete state .................................................. 37
Further Help .............................................................................................................. 37
Bibliography ...................................................................................................................... 38
Disclaimer.......................................................................................................................... 38
Trademark List .................................................................................................................. 38
2 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB Preface
Gene Expression Explorer is supplying new technology in data analysis and data mining. It is built on state‐of‐the‐art mathematical and statistical methods. The main feature of Gene Expression Explorer is the ease with which you will be able to explore your data sets. You will rapidly and easily visualize and work interactively with your data sets in real time directly on the computer screen and also be able to import and export different types of data, images and animations. You are not expected to have in‐depth knowledge of mathematical or statistical methods, or possess a powerful supercomputer. With an ordinary laptop or stationary PC you will be able to easily explore your high‐dimensional data and rapidly find relevant structures. This tutorial is meant to enable first‐time users to understand and use the basic capabilities and features of Gene Expression Explorer. A comprehensive description of all functions can be found in the Reference Manual that is supplied in the Help Menu of Gene Expression Explorer. In this tutorial, a simple step‐by‐step example will present and lead you through the basic functionalities of the Gene Expression Explorer interactive environment. Welcome to an interactive and explorative journey! 3 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB What is Gene Expression Explorer?
Gene Expression Explorer is a powerful interactive visualization environment. It can be used to analyze many different types of data sets, but the user interface is developed for gene expression data exploration. Gene Expression Explorer helps you to uncover hidden structures and find patterns in large data sets, taking full advantage of all annotations and other links that are connected with your data. You can rapidly and easily import and export many different types of data, images and animations for use in further research, reports and presentations. The user interface of Gene Expression Explorer displays all datasets in 3D in order to be able to visualize them and interactively work with them in real time directly on the computer screen. When using Gene Expression Explorer, you take full advantage of the most powerful pattern recognizer that exists, the human brain, in order to find relevant structures in your data set. Gene Expression Explorer enables non‐statisticians and non‐
mathematicians alike to, without reliance on expert support, explore their own data sets. Gene Expression Explorer has a lot of built‐in functionality that helps the operator to rapidly uncover significant structures and relations hidden in high‐dimensional and complex datasets. The user interface is designed to be intuitive and easy to use, giving you unlimited possibilities when searching for structure in your data set, while at the same time providing built‐in functionality for interactive and easy hypothesis testing. The basic operation used in the Gene Expression Explorer environment is principal component analysis (PCA) that presents and makes it possible to visualize high‐
dimensional data in lower dimension. In Gene Expression Explorer, all data is presented in three dimensions and this three dimensional representation is then plotted on the computer screen. The unique feature of Gene Expression Explorer is the possibility to interactively and in real time manipulate the different PCA‐plots directly on the computer screen and at the same time work with all annotations and other links in a fully integrated way. 4 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB File formats
Gene Expression Explorer supports the following file formats • Gedata‐file is an easy‐to‐use format supplied by Gene Expression Explorer. • Base‐file format developed by the theoretical physics research group for complex systems at Lund University. Example file
The example file Acute Lymphoblastic Leukemia.gedata is supplied with the program. The data frame consists of gene expression profiles from 132 different patients, all suffering from some type of pediatric acute lymphoblastic leukemia (ALL). For each patient the expression level of 22282 genes has been measured. The dataset comes from a study by Ross et. al. [Ross2003]. 5 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB Some background on microarray data
A microarray data set is a set of quantitative measurements of the amounts of m‐RNA corresponding to the expression level of different genes being transcribed at a specific moment in a sample of cells. A microarray data sample can thus be thought of as giving a picture of the state of the given cells in terms of the expression level (or activity level) of every given gene present in the cells at the time the sample was taken. The availability of large data sets measuring the expression of practically all genes within a cell has created the need for new mathematical and statistical tools for data exploration. We shall examine an example dataset. The dataset originates from an article published by Ross et. al. [Ross2003] in Blood in 2003. The data set consists of gene expression profiles from 132 different patients, all suffering from some type of pediatric acute lymphoblastic leukemia (ALL) (a malignancy of the hematopoietic system, affecting precursors of lymphatic cells). The dataset thus consists of 132 diagnostic samples and for each sample we are given measurements of the expression level of 22282 genes. We will show how we easily can identify all prognostic subtypes of ALL (as described in [Ross2003]) by using the Gene Expression Explorer interactive exploration environment. 6 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB A work through of how to use Gene Expression Explorer
Starting Gene Expression Explorer
To begin the exploration process, the first step is to • start Gene Expression Explorer To start Gene Expression Explorer, double click the Gene Expression Explorer icon (shortcut) on the desktop or start Gene Expression Explorer from the program menu found in the start menu. The Gene Expression Explorer Main Window will appear. We begin with a quick orientation of what you see on the screen. Functionality governing the samples can be found to the left in the Sample Panel and functionality governing the variables is found to the right in the Variable Name List Panel. In between you have the Work Space where all plots are displayed in Plot Windows. Directly under the Menu Bar you find the different Dock Windows where you select the functionality of the mouse and many of the operations you wish to perform. Finally at the bottom you find the Status Bar. In the Status Bar you find for instance the total number of samples and variables in your data set displayed and information on how many of them that actively take part in the analysis at the moment. Dock Windows
Menu Bar Sample Panel Variable Name List Panel Work Space
Status Bar 7 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB To begin, we first of all restore default settings. (This is not something you have to do every time. We do it in the example to avoid possible discrepancies between used settings.) • Select the Edit > Restore Default Settings menu item in the Menu Bar • Select OK, when you are asked if you want to restore default settings. Note that when you later exit Gene Expression Explorer the current settings will be saved and used the next time the program is started and you can thus directly start working with the settings that best suits your data. We are now ready to open the example data frame in Gene Expression Explorer. • Select the Help> Example Files> Acute Lymphoblastic Leukemia.gedata in the Menu bar The next time you want to open a file you have recently explored with Gene Expression Explorer, you will find it under File > Recent Data Files in the Menu Bar. Usually when you want to open a file you use File>Open and then you select the file you want to open by double clicking on it. The example gedata file is now open in Gene Expression Explorer and you have the starting position for beginning to analyze the data. In the Plot Window in the Work Space you see a plot of the 132 samples. Recall that you find functionality corresponding to the Samples (patients) to the left of the Work Space in the Sample Panel and functionality having to do with Variable Name Lists to the right of the Work Space in the Variable Name List Panel. 8 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB Dock Windows
Variable Name List Panel Sample Panel Above the Work Space and the Panels you by default find six different Dock Windows containing commands for some of the basic functionality of Gene Expression Explorer. Note that by selecting View> Tool bars in the Menu Bar, you can select which Dock Windows that will be displayed. By default six of them are chosen (Mouse, Data, Normalize, Plot, Graph and Play). What you see in the Work Space is a principal component projection of the 132 samples from 22282 dimensions (corresponding to the 22282 different gene‐expression levels measured for each patient) down to the three‐dimensional space spanned by the three main principal components. All commands given in Gene Expression Explorer immediately effects the projections displayed in the Plot Windows. By selecting and clearing the checkbox labeled Axes in the Plot Properties Dock Window you can show or hide the three principal component axes. 9 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB •
Leave the Axes checkbox checked. In the Mouse Function Dock Window you can select the function of the mouse (the radio button Rotate in the upper left corner is checked by default from the start). When Rotate is selected, you can rotate the image by holding down the left mouse button and drag the image with the mouse in the Work Space. • You can continuously rotate the image by clicking the play button in the Play Dock Window When selected, the play button is replaced by a stop button in the Play Dock Window Here you can select the rotation speed (preset to 15). By pressing the stop button you stop the motion. Try to both continuously rotate (by clicking the play button) and then drag the image (with the mouse) to see how you can force the image to rotate in any desired direction. • Select the Center radio button in the Mouse Function Dock Window and then left‐click on a sample. The selected sample is then placed in the center of the Plot Window. 10 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB By left‐clicking another sample, the newly selected sample will instead be placed in the center. By finally left‐clicking anywhere else in the Plot Window, the original plot is restored. • Select the Rotate radio button. In the lower left corner of the Main Window you find the Status Bar. Here you see the text 132/132 Samples corresponding to the 132 small spheres you see plotted on the screen. Every sphere corresponds to a patient. • Left‐double‐click on one of the spheres. You get the available annotations for the particular data frame corresponding to that sample in an Annotation dialog‐window 11 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB Note: What you see on your screen can differ from the images by a rotation, due to the fact that we have been practicing rotation above. This should not cause confusion and in fact we recommend you to continuously rotate the images during the work‐through of the example to get a good feeling for the three‐dimensional structure of the projection. • Close the Annotation dialog‐window. Working with Samples
We shall now begin to actively look for interesting structures and start to interact with our data. In the Attribute combo box 12 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB in the Sample Panel you can choose between which of the different attributes that come with the data frame that you want to work with. • Select Leukemia Subtype from the list of Sample attributes This makes the different diagnostic subtypes of ALL appear in the Value Table located in the in the Value Window Value Toolbar
Value Table
In the Value Toolbar you see some of the (button) icons in Gene Expression Explorer displayed. In general Sample Color buttons give an instruction to color samples according to the attribute chosen and Variable Color buttons variables. give an instruction to color The Mark buttons give an instruction to mark a selected sample or variable. Remark: Note that the function of a button appears if you leave the mouse tool tip over it for an instant. In the Value Table you see the seven different subtypes of ALL listed. These are the a priori given clinical diagnoses that come with the data frame. 13 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB By selecting the different rows in the Value table you see the corresponding samples displayed in the Sample Window below the Value table. By choosing for instance the E2A‐PBX1 subtype you get the following Sample Window listing the 18 samples in this subtype group. •
Select the Mark button and left click on a sample number in Value Table. The corresponding sample is then marked in the Plot Window. Select none to clear all marks. •
Select the Sample Color Button in the Value Window. The samples in the Work Space are then colored according to the attribute chosen, i.e. in this case Leukemia Subtype. You now see the different subtypes of ALL colored according to subtype in the Work Space. 14 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB Even when we rotate this picture it is very hard to discern any structure or pattern in the plot. The reason for this is that all 22282 genes (variables) take part in the analysis. Most of them have possibly very little to do with the different genetic disturbances we are interested in, but all of them contribute to the noise in the data by small random fluctuations. We can remedy this by selecting the genes that contribute most to the variation over the data set and discarding the genes that only exhibit small (possibly random) fluctuations. • Move the slider in the Filter by Variance Window to 0.05.This can be done by either dragging the slider with the mouse, or by simply writing in 0.05 in the textbox and then press return. 15 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB Only the genes having a variance of more than (or equal to) 5% of the variance of the gene having the largest variance over the samples now take part in the analysis. This happens to be precisely 1929 genes which can be seen in the Status Bar, where it is indicated that only 1929 out of 22282 genes at the moment participate in the analysis. Clear patterns are now visible in the Plot Window. By using Principle Component Analysis (PCA) one makes sure that patients that have resembling gene expression profiles fall close to each other in the plot and as can be seen in the picture, this correlates well with clinical leukemia subtype. In particular the T‐ALL subgroup clearly distinguish itself and also occupies a lot of the variance. As can be seen in the plot, he first principal component contains 13 % of the total variance and clearly sets the T‐ALL group apart from the rest of the subtypes. In order to more clearly discern structure when plotting the other subtypes we shall now remove the T‐ALL samples from the PCA‐analysis. 16 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB • Select the Active Samples Button in the Value Window and uncheck the checkbox corresponding to T‐ALL in the Value table The PCA is recalculated interactively and we now (after a possible rotation) have the following plot. 17 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB Note that we now have 118 out of 132 samples present in the plot. The text 118/132 Samples is displayed in the Status Bar along with the text 1682/22282 Variables. Several subgroups are now clearly discernible. To try to separate the different subgroups even more we shall now select the genes that are most responsible for differentiating between the a priori chosen diagnostic subtypes. • Click the Analyzed Variables button marked with a funnel symbol • Drag the slider to 30%, or write 30 directly into the textbox and press return. We have now selected the genes that have more than 30 % of their variance over the active samples explained by the diagnostic subtype of the samples. The technique used to select the active variables is called ANOVA and the amount of variation explained by diagnostic subtype is called the R2‐value. How much you should filter depends on the structure of your data and you should visually search for interesting patterns while filtering and stop when you see them. You must be a little bit cautious though, because there are possible traps that exist when you use ANOVA. In short, you might create patterns that have no statistical significance. The statistical significance might be checked in Gene Expression Explorer by using randomization and permutation of your data set. You also have the opportunity to use cross‐validation to check the stability of the patterns you see on the screen. For more information on this see the Reference Manual in the Help Menu. In the Status Bar you can now see that 844 out of the total 22282 variables are active, i.e. participate in the analysis and we have the following plot. 18 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB What we see above is a three‐dimensional projection from (in the example above) 844 dimensions. Even though we have projected the dataset in the “best possible fashion”, i.e. keeping as much variance as possible, there will be features that we cannot see in a three‐dimensional projection. To, to some extent, remedy this situation we shall now create a graph in the set of samples. 19 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB Using Graphs
In Gene Expression Explorer you can easily create graphs connecting samples or variables. When creating graphs the distances involved are always the Euclidean distances in the full space of all active samples or variables. The graphs give you an opportunity to, in a sense, look into higher dimensions, but they can also be used in analysis for instance when using the multi dimensional rescaling algorithm Isomap (see the Reference Manual) in Gene Expression Explorer. Using the Graph Dock Window you can create a graph in many different ways. • Uncheck the Axes check box in the Plot Properties Dock Window to more clearly see the graph you will create. You can select the number of nearest neighbors that are to be joined by a graph, for each distinct sample, in the Min text‐box. The Min value is by default 0 and the Max value is by default 10 to begin with. These can be changed between 0 and 20 at will. The Max value plays a role when you use the slider in the Graph Dock Window to create a graph. By dragging the slider you create a graph by selecting all neighbors within the distance that is selected. Try it. • Put the distance slider to 0, leave the Max value at 10 and change the Min value to 6. You can do this either by using the selection buttons or by writing directly in the textbox and then press Return. Note again that when one selects the nearest neighbors to create the graphs the distances are computed in the full space of all active variables taking part in the analysis (i.e. in 844 dimensions in the case at hand). After a possible rotation we have the following plot. 20 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB An interesting fact manifests itself in the projection above. In the group of diagnostic subtype called Other, two different subgroups are clearly discernible. The two subgroups do not share any of their 6 closest neighbors. These samples should maybe not be identically classified. We shall now reclassify one of them. Modifying annotations
It is sometimes convenient to modify the data set you work with, for instance by reclassifying samples, in order to go on and find interesting information. We shall to begin with reclassify some of the samples and thereby split one of the diagnostic subgroups into two different subgroups. We do this in order to later be able to find interesting information on for instance which variables (genes) that best discriminate between the two new subgroups. 21 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB •
•
Set the Min button in the Graph Dock Window to 0 to clear the graph and thus more clearly discern the samples. Choose Classify in the Mouse Function Dock Window • Select the New Value button in the Value Tool Bar. A New Value appears in the Value Table and is automatically selected Note that you can change the name New Value that appears in the Value Table by simply double clicking in the corresponding text box and entering the preferred name. • Hold down the left mouse button and drag the mouse over the clearly discernible subgroup of Other that is closest to the green subgroup TEL‐AML 1. The samples are reclassified as New Value. Note that if you happen to for instance classify some sample that should not have been reclassified you can undo your last command by selecting the Undo Button Value Toolbar in the The picture below is taken in the middle of the reclassification process. 22 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB By selecting Rotate in the Mouse Function Dock Window, you can rotate and see that you have marked all samples. If not, you select Classify again and complete the operation. Note The number of variables taking part in the analysis changes when you reclassify the subgroups. This is due to the fact that the R2‐Value is set to 30% and the set of active variables corresponding to these 30% depends on the subgroups that we choose to discern. • Select Rotate in the Mouse Function Dock Window We shall now open up another Plot Window in the Work Space. Multiple Plot Windows
You can at any point in the analysis open up a new Plot Window in the Work Space. These new Plot Windows you open can be chosen to be synchronized with the active (high lighted) Plot Window or not. If the Plot Windows are synchronized, they will always share the same active samples and/or variables and they will be displayed in the same way, but they can for instance be colored according to different annotations which 23 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB can be very useful. You activate (select) a Plot Window by clicking anywhere in it. The frame of the currently active Plot Window is always high lighted. • Select Window > New Synchronized Plot in the Menu Bar. Note that you now have two different Plot Windows open in Gene Expression Explorer. You can find them listed under Window in the Menu bar. You can select which window to display or you can display all windows by choosing Window > Tile in the Menu bar. • Select Window > Tile in the Menu Bar. We get the following plot. 24 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB •
Make sure that the new Plot Window is active and select Novel Group in the Attribute combo box in the Sample Panel. This Novel Group is a new diagnostic subgroup discovered in the study by Ross et. al. [Ross2003]. •
Select the Sample Color Button in the Value Toolbar to color the samples in the active window according to the Novel group attribute. We can now see that our New Value group in the left window corresponds to the (green) Novel group in the right window discovered by Ross et. al. 25 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB We shall now have a look at the 951 variables that participate in the analysis at the moment. Working with Variables
Although we have strictly speaking been working with the variables all the time, since we have filtered and so on, we shall now have a look at them. You can select if you want to display variables or samples in the active Plot Window in the Data Mode Dock Window. • Make sure that the new Plot Window to the right is active and select the radio button Variables in the Data Mode Dock Window. This gives the following plot in the Work Space 26 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB In the right Plot Window above you see a PCA‐plot of the 951 active variables participating in the analysis at the moment. • Select the left window by clicking anywhere in it to activate it. • Make sure that New Value is selected in the Value Table 27 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB •
Select the Variable Color Button in the Value Toolbar The variables are now colored according to mean expression level in the subtype group selected in the Value Table for the Samples. Red means highly expressed, i.e. they are up‐regulated in the chosen Sample group and green corresponds to down‐regulated genes. In the right Plot Window below you see a three‐dimensional PCA plot of the 951 variables taking part in the analysis at the moment colored according to their mean expression level in the Sample subtype group New Value. Notice that since we have chosen synchronized plots above, the highly expressed variables in the New Value group are found in the same direction as the New Value Group itself. 28 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB By selecting the other subtype groups in the Value Table we can see the mean expression levels for the variables for all the different sample subtype groups. • Select the E2A‐PBX1subtype. This gives the following plot. 29 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB We shall now create a list of the genes that are most up‐regulated in the E2A‐PBX1 group. • Select the right window by clicking the Window icon. • Select the New button in the Variable Name List Tool Bar •
Select List in the Mouse Function Dock Window 30 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB Draw a closed curve clock wise around some of the genes that are most expressed for the group E2A‐PBX1. You do this by holding down the left mouse button while at the same time moving the mouse tool tip clock wise around the selected genes to create a closed curve. 31 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB A variable name list appears in the Variable Name Window displaying the selected genes. Note that your list most probably will differ from the one displayed below since it is unlikely that you have selected precisely the same variables as in the plot above. Variable Name List Annotations
Note that annotations for the gene selected in the variable name list are found in the Variable Window. •
Select the Mark button in the Variable Name Window 32 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB You now see the selected gene marked in the variable plot. • Select Window > Tile in the Menu Bar • Select the Sample Color button in the Variable Name Window You now get the Samples colored according to how active the selected gene is for each particular sample. Red means high activity and green corresponds to low activity. 33 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB By selecting the different variables in the variable name list one see the activity of the selected genes for each patient. If the gene (PBX1) selected in the plot below was actually included in your list you can create something like the following plot 34 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB We see that this particular gene is up regulated for all samples in the group E2A‐
PBX1 and down regulated for all other active samples. In the Variable Window we see that this gene actually is precisely the PBX1. We shall now find genes that are correlated with PBX1. • Select Corr. in the Mouse Function Dock Window •
Left Click on the PBX1 in the Work Space A Correlated Variables dialog box appears 35 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB One can drag the slider to decide how highly correlated a variable has to be to be selected in the plot. • Drag the slider to 60 % 36 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB One can now see all the variables that have a correlation coefficient of more than 60% to the selected PBX1 connected with it by a graph. Exporting images, animations and complete state
One can at any time during an ongoing analysis export an image or an animation from Gene Expression Explorer. You do this by selecting File > Export > Image or File > Export > Video, and then supplying the name and other characteristics of the exported file. One should also note that it is possible at any point in an analysis to save the complete current state of Gene Expression Explorer by selecting File > Export > Complete State. The complete state is then saved as a Qlucore complete state file (file.qcstate). You can then at a later point return to the analysis, at the point it was saved, by opening that particular file in Gene Expression Explorer. Further Help
This tutorial has only covered a part of the functionality in Gene Expression Explorer. For a more comprehensive coverage see the Reference Manual supplied in the Help Menu of Gene Expression Explorer. 37 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB Bibliography
[Ross2003] M.E. Ross et. al. Classification of pediatric acute lymphoblastic leukemia by gene expression profiling Blood 15 October 2003, Vol 102, No 8, pp 2951‐‐2959. Disclaimer
The contents of this document are subject to revision without notice due to continued progress in methodology, design, and manufacturing. Qlucore shall have no liability for any error or damages of any kind resulting from the use of this document. Trademark List
Windows 2000 and Windows XP are trademarks of Microsoft. 38 Gene Expression Explorer Tutorial Copyright 2007 Qlucore AB 
Download