4-th year 1. a. Describe the basic procedure of acquisition and warehousing of gene expression data. b. List two of the commonly used methods in analysis gene expression data and describe their main functions. c. i. ii. Explain how hierarchical clustering algorithms work, make sure your answer describes what is meant by a linkage method and how it is used. Based on a Euclidean similarity measure, calculate the similarity matrix between the following observations: Gene ID G0001 G0002 G0003 G0004 T1 1 1 5 5 T2 1 4 1 4 d. Given a gene expression profile of two populations (gene expressions in untreated tissue samples vs. gene expressions in its corresponding treated tissue samples), the volcano plot provides a visualisation of significance of changes in expression values between two populations using the hypothesis test method. i. ii. Draw an analysis workflow diagram for the required data analysis. Explain the steps in each of the stages in your workflow. The four parts carry, respectively, 15%, 15%, 30% and 30%. © University of London 2003 Paper 3rd year a. Give a brief description of the concept of gene expression and its commonly used analysis methods. b. Describe the application of supervised learning (classification) and unsupervised learning (clustering) in gene expression analysis. c. Given a gene expression profile of drug treatment measured in 8 different time points. i. Design a workflow to cluster the co-expressed gene. ii. Explain the steps in each of the stages in your workflow. d. The following table shows the distance matrix between five genes, G1 G2 G3 G4 G5 i. ii. iii. G1 0 9 3 6 11 G2 G3 G4 G5 0 7 5 10 0 9 2 0 8 0 Based on a complete linkage method show the distance matrix between the first formed cluster and the other data points. Draw a dendrogram showing the full tree for five points based on complete linkage. Draw a dendrogram showing the full tree for the five points based on single linkage. The four parts carry, respectively, 20%, 20% and 30% and 30% of the marks. © University of London 2003 Paper Sample Answers (4th Year) 1. a. Model answer should mention micro-array chips and experiment where each dot on a chip measures the expression of one gene under a given environment condition. The output data is stored for each gene with experiment context and reference information usually from public sources. Commonly used analysis methods include statistical analysis (hypothesis testing), visualisation techniques, data mining. [3 marks] b. i. Clustering to find groups of gene of similar or correlated behaviour under different environmental conditions. ii. Differential gene expression analysis to study gene behaviour in different states (e.g. diseased and normal states) [3 marks] c. Standard Book Question: A similarity matrix is constructed to calculate the distance between the pairs of points. The pair with the shortest distance is merged into one cluster (or new point). The process is repeated resulting in a dendrogram or tree shape. [2 marks] Different linkage methods define how distances between clusters are measured (e.g. single linkage, complete linkage, average linkage [2 marks] ii. G0001 G0002 G0003 G0004 G0001 Sqrt (0+3*3)=3 Sqrt (4*4+0)=4 Sqrt (4*4+3*3)=5 G0002 Sqrt (4*4+3*3)=5 Sqrt (4*4+0*0)=4 G0003 Sqrt (0+3*3)=3 G0004 - [4 marks] d. i. Cleaned Gene Exp Table Calc. Fold Changes for each gene Calc. t-test between groups Calc Significance & Effect Draw scatter plot Highlight Genes with high Significance & Effect [3 marks] ii. a) Starting with a table of gene expression values for two populations we first calculate the fold changes for each gene between every two time points in the time series as (ln t2 – ln t1). b) Based on the newly calculated fold change table we apply a t-Test between the two different populations, based on which we can calculate the significance (p-value) of the changes between both populations. c) We calculate the effect of the change as the difference between the logged means for a gene. d) Genes that have both a high effect and a high significance are deemed to be interesting genes. [3 marks] © University of London 2003 Paper Sample Answers (3rd Year) a. In each and every organism, different genes are expressed in different cell and tissue types (spatial differences) and at different developmental stages (temporal differences). Analysis of these variations in gene expression can lead to a better understanding of disease states, targeting of drugs to specific cells, tissues or individuals, development of agricultural products, etc. Model answer should mention micro-array chips and experiment where each dot on a chip measures the expression of one gene under a given environment condition. [4 marks] b. Model answer should describe using Clustering to find groups of gene of similar behaviour under different environmental conditions. Differential gene expression analysis to study gene behaviour in different states (e.g. diseased and normal states [4 marks] c. Gene Exp Table Clean data from noise (e.g. by flooring), scaling etc Cluster Data Points Validate Meaning of Clusters for functions, pathways etc [3 marks] The data is presented in a table, where each row contains a gene id and a time series of measurements. The data is then cleaned from noise e.g. using floor functions to remove noise. A clustering algorithm using an appropriate distance measure is applied where the time series is treated as a vector of points in the feature space. The generated clusters are then examined to validate their meaning to interpret the significance/meaning of the genes assigned to the same clusters, this can be based on accessing remote database to check for their function. [3 marks] d. i. The first cluster will be formed from G3 and G5 since they have the minimum distance. [1 mark] G35 G1 G2 G4 G35 0 11 10 9 G1 G2 G4 0 9 6 0 5 0 [1 mark] © University of London 2003 Paper ii,iii Single Linkage Complete Linkage [4 marks] © University of London 2003 Paper