BMMB 551 Assignment on Transcriptomes (Lesson 8) This assignment is intended to be an opportunity to explore some of the topics covered in class. The report should emphasize insights or information that were interesting to you. Note that the following are options; you only have to do one (and option 3 is to develop your own exercise). Option 1. Use Gene Expression Omnibus (GEO) to explore microarray expression data This will give you some hands-on experience with microarray-based data on transcription. I encourage you to do the analysis on cellular systems and genes that are of interest to you. You are free to follow a path different from the one outlined here if that is where your curiosity and interest take you. Please make it clear in your report what you did. Go to GEO, a server at NCBI: http://www.ncbi.nlm.nih.gov/geo/ (1) GEO currently holds data for over 36,000 “series”, each of which comprises a study with multiple samples. Over 3000 of these are curated as GEO “DataSets”, which provide not only the data but also tools for finding differentially expressed genes and for cluster analysis. You should choose one of the latter. The query engine supports text queries like “breast cancer” so try it to find something you are interested in. I’m working with the data in GDS568 record “Erythroid differentiation: G1E model” data, so if you have no other preference, choose it. All subsequent instructions refer to this dataset, but the same features should be available for any comparable dataset. If you use a different dataset, then substitute equivalent biological questions in the following. (2) Under “Data Analysis Tools”, look at “Experiment design and value distribution” to see what the samples are and the distribution of expression levels in each. (3) Under “Data Analysis Tools”, use “Cluster heatmaps” to generate at least two views of the clusters in the data. Include a hierarchical clustering and a k-means clustering. You can control the colors in the heat map; if you change them please make it clear in the report what the colors mean. (4) Now find out something about genes that show interesting patterns of expression in the clusters (e.g. an up-regulated or down-regulated set). There are multiple ways to focus on an interesting subset; use any or all of the following. (4.1) In order to get a subset of genes with an interesting expression pattern, you can use the cropping tool on a cluster to zoom in. There is a 1 cropping tool (reddish-yellow dashed line box) to specify a cluster of interest. Detailed instructions on using this interface can be obtained by clicking on “How To” at the upper right, above the diagram. (4.2) Alternatively, you can click on “Find genes” to choose genes that are up or down regulated for a condition (time, disease, etc). (4.3) You can also “Compare 2 sets of samples” to find an interesting set of genes. After choosing a small subset of genes (4-5) with interesting expression patterns, follow the hyperlinks to get quantitative information on the expression levels for and to find out more about the genes (e.g. links to Entrez Gene). In your report, tell me what you learned. Option 2. Map and analyze RNA-seq data Galaxy contains tools for mapping and analyzing RNA-seq data, and it looks like some datasets of tractable size are available in the Data Libraries (look under the “Shared Data” pull-down menu). This is an opportunity to get some experience in handling these data. A serious effort in running the mapping and some analyses, and a description of your experiences and what you learned, would constitute a good report. I expect each students experience will vary depending on your background, and you may be inspired to dig more deeply. If you are feeling adventurous, go for it! At the Galaxy web site, go to: “Shared Data” (top black row) Select “Data Libraries” Click on “Demonstration Datasets” Click on the blue triangle for “Human RNA-seq: CHB ENCODE Exercise” Import the data (do at least one cell type) into your history. It looks like the data may be filtered so it is only chr19. Click on “Analyze Data” (top black row), and return to your history. Under NGS: RNA-seq, run Tophat to map the reads, and then explore cufflinks and cuffcompare for further analysis. I expect it will take a while to finish the initial mapping and maybe other functions as well. Option 3. Develop your own assignment Develop your own exercise using the data from the technologies we are discussing. Note that substantial data from modENCODE projects on flies and worms are available. 2