BI201C dChip for Gene Expression and SNP Microarray Data Analysis Cheng Li Jan 26, 2009 This file’s address: http://www.dchip.org/dchip_expression.doc This course is jointly conducted by Cheng Li and Shailender Nagpal, through Bioinformatics.Org: http://wiki.bioinformatics.org/BI201C_dChip_for_Gene_Expression_and_SNP_Genotyping dChip website: www.dchip.org Session 3: Gene Expression Analysis: Case Study on Filtering, Comparison, Clustering, Enrichment 1. Obtained dChip and demo data Download and unzip the files to a directory: http://biosun1.harvard.edu/complab/dchip/dchip_demo.zip (69 Mb) You can follow me to explore this dataset using dChip in today’s session. Alternatively, the full data can be obtained following the steps 1.1 – 1.5 (in your own time): 1.1 dChip software: http://biosun1.harvard.edu/~cli/dchip.exe (download to a directory, e.g. "c:\dchip") 1.2 Download and unzip example data CEL files, -- The paper describing this dataset (Armstrong et al. Nature Genetics 30, 41 – 47, 2001) -- Download and unzip these files: scaling_factors_and_fig_key.txt ALL1, ALL2, MLL1, MLL2 zipped files, may need to rename the extension to “.gz” before using WinZip to unzip them. 1.3 Download and unzip the CDF file:HG_U95A.zip 1.4 Download and unzip gene information file: HG-U95Av2 gene info2.zip 1.5 Download the sample information file made from “scaling_factors_and_fig_key.txt”: ALL sample info.xls 2. Basic steps to open expression data (covered in session 1-2) 2.1. Data extraction, normalization and expression computation 1 -- “Analysis/Open group”: specify data directory, working directory (in “Options”), sample information file, gene information file -- “Analysis/Normalize & Model”: are the arrays being normalized to have similar median intensity? 2.2 Check probe level data -- Click the “PM/MM” data on the left, use Home, End (go to another probe set), PageUp and PageDown (go to another array) keys to look at the probe level data, and the model fitted for the current probe set. 2.3 Check outlier arrays -- After step 2.1, look at the “array summary file” for any outlying arrays. -- Also check array images for marked single outliers in pink; press key “O” to toggle displaying array outliers; toggle back and forth two array images to see if these outliers are identified reasonably. -- Use “Image/Normalization plot” to view the scatterplot of outlier arrays and baseline arrays. 3. Filter genes and clustering 3.1 “Analysis/Filter genes”: usually it’s good to obtain < 1000 genes to look at in clustering 3.2 “Analysis/Hierarchical clustering”: Check both sample and gene clustering -- Are samples of similar types clustering together? Is there anything special about mis-clustered samples? -- Enlarge the image; what are the genes highly expressed in particular groups of samples? Are there replicate probe sets for the same gene selected and clustered closely? -- Redo gene filtering using different criteria to get gene lists of different size, and then do clustering. Is the sample clustering similar? -- Redo “Analysis/Open group” with “Options/Log 2 transform expression values” checked; redo filtering and clustering. Is the result similar to the original scale? -- Click a gene name on the right to go to online database 3.3 The “Clustering” menu 4. Gene function enrichment analysis 4.1 What are the known genes or functionally significant gene clusters in step 3.2? 4.2 “Tools/Gene function enrichment” -- Click a gene branch before doing this step will classify the genes in the branch 5. Compare samples for supervised gene selection 2 5.1 “Analysis/Compare samples” 5.2 “Analysis/Hierarchical clustering” -- Use “Tools/Array list file” to order samples. -- Sample clustering is not necessary, but may identify outlying samples 5.3 Combine comparisons 5.4 Permute samples to assess FDR Session 4: Gene Expression Analysis: Case Study on ANOVA & Correlation, Classification, Genome, Chromosome, Automating dChip 6. Use ANOVA or correlation analysis for supervised gene selection 6.1 “Analysis/ANOVA & Correlation” -- Use “type” for ANOVA filtering, similar to “Compare samples” for two groups -- Use “fake_class” for ANOVA filtering -- Use “fake_response” for correlation filtering -- Use a gene or gene branch for correlation filtering 7. Classifying samples 7.1 “Tool/Classify samples”, using “fake_class”, specifying an ANOVA gene list, or filtered gene list -- LDA requires the R software to be installed -- PCA (Principle component analysis) may be performed, which doesn’t use the class information -- The “Cross-validation” option 8. Use chromosome and genome information 8.1 Obtain genome information and cytobands files: http://biosun1.harvard.edu/complab/dchip/chromosome.htm Download the genome information file: hg_u95av2 genome info2.xls (hg11) 8.2 “Analysis/Genome” -- Color gene branches in the clustering figure and then do this 3 8.3 “Analysis/Chromosome” 9. Automate dChip functions “Tools/Automate” menu Homework 1. Use your own dataset, or find an expression dataset from GEO: http://www.ncbi.nlm.nih.gov/geo/ (E.g. search “lung cancer” at “Query/DataSets”, click a GSE reference series, download the CEL files from the bottom of the page). 2. Analyze the dataset using dChip. You may need to obtain CDF file and the gene information file of the specific array type, and create your own sample info file based on dataset’s annotations. Follow the similar analyses as the original paper of the dataset or the functions covered in today’s sessions. Do you get similar conclusion as the original paper? Note any analysis questions for discussion. 4