Using GEO2R and R-Bioconductor, for extracting expression dataset from GEO and create statistical analysis of the microarray sample Luca Zammataro luca.zammataro@iit.it Aim of practice: 1. Download a microarray curated experiment published in GEO, by means of GEO2R. 2. Create comparison for groups of Samples and apply statistic filter using GEO2R, by means of an R-Bioconductor script. 3. Obtaining an expression dataset (an exportable text file) of all the values from the selected platform, by means of the script, implementable in R-Bioconductor. 4. Filtering and singling out the most important regulated genes among groups, with the own function description. Following tools and relative instructions are available from own web sites. Please go through instructions before carrying on the following exercise. The practice is divided in 2 section, the former is based on the GEO2R GUI, giving you the possibility to download GSE microarray experiments and performing some basic statistical analysis on the resulting expression dataset. The latter gives you the main R-Bioconductor codes to 1) manually download microarray data, 2) obtain an expression data set file, that you can save on your disk; 3) selecting the most important regulated genes across experimental groups (by means of excel). GEO: The Gene Expression Omnibus (GEO) is a public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomic data submitted by the scientific community. In addition to data storage, a collection of web-based interfaces and applications are available to help users query and download the studies and gene expression patterns stored in GEO. For more information about various aspects of GEO, please see our documentation listings and publications. http://www.ncbi.nlm.nih.gov/geo/browse/?view=series&display=20 Remember that: GEO Platform (GPL) These files describe a particular type of microarray. They are annotation files. GEO Sample (GSM) Files that contain all the data from the use of a single chip. For each gene there will be multiple scores including the main one, held in the VALUE column. GEO Series (GSE) Lists of GSM files that together form a single experiment. GEO Dataset (GDS) These are curated files that hold a summarised combination of a GSE file and its GSM files. They contain normalised expression levels for each gene from each sample (i.e. just the VALUE field from the GSM file). Further information on GEO data organization can be found here: http://www.ncbi.nlm.nih.gov/geo/info/overview.html http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/geo/ http://lifesciencedb.jp/geo-e/?keyword=HuGene-1_1-st-v1&action=ListPlatform GEO2R: GEO2R is an interactive web tool that allows users to compare two or more groups of Samples in a GEO Series in order to identify genes that are differentially expressed across experimental conditions. GEO2R uses R-Bioconductor environment, producing a script that can be upload within the R-Bioconductor for the local analysis of microarray data. R-BioConductor users may be interested in the GEOquery package which parses GEO SOFT files for integration with BioConductor 'R' analysis resources, see publication. GEO2R is available from : http://www.ncbi.nlm.nih.gov/geo/geo2r/ GEO2R instructions are available from: www.ncbi.nlm.nih.gov/geo/info/geo2r.html GEO2R also produce a standard R-code that you can implement within the R-Bioconductor environment. 1 R-Bioconductor: Bioconductor is an open source software project based on the R programming language that provides tools for the analysis of highthroughput genomic data. The GEOquery R package parses GEO data into R data structures that can be used by other R packages. The limma (Linear Models for Microarray Analysis) R package has emerged as one of the most widely used statistical tests for identifying differentially expressed genes. It handles a wide range of experimental designs and data types and applies multipletesting corrections on P-values to help correct for the occurrence of false positives. Thus, GEO2R provides a simple interface that allows users to perform R statistical analysis without command line expertise. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platfroms. R is available from “The Comprehensive R Archive Network”: cran.r-project.org Bioconductor is available from: www.bioconductor.org Practice: 1. In GEO2R web site enter a Series accession number in the box e.g. GSE32164, to select the platform of interest, going through http://www.ncbi.nlm.nih.gov/geo/geo2r/ a. How many samples, replicates and groups you can identify within the selected experiment? (The question is addressed to evaluate if you have idea of the experimental meanings, and what groups you should take in consideration for the analysis.) i. 2. There are 9 Samples, (3 groups composed by 3 replicates) in this experiment. Define groups and enter the name of groups to generate statistics, by means of the GEO2R Web GUI: You should obtain a landscape as the following picture : 3. Calculate the distribution of value data for the Samples you have selected, then analyze the resultant boxplot, clicking on the “Value distribution” button: A boxplot will appear: 2 a. Determining if values data are median-centered across Samples, and thus suitable for cross-comparison. Report what is the median value of expression levels falling in the median-centered space, within the box. In this case it’s 5. (Log2 values). So report this range as answer of this exercise. b. Click on “export” an copy and paste in a MSWord document, the report table of the boxplot. Table will appear in the following format: 4. Click on the “Options” button, a section for adjusting statistical parameters will appear: apply adjusting p-values choosing Benjamini & Hochberg (False discovery rate). The Limma function provides several p-values adjustment options, (multiple-testing corrections) attempt to correct for the occurrence of false positive results. After you have selected the adjustment, come back to the GEO2R tab and click on the “Top250” button, to obtain the result of the calculation after applying the Benjamini & Hochberg statistics: a list of resulting 250 genes will be appear in table format.. The adjusted p-values will be listed in the Adj P-value column of the resulting table. Genes with the smallest P-values will be the most reliable. The table will appear in the following format: 3 5. a. Click on the first result and obtain the expression graph. b. Annotate from the table, how many genes have adj–Pval < 0.05. Open the R-Bioconductor environment terminal and copy and paste into the terminal the R-code produced by GEO2R, (the script is useful to load series and platform data directly from GEO (e.g GSE32164)). The code is only an example of the structure and functions that are automatically generated by GEO2R, Users should copy and paste all the bits of code directly from the GEO2R interface, As indicated in this screeshot: Code Explanation: a. Add the required library using the R-Bioconductor library() function: library(Biobase) library(GEOquery) library(limma) b. Create a gset object using the getGEO function; waiting for the data downloads gset <- getGEO("GSEXXXXX", GSEMatrix =TRUE) if (length(gset) > 1) idx <- grep("GPL6244", attr(gset, "names")) else idx <1 gset <- gset[[idx]] 1. # type ls() to discover variables defined in R 4 The result is: [1] "gset" "idx" # Type gset to discover what is the number of assayData for this affymetrix: (33297) c. Make names from the corresponding platform: fvarLabels(gset) <- make.names(fvarLabels(gset)) 1. #type fvarLabels(gset) and report the variable content Display all the labels for this platform: (what are the feature inside the platform) and recall this matrix when you should produce the 250TopList file.txt [1] "ID" "GB_LIST" "SPOT_ID" "seqname" [5] "RANGE_GB" "RANGE_STRAND" "RANGE_START" "RANGE_STOP" [9] "total_probes" "gene_assignment" "mrna_assignment" "category" d. Create a variable with group names for the statistical comparison sml <- c("G0","G1","G2" …”Gn”); e. Calculate a log2 transform of the values. exprs(gset) <- log2(exprs(gset)) log_ex <- exprs(gset) f. Set up the data and proceed with analysis fl <- as.factor(sml) gset$description <- fl # Create a model matrix for the contrast calculation design <- model.matrix(~ description + 0, gset) colnames(design) <- levels(fl) fit <- lmFit(gset, design) # this is the limma function cont.matrix <- makeContrasts(G2-G0, G1-G0, G2-G1, levels=design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2, 0.01) g. Introduce instructions for statistics calculation for FDR tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250) # No FDR # tT <- topTable(fit2, adjust="none", sort.by="B", number=250) h. Load NCBI platform annotation gpl <- annotation(gset) platf <- getGEO(gpl, AnnotGPL=TRUE) ncbifd <- data.frame(attr(dataTable(platf), "table")) i. Replace original platform annotation tT <- tT[setdiff(colnames(tT), setdiff(fvarLabels(gset), "ID"))] tT <- merge(tT, ncbifd, by="ID") tT <- tT[order(tT$P.Value), ] j. # restore correct order Define a variable FInalOutput containing all the fields we want to print, according to type fvarLabels(gset) matrix, that gives you labels, and according to cont.matrix.. # type cont.matrix to show all the possible contrasts. (see the Levels..) 5 FinalOutput<subset(tT,select=c("ID","Gene.symbol","Gene.title","GO.Function","adj.P.Val","P .Value", "G4...G0","G1...G0")) Indicate the contrast using the 3 point … i.e.: “G4…G3” k. Write a table containing all expression data set. write.table(log_ex, file="Eset.txt", row.names=T, sep="\t") l. Write a table containing toplist of 250 genes, corrected by FDR, with contrasts and annotations. write.table(FinalOutput, file="250TopList.txt", row.names=F, sep="\t") 6. The R script is intended to create groups of comparison (Control vs Treatment). We have 3 groups G0, G1, G2. The 250TopList.txt output files contains also ratio among these groups, in the column named G2..G0, G1..G0, G2..G1. Now open the Eset.txt file with excel, (following the formatting guidelines suggested by Excel) and introduce a column reporting the formula “SIGN(cell_xx)*2^ABS(cell_xx)”, to transform ratio in fold-change values. In this way we will have a fold-change calculation; define a threshold of 1.3, -1.3 (also we can format the Excel cells in red and green to better visualize the values : a. 7. Using the “sort” Excel function to order the values to annotate how many genes result downregulated (<-1.3) and how many upregulated (1.3) in G2/G0 comparison. i. i.e G2…G0: Down: 100. Up: 17. b. How many downregulated (-1.3) genes in in G2/G0 comparison, have actin binding activity result: DIAPH3, ANTXR1 6 7