functional_enrichment_new - Baliga Lab at Institute for Systems

advertisement
Introduction to Systems Biology 2012
Functional Enrichment Analysis:
making sense of big data
Aaron Brooks & Fang Yin Lo
8/21/2012
Insight
Experimental
Design
Data
analysis
Data
collection
Mischel et al, 2004
From knots to knowledge
• What is functional enrichment?
- Tools and caveats (e.g. DAVID and pvals)
• How can you apply these tools to large,
complex analysis problems (i.e. automation)?
Why functional enrichment analysis?
Interesting glioblastoma study
Common
Pathways
Gene
e.g. Glycolysis
Gene
Urn problem:
Gene
Gene Gene
Gene Gene
Function
Common
Functions
e.g. sugar metabolism
Draw 6 marbles (k)
How likely am I to draw 3 or more marbles?
N m
m
( )( )
P(X  q) 
()
q
Total: 15 marbles (N):
5 red (m)
10 black (N-m)
kq
N
k
N: population size
m: # of positives in the population
k: # of draws
q: # of positives
The workflow of typical enrichment tools
e.g. Gene Ontology
(GO), KEGG Pathway,
etc
e.g. GO Terms that are
enriched in the input
gene list
Nucleic Acids Research, 2009, Vol. 37, No. 1 1–13
(the database for annotation, visualization and
integrated discovery)
• Diverse, web-based functional analysis tool
• Integrates a suite of databases and statistical tools
(GO, KEGG, Interpro, Disease)
• User-friendly,
• Problematic for large analysis problems (many
independent sets)
staRt your engines
http://baliga.systemsbiology.net/events/sysbio/content/bicluster-307
http://baliga.systemsbiology.net/events/sysbio/content/bicluster-353
What if you had
many sets to
analyze?
AUTOMATION!!!
topGO
• An R package that facilitates semi-automated enrichment
analysis for Gene Ontology
• Three main Steps:
1. Prepare data
- create a topGO object with list of genes identiers, gene-to-GO
annotations)
2. Run enrichment tests
3. Display the results
Structured controlled vocabularies (ontologies) that describe relationships
between gene products and their associated biological roles
• cellular components : the parts of a cell or its extracellular
environment
• molecular functions: activities, such as catalytic or binding
activities, that occur at the molecular level (e.g. catalytic activity, Toll
receptor binding)
• biological processes: series of events accomplished by one or
more ordered assemblies of molecular functions (e.g. signal
transduction, pyrimidine metabolic process )
GO structure
• Directed Acyclic Graph(DAG)
• Child terms are more specialized
• Child can have more than one parent
Data preparation
# Install topGO and Affymetrix Human Genome U133 Plus 2.0 Array
annotation data
> source("http://bioconductor.org/biocLite.R")
> biocLite("topGO")
> biocLite("hgu133plus2.db")
> geneSets # Input a list a genes
#Boot the gaggle
> library(gaggle)
> gaggleInit()
#load the library
> library(topGO)
> library(hgu133plus2.db)
Data preparation
### Initializing the analysis ###
# hgu133plus2ACCNUM: an R object that contains mappings
between the manufacturers identifiers and gene names of
Affymetrix Human Genome U133 Plus 2.0 Array
# all.genes: all background genes ( gene universe )
> all.genes <- ls(hgu133plus2ACCNUM)
Other annotation packages at
Bioconductor
Other annotation packages can be found at:
http://www.bioconductor.org/packages/release/data/annotation/
Data preparation: Input gene lists
### Make gene lists ###
# We will make a list that includes two sets of genes of interest
# Initialize the list:
>glioblastoma.genes = list()
http://baliga.systemsbiology.net/events/sysbio/content/bicluster-307
# broadcast 'bicluster 307 genes' to R
>glioblastoma.genes[["bc307"]] = sapply(getNameList(),tolower)
Do the same for the other gene list:
http://baliga.systemsbiology.net/events/sysbio/content/bicluster-353
>glioblastoma.genes[["bc353"]] = sapply(getNameList(),tolower)
Data preparation: make topGO object
## Analyze genes in bc353 first
> relevant.genes <- factor(as.integer(all.genes %in% glioblastoma.genes[["bc353"]]))
> names(relevant.genes) <- all.genes
# Construct the topGOdata object for automated analysis
>GOdata.BP <- new("topGOdata", ontology='BP', allGenes = relevant.genes,
annotationFun = annFUN.db, affyLib = 'hgu133plus2.db')
# ontology:'BP','MF; or 'CC'
# allGenes: named vector of type numeric or factor. The names attribute
contains the genes identifiers. The genes listed in this object define the gene
universe.
# annotationFun: function that maps gene identifiers to GO terms.
# annFUN.db extracts the gene-to-GO mappings from the affyLib object
# affyLib: character string containing the name of the Bioconductor annotaion
package for a specific microarray chip.
Run Enrichment Analysis
> results <- runTest(GOdata.BP, algorithm = 'classic', statistic = 'fisher’)
Analysis of results: summary
# generate a summary of the enrichment analysis
> results.table <- GenTable(GOdata.BP, results, topNodes =
length(results@score))
# How many GO terms were tested?
> dim(results.table)[1]
# reduce results to GO terms passing Benjamini-Hochberg multiple hypothesis
corrected pval <= 0.05, FDR <= 5%
>results.table.bh <results.table[which(p.adjust(results.table[,"result1"],method="BH")<=0.05),]
Analysis of results: get significant GO terms
# reduce results to GO terms passing Benjamini-Hochberg multiple hypothesis
corrected pval <= 0.05, FDR <= 5%
>results.table.bh <results.table[which(p.adjust(results.table[,"result1"],method="BH")<=0.05),]
# How many terms are enriched?
>dim(results.table.bh)[1]
# What are the top ten terms?
>results.table.bh[1:10,]
Analysis of results: get genes in top GO terms
# Get all the genes annotated to a specific GO term of interest:
>GOid.of.interest = results.table.bh[1,"GO.ID"]
>all.term.genes = genesInTerm(GOdata.BP,GOid.of.interest)[[1]]
# Which of these genes is in the bicluster?
>genes.of.interest <- intersect(glioblastoma.genes[["bc353"]],all.term.genes)
# print table with probe ID and gene symbol
>gene.symbol= toTable(hgu133plus2SYMBOL[genes.of.interest])
# print table with probe ID and gene names
>gene.name= toTable(hgu133plus2GENENAME[genes.of.interest])
# Combine the information of the genes, output to csvfile:
>cbind(gene.symbol,gene.name[,2])
>write.csv(cbind(gene.symbol,gene.name[,2]), file =
"glioblastoma.genes_bc353_in_immune response.csv“)
Automation
results <- list()
for( bc in names(glioblastoma.genes) ) {
cat(paste("Computing functional enrichment for...",bc,"\n"))
relevant.genes <- factor(as.integer(all.genes %in%
glioblastoma.genes[[bc]]))
names(relevant.genes) <- all.genes
GOdata.BP <- new("topGOdata", ontology='BP', allGenes =
relevant.genes, annotationFun = annFUN.db, affyLib =
'hgu133plus2.db')
results[[bc]] <- GenTable(GOdata.BP,runTest(GOdata.BP, algorithm =
'classic', statistic = 'fisher'),topNodes = length(results@score))
}
Questions?
Other algorithms supported by topGO
• Standard implementations of GO testing compute the
significance of a node independent of the significance of the
neighboring nodes (‘classic’)
• Other algorithms take into considerations of the GO structure
and try to find more specific GO terms (e.g.’elim’, ‘weight’,
Alexa et al. (2006). Bioinformatics (2006) 22 (13): 1600-1607.
‘weight01’,etc)
Other algorithms supported by topGO
# Try running other algorithms and compare the results:
>r1.BP.elim = runTest(GOdata.BP, algorithm = 'elim', statistic = 'fisher')
>r1.BP.weight = runTest(GOdata.BP, algorithm = 'weight', statistic = 'fisher‘)
# This will take a while…
# After the runs are done, visually compare resulting p values from different algorithms:
>pValue.classic <- score(r1.BP)
>pValue.elim <- score(r1.BP.elim)[names(pValue.classic)]
>pValue.weight <- score(r1.BP.weight)[names(pValue.classic)]
>gstat <- termStat(GOdata.BP, names(pValue.classic))
>gSize <- gstat$Annotated / max(gstat$Annotated) * 4
>colMap <- function(x) {
.col <- rep(rev(heat.colors(length(unique(x)))), time = table(x))
return(.col[match(1:length(x), order(x))])
}
>gCol <- colMap(gstat$Significant)
>plot(pValue.classic, pValue.elim, xlab = "p-value classic", ylab = "p-value elim",pch = 19,
cex = gSize, col = gCol)
Broadcasting gene list to DAVID
1. In R:
> broadcast(geneSetGenes)
2.
3. Select target:
DAVID
4. Broadcast to
DAVID
Broadcasting gene list to DAVID
Broadcasting gene list to DAVID
Broadcasting gene list to DAVID
Gene list and population
background being analyzed
Clustering options and contingency
A group of terms having similar
biological meaning due to
sharing similar gene members
Original database/resources
where the terms orient
Enriched terms associate
with input gene list
Modified Fisher Corrected p-values
Exact p-values
Some remaining challenges
Realistically positioning the role of enrichment P-values in the current datamining environment:
•
•
•
•
•
•
•
high-throughput enrichment data-mining environment is extremely complicated
Variations of the user gene list size
deviation of the number of genes associated with each annotation
gene overlap between annotation
incompleteness of annotation content
strong connectivity/dependency among genes
unbalanced distributions of annotation content
Limitation of multiple testing correction on enrichment P-values
•
common multiple testing correction techniques maybe overly conservative approaches if there are thousands or
even more annotation terms involved in the analysis
Genome Inform. 2005;16:106-115.; Nucleic Acids Research, 2009, Vol. 37, No. 1 1–13
Cross-comparing enrichment analysis results derived from multiple gene lists
• the size of the gene list impacts the absolute enrichment P-values, therefore
difficult to directly compare the absolute enrichment P-values across gene lists
Some remaining challenges
some may treat the resulting enrichment P-values as a scoring system that
plays a advisory role
more of an exploratory procedure, with the aid of enrichment P-value, rather
than a pure statistical solution.
Nat. Protoc. 2008. doi: 10.1038/nprot.2008.211
the specificity of enrichment analysis is more impacted by non-statistical
layers than it is by statistical methods alone
Nucleic Acids Research, 2009, Vol. 37, No. 1 1–13
Working with topGO data object
# work with Godata.BP
# obtaining all genes
> a = genes(GOdata.BP)
> str(a)
chr [1:31777] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at"
"1294_at" ...
# number of genes
> numGenes(GOdata.BP)
[1] 31777
# The list of significant genes can be accessed using the method sigGenes()
> sg = sigGenes(GOdata.BP)
> str(sg)
chr [1:22] "201291_s_at" "202095_s_at" "202589_at" "202705_at"
"203213_at" ...
Working with topGO data object
# accessing information related to GO and its structure
# which GO terms are available for analysis:
> ug = usedGO(GOdata.BP)
> str(ug)
chr [1:10921] "GO:0000002" "GO:0000003" "GO:0000012" "GO:0000018" "GO:0000019" ...
# select some random GO terms: (1). count the number of annotated genes and (2) obtain their annotation
> sel.terms <- sample(usedGO(GOdata.BP), 10)
> sel.terms
[1] "GO:0032913" "GO:0043372" "GO:0044259" "GO:0032700" "GO:0043122" "GO:2001141" "GO:0060587"
[8] "GO:0071352" "GO:0007256" "GO:0051343“
# Check what are the genes annotated to a specific GO term:
> genesInTerm(GOdata.BP, "GO:0032913" )
$`GO:0032913`
[1] "208650_s_at" "208651_x_at" "209771_x_at" "209772_s_at" "216379_x_at" "266_s_at"
# Number of genes annotated to the selected GO terms:
> num.ann.genes <- countGenesInTerm(GOdata.BP, sel.terms)
> num.ann.genes
GO:0032913 GO:0043372 GO:0044259 GO:0032700 GO:0043122 GO:2001141 GO:0060587 GO:0071352
GO:0007256
6
36
111
12
454
6809
2
5
11
GO:0051343
2
Working with topGO data object
> ann.genes <- genesInTerm(GOdata.BP, sel.terms)
> str(ann.genes)
List of 10
$ GO:0032913: chr [1:6] "208650_s_at" "208651_x_at" "209771_x_at" "209772_s_at" ...
$ GO:0043372: chr [1:36] "1554519_at" "1555689_at" "1565358_at" "1569748_at" ...
$ GO:0044259: chr [1:111] "1554383_a_at" "1555540_at" "1555896_a_at" "1556499_s_at" ...
$ GO:0032700: chr [1:12] "1552798_a_at" "1556190_s_at" "201300_s_at" "207160_at" ...
$ GO:0043122: chr [1:454] "1552360_a_at" "1552703_s_at" "1552798_a_at" "1552804_a_at" ...
$ GO:2001141: chr [1:6809] "121_at" "1316_at" "1405_i_at" "1487_at" ...
$ GO:0060587: chr [1:2] "201525_at" "207092_at"
$ GO:0071352: chr [1:5] "201940_at" "201941_at" "201942_s_at" "201943_s_at" ...
$ GO:0007256: chr [1:11] "1558984_at" "203652_at" "206362_x_at" "207347_at" ...
$ GO:0051343: chr [1:2] "207514_s_at" "214286_at"
Analysis of results: get genes in top GO terms
#We can also look at multiple GO terms at the same time:
> GOids.of.interest = results.table.bh[c(1:10),"GO.ID"]
> all.term.genes = genesInTerm(GOdata.BP,GOids.of.interest)
# Which of these genes is in the bicluster?
> genes.of.interest <sapply(names(all.term.genes),function(x){intersect(all.term.genes[[x]],gliobla
stoma.genes[["bc353"]])})
# print table with probe ID and gene symbol:
> geneSynmol.of.interest <lapply(names(genes.of.interest),function(x){toTable(hgu133plus2SYMBOL[ge
nes.of.interest[[x]]])})
> names(geneSynmol.of.interest)<- GOids.of.interest
Download