by Kyrylo Bessonov
(kbessonov@ulg.ac.be)
9 Oct 2012
• Is the repository with extensions and libraries for R-language found at http://www.bioconductor.org/
• Bioconductor libraries cover
• Micorarray analysis
• Genetic variants analysis (SNPs)
• Sequence analysis (FASTA, RNA-seq)
• Annotation (pathways, genes)
• High-troughput assays (Mass Spec)
• All libraries are free to use and contain documentation (i.e. vignettes )
• Vignettes are short and require some previous knowledge of R and/or other defendant libraries
2
• To install Bioconductor libraries the R environment needs to be configured source("http://bioconductor.org/biocLite.R")
• To download any Bioconductor library use biocLite( " package_name " ) function biocLite("GenomeGraphs")
• Load the downloaded library functions via the library("lib_name") function library("GenomeGraphs")
• Read help file(s) via the
??
or browseVignettes()
??
library_name (e.g. ??GenomeGraphs) browseVignettes (package = "GenomeGraphs")
3
The GenomeGraphs library
Authors: Steen Durinck and James Bullard
(Bioconductor library)
• Allows to retrieve data from
• Ensembl comprehensive DB on genomes
• intron/exon locations
• sequence genetic variation data
• protein properties (pI, domains, motifs)
• GenomeGraphs Bioc library allows to display data on:
• Gene Expression (expression / location)
• Comparative genomic hybridization (CGH)
• Sequencing (e.g. variations/introns/exons)
5
• gdPlot main plotting function gdPlot(gdObjects, minBase, maxBase ...)
• gdObjects = any objects created by GenomeGraphs a) BaseTrack ; b) Gene ; c) GenericArray ; d) RectangleOverlay
• minBase the lowest nt position to be plotted ( optional )
• maxBase the largest nt to display ( optional )
•
Objects are data structures that hold/organize a set of variables
1) Create an object to plot using makeBaseTrack() function
ObjBaseT = makeBaseTrack (1:100, rnorm(1:100),strand = "+")
2) Plot the newly created object gdPlot (ObjBaseT)
• Display object structure / properties (e.g. of BaseTrack ) attributes (ObjBaseT)
• $ sign next to the name represents object variables
• Change or view individual object variables attr (ObjBaseT, “strand")
6
• Changing values of variables (classical way) attr(Object, "variable/parameter") = value attr(ObjBaseT, "strand") = "-“
• If an array, to access individual elements use [element number 1..n] attr(ObjBaseT, "variable/parameter")[number] = new_value attr(ObjBaseT, "base")[1] = 0
• Changing graphical parameters such as color showDisplayOptions (ObjBaseT) alpha = 1 lty = solid color = orange size = 5 lwd = 1 type = p getPar (ObjBaseT, "color")
[1] "orange" setPar (ObjBaseT, "color", "blue")
7
• gdPlot(…) can take any number of gdObjects to plot
• Let’s plot two BaseTrack objects a and b a = makeBaseTrack(1:100, rlnorm(100), strand = "+") b = makeBaseTrack(1:100, rnorm(100), strand = "-") ab=list(a,b) gdplot(ab)
8
• ab is list of objects a and b
• To access individual object within the list use [ ] ab[1] will display object a ab[2] will display object b
• To modify the grouped object use double [[ ]]
• To change color of b to red getPar(ab [[ 2 ]] , "color") setPar(ab[[2]], "color", " red ")
-OR(re-create object) b = makeBaseTrack(1:100, rnorm(100), strand = "-", dp=DisplayPars(color=" red ")) ab=list(a,b) gdPlot(ab)
9
• gdPlot does not provide functions to label axis
• Trick = “use labeled / tagged” objects
"label" = GenomeGraph object
"+ strand"= makeBaseTrack(1:100, rlnorm(100), strand = "+")
• To display legend use makeLegend("text","color") ab=list("+"=a, "-"=b, makeGenomeAxis (), makeLegend (c("+", "-"),c("orange", "red")) )
10
Combining capabilities of biomaRt and GenomeGraph libraries
• biomaRt library allows to retrieve data from public DBs ensembl snp unimart bacteria_mart_14
ENSEMBL GENES 68 (SANGER UK)
ENSEMBL VARIATION 68 (SANGER UK)
UNIPROT (EBI UK)
ENSEMBL BACTERIA 14 (EBI UK)
***Use listMarts() to see all available databases***
• Let’s retrieve gene data of the Bacillus subtilis strain
• useMart(database,dataset) allows to connect to specified database and dataset within this database db=useMart("bacteria_mart_14") listDatasets(db)
Dataset
…
Description
… bac_6_gene Bacillus subtilis genes (EB 2 b_subtilis) version
…
EB 2 b_subtilis db=useMart("bacteria_mart_14", "bac_6_gene")
12
• listAttributes() shows all prop. of the biomaRt obj.
• The db object has total of 4175 genes
• Use getBM(attribute, filter, value, biomaRt_obj) to extract values belonging to specified attributes
• attribute: general term such as gene name/chromosome # / strand (+ or - )
• filter: parameter applied on attribute such as genomic region to consider (i.e. start and end in nt)
• value: actual values of the applied filter(s) getBM(c("external_gene_id", "description","start_position",
"end_position", "strand"), filters = c("start", "end"), values = list(1,10000), db)
ID description metS Methionyl-tRNA synthetase ftsH Cell division protease ftsH homolog hslO 33 kDa chaperonin
Dgk Deoxyguanosine kinase start(nt)
45633
76984
79880
23146 end(nt) strand
47627 1
78897
80755
23769
1
1
-1 13
• Create an object with makeGeneRegion() function makeGeneRegion(start,end,chromosome name, strand, biomaRt object, plotting options)
• Find notation used for the chromosome naming getBM("chromosome_name","","",db) chromosome_name
1 Chromosome gRegion = makeGeneRegion(1, 10000, chr = " Chromosome ", strand = "+", biomart = db, dp =
DisplayPars( plotId = TRUE , idRotation = 90, cex = 0.8, idColor = "black")) gdPlot(list(gRegion, makeGenomeAxis(), makeTitle("Position(nt)",cex=3,"black",0.1)))
14
ensembl_gene_id
15
GenomeArray()
• HT sequencing technologies allow to sequence mRNA in a series of short contigs of 50-200 bp
• In addition to gene expression analysis it possible to
• Map location of introns (UTRs) / exons
• Principal: one searches for a rapid changes in abundance of the RNA-Seq signal (contigs)
• Integration of sequence + expression information
• Ugrappa Nagalakshmi et. al. 2008 had used this strategy to accurately map yeast genome 1
• Task: Display part of the seqDataEx dataset having both
• abundance of mRNA (cDNA) transcripts and
• annotated yeast genome
1
Ugrappa Nagalakshmi et. al. The transcriptional landscape of the yeast genome defined by RNA sequencing .
Science , 2008
17
• Need to get mRNA contig abundance data data("seqDataEx") rnaSeqAb=seqDataEx$david
• Create GeneticArray object with makeGenericArray (intensity, probeStart, probeEnd, trackOverlay, dp = NULL)
• intensity either microaray or RNAseq transcript abundance signal
• probStart start position of the probe (location in nt)
• Plot contig abundance w.r.t. to genomic location gdPlot(list("abun"=makeGenericArray(rnaSeqAb[, "expr", drop = FALSE], rnaSeqAb [, "location"]) , makeGenomeAxis() ) )
18
• Need to get annotated yeast genome data annotGen = useMart("ensembl", "scerevisiae_gene_ensembl")
• Create mRNA contig abundance mRNAabun = makeGenericArray(rnaSeqAb[, "expr", drop = FALSE], rnaSeqAb [, "location"])
• Create annotated seq. covering the mRNA contigs location annotSeq= makeGeneRegion(start = min(rnaSeqAb[, "location"]), end = max(rnaSeqAb[, "location"]), chr = "IV", strand = "+", biomart = annotGen, dp = DisplayPars(plotId = TRUE, idRotation = 0, cex = 0.85, idColor = "black", size=0.5))
• Combine objects and plot them gdPlot( list("abund" = mRNAabun,makeGenomeAxis(), "+" = annotSeq),
1299000,1312000)
19
20
• To overlay rectangle use makeRectangleOverlay() makeRectangleOverlay(start, end, region = NULL, coords = c("genomic", "absolute"), dp) rectOver = makeRectangleOverlay( 1301500, 1302200, region=c(1,2), "genomic", DisplayPars(alpha = 0.5))
• To overlay text use makeTextOverlay() tOver = makeTextOverlay("Ribosomal Large subunit",
1302000, 0.95, region = c(1,1), dp = DisplayPars(color = "red"))
• Combine all overlay objects into one vector with c(v1,v2) gdPlot( list("abund" = mRNAabun,makeGenomeAxis(),
"+" = annotSeq), overlays= c (rectOver, tOver) )
21
22
makeTranscript(id, type, biomart, dp = NULL)
• mRNA coming from the same ORF could be spliced in many ways
• E.g. case of VDR genes of IgG
• Given biomaRt object, the makeTranscript() will extract splicing information for given id (i.e. gene)
• Download human genome databse hGenome <- useMart("ensembl", "hsapiens_gene_ensembl")
• Select Ensembl ID to look at head(getBM(c("ensembl_gene_id", "description"),"","", hGenome))
• Get splicing data from biomaRt object (hGenome) spliceObj = makeTranscript("ENSG00000168309",
"ensembl_gene_id" ,hGenome)
• Plot the object with gdPlot gdPlot(list(makeTitle("Transcript ID:
ENSG00000168309"),splicingObj, makeGenomeAxis()) )
24
25
• GeneGraphs provides a wide range to plot genomic data
• Can use external databases through biomaRt
• Main useful features
• identifies exons/introns
• allows to cross-reference expression / genome data
• flexible albeit complex plotting capabilities
• allows to overlay graphical objects and text
• ability to create custom legends
• annotation capabilities provided by powerful biomaRt
26