Analyzing CAGE data

advertisement
Analyzing CAGE data
Compared to microarray data, there are no standard formats in tag analysis. The data we
will work with is based on data from Japanese collaborators. They have sequenced tags
from 7 tissues (liver, cerebellum, embryo, lung, macrophages, somatosensory cortex and
visual cortex) in mouse , and then mapped and clustered these tags together to tag
clusters. We will only look at clusters with > 30 tags, and to make this less
computationally demanding, we will only look at 1000 tag clusters ( the real dataset is
close to 19000 tag clusters) . Each row is one promoter (or really a cluster of CAGE
tags), broken up by where the tags come from. The values are normalized to tags per
million (TPMs)
This exercise will not spell out all the commands as previous exercises, because you have
done all the things before. Harder things will be shown, and hints will be given.
Download the file htbinf_cage_tpms.txt from the web directory
Read this file into R
d<-read.table( "htbinf_cage_tpms.txt", h=T)
Try to figure out what is in the columns.
Pretty self-explanatory. Tag cluster ids, locations, strand, and then TPM values for the
various tissues, which are the tissues above but abbreviated to three letters
A tag cluster can here be viewed as a promoter and its exoression in the tissues of
interest.
Some of the tools we will use prefer matrices instead of data frames, so we’ll make an
alternative object, m, which holds almost the same things (skipping the columns with
strings):
m<- as.matrix(d[c(4:10)])
Use the heatmap2 function on the matrix m.
library(gplots)
heatmap.2(m)
A heatmap tries to group columns and rows together that look similar – I this case
promoters on the rows and tissues on the columns.
How can this image be interpreted? What is it that we see? What tissues are most similar
in terms of promoter usage? What tissue has the most “tissue-specific promoters”?
Rows are individual tag clusters, columns are tissues. White means high TPMs, red low
TPMs (normalized in each row).By just counting white areas, there are many promoters
that are used mostly in macrophages, although all tissues have more or less exclusive
promoters.
The “tree” in the top shows that cerebellum and macrophaes cluster together, this is also
true for visual and somatosensory cortex (in fact, these are more similar than anything
else)
What most experimentalists often want are the extremes: for instance, what promoters are
the most “tissue-specific”. If we limit this question to
What are the three most liver-specific promoters in this set, and what genes do they
belong to?
We first need to define what we mean by liver specificity. It makes sense to calculate
this, for each promoter, as
(liverTPM_in promoter)/ (sum of all TPMs in promoter)
This will essentially give the fraction of liver constribution in each promoter - if this is
high, the promoter is biased towards liver.
There are many way of doing this, but a simple way is
liver_spec<-m[,3]/apply(m, 1, sum) # if you do not understand this, you should lok up
what is happening
Given tis vector, what promoter(s) has/have the highest liver specificitiy, and where is it
located on the genome(what genes do they belomg to )
The liver_spec vector can be used to sort the dataframe d (which also holds locations)
> d[sort.list(liver_spec),]
This will sort d according to liver_spec. The last rows will be the ones with the highest
values:
…
142
143
144
145
146
C9F2BE3351 chr9:46019409-46019413 F 3.94892 1.58179 30.359900 0.770272 5.88516 4.78599 4.29756
C6R53B58AA chr6:87775402-87775477 R 3.94892 1.58179 30.359900 0.770272 5.88516 4.78599 4.29756
C17F2174305 chr17:35078917-35078966 F 3.94892 1.58179 30.359900 0.770272 5.88516 4.78599 4.29756
C10R177717A chr10:24605050-24605196
R 3.94892 1.58179 30.359900 0.770272 5.88516 4.78599 4.29756
C13F43FFEA chr13:4456426-4456455 F 3.94892 1.58179 30.359900 0.770272 5.88516 4.78599 4.29756
I then take the last promoter and look it up in the genome browser: it turns out that this
promoter somewhat surprisingly resides in a 3’ UTR(!).
Hints:
1) we have done this type sorting previously, in the R lectures, involving the
sort.list() function (look it up!).
2) the tags are mapped to the mm8 assembly
Download