Analyzing CAGE data Compared to microarray data, there are no standard formats in tag analysis. The data we will work with is based on data from Japanese collaborators. They have sequenced tags from 7 tissues (liver, cerebellum, embryo, lung, macrophages, somatosensory cortex and visual cortex) in mouse , and then mapped and clustered these tags together to tag clusters. We will only look at clusters with > 30 tags, and to make this less computationally demanding, we will only look at 1000 tag clusters ( the real dataset is close to 19000 tag clusters) . Each row is one promoter (or really a cluster of CAGE tags), broken up by where the tags come from. The values are normalized to tags per million (TPMs) This exercise will not spell out all the commands as previous exercises, because you have done all the things before. Harder things will be shown, and hints will be given. Download the file htbinf_cage_tpms.txt from the web directory Read this file into R d<-read.table( "htbinf_cage_tpms.txt", h=T) Try to figure out what is in the columns. Pretty self-explanatory. Tag cluster ids, locations, strand, and then TPM values for the various tissues, which are the tissues above but abbreviated to three letters A tag cluster can here be viewed as a promoter and its exoression in the tissues of interest. Some of the tools we will use prefer matrices instead of data frames, so we’ll make an alternative object, m, which holds almost the same things (skipping the columns with strings): m<- as.matrix(d[c(4:10)]) Use the heatmap2 function on the matrix m. library(gplots) heatmap.2(m) A heatmap tries to group columns and rows together that look similar – I this case promoters on the rows and tissues on the columns. How can this image be interpreted? What is it that we see? What tissues are most similar in terms of promoter usage? What tissue has the most “tissue-specific promoters”? Rows are individual tag clusters, columns are tissues. White means high TPMs, red low TPMs (normalized in each row).By just counting white areas, there are many promoters that are used mostly in macrophages, although all tissues have more or less exclusive promoters. The “tree” in the top shows that cerebellum and macrophaes cluster together, this is also true for visual and somatosensory cortex (in fact, these are more similar than anything else) What most experimentalists often want are the extremes: for instance, what promoters are the most “tissue-specific”. If we limit this question to What are the three most liver-specific promoters in this set, and what genes do they belong to? We first need to define what we mean by liver specificity. It makes sense to calculate this, for each promoter, as (liverTPM_in promoter)/ (sum of all TPMs in promoter) This will essentially give the fraction of liver constribution in each promoter - if this is high, the promoter is biased towards liver. There are many way of doing this, but a simple way is liver_spec<-m[,3]/apply(m, 1, sum) # if you do not understand this, you should lok up what is happening Given tis vector, what promoter(s) has/have the highest liver specificitiy, and where is it located on the genome(what genes do they belomg to ) The liver_spec vector can be used to sort the dataframe d (which also holds locations) > d[sort.list(liver_spec),] This will sort d according to liver_spec. The last rows will be the ones with the highest values: … 142 143 144 145 146 C9F2BE3351 chr9:46019409-46019413 F 3.94892 1.58179 30.359900 0.770272 5.88516 4.78599 4.29756 C6R53B58AA chr6:87775402-87775477 R 3.94892 1.58179 30.359900 0.770272 5.88516 4.78599 4.29756 C17F2174305 chr17:35078917-35078966 F 3.94892 1.58179 30.359900 0.770272 5.88516 4.78599 4.29756 C10R177717A chr10:24605050-24605196 R 3.94892 1.58179 30.359900 0.770272 5.88516 4.78599 4.29756 C13F43FFEA chr13:4456426-4456455 F 3.94892 1.58179 30.359900 0.770272 5.88516 4.78599 4.29756 I then take the last promoter and look it up in the genome browser: it turns out that this promoter somewhat surprisingly resides in a 3’ UTR(!). Hints: 1) we have done this type sorting previously, in the R lectures, involving the sort.list() function (look it up!). 2) the tags are mapped to the mm8 assembly