Final project of the course “Introduction to Matlab and Data Analysis”

advertisement
Final project of the course “Introduction to Matlab and Data Analysis”
During the past month you have been measuring promoter activity (PA, the rate of
change in protein levels per cell per time) of 20 genes. Your PI comes to you and asks
you to present your results in the next group meeting. The 20 genes are all a part of a
shared pathway, and they are arranged in a list by their distance from the initial gene in
the pathway.
You thought about it quite a lot, and finally decided to present the following graphs
(detailed instructions follow below):
1. You want to calculate and display the correlation coefficients of the measured
PA and find a linear relation between average PA and the genes GC content, for
each gene.
Note:
A. The PA calculated should be the average PA for each gene.
B. The GC content should be displayed as STD from the mean GC content of the
20 genes. Thus, a GC content that equals the mean GC content of all 20
genes will count as 0, and a GC that is one STD from the mean will count as 1.
You should get values between -0.2 to 0.15. (This procedure is quite common
when values of a certain variable appear to be crowded around a certain
value which is the case with the GC content of the 20 genes). You should
smooth the data before plotting it (see instructions below).
2. For the 4 highest correlation genes, you wish to show how inter-gene correlation
decays with distance from initial gene in the pathway.
You have the following data files:
1. genetrial.txt – this is a part (10,000 rows) of a Genebank file which contains data
on E.coli genes, for each gene there appears both its synonyms, gene place in the
DNA (start and stop codon), its functions and its GO terms that you already know
and love.
2. sequences.fasta – a file containing the DNA sequence of E.coli
3. FinalProjectData.mat – contains the following variables:
a. genesList – a cell array with the genes names.
b. Measurements – a matrix with 1001 time measurements of PA of each
gene. Each gene PA measurement is a row.
Steps towards submitting the project:
1. Get the sequences of the genes from a GenBank+Fasta files and calculate GC
content:
a. Insert the sequences.fasta file into a cell array (note: it might be
beneficial to parse it one character at a time with %c instead of the
regular %s).
b. Insert the genetrial.txt data to another cell array.
c. Find all rows where the word CDS appears and take the place of the gene
in the fasta file (both beginning and the end). Note: if the word
‘complement’ appears before the start...end positions then when you
calculate the GC content you should actually look for [TA] content.
d. You should convert the ‘start’..’end’ positions that you found from strings
to numbers.
e. Take the list of genes, you can assume that the gene name is always
located one row below the CDS line.
f. Find where the genes of your experiment (geneList variable) appear in
the GeneNames taken from the genetrial.txt file.
g. For each gene in geneList, calculate the GC content of its sequence. You
should take the relevant gene sequence and look for G or C in it. If it is a
complement gene, you should look for T and A.
h. Build a structure with all the following fields:
i. Gene Name
ii. Is it a complement or not
iii. Start position
iv. End position
v. Gene sequence
vi. GC content
No loops are allowed for building the structure (structures can be
assigned using the function struct()) and inserting entire cell arrays for
each field.
2. Plot GC content vs. mean PA and PA correlation among the genes:
a. Scale the GC content data to be distance in STD from the mean of the 20
genes’ GC content.
b. Calculate the mean of the PA for each gene.
c. Calculate the correlation between the scaled GC content and the average
PA (use corrcoef(x,y)) . Plot the points on a graph (only the points).
d. Create a linear fit to the points and plot it on the same graph (use:
fitPA=polyfit(xdata, ydata,1) and then yfit=polyval(fitPA,x) to get the fit,
and then plot).
e. Smooth the data using the method ‘rloess’. Recall that smoothing with ‘r’
methods allows for robust smoothing in a way that outliers get less
weight in the smoothing procedure. Note that when using ‘rloess’ you
should specify not the number of points in the window to smooth, but
rather the data span fraction (which fraction of the data each window
averages upon). Here, you should use 0.25 as the fraction/window to
smooth on.
f. Plot the smoothed data and the linear fit of that data on the same graph.
Write next to each linear fit its equation.
Note the difference between the two techniques: the smoothing lowered
the effect of the ‘outliers’ in our data, producing a better fit.
g. Calculate the PA correlation matrix of the 20 genes. (use: corrcoef(x) with
a single matrix. Note that measurements is a matrix where each gene is a
row while corrcoef() calculates correlation between columns).
h. Display the correlation matrix with the ‘hot’ colormap. Show also the
colorbar and use as xticklabels and yticklabels the gene names. Have the
xticklabels on the top part of the figure and not at the bottom (use:
‘XAxisLocation’ parameter).
3. Find for the 4 highest correlation genes how correlation between the genes
decays with ‘distance’ from initial gene in the pathway. Here, assume genes in
geneList are ordered by their distance from the initial gene. Thus, a gene at place
geneList{3} is ‘distanced’ 2 units from the initial gene in the pathway.
a. For each gene take its correlation with other genes (i.e. for the first gene
you should take the first row from place 2:end. For the second gene, the
second row from place 3:end and so on…)
b. Define an anonymous function which is a decaying exponent with 2
parameters: myfunc(c,xdata) where c (the parameters we’re looking for),
xdata. Another parameter appearing in the function should be called:
‘initDis’ which will be initial displacement. The fit should be to the
function: 𝑦𝑓𝑖𝑑 = 𝑖𝑛𝑖𝑑𝐷𝑖𝑠 + 𝑐(1)𝑒 𝑐(2)π‘₯π‘‘π‘Žπ‘‘π‘Ž
This is an example on how to use anonymous functions to handle extra
parameters (in our case, the function lsqcurvefit knows to deal with only
2 inputs (c,xdata) but is actually assigned 3 parameters (c,xdata,initDis).
Note the extra parameter should be assigned before you define the
anonymous function.
c. For each of the genes find the best fit using the function lsqcurvefit() (see
example file attached). initDis should be assigned to the value -0.1.
d. Use subplot() in order to display the 4 genes data and each of the fit
(using the fit parameters you already found in the previous section). For
the data plotting use markers only, for the fit data a solid line. The x axis
should be the genes names that appear in the specific fit. The title of
each subplot should be: ‘GeneName Correlation data, Fit parameters are:
c1=c(1), c2=c(2), Displacement=initDis’ in underscore are variables that
you should insert their value.
4. Attached to the project the m files I presented in the last tutorial. Use them
wisely. Also, the figs should look like the ones appearing in the presentation, also
attached.
5. Each graph plotted is 40 points worth. Each loop you use decreases the grade by
3 points. Nested loops decrease grade by the multiplication of the loops: 2
nested loops = 9 points, 3 nested loops = 27 points and so on…
6. I’m here for questions/comments/problems and so on: uvhart@weizmann.ac.il
or better use the forum at:
https://groups.google.com/group/weizmannmatlabfall2012a?hl=iw&ie=UTF-8&pli=1
Each question should start with the title: Final Project: Q#.# (question numbering
should follow the suggested steps numbering).
Best of luck in the group meeting,
yuval
Download