Final project - “Introduction to Matlab and Data Analysis”

advertisement
Final project - “Introduction to Matlab and Data Analysis”
During the past month you have been measuring promoter activity (PA, the rate of
change in protein levels per cell per time) of 10 genes. Your PI asks you to present your
results in the next group meeting. The 10 genes are all a part of a shared pathway, and
they are arranged in a list by their distance from the initial gene in the pathway.
You thought about it quite a lot, and finally decided to present the following graphs
(detailed instructions follow below):
1. You want to calculate and display the correlation coefficients of the measured
PA and find a linear relation between average PA and the genes’ GC content for
each gene.
Note:
A. The PA calculated should be the average PA for each gene.
B. The GC content should be displayed as STD from the mean GC content of the
10 genes. Thus, a GC content that equals the mean GC content of all 10
genes will equal 0, and a GC that is one STD from the mean will count as 1.
(This procedure is quite common when values of a certain variable appear to
be crowded around a certain value which is the case with the GC content of
the 10 genes). You should smooth the data before plotting it (see instructions
below).
2. For a certain gene (rbsK), you wish to show how inter-gene correlation decays
with distance from initial gene in the pathway.
You have the following data files:
1. genebank2013.txt – this is a part (~17,000 rows) of a Genebank file which
contains data on E.coli genes, for each gene there appears both its synonyms,
gene place in the DNA (start and stop codon), its functions and its GO terms that
you already know and love.
2. sequences.fasta–a file containing the DNA sequence of E.coli
3. FinalProjectData2013.mat – contains the following variables:
a. genesList2013 – a cell array with the genes names.
b. Measurements2013 – a matrix with 1001 time measurements of PA of 10
genes. Each gene PA measurement is a row.
Steps towards submitting the project:
1. Get the sequences of the genes from a GenBank+Fasta files and calculate GC
content:
a. Insert the sequences.fasta file into a cell array (note: it might be
beneficial to parse it one character at a time with %c instead of the
regular %s).
b. Insert the genebank data to another cell array (each line in a cell array).
c. Find all rows where the word CDS appears and take the place of the gene
in the fasta file (both beginning and the end). Note: if the word
‘complement’ appears before the start...end positions then when you
calculate the GC content you should actually look for [TA] content.
d. You should convert the ‘start’..’end’ positions that you found from strings
to numbers.
e. Take the list of genes, you can assume that the gene name is always
located one row below the CDS line.
f. Find where the genes of your experiment (geneList2013 variable) appear
in the GeneNames taken from the genebank2013.txt file.
g. For each gene in geneList2013, calculate the GC content of its sequence.
You should take the relevant gene sequence and look for G or C in it. If it
is a complement gene, you should look for T and A.
h. Build a structure with all the following fields:
i. Gene Name
ii. Is it a complement or not
iii. Start position
iv. End position
v. Gene sequence
vi. GC content
No loops are allowed for building the structure (structures can be
assigned using the function struct() and inserting entire cell arrays for
each field.
2. Plot mean PA vs. GC content and PA correlation among the genes:
a. Scale the GC content data (x axis data) to be distance in STD from the
mean of the 10 genes’ GC content (either calculate (X-mean(x))/std(x) or
use the function zscore()).
b. Calculate the PA mean for each gene (y axis data).
c. Calculate the correlation between the scaled GC content and the average
PA (use corrcoef(x,y)) . Plot the points on a graph with red markers (only
the points).
d. Smooth the data using the method ‘moving average’ with 5 data points in
a window [Recall that smoothing with ‘r’ methods allows for robust
smoothing in a way that outliers get less weight in the smoothing
procedure]. Plot the smoothed data on the same graph, with blue
markers.
e. Plot the linear fit of that data on the same graph with blue solid line.
Write next to the linear fit its equation (use: fitPA=polyfit(xdata, ydata,1)
and then yfit=polyval(fitPA,x) to get the fit, and then plot).
f. Calculate the PA correlation matrix of the 10 genes. (use: corrcoef(x) with
a single matrix. Note that measurements is a matrix where each gene is a
row while corrcoef() calculates correlation between columns).
g. Display the correlation matrix with the ‘hot’ colormap. Show also the
colorbar and use as xticklabels and yticklabels the gene names. Have the
xticklabels on the top part of the figure and not at the bottom (use:
‘XAxisLocation’ parameter).
3. You are given a figure (rbsKdata2013.jpg) that describes how correlation
between the genes decays with ‘distance’ from the initial gene in the pathway.
a. Using ginput(), take the values of the points in the graph. Take also the X
axis values and Y axis values in order to convert the points you’ve taken
to the “real” scale of the data.
b. Define an anonymous function which is a decaying exponent with 2
parameters: myfunc(c,xdata) where c are the parameters we’re looking
for. Another parameter appearing in the function should be
called:‘initDis’ which will be initial displacement. The fit should be to the
function: 𝑦𝑓𝑖𝑑 = 𝑖𝑛𝑖𝑑𝐷𝑖𝑠 + 𝑐(1)𝑒 𝑐(2)π‘₯π‘‘π‘Žπ‘‘π‘Ž
This is an example on how to use anonymous functions to handle extra
parameters (in our case, the function lsqcurvefit knows to deal with only
2 inputs (c,xdata) but is actually assigned 3 parameters (c,xdata,initDis).
Note the extra parameter should be assigned before you define the
anonymous function.
c. Find the best fit using the function lsqcurvefit() (see example file
attached). initDis should be assigned to the value -0.1.
d. Plot the genes data and the fit (using the fit parameters you already
found in the previous section). For data plotting use markers only, for the
fit data a solid line. The X axis should be the genes names that appear in
the specific fit. The title should be: ‘rbsK Correlation data, Fit parameters
are: c1=c(1), c2=c(2), Displacement=initDis’ in underscore are variables
where you should insert their value (by using sprint()).
4. The last figure should contain all 3 plots so far in the following way –
a. Open a new figure and by using the axes() command create three
subplots – the first one should be small in the middle of the top row and
display the MeanPA vs. GCcontent plot.
b. Axes #2 and #3 you create should be 1.5 as big as the first one and be in
the bottom row – plot in the left graph the correlation matrix and in the
right graph the correlation data for rbsK and its fitted exponential decay
function. This is the general scheme:
1
2
3
5. Attached to the project are the m files I presented in the last tutorial. Use them
wisely. Also, the figures should look like the ones appearing in the presentation,
also attached.
6. The first 3 graphs are 33 points worth. The last graph is 10 more points. Each
loop you use decreases the grade by 3 points. Nested loops decrease grade by
the multiplication of the loops: 2 nested loops = 9 points, 3 nested loops = 27
points and so on…
7. I’m here for questions/comments/problems and so on: uvhart@weizmann.ac.il
or better use the forum at:
Each question should start with the title: Final Project: Q#.# (question numbering
should follow the suggested steps numbering).
Best of luck in the group meeting,
Yuval
Download