Final project of the course “Introduction to Matlab and Data Analysis” During the past month you have been measuring promoter activity (PA, the rate of change in protein levels per cell per time) of 20 genes. Your PI comes to you and asks you to present your results in the next group meeting. The 20 genes are all a part of a shared pathway, and they are arranged in a list by their distance from the initial gene in the pathway. You thought about it quite a lot, and finally decided to present the following graphs (detailed instructions follow below): 1. You want to calculate and display the correlation coefficients of the measured PA and find a linear relation between average PA and the genes GC content, for each gene. Note: A. The PA calculated should be the average PA for each gene. B. The GC content should be displayed as STD from the mean GC content of the 20 genes. Thus, a GC content that equals the mean GC content of all 20 genes will count as 0, and a GC that is one STD from the mean will count as 1. You should get values between -0.2 to 0.15. (This procedure is quite common when values of a certain variable appear to be crowded around a certain value which is the case with the GC content of the 20 genes). You should smooth the data before plotting it (see instructions below). 2. For the 4 highest correlation genes, you wish to show how inter-gene correlation decays with distance from initial gene in the pathway. You have the following data files: 1. genetrial.txt – this is a part (10,000 rows) of a Genebank file which contains data on E.coli genes, for each gene there appears both its synonyms, gene place in the DNA (start and stop codon), its functions and its GO terms that you already know and love. 2. sequences.fasta – a file containing the DNA sequence of E.coli 3. FinalProjectData.mat – contains the following variables: a. genesList – a cell array with the genes names. b. Measurements – a matrix with 1001 time measurements of PA of each gene. Each gene PA measurement is a row. Steps towards submitting the project: 1. Get the sequences of the genes from a GenBank+Fasta files and calculate GC content: a. Insert the sequences.fasta file into a cell array (note: it might be beneficial to parse it one character at a time with %c instead of the regular %s). b. Insert the genetrial.txt data to another cell array. c. Find all rows where the word CDS appears and take the place of the gene in the fasta file (both beginning and the end). Note: if the word ‘complement’ appears before the start...end positions then when you calculate the GC content you should actually look for [TA] content. d. You should convert the ‘start’..’end’ positions that you found from strings to numbers. e. Take the list of genes, you can assume that the gene name is always located one row below the CDS line. f. Find where the genes of your experiment (geneList variable) appear in the GeneNames taken from the genetrial.txt file. g. For each gene in geneList, calculate the GC content of its sequence. You should take the relevant gene sequence and look for G or C in it. If it is a complement gene, you should look for T and A. h. Build a structure with all the following fields: i. Gene Name ii. Is it a complement or not iii. Start position iv. End position v. Gene sequence vi. GC content No loops are allowed for building the structure (structures can be assigned using the function struct()) and inserting entire cell arrays for each field. 2. Plot GC content vs. mean PA and PA correlation among the genes: a. Scale the GC content data to be distance in STD from the mean of the 20 genes’ GC content. b. Calculate the mean of the PA for each gene. c. Calculate the correlation between the scaled GC content and the average PA (use corrcoef(x,y)) . Plot the points on a graph (only the points). d. Create a linear fit to the points and plot it on the same graph (use: fitPA=polyfit(xdata, ydata,1) and then yfit=polyval(fitPA,x) to get the fit, and then plot). e. Smooth the data using the method ‘rloess’. Recall that smoothing with ‘r’ methods allows for robust smoothing in a way that outliers get less weight in the smoothing procedure. Note that when using ‘rloess’ you should specify not the number of points in the window to smooth, but rather the data span fraction (which fraction of the data each window averages upon). Here, you should use 0.25 as the fraction/window to smooth on. f. Plot the smoothed data and the linear fit of that data on the same graph. Write next to each linear fit its equation. Note the difference between the two techniques: the smoothing lowered the effect of the ‘outliers’ in our data, producing a better fit. g. Calculate the PA correlation matrix of the 20 genes. (use: corrcoef(x) with a single matrix. Note that measurements is a matrix where each gene is a row while corrcoef() calculates correlation between columns). h. Display the correlation matrix with the ‘hot’ colormap. Show also the colorbar and use as xticklabels and yticklabels the gene names. Have the xticklabels on the top part of the figure and not at the bottom (use: ‘XAxisLocation’ parameter). 3. Find for the 4 highest correlation genes how correlation between the genes decays with ‘distance’ from initial gene in the pathway. Here, assume genes in geneList are ordered by their distance from the initial gene. Thus, a gene at place geneList{3} is ‘distanced’ 2 units from the initial gene in the pathway. a. For each gene take its correlation with other genes (i.e. for the first gene you should take the first row from place 2:end. For the second gene, the second row from place 3:end and so on…) b. Define an anonymous function which is a decaying exponent with 2 parameters: myfunc(c,xdata) where c (the parameters we’re looking for), xdata. Another parameter appearing in the function should be called: ‘initDis’ which will be initial displacement. The fit should be to the function: π¦πππ‘ = ππππ‘π·ππ + π(1)π π(2)π₯πππ‘π This is an example on how to use anonymous functions to handle extra parameters (in our case, the function lsqcurvefit knows to deal with only 2 inputs (c,xdata) but is actually assigned 3 parameters (c,xdata,initDis). Note the extra parameter should be assigned before you define the anonymous function. c. For each of the genes find the best fit using the function lsqcurvefit() (see example file attached). initDis should be assigned to the value -0.1. d. Use subplot() in order to display the 4 genes data and each of the fit (using the fit parameters you already found in the previous section). For the data plotting use markers only, for the fit data a solid line. The x axis should be the genes names that appear in the specific fit. The title of each subplot should be: ‘GeneName Correlation data, Fit parameters are: c1=c(1), c2=c(2), Displacement=initDis’ in underscore are variables that you should insert their value. 4. Attached to the project the m files I presented in the last tutorial. Use them wisely. Also, the figs should look like the ones appearing in the presentation, also attached. 5. Each graph plotted is 40 points worth. Each loop you use decreases the grade by 3 points. Nested loops decrease grade by the multiplication of the loops: 2 nested loops = 9 points, 3 nested loops = 27 points and so on… 6. I’m here for questions/comments/problems and so on: uvhart@weizmann.ac.il or better use the forum at: https://groups.google.com/group/weizmannmatlabfall2012a?hl=iw&ie=UTF-8&pli=1 Each question should start with the title: Final Project: Q#.# (question numbering should follow the suggested steps numbering). Best of luck in the group meeting, yuval