Final project - “Introduction to Matlab and Data Analysis” During the past month you have been measuring promoter activity (PA, the rate of change in protein levels per cell per time) of 10 genes. Your PI asks you to present your results in the next group meeting. The 10 genes are all a part of a shared pathway, and they are arranged in a list by their distance from the initial gene in the pathway. You thought about it quite a lot, and finally decided to present the following graphs (detailed instructions follow below): 1. You want to calculate and display the correlation coefficients of the measured PA and find a linear relation between average PA and the genes’ GC content for each gene. Note: A. The PA calculated should be the average PA for each gene. B. The GC content should be displayed as STD from the mean GC content of the 10 genes. Thus, a GC content that equals the mean GC content of all 10 genes will equal 0, and a GC that is one STD from the mean will count as 1. (This procedure is quite common when values of a certain variable appear to be crowded around a certain value which is the case with the GC content of the 10 genes). You should smooth the data before plotting it (see instructions below). 2. For a certain gene (rbsK), you wish to show how inter-gene correlation decays with distance from initial gene in the pathway. You have the following data files: 1. genebank2013.txt – this is a part (~17,000 rows) of a Genebank file which contains data on E.coli genes, for each gene there appears both its synonyms, gene place in the DNA (start and stop codon), its functions and its GO terms that you already know and love. 2. sequences.fasta–a file containing the DNA sequence of E.coli 3. FinalProjectData2013.mat – contains the following variables: a. genesList2013 – a cell array with the genes names. b. Measurements2013 – a matrix with 1001 time measurements of PA of 10 genes. Each gene PA measurement is a row. Steps towards submitting the project: 1. Get the sequences of the genes from a GenBank+Fasta files and calculate GC content: a. Insert the sequences.fasta file into a cell array (note: it might be beneficial to parse it one character at a time with %c instead of the regular %s). b. Insert the genebank data to another cell array (each line in a cell array). c. Find all rows where the word CDS appears and take the place of the gene in the fasta file (both beginning and the end). Note: if the word ‘complement’ appears before the start...end positions then when you calculate the GC content you should actually look for [TA] content. d. You should convert the ‘start’..’end’ positions that you found from strings to numbers. e. Take the list of genes, you can assume that the gene name is always located one row below the CDS line. f. Find where the genes of your experiment (geneList2013 variable) appear in the GeneNames taken from the genebank2013.txt file. g. For each gene in geneList2013, calculate the GC content of its sequence. You should take the relevant gene sequence and look for G or C in it. If it is a complement gene, you should look for T and A. h. Build a structure with all the following fields: i. Gene Name ii. Is it a complement or not iii. Start position iv. End position v. Gene sequence vi. GC content No loops are allowed for building the structure (structures can be assigned using the function struct() and inserting entire cell arrays for each field. 2. Plot mean PA vs. GC content and PA correlation among the genes: a. Scale the GC content data (x axis data) to be distance in STD from the mean of the 10 genes’ GC content (either calculate (X-mean(x))/std(x) or use the function zscore()). b. Calculate the PA mean for each gene (y axis data). c. Calculate the correlation between the scaled GC content and the average PA (use corrcoef(x,y)) . Plot the points on a graph with red markers (only the points). d. Smooth the data using the method ‘moving average’ with 5 data points in a window [Recall that smoothing with ‘r’ methods allows for robust smoothing in a way that outliers get less weight in the smoothing procedure]. Plot the smoothed data on the same graph, with blue markers. e. Plot the linear fit of that data on the same graph with blue solid line. Write next to the linear fit its equation (use: fitPA=polyfit(xdata, ydata,1) and then yfit=polyval(fitPA,x) to get the fit, and then plot). f. Calculate the PA correlation matrix of the 10 genes. (use: corrcoef(x) with a single matrix. Note that measurements is a matrix where each gene is a row while corrcoef() calculates correlation between columns). g. Display the correlation matrix with the ‘hot’ colormap. Show also the colorbar and use as xticklabels and yticklabels the gene names. Have the xticklabels on the top part of the figure and not at the bottom (use: ‘XAxisLocation’ parameter). 3. You are given a figure (rbsKdata2013.jpg) that describes how correlation between the genes decays with ‘distance’ from the initial gene in the pathway. a. Using ginput(), take the values of the points in the graph. Take also the X axis values and Y axis values in order to convert the points you’ve taken to the “real” scale of the data. b. Define an anonymous function which is a decaying exponent with 2 parameters: myfunc(c,xdata) where c are the parameters we’re looking for. Another parameter appearing in the function should be called:‘initDis’ which will be initial displacement. The fit should be to the function: π¦πππ‘ = ππππ‘π·ππ + π(1)π π(2)π₯πππ‘π This is an example on how to use anonymous functions to handle extra parameters (in our case, the function lsqcurvefit knows to deal with only 2 inputs (c,xdata) but is actually assigned 3 parameters (c,xdata,initDis). Note the extra parameter should be assigned before you define the anonymous function. c. Find the best fit using the function lsqcurvefit() (see example file attached). initDis should be assigned to the value -0.1. d. Plot the genes data and the fit (using the fit parameters you already found in the previous section). For data plotting use markers only, for the fit data a solid line. The X axis should be the genes names that appear in the specific fit. The title should be: ‘rbsK Correlation data, Fit parameters are: c1=c(1), c2=c(2), Displacement=initDis’ in underscore are variables where you should insert their value (by using sprint()). 4. The last figure should contain all 3 plots so far in the following way – a. Open a new figure and by using the axes() command create three subplots – the first one should be small in the middle of the top row and display the MeanPA vs. GCcontent plot. b. Axes #2 and #3 you create should be 1.5 as big as the first one and be in the bottom row – plot in the left graph the correlation matrix and in the right graph the correlation data for rbsK and its fitted exponential decay function. This is the general scheme: 1 2 3 5. Attached to the project are the m files I presented in the last tutorial. Use them wisely. Also, the figures should look like the ones appearing in the presentation, also attached. 6. The first 3 graphs are 33 points worth. The last graph is 10 more points. Each loop you use decreases the grade by 3 points. Nested loops decrease grade by the multiplication of the loops: 2 nested loops = 9 points, 3 nested loops = 27 points and so on… 7. I’m here for questions/comments/problems and so on: uvhart@weizmann.ac.il or better use the forum at: Each question should start with the title: Final Project: Q#.# (question numbering should follow the suggested steps numbering). Best of luck in the group meeting, Yuval