Introduction to Matlab & Data Analysis Final Project: That’s all, Folks! Yuval Hart, Weizmann 2010© 1 Outline Parsing files Efficient programming - vectorization Correlation coefficients Passing extra parameters Image plotting Curve Fitting & Optimization Figure handling 2 “Rotation in 60 minutes” 3 Rotation in 60 minutes: During the past month you’ve measured promoter activity of 20 genes. Your PI wants you to present your results at the next group meeting. 4 To Do List Get the sequences of the genes from a GenBank+Fasta files and calculate GC content Display all correlation coefficients of the measured PA and relation to GC content Find for the highest 4 genes, how correlation decays with distance from initial gene in the pathway 5 To Do List Get the sequences of the genes from a GenBank+Fasta files and calculate GC content Display all correlation coefficients of the measured PA and relation to GC content Find for the highest 4 genes, how correlation decays with distance from initial gene in the pathway 6 GenBank file format 7 Step 3: Attach every gene name with its DNA sequence % % % % % Build the structure with all needed fields: Build the structure Genes with the desired genes and their data: name, startPosition, endPosition, sequence, complement (1/0), GCcontent This is also the way to preallocate for structures: Genes(1,sum(indGeneList))=struct( 'name', [], 'complement', [], 'sequence',[],... 'StartPosition',[],'EndPosition',[],'GCcontent',1); Genes=struct('name',geneNames(indGeneList),… 'complement', num2cell(indComplement(indGeneList)'),... 'StartPosition',CDSpositionStartEndCelled(indGeneList,1)',… 'EndPosition',CDSpositionStartEndCelled(indGeneList,2)',... 'sequence',seq,'GCcontent',GCcontent); a=Genes; Note: Structures are assigned one by one only with cell arrays 8 To Do List Get the sequences of the genes from a GenBank+Fasta files and calculate GC content Display all correlation coefficients of the measured PA and relation to GC content Find for the highest 4 genes, how correlation decays with distance from initial gene in the pathway 9 Calculate and plot Correlation Matrix % % % % % Load the list of genes and measurements Input: measurement mat file contains: geneList - a cell array of the genes Names measurements - a matrix of 20 genes measurements at 1001 time points GenesGCcontent - a vector of the genes GCcontent values %measurements has a row for each gene containing its measurements through %1001 time points and the geneList names load measurements 10 Plot GC content and mean PA dependence Plot fit results upon the previous graph: Note: Smoothed data can lower the effect of outliers 11 Calculate and plot Correlation Matrix Calculate and display the corr. matrix 12 To Do List Get the sequences of the genes from a GenBank+Fasta files and calculate GC content Display all correlation coefficients of the measured PA and relation to GC content Find for the highest 4 genes, how correlation decays with distance from initial gene in the pathway 13 Step 2: Fit correlations to the desired function Using anonymous function to add more Parameters and fitting using lsqcurvefit: initDis=-0.1; c0=[.7 0.1]; %assigning the initial values for the fit search paramfunc = @(c,x)FittingCurveExpGuess(c,x,initDis); %def. of the anonymous function ExpParam=lsqcurvefit(paramfunc,c0,XdataPoints,correl,[0 -1],[1 1],options); Function name Initial guess X data Y data Lower bound function y_hat=FittingCurveExpGuess(c,x,init) % This assumes an exponential decreasing curve y_hat=init+c(1)*exp(c(2).*x); upper bound 14 Step 3: Plot the correlation data and fit 15 Best of Luck in the Group Meeting ! 16 Best of Luck in the Group Meeting ! 17 This is the end, my friend, the end "Louis, I think this is the beginning of a beautiful friendship." 18