Introduction to Matlab & Data analysis

advertisement
Introduction to Matlab
& Data Analysis
Final Project:
That’s all, Folks!
Yuval Hart, Weizmann 2010©
1
Outline







Parsing files
Efficient programming - vectorization
Correlation coefficients
Passing extra parameters
Image plotting
Curve Fitting & Optimization
Figure handling
2
“Rotation in 60 minutes”
3
Rotation in 60 minutes:


During the past
month you’ve
measured promoter
activity of 20 genes.
Your PI wants you
to present your
results at the next
group meeting.
4
To Do List



Get the sequences of the genes from a
GenBank+Fasta files and calculate GC
content
Display all correlation coefficients of the
measured PA and relation to GC content
Find for the highest 4 genes, how
correlation decays with distance from
initial gene in the pathway
5
To Do List



Get the sequences of the genes from a
GenBank+Fasta files and calculate GC
content
Display all correlation coefficients of the
measured PA and relation to GC content
Find for the highest 4 genes, how
correlation decays with distance from
initial gene in the pathway
6
GenBank file format
7
Step 3: Attach every gene name with
its DNA sequence

%
%
%
%
%
Build the structure with all needed fields:
Build the structure Genes with the desired genes and their data:
name, startPosition, endPosition, sequence, complement (1/0), GCcontent
This is also the way to preallocate for structures:
Genes(1,sum(indGeneList))=struct( 'name', [], 'complement', [], 'sequence',[],...
'StartPosition',[],'EndPosition',[],'GCcontent',1);
Genes=struct('name',geneNames(indGeneList),…
'complement', num2cell(indComplement(indGeneList)'),...
'StartPosition',CDSpositionStartEndCelled(indGeneList,1)',…
'EndPosition',CDSpositionStartEndCelled(indGeneList,2)',...
'sequence',seq,'GCcontent',GCcontent);
a=Genes;
Note: Structures are assigned one by one only with cell arrays
8
To Do List



Get the sequences of the genes from a
GenBank+Fasta files and calculate GC
content
Display all correlation coefficients of the
measured PA and relation to GC content
Find for the highest 4 genes, how
correlation decays with distance from
initial gene in the pathway
9
Calculate and plot Correlation Matrix

%
%
%
%
%
Load the list of genes and measurements
Input:
measurement mat file contains:
geneList - a cell array of the genes Names
measurements - a matrix of 20 genes measurements at 1001 time points
GenesGCcontent - a vector of the genes GCcontent values
%measurements has a row for each gene containing its measurements through
%1001 time points and the geneList names
load measurements
10
Plot GC content and mean PA dependence

Plot fit results upon the previous graph:
Note:
Smoothed
data
can lower the
effect
of outliers
11
Calculate and plot Correlation Matrix

Calculate and display the corr. matrix
12
To Do List



Get the sequences of the genes from a
GenBank+Fasta files and calculate GC
content
Display all correlation coefficients of the
measured PA and relation to GC content
Find for the highest 4 genes, how
correlation decays with distance from
initial gene in the pathway
13
Step 2: Fit correlations to the desired
function
Using anonymous function to add more
Parameters and fitting using lsqcurvefit:

initDis=-0.1;
c0=[.7 0.1]; %assigning the initial values for the fit search
paramfunc = @(c,x)FittingCurveExpGuess(c,x,initDis); %def. of the anonymous function
ExpParam=lsqcurvefit(paramfunc,c0,XdataPoints,correl,[0 -1],[1 1],options);
Function name
Initial guess
X data
Y data
Lower
bound
function y_hat=FittingCurveExpGuess(c,x,init)
% This assumes an exponential decreasing curve
y_hat=init+c(1)*exp(c(2).*x);
upper
bound
14
Step 3: Plot the correlation data and fit
15
Best of Luck in the Group
Meeting !
16
Best of Luck in the Group
Meeting !
17
This is the end, my friend, the end
"Louis, I think this is the beginning of a beautiful friendship."
18
Download