Introduction to Matlab & Data analysis

advertisement
Introduction to Matlab
& Data Analysis
Tutorial 13:
That’s all, Folks!
Please change directory to directory E:\Matlab (cd E:\Matlab;)
From the course website
(http://www.weizmann.ac.il/midrasha/courses/MatlabIntro//course_outline.htm )
Download:
tFinal.zip
Yuval Hart, Weizmann 2010©
1
Outline








Parsing files
Efficient programming - vectorization
(Profiling)
Correlation coefficients
Passing extra parameters
Image plotting
Curve Fitting & Optimization
Figure handling
2
“Rotation in 60 minutes”
3
Rotation in 60 minutes:


During the past
month you’ve
measured promoter
activity of 20 genes.
Your PI wants you
to present your
results at the next
group meeting.
4
To Do List



Get the sequences of the genes from a
GenBank+Fasta files and calculate GC
content
Display all correlation coefficients of the
measured PA and relation to GC content
Find for the highest 4 genes, how
correlation decays with distance from
initial gene in the pathway
5
To Do List



Get the sequences of the genes from a
GenBank+Fasta files and calculate GC
content
Display all correlation coefficients of the
measured PA and relation to GC content
Find for the highest 4 genes, how
correlation decays with distance from
initial gene in the pathway
6
GenBank file format
7
Step 1: get data from files

Get the DNA sequence from the fasta
file:
% Extract gene sequences from the fasta file
fid_fasta_data = fopen(fnamefasta,'r'); %check that file was opened correctly
if fid_fasta_data<0
error('GenBank File name is not correct, please issue file name again');
end
celledFasta=textscan(fid_fasta_data, '%c'); % '%c' is single character
fclose(fid_fasta_data);
fasta=celledFasta{1}'; %fasta is a char array of the sequence
8
Step 1: get data from files

Put the entire file in a cell array divided
to rows:
%open the gene file
fid_gene_input=fopen(fnamegene,'r');
%check that file was opened correctly
if fid_gene_input<0
error('GenBank File name is not correct, please issue file name again');
end
% parse the file such that every row of file is inside a cell element
celledData=textscan(fid_gene_input,'%s','delimiter','\n');
fclose(fid_gene_input);
% remove all white spaces from beginning and end of rows
celledDataTrim=strtrim(celledData{1});
9
Step 2: Get genes names and
sequence position from GenBank data

Get the genes names and sequence
location from the GenBank file:
% indCDS has an index for all ocurrances of CDS
% line format is 'CDS
pos1..pos2' or 'CDS
complement(pos1..pos2)'
% so CDSposition are the tokens from the row
CDSposition=regexp(celledDataTrim,'^CDS\s+(?:complement\()*(\d+)\.\.(\d+)','tok
ens');
indCDS=~cellfun('isempty',CDSposition);
% gene name is one row below the CDS info so shift index one place right
indGene=circshift(indCDS,[1 1]);
% since already looked for right patterning, only need to check if there is
% complement or not. indComplement indicates if it is a complement or
% regular sequence
indComplement=~cellfun('isempty',regexp(celledDataTrim(indCDS),'complement'));
10
Step 2: Get genes names and
sequence position from GenBank data

Get the genes names and sequence
location from the GenBank file:
%List of genes corresponding to the CDS found
geneNames=regexp(celledDataTrim(indGene),'gene="(\w+)"','tokens');
geneNames=[geneNames{:}];
geneNames=[geneNames{:}];
% Consider only cell elemets that had 'CDS' in them
onlyCDSposition=CDSposition(indCDS);
% Flatten the tokes cell array such that onlyCDSposition will have odd
% elements as position 1 (start of gene) and even elements as position 2
% (end of gene)
onlyCDSposition=[onlyCDSposition{:}];
CDSpositionStartEndCelled=cat(1,onlyCDSposition{:}); % cancatinates as two
% columns and not in a single row (try cat(2,onlyCDSposition{:}))
CDSpositionStartEndNum=cellfun(@str2num,CDSpositionStartEndCelled);
11
Step 2: Get genes names and
sequence position from GenBank data

Get the index of only the genes we are
interested in (found in genePool):
% indGene specifies all ocurrances of genes in the file that are in the
% "pool"/desired list
indGeneList=ismember(geneNames,genePool);
12
Step 3: Attach every gene name with
its DNA sequence
 Use indices to build array of gene sequence
and calculate GC content:
% Initialize gene list index
j=0;
% Note: i is the index of the vector searched (serial number of the gene in
% the genBank list, j is the index of the specified genes, e.g. there could
% be only 2 genes but their serial number in genBank file is 151 and 352, therefore
% i= [151 352] but j=[1 2])
seq=cell(1,sum(indGeneList));
GCcontent=cell(1,sum(indGeneList));
for i=find(indGeneList==1)
j=j+1;
% get the sequence from the fasta data by the start and end positions
seq{j}=fasta(CDSpositionStartEndNum(i,1):CDSpositionStartEndNum(i,2));
% GCcontent is the percent of G or C in the sequence
GCcontent{j}=length(regexp(seq{j},'[GC]'))/length(seq{j});
end
13
Step 3: Attach every gene name with
its DNA sequence

%
%
%
%
%
Build the structure with all needed fields:
Build the structure Genes with the desired genes and their data:
name, startPosition, endPosition, sequence, complement (1/0), GCcontent
This is also the way to preallocate for structures:
Genes(1,sum(indGeneList))=struct( 'name', [], 'complement', [], 'sequence',[],...
'StartPosition',[],'EndPosition',[],'GCcontent',1);
Genes=struct('name',geneNames(indGeneList),…
'complement', num2cell(indComplement(indGeneList)'),...
'StartPosition',CDSpositionStartEndCelled(indGeneList,1)',…
'EndPosition',CDSpositionStartEndCelled(indGeneList,2)',...
'sequence',seq,'GCcontent',GCcontent);
a=Genes;
Note: Structures are assigned one by one only with cell arrays
14
To Do List



Get the sequences of the genes from a
GenBank+Fasta files and calculate GC
content
Display all correlation coefficients of the
measured PA and relation to GC content
Find for the highest 4 genes, how
correlation decays with distance from
initial gene in the pathway
16
Calculate and plot Correlation Matrix

%
%
%
%
%
Load the list of genes and measurements
Input:
measurement mat file contains:
geneList - a cell array of the genes Names
measurements - a matrix of 20 genes measurements at 1001 time points
GenesGCcontent - a vector of the genes GCcontent values
%measurements has a row for each gene containing its measurements through
%1001 time points and the geneList names
load measurements
17
Plot GC content and mean PA dependence
Plot mean PA vs. GC content with the correlation
coefficient

figure(1);
corrGCvsPA=corrcoef(ScaledGCcontent,MeanPA);
plot(ScaledGCcontent,MeanPA,'or','MarkerSize',8,'LineWidth',2);
set(gcf,'units','normalized','outerposition',[0 0 1 1]);%set the plot to full screen
title(sprintf('Mean Promoter Activity vs. GCcontent, Correlation is %2.4f',...
corrGCvsPA(1,2)),'FontSize',14);
xlabel('Scaled GC content [% deviation from 0.5]','FontSize',14);
ylabel('Mean Promoter Activity [a.u.]','FontSize',14);
hold on;
18
Plot GC content and mean PA dependence

Plot fit results upon the previous graph:
% Check for a linear fit to the curve
fittedfunc=polyfit(ScaledGCcontent,MeanPA',1);
plot(ScaledGCcontent,polyval(fittedfunc,ScaledGCcontent),'r','LineWidth',2);
% Smooth the data and then fit to a polynomial:
SmoothPA=smooth(ScaledGCcontent,MeanPA,0.25,'rloess');
Robust smooth
%plot the smooth data set with robust smoothing
plot(ScaledGCcontent,SmoothPA,'ob','MarkerSize',8,'LineWidth',2);
Smofittedfunc=polyfit(ScaledGCcontent,SmoothPA',1);
plot(ScaledGCcontent,polyval(Smofittedfunc,ScaledGCcontent),'b','LineWidth',2);
text(0.05,2.1,['\leftarrow', sprintf('y= %2.2f x+%2.2f', fittedfunc(1),fittedfunc(2))], ...
'HorizontalAlignment','left','FontSize',18,'Color',[1 0 0]); %See text properties
text(-0.11,4,['\leftarrow',sprintf('y= %2.2f x+%2.2f‘...
,Smofittedfunc(1),Smofittedfunc(2))], 'HorizontalAlignment','left','FontSize',18,...
'Color',[0 0 1]); %See text properties
19
Plot GC content and mean PA dependence

Plot fit results upon the previous graph:
Note:
Smoothed
data
can lower the
effect
of outliers
20
Calculate and plot Correlation Matrix

Calculate and display the corr. matrix
figure(2);
%note that corrcoef works on columns so we need to transpose measurements
%calculate the correlation matrix of all genes measurements
corrMat=corrcoef(measurements');
colormap('hot'); %set color scheme, popular choices are also: 'jet','hsv'
imagesc(corrMat); %creates the image, data is scaled to max value of matrix
colorbar; %plots also the color bar in the figure.
set(gcf,'units','normalized','outerposition',[0 0 1 1]);%set the plot to full screen
set(gca,'XTick',1:20,'XTickLabel',geneList,'FontSize',12,'XAxisLocation','top')
%sets the Ticks to be the genes Names and present them at top of figure
set(gca,'YTick',1:20,'YTickLabel',geneList,'FontSize',12) %sets the Ticks to be the
genes Names
title('Gene correlations','FontSize',16);
21
Calculate and plot Correlation Matrix

Calculate and display the corr. matrix
22
To Do List



Get the sequences of the genes from a
GenBank+Fasta files and calculate GC
content
Display all correlation coefficients of the
measured PA and relation to GC content
Find for the highest 4 genes, how
correlation decays with distance from
initial gene in the pathway
25
Step 1: initialize and set parameters
Set figure parameters and external fit
parameters of the curves:

figure(3);
set(gcf,'units','normalized','outerposition',[0 0 1 1]);%set plot to full screen
%want to check if a vertical displacement helps, so added variable: initDis
%which is part of the fitting function formula
initDis=-0.1;
GenesAmount=size(measurements,1);
26
Step 2: Fit correlations to the desired
function

Using anonymous function to add more parameters and fitting
using lsqcurvefit:
for i=1:numGenesToPlot
correl=corrMat(i,(1+i):end); %assigning the current correlation matrix values,
from row i and columns after the diagonal
% definition of the anonymous function which can have only two inputs,
% yet we use three: fitting parameters, x values and initial displacement
paramfunc = @(c,x)FittingCurveExpGuess(c,x,initDis); %definition of the
% anonymous function
c0=[.7 0.1]; %assigning the initial values for the fit search
XdataPoints=(1+i):GenesAmount;
options = optimset('TolFun',1e-8,'GradObj','on'); % default=1e-6
%lsqcurvefit(function name,init guess,xdata,ydata,lower bound,upper
% bound,options)
ExpParam=lsqcurvefit(paramfunc,c0,XdataPoints,correl,[0 -1],[1 1],options);
end
27
Step 2: Fit correlations to the desired
function
Using anonymous function to add more
Parameters and fitting using lsqcurvefit:

initDis=-0.1;
c0=[.7 0.1]; %assigning the initial values for the fit search
paramfunc = @(c,x)FittingCurveExpGuess(c,x,initDis); %def. of the anonymous function
ExpParam=lsqcurvefit(paramfunc,c0,XdataPoints,correl,[0 -1],[1 1],options);
Function name
Initial guess
X data
Y data
Lower
bound
function y_hat=FittingCurveExpGuess(c,x,init)
% This assumes an exponential decreasing curve
y_hat=init+c(1)*exp(c(2).*x);
upper
bound
28
Step 3: Plot the correlation data and fit
Plotting with dots, each subplots with its
own genes names and curvefit parameters:

for i=1:numGenesToPlot
% missing parts on previous slides…
%Plotting the correlation graph with the found parameters:
subplot(numGenesToPlot,1,i);
plot(XdataPoints,correl,'ob',…
XdataPoints,init+ExpParam(1)*exp((XdataPoints).*ExpParam(2)),'r','LineWidth',2);
set(gca,'XTick',XdataPoints,'XTickLabel',geneList(XdataPoints),'FontSize',12);
set(gca,'YLim',[0 max(correl)+0.1]);
title(sprintf('%s Correlation Data, Fit parameters: c1=%2.2f , c2=%2.2f,…
Displacement=%2.2f ',geneList{i},ExpParam(1),ExpParam(2),initDis),'FontSize',14);
end
29
Step 3: Plot the correlation data and fit
30
Best of Luck in the Group
Meeting !
31
Best of Luck in the Group
Meeting ! (and exam )
32
What did we learn?



Matlab syntax
Array manipulation, Cells, Structures
Programming:




Functions
Writing efficient code
Files & strings manipulation
Data analysis and Signal Processing
33
34
This is the end, my friend, the end
"Louis, I think this is the beginning of a beautiful friendship."
35
Download