CkHybridizationArray - Laboratory of Computational Biology

advertisement
Project in Bioinformatics:
Applying Ck
Hybridization
Arrays in Expression Profiling – a
Theoretical Study
By
Yoav Pinsky
I.D. 031986789
Idan Rubin
I.D. 034039107
May 2003
Table of Content
1
Introduction .................................................................................................. 3
1.1
Basic terms .......................................................................................... 3
1.2
The Vision........................................................................................... 4
1.3
Goal of project .................................................................................... 4
1.4
Theoretical Idea .................................................................................. 5
2
Description of project .................................................................................. 6
2.1
Work stages......................................................................................... 6
2.1.1
Stage I – Synthetic data, binary signatures ......................................... 6
2.1.1.1
Data sets ........................................................................... 6
2.1.1.2
Hybridization signature calculation ................................. 6
2.1.1.3
Noise modeling ................................................................ 7
2.1.1.4
Pseudo-inverse ................................................................. 7
2.1.1.5
Solution quality................................................................ 7
2.1.2
Stage II – Yeast sequences data, quantitative signature ..................... 8
2.1.2.1
Data sets ........................................................................... 8
2.1.2.2
Hybridization signature calculation ................................. 8
3
Results .......................................................................................................... 9
3.1
Synthetic (Random) data .................................................................... 9
3.2
Yeast data .......................................................................................... 10
3.2.1
Data set ............................................................................................. 10
3.2.2
Building the signature matrix ........................................................... 10
3.2.3
Calculating the pseudo-inverse of each matrix ................................. 11
3.2.4
Performing the experiments .............................................................. 11
4
Analyzing the results ................................................................................. 12
4.1.1
SQ1 .................................................................................................... 12
4.1.2
SQ2 .................................................................................................... 15
4.1.3
SQ3 .................................................................................................... 16
4.1.4
SQ4 .................................................................................................... 18
5
Dealing with under-determined matrices ................................................... 19
5.1
The problem ...................................................................................... 19
5.2
The solution ...................................................................................... 19
References ........................................................................................................... 21
Appendix A - Work Stages summary ................................................................. 22
Appendix B – Future work ................................................................................. 23
Appendix C – Hardware & Software specifications ........................................... 24
Appendix D – CD with C code, matlab scripts, graphs and data
2
1
Introduction
In any living cell that undergoes a biological process; different subsets of the
total set of genes encoded in the organism’s genome are expressed in different
stages of the process. The particular subset expressed at a given stage and its
quantitative composition is of extreme importance. Being able to measure subsets
of genes that express themselves in different stages, different cells, and different
organisms is instrumental in understanding biological processes. Such information
can help the characterization of sequence-to-function relationship and the
determination of effects (and side effects) of experimental treatment. The most
successful and most widely used techniques for measuring expression profiles
utilize specifically designed surface-bound probes in an assay based on
hybridization arrays. One example of an existing generic method that doesn’t
require prior determination of the RNA to be measured is SAGE (ref. 1).
In this work we study theoretical and feasibility aspects of a generic micro-array
based approach to expression profiling, from the computational point of view.
1.1

Basic terms
Gene expression profiling assay – is an assay that measures the expression
levels of a set of genes in a mixture.

Hybridization array – is a set of oligonucleotides immobilized on a surface.
When a fluorescent-labeled target DNA (or RNA) mixture is introduced to
such an array, the resulting fluorescence pattern is indicative of the mixture’s
content.

Generic hybridization array – as opposed to specific or custom array, which is
used by today’s applications, is a hybridization array that contains probes that
do not depend on the assays specific target. For example, a Ck array, that is an
array that contains all 4k k-mers (DNA/RNA oligonucleotide of length k). In
3
this work we focus on Ck arrays, and when we write hybridization array, we
refer to a Ck generic array, unless otherwise indicated.

Hybridization signature – The fluorescence-pattern induced by introducing a
solution of RNA/DNA sequences to the generic hybridization array.

Hybridization signature matrix – a matrix of some hybridization signatures.
Each column in the matrix is a hybridization signature of one RNA sequence.

Concentration vector - a vector containing the concentrations of each type of
RNA sequence in a mixture.

Melting curve – a graph / function that indicates the annealing factor of two
specific RNA / DNA strands for any given temperature.
1.2
The Vision
Given a mixture of many different RNA strands (with known sequences), we
want to determine the expression levels of each sequence in the mixture, using a
generic array based hybridization assay, and our knowledge of the hybridization
signatures of each component of the mixture.
1.3
Goal of project
In this project we will study the theoretical feasibility of applying generic array
designs to expression profiling. We examine the following question: what is the
quantitative effect of the noise variance on the hybridization array’s performance?
To be more specific: how large can random (Gaussian) noise in the fluorescencepattern get, and still be tolerated by a generic hybridization array? (Tolerated noise
here means that the array still yields the right answer, with high probability,
measured according to some reasonable probability measures on the input space).
4
1.4
Theoretical Idea
Consider a mixture of known RNA sequences. We try to determine the expression
levels of each RNA molecule in the mixture by performing the following:
1. We get the hybridization signature of the mixture, b, by performing a simple
hybridization assay.
2. We calculate the hybridization signature of each RNA sequence that might be
preset in the mixture, based on its sequence, and construct a hybridization
matrix A. Each column in the matrix is a hybridization signature of one
sequence.
3. To find the concentration vector, we use the pseudo-inverse of the
hybridization matrix, and find the vector, x, that gives us the best approximate
solution to the equation system: Ax = b
This will work if the hybridization signature is linear in the relative concentration
of the different RNA molecules in the mixture. In reality this is not the case, but
we assume that it is approximately linear. If we had an "ideal" system, under the
linearity assumption, and the matrix A was non-singular, the process described
above would give us the exact and unique concentration vector b.
Unfortunately we have some factors that can cause an error in our results:
1. The accuracy of the calculated hybridization signature of each gene.
2. The accuracy of our instruments - when we measure the hybridization
signature of the mixture.
3. The hybridization kinetics of each sequence in the mixture can be slightly
different than the hybridization in a pure solution.
We treat all these factors as noise, and want to find out how this noise affects
the accuracy of our calculated concentration vector.
5
2
Description of project
We start by describing the general flow that we used for finding the relationship
between the standard deviation of the noise and the quality of the theoretical result
concentration vector:

We choose a set of genes and compute the hybridization signature of each
member. To compute this signature we use a simple approximated thermo
dynamical model.

We randomly draw a mixture of genes (a concentration vector) in a uniform
(in [0, 1]) and independent manner and compute the corresponding linear
combination of the pure analytes (mRNA molecules) signatures. Under the
linearity assumption, in the absence of experimental error and under the
thermo dynamical model assumed, this should represent the hybridization
signature of the mixture.

We add a randomly drawn noise and try to solve (in the least square error
sense) the resulting equation.

We iterate over various noise variances and study the quality of the solution as
a function of the noise variance.
2.1
2.1.1
Work stages
Stage I – Synthetic data, binary signatures
2.1.1.1
Data sets
As the RNA components of the simulated mixture we used sets of 500 random
sequences of random length from 500 to 1000 bases.
2.1.1.2
Hybridization signature calculation
We started with a very simple (and unrealistic) model, placing 1 in the array cell if
there was a perfect match between the oligonucleotide in the cell and the sequence
(meaning the sequence contains the oligonucleotide), and 0 otherwise (the
sequence does not contain the oligonucleotide). In stage II we study a more
6
quantitative model
2.1.1.3
Noise modeling
We modeled the noise in 2 different ways:
1. Additive noise - By adding a noise factor to each cell in the "ideal" signature
of the mixture (calculated by multiplying the hybridization matrix with a
concentration vector). The noise factor we used is a random variable X, with
normal distribution. X ~ Normal (0, σ 2).
2. Multiplicative noise - By multiplying each cell of the "ideal" signature of the
mixture by eX, where X is again a random variable with normal distribution.
X ~ Normal(0, σ2).
2.1.1.4
Pseudo-inverse
We used MATLAB pinv function to calculate the pseudo-inverse of the
hybridization matrix.
2.1.1.5
Solution quality
We used three different ways to measure the difference between the calculated
concentration vector, and the "real" concentration vector (unknown in a real
experiment, randomly generated by us in our simulations).
1. Euclidian:
∑ (resulti - reali) 2
the most popular but gives much more weight to large errors, not taking into
account the original concentration (summing absolute errors). We used the
Pseudo-inverse matrix to calculate the original concentration vector back from
the signature of the mixture. The Pseudo-inverse gives us the optimal solution
vector in the Euclidian metric. The concentration vector we get will be the one
that gives a hybridization signature (as obtained from multiplying by A) with
the smallest Euclidian distance from the signature we started from (with the
added noise).
2. Log division: ∑ | log (resulti / reali ) | (if resulti is negative we treat it as zero)
gives equal weight to errors in high and low concentration sequences. This
function measure multiplicative rather than additive error.
7
3. Relative error: ∑ ((resulti – reali) / reali) 2
gives equal weight to errors in high and low concentration sequences. This
measures the error relative to the original concentration.
2.1.2
Stage II – Yeast sequences data, quantitative signature
2.1.2.1
Data sets
We used real genes sequences (Yeast ORF sequences) taken from the Yeast
genome database.
SGD ftp site:
ftp://genome-ftp.stanford.edu/pub/yeast/
SGD web site:
http://www.yeastgenome.org/
2.1.2.2
Hybridization signature calculation
We used a temperature-dependent hybridization model. This model is not
completely realistic, but a rough approximation designed to understand the
phenomenon in general but not the details:
In case of mismatch - we write 0 (meaning no hybridization).
In case of match - we write the result of the following function:
f(t) =
f(t) is the hybridization affinity of two sequences in temperature t.
That means that in a mixture of two complementary sequences at temperature t,
we expect f(t) percent of the strands to be in the double-strand form, and the rest
in a single strand form.
The parameter Tm is defined to be the temperature where 50% of the RNA/DNA
strands are connected to the complementary oligonucleotides in the array cell, and
the parameter a controls
Melting curves for some 5-mers
1
0.9
This function is 1 at -∞,
0.8
0.5 at Tm and 0 at +∞.
In the figure we can see
the melting curves of
some pairs of sequences.
If we perform a
Precent of hybrid strands
the slope of the function.
TTTTT-AAAAA (Tm=10)
AAACT-TTTGA (Tm=12)
ACACA-TGTGT (Tm=14)
CCAAA-GGTTT (Tm=16)
GGGGT-CCCCA (Tm=18)
GGGGC-CCCCG (Tm=20)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
hybridization experiment
0
5
10
8
15
Temperature
20
25
with a mixture of 5-mers, in a temperature of 15 degrees (dashed line), for
example, each pair of 5-mers will connect according to the value where its melting
curve cross the 15 degrees line. The difference in the melting curves arises from
the different Tm each pair of strands has.
To determine the Tm parameter we used the following simplistic rough estimate:
Tm(seq) = 4*[number of C/G bases] + 2*[number of A/T bases]
3
Results
3.1
Synthetic (Random) data
Figure 3.1 below shows the relation between the standard deviation (STD) of the
noise random variable X, and the error in the calculated concentration vector.
This experiment was done on 500 random genes of random lengths (500-1000
bases).
The hybridization signature was calculated with the melting curve function at a
temperature of 14 degrees.
The noise function is the exponential one. The quality of the solution is assessed
by least squares, normalized by number of genes (see equation 6.1).
Each point in the graph is a result of 30 experiments with random noise and
random concentration vectors. The top line is the maximum error out of the 30
experiments, the bottom line is the minimum error and the blue line is the average
error.
The average line is approximately linear as depicted, and we will use this fact later
in our experiments.
-4
5
x 10
Figure 3.1
Average
Linear fit for average
Max
Min
4.5
4
Data: 500 random sequences, each
one 500-1000 bases long.
3.5
Error
3
Signature matrix: melting curve
2.5
function on 14 degrees.
2
Additive Noise STD:
1.5
0.001, 0.003, 0.005,…,0.1
1
30 repeats for each STD value.
0.5
0
1
1.1
1.2
1.3
1.4
1.5
exp(STD of noise)
1.6
1.7
9
1.8
Figure 3.1
3.2
Yeast data
The final stage of our project was to try and use the methods we developed on
actual genomic data sequences.
3.2.1
Data set
We decided to use the Yeast ORF database as a model for our experiments
because it is widely used by many researchers; it is small enough for computing
resources (about 6000 ORF’s), and it is used by biologists as a model for the
Human genome.
We downloaded the Yeast ORF sequences from the SGD ftp site. We wrote a
simple parser and used it to convert the data from the original FASTA format to a
plain text format: each sequence in a single line.
3.2.2
Building the signature matrix
We wrote a C program to calculate the hybridization signature of the sequences.
This program read the sequences from the text file one by one, and calculates the
signature of each sequence for a defined Ck array in a given temperature.
The program is generic and can use any user-defined function to calculate the
expected hybridization in each cell of the Ck array. This user-defined function gets
two DNA strands as C-style strings, and returns a number between 0 and 1, which
reflects the percent of hybridization between the two strands.
The output of the signature program is a text file that contains the hybridization
signature matrix earlier referred to as A.
We used the generic signature program to get the hybridization matrices of 6-mers
array and 7-mers array, using both the zero-one hybridization model and the
melting-curve hybridization model in two different temperatures.
10
3.2.3
Calculating the pseudo-inverse of each matrix
We loaded each signature matrix in MATLAB and calculated the pseudo-inverse
using the built-in pinv function.
To avoid redundant columns in the matrix (duplicates of the same signature),
caused by duplicated genes in the DB, we used MATLAB's built-in unique
function before calculating the pinv.
In this stage we encountered a space complexity problem - MATLAB could not
perform the pinv function on our 7-mer signatures matrix array. We used a
pentium-4m computer with 512 MB of RAM, but the size of such a matrix is 47 *
genes = 16384 * 6000 = 98304000 double cells = 750 MB of memory. Moreover,
this calculation demands twice this size, for the pinv result is also a matrix of the
same size.
Thus we where limited to performing our experiments on 6-mers matrices, which
does not allow us to use all 6000 genes. This is because such the matrix would be
under-determined. We chose to do our experiments on a maximum of 3000 genes,
which we hope will serve as a good model also for the whole 6000.
3.2.4
Performing the experiments
In this stage we simulated a gene expression experiment using Ck array, and
measured the effect of noise on the calculated concentration vector.
We start with a hybridization signatures matrix A, and it's calculated pinv pA.
We used a 6-mer Ck array, on 2991 genes, so the size of A was 4096*2991.
1. We draw a concentration vector for the genes:
v = (e1, e2, …,e2991) where ei ~ U(0,1) + 0.000001
We added 0.000001 to prevent a case of non-expressed genes.
2. We calculated the hybridization signature of the gene mixture on the array:
s = Av
11
3. We choose a standard deviation value s, and draw a noise factor for each cell
in the signature array:
Noise = (n1,…n4096) where ni = Norm(0, σ2)
4. We get a noisy hybridization signature by multipling each cell with the
exponent of a random noise value:
s' : { s'i = si * eni | i = 1…4096 }
5. We calculate the concentration vector out of the noisy array, using the pseudoinverse of the original signatures matrix:
v' = pinv(A) * s'
6. Now we measure the quality of our solution with 4 different functions:
6.1
SQ1 = (1 / ∑i vi) * ∑i |vi - v'i|
( i=1…n, n = number of genes )
this is the average delta between the real and the calculated concentration
vector on each gene.
6.2
SQ2 =
(1 / ∑i vi2) * ∑i (vi - v'I) 2
this is the Euclidian distance between the real an calculated concentration
vectors.
6.3
SQ3 = (1 / n) * ∑i |vi - v'i| / vi
this is the average relative error of each gene's concentration vectors.
6.4
SQ4 = (1 / n) * ∑i |log (v’i / vi)|
this is the log of the relative errors of each gene's concentration vectors.
4
4.1.1
Analyzing the results
SQ1
SQ1 = (1 / ∑i vi) * ∑i |vi - v'i|
Lets look at an experiment on 1000 Yeast genes, with noise standard deviation of
0.01. The vector delta = v - v' is our prediction error. This vector values are
distributed normally with average = 0, and std = 0.0765 as shown in figure 4.1.
Figure 4.1 is a histogram of the values in the delta vector.
12
The mean of absolute values of delta, was
150
0.0559 in this case. This is the average
100
concentration error we have for each gene.
50
To normalize this value, we divide this number
0
-0.4
by the average concentration value:
-0.2
0
0.2
0.4
0.6
Figure 4.1
(1/n) *∑i vi = 0.5075
so the SQ1 score in this case is 0.0559 / 0.5075 = 0.1101
this gives us the SQ1 value:
(1 / ((1/n) * ∑i vi ))) * (1/n) * ∑i |vi - v'i| =
(1 / ∑i vi) * ∑i |vi - v'i| = SQ1
so SQ1 is the average concentration error, normalized by the average of the
concentration values.
In figure 4.2 we see a graph of SQ1 values for different standard deviation of the
noise. For each noise STD, we ran 50 experiments. The graph shows the
maximum, average and minimum SQ1 values, out of the 50 experiments.
1000 genes, 50 experiments
1.2
Avg
Linear fit (slope = 11.79)
Max
Min
1
Figure 4.2
Data: 1000 Yeast genes
Error (SQ1)
Signature matrix: 6-mer array,
0.8
melting curve function calculated
at 18 degrees.
0.6
Noise STD:
0.4
0.001, 0.003, 0.005,…,0.1
50 repeats for each STD value.
0.2
0
0
0.02
0.04
0.06
STD of noise
0.08
0.1
Figure 4.2
We see that there is a linear relation between the STD of the noise, to the SQ1
average score. In the 1000 genes example shown in figure 4.2, the slope of the
linear fit to the average line, is 11.79.
Based on this result, we can say that when we perform a gene expression assay
13
with our model on 1000 genes and a 6-mer array, the expected average error we
will get equals: expected average error = 11.79 * STD of noise
So if we want, for example, to calculate the expression vector of 1000 genes, using
6-mer Ck array, with an average error of 0.1 (at SQ1 normalized units), we should
lower our noise factors STD to at most 0.1 / 11.79 = 0.00848.
Figure 4.3 shows the results of the same experiments but for 3000 genes.
9
Figure 4.3
Avg
Linear fit (slope = 83.25)
Max
Min
8
7
Data: 3000 Yeast genes
Signature matrix: 6-mer array,
melting curve function calculated
5
at 18 degrees.
4
Noise STD:
3
0.001, 0.003, 0.005,…,0.1
2
50 repeats for each STD value.
1
0
0
0.02
0.04
0.06
STD of noise
0.08
0.1
Figure 4.3
The relation is still linear but with a much stronger slope of 83.25 .
This result probably relate to the larger condition number of the signatures matrix
(see appendix B – future work).
Figure 4.4 shows the relation between the number of genes in the concentration
vector (and signatures matrix), and the slope of average SQ1 score. These are the
results from experiments made on a 6-mer Ck array.
Figure 4.4
80
70
Data: 500 - 3000 Yeast genes
Slope of SQ1 average score
Signature matrix: 6-mer array,
60
Slope
Error (SQ1)
6
melting curve function calculated at
50
18 degrees.
40
Noise STD:
30
0.001, 0.003, 0.005,…,0.1
20
50 repeats for each STD value.
10
0
500
14
1000
1500
2000
Number of genes
2500
3000
4.1.2
SQ2
SQ2 =
(1 / ∑i vi2) * ∑i (vi - v'I) 2
Using the SQ2 score we get similar results. In figure 4.5 we show the results on
1000 genes. On figure 4.6 are the slopes for different sizes of gene sets.
The SQ2 results are a little different from SQ1 results, maybe because SQ2 function
gives more weight to larger errors. We should remember that in order to get the
concentration vector back from the hybridization signature vector, we used the
pinv method – that guarantees an optimized solution in the least square sense. So
1000 genes, 50 experiments per STD
1.5
Figure 4.5
Data: 500 - 3000 Yeast genes
Signature matrix: 6-mer array,
1
SQ2
melting curve function calculated
at 18 degrees.
Noise STD: 0.001…0.1
0.5
Avg
Linear fit (slope = 14.32)
Max
Min
0
0
0.02
0.04
0.06
STD of noise
0.08
50 repeats for each STD value.
0.1
Figure 5.5
Figure 4.6
150
Slope of SQ2 average score
Data: 500 - 3000 Yeast genes
Signature matrix: 6-mer array,
Slope
100
melting curve function calculated at
18 degrees.
50
Noise STD:
0.001, 0.003, 0.005,…,0.1
0
500
1000
1500
2000
Number of genes
Figure 4.6
2500
3000
15
50 repeats for each STD value.
the SQ2 method is the natural way for measuring our results quality.
4.1.3
SQ3
SQ3 = (1 / n) * ∑i |vi - v'i| / vi
We are now interested in measuring the relative error of our solution.
vi - v'i is the delta between the real concentration of the i-th gene and it’s
calculated concentration. So |vi - v'i| / vi is this error divided by the real
concentration, in other words: the relative error of the calculated concentration.
The sum of all relative errors divided by the number of genes (n) is the average
relative error.
So the SQ3 score represent the average relative error of a calculated
concentration vector.
note that: |vi - v'i| / vi = | 1 - v'i / vi | (when vi and v'i are positive), so Ri = v'i / vi is
the value we are interested in.
Ri є [0, ∞), and Ri = 1 when v’i = vi .
Let’s look at another example of an experiment, calculating the concentration
vector of 1000 Yeast genes using a 6-mers signatures matrix, built using the
melting curve function on 18 degrees, with noise STD of 0.01 .
80
Figure 4.7
Histogram of vi / v'i values
60
Grouped by logarithmic centers.
40
Data: 1000 Yeast genes
Signature matrix: 6-mer array,
20
melting curve function calculated at
0
-0.2
10
-0.1
10
0
10
0.1
0.2
10
10
18 degrees.
Noise STD: 0.01
Figure 4.7
We see that on a logarithmic scale, R values are distribution is approximately
normal, with mean of 1. About 13% of the values of R are smaller than 2/3 or
larger than 3/2 = 1.5 . We consider these large relative errors, and it is interesting
to look at the concentrations in which these errors occurred.
The average concentration in the concentration vector is 0.4829 .
But when we calculate the average concentration only on the places in the vector
that have 2/3<Ri<3/2 we get a mean concentration of 0.1385 . This points to the
fact that the small concentrations are those who get a large relative error.
In our example, the SQ3 score, which is the average relative error, is 0.74 (74%).
The SQ3 score of the concentrations larger than 0.1 (which populate 88% of the
concentration vector) was 0.14 (14%).
The problem of large relative error in small concentrations is bound to happen
when we use the least squares norm pseudo-inverse in our calculations and this
16
norm gives more weight to large absolute errors, ignoring the relative errors. It is
possible to create a pseudo-inverse matrix based on a different norm that will
consider the relative error (see appendix B – future work).
Figure 4.8 shows SQ3 scores for different noise STD on 3000 genes.
30
SQ3 score
Figure 4.8
25
SQ3 scores vs. noise STD
20
Data: 3000 Yeast genes
Signature matrix: 6-mer array,
15
melting curve function
10
calculated at 18 degrees.
Avg
Linear fit (slope = 403.25)
Max
Min
5
0
0
0.01
0.02
0.03
0.04
STD of noise
0.05
0.06
Noise STD: 0.001...0.07
50 repeats for each STD value
0.07
Figure 4.8
Figure 4.9 shows the slopes for different numbers of genes.
500
Figure 4.9
Slope
Slope of SQ3 average score
400
Data: 500-3000 Yeast genes
300
Signature matrix: 6-mer array,
melting curve function calculated
200
at 18 degrees.
100
0
500
Noise STD: 0.01
50 repeats for each STD value
1000
1500
2000
Number of genes
2500
3000
Figure 4.9
17
4.1.4
SQ4
SQ4 = (1 / n) * ∑i | log (v’i / vi ) |
Another way of exploring the relative error of our solution is to look on the log
of R.
log Ri = log |v'i / vi |
Like in the example from the previous section, we see that the distribution of the
values of log R is approximately normal, with mean 0. the maximum value in
this example is 4.6, and minimum is -5.7, this gives us another perspective on
the data, in which we can measure the relative error.
A histogram of log R values is shown in figure 4.10 .
200
Figure 4.10
Histogram of Log(CV’i / CVi)
150
values.
100
Data: 1000 Yeast genes
Signature matrix: 6-mer array,
50
melting curve function calculated
at 18 degrees.
0
-1
-0.5
0
0.5
1
Noise STD: 0.01
Figure 4.10
res-file: sign-6-orf-coding-hyb-18-3000.mat genes:2991 repeats:50
7000
Figure 4.11
6000
Data: 3000 Yeast genes
ESQ4 score
5000
Signature matrix: 6-mer array,
4000
melting curve function calculated
3000
at 18 degrees.
2000
1000
0
Noise STD: 0.001 … 0.1
Avg
Max
Min
0
0.02
0.04
0.06
STD of noise
0.08
50 repeats for each STD value
0.1
Figure 4.11
18
5
5.1
Dealing with under-determined matrices
The problem
When we try to get a concentration vector of m genes, using a hybridization array
of n = 4k cells, and m > n, we face the problem of an under-determined matrix. It
is like trying to solve an algebric problem of 3 variables, with only 2 equations.
The solution space for such a problem is infinitely large.
5.2
The solution
The proposed solution to this problem, is to produce another set of “equations” by
building a second hybridization signatures matrix, at a different temperature.
For example, lets say we have a mixture of 6000 genes. We want to use a 6-mers
array, to get the concentration vector of our genes, so we calculate the
hybridization signatures matrix at a temperature of 18 degrees. We got an underdetermined matrix A18 with N x M = 4096 x 6000. Then we produce another
matrix A15 at a temperature of 15 degrees. To get our concentration vector, we use
the vertical concatenation of the two matrices:
[A15]
A15, 18 =
[A18]
the size of A15, 18 is 8192 x 6000, and we can use this matrix to get our
concentration vector.
We tested this idea on concentration vectors of 2000 genes, using a 6-mers array.
This way we could compare our results to the previous experiments with 2000
genes. The reason for not using more genes was again the space complexity
limitation: MATLAB could not load two 4096 x 6000 matrices, and compute their
pinv, and could not even perform this operation on two 4096 x 3000 matrices.
19
Figure 5.1 shows a comparison between a regular 18 degrees hybridization matrix
SQ1 scores shown before, and the 2 temperatures (15, 18 degrees) matrix SQ1
scores. We can see a nice improvement in the quality of the results when using the
2 temperatures matrix.
3
2.5
SQ1 score
Figure 5.1
Normal Average (slope = 33.5)
2 temp Average (slope = 23.3)
Data: 2000 Yeast genes
Signature matrix: 6-mer array,
2
melting curve function
1.5
calculated at 18 degrees, and
1
15 degrees.
0.5
0
Noise STD: 0.001…0.01
0
0.02
0.04
0.06
STD of noise
0.08
0.1
Figure 5.1
Figure 5.2 shows the relative error comparison using SQ4 (log) scores.
It is evident that the use of the 2 temperatures matrix improved the result quality
also in the relative error sense.
Figure 5.2
1.2
Data: 2000 Yeast genes
SQ4 score
1
Signature matrix: 6-mer array,
0.8
melting curve function
0.6
calculated at 18 degrees, and
0.4
15 degrees.
Normal Average
2 temp Average
0.2
0
0
0.02
0.04
0.06
STD of noise
0.08
Figure 5.2
20
Noise STD: 0.001…0.01
0.1
References
1. Serial analysis of gene expression (SAGE) is a tool that allows the analysis of
overall gene expression patterns with digital analysis. Because SAGE does not
require a preexisting clone, it can be used to identify and quantitate new genes
as well as known genes.
http://www.sagenet.org/
2. Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D. & Levine,
A. J. (1999), ‘Broad patterns of gene expression revealed by clustering
analysis of tumor and normal colon tissues probed by oligonucleotide arrays’,
Proc. Nat. Acad. Sci. USA 96, 6745–6750.
3. DeRisi., J., Iyer, V. & Brown, P. (1997), ‘Exploring the metabolic and genetic
ontrol of gene expression on a genomic scale’, Science 282, 699–705.
4. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee,
M. S., Mittmann, M., Want, C., Kobayashi, M., Horton, H. & Brown, E. L.
(1996), ‘DNA expression monitoring by hybridization of high density
oligonucleotide arrays’, Nature Biotechnology 14, 1675–1680.
21
Appendix A - Work Stages summary
** proposed as future work
Data Sets:
-
Random sequences. Length 500-1000.
-
Yeast ORF sequences. ~6000 sequences.
Signature Calculation:
-
match / mismatch (1 / 0)
-
Hybridization function: 1 / (exp(a(x-Tm))+1)
Noise modeling:
Let X be a random variable with normal distribution: X ~ Normal (0, σ2)
Let W be the real hybridization signature of a mixture, then W’ is the
experimented hybridization signature, which is the real signature with some noise.
-
noise = X
-
noise = eX
W’ = W + noise
W’ = W * noise
Pseudo-inverse norm:
-
Least squares: sum(Wi – Wi') ^ 2
-
** Percentage: sum (Wi – Wi' / Wi) ^ 2
-
** Logarithmic: sum | log |Wi / Wi’| |
Result quality measurement function:
-
Least squares: sum(Wi – Wi') ^ 2
-
Normalized least squares: sum(Wi – Wi’) ^ 2 / number of genes
-
Percentage: sum (Wi – Wi’ / Wi) ^ 2
-
Logarithmic: sum | log |Wi / Wi’| |
Results
Draw graphs for all experiments.
Compare results with different parameters:
-
Number of genes
-
Ck size (3-mer = 64, 4-mer = 256, 5-mer = 1024, 6-mer = 4096, 7-mer =
16384, 8-mer = 65536)
-
Random / Yeast data.
22
Appendix B – Future work
Solve under-determined matrix equation by using 2 temperatures on same set of
genes, and getting a double length array. Compare results of this method to the
regular method.
Pseudo-inverse norms:
We saw that using the basic pseudo-inverse norm (L2) does not give us good
results on relative error measurements. If we are interested in lowering the relative
error, it is possible to define a different norm that will give more weight to larger
relative errors. This norm would look something like | vi – v’i | / vi which is
actually a family of norms. We can use these norms to create a pseudo-inverse
matrix that will minimize the relative error of the result.
More research can lead to a norm that will lower the values of |log v’i / vi |.
Differential expression
Compare expression vector delta (can we discover the genes that were over/under
expressed)
Condition number
Explore the connection between the signature matrix condition number and the
results.
Extending the hybridization model
Improve the hybridization model by taking mismatches into account and refine the
melting curve function.
23
Appendix C – Hardware & Software specifications
Hardware
We used an HP mobile computer, with a Pentium - 4m processor, running at 1.6
GHz, with 512 MB of RAM.
Software
We used MATLAB 6.1 for all the experiments.
We wrote some C programs to produce the signature matrices, and to convert the
Yeast ORF files from FASTA format to plain text files.
Space requirements
Here are some examples:
File
Size
7-mers signatures matrix of 6000 genes
915 MB (zipped: 15 MB)
6-mers signatures matrix of 6000 genes
228 MB (zipped: 6 MB)
4096 x 3000 matrix loaded in MATLAB
96 MB of RAM
Time
It took MATLAB about 5 minutes to calculate the pinv of a 4096 x 6000 matrix.
C Programs
Signature maker
C program used to build signature matrices.
input: The program read a plain text file that contain one gene in each line.
Output: a text file with each gene’s hybridization signature, based on the generic
definitions of Ck array (K-mer), hybridization function and temperature.
Fasta parser
C program used to convert FASTA files to the format readable by the Signature
maker.
input: a text file in FASTA format.
Output: a text file with one gene in each line.
24
MATLAB scripts
Analyze signature
run the concentration experiment on some noise STD.
input: signature matrix, pinv of that matrix, STD vector (the script perform the
experiments on each value in the vector), repeats (how many experiments to
perform for each noise STD)
Output: a res construct that contains the results of all the experiments.
Plot results
Plot the results of a series of experiments.
Input: res construct and number of quality measurement function.
Output: a graph of the average, max and min results vs. the STD vector.
25
Download