A Whole Genome Phylogeny Using Truncated Pivoted QR Decomposition Shakhina Pulatova

advertisement
A Whole Genome Phylogeny
Using Truncated Pivoted QR
Decomposition
Shakhina Pulatova
Thesis Defense, July 14, 2004
5/29/2016
Michael Berry (Chair)
Robert Ward
Kwai Wong
1
Outline

Introduction







5/29/2016
SVD-based Approach
QR-based Approach
SPQR Algorithm
Whole Genome Phylogeny
Experimental Results
Conclusions
Future Work
2
Introduction

Whole genome sequences in public databases
accumulating rapidly

Use sequence information to define and understand
ancestral relationships between organisms

Need efficient techniques to automatically compare and
categorize genes and species within extremely large
genomic datasets

Goal: automatically generate phylogenetic trees using
protein sequences
5/29/2016
3
Introduction (cont.)

Phylogenetic Tree – branched dendrograms used to
represent or model the evolutionary history of a group of
species/organisms






Organisms placed at the leaves
Binary split branches
Branch lengths proportional to predicted evolutionary time
between species
Rooted – unique most recent common ancestor
Unrooted – common ancestor unknown branch
Mouse
Useful in
Rat



5/29/2016
Pharmaceutical R&D
branchlength
root
Human
Medicine
Designing enhanced organisms, forensics, linguistics, etc.
4
Introduction (cont.)

Standard approaches to phylogeny




Character-by-character analysis of whole genomes (Maximum
Likelihood, Bayesian Inference)  computationally intractable!
Most methods use incomplete subsets of data
Methods based on sequence alignments  do not account for
insertions and deletions of arbitrary size
SVD-based approach by Berry and Stuart


5/29/2016
Uses complete genome sequences
Relatively unaffected by insertions or deletions
5
SVD-based Approach

Exhaustive similarity analysis using identified significant
independent characteristics within protein sequences

Constructs peptide × protein matrix A from whole genomes


5/29/2016
All possible overlapping short strings of peptides considered
Each protein defined as a linear combination of those peptide
frequencies
6
SVD-based Approach (cont.)


Factor the matrix A using truncated Singular Value
Decomposition (SVD) to get a low rank approximation
Ak = UkΣkVkT






5/29/2016
Uk – new vector definitions for peptides (“peptide” matrix)
Vk – new vector definitions for proteins (“protein” matrix)
Orthonormal basis vectors of Uk – correlated peptide motifs
(independent characteristics) as particular linear combinations
of peptide strings
Orthonormal basis vectors of Vk – corresponding protein
families as particular linear combinations of proteins
Species vectors = sum of protein vectors for each species
Angles between species vectors  estimates of species
similarities
7
SVD-based Approach (cont.)

Advantages of SVD:



Reduced rank approximations most optimal
Minimal norm deviations from original matrix
Disadvantages:



5/29/2016
Expensive to compute
Storage of dense factors
No easy way to increase the size of the approximation
step-by-step
8
QR-based Approach

Truncated Pivoted QR Decomposition – alternative to SVD

AP = QR




5/29/2016
A is m×n
P – permutation matrix that reorders columns of A
Q – m×n orthonormal matrix
R – n×n upper triangular matrix
9
QR-based Approach (cont.)

Let B = AP, then B can be written as:
B
(k )
1
B2
(k )
  Q
(k )
Q2
1
(k )

 R11( k )

 0

(k )
R12 

(k ) 
R22 

Rank k approximation to B after kth step:
~
(k )
B ( k )  Q1 R11( k ) R12( k )  , where R11(k) is k×k

Objective: minimize
k
~
(k )
B  B ( k )  R22
k
n-k
k
n-k
k
(k)
m B1
5/29/2016
(k)
B2
=
m
(k)
Q1
(k)
Q2
* n-k
(k)
R11
n-k
(k)
R12
(k)
0
R22
10
SPQR Algorithm




Semi-Pivoted QR (SPQR) Algorithm by Stewart
At each iteration column of A used to compute an
additional column of Q and row of R
Column of A chosen so that ||R22(k)|| becomes small
Does not form dense factor Q1, since
Q1(k) = B1(k) (R11(k))-1


5/29/2016
Uses Quasi-Gram-Schmidt algorithm (based on classical
Gram-Schmidt) with reorthogonalization
Only involves sparse matrix-vector products and triangular
system solutions
11
SPQR Algorithm (cont.)
Initialization and I/O
Input:
m×n matrix A
k
tol
//original matrix
//number of iterations to perform
//desired accuracy
Output:
R
P
nrmR22
//kxk triangular R11
//Permutation matrix
//norm of R22
Initialization:
k = min (m, n, k)
R[1:k, 1:k] = 0
colnrm[j] = ||A[:, j]||,
P[j] = j, j = 1, … , n
5/29/2016
j = 1, … , n
//norm of the columns of R22
12
Determine the
pivot column
and swap it
with column k
5/29/2016
for i = 1 to k {
determine an index p such that conrm[p] is maximal
P[k] <-> P[p]
colnrm[k] <-> colnrm[p]
a = A[:, P[k]]
if k = 1 {
R[1, 1] = ||a||
q = a / R[1, 1]
} else { b = aT * A[:, P[1: k-1]]
Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k]
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = a – A[:, P[1: k-1]] * c
b = qT * A[:, P[1:k-1]]
Solve the system R[1:k-1, 1:k-1]T * r = bT for r
Solve the system R[1:k-1, 1:k-1] * c = r for c
q = q – A[:, P[1:k-1]] * c
R[1:k-1, k] = R[1:k-1, k] + r
R[k,k] = ||q||
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = (a – A[:, P[1: k-1]] * c) / R[k, k]
}
if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] }
colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0
nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n
colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n
if nR22 < tol { leave k }
}
13
Special action
for 1st iteration
5/29/2016
for i = 1 to k {
determine an index p such that conrm[p] is maximal
P[k] <-> P[p]
colnrm[k] <-> colnrm[p]
a = A[:, P[k]]
if k = 1 {
R[1, 1] = ||a||
q = a / R[1, 1]
} else { b = aT * A[:, P[1: k-1]]
Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k]
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = a – A[:, P[1: k-1]] * c
b = qT * A[:, P[1:k-1]]
Solve the system R[1:k-1, 1:k-1]T * r = bT for r
Solve the system R[1:k-1, 1:k-1] * c = r for c
q = q – A[:, P[1:k-1]] * c
R[1:k-1, k] = R[1:k-1, k] + r
R[k,k] = ||q||
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = (a – A[:, P[1: k-1]] * c) / R[k, k]
}
if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] }
colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0
nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n
colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n
if nR22 < tol { leave k }
}
14
Quasi-GramSchmidt step
5/29/2016
for i = 1 to k {
determine an index p such that conrm[p] is maximal
P[k] <-> P[p]
colnrm[k] <-> colnrm[p]
a = A[:, P[k]]
if k = 1 {
R[1, 1] = ||a||
q = a / R[1, 1]
} else { b = aT * A[:, P[1: k-1]]
Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k]
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = a – A[:, P[1: k-1]] * c
b = qT * A[:, P[1:k-1]]
Solve the system R[1:k-1, 1:k-1]T * r = bT for r
Solve the system R[1:k-1, 1:k-1] * c = r for c
q = q – A[:, P[1:k-1]] * c
R[1:k-1, k] = R[1:k-1, k] + r
R[k,k] = ||q||
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = (a – A[:, P[1: k-1]] * c) / R[k, k]
}
if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] }
colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0
nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n
colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n
if nR22 < tol { leave k }
}
15
Reorthogonalization
5/29/2016
for i = 1 to k {
determine an index p such that conrm[p] is maximal
P[k] <-> P[p]
colnrm[k] <-> colnrm[p]
a = A[:, P[k]]
if k = 1 {
R[1, 1] = ||a||
q = a / R[1, 1]
} else { b = aT * A[:, P[1: k-1]]
Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k]
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = a – A[:, P[1: k-1]] * c
b = qT * A[:, P[1:k-1]]
Solve the system R[1:k-1, 1:k-1]T * r = bT for r
Solve the system R[1:k-1, 1:k-1] * c = r for c
q = q – A[:, P[1:k-1]] * c
R[1:k-1, k] = R[1:k-1, k] + r
R[k,k] = ||q||
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = (a – A[:, P[1: k-1]] * c) / R[k, k]
}
if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] }
colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0
nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n
colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n
if nR22 < tol { leave k }
}
16
Update R
5/29/2016
for i = 1 to k {
determine an index p such that conrm[p] is maximal
P[k] <-> P[p]
colnrm[k] <-> colnrm[p]
a = A[:, P[k]]
if k = 1 {
R[1, 1] = ||a||
q = a / R[1, 1]
} else { b = aT * A[:, P[1: k-1]]
Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k]
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = a – A[:, P[1: k-1]] * c
b = qT * A[:, P[1:k-1]]
Solve the system R[1:k-1, 1:k-1]T * r = bT for r
Solve the system R[1:k-1, 1:k-1] * c = r for c
q = q – A[:, P[1:k-1]] * c
R[1:k-1, k] = R[1:k-1, k] + r
R[k,k] = ||q||
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = (a – A[:, P[1: k-1]] * c) / R[k, k]
}
if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] }
colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0
nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n
colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n
if nR22 < tol { leave k }
}
17
Compute
kth column
of Q
5/29/2016
for i = 1 to k {
determine an index p such that conrm[p] is maximal
P[k] <-> P[p]
colnrm[k] <-> colnrm[p]
a = A[:, P[k]]
if k = 1 {
R[1, 1] = ||a||
q = a / R[1, 1]
} else { b = aT * A[:, P[1: k-1]]
Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k]
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = a – A[:, P[1: k-1]] * c
b = qT * A[:, P[1:k-1]]
Solve the system R[1:k-1, 1:k-1]T * r = bT for r
Solve the system R[1:k-1, 1:k-1] * c = r for c
q = q – A[:, P[1:k-1]] * c
R[1:k-1, k] = R[1:k-1, k] + r
R[k,k] = ||q||
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = (a – A[:, P[1: k-1]] * c) / R[k, k]
}
if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] }
colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0
nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n
colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n
if nR22 < tol { leave k }
}
18
Compute kth
row of R12
5/29/2016
for i = 1 to k {
determine an index p such that conrm[p] is maximal
P[k] <-> P[p]
colnrm[k] <-> colnrm[p]
a = A[:, P[k]]
if k = 1 {
R[1, 1] = ||a||
q = a / R[1, 1]
} else { b = aT * A[:, P[1: k-1]]
Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k]
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = a – A[:, P[1: k-1]] * c
b = qT * A[:, P[1:k-1]]
Solve the system R[1:k-1, 1:k-1]T * r = bT for r
Solve the system R[1:k-1, 1:k-1] * c = r for c
q = q – A[:, P[1:k-1]] * c
R[1:k-1, k] = R[1:k-1, k] + r
R[k,k] = ||q||
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = (a – A[:, P[1: k-1]] * c) / R[k, k]
}
if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] }
colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0
nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n
colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n
if nR22 < tol { leave k }
}
19
Downdate
column norms,
compute ||R22||
5/29/2016
for i = 1 to k {
determine an index p such that conrm[p] is maximal
P[k] <-> P[p]
colnrm[k] <-> colnrm[p]
a = A[:, P[k]]
if k = 1 {
R[1, 1] = ||a||
q = a / R[1, 1]
} else { b = aT * A[:, P[1: k-1]]
Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k]
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = a – A[:, P[1: k-1]] * c
b = qT * A[:, P[1:k-1]]
Solve the system R[1:k-1, 1:k-1]T * r = bT for r
Solve the system R[1:k-1, 1:k-1] * c = r for c
q = q – A[:, P[1:k-1]] * c
R[1:k-1, k] = R[1:k-1, k] + r
R[k,k] = ||q||
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = (a – A[:, P[1: k-1]] * c) / R[k, k]
}
if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] }
colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0
nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n
colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n
if nR22 < tol { leave k }
}
20
Stopping
criterion
5/29/2016
for i = 1 to k {
determine an index p such that conrm[p] is maximal
P[k] <-> P[p]
colnrm[k] <-> colnrm[p]
a = A[:, P[k]]
if k = 1 {
R[1, 1] = ||a||
q = a / R[1, 1]
} else { b = aT * A[:, P[1: k-1]]
Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k]
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = a – A[:, P[1: k-1]] * c
b = qT * A[:, P[1:k-1]]
Solve the system R[1:k-1, 1:k-1]T * r = bT for r
Solve the system R[1:k-1, 1:k-1] * c = r for c
q = q – A[:, P[1:k-1]] * c
R[1:k-1, k] = R[1:k-1, k] + r
R[k,k] = ||q||
Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c
q = (a – A[:, P[1: k-1]] * c) / R[k, k]
}
if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] }
colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0
nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n
colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n
if nR22 < tol { leave k }
}
21
SPQR Algorithm (cont.)

Apply SPQR algorithm to A to get



Matrix X from columns of A
Upper triangular matrix R
 P  XR-1 used as a “peptide” matrix
Apply SPQR algorithm to AT to get


5/29/2016
Matrix Y from columns of AT
Upper triangular matrix S
 Q  YS-1 used as a “protein” matrix
22
SPQR Algorithm (cont.)

C++ implementation for efficiency and scalability

Sparse matrix A in Harwell-Boeing format

Sparse matrix-vector products



A [:, P[start:end]] * x
xT * A[:, P[start:end]]
LAPACK DTRTRS module used to solve triangular
systems
5/29/2016
23
Whole Genome Phylogeny
1. Compile protein sequences for whole genomes;
Construct peptide-by-protein (sparse) frequency matrix
A [m, n] (AACODE3)

Peptides

m = # all possible overlapping x-gram peptides:
 4-gram peptides  160,000 overlapping tetrapeptides
 5-gram peptides  3,200,000 overlapping
pentapeptides
Proteins
n = # of proteins
1
2
. . . n
5/29/2016
1 f11
f12
. . .
2 f21
f22
f2n
3 f31
f32
f3n
m fm1
fm2
. . .
f1n
fmn
24
Whole Genome Phylogeny (cont.)
2. Apply SPQR algorithm to AT to get Q matrix

Store Q matrix
 each column appended to file at each iteration
 for k iterations, storage = n*k*sizeof(double)
3. Construct species vectors by summing corresponding
protein vectors (COSDIST)
4. Build evolutionary pairwise distance matrix by calculating
cosine values for each pair of species vectors
(COSDIST)
5. Generate phylogenetic trees (NEIGHBOR)
5/29/2016
25
Whole Genome Phylogeny (cont.)
6. Construct consensus tree using subspaces of different
dimensions (CONSENSE)
7. Analyze effect of dimensions by comparing trees and
computing distances between them (TREEDIST)
8. Visualize and edit trees (TREE EXPLORER)
5/29/2016
26
Experimental Results
Peptides
Proteins
Factors
(SPQR)
Factors
(SVD)
4
160,000
1,675
620
534
Land Plants



Chloroplast
genome
17 species,
1,675 proteins
Input matrix
log-entropy
transformed
30
Symmetric Distance

n-grams
25
20
15
10
5
0
10
110
210
310
410
510
610
Number of Factors
20 per. Mov. Avg. (Adjacent tree distances)
20 per. Mov. Avg. (Dist. from SPQR 10-620 consensus tree)
20 per. Mov. Avg. (Dist. from SVD 10-534 consensus tree)
5/29/2016
27
Experimental Results (cont.)


SPQR-based
consensus tree
Abel
602/611
SVD-based
consensus tree
498/525
Ntab
374/611
356/525
Sole
220/611
499/611
Atha
380/525
Lcor
372/525
559/611
429/525
Cfer
470/525
Oela
555/611
611/611
276/611
421/525
448/525
Taes
525/525
Acap
Mpol
Cglo
5/29/2016
Taes
Pnud
222/525
Acap
Afor
Afor
277/611
Zmay
Pthu
392/525
Pnud
390/611
Oela
Osat
Zmay
Pthu
Cfer
525/525
525/525
Osat
569/611
513/611
Lcor
Atri
Atri
515/611
Ntab
Sole
323/525
Atha
280/611
Abel
453/525
Mpol
Cglo
28
Experimental Results (cont.)
Vertebrates



Mitochondrial
genomes
64 species, 13
proteins each
Input matrix
log-entropy
transformed
n-grams
Peptides
Proteins
Factors
(SPQR)
Factors
(SVD)
5
3,200,000
832
400
121
120
Symmetric Distance

100
80
60
40
20
0
10
110
210
310
Number of Factors
20 per. Mov. Avg. (Dist. from SVD 10-121 consensus tree)
20 per. Mov. Avg. (Dist. from SPQR 10-400 consensus tree)
20 per. Mov. Avg. (Adjacent tree distances)
5/29/2016
29
Experimental Results (cont.)
391
179
SPQR-based
consensus tree

89
350
366
313
118
319
137
205
382
389
86
323
149
144
51
32
104
90
78
32
391
391
362
165
169
122
391
369
346
47
165
138
230
391
391
198
391
177
204
148
144
268
272
80
360
374
205
122
250
50
70
391
258
150
74
325
46
206
210
391
374
243
391
391
343
5/29/2016
271
345
Easi
Ecab
Csim
Runi
Hgry
Pvit
Cfam
Fcat
Oari
Btau
Sscr
Bphy
Bmus
Hamp
Ajam
Teur
Dnov
Oafe
Svul
Mgli
Cpor
Ocun
Mmus
Rnor
Eeur
Ppya
Hsap
Lafr
Dvir
Mrob
Oana
Amis
Dsem
Cboy
Ccic
Fper
Scam
Rame
Ssha
Aame
Ggal
Vcha
Cfru
Cmyd
Cpic
Psub
Eegr
Gmor
Sfon
Salp
Ssal
Omyk
Poli
Ccar
Caur
Drer
Clac
Porn
Lcha
Pdol
Scan
Mman
Saca
Rrad
103
Perrisodactyls
Carnivores

64
SVD-based
consensus tree
97
82
100
62
72
75
101
Cetartiodactyls
63
108
63
Edentata
60
Rodents
99
110
Primates
89
71
Non-eutherians
82
71
109
59
63
67
88
Birds &
Reptiles
57
79
108
83
88
85
58
75
105
90
Bony Fish
110
98
93
88
87
Cartilagenous
Fish
81
98
Easi
Ecab
Csim
Runi
Hgry
Pvit
Cfam
Fcat
Teur
Oari
Btau
Sscr
Bphy
Bmus
Hamp
Ajam
Dnov
Oafe
Ocun
Svul
Mgli
Cpor
Mmus
Rnor
Lafr
Ppya
Hsap
Eeur
Dvir
Mrob
Oana
Dsem
Cboy
Ccic
Scam
Rame
Aame
Ggal
Fper
Vcha
Cfru
Ssha
Amis
Cmyd
Cpic
Psub
Eegr
Pdol
Porn
Sfon
Salp
Omyk
Ssal
Poli
Gmor
Ccar
Caur
Clac
Drer
Lcha
Scan
Mman
Saca
Rrad
30
Experimental Results (cont.)
Eukaryotes


9 whole
genomes,
175,559
proteins
Input matrix
log-entropy
transformed,
and columns
normalized
n-grams
Peptides
Proteins
Factors
(SPQR)
Factors
(SVD)
4
160,000
175,559
1400
437
5
Symmetric Distance

4
3
2
1
0
10
110
210
310
410
510
610
710
810
910 1010 1110 1210 1310
Number of Factors
20 per. Mov. Avg. (Dist. from SVD 10-437 and SPQR 1198-1400 consensus trees)
20 per. Mov. Avg. (Adjacent Tree Distances)
5/29/2016
31
Experimental Results (cont.)

SPQR-based
consensus tree

SVD-based
consensus tree
Mmus
192/203
203/203
Mmus
426/428
Rnov
421/428
Hsap
203/203
203/203
5/29/2016
Hsap
428/428
Frub
203/203
Rnov
203/203
Frub
428/428
Agam
Dmel
428/428
Agam
409/428
Dmel
Cele
Cele
Scer
Scer
Pfal
Pfal
32
Experimental Results (cont.)

Performance Analysis
Plant Dateset Times
742.66
800
725.54
700
2400
600
2100
500
1800
400
300
132.62
200
1500
857.33
1200
900
600
47.76
100
2443
2447
2700
Seconds
Seconds
Vertebrate Dataset Times
20.7
300
0
0
SPQR(A^T)
SVD
SPQR
SPQR(A^T)
5/29/2016
SVD
SPQR(A)
SPQR(A^T)
SVD
SVD
SPQR
SVD
COSDIST
33
Experimental Results (cont.)


Eukaryote Dataset Times
12.52
14
Observations:
12.35
12

Hours
10
8
5.05
6
3.94
4
2

0
SPQR(A^T)
SPQR(A^T)
5/29/2016
SVD
SPQR(A)
SPQR
SVD
SVD
As the # dimensions
increase, COSDIST time
increases
If m (rows) > n (cols) in orig.
matrix, protein matrix
calculated faster than
peptide matrix
If n > m, peptide matrix
computed faster than
protein matrix
COSDIST
34
Experimental Results (cont.)

Memory Usage
Analysis
35
Megabytes
30
25
20
15
10
5
0
Plants
Vertebrates
SPQR(A^T)
5/29/2016
SPQR(A)
Eukaryotes
SVD
35
Conclusions

Advantages of SPQR-based approach for whole genome
phylogeny analysis





Fast
Memory efficient
Storage can be conserved from Q factor, if needed
Scalable alternative to SVD for comparing whole genomes in
a phylogenetic context
Disadvantages:

5/29/2016
Need both A and AT if both motif analysis and phylogenetic
trees are desired
36
Future Work

For experiments conducted



Motif analysis can be performed if needed
Better consensus trees may be obtained by constructing gene
trees
Algorithm


5/29/2016
Transposing the matrix in Harwell-Boeing format:
Examine tradeoffs between storage and additional computation
Implement the algorithm in parallel (compute peptide and
protein factor matrices simultaneously)
37
Questions
5/29/2016
38
Download