A Whole Genome Phylogeny Using Truncated Pivoted QR Decomposition Shakhina Pulatova Thesis Defense, July 14, 2004 5/29/2016 Michael Berry (Chair) Robert Ward Kwai Wong 1 Outline Introduction 5/29/2016 SVD-based Approach QR-based Approach SPQR Algorithm Whole Genome Phylogeny Experimental Results Conclusions Future Work 2 Introduction Whole genome sequences in public databases accumulating rapidly Use sequence information to define and understand ancestral relationships between organisms Need efficient techniques to automatically compare and categorize genes and species within extremely large genomic datasets Goal: automatically generate phylogenetic trees using protein sequences 5/29/2016 3 Introduction (cont.) Phylogenetic Tree – branched dendrograms used to represent or model the evolutionary history of a group of species/organisms Organisms placed at the leaves Binary split branches Branch lengths proportional to predicted evolutionary time between species Rooted – unique most recent common ancestor Unrooted – common ancestor unknown branch Mouse Useful in Rat 5/29/2016 Pharmaceutical R&D branchlength root Human Medicine Designing enhanced organisms, forensics, linguistics, etc. 4 Introduction (cont.) Standard approaches to phylogeny Character-by-character analysis of whole genomes (Maximum Likelihood, Bayesian Inference) computationally intractable! Most methods use incomplete subsets of data Methods based on sequence alignments do not account for insertions and deletions of arbitrary size SVD-based approach by Berry and Stuart 5/29/2016 Uses complete genome sequences Relatively unaffected by insertions or deletions 5 SVD-based Approach Exhaustive similarity analysis using identified significant independent characteristics within protein sequences Constructs peptide × protein matrix A from whole genomes 5/29/2016 All possible overlapping short strings of peptides considered Each protein defined as a linear combination of those peptide frequencies 6 SVD-based Approach (cont.) Factor the matrix A using truncated Singular Value Decomposition (SVD) to get a low rank approximation Ak = UkΣkVkT 5/29/2016 Uk – new vector definitions for peptides (“peptide” matrix) Vk – new vector definitions for proteins (“protein” matrix) Orthonormal basis vectors of Uk – correlated peptide motifs (independent characteristics) as particular linear combinations of peptide strings Orthonormal basis vectors of Vk – corresponding protein families as particular linear combinations of proteins Species vectors = sum of protein vectors for each species Angles between species vectors estimates of species similarities 7 SVD-based Approach (cont.) Advantages of SVD: Reduced rank approximations most optimal Minimal norm deviations from original matrix Disadvantages: 5/29/2016 Expensive to compute Storage of dense factors No easy way to increase the size of the approximation step-by-step 8 QR-based Approach Truncated Pivoted QR Decomposition – alternative to SVD AP = QR 5/29/2016 A is m×n P – permutation matrix that reorders columns of A Q – m×n orthonormal matrix R – n×n upper triangular matrix 9 QR-based Approach (cont.) Let B = AP, then B can be written as: B (k ) 1 B2 (k ) Q (k ) Q2 1 (k ) R11( k ) 0 (k ) R12 (k ) R22 Rank k approximation to B after kth step: ~ (k ) B ( k ) Q1 R11( k ) R12( k ) , where R11(k) is k×k Objective: minimize k ~ (k ) B B ( k ) R22 k n-k k n-k k (k) m B1 5/29/2016 (k) B2 = m (k) Q1 (k) Q2 * n-k (k) R11 n-k (k) R12 (k) 0 R22 10 SPQR Algorithm Semi-Pivoted QR (SPQR) Algorithm by Stewart At each iteration column of A used to compute an additional column of Q and row of R Column of A chosen so that ||R22(k)|| becomes small Does not form dense factor Q1, since Q1(k) = B1(k) (R11(k))-1 5/29/2016 Uses Quasi-Gram-Schmidt algorithm (based on classical Gram-Schmidt) with reorthogonalization Only involves sparse matrix-vector products and triangular system solutions 11 SPQR Algorithm (cont.) Initialization and I/O Input: m×n matrix A k tol //original matrix //number of iterations to perform //desired accuracy Output: R P nrmR22 //kxk triangular R11 //Permutation matrix //norm of R22 Initialization: k = min (m, n, k) R[1:k, 1:k] = 0 colnrm[j] = ||A[:, j]||, P[j] = j, j = 1, … , n 5/29/2016 j = 1, … , n //norm of the columns of R22 12 Determine the pivot column and swap it with column k 5/29/2016 for i = 1 to k { determine an index p such that conrm[p] is maximal P[k] <-> P[p] colnrm[k] <-> colnrm[p] a = A[:, P[k]] if k = 1 { R[1, 1] = ||a|| q = a / R[1, 1] } else { b = aT * A[:, P[1: k-1]] Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k] Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = a – A[:, P[1: k-1]] * c b = qT * A[:, P[1:k-1]] Solve the system R[1:k-1, 1:k-1]T * r = bT for r Solve the system R[1:k-1, 1:k-1] * c = r for c q = q – A[:, P[1:k-1]] * c R[1:k-1, k] = R[1:k-1, k] + r R[k,k] = ||q|| Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = (a – A[:, P[1: k-1]] * c) / R[k, k] } if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] } colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0 nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n if nR22 < tol { leave k } } 13 Special action for 1st iteration 5/29/2016 for i = 1 to k { determine an index p such that conrm[p] is maximal P[k] <-> P[p] colnrm[k] <-> colnrm[p] a = A[:, P[k]] if k = 1 { R[1, 1] = ||a|| q = a / R[1, 1] } else { b = aT * A[:, P[1: k-1]] Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k] Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = a – A[:, P[1: k-1]] * c b = qT * A[:, P[1:k-1]] Solve the system R[1:k-1, 1:k-1]T * r = bT for r Solve the system R[1:k-1, 1:k-1] * c = r for c q = q – A[:, P[1:k-1]] * c R[1:k-1, k] = R[1:k-1, k] + r R[k,k] = ||q|| Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = (a – A[:, P[1: k-1]] * c) / R[k, k] } if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] } colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0 nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n if nR22 < tol { leave k } } 14 Quasi-GramSchmidt step 5/29/2016 for i = 1 to k { determine an index p such that conrm[p] is maximal P[k] <-> P[p] colnrm[k] <-> colnrm[p] a = A[:, P[k]] if k = 1 { R[1, 1] = ||a|| q = a / R[1, 1] } else { b = aT * A[:, P[1: k-1]] Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k] Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = a – A[:, P[1: k-1]] * c b = qT * A[:, P[1:k-1]] Solve the system R[1:k-1, 1:k-1]T * r = bT for r Solve the system R[1:k-1, 1:k-1] * c = r for c q = q – A[:, P[1:k-1]] * c R[1:k-1, k] = R[1:k-1, k] + r R[k,k] = ||q|| Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = (a – A[:, P[1: k-1]] * c) / R[k, k] } if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] } colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0 nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n if nR22 < tol { leave k } } 15 Reorthogonalization 5/29/2016 for i = 1 to k { determine an index p such that conrm[p] is maximal P[k] <-> P[p] colnrm[k] <-> colnrm[p] a = A[:, P[k]] if k = 1 { R[1, 1] = ||a|| q = a / R[1, 1] } else { b = aT * A[:, P[1: k-1]] Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k] Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = a – A[:, P[1: k-1]] * c b = qT * A[:, P[1:k-1]] Solve the system R[1:k-1, 1:k-1]T * r = bT for r Solve the system R[1:k-1, 1:k-1] * c = r for c q = q – A[:, P[1:k-1]] * c R[1:k-1, k] = R[1:k-1, k] + r R[k,k] = ||q|| Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = (a – A[:, P[1: k-1]] * c) / R[k, k] } if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] } colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0 nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n if nR22 < tol { leave k } } 16 Update R 5/29/2016 for i = 1 to k { determine an index p such that conrm[p] is maximal P[k] <-> P[p] colnrm[k] <-> colnrm[p] a = A[:, P[k]] if k = 1 { R[1, 1] = ||a|| q = a / R[1, 1] } else { b = aT * A[:, P[1: k-1]] Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k] Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = a – A[:, P[1: k-1]] * c b = qT * A[:, P[1:k-1]] Solve the system R[1:k-1, 1:k-1]T * r = bT for r Solve the system R[1:k-1, 1:k-1] * c = r for c q = q – A[:, P[1:k-1]] * c R[1:k-1, k] = R[1:k-1, k] + r R[k,k] = ||q|| Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = (a – A[:, P[1: k-1]] * c) / R[k, k] } if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] } colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0 nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n if nR22 < tol { leave k } } 17 Compute kth column of Q 5/29/2016 for i = 1 to k { determine an index p such that conrm[p] is maximal P[k] <-> P[p] colnrm[k] <-> colnrm[p] a = A[:, P[k]] if k = 1 { R[1, 1] = ||a|| q = a / R[1, 1] } else { b = aT * A[:, P[1: k-1]] Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k] Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = a – A[:, P[1: k-1]] * c b = qT * A[:, P[1:k-1]] Solve the system R[1:k-1, 1:k-1]T * r = bT for r Solve the system R[1:k-1, 1:k-1] * c = r for c q = q – A[:, P[1:k-1]] * c R[1:k-1, k] = R[1:k-1, k] + r R[k,k] = ||q|| Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = (a – A[:, P[1: k-1]] * c) / R[k, k] } if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] } colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0 nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n if nR22 < tol { leave k } } 18 Compute kth row of R12 5/29/2016 for i = 1 to k { determine an index p such that conrm[p] is maximal P[k] <-> P[p] colnrm[k] <-> colnrm[p] a = A[:, P[k]] if k = 1 { R[1, 1] = ||a|| q = a / R[1, 1] } else { b = aT * A[:, P[1: k-1]] Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k] Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = a – A[:, P[1: k-1]] * c b = qT * A[:, P[1:k-1]] Solve the system R[1:k-1, 1:k-1]T * r = bT for r Solve the system R[1:k-1, 1:k-1] * c = r for c q = q – A[:, P[1:k-1]] * c R[1:k-1, k] = R[1:k-1, k] + r R[k,k] = ||q|| Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = (a – A[:, P[1: k-1]] * c) / R[k, k] } if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] } colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0 nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n if nR22 < tol { leave k } } 19 Downdate column norms, compute ||R22|| 5/29/2016 for i = 1 to k { determine an index p such that conrm[p] is maximal P[k] <-> P[p] colnrm[k] <-> colnrm[p] a = A[:, P[k]] if k = 1 { R[1, 1] = ||a|| q = a / R[1, 1] } else { b = aT * A[:, P[1: k-1]] Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k] Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = a – A[:, P[1: k-1]] * c b = qT * A[:, P[1:k-1]] Solve the system R[1:k-1, 1:k-1]T * r = bT for r Solve the system R[1:k-1, 1:k-1] * c = r for c q = q – A[:, P[1:k-1]] * c R[1:k-1, k] = R[1:k-1, k] + r R[k,k] = ||q|| Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = (a – A[:, P[1: k-1]] * c) / R[k, k] } if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] } colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0 nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n if nR22 < tol { leave k } } 20 Stopping criterion 5/29/2016 for i = 1 to k { determine an index p such that conrm[p] is maximal P[k] <-> P[p] colnrm[k] <-> colnrm[p] a = A[:, P[k]] if k = 1 { R[1, 1] = ||a|| q = a / R[1, 1] } else { b = aT * A[:, P[1: k-1]] Solve the system R[1:k-1, 1:k-1]T * R[1:k-1, k] = bT for R[1:k-1, k] Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = a – A[:, P[1: k-1]] * c b = qT * A[:, P[1:k-1]] Solve the system R[1:k-1, 1:k-1]T * r = bT for r Solve the system R[1:k-1, 1:k-1] * c = r for c q = q – A[:, P[1:k-1]] * c R[1:k-1, k] = R[1:k-1, k] + r R[k,k] = ||q|| Solve the system R[1:k-1, 1:k-1] * c = R[1:k-1, k] for c q = (a – A[:, P[1: k-1]] * c) / R[k, k] } if k+1 <= n { r[k+1:n] = qT * A[:, P[k+1, n]] } colnrm[j] = colnrm[j]2 – r[j]2, j = k+1, … , n, set negative downdates to 0 nrmR22 = sqrt(sum(colnrm[j])), j = k+1, … , n colnrm[j] = sqrt(colnrm[j]), j = k+1, … , n if nR22 < tol { leave k } } 21 SPQR Algorithm (cont.) Apply SPQR algorithm to A to get Matrix X from columns of A Upper triangular matrix R P XR-1 used as a “peptide” matrix Apply SPQR algorithm to AT to get 5/29/2016 Matrix Y from columns of AT Upper triangular matrix S Q YS-1 used as a “protein” matrix 22 SPQR Algorithm (cont.) C++ implementation for efficiency and scalability Sparse matrix A in Harwell-Boeing format Sparse matrix-vector products A [:, P[start:end]] * x xT * A[:, P[start:end]] LAPACK DTRTRS module used to solve triangular systems 5/29/2016 23 Whole Genome Phylogeny 1. Compile protein sequences for whole genomes; Construct peptide-by-protein (sparse) frequency matrix A [m, n] (AACODE3) Peptides m = # all possible overlapping x-gram peptides: 4-gram peptides 160,000 overlapping tetrapeptides 5-gram peptides 3,200,000 overlapping pentapeptides Proteins n = # of proteins 1 2 . . . n 5/29/2016 1 f11 f12 . . . 2 f21 f22 f2n 3 f31 f32 f3n m fm1 fm2 . . . f1n fmn 24 Whole Genome Phylogeny (cont.) 2. Apply SPQR algorithm to AT to get Q matrix Store Q matrix each column appended to file at each iteration for k iterations, storage = n*k*sizeof(double) 3. Construct species vectors by summing corresponding protein vectors (COSDIST) 4. Build evolutionary pairwise distance matrix by calculating cosine values for each pair of species vectors (COSDIST) 5. Generate phylogenetic trees (NEIGHBOR) 5/29/2016 25 Whole Genome Phylogeny (cont.) 6. Construct consensus tree using subspaces of different dimensions (CONSENSE) 7. Analyze effect of dimensions by comparing trees and computing distances between them (TREEDIST) 8. Visualize and edit trees (TREE EXPLORER) 5/29/2016 26 Experimental Results Peptides Proteins Factors (SPQR) Factors (SVD) 4 160,000 1,675 620 534 Land Plants Chloroplast genome 17 species, 1,675 proteins Input matrix log-entropy transformed 30 Symmetric Distance n-grams 25 20 15 10 5 0 10 110 210 310 410 510 610 Number of Factors 20 per. Mov. Avg. (Adjacent tree distances) 20 per. Mov. Avg. (Dist. from SPQR 10-620 consensus tree) 20 per. Mov. Avg. (Dist. from SVD 10-534 consensus tree) 5/29/2016 27 Experimental Results (cont.) SPQR-based consensus tree Abel 602/611 SVD-based consensus tree 498/525 Ntab 374/611 356/525 Sole 220/611 499/611 Atha 380/525 Lcor 372/525 559/611 429/525 Cfer 470/525 Oela 555/611 611/611 276/611 421/525 448/525 Taes 525/525 Acap Mpol Cglo 5/29/2016 Taes Pnud 222/525 Acap Afor Afor 277/611 Zmay Pthu 392/525 Pnud 390/611 Oela Osat Zmay Pthu Cfer 525/525 525/525 Osat 569/611 513/611 Lcor Atri Atri 515/611 Ntab Sole 323/525 Atha 280/611 Abel 453/525 Mpol Cglo 28 Experimental Results (cont.) Vertebrates Mitochondrial genomes 64 species, 13 proteins each Input matrix log-entropy transformed n-grams Peptides Proteins Factors (SPQR) Factors (SVD) 5 3,200,000 832 400 121 120 Symmetric Distance 100 80 60 40 20 0 10 110 210 310 Number of Factors 20 per. Mov. Avg. (Dist. from SVD 10-121 consensus tree) 20 per. Mov. Avg. (Dist. from SPQR 10-400 consensus tree) 20 per. Mov. Avg. (Adjacent tree distances) 5/29/2016 29 Experimental Results (cont.) 391 179 SPQR-based consensus tree 89 350 366 313 118 319 137 205 382 389 86 323 149 144 51 32 104 90 78 32 391 391 362 165 169 122 391 369 346 47 165 138 230 391 391 198 391 177 204 148 144 268 272 80 360 374 205 122 250 50 70 391 258 150 74 325 46 206 210 391 374 243 391 391 343 5/29/2016 271 345 Easi Ecab Csim Runi Hgry Pvit Cfam Fcat Oari Btau Sscr Bphy Bmus Hamp Ajam Teur Dnov Oafe Svul Mgli Cpor Ocun Mmus Rnor Eeur Ppya Hsap Lafr Dvir Mrob Oana Amis Dsem Cboy Ccic Fper Scam Rame Ssha Aame Ggal Vcha Cfru Cmyd Cpic Psub Eegr Gmor Sfon Salp Ssal Omyk Poli Ccar Caur Drer Clac Porn Lcha Pdol Scan Mman Saca Rrad 103 Perrisodactyls Carnivores 64 SVD-based consensus tree 97 82 100 62 72 75 101 Cetartiodactyls 63 108 63 Edentata 60 Rodents 99 110 Primates 89 71 Non-eutherians 82 71 109 59 63 67 88 Birds & Reptiles 57 79 108 83 88 85 58 75 105 90 Bony Fish 110 98 93 88 87 Cartilagenous Fish 81 98 Easi Ecab Csim Runi Hgry Pvit Cfam Fcat Teur Oari Btau Sscr Bphy Bmus Hamp Ajam Dnov Oafe Ocun Svul Mgli Cpor Mmus Rnor Lafr Ppya Hsap Eeur Dvir Mrob Oana Dsem Cboy Ccic Scam Rame Aame Ggal Fper Vcha Cfru Ssha Amis Cmyd Cpic Psub Eegr Pdol Porn Sfon Salp Omyk Ssal Poli Gmor Ccar Caur Clac Drer Lcha Scan Mman Saca Rrad 30 Experimental Results (cont.) Eukaryotes 9 whole genomes, 175,559 proteins Input matrix log-entropy transformed, and columns normalized n-grams Peptides Proteins Factors (SPQR) Factors (SVD) 4 160,000 175,559 1400 437 5 Symmetric Distance 4 3 2 1 0 10 110 210 310 410 510 610 710 810 910 1010 1110 1210 1310 Number of Factors 20 per. Mov. Avg. (Dist. from SVD 10-437 and SPQR 1198-1400 consensus trees) 20 per. Mov. Avg. (Adjacent Tree Distances) 5/29/2016 31 Experimental Results (cont.) SPQR-based consensus tree SVD-based consensus tree Mmus 192/203 203/203 Mmus 426/428 Rnov 421/428 Hsap 203/203 203/203 5/29/2016 Hsap 428/428 Frub 203/203 Rnov 203/203 Frub 428/428 Agam Dmel 428/428 Agam 409/428 Dmel Cele Cele Scer Scer Pfal Pfal 32 Experimental Results (cont.) Performance Analysis Plant Dateset Times 742.66 800 725.54 700 2400 600 2100 500 1800 400 300 132.62 200 1500 857.33 1200 900 600 47.76 100 2443 2447 2700 Seconds Seconds Vertebrate Dataset Times 20.7 300 0 0 SPQR(A^T) SVD SPQR SPQR(A^T) 5/29/2016 SVD SPQR(A) SPQR(A^T) SVD SVD SPQR SVD COSDIST 33 Experimental Results (cont.) Eukaryote Dataset Times 12.52 14 Observations: 12.35 12 Hours 10 8 5.05 6 3.94 4 2 0 SPQR(A^T) SPQR(A^T) 5/29/2016 SVD SPQR(A) SPQR SVD SVD As the # dimensions increase, COSDIST time increases If m (rows) > n (cols) in orig. matrix, protein matrix calculated faster than peptide matrix If n > m, peptide matrix computed faster than protein matrix COSDIST 34 Experimental Results (cont.) Memory Usage Analysis 35 Megabytes 30 25 20 15 10 5 0 Plants Vertebrates SPQR(A^T) 5/29/2016 SPQR(A) Eukaryotes SVD 35 Conclusions Advantages of SPQR-based approach for whole genome phylogeny analysis Fast Memory efficient Storage can be conserved from Q factor, if needed Scalable alternative to SVD for comparing whole genomes in a phylogenetic context Disadvantages: 5/29/2016 Need both A and AT if both motif analysis and phylogenetic trees are desired 36 Future Work For experiments conducted Motif analysis can be performed if needed Better consensus trees may be obtained by constructing gene trees Algorithm 5/29/2016 Transposing the matrix in Harwell-Boeing format: Examine tradeoffs between storage and additional computation Implement the algorithm in parallel (compute peptide and protein factor matrices simultaneously) 37 Questions 5/29/2016 38