Bootstrap Support for Mary-Lee's Clusters efg, 5 April 2006 1 Step-by-Step Analysis Procedure ..................................................................................................... 2 1.1 Starting Point ............................................................................................................................ 2 1.2 R Script: SetupConsenseData .................................................................................................. 2 1.3 PHYLIP neighbor ..................................................................................................................... 4 1.4 PHYLIP consense ..................................................................................................................... 5 2 Consense Output – Baseline Case .................................................................................................... 7 3 Unrooted Trees in TreeView ............................................................................................................ 9 3.1 Baseline................................................................................................................................... 10 3.2 Delete-1 Jackknife .................................................................................................................. 11 3.3 Delete-2 Jackknife .................................................................................................................. 12 3.4 Delete-4 Bootstrap .................................................................................................................. 13 3.5 Delete-6 Bootstrap .................................................................................................................. 14 4 Appendix A. SetupConsenseData.R .............................................................................................. 15 U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 1 Bootstrap Support for Mary-Lee's Clusters 1 Step-by-Step Analysis Procedure The following analysis uses the PHYLIP (PHYLogeny Inference Package) programs, "neighbor" and "consense" from http://evolution.genetics.washington.edu/phylip/general.html, which are installed for use in the Stowers' Linux environment, to analyze how much boostrap support there is for the clusters in Mary-Lee's dataset of 90 selected genes with 17-point time series. 1.1 Starting Point In directory: U:\efg\Research\Olivier\Mary-Lee\17PointSeries, the Book2.xls Excel file contains the 90 selected genes and their 17-point time series. 1.2 R Script: SetupConsenseData The script SetupConsenseData (see Appendix) was used to create several datasets with multiple distance matrices for use with the neighbor PHYLIP program. Correlation matrices were computed by deleting "d" points in the time series. Each correlation matrix was converted to distance matrix with this formula: Distance <- (1 - CorrelationMatrix)/2 See Efron and Tibshirani, p. 149, for information about "delete-d jackknife." Run this R script under Windows: File | Change Dir … | U:\efg\Research\Olivier\Mary-Lee\17PointSeries source("SetupConsenseData.R") [NOTE: This script is extremely slow when run directly from the "U" drive. Perhaps a 100X speed improvement can be seen by copying and running this script from the C:\ drive. Unfortunately, this script does not work from Linux because of the use of the RODBC package to read the unmodified Excel file.] U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 2 Bootstrap Support for Mary-Lee's Clusters In R, issue the following commands: • CreateBaseline() • CreateDelete1Jackknife() • CreateDelete2Jackknife() • CreateDeleteNBootstrap(1000, 4) • CreateDeleteNBootstrap(1000, 6) • CreateDeleteNBootstrap(1000, 8) The "Delete1" procedure creates the 17 "delete 1" jackknife samples. Likewise, the "Delete" procedure creates the 17*16=272 "delete 2" jackknife samples. For larger deletions, 1000 bootstrap samples were used to approximate the exact jackknife deletions. The functions above created the following files: infile-baseline, infile-1, infile-2, infile-4, infile-6, infile-8. These files were moved to corresponding directories, Baseline, Delete1-17, Delete2-272, Delete4-1000-Boot, Delete6-1000-Boot, and Delete-8-1000Boot, and renamed "infile". This segregated the various files and did not require any additional renaming to keep all the results form neighbor and consense. Note: This is a "distance" matrix. The main diagonal should be 0s. U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 3 Bootstrap Support for Mary-Lee's Clusters 1.3 PHYLIP neighbor Run the PHYLIP program, neighbor, under Linux after changing to the various directories created above (Baseline, Delete1-17, Delete2-272, Delete4-1000-Boot, Delete6-1000-Boot, and Delete-81000Boot). This PHYLIP program reads the distance matrices and creates a file with the corresponding unrooted trees, which will be processed by the consense PHYLIP program. For the baseline case, accept the neighbor defaults. For the other cases, select the "M" option ("Analyse multiple data sets"), and specify the following values: Directory # Data sets Random Number Used Delete1-17 17 19 Delete2-272 272 29 Delete4-1000-Boot 1000 71 Delete6-1000-Boot 1000 255 Delete8-1000-Boot 1000 19937 neighbor Neighbor-Joining/UPGMA method version 3.6a3 Settings for this run: N Neighbor-joining or UPGMA tree? O Outgroup root? L Lower-triangular data matrix? R Upper-triangular data matrix? S Subreplicates? J Randomize input order of species? M Analyze multiple data sets? 0 Terminal type (IBM PC, ANSI, none)? 1 Print out the data at start of run 2 Print indications of progress of run 3 Print out tree 4 Write out trees onto tree file? Neighbor-joining No, use as outgroup species No No No No. Use input order No (none) No Yes Yes Yes 1 Y to accept these or type the letter for one to change M How many data sets? 17 Random number seed (must be odd)? 19 Y to accept these or type the letter for one to cha nge Y . . . Cycle 2: node 1 ( 0.00636) joins node 71 ( 0.01994) Cycle 1: node 1 ( 0.00579) joins node 60 ( 0.04484) last cycle: node 1 ( 0.00135) joins node 8 ( 0.01066) joins node 15 ( 0.01601) Output written on file "outfile" Tree written on file "outtree" U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 4 Bootstrap Support for Mary-Lee's Clusters Done. Rename outtree to be intree: mv outtree intree The intree files for the cases with 1000 distance matrices hare fairly large: 31,028 lines. 1.4 PHYLIP consense Run Linux program: consense. This PHYLP program read the trees created by the "neighbor" PHYLP program and computes a consensus tree by the majority-rule consensus tree method. [Select "R" to replace files unless old output files are renamed or deleted.] consense Consensus tree program, version 3.6a3 Settings for this run: C Consensus type (MRe, strict, MR, Ml): O Outgroup root: Majority rule (extended) No, use as outgroup species U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 1 5 Bootstrap Support for Mary-Lee's Clusters R T 1 2 3 4 Trees to be treated as Rooted: Terminal type (IBM PC, ANSI, none): Print out the sets of species: Print indications of progress of run: Print out tree: Write out trees onto tree file: No (none) Yes Yes Yes Yes Are these settings correct? (type Y or the letter for one to change) Y Consensus tree written to file "outtree" Output written to file "outfile" Done. look at bottom of outfile for consensus tree U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 6 Bootstrap Support for Mary-Lee's Clusters – Baseline 2 Consense Output – Baseline Case Created by processing the original distance matrix through the neighbor and consense PHYLP programs. Wnt Cluster (Axin, Dkk1, …) Notch Cluster (Hes1, Hes5, …) NOTE: A better way to look at these unrooted trees is with the TreeView program, shown in the next section. Extended majority rule consensus tree CONSENSUS TREE: the numbers on the branches indicate the number of times the partition of the species into the two sets which are separated by that branch occurred among the trees, out of 1.00 trees (trees had fractional w eights) +------Mxra8 + --1.0-| + --1.0-| +------Zfp191 | | + --1.0-| +-------------Bcl9l | | + --1.0-| +-------------------- Dnpep | | + --1.0-| +--------------------------- C80012 | | | | + ------Kctd11 + ------------------------------------------------------- ---------------------------------------------------- 1.0-| +----------------------- 1.0-| | | + ------Ugp2 | | | | + -------------Rnf103 | + ----------------------- 1.0-| | | | | + ------Dsp-B + --1.0-| + ------Dsp-A | | | | | + ----------------------------------------------------------- ----------------- Arfl4 | | + -------------Nol5a | + --1.0-| | | | | | | | | | | | | | | | | | | | | + --1.0-| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | + --1.0-| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | + ------Mrps15 | + --1.0-| +--1.0-| | | | +------Mta3 | | | | + --1.0-| +-------------------- Gpr175 | | | | | | + ------Nkd1 + --1.0-| +--1.0-| +----------------1.0-| | | | | + ------Hey1 | | | | | | +--1.0-| +---------------------------------- Nrarp | | | | | | | | + ------Rabep1 | | +--1.0-| +------------------------------ 1.0-| | | | | + ------Chd4 | | + --1.0-| | + --1.0-| | | | +------------------------------------------------ Id1 | | | + --1.0-| | | | | | | + ------------------------------- ------------------------ Hira | | + --1.0-| | | | | + -------------------------------------------------------------- Efna1 | | | | | + --------------------------------------------------------------------- Lfng | | + --1.0-| | + ------Oact2 | | + ------------------------------------------------------------------------ 1.0-| | | + ------Mtm1 | | | | + ------Ttc1 | | + --1.0-| + --1.0-| | + --1.0-| +------Ptpn11 | | | | | | | + ----------------------------------------------------------------- 1.0-| +-------------Spry2 | | | + --1.0-| | + -------------------- Nagk | | | | | +------------------------------------------------------------------------------------------------- Hes5 +--1.0-| | | | + -------------------------------------------------------------------- ------------------------------------ Egr1 | | | | + -------------Bcl2l11 + --1.0-| +--------------------------------------------------------------------------------------------- 1.0-| | | | | + ------Klf10 | | | + --1.0-| | + --1.0-| | + ------Hes1 | | | | | | | + ---------------------------------------------------------------------------------------------------------------------- Csnk2a2 | + --1.0-| | | | | + ------------------------------------------------ ----------------------------------------------------------------------------- Trub2 | | | | | | + ------Gm428 | + --1.0-| +------------------------------------------------------------------------------------------------------------------------- 1.0-| | | | + ------1427572 at | | | +--1.0-| | + ------Nudt13 U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 7 Bootstrap Support for Mary-Lee's Clusters – Baseline | | | | | | | | | | | | | | | | | | | + --1.0-| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | + --1.0-| | | | | | | | | | | | | | | | | | | | | | | | | | | + --1.0-| | | + | | | | | | | | | | | | | | | | | | | | | + -------------------------------------------------------------------------------------------------------------------------------- 1.0-| | + ------Star | + ---------------------------------------------------------------------------------- ---------------------------------------------------------------- 2810437L13 + | + ------Seh1l + --1.0-| --1.0-| +------Otud5 | | + --1.0-| +-------------1418669 at | | | | | | + ------Gfra2 + -------------------------------------------------------------------------------------------------------------------------------- 1.0-| +---------1.0-| | + ------Sh2bp1 | | + ------Ninj1 + ----------------1.0-| + ------Tlr5 + -------------------- Cyp3a11 + ----------------1.0-| | | + -------------Spint2 | + --1.0-| + | | + ------Ubc-B | + --1.0-| + --1.0-| + ------Ubc-A | | | | + ---------------------------------- Cflar | | | | | | + ------Sp5 | | | + --1.0-| | + --1.0-| +--1.0-| +------Kcnmb2 --1.0-| | | | | | | + --1.0-| +-------------Trim2 | | | | | | | + --1.0-| +-------------------- Fscn1 + --1.0-| | | | | | + --------------------------- Fgf1 | | | | | + --------------------------------------------- ---Zfp96 | | | + ------------------------------------------------------- Poldip3 | + --1.0-| + -------------Wdr40a | | + --1.0-| | | | | | | | | | | | | +--------------------------------------------------------------------------------------------- 1.0-| | | | | | | | | | + ------H2-K1 | + --1.0-| + --1.0-| +------Snrpd3 | | | | +------Wee1 | | +--1.0-| +------------------------------ 1.0-| +--1.0-| +------Mtl5 | | | | | | | + -------------Plxdc2 | | | | | | | | | | + ------Cyp2c50 | | | | + ----------------1.0-| | | | | + ------Tnfrsf9 + --1.0-| | | | | | | | + ------------------------ --------------------------------------------- 2900083I11 | | | | | | | + ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 1810008K03 + --1.0-| | | | | | + ------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------ 6330407G11 | | | + --1.0-| | +------------------------------------------------------------------ -------------------------------------------------------------------------------------------------------------------------- A830059I20 | | | | | + --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 2610042O14 | | | + ----------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------- 1200011M11 | --1.0-| + ------Rpl26-B | + --1.0-| | + --1.0-| +------Rpl26-A | | | | | + --1.0-| +-------------Tmsb10 | | | | + --1.0-| | | | + ------Dnmt3a | | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1.0-| +---------1.0-| | | | + ------Mdh1 | | | + --1.0-| | + --------------------------- Hbb-bh1 | | | | | + ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------ Phlda1 | | + --1.0-| | + ------Has2 | | + --------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- 1.0-| | | + ------Rbm22 + --1.0-| | | | + ------------------------------------------ -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Tnfrsf19-B + --1.0-| | | | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dact1 +--1.0-| | | | + ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------ Axin2 +------| | | | + ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------ Myc | | | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------- Dkk1 | +------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------ Tnfrsf19-A remember: this is an unrooted tree! U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 8 TreeView Display of Unrooted Trees 3 Unrooted Trees in TreeView The Treeview program (from http://taxonomy.zoology.gla.ac.uk/rod/treeview.html) reads many different tree formats, including the format created by PHYLIP, and provides a better way to view "unrooted" trees. Tree files, outtree, are in the various subdirectories under U:\efg\Research\Olivier\Mary-Lee\17PointSeries: Baseline, Delete1-17, Delete2-272, Delete4-1000-Boot, Delete6-1000-Boot, and Delete-8-1000Boot Instructions to view with TreeView: 1. Start TreeView 2. File | Open | Files of type: All files 3. Open Tree | Unrooted Tree | Show Internal Edge Labels Tree | Internal Label Font … | 10-point Arial U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 9 TreeView Display of Unrooted Trees 3.1 Baseline U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 10 TreeView Display of Unrooted Trees 3.2 Delete-1 Jackknife 17 "delete 1" jackknife samples We want to look for clades of genes that consistently group together, say 66% of the time or more (~11 out of 17). WNT Group U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 11 TreeView Display of Unrooted Trees 3.3 Delete-2 Jackknife 17*16=272 "delete 2" jackknife samples We want to look for clades of genes that consistently group together, say 66% of the time or more (~181 out of 272). U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 12 TreeView Display of Unrooted Trees 3.4 Delete-4 Bootstrap 1000 bootstrap samples of "delete 4" jackknife We want to look for clades of genes that consistently group together, say 66% (less?) of the time or more (~666 out of 1000). U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 13 TreeView Display of Unrooted Trees 3.5 Delete-6 Bootstrap 1000 bootstrap samples of "delete 6" jackknife U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 14 Appendix A. SetupConsenseData.R Script 4 Appendix A. SetupConsenseData.R From U:\efg\Research\Olivier\Mary-Lee\17PointSeries # efg, 4 May 2006. Stowers Institute. library(RODBC) filename <- "U:/efg/Research/Olivier/Mary-Lee/17PointSeries/Book2.xls" connection <- odbcConnectExcel(filename) sqlTables(connection) worksheet <- sqlFetch(connection, "Sheet1", as.is=TRUE) close(connection) worksheet <- worksheet[,c(1,4,22:38)] # Get rid of blanks in Column Names colnames(worksheet) <- gsub(" ", "",colnames(worksheet)) # Change "---" GeneSymbols to be Affy ProbeSetIDs worksheet$GeneSymbol[ worksheet$GeneSymbol == "---" ] <- worksheet$ProbeSetID[ worksheet$GeneSymbol == "---" ] # Get rid of overloaded Affy data in GeneSymbol field worksheet$GeneSymbol <- unlist( lapply( strsplit(worksheet$GeneSymbol, "///"), "[", 1 ) ) DuplicateGeneIDs <- worksheet$GeneSymbol[ table(sort(worksheet$GeneSymbol)) > 1] # Add "-A" and "-B" to GeneSymbol duplicates IDtable <- table(worksheet$GeneSymbol) Duplicates <- names(which(IDtable > 1)) for (i in 1:length(Duplicates)) { worksheet$GeneSymbol[ worksheet$GeneSymbol == Duplicates[i]] <paste(worksheet$GeneSymbol[ worksheet$GeneSymbol == Duplicates[i]], c("-A", "-B"), sep="") } rownames(worksheet) <- worksheet$GeneSymbol d <- data.matrix(worksheet[,3:ncol(worksheet)]) # Base correlation matrix BaseCorrelationMatrix <- cor(t(d)) # Dkk1 Tnfrsf19-A # Dkk1 1.0000000 0.9254702 # Tnfrsf19-A 0.9254702 1.0000000 # Hes1 -0.9012325 -0.9130869 # Axin2 0.8679224 0.7636788 # Dnmt3a 0.3520647 0.3392926 Hes1 Axin2 Dnmt3a -0.9012325 0.8679224 0.3520647 -0.9130869 0.7636788 0.3392926 1.0000000 -0.7181422 -0.5555910 -0.7181422 1.0000000 0.2403288 -0.5555910 0.2403288 1.0000000 #[1] 1 #[1] 0.9254702 #[1] -0.9012325 #[1] 0.8679224 #[1] 0.3520647 cor(d[1,], d[1,]) cor(d[1,], d[2,]) cor(d[1,], d[3,]) cor(d[1,], d[4,]) cor(d[1,], d[5,]) # Create Distance Matrix from Correlation Matrix # See R-Help, ?dist, second example Distance <- (1 - BaseCorrelationMatrix)/2 # # # # # # # Distance[1:5,1:5] Dkk1 Tnfrsf19-A Dkk1 0.00000000 0.03726489 Tnfrsf19-A 0.03726489 0.00000000 Hes1 0.95061627 0.95654343 Axin2 0.06603878 0.11816062 Dnmt3a 0.32396763 0.33035369 Hes1 0.9506163 0.9565434 0.0000000 0.8590711 0.7777955 Axin2 0.06603878 0.11816062 0.85907112 0.00000000 0.37983561 Dnmt3a 0.3239676 0.3303537 0.7777955 0.3798356 0.0000000 heatmap(Distance, scale="none", Colv=NULL, Rowv=NULL) U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 15 Appendix A. SetupConsenseData.R Script heatmap(Distance) WriteDistanceMatrix <- function(OutFile, DeleteColumns) { print(DeleteColumns) flush.console() # Let Windows display catch up if ( length(DeleteColumns) == 0 ) { DeleteD <- d } else { DeleteD <- d[,-DeleteColumns] } CorrelationMatrix <- cor(t(DeleteD)) Distance <- (1 - CorrelationMatrix)/2 cat(sprintf("%5d", nrow(Distance)), "\n", file=OutFile) for (k in 1:nrow(Distance)) { cat( sprintf("%-16s", rownames(Distance)[k]), file=OutFile) cat( sprintf("%10.6f", Distance[k,]), file=OutFile) cat("\n", file=OutFile) } } # baseline sample CreateBaseline <- function() { OutFile <- file("infile-baseline", "w") WriteDistanceMatrix(OutFile, NULL) close(OutFile) } # 17 Jackknife Delete-1 samples CreateDelete1Jackknife <- function(seed=19) { OutFile <- file("infile-1", "w") set.seed(seed) for (i in 1:ncol(d)) { DeleteColumns <- i WriteDistanceMatrix(OutFile, DeleteColumns) } close(OutFile) } # 17*16 Jacknife Delete-2 samples CreateDelete2Jackknife <- function(seed=19) { OutFile <- file("infile-2", "w") set.seed(seed) for (i in 1:ncol(d)) { DeleteColumns <- i for (j in 1:ncol(d)) if (j != i) { WriteDistanceMatrix(OutFile, c(DeleteColumns, j) ) } } close(OutFile) } # N Bootstrap samples of DeleteCount Jackknife CreateDeleteNBootstrap <- function(BootCount, DeleteCount, seed=19) { # Bootstrap with "Delete-d jacknife" (See Efron & Tibshirani, p. 149) OutFile <- file(paste("infile-", DeleteCount, sep=""), "w") set.seed(seed) for (i in 1:BootCount) { U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 16 Appendix A. SetupConsenseData.R Script DeleteColumns <- sample(1:ncol(d))[1:DeleteCount] WriteDistanceMatrix(OutFile, DeleteColumns) } close(OutFile) } U:\efg\Research\Olivier\Mary-Lee\17PointSeries\NeighborConsense.doc 9 October 2006 17