Introduction to Multivariate Analysis Biology 4605/7220 Chih-Lin Wei Canadian Health Oceans Network Postdoc Fellow Ocean Science Centre, MUN My Background • Benthic ecologist: Community ecology How environments control macroecological patterns in the deep-sea Interested in R but “NOT a statistician”. • Education: BS in Zoology in Taiwan; MS & PhD in Biological Oceanography, Texas A&M University • Current project: Scale-up regional benthic diversity and standing stock pattern using ecological modeling approaches Lecture Contents • • • • • • • Visualization Resemblance index Cluster analysis Ordination Correlation Testing for difference Other stuff Clarke & Warwick (2001) Front Matter • Mostly non-parametric, permutation-based techniques • Start with graphical concept • Followed by examples in simple R codes • No more than 3 lines of code for each example • Most functions in base R or package “vegan” • All analyses are available on commercial software (PRIMER-E) [demo version] R packages # Install and load R Packages install.packages( c("vegan", "scatterplot3d", "reshape2", "lattice", "clustsig") ) library( vegan ) library( scatterplot3d ) library( reshape2 ) library( lattice ) library( clustsig) First thing first, plot the data # Violent Crime Rates by US State 200 50 plot( USArrests[,1:2] ) 100 150 Assault 250 300 USArrests 5 10 Murder 15 3D Scatter Plot 50 250 200 40 150 100 50 0 0 5 10 Murder 15 20 Assault 70 60 350 300 30 UrbanPop 80 90 100 scatterplot3d( USArrests[,1:3] ) Scatterplot Matrices 50 150 250 10 20 30 10 15 pairs( USArrests ) 40 250 5 Murder 70 90 50 150 Assault 30 40 30 50 UrbanPop 10 20 Rape 5 10 15 30 50 70 90 Lattice Graphs Rape # Melt dataframe to flat format m = melt( USArrests, id.vars = "Assault" ) m 80 60 40 20 value 0 Murder UrbanPop 80 60 # Multipanel scatter plot xyplot( value ~ Assault | variable, data = m ) 40 20 0 50 100 150 200 250 300 350 Assault Resemblance/distance Indices *Not good for data with lots of zero (e.g. species abundance) Clarke & Warwick (2001) Resemblance/distance Indices • D = 0, if species are identical in 2 samples • D = 1, if 2 samples have no species in common • Better for species abundance data (with lots of zero) Resemblance/distance Indices # Euclidean Distance: # Bray-Crutis Dissimilarity # Vegetation in lichen pastures dist( USArrests ) data( varespec ) varespec vegdist( varespec ) Hierarchical Clustering 4 2 3 1 0.0 0.2 0.3 0.4 0.5 0.6 0.7 Cluster Dendrogram Dissimilarity • Patterns in distance or dissimilarity matrix is difficult to detect. • Find natural grouping by successive fusing of samples Hierarchical Clustering Linkage Options: Sp 2 •Single linkage Group 1 Group 2 (neareast neighbour clustering) •Complete linkage Single Link (furthest neighbour clustering) Complete Link •Group-average linkage •Ward’s minimum variance Sp 1 5 10 15 Alaska Florida Delaware Hawaii Rhode Island Kentucky Missouri Utah Oregon Washington Massachusetts New Jersey Connecticut Vermont West Virginia Arkansas South Dakota Idaho North Dakota Minnesota Maine Wisconsin Iowa New Hampshire Wyoming Virginia Nebraska Oklahoma Montana Indiana Ohio Kansas Pennsylvania Nevada North Carolina Arizona Michigan Maryland New Mexico Illinois New York Tennessee Texas Georgia Mississippi South Alabama Carolina Louisiana California Colorado Vermont Maine North Dakota Idaho South Dakota Minnesota Wisconsin Iowa New Hampshire Hawaii Utah Oregon Washington New Jersey Oklahoma Indiana Ohio Delaware Rhode Island Connecticut Massachusetts Kentucky Wyoming Arkansas Virginia West Virginia Kansas Pennsylvania Montana NorthNebraska Carolina Georgia Alabama Louisiana Mississippi South Carolina Florida Michigan Maryland New Mexico Arizona Illinois New York Missouri Tennessee Texas California Colorado Alaska Nevada Dissimilarity 0.0 0.5 1.0 1.5 2.0 Dissimilarity 2.5 0.0 0.1 0.2 0.3 0.4 0.5 Single Linkage North Carolina Georgia Alabama Louisiana Mississippi South Carolina California Colorado Alaska Nevada Missouri Tennessee Texas Florida Illinois New York Arizona Michigan Maryland New Mexico West Virginia Idaho South Dakota Minnesota Wisconsin Iowa New Hampshire Vermont Maine North Dakota New Indiana Jersey Ohio Kentucky Arkansas Wyoming Oklahoma Virginia Hawaii Kansas Pennsylvania Montana Nebraska Utah Oregon Washington Delaware Rhode Island Connecticut Massachusetts 0 Height 1.2 # Euclidean Distance d = dist( arrest ) West Virginia NorthVermont Dakota Idaho South Dakota Minnesota Maine Wisconsin Iowa New Hampshire Hawaii Utah Oregon Washington New Jersey Kansas Pennsylvania Montana Nebraska Wyoming Virginia Oklahoma Indiana Ohio Arkansas Kentucky Delaware Rhode Island Connecticut Massachusetts Alaska Nevada California NorthColorado Carolina Georgia Mississippi South Carolina Alabama Louisiana Florida Arizona Illinois New York Michigan Maryland New Mexico Missouri Tennessee Texas 0.4 0.8 # Dendrograms plot( hclust( d, "single" ) ) plot( hclust( d, "complete" ) ) plot( hclust( d, "average" ) ) plot( hclust( d, "ward" ) ) Dissimilarity # Normalization arrest = scale( USArrests, center = FALSE ) 0.0 Hierarchical Clustering Complete Linkage Group-Average Linkage Ward's Minimum Variance North Carolina Georgia Alabama Louisiana Mississippi South Carolina California Colorado Alaska Nevada Missouri Tennessee Texas Florida Illinois New York Arizona Michigan Maryland New Mexico West Virginia Idaho South Dakota Minnesota Wisconsin Iowa New Hampshire Vermont Maine North Dakota New Jersey Indiana Ohio Kentucky Arkansas Wyoming Oklahoma Virginia Hawaii Kansas Pennsylvania Montana Nebraska Utah Oregon Washington Delaware Rhode Island Connecticut Massachusetts 10 15 # Cut into 3 groups rect.hclust( clus, k = 3 ) 5 Height North Carolina Georgia Alabama Louisiana Mississippi South Carolina California Colorado Alaska Nevada Missouri Tennessee Texas Florida Illinois New York Arizona Michigan Maryland New Mexico West Virginia Idaho South Dakota Minnesota Wisconsin Iowa New Hampshire Vermont Maine North Dakota New Jersey Indiana Ohio Kentucky Arkansas Wyoming Oklahoma Virginia Hawaii Kansas Pennsylvania Montana Nebraska Utah Oregon Washington Delaware Rhode Island Connecticut Massachusetts 0 5 Height 10 15 # Using Ward's mehtod clus = hclust( d, "ward" ) plot( clus ) 0 Determine Numbers of Clusters Cluster Dendrogram K=3 Cluster Dendrogram K=6 Determine Significant Clusters Clarke et al. (2008, JEMBE 366:56-69) West Virginia North Dakota Vermont Idaho South Dakota Minnesota Maine Wisconsin Iowa New Hampshire Hawaii Utah Oregon Washington New Jersey Kansas Pennsylvania Montana Nebraska Wyoming Virginia Oklahoma Indiana Ohio Arkansas Kentucky Delaware Rhode Island Connecticut Massachusetts Alaska Nevada California Colorado North Carolina Georgia Mississippi South Carolina Alabama Louisiana Florida Arizona Illinois New York Michigan Maryland New Mexico Missouri Tennessee Texas 0.0 0.2 clus2 = simprof( arrest ) simprof.plot( clus2 ) 0.6 0.8 1.0 1.2 # 999 permutation # Group-average clustering # alpha = 0.05 0.4 1.4 Similarity Profile Test * Colors = significant clusters Motivations for Ordination • Dendrogram is still difficult to understand • Clustering forced samples into groups despites the compositional changes may be continuous. • Ordination reduces dimensionality of multivariate data (data cloud so to speak) • Preferably, capture majority of the information as bivariate data frame, so the multivariate patterns can be shown on a scatter plot. Principal Component Analysis (PCA) 2 species example Clarke & Warwick (2001) Principal Component Analysis (PCA) • PC1 maximizes variance of points projected on it. 3 species example • PC2 is perpendicular to PC1 • PC3 is perpendicular to PC1 and PC2 • New orthogonal axes are linear combination of old data: PC1 = 0.62 Sp1 + 0.52 Sp2 + 0.58 Sp3 PC2 = -0.73 Sp1 + 0.65 Sp2 + 0.2 Sp3 PC3 = 0.28 Sp 1 + 0.55 Sp2 -0.79 Sp3 Clarke & Warwick (2001) Principal Component Analysis (PCA) 0.0 0.5 -0.4 0.0 0.2 0.4 # PCA pca = princomp( arrest ) 0.0 1.0 -0.5 0.5 -1.0 Comp.1 0.0 0.2 -0.5 0.0 0.2 0.4 -0.6 -0.2 Comp.3 Comp.4 -0.4 # New orthogonal axes pairs( pca$scores ) Comp.2 -1.0 0.0 1.0 -0.6 -0.2 0.2 Principal Component Analysis (PCA) 0.2 0.1 # Variance of PC axes plot( pca ) # Total variance explained summary( pca ) 0.0 Variances 0.3 0.4 # Variable contributions # PC1 = -0.65 Murder -0.6 Assault -0.46 Rape pca$loading Comp.1 Comp.2 Comp.3 Comp.4 Principal Component Analysis (PCA) Mississippi North Carolina 0.5 #Cut dentrogram for 6 cluster group = cutree( clus, 6 ) South Carolina Georgia Louisiana Alabama plot( pca$scores, type = "n" ) 0.0 Florida Arkansas Vermont Wyoming Maine South Dakota North Dakota Montana Virginia Texas New Hampshire Pennsylvania Delaware Maryland Rhode Island Wisconsin Iowa Kansas Connecticut Illinois Idaho Indiana Oklahoma Nebraska New York New Jersey Ohio New Mexico Minnesota Missouri Massachusetts Michigan Hawaii Arizona -0.5 Comp.2 Tennessee text( pca$scores, names( group ), col = group ) West Virginia Kentucky Utah Washington Oregon Alaska Nevada Colorado California -1.0 -0.5 0.0 Comp.1 0.5 1.0 Principal Component Analysis (PCA) -0.4 -0.2 0.0 0.2 0.4 0.6 0.6 1.0 -0.6 Murder 0.4 Mississippi 0.0 0.2 West Virginia Kentucky Tennessee Arkansas Vermont Wyoming Maine South Dakota North Dakota Montana Virginia Assault Texas New Hampshire Florida Pennsylvania Maryland Delaware Rhode Island Iowa Wisconsin Kansas Idaho Indiana Connecticut Oklahoma Nebraska NewIllinois York New Jersey Ohio New Mexico Missouri Minnesota Massachusetts Michigan Hawaii Arizona Washington Oregon Utah Alaska Nevada Colorado California UrbanPop -0.4 -0.2 0.0 -0.5 South Carolina Georgia Louisiana Alabama -0.6 -1.0 Comp.2 0.5 North Carolina Rape -1.0 -0.5 0.0 Comp.1 0.5 1.0 # Add variable contributions biplot( pca, scale = 0 ) Non-Metric Multidimensional Scaling (nMDS) • Ordination bases on ranked resemblance (or distance) matrix • Robust and flexible for all kind of resemblance indices • Using iterative procedure, successively refine the locations of ordination points according to the ranked dissimilarities of samples • Better choice for species abundance data (comparing to PCA) Multidimensional Scaling (nMDS) 2 1 0 Ordination Distance 3 Non-metric fit, R2 = 0.995 Linear fit, R2 = 0.98 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Observed Dissimilarity mds = metaMDS( arrest ) stressplot( mds ) Multidimensional Scaling (nMDS) -0.2 0.0 North Carolina Mississippi Murder 0.6 0.4 0.2 0.2 West Virginia -0.1 0.0 0.1 Vermont Kentucky South Carolina Georgia South Dakota Alabama Louisiana Arkansas Montana Tennessee Maine Assault Wyoming Delaware Florida North Dakota Idaho Maryland Texas Virginia Indiana Nebraska Kansas Pennsylvania Hampshire NewIowa New Mexico Oklahoma York NewIllinois Michigan Ohio Missouri Minnesota Alaska New Jersey Wisconsin NevadaArizonaRape Connecticut Massachusetts Oregon UrbanPop California Colorado Washington Utah -0.2 0.3 0.1 0.0 -0.1 -0.2 Vermont Kentucky South Carolina Georgia South Dakota Alabama Louisiana Arkansas Montana Tennessee Maine Wyoming Delaware Florida North Dakota Idaho Maryland Texas Virginia Indiana Nebraska Kansas Pennsylvania New Mexico Hampshire NewIowa Oklahoma Illinois York New Michigan Missouri Ohio Minnesota Alaska New Jersey Wisconsin Nevada Arizona Connecticut Massachusetts Oregon California Colorado Washington Utah 0.2 West Virginia MDS2 Hawaii -0.3 Hawaii Rhode Island Rhode Island -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 MDS1 # Ordination with 6 clusters plot( mds$points, type = "n" ) text( mds$points, names( group ), pch = group, col = group) -0.4 -0.2 0.0 0.2 -0.3 0.0 -0.1 -0.2 North Carolina Mississippi -0.3 MDS2 0.1 0.2 0.3 -0.4 0.4 0.6 0.8 MDS1 # Add variable score # Weighted average biplot( mds$points , mds$species ) Correlation between Matrices 0.8 # Vegetation and environment # in lichen pastures data( varespec ) data( varechem ) 0.6 0.4 # Euclidean distance env.dist = dist( scale( varechem ) ) 0.2 veg.dist # Bray-Crutis Dissimilarity veg.dist = vegdist( varespec ) 2 4 6 env.dist 8 Mantel Test Species Rank Sites Sites BC 1, 2, 3,…….... ρ Correlation Sites ED Rank Sites Sites Environ. Sites 1, 2, 3,…….... Mantel Test 100 r = 0.3 60 40 20 # Distribution of permuted r hist ( man$perm ) 0 Frequency 80 # Mantel test # Based on 999 permutations # Pearson's correlation man = mantel( veg.dist, env.dist ) man -0.3 -0.2 -0.1 0.0 0.1 0.2 Pearson Correlation (r) 0.3 0.4 Best Environmental Subsets Species Rank Sites Sites BC 1, 2, 3,…….... ρ Correlation Sites ED Rank Sites Sites Environ. Sites 1, 2, 3,…….... BIOENV bioenv( varespec, varechem ) 0.6 0.4 0.2 veg.dist 0.8 # 16383 possible subsets # Subset of environmental variables with best correlation to community data 1 2 3 4 env.dist (N + P + Al + Mn + Baresoil) 5 Testing Group Difference for Community Data data( dune ) #Vegetation in Dutch Dune Meadows dune # More species (variables) than samples # Dominance of zero values # Violates multivariate normality and constant variance across the groups # A robust, permuatation-based test is needed for community data. Analysis of Similarity (ANOSIM) BC Rank Sites Sites Species 1, 2, 3,…….... Sites • R = 1: Within group are more similar than between groups • R = 0: Between and within group are the same in average • R is an absolute measure of group seperation R rB rW n ( n 1) / 4 rB = Avg. rank between groups rW = Avg. rank within groups n = sample size Analysis of Similarity (ANOSIM) # MDS plot seems to suggest moisture effect plot( mds$points, pch = 21, bg = Moisture, cex = Moisture ) 1 0.0 2 3 4 -0.5 # Run a MDS on dune vegetation mds = metaMDS( dune ) Moisture MDS2 Moisture = as.numeric( dune.env$Moisture ) 0.5 # Does moisture has effect on vegetation? Vegetation in Dutch Dune Meadows 1.0 # Environment factors in Dutch Dune Meadows data( dune.env ) -0.5 0.0 MDS1 0.5 1.0 Analysis of Similarity (ANOSIM) aos = anosim( dune, Moisture ) R = 0.43 100 50 # Distribution of permuted R hist( aos$perm ) 0 Frequency 150 200 aos -0.2 0.0 0.2 ANOSIM R-statistics 0.4 Other Useful Functions Clustering: • pam() for clustering around medoids and clara() for clustering large data (both in “cluster”) • pvclust() in “pvclust” for assessing the uncertainty in hierarchical cluster analysis Ordination: • Great PCA video explanation on YOUTUBE • imputePCA() in “missMDA” for handling missing data • cca() and rda() in “vegan” for constrained type of ordinations Testing difference: • mrpp() in “vegan” for ANOSIM type analysis but using original dissimilarities instead of their ranks. • adonis() in “vegan” for robust and flexible multivariate permutational analysis of variance (e.g. factorial & nested design, mixed model, etc.) • betadisper() in “vegan” for testing constant multivariate variance (or dispersion)