Tutorial I Female Mouse Liver Microarray Data Network Construction and Module Analysis Steve Horvath Correspondence: shorvath@mednet.ucla.edu, http://www.ph.ucla.edu/biostat/people/horvath.htm Content of this tutorial 1) Gene Co-expression Network Construction, 2) Module Definition Based on Average Linkage hierarchical clustering with the dynamic tree cut algorithm 3) Relating Modules To Physiological Traits (module significance analysis) 4) Comparing Weighted Network Results to Unweighted Network Results 5) Studying the Clustering Coefficicient Abstract We use microarray from an F2 mouse intercross to examine the large-scale organization of gene co-expression networks in female mouse liver and annotate several gene modules in terms of 20 physiological traits. Finally we study the relationship between connectivity and a measures of gene significance based on the physiological traits. The data and biological implications are described in the following references Ghazalpour A, Doss S, Zhang B, Wang S, Plaisier C, Castellanos R, Brozell A, Schadt EE, Drake TA, Lusis AJ, Horvath S (2006)Integrating Genetic and Network Analysis to Characterize Genes Related to Mouse Weight. PloS Genetics. Volume 2 | Issue 8 | AUGUST 2006 Fuller TF, Ghazalpour A, Aten JE, Drake TA, Lusis AJ, Horvath S (2007) Weighted Gene Coexpression Network Analysis Strategies Applied to Mouse Weight, Mamm Genome 18(6):463-472. We provide the statistical code used for generating the weighted gene co-expression network results. Thus, the reader be able to reproduce all of our findings. This document also serves as a tutorial to weighted gene co-expression network analysis. Some familiarity with the R software is desirable but the document is fairly self-contained. This document and data files can be found at the following webpage: http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/MouseWeight/ More material on weighted network analysis can be found here http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/ Method Description The network construction is conceptually straightforward: nodes represent genes and nodes are connected if the corresponding genes are significantly co-expressed across appropriately chosen tissue samples. Here we study networks that can be specified with the following adjacency matrix: A=[aij] is symmetric with entries in [0,1]. By convention, the diagonal elements are assumed to be zero. For unweighted networks, the adjacency matrix contains binary information (connected=1, unconnected=0). In weighted networks the adjacency matrix contains encodes pairwise connection strengths. 1 Microarray data RNA preparation and array hybridizations were performed at Rosetta Informatics. The custom ink-jet microarrays used in this study (Agilent technologies, previously described [2, 24]) contain 2186 control probes and 23,574 non-control oligonucleotides extracted from mouse Unigene clusters and combined with RefSeq sequences and RIKEN full-length clones. Mouse livers were homogenized and total RNA extracted using Trizol reagent (Invitrogen, CA) according to manufacturer’s protocol. Three µg of total RNA was reverse transcribed and labeled with either Cy3 or Cy5 fluorochrome. Purified Cy3 or Cy5 complementary RNA was hybridized to at least two microarray slides with fluor reversal for 24 hours in a hybridization chamber, washed, and scanned using a laser confocal scanner. Arrays were quantified on the basis of spot intensity relative to background, adjusted for experimental variation between arrays using average intensity over multiple channels, and fit to an error model to determine significance (type I error). Gene expression is reported as the ratio of the mean log10 intensity (mlratio) relative to the pool derived from 150 mice randomly selected from the F2 population. Data Reduction: In order to minimize noise in the gene expression data set, several data filtering steps were taken. First, preliminary evidence showed major differences in gene expression levels between sexes amongst the F2 mice used (Yang, manuscript in preparation), and therefore only female mice were used for network construction . The construction and comparison of the male network will be reported elsewhere. Only those mice with complete phenotype, genotype and array data were used. This gave a final experimental sample of 135 female mice used for network construction. Due to computational limitations the following filtering steps were applied to the genome wide expression data. First, the 8000 most varying genes (ml ratio – log10 of the ratio of experimental mouse gene expression to F2 pool), across all mice were identified. Next, amongst these 8000 genes, after preliminary network construction, the 3600 most connected genes were chosen as those to use in further steps (All genes excluded had a connectivity of 1 or less). These 3600 genes were then examined, and where appropriate, gene isoforms and genes containing duplicate probes were excluded by using only those with the highest expression among the redundant transcripts. This final filtering step yielded a count of 3421 genes for the experimental network construction. The main reason for not using all 23000 genes is that genes with low variance across the mouse samples are likely to be less interesting in this analysis that relates gene expression profiles to SNP markers and highly varying physiological traits. Noise genes may compromise module detection and thus our integrated model. A computational reason for restricting the analysis to 8000 genes is that our R software code becomes extremely slow when dealing with matrices of dimension larger than 8000x8000. For module detection, we limited our analysis to 3600 most connected genes since our module construction method and visualization tools cannot handle larger data sets at this point. By definition, module genes are highly connected with the genes of their module (i.e. module genes tend have relatively high connectivity). Thus, for the purpose of module detection, restricting the analysis to the most connected genes does not lead to major information loss for the key points of our application. However, there may be applications where genes with relatively low connectivity are biologically interesting so that gene filtering based on connectivity would lead to information loss. Finally, we eliminated multiple probes with similar expression pattern for the same gene since we are interested in studying gene networks as opposed to probeset networks. This resulted in 3421 genes in our final set which we used for module detection. 2 Network Construction: We pioneered the use of a weighted coexpression network for mapping complex disease genes. In co-expression networks, network nodes correspond to genes and connection strengths are determined by the pairwise correlations between expression profiles. In contrast to unweighted networks, weighted networks use soft thresholding of the Pearson correlation matrix for determining the connection strengths between two genes. Soft thresholding of the Pearson correlation preserves the gene co-expression information and leads to weighted co-expression networks that are highly robust with respect to the construction method. The network construction algorithm is described in detail elsewhere (Zhang and Horvath 2005). Briefly, a gene co-expression similarity measure (absolute value of Pearson’s product moment correlation) was used to relate every pairwise gene-gene relationship. An adjacency matrix was then constructed using a `soft’ power adjacency function aij = Power(sij, ) |sij| where sij is the co-expression similarity, and aij represents the resulting adjacency that measures the connection strengths. The power is chosen using the scale free topology criterion proposed in Zhang and Horvath (2005). Briefly, the power was chosen such the resulting network exhibited approximate scale free topology and a high mean number of connections. The scale free topology criterion led us to choose a power of = 6 based on the preliminary network built from the 8000 most varying genes. However, since we are using a weighted network as opposed to an unweighted network, the biological findings are highly robust with respect to the choice of this power (Zhang and Horvath 2005) . Topological Overlap Matrix and Gene Modules The adjacency matrix was then used to define a network distance measure or more precisely a measure of node dissimilarity based on the topological overlap matrix. Specifically the topological overlap matrix is given by lij aij ij min{ki , k j } 1 aij where, lij aiu auj denotes the number of nodes to which both i and j are connected, and u u i indexes the nodes of the network. The topological overlap matrix (TOM) is given by Ω=[ω ij]. ωij is a number between 0 and 1 and is symmetric (i.e, ωij= ωji). The rationale for considering this similarity measure is that nodes that are part of highly integrated modules are expected to have high topological overlap with their neighbors. Network Module Identification. Gene "modules" are groups of nodes that have high topological overlap. Module identification was based on the topological overlap matrix Ω=[ωij] defined above. To use it in hierarchical clustering, it was turned into a dissimilarity measure by subtracting it from one (i.e, the topological overlap based dissimilarity measure is defined by dij 1 ij ). Based on the dissimilarity matrix we can use hierarchical clustering to discriminate one module from another. We used a dynamic cut-tree algorithm for automatically and precisely identifying modules in hierarchical clustering dendrogram (the "tree" method of cutreeDynamic, see Langfelder, Zhang Horvath 2008). 3 The algorithm takes into account an essential feature of cluster occurrence and makes use of the internal structure in a dendrogram. Specifically, the algorithm is based on an adaptive process of cluster decomposition and combination and the process is iterated until the number of clusters becomes stable. No claim is made that our module construction method is optimal. A comparison of different module construction methods is beyond the scope of this paper. Detection of hub genes: To identify hub genes for the network, one may either consider the whole network connectivity (denoted by kTotal) or the intramodular connectivity (kWithin). We find that intramodular connectivity is far more meaningful than whole network connectivity Module Significance Analysis: Relating Gene Modules to Physiologic Traits, In the BXH F2 cross, 20 physiological traits were measured for each animal. We used this information to explore the physiological relevance of each of the modules in the network. To do this, the Pearson’s product moment correlation (Pearson’s correlation) was computed between each gene within a module and each of the physiological traits. This measure is termed the ‘gene significance’ of a particular gene with that trait. The geometric mean was then calculated for the absolute value of all gene significance scores within each module, yielding the ‘module significance’ (MS) of that particular module with the trait . In order to explore the characteristics of our connectivity measure, we plotted the connectivity parameter versus variance and mean gene expression for each gene within the most functionally significant (Blue and Brown) modules. We observed an inverse relationship between connectivity and gene expression variance. This is consistent with the idea that the network’s most highly connected ‘hubs’ are resilient to genetic background variations since they are vital for core biological functions. Statistical References Bin Zhang and Steve Horvath (2005) "A General Framework for Weighted Gene CoExpression Network Analysis", Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17 The following theoretical reference explores the meaning of coexpression network analysis • Horvath S, Dong J (2008) Geometric Interpretation of Gene Co-Expression Network Analysis. PloS Computational Biology. 4(8): e1000117. PMID: 18704157 The WGCNA R package is described in • Langfelder P, Horvath S (2008) WGCNA: an R package for Weighted Correlation Network Analysis. BMC Bioinformatics. 2008 Dec 29;9(1):559. PMID: 19114008 For the generalized topological overlap matrix as applied to unweighted networks see • Yip A, Horvath S (2007) Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics 8:22 Module detection based on branch cutting is described in Langfelder P, Zhang B, Horvath S (2008) Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics. Bioinformatics. 2008 Mar 1;24(5):719-20. PMID: 18024473 4 # Absolutely no warranty on the code. Please contact SH with suggestions. # CONTENTS # This document contains function for carrying out the following tasks # A) Assessing scale free topology and choosing the parameters of the adjacency function # using the scale free topology criterion (Zhang and Horvath 05) # B) Computing the topological overlap matrix # C) Defining gene modules using clustering procedures # D) Summing up modules by their first principal component (first eigengene) # E) Relating a measure of gene significance to the modules # F) Carrying out a within module analysis (computing intramodular connectivity) # and relating intramodular connectivity to gene significance. # G) Miscellaneous other functions, e.g. for computing the cluster coefficient. # Downloading the R software # 1) Go to http://www.R-project.org, download R and install it on your computer # After installing R, you need to install several additional R library packages: # For example to install Hmisc, open R, # go to menu "Packages/Install package(s) from CRAN", # then choose Hmisc. R will automatically install the package. # When asked "Delete downloaded files (y/N)? ", answer "y". # Do the same for some of the other libraries mentioned below. But note that # several libraries are already present in the software so there is no need to re-install them. # To get this tutorial and data files, go to the following webpage # www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork # Unzip all the files into the same directory, ## The user should copy and paste the following script into the R session. ## Text after "#" is a comment and is automatically ignored by R. # read in the R libraries library(MASS) # standard, no need to install library(class) # standard, no need to install library(cluster) library(sma) # install it for the function plot.mat library(impute)# install it for imputing missing value library(scatterplot3d) # Download the WGCNA library as a .zip file from http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/ and choose "Install package(s) from local zip file" in the packages tab library(WGCNA) options(stringsAsFactors=F) 5 # Please adapt the file paths setwd("C:/Documents and Settings/Steve Horvath/My Documents/ADAG/LinSong/NetworkScreening/MouseFemaleLiver") # read in the custom network functions. source("NetworkFunctions.txt") # The following 3421 probe set were arrived at using the following steps #1) reduce to the 8000 most varying, 2) 3600 most connected, 3) focus on unique genes dat0=read.table("cnew_liver_bxh_f2female_8000mvgenes_p3600_UNIQUE_tommodules.xls",hea der=T) names(dat0) # this contains information on the genes datSummary=dat0[,c(1:8,144:150)] # the following data frame contains # the gene expression data: columns are genes, rows are arrays (samples) datExpr = t(dat0[,9:143]) no.samples = dim(datExpr)[[1]] dim(datExpr) datClinicalTraits=read.csv("BXH_ClinicalTraits_361mice_forNewBXH.csv",header=T) #Now we order the mice so that trait file and expression file agree restrictMice=is.element(datClinicalTraits$MiceID,dimnames(datExpr)[[1]]) table(restrictMice) datClinicalTraits=datClinicalTraits[restrictMice,] orderMiceTraits=order(datClinicalTraits$MiceID) orderMiceExpr=order(dimnames(datExpr)[[1]]) datClinicalTraits =datClinicalTraits[orderMiceTraits,] datExpr =datExpr[orderMiceExpr,] #from the following table, we verify that all 135 mice are in order table(datClinicalTraits$MiceID==dimnames(datExpr)[[1]]) rm(dat0);gc() 6 #SOFT THRESHOLDING For Weighted Network Construction # To construct a weighted network (soft thresholding with the power adjacency matrix), # we consider the following vector of potential thresholds. # Now we investigate soft thesholding with the power adjacency function powers1=c(seq(1,10,by=1),seq(12,18,by=2)) # To choose a cut-off value, we propose to use the Scale-free Topology Criterion (Zhang and # Horvath 2005). Here the focus is on the linear regression model fitting index # (denoted below by scale.law.R.2) that quantify the extent of how well a network # satisfies a scale-free topology. # The function PickSoftThreshold can help one to estimate the cut-off value # when using hard thresholding with the step adjacency function. # The first column lists the power beta # The second column reports the resulting scale free topology fitting index R^2. # The third column reports the slope of the fitting line. # The fourth column reports the fitting index for the truncated exponential scale free model. # Usually we ignore it. # The remaining columns list the mean, median and maximum connectivity. RpowerTable=pickSoftThreshold(datExpr, powerVector=powers1)[[2]] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Power scale.law.R.2 slope truncated.R.2 mean.k. median.k. max.k. 1 -0.0848 0.392 0.484 706.00 722.000 1130.0 2 0.0244 -0.624 0.857 240.00 238.000 532.0 3 0.2950 -1.160 0.980 105.00 96.300 302.0 4 0.4760 -1.570 0.975 53.10 44.500 188.0 5 0.6510 -1.940 0.943 30.10 23.600 125.0 6 0.8810 -1.610 0.982 18.50 13.500 86.3 7 0.9240 -1.720 0.941 12.20 8.200 74.4 8 0.9110 -1.760 0.900 8.47 5.120 67.5 9 0.8680 -1.740 0.859 6.17 3.310 62.4 10 0.8350 -1.690 0.846 4.67 2.240 58.2 12 0.8640 -1.530 0.919 2.94 1.070 51.5 14 0.8740 -1.410 0.948 2.03 0.527 46.2 16 0.9090 -1.320 0.977 1.51 0.278 41.7 18 0.8910 -1.250 0.961 1.17 0.150 37.9 cex1=0.7 par(mfrow=c(1,2)) plot(RpowerTable[,1], -sign(RpowerTable[,3])*RpowerTable[,2],xlab=" Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n") text(RpowerTable[,1], -sign(RpowerTable[,3])*RpowerTable[,2], labels=powers1,cex=cex1,col="red") # this line corresponds to using an R^2 cut-off of h abline(h=0.8,col="red") plot(RpowerTable[,1], RpowerTable[,5],xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n") text(RpowerTable[,1], RpowerTable[,5], labels=powers1, cex=cex1,col="red") 7 16 9 12 14 18 600 500 200 0.4 4 400 0.6 Mean Connectivity 5 3 100 2 3 4 1 5 0 0.0 1 300 0.8 10 0.2 Scale Free Topology Model Fit,signed R^2 8 700 7 6 2 5 10 Soft Threshold (power) 15 5 6 7 8 9 10 12 10 14 16 18 15 Soft Threshold (power) To choose a cut-off value beta, we use the Scale-free Topology Criterion (Zhang and Horvath 2005). Here the focus is on the linear regression model fitting index (denoted as scale.law.R.2) that quantify the extent of how well a network satisfies a scale-free topology. We choose the soft thresholding parameter beta=6 for the correlation matrix since this is where the R^2 curve seems to saturates. From the above table, we find that the resulting slope looks OK (negative between -1 and -2). In the appendix, we investigate different choices of beta. #Here the scale free topology criterion would lead us to pick a power of beta=6. In an appendix, we #study how the biological findings depend on the choice of the power. beta1=6 # this is the the power adjacency function parameter in power(s,beta) # By playing around with beta, you will find that the # findings are highly robust with respect to beta1, which is an attractive property. 8 # The following computes the network connectivity (Connectivity) Connectivity= softConnectivity(datExpr,power=beta1) # Creating Scale Free Topology Plots (SUPPLEMENTARY FIGURE S1 in our article) par(mfrow=c(1,1)) scaleFreePlot(Connectivity, truncated=T,main= paste("beta=",as.character(beta1))) -1.5 -2.0 log10(p(k)) -1.0 -0.5 beta= 6 , scale free R^2= 0.88 , slope= -1.61 , trunc.R^2= 0.98 0.8 1.0 1.2 1.4 1.6 1.8 log10(k) The Figure shows that the connectivity distribution p(k) is better modeled using an exponentially truncated power law p(k) ~ k- exp(-α k). In practice, we find that the two parameters α and provide too much flexibility in curve fitting. The truncated exponential model fitting index R^2 tends to be high irrespective of the adjacency function parameter. For this reason, we focus on the scale-free topology fitting index in our scale-free topology criterion. Exploring the use of the truncated exponential fitting index is beyond the scope of this article. 9 Module Detection An important step in network analysis is module detetion. To group genes with coherent expression profiles into modules, we use average linkage hierarchical clustering, which uses the topological overlap measure as dissimilarity. The topological overlap of two nodes reflects their similarity in terms of the commonality of the nodes they connect to (see [Ravasz et al 2002, Yip and Horvath 2006]). # Now define the power adjacency matrix ADJ = adjacency(datExpr,power=beta1) gc() # The following code computes the topological overlap matrix based on the # adjacency matrix. # TIME: Takes several minutes dissTOM=TOMdist(ADJ) gc() # Now we carry out hierarchical clustering with the TOM matrix. Branches of the # resulting clustering tree will be used to define gene modules. hierTOM = hclust(as.dist(dissTOM),method="average"); par(mfrow=c(1,1)) plot(hierTOM,labels=F) 10 # By our definition, modules correspond to branches of the tree. # The question is what height cut-off should be used? This depends on the # biology. Large height values lead to big modules, small values lead to small # but very cohesive modules. We used a dynamic cut-tree algorithm for selection branches of the hierarchical clustering dendrogram (Langfelder Zhang Horvath 2008). The algorithm takes into account an essential feature of cluster occurrence and makes use of the internal structure in a dendrogram. Specifically, the algorithm is based on an adaptive process of cluster decomposition and combination and the process is iterated until the number of clusters becomes stable. 11 # The following is the original code used in the paper by Ghazalpour et al myheightcutoff =0.995 mydeepSplit = FALSE # fine structure within module myminModuleSize = 20 # modules must have this minimum number of genes #new way for identifying modules based on hierarchical clustering dendrogram colorh1=cutreeDynamic(hierclust= hierTOM, deepSplit=mydeepSplit,maxTreeHeight =myheightcutoff,minModuleSize=myminModuleSize) table(colorh1) # Our code has slightly changed. If we could go back in time, we would use the following code # for branch cutting. But please skip it.... colorhNEWstep1= dynamicTreeCut::cutreeDynamic(dendro=hierTOM, cutHeight =0.9965, minClusterSize = 20, method = "tree",deepSplit =F) # to turn the branch lables (which are integers) into colors we use colorhNEWstep2=labels2colors(colorhNEWstep1) colorhNEWstep3=mergeCloseModules(datExpr, colors=colorh2, cutHeight = 0.15, MEs = NULL, impute = TRUE, useAbs = F)$colors # Note that the resulting color is quite similar to the original one: table(colorhNEWstep3,colorh1) #This results in the following color assignment. par(mfrow=c(2,1)) plot(hierTOM, main="Female Mouse Liver Network", labels=F, xlab="", sub=""); plotColorUnderTree(hierTOM,colors=data.frame(colorh1)) title("Colored by UNMERGED dynamic modules") 12 #Note that the colors correspond to portions of the branches. #To determine whether some colors should be merged we a) represent each module by its #module eigengenes (defined as its first principal component) and b) clustering #the principal components. If 2 module eigengenes (PCs) are highly correlated #then the modules should be merged. A general rule may be that two modules are #merged if the distance between the two is samller than 0.1 (i.e., correlation #is bigger than 0.9) datME = moduleEigengenes(as.matrix(datExpr), colorh1)[[1]] dissMEs = 1-abs(cor(datME, use="p")) dissMEs = ifelse(is.na(dissMEs), 0, dissMEs) hierMEs = hclust(as.dist(dissMEs),method="a") 13 PCred PCpink PCturquoise PCmidnightblue PCpurple PCyellow PCblack PCtan PCgreen PClightgreen PCblue PCmagenta PClightcyan PCgrey60 PCgreenyellow PCcyan PCsalmon PCgrey 0.4 0.2 0.0 PCbrown 0.6 PClightyellow 0.8 #display ME hierarchical dendrogram on screen par(mfrow=c(1,1), mar=c(0, 3, 1, 1) + 0.1, cex=1) plot(hierMEs, xlab="",ylab="",main="",sub="") par(mfrow=c(1,1)) #This tree suggest to merge several colors 14 #To merge a minor cluster to a major cluster, we use colorh1 = merge2Clusters(colorh1, mainclusterColor="lightcyan", minorclusterColor="grey60") colorh1 = merge2Clusters(colorh1, mainclusterColor="blue", minorclusterColor="magenta") colorh1 = merge2Clusters(colorh1, mainclusterColor="red", minorclusterColor="turquoise") colorh1 = merge2Clusters(colorh1, mainclusterColor="red", minorclusterColor="pink") colorh1 = merge2Clusters(colorh1, mainclusterColor="black", minorclusterColor="yellow") colorh1 = merge2Clusters(colorh1, mainclusterColor="green", minorclusterColor="lightgreen") colorh1 = merge2Clusters(colorh1, mainclusterColor="green", minorclusterColor="tan") # After merging some colors we arrive at the following hierarchical plot #### FIGURE 1A) in our manuscript par(mfrow=c(2,1)) plot(hierTOM, main="Female Mouse Liver Network", labels=F, xlab="", sub=""); plotColorUnderTree(hierTOM,colorh1, title1="Colored by female liver modules") 15 # We also propose to use classical multi-dimensional scaling plots # for visualizing the network. Here we chose 3 scaling dimensions # This also takes about 10 minutes... cmd1=cmdscale(as.dist(dissTOM),4) par(mfrow=c(2,3)) plot(cmd1[,c(1,2)], col= as.character(colorh1) ) plot(cmd1[,c(1,3)], col= as.character(colorh1) ) plot(cmd1[,c(1,4)], col= as.character(colorh1) ) plot(cmd1[,c(2,3)], col= as.character(colorh1) ) plot(cmd1[,c(2,4)], col= as.character(colorh1) ) plot(cmd1[,c(3,4)], col= as.character(colorh1) ) ### FIGURE 1 B in our article par(mfrow=c(1,1)) scatterplot3d(cmd1[,1:3], color=as.character(colorh1), main="MDS plot",xlab="Scaling Dimension 1", ylab="Scaling Dimension 2", zlab="Scaling Dimension 3",cex.axis=1.5,angle=320) 16 TOM plot and MDS plots To visualize the network, we used several plots. The topological overlap matrix plot represents the topological overlap matrix where rows and columns are sorted and colored according to the hierarchical clustering tree used in the module definition. A classical multi-dimensional scaling plot that uses the topological overlap matrix as input can also be used. # An alternative view of this is the so called TOM plot that is generated by the # function TOMplot # Inputs: TOM distance measure, hierarchical (hclust) object, color # Warning: for large gene sets, say more than 2000 genes #this may take a while... TOMplot(dissTOM , hierTOM, colorh1) 17 Definition of trait based gene significance For a given physiological trait, we defined a measure of gene significance by forming the absolute value of the Spearman correlation between trait and gene expression values. For example, the body weight can be used to define a gene significance of the ith gene expression GSweight(i) = |cor(x(i), weight)| where x(i) is the gene expression profile of the ith gene. A histogram of the clinical traits shows that several clinical traits appear to have outliers. # To protect agains outliers, we replace the values of the physiological traits # by their ranks. rank1=function(x) rank(x, na.last="keep") rankdatClinicalTraits=apply(datClinicalTraits[,5:26],2,rank1) # This function computes the correlation between a gene expression # and a physiological trait if(exists("GSfunction")) rm(GSfunction) GSfunction=function(x) {cor(x,rankdatClinicalTraits,use="p")} # the following data frame has as columns the gene significance variables # for different clinical traits GeneSignificance =t(apply(datExpr,2,GSfunction)) dimnames(GeneSignificance)[[2]]=paste("GS",dimnames(rankdatClinicalTraits)[[2]],sep="" ) # Since we only care about absolute values of correlations between expression # profiles and traits, we set GeneSignificance=data.frame(abs(GeneSignificance)) names(GeneSignificance) [1] "GSWeightG" "GSLengthCM" [6] "GSX100xfat.weight" "GSTrigly" [11] "GSFFA" "GSGlucose" "GSInsulin.ug.l." [16] "GSGlucose.Insulin" "GSLeptin.pg.ml." [21] "GSAorticCal.M" "GSAorticCal.L" "GSAbFat" "GSTotalChol" "GSLDL.VLDL" "GSAdiponectin" "GSOtherFat" "GSHDLChol" "GSMCP.1.phys." "GSAorticLesions" "GSTotalFat" "GSUC" "GSAneurysm" # Here we define more conventional to annotate Figure 2 and Supplementary Figure S2 namesGS=c("Weight","Length","AbFat","OtherFat","TotalFat","Index", "Trigly","Chol","HDL","UC","FFA","Glucose","LDL+VLDL", "MCP1", "Insulin","GlucoseInsulin", "Leptin", "Adiponectin", "AorticLesions", "Aneurysm", "AorticCal.M", "AorticCal.L") The mean gene significance for a particular module can be considered as a measure of module significance (MS) (see Materials and Methods for statistical test), which means that MS provides a measure for overall correlation between the trait and the module. This means that a module with high MS value for "body weight" is on average composed of genes highly correlated with body weight. # Here we use the function verboseBoxplot to creates barplots # that shows whether modules are enriched with significant genes. # It also reports a Kruskal Wallis P-value. # The gene significance can be a binary variable or a quantitative variable. # It also plots the 95% confidence interval of the mean 18 par(mfrow=c(7,3), mar= c(1, 4, 3, 1) +0.1) for (i in c(1:21) ) { verboseBoxplot(GeneSignificance[,i],colorh1,col=levels(factor(colorh1)),main= namesGS[i],xlab="module",ylab="GS") abline(h=.3,col="red") } GS GS GS GS 0.0 GS darkred lightyellow salmon Aneurysm p = 1.5e-81 module GS black darkred lightyellow salmon HDL p = 4.2e-46 module black darkred lightyellow salmon Glucose p = 3.6e-223 module black darkred lightyellow salmon Insulin p = 1.4e-265 module black darkred lightyellow salmon Adiponectin p = 7.2e-70 module black darkred lightyellow salmon AorticCal.M p = 4.1e-93 module 0.0 darkred lightyellow salmon Leptin p = 2.7e-194 module black 0.0 black darkred lightyellow salmon Index p = 1.1e-233 module 0.0 darkred lightyellow salmon MCP1 p = 1.5e-211 module black 0.0 black 0.00 darkred lightyellow salmon FFA p = 1.7e-197 module GS GS GS GS GS GS black 0.00 GS black darkred lightyellow salmon AorticLesions p = 2.5e-160 module 0.00 darkred lightyellow salmon Chol p = 5.1e-213 module 0.0 0.0 GS black darkred lightyellow salmon GlucoseInsulin p = 2.2e-262 module AbFat p = 1.9e-211 0.00 GS GS darkred lightyellow salmon LDL+VLDL p = 1.8e-208 module black 0.0 black darkred lightyellow salmon TotalFat p = 5.1e-266 module 0.0 darkred lightyellow salmon UC p = 1.2e-201 module black 0.0 black 0.00 Length p = 3.6e-56 0.0 0.0 darkred lightyellow salmon Trigly p = 3.5e-171 module 0.0 GS black 0.0 GS darkred lightyellow salmon OtherFat p = 2.7e-262 module 0.0 GS GS black 0.0 GS Weight p = 7.5e-288 So, interesting combinations include a) GSweight in the blue module and b) Glucose.Insulin the the brown module. 19 Since our particular interest is in understanding body weight, we focus on the blue module. The following code creates a barplot for the body weight based gene significance measure. verboseBarplot(GeneSignificance[,1],colorh1,col=levels(factor(colorh1)),main="module significance",xlab="module",ylab="mean gene significance") 0.3 0.2 0.1 0.0 mean gene significance module significance p= 7.5e-288 black blue brown cyan darkred greenyellow lightcyan midnightblue red royalblue module dim(GeneSignificance) # Now we produce Figure 2 of the article. whichmodule="blue" # mean gene significance=module significance meanGS=apply(abs(GeneSignificance)[colorh1==whichmodule,],2,mean) # corresponding standard error stderrGS= apply(abs(GeneSignificance)[colorh1==whichmodule,],2,stderr1) # The following code produces a barplot with rotated axis labels ## Increase bottom margin to make room for rotated labels par(mar = c(7, 4, 4, 2) + 0.1) ## Create plot and get bar midpoints in 'mp' mp = barplot(as.vector(meanGS),col=whichmodule, ylab="Module Significance",cex.lab=1.5) ## Set up x axis with tick marks alone axis(1, at = mp, labels = FALSE) ## text labels labels = namesGS ## Plot x axis labels at mp, you may want to change the offser -.005... text(mp, par("usr")[3] - 0.005, srt = 45, adj = 1, labels = labels, xpd = TRUE,cex=1.3) ## Plot x axis label at line 4 mtext(1, text = "Physiological Traits", line = 6,cex=1.5) # This creates the error bars err.bp(meanGS , stderrGS, two.side=T) 20 W ei Le ght ng Ab th O F th at e To rFa ta t lF In at de Tr x ig ly C ho H l D L U C G FFA LD luc L+ os VL e D M L G C l u I P1 co ns se ul In i n su Ad L lin e Ao ipo pti rti ne n cL ct i An esio n Ao eur ns rti ys Ao cCa m rti l.M cC al .L Physiological Traits 21 0.0 0.1 0.2 Module Significance 0.3 # To get a sense of how related the modules are one can summarize each module # by its first eigengene (referred to as principal components). # Next we cluster the eigengens. This is very similar to the code used above # for identifying modules that should be merged. dME2=moduleEigengenes(datExpr,colorh1)[[1]] hclustdME2=hclust(as.dist( 1-abs(t(cor(dME2, method="p")))), method="average" ) par(mfrow=c(1,1)) plot(hclustdME2, main="Clustering the Module Eigengenes") PCgreenyellow PCblue PCgrey PCred PCmidnightblue PCpurple PCgreen PClightyellow PCblack PCcyan PClightcyan PCsalmon PCbrown 0.6 0.5 0.4 0.3 Height 0.7 0.8 0.9 1.0 Clustering the Module Eigengenes as.dist(1 - abs(t(cor(PC1, method = "p")))) hclust (*, "average") 22 # Now we create scatter plots of the samples (arrays) along the module eigengenes. dME2=dME2[,hclustdME2$order] pairs( dME2, upper.panel = panel.smooth, lower.panel = panel.cor , diag.panel=panel.hist ,main="Relation between modules") Comment: each dot represents a mouse. Above the diagonal are scatterplots. Below are the corresponding absolute values of the correlation 23 #Now we study how connectivity is related to mean gene expression or variance of gene expression #### This Supplementary Figure S3 in our article par(mfrow=c(2,2)) whichmodule="blue" # mean expression of the blue module genes meanExprModule=apply( datExpr[,colorh1==whichmodule],2,mean1) # variance of expression varExprModule=apply( datExpr[,colorh1==whichmodule],2,var1) ConnectivityModule= SoftConnectivity(datExpr[,colorh1==whichmodule], power=beta1) verboseScatterplot(ConnectivityModule,varExprModule,xlab=paste("Connectivity (k.in)", whichmodule, " module"), ylab="Variance",col=whichmodule) verboseScatterplot(ConnectivityModule,meanExprModule,xlab=paste("Connectivity (k.in)", whichmodule, " module"), ylab="Mean Expression",col=whichmodule) meanExpr=apply( datExpr,2,mean1) varExpr=apply( datExpr,2,var1) verboseScatterplot(Connectivity,varExpr,xlab=paste("Whole Network Connectivity (k.all)"), ylab="Variance",col=colorh1) verboseScatterplot(Connectivity,meanExpr,xlab=paste("Whole Network Connectivity (k.all)"), ylab="Mean Expression",col=colorh1) In the co-expression network presented here, we find that the gene expression levels of hub genes are less variable (lower variance) than other, less connected, nodes across all mice. This is consistent with the idea that the network’s most highly connected hubs are resilient to large genetic background variations since they are vital for core biological functions. 24 # The following produces heatmap plots for each module. # Here the rows are genes and the columns are samples. # Well defined modules results in characteristic band structures since the corresponding genes are # highly correlated. ClusterSamples=hclust(dist(datExpr[,] ),method="average") par(mfrow=c(3,1), mar=c(1, 2, 4, 1)) which.module="black" plot.mat(t(scale(datExpr[ClusterSamples$order,][,colorh1==which.module ]) ),nrgcols=30,rlabels=T, clabels=T,rcols=which.module, title=which.module ) which.module="blue" plot.mat(t(scale(datExpr[ClusterSamples$order,][,colorh1==which.module ]) ),nrgcols=30,rlabels=T, clabels=T,rcols=which.module, title=which.module ) which.module="brown" plot.mat(t(scale(datExpr[ClusterSamples$order,][,colorh1==which.module ]) ),nrgcols=30,rlabels=T, clabels=T,rcols=which.module, title=which.module ) # The function intramodularConnectivity computes the whole network connectivity kTotal, # the within module connectivity (kWithin). kOut=kTotal-kWithin and # and kDiff=kIn-kOut=2*kIN-kTotal 25 ConnectivityMeasures=intramodularConnectivity(abs(cor(datExpr,use="p"))^beta1,co lorh1) names(ConnectivityMeasures) [1] "kTotal" "kWithin" "kOut" "kDiff" # The following plots show the gene significance vs intromodular connectivity colorlevels=levels(factor(colorh1)) par(mfrow=c(4,3)) for (i in 1:12) { whichmodule=colorlevels[i];restrict1=colorh1==whichmodule verboseScatterplot(ConnectivityMeasures$kWithin[restrict1],GeneSignificance[restrict1,1],col=wh ichmodule,main=whichmodule,xlab="Intramodular k", ylab="Gene Signif") } 26 APPENDIX: Constructing an unweighted networks and comparing it to the weighted nework. Here we redo the network analysis using hard thresholding, i.e. dichotomizing the correlation matrix. We show that our main biological findings are highly robust with respect to the network construction method. Use the scale free topology criterion for finding the hard threshold parameter tau. thresholds1= c(seq(.1,.5, by=.1), seq(.55,.95, by=.05) ) TableHard=pickHardThreshold(datExpr, thresholds1)[[2]] gc() 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Cut 0.10 0.20 0.30 0.40 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 p.value scale.law.R.2 slope. truncated.R.2 mean.k. median.k. max.k. 2.47e-01 0.7750 2.460 0.8060 2290.000 2390 2880 1.96e-02 0.1260 0.545 0.0153 1450.000 1540 2330 3.88e-04 -0.1110 -0.124 0.0280 875.000 898 1820 1.40e-06 0.2880 -0.852 0.8180 478.000 454 1320 5.74e-10 0.6120 -1.340 0.9660 228.000 189 823 4.05e-12 0.6690 -1.520 0.9580 149.000 111 633 1.18e-14 0.7310 -1.680 0.9750 93.100 68 484 0.00e+00 0.7750 -1.700 0.9880 55.100 35 344 0.00e+00 0.7250 -1.910 0.9710 30.200 15 242 0.00e+00 0.6730 -1.680 0.8400 15.600 5 147 0.00e+00 0.6530 -1.500 0.6040 7.930 1 106 0.00e+00 0.0658 -1.600 0.0294 4.290 0 100 0.00e+00 0.6990 -0.987 0.8320 2.280 0 88 0.00e+00 0.9590 -0.996 0.9620 0.558 0 42 To choose a cut-off value tau, we propose to use the Scale-free Topology Criterion (Zhang and Horvath 2005). Here the focus is on the linear regression model fitting index (denoted as scale.law.R.2) that quantify the extent of how well a network satisfies a scale-free topology. We choose the cut value (tau) of 0.7 for the correlation matrix since this is where the R^2 curve seems to saturates. From the above table, we find that the resulting slope looks OK (negative and between -1 and -2), and the mean number of connections looks good Below we investigate different choices of tau. 27 1.0 par(mfrow=c(1,1)) plot(thresholds1, -sign(TableHard[,4])*TableHard[,3], type="n",ylab="Scale Free Topology R^2",xlab="Hard Threshold tau", ylim=range(min(c( -sign(TableHard[,4])*TableHard[,3]),na.rm=T),1) ) text(thresholds1, -sign(TableHard[,4])*TableHard[,3], labels= thresholds1, col="black") abline(h=.8) 0.95 0.6 0.65 0.55 0.9 0.75 0.8 0.4 0.0 0.85 0.2 0.3 -0.5 Scale Free Topology R^2 0.5 0.5 0.7 0.1 0.2 0.4 0.6 0.8 Hard Threshold tau 28 tau1=.65 # this parameter is hard threshold parameter. #Let’s define the adjacency matrix of an unweighted network ADJHARD = I(abs(cor(datExpr[,],use="p"))>tau1)+0.0 gc() # This is the unweighted connectivity ConnectivityHard =as.vector(apply(ADJHARD,2,sum)) scaleFreePlot(ConnectivityHard,truncated=T,main=paste("tau=",as.character(tau1))) -1.5 -2.0 -2.5 log10(p(k)) -1.0 -0.5 tau= 0.65 , scale free R^2= 0.78 , slope= -1.74 , trunc.R^2= 0.99 1.2 1.4 1.6 1.8 2.0 2.2 2.4 log10(k) 29 # Let’s compare weighted to unweighted connectivity in a scatter plot verboseScatterplot(ConnectivityHard, Connectivity,xlab="Unweighted Connectivity",ylab="Weighted Connectivity", col= as.character(colorh1)) # Comments: the connectivity measures is highly preserved between weighted and unweighted networks but there are marked differences for the brown module. 30 # The following code computes the topological overlap matrix based on the # adjacency matrix. dissTOMhard=TOMdist(ADJHARD) gc() # Now we carry out hierarchical clustering with the TOM matrix. hierTOMhard = hclust(as.dist(dissTOMhard),method="average"); #Next, we study whether the `soft’ modules of the unweighted network described above can also be #found in the unweighted network # The following shows the hierarchical tree based on the unweighted network but the # genes are colored according to their membership in the weighted network par(mfrow=c(2,1)) plot(hierTOMhard, main="Unweighted Network Module Tree ", labels=F, xlab="", sub=""); plotColorUnderTree(hierTOMhard, colors=colorh1) title("Dynamic Colors, weighted network") Comment: Overall the colors stay together. This is particularly true for the blue module, which is the main interest of our paper. This demonstrates that the module assignment is robust with respect to the network construction method. 31 # An alternative view of this is the so called TOM plot that is generated by the # function TOMplot # Inputs: TOM distance measure, hierarchical (hclust) object, color # Here we use the unweighted module tree but color it by the weighted modules. TOMplot(dissTOMhard , hierTOMhard, as.character(colorh1)) gc() #Comment: module assignment is highly preserved. ConnectivityMeasuresHARD=intramodularConnectivity(ADJHARD,colorh1) 32 Appendix: Computation of the cluster coefficient Although, we don’t discuss the clustering coefficient in our main article, we briefly mention it here since it is an important network concept. The cluster coefficient measures the cliquishness of a gene. While we don’t use the clustering coefficient in our manuscript, we report it here for the sake of completeness. # First, we start with the weighted network cluster.coef= clusterCoef(ADJ) gc() # Now we plot cluster coefficient versus connectivity # for all genes par(mfrow=c(1,1),mar=c(2,2,2,1)) plot(Connectivity, cluster.coef,col=as.character(colorh1),xlab="Connectivity",ylab="Cluster Coefficient") Overall, we find that the clustering coefficient in a weighted network is roughly constant for highly connected genes inside of a given module. Across modules the clustering coefficient varies a lot. 33 # Now we compute the CC for the unweighted network diag(ADJHARD)=0 cluster.coefHARD= clusterCoef(ADJHARD) ConnectivityHARD= apply(ADJHARD,2,sum) par(mfrow=c(1,1)) plot(ConnectivityHARD,cluster.coefHARD,col=as.character(colorh1),xlab="Connectivity",ylab=" Cluster Coefficient" ) # There is a marked difference between the weighted network and the unweighted network when it comes to the relationship between clustering coefficient and connectivity. This is further discussed in Zhang and Horvath 2005 and the following reference: Horvath and Dong, Yip (2008) PloS Comp Biol THE END To cite the code and methods in this manual, please use Zhang B, Horvath S (2005) A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17. http://www.bepress.com/sagmb/vol4/iss1/art17 34