Microsoft word version

advertisement
Tutorial I
Female Mouse Liver Microarray Data
Network Construction and Module Analysis
Steve Horvath
Correspondence: shorvath@mednet.ucla.edu, http://www.ph.ucla.edu/biostat/people/horvath.htm
Content of this tutorial
1) Gene Co-expression Network Construction,
2) Module Definition Based on Average Linkage hierarchical clustering with the
dynamic tree cut algorithm
3) Relating Modules To Physiological Traits (module significance analysis)
4) Comparing Weighted Network Results to Unweighted Network Results
5) Studying the Clustering Coefficicient
Abstract
We use microarray from an F2 mouse intercross to examine the large-scale organization of gene
co-expression networks in female mouse liver and annotate several gene modules in terms of 20
physiological traits. Finally we study the relationship between connectivity and a measures of gene
significance based on the physiological traits.
The data and biological implications are described in the following references


Ghazalpour A, Doss S, Zhang B, Wang S, Plaisier C, Castellanos R, Brozell A, Schadt EE, Drake TA,
Lusis AJ, Horvath S (2006)Integrating Genetic and Network Analysis to Characterize Genes Related to
Mouse Weight. PloS Genetics. Volume 2 | Issue 8 | AUGUST 2006
Fuller TF, Ghazalpour A, Aten JE, Drake TA, Lusis AJ, Horvath S (2007) Weighted Gene Coexpression Network Analysis Strategies Applied to Mouse Weight, Mamm Genome 18(6):463-472.
We provide the statistical code used for generating the weighted gene co-expression network
results. Thus, the reader be able to reproduce all of our findings. This document also serves as a
tutorial to weighted gene co-expression network analysis. Some familiarity with the R software is
desirable but the document is fairly self-contained. This document and data files can be found at the
following webpage:
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/MouseWeight/
More material on weighted network analysis can be found here
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/
Method Description
The network construction is conceptually straightforward: nodes represent genes and nodes are
connected if the corresponding genes are significantly co-expressed across appropriately chosen
tissue samples. Here we study networks that can be specified with the following adjacency matrix:
A=[aij] is symmetric with entries in [0,1]. By convention, the diagonal elements are assumed to be
zero. For unweighted networks, the adjacency matrix contains binary information (connected=1,
unconnected=0). In weighted networks the adjacency matrix contains encodes pairwise connection
strengths.
1
Microarray data
RNA preparation and array hybridizations were performed at Rosetta Informatics. The
custom ink-jet microarrays used in this study (Agilent technologies, previously described [2, 24])
contain 2186 control probes and 23,574 non-control oligonucleotides extracted from mouse
Unigene clusters and combined with RefSeq sequences and RIKEN full-length clones. Mouse
livers were homogenized and total RNA extracted using Trizol reagent (Invitrogen, CA) according
to manufacturer’s protocol. Three µg of total RNA was reverse transcribed and labeled with either
Cy3 or Cy5 fluorochrome. Purified Cy3 or Cy5 complementary RNA was hybridized to at least
two microarray slides with fluor reversal for 24 hours in a hybridization chamber, washed, and
scanned using a laser confocal scanner. Arrays were quantified on the basis of spot intensity
relative to background, adjusted for experimental variation between arrays using average intensity
over multiple channels, and fit to an error model to determine significance (type I error). Gene
expression is reported as the ratio of the mean log10 intensity (mlratio) relative to the pool derived
from 150 mice randomly selected from the F2 population.
Data Reduction:
In order to minimize noise in the gene expression data set, several data filtering steps were taken.
First, preliminary evidence showed major differences in gene expression levels between sexes
amongst the F2 mice used (Yang, manuscript in preparation), and therefore only female mice were
used for network construction . The construction and comparison of the male network will be
reported elsewhere. Only those mice with complete phenotype, genotype and array data were used.
This gave a final experimental sample of 135 female mice used for network construction. Due to
computational limitations the following filtering steps were applied to the genome wide expression
data. First, the 8000 most varying genes (ml ratio – log10 of the ratio of experimental mouse gene
expression to F2 pool), across all mice were identified. Next, amongst these 8000 genes, after
preliminary network construction, the 3600 most connected genes were chosen as those to use in
further steps (All genes excluded had a connectivity of 1 or less). These 3600 genes were then
examined, and where appropriate, gene isoforms and genes containing duplicate probes were
excluded by using only those with the highest expression among the redundant transcripts. This
final filtering step yielded a count of 3421 genes for the experimental network construction.
The main reason for not using all 23000 genes is that genes with low variance across the mouse
samples are likely to be less interesting in this analysis that relates gene expression profiles to SNP
markers and highly varying physiological traits. Noise genes may compromise module detection
and thus our integrated model. A computational reason for restricting the analysis to 8000 genes is
that our R software code becomes extremely slow when dealing with matrices of dimension larger
than 8000x8000.
For module detection, we limited our analysis to 3600 most connected genes since our module
construction method and visualization tools cannot handle larger data sets at this point. By
definition, module genes are highly connected with the genes of their module (i.e. module genes
tend have relatively high connectivity). Thus, for the purpose of module detection, restricting the
analysis to the most connected genes does not lead to major information loss for the key points of
our application. However, there may be applications where genes with relatively low connectivity
are biologically interesting so that gene filtering based on connectivity would lead to information
loss. Finally, we eliminated multiple probes with similar expression pattern for the same gene since
we are interested in studying gene networks as opposed to probeset networks. This resulted in 3421
genes in our final set which we used for module detection.
2
Network Construction:
We pioneered the use of a weighted coexpression network for mapping complex disease
genes. In co-expression networks, network nodes correspond to genes and connection strengths are
determined by the pairwise correlations between expression profiles. In contrast to unweighted
networks, weighted networks use soft thresholding of the Pearson correlation matrix for
determining the connection strengths between two genes. Soft thresholding of the Pearson
correlation preserves the gene co-expression information and leads to weighted co-expression
networks that are highly robust with respect to the construction method.
The network construction algorithm is described in detail elsewhere (Zhang and Horvath
2005). Briefly, a gene co-expression similarity measure (absolute value of Pearson’s product
moment correlation) was used to relate every pairwise gene-gene relationship. An adjacency
matrix was then constructed using a `soft’ power adjacency function aij = Power(sij, )  |sij|  where
sij is the co-expression similarity, and aij represents the resulting adjacency that measures the
connection strengths. The power  is chosen using the scale free topology criterion proposed in
Zhang and Horvath (2005). Briefly, the power was chosen such the resulting network exhibited
approximate scale free topology and a high mean number of connections. The scale free topology
criterion led us to choose a power of  = 6 based on the preliminary network built from the 8000
most varying genes. However, since we are using a weighted network as opposed to an unweighted
network, the biological findings are highly robust with respect to the choice of this power (Zhang
and Horvath 2005) .
Topological Overlap Matrix and Gene Modules
The adjacency matrix was then used to define a network distance measure or more
precisely a measure of node dissimilarity based on the topological overlap matrix. Specifically the
topological overlap matrix is given by
lij  aij
ij 
min{ki , k j }  1  aij
where, lij   aiu auj denotes the number of nodes to which both i and j are connected, and u
u i
indexes the nodes of the network. The topological overlap matrix (TOM) is given by Ω=[ω ij]. ωij is
a number between 0 and 1 and is symmetric (i.e, ωij= ωji). The rationale for considering this
similarity measure is that nodes that are part of highly integrated modules are expected to have high
topological overlap with their neighbors.
Network Module Identification.
Gene "modules" are groups of nodes that have high topological overlap. Module
identification was based on the topological overlap matrix Ω=[ωij] defined above. To use it in
hierarchical clustering, it was turned into a dissimilarity measure by subtracting it from one (i.e, the
topological overlap based dissimilarity measure is defined by dij  1  ij ). Based on the
dissimilarity matrix we can use hierarchical clustering to discriminate one module from another.
We used a dynamic cut-tree algorithm for automatically and precisely identifying modules in
hierarchical clustering dendrogram (the "tree" method of cutreeDynamic, see Langfelder, Zhang
Horvath 2008).
3
The algorithm takes into account an essential feature of cluster occurrence and makes use of
the internal structure in a dendrogram. Specifically, the algorithm is based on an adaptive process
of cluster decomposition and combination and the process is iterated until the number of clusters
becomes stable. No claim is made that our module construction method is optimal. A comparison
of different module construction methods is beyond the scope of this paper.
Detection of hub genes:
To identify hub genes for the network, one may either consider the whole network connectivity
(denoted by kTotal) or the intramodular connectivity (kWithin).
We find that intramodular connectivity is far more meaningful than whole network connectivity
Module Significance Analysis: Relating Gene Modules to Physiologic Traits,
In the BXH F2 cross, 20 physiological traits were measured for each animal. We used this
information to explore the physiological relevance of each of the modules in the network. To do
this, the Pearson’s product moment correlation (Pearson’s correlation) was computed between each
gene within a module and each of the physiological traits. This measure is termed the ‘gene
significance’ of a particular gene with that trait. The geometric mean was then calculated for the
absolute value of all gene significance scores within each module, yielding the ‘module
significance’ (MS) of that particular module with the trait .
In order to explore the characteristics of our connectivity measure, we plotted the
connectivity parameter versus variance and mean gene expression for each gene within the most
functionally significant (Blue and Brown) modules. We observed an inverse relationship between
connectivity and gene expression variance. This is consistent with the idea that the network’s most
highly connected ‘hubs’ are resilient to genetic background variations since they are vital for core
biological functions.
Statistical References
 Bin Zhang and Steve Horvath (2005) "A General Framework for Weighted Gene CoExpression Network Analysis", Statistical Applications in Genetics and Molecular
Biology: Vol. 4: No. 1, Article 17
The following theoretical reference explores the meaning of coexpression network analysis
•
Horvath S, Dong J (2008) Geometric Interpretation of Gene Co-Expression Network
Analysis. PloS Computational Biology. 4(8): e1000117. PMID: 18704157
The WGCNA R package is described in
•
Langfelder P, Horvath S (2008) WGCNA: an R package for Weighted Correlation Network
Analysis. BMC Bioinformatics. 2008 Dec 29;9(1):559. PMID: 19114008
For the generalized topological overlap matrix as applied to unweighted networks see
•
Yip A, Horvath S (2007) Gene network interconnectedness and the generalized topological
overlap measure. BMC Bioinformatics 8:22
 Module detection based on branch cutting is described in
Langfelder P, Zhang B, Horvath S (2008) Defining clusters from a hierarchical cluster tree: the
Dynamic Tree Cut package for R. Bioinformatics. Bioinformatics. 2008 Mar 1;24(5):719-20.
PMID: 18024473
4
# Absolutely no warranty on the code. Please contact SH with suggestions.
# CONTENTS
# This document contains function for carrying out the following tasks
# A) Assessing scale free topology and choosing the parameters of the adjacency function
# using the scale free topology criterion (Zhang and Horvath 05)
# B) Computing the topological overlap matrix
# C) Defining gene modules using clustering procedures
# D) Summing up modules by their first principal component (first eigengene)
# E) Relating a measure of gene significance to the modules
# F) Carrying out a within module analysis (computing intramodular connectivity)
# and relating intramodular connectivity to gene significance.
# G) Miscellaneous other functions, e.g. for computing the cluster coefficient.
# Downloading the R software
# 1) Go to http://www.R-project.org, download R and install it on your computer
# After installing R, you need to install several additional R library packages:
# For example to install Hmisc, open R,
# go to menu "Packages/Install package(s) from CRAN",
# then choose Hmisc. R will automatically install the package.
# When asked "Delete downloaded files (y/N)? ", answer "y".
# Do the same for some of the other libraries mentioned below. But note that
# several libraries are already present in the software so there is no need to re-install them.
# To get this tutorial and data files, go to the following webpage
# www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
# Unzip all the files into the same directory,
## The user should copy and paste the following script into the R session.
## Text after "#" is a comment and is automatically ignored by R.
# read in the R libraries
library(MASS)
# standard, no need to install
library(class) # standard, no need to install
library(cluster)
library(sma) # install it for the function plot.mat
library(impute)# install it for imputing missing value
library(scatterplot3d)
# Download the WGCNA library as a .zip file from
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/
and choose "Install package(s) from local zip file" in the packages tab
library(WGCNA)
options(stringsAsFactors=F)
5
# Please adapt the file paths
setwd("C:/Documents and Settings/Steve Horvath/My
Documents/ADAG/LinSong/NetworkScreening/MouseFemaleLiver")
# read in the custom network functions.
source("NetworkFunctions.txt")
# The following 3421 probe set were arrived at using the following steps
#1) reduce to the 8000 most varying, 2) 3600 most connected, 3) focus on unique genes
dat0=read.table("cnew_liver_bxh_f2female_8000mvgenes_p3600_UNIQUE_tommodules.xls",hea
der=T)
names(dat0)
# this contains information on the genes
datSummary=dat0[,c(1:8,144:150)]
# the following data frame contains
# the gene expression data: columns are genes, rows are arrays (samples)
datExpr = t(dat0[,9:143])
no.samples = dim(datExpr)[[1]]
dim(datExpr)
datClinicalTraits=read.csv("BXH_ClinicalTraits_361mice_forNewBXH.csv",header=T)
#Now we order the mice so that trait file and expression file agree
restrictMice=is.element(datClinicalTraits$MiceID,dimnames(datExpr)[[1]])
table(restrictMice)
datClinicalTraits=datClinicalTraits[restrictMice,]
orderMiceTraits=order(datClinicalTraits$MiceID)
orderMiceExpr=order(dimnames(datExpr)[[1]])
datClinicalTraits =datClinicalTraits[orderMiceTraits,]
datExpr =datExpr[orderMiceExpr,]
#from the following table, we verify that all 135 mice are in order
table(datClinicalTraits$MiceID==dimnames(datExpr)[[1]])
rm(dat0);gc()
6
#SOFT THRESHOLDING For Weighted Network Construction
# To construct a weighted network (soft thresholding with the power adjacency matrix),
# we consider the following vector of potential thresholds.
# Now we investigate soft thesholding with the power adjacency function
powers1=c(seq(1,10,by=1),seq(12,18,by=2))
# To choose a cut-off value, we propose to use the Scale-free Topology Criterion (Zhang and
# Horvath 2005). Here the focus is on the linear regression model fitting index
# (denoted below by scale.law.R.2) that quantify the extent of how well a network
# satisfies a scale-free topology.
# The function PickSoftThreshold can help one to estimate the cut-off value
# when using hard thresholding with the step adjacency function.
# The first column lists the power beta
# The second column reports the resulting scale free topology fitting index R^2.
# The third column reports the slope of the fitting line.
# The fourth column reports the fitting index for the truncated exponential scale free model.
# Usually we ignore it.
# The remaining columns list the mean, median and maximum connectivity.
RpowerTable=pickSoftThreshold(datExpr, powerVector=powers1)[[2]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Power scale.law.R.2 slope truncated.R.2 mean.k. median.k. max.k.
1
-0.0848 0.392
0.484 706.00
722.000 1130.0
2
0.0244 -0.624
0.857 240.00
238.000 532.0
3
0.2950 -1.160
0.980 105.00
96.300 302.0
4
0.4760 -1.570
0.975
53.10
44.500 188.0
5
0.6510 -1.940
0.943
30.10
23.600 125.0
6
0.8810 -1.610
0.982
18.50
13.500
86.3
7
0.9240 -1.720
0.941
12.20
8.200
74.4
8
0.9110 -1.760
0.900
8.47
5.120
67.5
9
0.8680 -1.740
0.859
6.17
3.310
62.4
10
0.8350 -1.690
0.846
4.67
2.240
58.2
12
0.8640 -1.530
0.919
2.94
1.070
51.5
14
0.8740 -1.410
0.948
2.03
0.527
46.2
16
0.9090 -1.320
0.977
1.51
0.278
41.7
18
0.8910 -1.250
0.961
1.17
0.150
37.9
cex1=0.7
par(mfrow=c(1,2))
plot(RpowerTable[,1], -sign(RpowerTable[,3])*RpowerTable[,2],xlab="
Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n")
text(RpowerTable[,1], -sign(RpowerTable[,3])*RpowerTable[,2],
labels=powers1,cex=cex1,col="red")
# this line corresponds to using an R^2 cut-off of h
abline(h=0.8,col="red")
plot(RpowerTable[,1], RpowerTable[,5],xlab="Soft Threshold (power)",ylab="Mean
Connectivity", type="n")
text(RpowerTable[,1], RpowerTable[,5], labels=powers1, cex=cex1,col="red")
7
16
9
12
14
18
600
500
200
0.4
4
400
0.6
Mean Connectivity
5
3
100
2
3
4
1
5
0
0.0
1
300
0.8
10
0.2
Scale Free Topology Model Fit,signed R^2
8
700
7
6
2
5
10
Soft Threshold (power)
15
5
6
7
8
9
10
12
10
14
16
18
15
Soft Threshold (power)
To choose a cut-off value beta, we use the Scale-free Topology Criterion (Zhang and Horvath
2005). Here the focus is on the linear regression model
fitting index (denoted as scale.law.R.2) that quantify the extent of how well
a network satisfies a scale-free topology. We choose the soft thresholding parameter beta=6 for the
correlation matrix since this is where the R^2 curve seems to saturates. From the above table, we
find that the resulting slope looks OK (negative between -1 and -2). In the appendix, we investigate
different choices of beta.
#Here the scale free topology criterion would lead us to pick a power of beta=6. In an appendix, we
#study how the biological findings depend on the choice of the power.
beta1=6 # this is the the power adjacency function parameter in power(s,beta)
# By playing around with beta, you will find that the
# findings are highly robust with respect to beta1, which is an attractive property.
8
# The following computes the network connectivity (Connectivity)
Connectivity= softConnectivity(datExpr,power=beta1)
# Creating Scale Free Topology Plots (SUPPLEMENTARY FIGURE S1 in our article)
par(mfrow=c(1,1))
scaleFreePlot(Connectivity, truncated=T,main= paste("beta=",as.character(beta1)))
-1.5
-2.0
log10(p(k))
-1.0
-0.5
beta= 6 , scale free R^2= 0.88 , slope= -1.61 , trunc.R^2= 0.98
0.8
1.0
1.2
1.4
1.6
1.8
log10(k)
The Figure shows that the connectivity distribution p(k) is better modeled using an exponentially
truncated power law p(k) ~ k- exp(-α k). In practice, we find that the two parameters α and 
provide too much flexibility in curve fitting. The truncated exponential model fitting index R^2
tends to be high irrespective of the adjacency function parameter. For this reason, we focus on the
scale-free topology fitting index in our scale-free topology criterion. Exploring the use of the
truncated exponential fitting index is beyond the scope of this article.
9
Module Detection
An important step in network analysis is module detetion.
To group genes with coherent expression profiles into modules, we use average linkage hierarchical
clustering, which uses the topological overlap measure as dissimilarity.
The topological overlap of two nodes reflects their similarity in terms of the commonality of the
nodes they connect to (see [Ravasz et al 2002, Yip and Horvath 2006]).
# Now define the power adjacency matrix
ADJ = adjacency(datExpr,power=beta1)
gc()
# The following code computes the topological overlap matrix based on the
# adjacency matrix.
# TIME: Takes several minutes
dissTOM=TOMdist(ADJ)
gc()
# Now we carry out hierarchical clustering with the TOM matrix. Branches of the
# resulting clustering tree will be used to define gene modules.
hierTOM = hclust(as.dist(dissTOM),method="average");
par(mfrow=c(1,1))
plot(hierTOM,labels=F)
10
# By our definition, modules correspond to branches of the tree.
# The question is what height cut-off should be used? This depends on the
# biology. Large height values lead to big modules, small values lead to small
# but very cohesive modules.
We used a dynamic cut-tree algorithm for selection branches of the hierarchical clustering
dendrogram (Langfelder Zhang Horvath 2008). The algorithm takes into account an essential
feature of cluster occurrence and makes use of the internal structure in a dendrogram. Specifically,
the algorithm is based on an adaptive process of cluster decomposition and combination and the
process is iterated until the number of clusters becomes stable.
11
# The following is the original code used in the paper by Ghazalpour et al
myheightcutoff =0.995
mydeepSplit
= FALSE # fine structure within module
myminModuleSize = 20 # modules must have this minimum number of genes
#new way for identifying modules based on hierarchical clustering dendrogram
colorh1=cutreeDynamic(hierclust= hierTOM, deepSplit=mydeepSplit,maxTreeHeight
=myheightcutoff,minModuleSize=myminModuleSize)
table(colorh1)
# Our code has slightly changed. If we could go back in time, we would use the following code
# for branch cutting. But please skip it....
colorhNEWstep1= dynamicTreeCut::cutreeDynamic(dendro=hierTOM, cutHeight =0.9965,
minClusterSize = 20, method = "tree",deepSplit =F)
# to turn the branch lables (which are integers) into colors we use
colorhNEWstep2=labels2colors(colorhNEWstep1)
colorhNEWstep3=mergeCloseModules(datExpr, colors=colorh2, cutHeight = 0.15, MEs = NULL,
impute = TRUE, useAbs = F)$colors
# Note that the resulting color is quite similar to the original one:
table(colorhNEWstep3,colorh1)
#This results in the following color assignment.
par(mfrow=c(2,1))
plot(hierTOM, main="Female Mouse Liver Network", labels=F, xlab="", sub="");
plotColorUnderTree(hierTOM,colors=data.frame(colorh1))
title("Colored by UNMERGED dynamic modules")
12
#Note that the colors correspond to portions of the branches.
#To determine whether some colors should be merged we a) represent each module by its
#module eigengenes (defined as its first principal component) and b) clustering
#the principal components. If 2 module eigengenes (PCs) are highly correlated
#then the modules should be merged. A general rule may be that two modules are
#merged if the distance between the two is samller than 0.1 (i.e., correlation #is bigger than 0.9)
datME = moduleEigengenes(as.matrix(datExpr), colorh1)[[1]]
dissMEs = 1-abs(cor(datME, use="p"))
dissMEs = ifelse(is.na(dissMEs), 0, dissMEs)
hierMEs
= hclust(as.dist(dissMEs),method="a")
13
PCred
PCpink
PCturquoise
PCmidnightblue
PCpurple
PCyellow
PCblack
PCtan
PCgreen
PClightgreen
PCblue
PCmagenta
PClightcyan
PCgrey60
PCgreenyellow
PCcyan
PCsalmon
PCgrey
0.4
0.2
0.0
PCbrown
0.6
PClightyellow
0.8
#display ME hierarchical dendrogram on screen
par(mfrow=c(1,1), mar=c(0, 3, 1, 1) + 0.1, cex=1)
plot(hierMEs, xlab="",ylab="",main="",sub="")
par(mfrow=c(1,1))
#This tree suggest to merge several colors
14
#To merge a minor cluster to a major cluster, we use
colorh1 = merge2Clusters(colorh1, mainclusterColor="lightcyan", minorclusterColor="grey60")
colorh1 = merge2Clusters(colorh1, mainclusterColor="blue", minorclusterColor="magenta")
colorh1 = merge2Clusters(colorh1, mainclusterColor="red", minorclusterColor="turquoise")
colorh1 = merge2Clusters(colorh1, mainclusterColor="red", minorclusterColor="pink")
colorh1 = merge2Clusters(colorh1, mainclusterColor="black", minorclusterColor="yellow")
colorh1 = merge2Clusters(colorh1, mainclusterColor="green", minorclusterColor="lightgreen")
colorh1 = merge2Clusters(colorh1, mainclusterColor="green", minorclusterColor="tan")
# After merging some colors we arrive at the following hierarchical plot
#### FIGURE 1A) in our manuscript
par(mfrow=c(2,1))
plot(hierTOM, main="Female Mouse Liver Network", labels=F, xlab="", sub="");
plotColorUnderTree(hierTOM,colorh1, title1="Colored by female liver modules")
15
# We also propose to use classical multi-dimensional scaling plots
# for visualizing the network. Here we chose 3 scaling dimensions
# This also takes about 10 minutes...
cmd1=cmdscale(as.dist(dissTOM),4)
par(mfrow=c(2,3))
plot(cmd1[,c(1,2)], col= as.character(colorh1) )
plot(cmd1[,c(1,3)], col= as.character(colorh1) )
plot(cmd1[,c(1,4)], col= as.character(colorh1) )
plot(cmd1[,c(2,3)], col= as.character(colorh1) )
plot(cmd1[,c(2,4)], col= as.character(colorh1) )
plot(cmd1[,c(3,4)], col= as.character(colorh1) )
### FIGURE 1 B in our article
par(mfrow=c(1,1))
scatterplot3d(cmd1[,1:3], color=as.character(colorh1), main="MDS plot",xlab="Scaling
Dimension 1", ylab="Scaling Dimension 2", zlab="Scaling Dimension 3",cex.axis=1.5,angle=320)
16
TOM plot and MDS plots
To visualize the network, we used several plots. The topological overlap matrix plot
represents the topological overlap matrix where rows and columns are sorted and colored according
to the hierarchical clustering tree used in the module definition. A classical multi-dimensional
scaling plot that uses the topological overlap matrix as input can also be used.
# An alternative view of this is the so called TOM plot that is generated by the
# function TOMplot
# Inputs: TOM distance measure, hierarchical (hclust) object, color
# Warning: for large gene sets, say more than 2000 genes
#this may take a while...
TOMplot(dissTOM , hierTOM, colorh1)
17
Definition of trait based gene significance
For a given physiological trait, we defined a measure of gene significance by forming the absolute
value of the Spearman correlation between trait and gene expression values. For example, the body
weight can be used to define a gene significance of the ith gene expression GSweight(i) = |cor(x(i),
weight)| where x(i) is the gene expression profile of the ith gene.
A histogram of the clinical traits shows that several clinical traits appear to have outliers.
# To protect agains outliers, we replace the values of the physiological traits
# by their ranks.
rank1=function(x) rank(x, na.last="keep")
rankdatClinicalTraits=apply(datClinicalTraits[,5:26],2,rank1)
# This function computes the correlation between a gene expression
# and a physiological trait
if(exists("GSfunction")) rm(GSfunction)
GSfunction=function(x) {cor(x,rankdatClinicalTraits,use="p")}
# the following data frame has as columns the gene significance variables
# for different clinical traits
GeneSignificance =t(apply(datExpr,2,GSfunction))
dimnames(GeneSignificance)[[2]]=paste("GS",dimnames(rankdatClinicalTraits)[[2]],sep="" )
# Since we only care about absolute values of correlations between expression
# profiles and traits, we set
GeneSignificance=data.frame(abs(GeneSignificance))
names(GeneSignificance)
[1] "GSWeightG"
"GSLengthCM"
[6] "GSX100xfat.weight" "GSTrigly"
[11] "GSFFA"
"GSGlucose"
"GSInsulin.ug.l."
[16] "GSGlucose.Insulin" "GSLeptin.pg.ml."
[21] "GSAorticCal.M"
"GSAorticCal.L"
"GSAbFat"
"GSTotalChol"
"GSLDL.VLDL"
"GSAdiponectin"
"GSOtherFat"
"GSHDLChol"
"GSMCP.1.phys."
"GSAorticLesions"
"GSTotalFat"
"GSUC"
"GSAneurysm"
# Here we define more conventional to annotate Figure 2 and Supplementary Figure S2
namesGS=c("Weight","Length","AbFat","OtherFat","TotalFat","Index",
"Trigly","Chol","HDL","UC","FFA","Glucose","LDL+VLDL", "MCP1",
"Insulin","GlucoseInsulin", "Leptin", "Adiponectin", "AorticLesions", "Aneurysm",
"AorticCal.M", "AorticCal.L")
The mean gene significance for a particular module can be considered as a measure of module
significance (MS) (see Materials and Methods for statistical test), which means that MS provides a
measure for overall correlation between the trait and the module. This means that a module with
high MS value for "body weight" is on average composed of genes highly correlated with body
weight.
# Here we use the function verboseBoxplot to creates barplots
# that shows whether modules are enriched with significant genes.
# It also reports a Kruskal Wallis P-value.
# The gene significance can be a binary variable or a quantitative variable.
# It also plots the 95% confidence interval of the mean
18
par(mfrow=c(7,3), mar= c(1, 4, 3, 1) +0.1)
for (i in c(1:21) ) {
verboseBoxplot(GeneSignificance[,i],colorh1,col=levels(factor(colorh1)),main=
namesGS[i],xlab="module",ylab="GS")
abline(h=.3,col="red") }
GS
GS
GS
GS
0.0
GS
darkred lightyellow salmon
Aneurysm p = 1.5e-81
module
GS
black
darkred lightyellow salmon
HDL p = 4.2e-46
module
black
darkred lightyellow salmon
Glucose p = 3.6e-223
module
black
darkred lightyellow salmon
Insulin p = 1.4e-265
module
black
darkred lightyellow salmon
Adiponectin p = 7.2e-70
module
black
darkred lightyellow salmon
AorticCal.M p = 4.1e-93
module
0.0
darkred lightyellow salmon
Leptin p = 2.7e-194
module
black
0.0
black
darkred lightyellow salmon
Index p = 1.1e-233
module
0.0
darkred lightyellow salmon
MCP1 p = 1.5e-211
module
black
0.0
black
0.00
darkred lightyellow salmon
FFA p = 1.7e-197
module
GS
GS
GS
GS
GS
GS
black
0.00
GS
black darkred lightyellow salmon
AorticLesions p = 2.5e-160
module
0.00
darkred lightyellow salmon
Chol p = 5.1e-213
module
0.0
0.0
GS
black darkred lightyellow salmon
GlucoseInsulin p = 2.2e-262
module
AbFat p = 1.9e-211
0.00
GS
GS
darkred lightyellow salmon
LDL+VLDL p = 1.8e-208
module
black
0.0
black
darkred lightyellow salmon
TotalFat p = 5.1e-266
module
0.0
darkred lightyellow salmon
UC p = 1.2e-201
module
black
0.0
black
0.00
Length p = 3.6e-56
0.0
0.0
darkred lightyellow salmon
Trigly p = 3.5e-171
module
0.0
GS
black
0.0
GS
darkred lightyellow salmon
OtherFat p = 2.7e-262
module
0.0
GS
GS
black
0.0
GS
Weight p = 7.5e-288
So, interesting combinations include
a) GSweight in the blue module and
b) Glucose.Insulin the the brown module.
19
Since our particular interest is in understanding body weight, we focus on the blue module.
The following code creates a barplot for the body weight based gene significance measure.
verboseBarplot(GeneSignificance[,1],colorh1,col=levels(factor(colorh1)),main="module
significance",xlab="module",ylab="mean gene significance")
0.3
0.2
0.1
0.0
mean gene significance
module significance p= 7.5e-288
black
blue
brown
cyan
darkred
greenyellow
lightcyan
midnightblue
red
royalblue
module
dim(GeneSignificance)
# Now we produce Figure 2 of the article.
whichmodule="blue"
# mean gene significance=module significance
meanGS=apply(abs(GeneSignificance)[colorh1==whichmodule,],2,mean)
# corresponding standard error
stderrGS= apply(abs(GeneSignificance)[colorh1==whichmodule,],2,stderr1)
# The following code produces a barplot with rotated axis labels
## Increase bottom margin to make room for rotated labels
par(mar = c(7, 4, 4, 2) + 0.1)
## Create plot and get bar midpoints in 'mp'
mp = barplot(as.vector(meanGS),col=whichmodule, ylab="Module Significance",cex.lab=1.5)
## Set up x axis with tick marks alone
axis(1, at = mp, labels = FALSE)
## text labels
labels = namesGS
## Plot x axis labels at mp, you may want to change the offser -.005...
text(mp, par("usr")[3] - 0.005, srt = 45, adj = 1, labels = labels, xpd = TRUE,cex=1.3)
## Plot x axis label at line 4
mtext(1, text = "Physiological Traits", line = 6,cex=1.5)
# This creates the error bars
err.bp(meanGS , stderrGS, two.side=T)
20
W
ei
Le ght
ng
Ab th
O F
th at
e
To rFa
ta t
lF
In at
de
Tr x
ig
ly
C
ho
H l
D
L
U
C
G FFA
LD luc
L+ os
VL e
D
M L
G
C
l u I P1
co ns
se ul
In i n
su
Ad L lin
e
Ao ipo pti
rti ne n
cL ct
i
An esio n
Ao eur ns
rti ys
Ao cCa m
rti l.M
cC
al
.L
Physiological Traits
21
0.0
0.1
0.2
Module Significance
0.3
# To get a sense of how related the modules are one can summarize each module
# by its first eigengene (referred to as principal components).
# Next we cluster the eigengens. This is very similar to the code used above
# for identifying modules that should be merged.
dME2=moduleEigengenes(datExpr,colorh1)[[1]]
hclustdME2=hclust(as.dist( 1-abs(t(cor(dME2, method="p")))), method="average" )
par(mfrow=c(1,1))
plot(hclustdME2, main="Clustering the Module Eigengenes")
PCgreenyellow
PCblue
PCgrey
PCred
PCmidnightblue
PCpurple
PCgreen
PClightyellow
PCblack
PCcyan
PClightcyan
PCsalmon
PCbrown
0.6
0.5
0.4
0.3
Height
0.7
0.8
0.9
1.0
Clustering the Module Eigengenes
as.dist(1 - abs(t(cor(PC1, method = "p"))))
hclust (*, "average")
22
# Now we create scatter plots of the samples (arrays) along the module eigengenes.
dME2=dME2[,hclustdME2$order]
pairs( dME2, upper.panel = panel.smooth, lower.panel = panel.cor , diag.panel=panel.hist
,main="Relation between modules")
Comment: each dot represents a mouse. Above the diagonal are scatterplots. Below are the
corresponding absolute values of the correlation
23
#Now we study how connectivity is related to mean gene expression or variance of gene expression
#### This Supplementary Figure S3 in our article
par(mfrow=c(2,2))
whichmodule="blue"
# mean expression of the blue module genes
meanExprModule=apply( datExpr[,colorh1==whichmodule],2,mean1)
# variance of expression
varExprModule=apply( datExpr[,colorh1==whichmodule],2,var1)
ConnectivityModule= SoftConnectivity(datExpr[,colorh1==whichmodule], power=beta1)
verboseScatterplot(ConnectivityModule,varExprModule,xlab=paste("Connectivity (k.in)",
whichmodule, " module"), ylab="Variance",col=whichmodule)
verboseScatterplot(ConnectivityModule,meanExprModule,xlab=paste("Connectivity (k.in)",
whichmodule, " module"), ylab="Mean Expression",col=whichmodule)
meanExpr=apply( datExpr,2,mean1)
varExpr=apply( datExpr,2,var1)
verboseScatterplot(Connectivity,varExpr,xlab=paste("Whole Network Connectivity (k.all)"),
ylab="Variance",col=colorh1)
verboseScatterplot(Connectivity,meanExpr,xlab=paste("Whole Network Connectivity (k.all)"),
ylab="Mean Expression",col=colorh1)
In the co-expression network presented here, we find that the gene expression levels of hub genes
are less variable (lower variance) than other, less connected, nodes across all mice. This is
consistent with the idea that the network’s most highly connected hubs are resilient to large genetic
background variations since they are vital for core biological functions.
24
# The following produces heatmap plots for each module.
# Here the rows are genes and the columns are samples.
# Well defined modules results in characteristic band structures since the corresponding genes are
# highly correlated.
ClusterSamples=hclust(dist(datExpr[,] ),method="average")
par(mfrow=c(3,1), mar=c(1, 2, 4, 1))
which.module="black"
plot.mat(t(scale(datExpr[ClusterSamples$order,][,colorh1==which.module ])
),nrgcols=30,rlabels=T, clabels=T,rcols=which.module,
title=which.module )
which.module="blue"
plot.mat(t(scale(datExpr[ClusterSamples$order,][,colorh1==which.module ])
),nrgcols=30,rlabels=T, clabels=T,rcols=which.module,
title=which.module )
which.module="brown"
plot.mat(t(scale(datExpr[ClusterSamples$order,][,colorh1==which.module ])
),nrgcols=30,rlabels=T, clabels=T,rcols=which.module,
title=which.module )
# The function intramodularConnectivity computes the whole network connectivity kTotal,
# the within module connectivity (kWithin). kOut=kTotal-kWithin and
# and kDiff=kIn-kOut=2*kIN-kTotal
25
ConnectivityMeasures=intramodularConnectivity(abs(cor(datExpr,use="p"))^beta1,co
lorh1)
names(ConnectivityMeasures)
[1] "kTotal" "kWithin" "kOut"
"kDiff"
# The following plots show the gene significance vs intromodular connectivity
colorlevels=levels(factor(colorh1))
par(mfrow=c(4,3))
for (i in 1:12) {
whichmodule=colorlevels[i];restrict1=colorh1==whichmodule
verboseScatterplot(ConnectivityMeasures$kWithin[restrict1],GeneSignificance[restrict1,1],col=wh
ichmodule,main=whichmodule,xlab="Intramodular k", ylab="Gene Signif")
}
26
APPENDIX: Constructing an unweighted networks and comparing it to the weighted
nework.
Here we redo the network analysis using hard thresholding, i.e. dichotomizing the correlation
matrix. We show that our main biological findings are highly robust with respect to the network
construction method.
Use the scale free topology criterion for finding the hard threshold parameter tau.
thresholds1= c(seq(.1,.5, by=.1), seq(.55,.95, by=.05) )
TableHard=pickHardThreshold(datExpr, thresholds1)[[2]]
gc()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Cut
0.10
0.20
0.30
0.40
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
p.value scale.law.R.2 slope. truncated.R.2 mean.k. median.k. max.k.
2.47e-01
0.7750 2.460
0.8060 2290.000
2390
2880
1.96e-02
0.1260 0.545
0.0153 1450.000
1540
2330
3.88e-04
-0.1110 -0.124
0.0280 875.000
898
1820
1.40e-06
0.2880 -0.852
0.8180 478.000
454
1320
5.74e-10
0.6120 -1.340
0.9660 228.000
189
823
4.05e-12
0.6690 -1.520
0.9580 149.000
111
633
1.18e-14
0.7310 -1.680
0.9750
93.100
68
484
0.00e+00
0.7750 -1.700
0.9880
55.100
35
344
0.00e+00
0.7250 -1.910
0.9710
30.200
15
242
0.00e+00
0.6730 -1.680
0.8400
15.600
5
147
0.00e+00
0.6530 -1.500
0.6040
7.930
1
106
0.00e+00
0.0658 -1.600
0.0294
4.290
0
100
0.00e+00
0.6990 -0.987
0.8320
2.280
0
88
0.00e+00
0.9590 -0.996
0.9620
0.558
0
42
To choose a cut-off value tau, we propose to use the Scale-free Topology Criterion (Zhang and
Horvath 2005). Here the focus is on the linear regression model
fitting index (denoted as scale.law.R.2) that quantify the extent of how well
a network satisfies a scale-free topology. We choose the cut value (tau) of 0.7 for the correlation
matrix since this is where the R^2 curve seems to saturates. From the above table, we find that the
resulting slope looks OK (negative and between -1 and -2), and the mean number of connections
looks good Below we investigate different choices of tau.
27
1.0
par(mfrow=c(1,1))
plot(thresholds1,
-sign(TableHard[,4])*TableHard[,3], type="n",ylab="Scale Free Topology R^2",xlab="Hard
Threshold tau", ylim=range(min(c( -sign(TableHard[,4])*TableHard[,3]),na.rm=T),1) )
text(thresholds1, -sign(TableHard[,4])*TableHard[,3], labels= thresholds1, col="black")
abline(h=.8)
0.95
0.6
0.65
0.55
0.9
0.75 0.8
0.4
0.0
0.85
0.2
0.3
-0.5
Scale Free Topology R^2
0.5
0.5
0.7
0.1
0.2
0.4
0.6
0.8
Hard Threshold tau
28
tau1=.65 # this parameter is hard threshold parameter.
#Let’s define the adjacency matrix of an unweighted network
ADJHARD = I(abs(cor(datExpr[,],use="p"))>tau1)+0.0
gc()
# This is the unweighted connectivity
ConnectivityHard =as.vector(apply(ADJHARD,2,sum))
scaleFreePlot(ConnectivityHard,truncated=T,main=paste("tau=",as.character(tau1)))
-1.5
-2.0
-2.5
log10(p(k))
-1.0
-0.5
tau= 0.65 , scale free R^2= 0.78 , slope= -1.74 , trunc.R^2= 0.99
1.2
1.4
1.6
1.8
2.0
2.2
2.4
log10(k)
29
# Let’s compare weighted to unweighted connectivity in a scatter plot
verboseScatterplot(ConnectivityHard, Connectivity,xlab="Unweighted
Connectivity",ylab="Weighted Connectivity", col= as.character(colorh1))
# Comments:
 the connectivity measures is highly preserved between weighted and unweighted networks
but there are marked differences for the brown module.
30
# The following code computes the topological overlap matrix based on the
# adjacency matrix.
dissTOMhard=TOMdist(ADJHARD)
gc()
# Now we carry out hierarchical clustering with the TOM matrix.
hierTOMhard = hclust(as.dist(dissTOMhard),method="average");
#Next, we study whether the `soft’ modules of the unweighted network described above can also be
#found in the unweighted network
# The following shows the hierarchical tree based on the unweighted network but the
# genes are colored according to their membership in the weighted network
par(mfrow=c(2,1))
plot(hierTOMhard, main="Unweighted Network Module Tree ", labels=F, xlab="", sub="");
plotColorUnderTree(hierTOMhard, colors=colorh1)
title("Dynamic Colors, weighted network")
Comment: Overall the colors stay together. This is particularly true for the blue module, which is
the main interest of our paper. This demonstrates that the module assignment is robust with respect
to the network construction method.
31
# An alternative view of this is the so called TOM plot that is generated by the
# function TOMplot
# Inputs: TOM distance measure, hierarchical (hclust) object, color
# Here we use the unweighted module tree but color it by the weighted modules.
TOMplot(dissTOMhard , hierTOMhard, as.character(colorh1))
gc()
#Comment: module assignment is highly preserved.
ConnectivityMeasuresHARD=intramodularConnectivity(ADJHARD,colorh1)
32
Appendix: Computation of the cluster coefficient
Although, we don’t discuss the clustering coefficient in our main article, we briefly mention it here
since it is an important network concept.
The cluster coefficient measures the cliquishness of a gene.
While we don’t use the clustering coefficient in our manuscript, we report it here for the sake of
completeness.
# First, we start with the weighted network
cluster.coef= clusterCoef(ADJ)
gc()
# Now we plot cluster coefficient versus connectivity
# for all genes
par(mfrow=c(1,1),mar=c(2,2,2,1))
plot(Connectivity,
cluster.coef,col=as.character(colorh1),xlab="Connectivity",ylab="Cluster
Coefficient")
Overall, we find that the clustering coefficient in a weighted network is roughly constant for highly
connected genes inside of a given module. Across modules the clustering coefficient varies a lot.
33
# Now we compute the CC for the unweighted network
diag(ADJHARD)=0
cluster.coefHARD= clusterCoef(ADJHARD)
ConnectivityHARD= apply(ADJHARD,2,sum)
par(mfrow=c(1,1))
plot(ConnectivityHARD,cluster.coefHARD,col=as.character(colorh1),xlab="Connectivity",ylab="
Cluster Coefficient" )
# There is a marked difference between the weighted network and the unweighted network when it
comes to the relationship between clustering coefficient and connectivity. This is further discussed
in Zhang and Horvath 2005 and the following reference:
Horvath and Dong, Yip (2008) PloS Comp Biol
THE END
To cite the code and methods in this manual, please use
Zhang B, Horvath S (2005) A General Framework for Weighted Gene Co-Expression Network
Analysis. Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17.
http://www.bepress.com/sagmb/vol4/iss1/art17
34
Download