Tova Fuller, Steve Horvath
Correspondence: suprtova@ucla.edu
, shorvath@mednet.ucla.edu
Abstract
Here we illustrate differential network analysis by comparing the connectivity and module structure of two networks based on the liver expression data of lean and heavy mice. This unbiased method for comparing two phenotypically distinct subgroups of mouse samples serves as a method for understanding the underlying differential gene co-expression network topology giving rise to altered biological pathways.
This work is in press:
Tova Fuller, Anatole Ghazalpour, Jason Aten, Thomas A. Drake, Aldons J. Lusis, Steve
Horvath (2007) Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm Genome, in press.
The data are described in:
Anatole Ghazalpour, Sudheer Doss, Bin Zang, Susanna Wang,Eric E. Schadt, Thomas
A. Drake, Aldons J. Lusis, Steve Horvath (2006) Integrating Genetics and Network
Analysis to Characterize Genes Related to Mouse Weight. PloS Genetics
We provide the statistical code used for generating the weighted gene co-expression network results. Thus, the reader be able to reproduce all of our findings. This document also serves as a tutorial to differential weighted gene co-expression network analysis. Some familiarity with the R software is desirable but the document is fairly self-contained. This document and data files can be found at the following webpage: http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/DifferentialNetworkAnalysi s
More material on weighted network analysis can be found here: http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/
Method Description:
The data are described in the PLoS article cited above [1]. Please also refer to the citations above and below for more information regarding weighted gene co-expression network analysis (WGCNA).
Here we attempt to show that networks may be constructed from two phenotypically different subgroups of samples from a prior WGCNA experiment on mice. Here we identify 30 mice at both extremes of the weight spectrum in the BxH data and construct weighted gene coexpression networks from each.
Network Construction:
In co-expression networks, network nodes correspond to genes and connection strengths are determined by the pairwise correlations between expression profiles. In contrast to unweighted networks, weighted networks use soft thresholding of the Pearson correlation matrix for determining the connection strengths between two genes. Soft thresholding of the
Pearson correlation preserves the gene co-expression information and leads to weighted coexpression networks that are highly robust with respect to the construction method [2].
The network construction algorithm is described in detail elsewhere [2]. Briefly, a gene coexpression similarity measure (absolute value of Pearson’s product moment correlation) was used to relate every pairwise gene-gene relationship. An adjacency matrix was then constructed using a `soft’ power adjacency function a ij
= Power( s ij
,
)
| s ij
|
where s ij is the co-expression similarity, and a ij represents the resulting adjacency that measures the connection strengths. The power
is chosen using the scale free topology criterion proposed in Zhang and Horvath (2005). Briefly, the power was chosen such the resulting network exhibited approximate scale free topology and a high mean number of connections. The scale free topology criterion led us to choose a power of
= 6 based on the preliminary network built from the 8000 most varying genes. However, since we are using a weighted network as opposed to an unweighted network, the biological findings are highly robust with respect to the choice of this power [2].
Topological Overlap Matrix and Gene Modules
The adjacency matrix was then used to define a network distance measure or more precisely a measure of node dissimilarity based on the topological overlap matrix [2]. Specifically the topological overlap matrix is given by
ij
l ij
a ij k k i j
a ij where l ij
u a iu a uj
denotes the number of nodes to which both i and j are connected, and u indexes the nodes of the network. The topological overlap matrix (TOM) is given by Ω=[ω ij
].
ω ij
is a number between 0 and 1 and is symmetric (i.e, ω ij
= ω ji
). The rationale for considering this similarity measure is that nodes that are part of highly integrated modules are expected to have high topological overlap with their neighbors.
Network Module Identification.
Gene "modules" are groups of nodes that have high topological overlap. Module identification was based on the topological overlap matrix Ω=[ω ij
] defined above. To use it in hierarchical clustering, it was turned into a dissimilarity measure by subtracting it from one (i.e, the topological overlap based dissimilarity measure is defined by d ij
1
ij
). Based on the dissimilarity matrix we can use hierarchical clustering to discriminate one module from another. We used a dynamic cut-tree algorithm for automatically and precisely identifying modules in hierarchical clustering dendrogram (the details of the algorithm could be found at http://www.genetics.ucla.edu/labs/horvath/binzhang/DynamicTreeCut ).
The algorithm takes into account an essential feature of cluster occurrence and makes use of the internal structure in a dendrogram. Specifically, the algorithm is based on an adaptive
process of cluster decomposition and combination and the process is iterated until the number of clusters becomes stable. No claim is made that our module construction method is optimal.
A comparison of different module construction methods is beyond the scope of this paper.
Network Comparison Measures:
For the i th
gene, we denote the whole network connectivity by k
1
(i) and k
2
(i) in networks 1 and 2, respectively. To facilitate the comparison between the connectivity measures of each network, we divide each gene connectivity by the maximum network connectivity, i.e.
K
1
( i ) k
1
( i ) max( k
1
)
and
K
2
( i ) k
2
( i ) max( k
2
)
. We utilize the following measure of differential connectivity as DiffK (i) = K
1
(i) – K
2
(i), but other measures of differential connectivity could also be considered. To measure differential gene expression between the lean and heavy mice,
Sector plots and permutation test:
Plotting DiffK versus the t-test statistic value for each gene gives a visual demonstration of how difference in connectivity relates to a more traditional t statistic describing difference in expression between the two networks.
To determine whether membership in each of these sectors was significant, I permute high or low weight status among our 60 mice in Networks 1 and 2, and reconstruct each network accordingly. I obtained values DiffK perm and ttest perm as shown above, taking into account permuted high or low weight status.
This permutation process was repeated 1000 times ( no.perm
= 1000), each time noting the amount of genes that could be found in each of the m = 8 sectors as perm n m
. Similarly, the sector counts for unpermuted, observed data was noted as each sector representing the number of iterations for which sector was found as
N m
1 no .
perm
1
.
Functional Analysis n obs m
. Then, I found a value N m
for obs n m
n perm m
. The p -value for each and Integrated
(http://david.niaid.nih.gov/david/ease.htm).
1.
2.
3.
Discovery (DAVID) [3]
Ghazalpour, A., et al., Integrating genetic and network analysis to characterize genes related to mouse weight.
PLoS Genet, 2006. 2 (8): p. e130.
Zhang, B. and S. Horvath, A general framework for weighted gene co-expression network analysis.
Stat Appl Genet Mol Biol, 2005. 4 : p. Article17.
Dennis, G., Jr., et al., DAVID: Database for Annotation, Visualization, and Integrated
Discovery.
Genome Biol, 2003. 4 (5): p. P3.
Statistical References
To cite this tutorial or the statistical methods please use
1.
Zhang B, Horvath S (2005) A General Framework for Weighted Gene Co-Expression
Network Analysis. Statistical Applications in Genetics and Molecular Biology: Vol. 4:
No. 1, Article 17. http://www.bepress.com/sagmb/vol4/iss1/art17
For the generalized topological overlap matrix as applied to unweighted networks see
2.
Yip A, Horvath S (2006) Generalized Topological Overlap Matrix and its Applications in Gene Co-expression Networks. Proceedings Volume. Biocomp Conference 2006,
Las Vegas. Technical report at http://www.genetics.ucla.edu/labs/horvath/GTOM/ .
For some additional theoretical insights consider
3.
Horvath S, Dong J, Yip A (2006) The Relationship between Intramodular Connectivity and Gene Significance. Proceedings Volume. Biocomp Conference 2006, Las Vegas.
Technical report at http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/
4.
Horvath, Dong, Yip (2006) Using Module Eigengenes to Understand Connectivity and
Other Network Concepts in Co-expression Networks. Submitted.
# Absolutely no warranty on the code. Please contact TF or SH with suggestions.
# Downloading the R software
# 1) Go to http://www.R-project.org, download R and install it on your computer
# After installing R, you need to install several additional R library packages:
# For example to install Hmisc, open R,
# go to menu "Packages\Install package(s) from CRAN",
# then choose Hmisc. R will automatically install the package.
# When asked "Delete downloaded files (y/N)? ", answer "y".
# Do the same for some of the other libraries mentioned below. But note that
# several libraries are already present in the software so there is no need to re-install them.
# To get this tutorial and data files, go to the following webpage
# http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/DifferentialNetworkAnalysi s
# Download the zip file containing:
# 1) R function file: "NetworkFunctions.txt", which contains several R functions
# needed for Network Analysis.
# Unzip all the files into the same directory,
## The user should copy and paste the following script into the R session.
## Text after "#" is a comment and is automatically ignored by R.
# read in the R libraries library(MASS) # standard, no need to install library(class) # standard, no need to install library(cluster) library(sma) # install it for the function plot.mat library(impute)# install it for imputing missing value library(scatterplot3d)
# Note: alter the following file paths to point towards where your files are. source("/Users/TovaFuller/Documents/HorvathLab2006/NetworkFunctions/Network
Functions.txt") setwd("/Users/TovaFuller/Documents/HorvathLab2007/MouseProject2.0/DiffNetwo rkAnalysis/")
# The following 3421 probe set were arrived at using the following steps
#1) reduce to the 8000 most varying, 2) 3600 most connected, 3) focus on unique genes dat0=read.table("cnew_liver_bxh_f2female_8000mvgenes_p3600_UNIQUE_tommodule s.xls",header=T) names(dat0)
# this contains information on the genes datSummary=dat0[,c(1:8,144:150)]
# the following data frame contains the gene expression data: columns are genes, rows are
# arrays (samples) datExpr <- t(dat0[,9:143]) no.samples <- dim(datExpr)[[1]] dimnames(datExpr)[[2]]=datSummary[,1] dim(datExpr)
# We read in the clinical data datClinicalTraits=read.csv("BXH_ClinicalTraits_361mice_forNewBXH.csv",heade r=T)
#Now we order the mice so that trait file and expression file agree restrictMice=is.element(datClinicalTraits$MiceID,dimnames(datExpr)[[1]]) table(restrictMice) datClinicalTraits=datClinicalTraits[restrictMice,] orderMiceTraits=order(datClinicalTraits$MiceID) orderMiceExpr=order(dimnames(datExpr)[[1]]) datClinicalTraits =datClinicalTraits[orderMiceTraits,] datExpr =datExpr[orderMiceExpr,]
# From the following table, we verify that all 135 mice are in order table(datClinicalTraits$MiceID==dimnames(datExpr)[[1]]) rm(dat0);collect_garbage()
BodyWeight=as.numeric(datClinicalTraits$WeightG)
# Now we find the 30 heaviest and leanest mice. Network 1 will refer to modules defined by
# the 30 leanest mice, and Network 2 will refer to modules defined by the 30 heaviest mic. rest1=rank(BodyWeight,ties="first")<=30 rest2=rank(-BodyWeight,ties="first")<=30
# We check to make sure there are 30 of each, with no overlap between the groups table(rest1) table(rest2) table(rest1==T & rest2==T)
# FALSE
# 135
# We separate the expressions for the fat and lean groups: datExpr1=datExpr[rest1,] datExpr2=datExpr[rest2,]
# Now we find the whole network connectivity measures for each: k1=SoftConnectivity(datExpr1,6) k2=SoftConnectivity(datExpr2,6)
# We would like to normalize the connectivity measures. We do this by dividing by the
# maximum values.
K1=k1/max(k1)
K2=k2/max(k2)
# Now we find the difference between the two connectivity values.
DiffK=K1-K2 # Note that we did not take the absolute value here.
# Negative values of this difference imply that normalized Lean connectivity (k2) is
# greater than K1, fat normalized connectivity. poolRest=as.logical(rest1+rest2) factorLevels=rep(NA,length(poolRest)) factorLevels[rest1]="lowWeight" # trait 1 is low weight factorLevels[rest2]="highWeight" # trait 2 is high weight ttest=rep(NA, dim(datExpr)[[2]])
# Let's determine the t-statistic. for (i in 1:dim(datExpr)[[2]]){ ttest[i]=t.test(datExpr[poolRest,i]~factorLevels[poolRest])$statistic
}
# Permuted Status
# Let's create a control DiffK with permuted fat/lean status. We choose from the same
# mice that were considered in the fat and lean groups. Note results from this following
# analysis will be different each time because we are sampling randomly. temp1=sample(which(poolRest==T),30,replace=F)
Rest1=rep(F,length(poolRest))
Rest1[temp1]=T table(Rest1)
Rest2=as.logical(poolRest-Rest1) table(Rest2) table(Rest1==T & Rest2==T)
# FALSE
# 135 good! factorLevels2=rep(NA,length(poolRest)) factorLevels2[Rest1]="permLowWeight" factorLevels2[Rest2]="permHighWeight" datExpr1.2=datExpr[Rest1,] datExpr2.2=datExpr[Rest2,]
# Now we find the whole network connectivity measures for each: k1.2=SoftConnectivity(datExpr1.2,6) k2.2=SoftConnectivity(datExpr2.2,6)
# We would like to normalize the permuted connectivity measures. We do this by
# dividing by the maximum values.
K1.2=k1.2/max(k1.2)
K2.2=k2.2/max(k2.2)
# Now we find the difference between the two connectivity values.
DiffK.2=K1.2-K2.2 # Note that we did not take the absolute value here. ttest2=rep(NA, dim(datExpr)[[2]])
# Let's determine the t-statistic. for (i in 1:dim(datExpr)[[2]]){ ttest2[i]=t.test(datExpr[poolRest,i]~factorLevels2[poolRest])$statistic
}
# Plotting DiffK versus ttest in Unpermuted and Permuted par(mfrow=c(1,2)) plot(DiffK,ttest, main=paste("Unpermuted: cor=", signif(cor(DiffK,ttest, use="pairwise.complete.obs") ,3)),xlim=range(DiffK),ylim=range(ttest)) # colorGroup1 ?? abline(h=1.96) abline(h=-1.96) abline(v=.4) abline(v=-.4) plot(DiffK.2,ttest2, main=paste("Permuted: cor=", signif(cor(DiffK.2,ttest2, use="pairwise.complete.obs") ,3)), xlim=range(DiffK),ylim=range(ttest))
# colorGroup1 ?? abline(h=1.96) abline(h=-1.96) abline(v=.4) abline(v=-.4)
1 2 3
# We repeat this permutation 1000 times, each time
# counting the amount of "dots" in the quadrants
8
4
# shown at left.
# Now we wish to obtain a p-value for the DiffK
# when there is permutation versus when there is
7
6 5
# not.
# Here are the sector counts for unpermuted: sector1obs = (ttest>1.96) & (DiffK<(-0.4)) n1obs=sum(sector1obs) sector2obs = (ttest>1.96) &(DiffK>(-0.4)) & (DiffK<0.4) n2obs=sum(sector2obs) sector3obs = (ttest>1.96) & (DiffK>0.4) n3obs=sum(sector3obs) sector4obs = (ttest>(-1.96)) & (ttest<1.96) & (DiffK>0.4) n4obs=sum(sector4obs) sector5obs = (ttest<(-1.96)) & (DiffK>0.4) n5obs=sum(sector5obs) sector6obs = (ttest<(-1.96)) &(DiffK>(-0.4)) & (DiffK<0.4) n6obs=sum(sector6obs) sector7obs = (ttest<(-1.96)) & (DiffK<(-0.4)) n7obs=sum(sector7obs) sector8obs = (ttest>(-1.96)) & (ttest<1.96) & (DiffK<(-0.4))
n8obs=sum(sector8obs)
# We will find a vector of these same counts for each permutation, but the niobs value
# will always stay the same (scalar) no.perms=1000
# We started with 100, and then redid the experiment with no.perms=1000 n1perm=rep(NA,no.perms) n2perm=rep(NA,no.perms) n3perm=rep(NA,no.perms) n4perm=rep(NA,no.perms) n5perm=rep(NA,no.perms) n6perm=rep(NA,no.perms) n7perm=rep(NA,no.perms) n8perm=rep(NA,no.perms)
# We turn now to obtaining sector counts for each permutation for (i in c(1:no.perms)) { temp1=sample(which(poolRest==T),30,replace=F)
Rest1=rep(F,length(poolRest))
Rest1[temp1]=T
Rest2=as.logical(poolRest-Rest1) factorLevels2=rep(NA,length(poolRest)) factorLevels2[Rest1]="fakefat" factorLevels2[Rest2]="fakelean" datExprFat.2=datExpr[Rest1,] datExprLean.2=datExpr[Rest2,] k1.2=SoftConnectivity(datExprFat.2,6) k2.2=SoftConnectivity(datExprLean.2,6)
K1.2=k1.2/max(k1.2)
K2.2=k2.2/max(k2.2)
DiffK.2=K1.2-K2.2 # Note that we did not take the absolute value here. ttest2=rep(NA, dim(datExpr)[[2]])
# Let's determine the t-statistic. for (j in 1:dim(datExpr)[[2]]){ ttest2[j]=t.test(datExpr[poolRest,j]~factorLevels2[poolRest])$statistic
} sector1perm = (ttest2>1.96) & (DiffK.2<(-0.4)) n1perm[i]=sum(sector1perm) sector2perm = (ttest2>1.96) &(DiffK.2>(-0.4)) & (DiffK.2<0.4) n2perm[i]=sum(sector2perm) sector3perm = (ttest2>1.96) & (DiffK.2>0.4) n3perm[i]=sum(sector3perm) sector4perm = (ttest2>(-1.96)) & (ttest2<1.96) & (DiffK.2>0.4) n4perm[i]=sum(sector4perm) sector5perm = (ttest2<(-1.96)) & (DiffK.2>0.4) n5perm[i]=sum(sector5perm) sector6perm = (ttest2<(-1.96)) &(DiffK.2>(-0.4)) & (DiffK.2<0.4) n6perm[i]=sum(sector6perm) sector7perm = (ttest2<(-1.96)) & (DiffK.2<(-0.4)) n7perm[i]=sum(sector7perm) sector8perm = (ttest2>(-1.96)) & (ttest2<1.96) & (DiffK.2<(-0.4)) n8perm[i]=sum(sector8perm)
}
logicalSum1=sum(n1obs<=n1perm) logicalSum2=sum(n2obs<=n2perm) logicalSum3=sum(n3obs<=n3perm) logicalSum4=sum(n4obs<=n4perm) logicalSum5=sum(n5obs<=n5perm) logicalSum6=sum(n6obs<=n6perm) logicalSum7=sum(n7obs<=n7perm) logicalSum8=sum(n8obs<=n8perm) pval1=(logicalSum1+1)/(no.perms+1) pval1 pval2=(logicalSum2+1)/(no.perms+1) pval2 pval3=(logicalSum3+1)/(no.perms+1) pval3 pval4=(logicalSum4+1)/(no.perms+1) pval4 pval5=(logicalSum5+1)/(no.perms+1) pval5 pval6=(logicalSum6+1)/(no.perms+1) pval6 pval7=(logicalSum7+1)/(no.perms+1) pval7 pval8=(logicalSum8+1)/(no.perms+1) pval8
# For no.perms=100:
# pval1
# [1] 0.5940594
# pval2
# [1] 0.00990099
# pval3
# [1]0.00990099
# pval4
# [1] 0.930693
# pval5
# [1] 0.00990099
# pval6
# [1] 0.00990099
# pval7
# [1] 0.4950495
# pval8
# [1] 0.8118812
# For no.perms=1000 pval1
[1] 0.4645355 pval2
[1] 0.000999001 pval3
[1] 0.000999001 pval4
[1] 0.947053 pval5
[1] 0.00999001 pval6
[1] 0.000999001 pval7
[1] 0.3346653 pval8
[1] 0.7972028
# Because we are dividing by no.perms, the p-values are 10-fold different for 100 or 1000
# iterations.
# The result we achieve makes sense; regions with significantly high or low ttest values
# (p < 0.05) tend to be significant. However, since there are few data points that have a
# DiffK value < -0.4, the upper and lower right hand corner sectors do not prove
# significant. This could be because DiffK is defined as "normalized low weight
# connectivity" minus "normalized high weight connectivity" and there are few genes that in
# fact are more connected in the fat than in the lean that also have extreme ttest scores.
# Module Identification beta1=6
# Note: we use a beta1 = 6 as this was the value used in our previous analysis
# Finding hierGTOM in our Network 1 (low weight). datExpr1=data.frame(datExpr1)
AdjMat1=abs(cor(datExpr1,use="p"))^beta1 collect_garbage() dissGTOM1=TOMdist1(AdjMat1) collect_garbage() hierGTOM1 <- hclust(as.dist(dissGTOM1),method="average") par(mfrow=c(1,1)) plot(hierGTOM1,labels=F) rm(AdjMat1)
# Finding hierGTOM in our Network 2 (high weight). datExpr2=data.frame(datExpr2)
AdjMat2=abs(cor(datExpr2,use="p"))^beta1 collect_garbage() dissGTOM2=TOMdist1(AdjMat2) collect_garbage() hierGTOM2 <- hclust(as.dist(dissGTOM2),method="average") par(mfrow=c(1,1)) plot(hierGTOM2,labels=F) rm(AdjMat2)
# Dynamic Cut-Tree Algorithm
# "We used a dynamic cut-tree algorithm for selection branches of the hierarchical
# clustering dendrogram (the details of the algorithm can be found at the following link:
# www.genetics.ucla.edu/labs/horvath/binzhang/DynamicTreeCut. The algorithm takes
# into account an # essential feature of cluster occurrence and makes use of the internal
# structure in a dendrogram. Specifically, the algorithm is based on an adaptive process
# of cluster decomposition and combination and the process is iterated until the number
# of clusters becomes stable. " (From first mouse tutorial) myheightcutoff = 0.98 mydeepSplit = FALSE # fine structure within module myminModuleSize = 90 # modules must have this minimum number of genes
# Note that we get a "too many modules" warning if we have a smaller minimum size, or
# higher height cutoff.
# new way for identifying modules based on hierarchical clustering dendrogram color1=cutreeDynamic(hierclust= hierGTOM1 , deepSplit=mydeepSplit,maxTreeHeight
=myheightcutoff,minModuleSize=myminModuleSize) table(color1) color1 turquoise blue brown yellow green red
355 336 271 255 241 224
black pink magenta purple greenyellow tan
218 184 155 139 126 117
salmon cyan midnightblue grey
113 110 92 485 pdf(file="modules1.pdf") par(mfrow=c(2,1)) plot(hierGTOM1, main="BXH Mouse Network 1",labels=F,xlab="",sub=""); hclustplot1(hierGTOM1,color1,title1="Colored by Network 1 Modules") dev.off()
# Now we define modules in the Network 2. color2=cutreeDynamic(hierclust= hierGTOM2 , deepSplit=mydeepSplit,maxTreeHeight
=myheightcutoff,minModuleSize=myminModuleSize) table(color2) color2
turquoise blue brown yellow green red
337 243 234 210 184 184
black pink magenta purple greenyellow tan
175 172 161 116 108 102
salmon cyan midnightblue lightcyan grey60 lightgreen
102 101 100 98 92 90
grey
612 pdf(file="modules2.pdf") par(mfrow=c(2,1)) plot(hierGTOM2, main="BXH Mouse Network 2",labels=F,xlab="",sub=""); hclustplot1(hierGTOM2,color2,title1="Colored by Network 2 Modules") dev.off()
# We also want to look at modules in Network 2 colored by Network 1: pdf(file="modules2by1.pdf") par(mfrow=c(2,1)) plot(hierGTOM2, main="BXH Mouse Network 2",labels=F,xlab="",sub=""); hclustplot1(hierGTOM2,color1,title1="Colored by Network 1 Modules") dev.off()
# We look at DiffK versus ttest in different modules. par(mfrow=c(1,2)) plot(DiffK,ttest,col=as.character(color1),main="Colored by Network 1") plot(DiffK,ttest,col=as.character(color2),main="Colored by Network 2")
# Here we create a Network 1 plot that is colored by module definitions, with sectors
# delineated. par(mfrow=c(1,1)) plot(DiffK,ttest,col=as.character(color1),main=paste("Colored by Network 1
Modules, cor=", signif(cor(DiffK,ttest,use="pairwise.complete.obs"),3)), xlim=range(DiffK),ylim=range(ttest),xlab="Difference in
Connectivity",ylab="t-test statistic") abline(h=1.96) abline(h=-1.96) abline(v=.4) abline(v=-.4) text(c(-0.5,-0.2,0.7,0.7,0.7,-0.2,-0.5,-0.5),c(7,7,7,0,-9,-9,-
9,0),labels=c("1","2","3","4","5","6","7","8"),cex=3)
# Here we create a Network 2 plot. par(mfrow=c(1,1)) plot(DiffK.2,ttest2, main=paste("Permuted, cor=", signif(cor(DiffK.2,ttest2, use="pairwise.complete.obs") ,3)), xlim=range(DiffK),ylim=range(ttest),xlab="Difference in
Connectivity",ylab="t-test statistic") abline(h=1.96) abline(h=-1.96) abline(v=.4) abline(v=-.4)
# Here we create the permuted plot that appears in the Mammalian Genome article: par(mfrow=c(1,2),cex.lab=1.3,mar=c(5,5,4,2)+0.1) plot(DiffK,ttest,col=as.character(color1),main=paste("Colored by Network 1
Modules, cor=", signif(cor(DiffK,ttest,use="pairwise.complete.obs"),3)), xlim=range(DiffK),ylim=range(ttest),xlab="Difference in
Connectivity",ylab="t-test statistic") abline(h=1.96) abline(h=-1.96) abline(v=.4) abline(v=-.4) text(c(-0.5,-0.2,0.7,0.7,0.7,-0.2,-0.5,-0.5),c(7,7,7,0,-9,-9,-
9,0),labels=c("1","2","3","4","5","6","7","8"),cex=3) plot(DiffK.2,ttest2, main=paste("Permuted, cor=", signif(cor(DiffK.2,ttest2, use="pairwise.complete.obs") ,3)), xlim=range(DiffK),ylim=range(ttest),xlab="Difference in
Connectivity",ylab="t-test statistic") abline(h=1.96) abline(h=-1.96) abline(v=.4) abline(v=-.4)
# FUNCTIONAL ANALYSIS OF HIGHLY DIFFERENTIAL MODULES
The yellow module in lean mice and the grey module in fat mice is particularly interesting because it is member to sector 3 – which is found to be significantly different in DiffK versus ttest at the p=9.9e-4 level. Furthermore, high connectivity in low weight mice in alongside low connectivity in high weight mice demonstrates that network structure either affects or is affected by mouse weight. We use an EASE analysis of these genes, looking at:
1.
yellow genes as defined by lean modules
2.
the intersection of yellow genes as defined by network 1 modules and grey genes defined by network 2 modules
3.
#1, limited by sector 3 boundaries
4.
#2, limited by sector 3 boundaries.
# Let's find the genes we are interested in:
# Number 1 (see above)
Yellow1=color1=="yellow" table(Yellow1)
# Yellow1
# FALSE TRUE
# 3166 255
Grey2=color2=="grey" table(Grey2)
# Grey2
# FALSE TRUE
# 2809 612
# Number 2
Y1G2=Yellow1 & Grey2 table(Y1G2)
# FALSE TRUE
# 3250 171
# Number 3
Y1S3= ((ttest>1.96) & (DiffK>0.4) & Yellow1) table(Y1S3)
# YleanS3
# FALSE TRUE
# 3342 79
# Number 4
Y1G2S3=ttest>1.96 & (DiffK>0.4) & Y1G2 table(Y1G2S3)
# Y1G2S3
# FALSE TRUE
# 3360 61
# dimnames(datExpr)[[2]] are the gene names. We can use the "locus link
# IDs", otherwise known as Entrez Gene IDs in the DAVID database to get a functional
# annotation.
EntrezGeneIDs=datSummary[,3]
# Background write.table(EntrezGeneIDs,"background.txt",sep=" ",row.names=F,col.names=F)
# Number 1 write.table(datSummary[,3][Yellow1],"Yellow1.txt", sep="
",row.names=F,col.names=F)
# Number 2 write.table(datSummary[,3][Y1G2],"Y1G2.txt", sep="
",row.names=F,col.names=F)
# Number 3 write.table(datSummary[,3][Y1S3],"Y1S3.txt", sep="
",row.names=F,col.names=F)
# Number 4 write.table(datSummary[,3][Y1G2S3],"Y1G2S3.txt", sep="
",row.names=F,col.names=F)
Because a 0 is placed in lieu of a missing ID, I manually opened these files and deleted "0" entries. Then I used the entire gene list of the 3421 most connected genes as a background and entered this information into DAVID for a functional analysis.
In the yellow lean gene list, 250 of the 255 yellow lean genes were recognized as mus musculus genes. 168 of yellow lean / grey fat out of 171 were recognized. All of the sector 1 genes, whether only yellow lean genes, or the intersection of lean and fat genes, were recognized.
The smallest group of genes – those that met criteria 4 were highly specific for extracellular matrix, extracellular region, epidermal growth factors and its related proteins, cell adhesion, and glycoproteins. On the next page you will find a table of these genes sorted by increasing p-value. Genes with p-value < 0.05 are shaded.
In criteria 4 we added the additional requirement of being a grey fat gene. The genes that do not meet this requirement (only meet criteria 3, being yellow genes in sector 1) are involved in blood vessel morphogenesis, metalloprotease inhibition, anion transport, or other similar functions seen in Criteria 4.
CRITERIA 4 GENE FUNCTIONAL ANNOTATION – yellow in network 1, grey in network
2, sector 3
Category
GOTERM_CC_ALL
SP_PIR_KEYWORDS
UP_SEQ_FEATURE
GOTERM_CC_ALL
GOTERM_BP_ALL
UP_SEQ_FEATURE
UP_SEQ_FEATURE
SP_PIR_KEYWORDS
UP_SEQ_FEATURE
SP_PIR_KEYWORDS
SP_PIR_KEYWORDS
GOTERM_BP_ALL
UP_SEQ_FEATURE
SP_PIR_KEYWORDS
GOTERM_MF_ALL
INTERPRO_NAME
UP_SEQ_FEATURE
SMART_NAME
SP_PIR_KEYWORDS
SMART_NAME
SP_PIR_KEYWORDS
GOTERM_MF_ALL
INTERPRO_NAME
INTERPRO_NAME
INTERPRO_NAME
INTERPRO_NAME
INTERPRO_NAME
Term extracellular region signal signal peptide extracellular space cell adhesion domain:EGF-like 1 glycosylation site:N-linked (GlcNAc...) glycoprotein domain:EGF-like 3 cell adhesion collagen cell-cell adhesion domain:EGF-like 2 immunoglobulin domain extracellular matrix structural constituent
IPR007110:Immunoglobulin-like disulfide bond
SM00181:EGF extracellular matrix
SM00179:EGF_CA egf-like domain structural molecule activity
IPR000742:EGF-like, type 3
IPR006210:EGF
IPR001881:EGF-like calcium-binding
IPR006209:EGF-like
IPR013032:EGF-like region
GOTERM_BP_ALL
INTERPRO_NAME
INTERPRO_NAME
GOTERM_BP_ALL
GOTERM_CC_ALL
GOTERM_CC_ALL
SMART_NAME
UP_SEQ_FEATURE regulation of signal transduction
IPR013106:Immunoglobulin V-set
IPR013111:EGF, extracellular negative regulation of signal transduction extracellular matrix (sensu Metazoa) extracellular matrix
SM00643:C345C domain:VWFC
INTERPRO_NAME
INTERPRO_NAME
SP_PIR_KEYWORDS
GOTERM_BP_ALL
IPR013091:EGF calcium-binding
IPR008160:Collagen triple helix repeat calcium binding phosphate transport
UP_SEQ_FEATURE
SP_PIR_KEYWORDS domain:TSP N-terminal duplication
PIR_SUPERFAMILY_NAME SF005770:regulator of G protein signaling, RGS4 type
GOTERM_BP_ALL organ morphogenesis
INTERPRO_NAME IPR001134:Netrin, C-terminal write.table(datSummary[Y1G2S3,],"Y1G2S3summary.csv",sep=",")
Count %
23 37.70%
22 36.07%
22 36.07%
21 34.43%
10 16.39%
5 8.20%
21 34.43%
21 34.43%
4 6.56%
7 11.48%
5 8.20%
5 8.20%
4 6.56%
7 11.48%
4 6.56%
7 11.48%
16 26.23%
5 8.20%
5 8.20%
4 6.56%
5 8.20%
7 11.48%
5 8.20%
5 8.20%
4 6.56%
5 8.20%
5 8.20%
4 6.56%
4 6.56%
3 4.92%
3 4.92%
5 8.20%
5 8.20%
2 3.28%
2 3.28%
3 4.92%
3 4.92%
3 4.92%
3 4.92%
2 3.28%
3 4.92%
2 3.28%
5 8.20%
2 3.28%
PValue
1.75E-04
3.55E-04
5.39E-04
5.69E-04
7.69E-04
8.66E-04
0.00123567
0.001564983
0.001605736
0.001704503
0.001819918
0.00376904
0.006019022
0.006493147
0.009692797
0.010018458
0.011240382
0.012956896
0.014607793
0.014804082
0.016893234
0.017782941
0.018214759
0.02067224
0.02409503
0.029234871
0.029234871
0.033918607
0.043000016
0.043093768
0.044205578
0.0465188
0.0465188
0.060483805
0.067182734
0.074400865
0.0801662
0.082536708
0.088210508
0.088585807
0.088868263
0.089493679
0.094055231
0.095041582