library(MASS) - UCLA Human Genetics

advertisement
Tutorial
Weighted Gene Coexpression Network Analysis
BxH validation with BxD data
Tova Fuller, Steve Horvath
Correspondence: suprtova@ucla.edu, shorvath@mednet.ucla.edu
Contents of this tutorial:
1. Gene co-expression network construction
2. Module definition based on average linkage hierarchical clustering with the dynamic tree cut
algorithm, and studying module preservation
3. Finding connectivity and gene significance measures, and studying preservation across data
sets as well as relationships between these measures within each data set.
4. Obtaining linear models explaining variance in weight in each data set.
Abstract
Here we utilize a weighted gene co-expression network analysis (WGCNA) approach based on
expression and genotype data from a previously studied BxH F2 mouse intercross as well as a new BxD
cross. Specifically, we utilize weighted gene co-expression network analysis (WGCNA) methods to
demonstrate preservation of modules, intramodular connectivity and gene significance. We also obtain
linear models in both data crosses using a module QTL identified in the BxH data that resides on the
19th chromosome.
This work is in press:
Tova Fuller, Anatole Ghazalpour, Jason Aten, Thomas A. Drake, Aldons J. Lusis, Steve Horvath
(2007) Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm
Genome, in press.
The data are described in:
Anatole Ghazalpour, Sudheer Doss, Bin Zang, Susanna Wang,Eric E. Schadt, Thomas A. Drake,
Aldons J. Lusis, Steve Horvath (2006) Integrating Genetics and Network Analysis to
Characterize Genes Related to Mouse Weight. PloS Genetics
We provide the statistical code used for generating the weighted gene co-expression network results.
Thus, the reader be able to reproduce all of our findings. This document also serves as a tutorial to
differential weighted gene co-expression network analysis. Some familiarity with the R software is
desirable but the document is fairly self-contained. This document and data files can be found at the
following webpage:
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/DifferentialNetworkAnalysis
More material on weighted network analysis can be found here:
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/
Method Description:
The data are described in the PLoS article cited above [1]. Please also refer to the citations above and
below for more information regarding weighted gene co-expression network analysis (WGCNA).
Here we attempt to show that networks may be constructed from two phenotypically different
subgroups of samples from a prior WGCNA experiment on mice. Here we identify 30 mice at both
extremes of the weight spectrum in the BxH data and construct weighted gene co-expression networks
from each.
Network Construction:
In co-expression networks, network nodes correspond to genes and connection strengths are determined
by the pairwise correlations between expression profiles. In contrast to unweighted networks, weighted
networks use soft thresholding of the Pearson correlation matrix for determining the connection
strengths between two genes. Soft thresholding of the Pearson correlation preserves the gene coexpression information and leads to weighted co-expression networks that are highly robust with
respect to the construction method [2].
The network construction algorithm is described in detail elsewhere [2]. Briefly, a gene co-expression
similarity measure (absolute value of Pearson’s product moment correlation) was used to relate every
pairwise gene-gene relationship. An adjacency matrix was then constructed using a `soft’ power
adjacency function aij = Power(sij, )  |sij|  where sij is the co-expression similarity, and aij represents
the resulting adjacency that measures the connection strengths. The power  is chosen using the scale
free topology criterion proposed in Zhang and Horvath (2005). Briefly, the power was chosen such the
resulting network exhibited approximate scale free topology and a high mean number of connections.
The scale free topology criterion led us to choose a power of  = 6 based on the preliminary network
built from the 8000 most varying genes. However, since we are using a weighted network as opposed to
an unweighted network, the biological findings are highly robust with respect to the choice of this
power [2].
Topological Overlap Matrix and Gene Modules
The adjacency matrix was then used to define a network distance measure or more precisely a measure
of node dissimilarity based on the topological overlap matrix [2]. Specifically the topological overlap
matrix is given by
lij  aij
ij 
min{ki , k j }  1  aij
where lij   aiu auj denotes the number of nodes to which both i and j are connected, and u indexes
u
the nodes of the network. The topological overlap matrix (TOM) is given by Ω=[ω ij]. ωij is a number
between 0 and 1 and is symmetric (i.e, ωij= ωji). The rationale for considering this similarity measure is
that nodes that are part of highly integrated modules are expected to have high topological overlap with
their neighbors.
Network Module Identification.
Gene "modules" are groups of nodes that have high topological overlap. Module identification was
based on the topological overlap matrix Ω=[ωij] defined above. To use it in hierarchical clustering, it
was turned into a dissimilarity measure by subtracting it from one (i.e, the topological overlap based
dissimilarity measure is defined by dij  1  ij ). Based on the dissimilarity matrix we can use
hierarchical clustering to discriminate one module from another. We used a dynamic cut-tree algorithm
for automatically and precisely identifying modules in hierarchical clustering dendrogram (the details
of
the
algorithm
could
be
found
at
http://www.genetics.ucla.edu/labs/horvath/binzhang/DynamicTreeCut).
The algorithm takes into account an essential feature of cluster occurrence and makes use of the
internal structure in a dendrogram. Specifically, the algorithm is based on an adaptive process of
cluster decomposition and combination and the process is iterated until the number of clusters becomes
stable. No claim is made that our module construction method is optimal. A comparison of different
module construction methods is beyond the scope of this paper.
Intramodular connectivity and gene significance measures
The row sum of the adjacency measures with a given gene i results in the network connectivity measure
(kall): ki   aiu .
Analogously, the intramodular connectivity (kin) is found by summation of
ui
adjacencies over all genes in a particular module. Intramodular connectivity is an important concept for
identifying clinically relevant genes [3]. To measure intramodular connectivity, we find it
computationally convenient to define the module based connectivity, kME, as the correlation between a
 given gene expression profile and the module eigengene: kME(i)=|cor(x(i),ME)|. The module
eigengene is defined as the first principal component of the expression data and can be considered to be
the most representative gene expression inside the module.
The gene significance with respect to a specific trait is referred to as GStrait, with GStrait of the i th gene
in the array equal to |cor(x(i), trait)|, where x(i) is the gene expression profile of the ith gene.
1.
2.
3.
Ghazalpour, A., et al., Integrating genetic and network analysis to characterize genes related to
mouse weight. PLoS Genet, 2006. 2(8): p. e130.
Zhang, B. and S. Horvath, A general framework for weighted gene co-expression network
analysis. Stat Appl Genet Mol Biol, 2005. 4: p. Article17.
Horvath, S., J. Dong, and A.M. Yip, Connectivity, Module-Conformity, and Significance:
Understanding Gene Co-Expression Network Methods. UCLA Technical Report, 2006.
Statistical References
To cite this tutorial or the statistical methods please use
1. Zhang B, Horvath S (2005) A General Framework for Weighted Gene Co-Expression Network
Analysis. Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17.
http://www.bepress.com/sagmb/vol4/iss1/art17
For the generalized topological overlap matrix as applied to unweighted networks see
2. Yip A, Horvath S (2006) Generalized Topological Overlap Matrix and its Applications in Gene
Co-expression Networks. Proceedings Volume. Biocomp Conference 2006, Las Vegas.
Technical report at http://www.genetics.ucla.edu/labs/horvath/GTOM/.
For some additional theoretical insights consider
3. Horvath S, Dong J, Yip A (2006) The Relationship between Intramodular Connectivity and
Gene Significance. Proceedings Volume. Biocomp Conference 2006, Las Vegas. Technical
report at http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/
4. Horvath, Dong, Yip (2006) Using Module Eigengenes to Understand Connectivity and Other
Network Concepts in Co-expression Networks. Submitted.
# Absolutely no warranty on the code. Please contact TF or SH with suggestions.
# Downloading the R software
# 1) Go to http://www.R-project.org, download R and install it on your computer
# After installing R, you need to install several additional R library packages:
# For example to install Hmisc, open R,
# go to menu "Packages\Install package(s) from CRAN",
# then choose Hmisc. R will automatically install the package.
# When asked "Delete downloaded files (y/N)? ", answer "y".
# Do the same for some of the other libraries mentioned below. But note that
# several libraries are already present in the software so there is no need to re-install them.
# To get this tutorial and data files, go to the following webpage
# http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/DifferentialNetworkAnalysis
# Download the zip file containing:
# 1) R function file: "NetworkFunctions.txt", which contains several R functions
#
needed for Network Analysis.
# Unzip all the files into the same directory,
## The user should copy and paste the following script into the R session.
## Text after "#" is a comment and is automatically ignored by R.
# First we read in the libraries and source code we will need.
library(MASS)
library(class)
library(Hmisc)
library(sma)
library(impute)
library(scatterplot3d)
source("/Users/TovaFuller/Documents/HorvathLab2006/NetworkFunctions/NetworkFunction
s.txt")
# We set our working directory.
setwd("/Users/TovaFuller/Documents/HorvathLab2007/MouseProject2.0/")
# Reading in and Processing Expression, Clinical and SNP data
# 1. Expression data
# First we read in our BXD and BXH expression data:
dataBXD=read.csv("Gene_Expression_All_Animalsfixed_BXD.csv", header=TRUE)
dataBXH=read.csv("cnew_liver_bxh_f2female_8000mvgenes_p3600_UNIQUE_tommodules_BXH.c
sv", header=TRUE)
# The following denotes mapmaker id's for the genes
rid=as.character(dataBXD [,1])
# Merging old, BxH color information
# Now we would like to merge old color information – denoting module membership in the BxH data
# set - to this new data set. This will be important later on in demonstrating module preservation.
RIDs=data.frame(rid)
colnames(RIDs)="mapmaker_id"
# STEP 1: Merge to match genes in BxH and BxD data sets.
rTOlocusid=read.csv("ridTOlocusid.csv",header=T)
colnames(rTOlocusid)=c("mapmaker_id2","locus_id")
table(is.element(RIDs$mapmaker_id,rTOlocusid$mapmaker_id))
# FALSE TRUE
# 8208 16964 <- lose 8208 genes.
Merge1=merge(RIDs, rTOlocusid, by.x="mapmaker_id",
by.y="mapmaker_id2",all.x=T,all.y=F)
# Now we have to get the genes in order
Morder1=match(RIDs$mapmaker_id,Merge1$mapmaker_id)
Merge1=Merge1[Morder1,]
table(Merge1$mapmaker_id==RIDs$mapmaker_id)
# TRUE
# 25172
table(Merge1$mapmaker_id==rid)
# STEP 2: Merge to obtain color data from old data set
# Now we merge with the old mouse data (BXH) to obtain old module definitions.
dataBXHmodule=dataBXH[,c(3,2,150)]
colnames(dataBXHmodule)
# [1] "LocusLinkID" "gene_symbol" "module"
table(is.element(Merge1$locus_id, dataBXH$LocusLinkID))
# FALSE TRUE
# 23219 1953
# this number is so small because only the 3600 most connected genes are in dataBXHmodule, and
# there are several genes without locus link IDs in the BXH data file.
# Now we merge :
Merge2=merge(Merge1, dataBXHmodule, by.x="locus_id",
by.y="LocusLinkID",all.x=T,all.y=F)
# Now we have to get the genes in order
Morder2=match(RIDs$mapmaker_id,Merge2$ mapmaker_id)
Merge2=Merge2[Morder2,]
table(Merge2$mapmaker_id== RIDs$mapmaker_id)
# TRUE
# 25172
table(Merge2$mapmaker_id==rid)
# TRUE
# 15272
Merge3=merge(Merge1, dataBXH,by.x="locus_id",by.y="LocusLinkID",all.x=T,all.y=F)
# Takes a while
Morder3=match(RIDs$mapmaker_id,Merge3$mapmaker_id)
Merge3=Merge3[Morder3,]
table(Merge3$mapmaker_id==rid)
table(Merge2$mapmaker_id==Merge3$mapmaker_id)
check=cbind(RIDs$mapmaker_id,Merge2)
# We name a vector colorh0, for the old, BxH module colors.
color0=Merge2$module
table(is.na(color0))
# FALSE TRUE
# 1953 23219
overlap=!is.na(color0)
datExprBXH <- t(Merge3[overlap,10:144])
dim(datExprBXH)
BXHnamesOrdered=Merge3$substanceBXH[overlap]
# [1] 135 1953
colorOverlap=color0[overlap]
columnnames=colnames(dataBXD)
#obviously, the names of the columns, which are samples
# Rows are genes and columns are samples. This is our BxD expression data.
datExprBXD0=dataBXD[, grep("F2",columnnames)]
colnamedata=colnames(datExprBXD0) # sample names
# We omit duplicate samples.
dup1=grep("b",colnamedata)
dup2=grep("c",colnamedata)
dups=c(dup1,dup2)
datExprBXD=t(datExprBXD0[overlap,-dups])
BXDmice=dimnames(datExprBXD)[[1]]
# 2. Clinical Trait data
# 151 x 1953
# BxH Clinical Trait data
# We now read in our BxH clinical trait data:
datClinicalTraitsBXH=read.csv("BXH_ClinicalTraits_361mice_forNewBXH.csv",header=T)
# rows are samples & columns are traits
# We order the mice so that trait file and expression file agree in BxH data:
restrictMice=is.element(datClinicalTraitsBXH$MiceID,dimnames(datExprBXH)[[1]])
table(restrictMice)
# restrictMice
# FALSE TRUE
#
226
135
datClinicalTraitsBXH=datClinicalTraitsBXH[restrictMice,]
orderMiceTraits=order(datClinicalTraitsBXH$MiceID)
orderMiceExpr=order(dimnames(datExprBXH)[[1]])
datClinicalTraitsBXH=datClinicalTraitsBXH[orderMiceTraits,]
datExprBXH=datExprBXH[orderMiceExpr,]
# From the following table, we verify that all 135 mice are in order:
table(datClinicalTraitsBXH$MiceID==dimnames(datExprBXH)[[1]])
# TRUE
# 135
# BxD Clinical Trait data
# We also read in our BxD clinical trait data:
datClinicalTraits_SNPBXD0=read.csv("BXD_clinical_data.csv", header=T)
# We make the BxD mice names agree with datClinicalTraits sample names.
BXDmice_noF=gsub("F2_","",BXDmice)
restrictMiceBXD=is.element(datClinicalTraits_SNPBXD0$Mouse..,BXDmice_noF)
table(restrictMiceBXD)
# restrictMiceBXD
# FALSE TRUE
#
44
113
datClinicalTraits_SNPBXD=datClinicalTraits_SNPBXD0[restrictMiceBXD,]
restrictMiceBXDa=is.element(BXDmice_noF,datClinicalTraits_SNPBXD$Mouse..)
table(restrictMiceBXDa)
datExprBXD=datExprBXD[restrictMiceBXDa,]
BXDmice_noF= BXDmice_noF[restrictMiceBXDa]
datClinicalTraits_SNPBXD$Mouse..=as.character(datClinicalTraits_SNPBXD$Mouse..)
orderMiceTraitsBXD=order(datClinicalTraits_SNPBXD$Mouse..)
orderMiceExprBXD=order(BXDmice_noF)
datClinicalTraits_SNPBXD=datClinicalTraits_SNPBXD[orderMiceTraitsBXD, ]
dim(datClinicalTraits_SNPBXD)
# 132x204
datExprBXD=datExprBXD[orderMiceExprBXD,]
BXDmice_noF= BXDmice_noF[orderMiceExprBXD]
table(datClinicalTraits_SNPBXD$Mouse..==BXDmice_noF)
# TRUE
# 113
datClinicalTraitsBXD=datClinicalTraits_SNPBXD[,c(133:204)] # 113 x 72
# 3. SNP data
# BxDSNP data
# We obtain the SNP data for the BxD mice.
datSNPBXD0= datClinicalTraits_SNPBXD[,c(2:132)]
# We read in SNP info for BxD.
datSNPBXDinfo0=read.table("BXDSNPinfo.csv",sep=",",header=T)
table(is.element(dimnames(datSNPBXD0)[[2]],
as.character(datSNPBXDinfo0$marker_name)))
# FALSE TRUE
#
4
127
restrictSNPsBXD= is.element(dimnames(datSNPBXD0)[[2]],
as.character(datSNPBXDinfo0$marker_name))
datSNPBXD=datSNPBXD0[,restrictSNPsBXD]
orderSNPs=order(dimnames(datSNPBXD)[[2]])
datSNPBXD=datSNPBXD[,orderSNPs]
orderSNPs2=order(as.character(datSNPBXDinfo0$marker_name))
datSNPBXDinfo= datSNPBXDinfo0[orderSNPs2,]
# We check to ensure our SNPs match our SNP info.
table(is.element(dimnames(datSNPBXD)[[2]], as.character(datSNPBXDinfo$marker_name)))
# TRUE
# 127
# Now let's get chromosome number.
chrBXD=datSNPBXDinfo$chromosome_id
# BxH SNP data
datSNPinfoBXH=read.csv("SNPMarkerLocusTranslationTable.csv",header=T)
#[1] "UCSC.Name"
"Celera.Name"
"UCSC.Chromosome"
"UCSC.Location"
#[5] "Celera.Chromosome" "Celera.Location"
BXHindices=match(c("rs3662347","rs3714671","rs3721607","rs3676909","rs3704401","rs3
658504","rs3683481","rs3691821","rs3658160"), datSNPinfoBXH$Celera.Name)
BXHsnps=datSNPinfoBXH[BXHindices,-c(5,6)]
# We read in our SNP data
dat1BXH=read.csv("BluemoduleGenesWeightandSNPs.csv",header=T)
datSNPBXH= data.frame(dat1BXH[1:9, c(9:143) ])
dimnames(datSNPBXH)[[1]]=as.character(dat1BXH[1:9,1])
# Now we have to make our mouse name order match that of our expression data.
dimnames(datSNPBXH)[[2]]=gsub("F2","F2_", dimnames(datSNPBXH)[[2]])
datSNPBXH=datSNPBXH[,dimnames(datExprBXH)[[1]]]
table(dimnames(datSNPBXH)[[2]]== dimnames(datExprBXH)[[1]])
# TRUE
# 135
BXHchr=BXHsnps$UCSC.Chromosome
BXHpos=BXHsnps$UCSC.Location
# We do some pre-processing on our SNP data so that we don't run into problems later on.
tSNPBXH =data.frame(t(datSNPBXH))
for (i in c(1:dim(tSNPBXH)[[2]]))
{tSNPBXH [,i]=as.numeric(as.character(tSNPBXH[,i]))}
tSNPBXH =data.frame(tSNPBXH)
dimnames(tSNPBXH)[[2]]=paste("mQTL",BXHchr,".", signif(BXHpos,3)/1e5,sep="")
tSNPBXD =data.frame(datSNPBXD)
for (i in c(1:dim(tSNPBXD)[[2]]))
{tSNPBXD [,i]=as.numeric(as.character(tSNPBXD[,i]))}
tSNPBXD =data.frame(tSNPBXD)
# Let's take a look at the dimensions of our data frames to make sure nothing went wrong. The number
# of samples in each data set should be constant, and BxD and BxH data sets should share the same
# number of genes, which they do.
dim(datClinicalTraitsBXD)
# [1] 113 72
dim(datExprBXD)
# [1] 113 1953
dim(tSNPBXD)
# [1] 113 127
dim(datClinicalTraitsBXH)
# [1] 135 26
dim(datExprBXH)
# [1] 135 1953
dim(tSNPBXH)
# [1] 135 9
for (i in c(1:dim(datExprBXD)[[2]]))
{datExprBXD [,i]=as.numeric(as.character(datExprBXD[,i]))}
datExprBXD =data.frame(datExprBXD)
for (i in c(1:dim(datExprBXH)[[2]]))
{datExprBXH[,i]=as.numeric(as.character(datExprBXH[,i]))}
datExprBXH=data.frame(datExprBXH)
for (i in c(1:dim(datClinicalTraitsBXD)[[2]]))
{datClinicalTraitsBXD [,i]=as.numeric(as.character(datClinicalTraitsBXD[,i]))}
datClinicalTraitsBXD =data.frame(datClinicalTraitsBXD)
weightBXH=as.numeric(datClinicalTraitsBXH[,5])
weightBXD= as.numeric(datClinicalTraitsBXD[,1])
# Module Preservation Analysis
# Now we produce a cluster dendrogram after obtaining the adjacency matrix. We do this using only
# the genes that were colored by our previous analysis (intersecting genes between the two data sets).
beta1=6
# Dynamic Cut-Tree Algorithm
# We used a dynamic cut-tree algorithm for selection branches of the hierarchical clustering
# dendrogram (the details of the algorithm can be found at the following link:
# www.genetics.ucla.edu/labs/horvath/binzhang/DynamicTreeCut. The algorithm takes into account an
# essential feature of cluster occurrence and makes use of the internal structure in a dendrogram.
# Specifically, the algorithm is based on an adaptive process of cluster decomposition and combination
# and the process is iterated until the number of clusters becomes stable.
myheightcutoff =0.999
mydeepSplit = FALSE # fine structure within module
myminModuleSize = 20 # modules must have this minimum number of genes
# new way for identifying modules based on hierarchical clustering dendrogram
# We now obtain our hierGTOM for the BxD data.
AdjMatBXD=matrix(0, ncol=dim(datExprBXD)[[2]], nrow= dim(datExprBXD)[[2]])
AdjMatBXD<- abs(cor(datExprBXD,use="p"))^beta1
dissGTOMBXD=matrix(0, ncol= dim(datExprBXD)[[2]], nrow= dim(datExprBXD)[[2]])
dissGTOMBXD=TOMdist1(AdjMatBXD)
rm(AdjMatBXD)
hierGTOMBXD <- hclust(as.dist(dissGTOMBXD),method="average")
par(mfrow=c(1,1))
plot(hierGTOMBXD,labels=F)
# We find the modules based on the dynamic cut-tree algorithm in BxD data.
colorBXD=cutreeDynamic(hierclust= hierGTOMBXD, deepSplit=mydeepSplit,maxTreeHeight
=myheightcutoff,minModuleSize=myminModuleSize)
table(colorBXD)
#colorBXD
#turquoise
#
388
#
pink
#
101
blue
356
magenta
91
brown
302
grey
24
yellow
199
green
193
red
153
black
146
# Compare this with the previous, BxH module sizes in genes that were found in both data sets:
table(colorOverlap)
# colorOverlap
#
black
#
264
#
grey
#
50
#
salmon
#
32
blue
370
lightcyan
91
brown
cyan
158
65
lightyellow midnightblue
16
46
green
299
purple
45
greenyellow
80
red
437
# We visualize these modules:
pdf(file="pics/MG2/dendrogramPlot.pdf")
par(mar=c(1, 4, 4, 1) + 0.1,mfrow=c(3,1),cex=.9)
plot(hierGTOMBXD, main="BxD Cross Dendrogram", labels=F, xlab="", sub="");
hclustplot1(hierGTOMBXD,colorBXD, title1="Colored by BxD Modules")
hclustplot1(hierGTOMBXD,colorOverlap, title1="Colored by BxH Modules")
dev.off()
par(mar=c(5, 4, 4, 2) + 0.1)
# The top figure is a dendrogram showing the BxD data, the middle is coloring by BxD-defined
# modules, and the bottom figure shows the BxD data colored by old, BxH modules. Here we see rough
# module preservation.
# We produce a multidimensional scaling plot of the BxD data:
library(scatterplot3d)
cmd1=cmdscale(as.dist(dissGTOMBXD),4)
par(mfrow=c(1,1))
scatterplot3d(cmd1[,1:3], color=as.character(colorOverlap), main="MDS
plot",xlab="Scaling Dimension 1", ylab="Scaling Dimension 2", zlab="Scaling
Dimension 3",cex.axis=1.5,angle=320)
# Comparing K.ME and GSweight between data sets
# Finding k.ME
# Note: In the text of the article we refer to the principal component as "MEblue". Here, this is the
# same as "PCblue".
PCblueBXD=ModulePrinComps1(datexpr=as.matrix(datExprBXD),
couleur=as.character(colorOverlap))[[1]]$PCblue
PCblueBXH=ModulePrinComps1(datexpr=as.matrix(datExprBXH),
couleur=as.character(colorOverlap))[[1]]$PCblue
blueModuleIndex=(colorOverlap)=="blue"
# We find kME's for the different levels of detection thresholding:
kMEblueBXD=as.numeric(abs(cor(PCblueBXD, datExprBXD[,blueModuleIndex],use="p")))
kMEblueBXDAll=as.numeric(abs(cor(negPCblueBXD, datExprBXD,use="p")))
kMEblueBXH=as.numeric(abs(cor(PCblueBXH, datExprBXH[,blueModuleIndex],use="p")))
kMEblueBXHAll=as.numeric(abs(cor(PCblueBXH, datExprBXH,use="p")))
datExprBXHBlue=data.frame(datExprBXH [,blueModuleIndex])
datExprBXDBlue= data.frame(datExprBXD[,blueModuleIndex])
# Finding GSweight
# To protect against outliers, we replace the values of the physiological traits by their ranks.
rank1=function(x) rank(x, na.last="keep")
rankdatClinicalTraitsBXD=apply(datClinicalTraitsBXD,2,rank1)
rankdatClinicalTraitsBXH=apply(datClinicalTraitsBXH[,5:26],2,rank1)
# Here we find gene significance, or the correlation between a gene expression and a physiological trait
if(exists("GSfunctionBXH")) rm(GSfunctionBXH)
GSfunctionBXH=function(x) {cor(x,rankdatClinicalTraitsBXH,use="p")}
if(exists("GSfunctionBXD")) rm(GSfunctionBXD)
GSfunctionBXD=function(x) {cor(x,rankdatClinicalTraitsBXD,use="p")}
# We also compute GeneSignificance for the data frame with omission of 1000 low detection genes in
# both data sets:
GeneSignificanceBXD =t(apply(datExprBXD,2,GSfunctionBXD))
dimnames(GeneSignificanceBXD)[[2]]=paste("GS",dimnames(rankdatClinicalTraitsBXD)[[2
]],sep="" )
GeneSignificanceBXD=data.frame(abs(GeneSignificanceBXD))
GSweightBlueBXD= GeneSignificanceBXD[blueModuleIndex,1]
GSweightAllBXD= GeneSignificanceBXD[,1]
GeneSignificanceBXH =t(apply(datExprBXH,2,GSfunctionBXH))
dimnames(GeneSignificanceBXH)[[2]]=paste("GS",dimnames(rankdatClinicalTraitsBXH)[[2
]],sep="" )
GeneSignificanceBXH=data.frame(abs(GeneSignificanceBXH))
GSweightBlueBXH = GeneSignificanceBXH[blueModuleIndex,1]
GSweightAllBXH= GeneSignificanceBXH[,1]
# Now we make plots of relationships between k.ME and GS.weight in our BxD and BxH data sets.
par(mfrow=c(2,2),mar=c(5, 5, 4, 2) + 0.2)
scatterplot1(kMEblueBXHAll, kMEblueBXDAll, xlab1="k.MEblue, BxH cross",
ylab1="k.MEblue, BxD cross",col1= "black")
scatterplot1(GSweightAllBXH, GSweightAllBXD,xlab1="GSweight, BxH cross",
ylab1="GSweight, BxD cross",col1= colorOverlap)
scatterplot1(kMEblueBXH, GSweightBlueBXH,xlab1="k.MEblue, BxH cross",
ylab1="GSweight, BxH cross",col1= "blue")
scatterplot1(kMEblueBXD, GSweightBlueBXD, xlab1="k.MEblue, BxD cross",
ylab1="GSweight, BxD cross",col1= "blue")
# This plot demonstrates:
# 1. kMEblue (BXH versus the new cross) for all genes – this is simply the affinity to the blue module
# 2. GSweight (BxH versus new cross) for all genes
# 3. kMEblue v GSweight in the blue module in BxH cross
# 4. kMEblue v GSweight in the blue module in the new BxD cross.
# Linear Models
# Regressing weight on PCblue
# We regress weight on PCblue in the BXH data
lm1=lm(weightBXH~ PCblueBXH)
summary(lm1)
Call:
lm(formula = weightBXH ~ PCblueBXH)
Residuals:
Min
1Q
-12.1717 -3.3553
Median
0.2933
3Q
2.6305
Max
16.0453
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.0941
0.4146
91.88 < 2e-16 ***
PCblueBXH
43.5771
4.7887
9.10 1.35e-15 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.763 on 130 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-Squared: 0.3891,
Adjusted R-squared: 0.3844
F-statistic: 82.81 on 1 and 130 DF, p-value: 1.352e-15
# We do the same with the square root of weight:
summary(lm(sqrt(weightBXH)~ PCblueBXH))
Call:
lm(formula = sqrt(weightBXH) ~ PCblueBXH)
Residuals:
Min
1Q
-1.06334 -0.26591
Median
0.02787
3Q
0.23405
Max
1.22891
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.15155
0.03396 181.118 < 2e-16 ***
PCblueBXH
3.66285
0.39227
9.337 3.54e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3902 on 130 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-Squared: 0.4014,
Adjusted R-squared: 0.3968
F-statistic: 87.19 on 1 and 130 DF, p-value: 3.543e-16
# We repeat the same model, except in the new cross:
lm1=lm(weightBXD~ PCblueBXD)
summary(lm1)
Call:
lm(formula = weightBXD ~ PCblueBXD)
Residuals:
Min
1Q
-18.5685 -4.8604
Median
-0.9517
3Q
5.3467
Max
26.0681
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.5974
0.6835 44.767 < 2e-16 ***
PCblueBXD
27.8609
7.2655
3.835 0.000209 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.265 on 111 degrees of freedom
Multiple R-Squared: 0.117,
Adjusted R-squared: 0.109
F-statistic: 14.7 on 1 and 111 DF, p-value: 0.0002091
SNP19BXH= tSNPBXH[,9]
# We do the same with the square root of weight:
summary(lm(sqrt(weightBXD)~ PCblueBXD))
Call:
lm(formula = sqrt(weightBXD) ~ PCblueBXD)
Residuals:
Min
1Q
Median
-2.14934 -0.41173 -0.04982
3Q
0.51667
Max
2.04400
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.48674
0.06225
88.14 < 2e-16 ***
PCblueBXD
2.66675
0.66173
4.03 0.000103 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6617 on 111 degrees of freedom
Multiple R-Squared: 0.1276,
Adjusted R-squared: 0.1198
F-statistic: 16.24 on 1 and 111 DF, p-value: 0.0001025
# Now we regress weight on SNP19 in BXH:
lm1=lm(weightBXH~SNP19BXH)
summary(lm1)
Call:
lm(formula = weightBXH ~ SNP19BXH)
Residuals:
Min
1Q
-14.3126 -4.0709
Median
0.5152
3Q
4.0541
Max
15.8874
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.5570
0.8460 42.028 < 2e-16 ***
SNP19BXH
2.6556
0.6961
3.815 0.000209 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.779 on 130 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-Squared: 0.1007,
Adjusted R-squared: 0.09377
F-statistic: 14.56 on 1 and 130 DF, p-value: 0.0002095
lm1=lm(sqrt(weightBXH)~SNP19BXH)
summary(lm1)
Call:
lm(formula = sqrt(weightBXH) ~ SNP19BXH)
Residuals:
Min
1Q
-1.27270 -0.32349
Median
0.06349
3Q
0.34286
Max
1.19381
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.94060
0.06997 84.897 < 2e-16 ***
SNP19BXH
0.22086
0.05757
3.836 0.000194 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.478 on 130 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-Squared: 0.1017,
Adjusted R-squared: 0.09478
F-statistic: 14.72 on 1 and 130 DF, p-value: 0.0001939
lm1=lm(weightBXH~SNP19BXH+PCblueBXH)
summary(lm1)
Call:
lm(formula = weightBXH ~ SNP19BXH + PCblueBXH)
Residuals:
Min
1Q
-12.2280 -2.9061
Median
0.2126
3Q
2.6703
Max
15.9109
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.4530
0.6853 53.189 < 2e-16 ***
SNP19BXH
1.6830
0.5687
2.960 0.00367 **
PCblueBXH
40.7803
4.7469
8.591 2.43e-14 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.627 on 129 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-Squared: 0.428,
Adjusted R-squared: 0.4191
F-statistic: 48.26 on 2 and 129 DF, p-value: 2.258e-16
lm1=lm(sqrt(weightBXH)~SNP19BXH+PCblueBXH)
summary(lm1)
Call:
lm(formula = sqrt(weightBXH) ~ SNP19BXH + PCblueBXH)
Residuals:
Min
1Q
-1.06799 -0.22244
Median
0.03819
3Q
0.20894
Max
1.21781
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.01601
0.05611 107.220 < 2e-16 ***
SNP19BXH
0.13901
0.04656
2.986 0.00339 **
PCblueBXH
3.43184
0.38863
8.831 6.43e-15 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3788 on 129 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-Squared: 0.4401,
Adjusted R-squared: 0.4315
F-statistic: 50.71 on 2 and 129 DF, p-value: < 2.2e-16
# Finding SNP19 in BXD:
# Now we find SNP19 in the BXD data.
COR.PCblueBXD=rep(NA,dim(tSNPBXD)[[2]])
for (i in c(1: dim(tSNPBXD)[[2]]))
{COR.PCblueBXD[i]= as.numeric(abs(cor(PCblueBXD,as.numeric(tSNPBXD[,i]) ,use="p")))}
# There are five SNPs on the 19th chromosome.
tSNPBXD19= tSNPBXD[,chrBXD==19]
SNP19BXD= tSNPBXD19[,which.max(COR.PCblueBXD[chrBXD==19])]
# this is the 4th SNP on 19th chromosome.
SNP19BXD= tSNPBXD[,chrBXD==19]
# Below is a table of the base pair positions of each of the SNPs we will analyze based search results on
# Ensembl:
datSNPBXDinfo[chrBXD==19,]$marker_name
# [1] d19mit41 d19mit53 d19mit63 d19mit71 d19mit8
SNP Name
Data Set
Basepair start
rs3658160
BxH
47073456
47073456
d19mit41
BxD SNP.1
18743419
18743582
d19mit53
BxD SNP.2
45205220
45205330
d19mit63
BxD SNP.3
36104688
36104837
d19mit71
BxD SNP.4
59653090
59653225
d19mit8
BxD SNP.5
Not mapped by Ensembl
MGI position of 47?
SNP19BXD.1=tSNPBXD[,chrBXD==19][,1]
SNP19BXD.2=tSNPBXD[,chrBXD==19][,2]
SNP19BXD.3=tSNPBXD[,chrBXD==19][,3]
SNP19BXD.4=tSNPBXD[,chrBXD==19][,4]#
SNP19BXD.5=tSNPBXD[,chrBXD==19][,5]
Basepair end
this is the one with the highest cor with PCblue
# We use the 4th SNP as it has the highest correlation with PCblue
# Now we regress weight on SNP19 in BXD:
lm1=lm(weightBXD~SNP19BXD.4)
summary(lm1)
Call:
lm(formula = weightBXD ~ SNP19BXD.4)
Residuals:
Min
1Q
-21.1928 -5.4956
Median
-0.2433
3Q
4.5150
Max
23.9347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
29.160
1.143 25.518
<2e-16 ***
SNP19BXD.4
1.732
1.053
1.646
0.103
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.671 on 110 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-Squared: 0.02403, Adjusted R-squared: 0.01516
F-statistic: 2.709 on 1 and 110 DF, p-value: 0.1026
lm1=lm(sqrt(weightBXD)~SNP19BXD.4)
summary(lm1)
Call:
lm(formula = sqrt(weightBXD) ~ SNP19BXD.4)
Residuals:
Min
1Q
-2.39846 -0.47337
Median
0.02073
3Q
0.44362
Max
1.85191
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.35715
0.10477 51.131
<2e-16 ***
SNP19BXD.4
0.15579
0.09651
1.614
0.109
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7033 on 110 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-Squared: 0.02314, Adjusted R-squared: 0.01426
F-statistic: 2.606 on 1 and 110 DF, p-value: 0.1093
# doesn't improve
lm1=lm(weightBXD~SNP19BXD.4+PCblueBXD)
summary(lm1)
Call:
lm(formula = weightBXD ~ SNP19BXD + PCblueBXD)
Residuals:
Min
1Q
-18.841 -5.231
Median
-0.927
3Q
5.067
Max
25.065
Coefficients:
Estimate Std. Error t value
(Intercept) 29.9286
1.1095 26.975
SNP19BXD
0.8336
1.0341
0.806
PCblueBXD
26.5662
7.5487
3.519
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
Pr(>|t|)
< 2e-16 ***
0.421932
0.000633 ***
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.302 on 109 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-Squared: 0.1236,
Adjusted R-squared: 0.1075
F-statistic: 7.687 on 2 and 109 DF, p-value: 0.0007531
summary(lm(sqrt(weightBXD)~SNP19BXD.4+PCblueBXD)) # 0.1172, p is 0.464311
# Pearson correlations between measures of interest
# In comparing the PCs with weight, we must first ensure that the vector is in the correct order:
cor.test(kMEblueBXH, GSweightBlueBXH, method="p")
Pearson's product-moment correlation
data: kMEblueBXH and GSweightBlueBXH
t = 11.4797, df = 368, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4342818 0.5848353
sample estimates:
cor
0.5134995
cor.test(kMEblueBXD, GSweightBlueBXD, method="p")
Pearson's product-moment correlation
data: kMEblueBXD and GSweightBlueBXD
t = 13.2475, df = 368, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4949667 0.6334971
sample estimates:
cor
0.5682448
cor.test(PCblueBXH,weightBXH, method="p")
Pearson's product-moment correlation
data: PCblueBXH and weightBXH
t = 9.1, df = 130, p-value = 1.332e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5069661 0.7181278
sample estimates:
cor
0.6238009
# Previously our values of PCblue did not affect anything as we defined kME as the
# absolute value of the correlation with PCblueBXD. We here take PCblueBXD as its
# additive inverse, keeping in mind sign is poorly defined for PC.
PCblueBXD=-PCblueBXD
cor.test(PCblueBXD,weightBXD, method="p")
Pearson's product-moment correlation
data: PCblueBXD and weightBXD
t = 3.8347, df = 111, p-value = 0.0002091
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1679007 0.4954488
sample estimates:
cor
0.3420223
# BXH:
cor.test(PCblueBXH,SNP19BXH, method="p")
Pearson's product-moment correlation
data: PCblueBXH and SNP19BXH
t = 2.2886, df = 133, p-value = 0.02368
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.02656463 0.35202807
sample estimates:
cor
0.1946481
cor.test(weightBXH,SNP19BXH, method="p")
Pearson's product-moment correlation
data: weightBXH and SNP19BXH
t = 3.8151, df = 130, p-value = 0.0002095
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1548397 0.4630806
sample estimates:
cor
0.3173167
# BXD: 4th SNP
cor.test(PCblueBXD,SNP19BXD.4, method="p")
Pearson's product-moment correlation
data: PCblueBXD and SNP19BXD.4
t = 2.6734, df = 110, p-value = 0.008653
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.0643953 0.4135993
sample estimates:
cor
0.2469997
cor.test(weightBXD,SNP19BXD.4, method="p")
Pearson's product-moment correlation
data: weightBXD and SNP19BXD.4
t = 1.6459, df = 110, p-value = 0.1026
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.03142964 0.33106245
sample estimates:
cor
0.1550303
# In the BxH data, B corresponds to "0" in additive marker coding, H corresponds to
# a "1" and A corresponds to a "2".
# In the BxD data, B corresponds to a "2" in additive marker coding, and A
# corresponds to a "0".
# We can visualize the distribution of these different genotypes:
par(mfrow=c(1,2))
bxhsnplevels=SNP19BXH
bxhsnplevels[bxhsnplevels==1]="H"
bxhsnplevels[bxhsnplevels==2]="A"
bxhsnplevels[bxhsnplevels==0]="B"
bxdsnplevels=SNP19BXD.4
bxdsnplevels[bxdsnplevels==1]="H"
bxdsnplevels[bxdsnplevels==0]="A"
bxdsnplevels[bxdsnplevels==2]="B"
boxplot(weightBXH~bxhsnplevels,logical=T,notch=T,main="BxH data")
boxplot(weightBXD~bxdsnplevels,logical=T,notch=T,main="BxD data")
Download