Data Sets for Evaluating Algorithms for Inducing Structural Equation

advertisement
Finding Genes that are Downstream of a Given Trait
Tova Fuller, Jason Aten, Steve Horvath
Introduction
Here we use simulated data to show how NEO can be used to find genes that are reactive
to a clinical trait (denoted by y).
To cite this tutorial and the R code, please use the following reference.
Aten JE, Fuller TF, Lusis AJ, Horvath S (2008) Using genetic markers to orient the edges
in quantitative trait networks: the NEO software. BMC Systems Biology 2008, 2:34
R code
# read in the R libraries
library(MASS) # standard, no need to install
library(class) # standard, no need to install
library(cluster)
# Download the WGCNA package from the following webpage
# http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/
library(WGCNA)
library(impute)# install it for imputing missing value
# Memory – this step is unnecessary for mac users.
# check the maximum memory that can be allocated
memory.size(TRUE)/1024
# increase the available memory
memory.limit(size=4000)
# Set the working directory – this directory should correspond with the folder any files
# you will need are stored in.
dir.create("C:/NEOGeneScreeningMiniTutorial")
setwd("C:/NEOGeneScreeningMiniTutorial")
# Please adapt the following paths so that the point to the directory where
# the causality function files are located. Note that we use / instead of \
source("C:/Documents and Settings/Steve Horvath/My
Documents/RFunctions/CausalityFunctions.txt")
source("C:/Documents and Settings/Steve Horvath/My
Documents/RFunctions/updatedneo.txt")
1
#Step 1: Simulate the SNPs and gene expression data
# Here we establish the causal flow between SNPs, genes and our trait y.
# We can set the sample size by changing the value of no.obs
no.obs=200 # the number of mice/people/etc.
# Below we simulate the expression traits (profiles) of 5 genes, which will be denoted by
Eturquoise, Eblue, Ebrown, Egreen, Eyellow.
# The following gene significance parameters determine how correlated and how reactive
each gene is to the trait y.
GeneSignificanceTurquoise=0.5 # very reactive
GeneSignificanceBlue=0.5 # very reactive
GeneSignificanceBrown=0.4
GeneSignificanceGreen=0.3
GeneSignificanceYellow=0.1 # less reactive
# Note that the turquoise and blue genes are the most reactive to y, while the yellow
# gene is least reactive.
# To make the analysis reproducible, we set the random seed.
set.seed(1)
# SNP data: We create vectors to represent SNP data randomly sampled from values of
# either 0, 1 or 2, with HWE probabilities.
SNP1=sample(c(0,1,2), no.obs, replace=T, prob=c(0.25,0.5,0.25) )
SNP2=sample(c(0,1,2), no.obs, replace=T, prob=c(0.25,0.5,0.25) )
SNP3=sample(c(0,1,2), no.obs, replace=T, prob=c(0.25,0.5,0.25) )
# Trait vector (y): We create a quantitative output that is a function of
#SNP1 and SNP2 plus some noise. This is the trait vector which affects
# the gene expression profiles.
y=scale(SNP1+SNP2+rnorm(no.obs))
# The above equation corresponds to the following causal graph:
SNP
1
SNP
2
y
2
# The expression profile of the turquoise gene is simulated to be reactive to the clinical
#trait y
GS=GeneSignificanceTurquoise;
Eturquoise= GS*y+sqrt(1-GS^2)*rnorm(no.obs)
y
Eturquoise
GS=GeneSignificanceBlue;
Eblue= GS*y+2*SNP3+sqrt(1-GS^2)*rnorm(no.obs)
SNP
3
y
Eblue
GS=GeneSignificanceBrown;
Ebrown= GS*y+sqrt(1-GS^2)*rnorm(no.obs)
y
Ebrown
GS=GeneSignificanceGreen;
Egreen= GS*y+sqrt(1-GS^2)*rnorm(no.obs)
y
Egreen
GS=GeneSignificanceYellow;
Eyellow= GS*y+sqrt(1-GS^2)*rnorm(no.obs)
y
Eyellow
3
# This data frame contains all the expression traits
datExpr=data.frame(Eturquoise,Eblue,Ebrown,Egreen, Eyellow)
# We find the gene significance – the pearson correlation between each expression and y.
GS=as.numeric((cor(y,datExpr,use="p")))
# The entire total causal graph we simulate is shown below.
#Note that SNP 3 affects Eblue.
SNP
1
SNP
2
SNP
3
y
Eturquoise
Ebrown
Egreen
Eyellow
Eblue
4
#Step 2: Apply NEO and determine LEO.NB.CPA scores
# We would now like to find LEO.NB.CPA scores based on SNP1, SNP2 (our markers),
# y, and all expression traits. Specifically, we want LEO.NB.CPA scores for orienting
# the edge from y -> expression traits.
# We create a combined data frame that contains only SNPs 1 and 2 (SNP3 is omitted)
# The combined data that contain SNPs and trait data are denoted by
datCombined = data.frame(SNP1, SNP2, y, datExpr)
# this sets the default parameters for NEO
pm=neo.get.param()
# Here we indicate the index of our expected "upstream" trait (which is y):
pm$A = 3 # y
# Here we indicate the indices of all other traits:
pm$B=4:(ncol(datCombined))# All expression traits
# If needed, one can impute medians of missing trait data with the following command:
# pm$rough.and.ready.NA.imputation = TRUE
# This parameter will indicate whether we want running feedback from neo or not.
pm$quiet=FALSE # We want it to be "loud".
# We run neo
NEOoutput1=neo(datCombined, pm=pm, snpcols=1:2)
5
cor.dep.th = 0.1 cor.ind.th = 0.2 pcor.th = 0 SNP.below = 0.3
omega.th = 1 k.pcor.th = -0.05 ind.dep.th = 0.2 edge.th = 0.3
SNP2
SNP1
y-]Eyellow=0.26
y
y->Eturquoise=3.8
Eyellow
y->Egreen=1.8
y->Eblue=2.2
y->Ebrown=2.7
Eturquoise
Egreen
Eblue
Ebrown
Caption: the blue line plot shows that y has been correctly anchored to SNPs 1 and 2.
L1=GetEdgeScores(NEOoutput1,datTraits=data.frame(y,datExpr))
L1
> L1
$LEO.NB.OCA.AtoB
y Eturquoise Eblue Ebrown Egreen Eyellow
y
NA
3.77 2.25
2.72
1.78
0.333
Eturquoise -20.7
NA
NA
NA
NA
NA
Eblue
-20.8
NA
NA
NA
NA
NA
Ebrown
-21.7
NA
NA
NA
NA
NA
Egreen
-21.2
NA
NA
NA
NA
NA
Eyellow
-24.2
NA
NA
NA
NA
NA
$LEO.NB.CPA.AtoB
y Eturquoise Eblue Ebrown Egreen Eyellow
y
NA
3.77 2.25
2.72
1.78
0.333
Eturquoise NA
NA
NA
NA
NA
NA
Eblue
NA
NA
NA
NA
NA
NA
Ebrown
NA
NA
NA
NA
NA
NA
Egreen
NA
NA
NA
NA
NA
NA
Eyellow
NA
NA
NA
NA
NA
NA
6
LEO.NB.CPAscores=L1$LEO.NB.CPA.AtoB[1,-1]
> LEO.NB.CPAscores
Eturquoise Eblue Ebrown Egreen Eyellow
y
3.77 2.25
2.72
1.78
0.263
barplot(height=as.numeric(LEO.NB.CPAscores),
names.arg=names(LEO.NB.CPAscores) ,ylab="LEO.NB.CPA(y->E)",las=2)
abline(h=0.8,col="blue")
LEO.NB.CPA(y->E)
3.5
3.0
2.5
2.0
1.5
1.0
0.5
Eyellow
Egreen
Ebrown
Eblue
Eturquoise
0.0
These results above depict LEO.NB.CPA scores for each y->gene edge. The blue
horizontal line corresponds to the recommended threshold of 0.8 for the CPA score.
Recall that the turquoise, blue, and green genes were set to be the most reactive genes.
However, also remember that the blue gene has input from causal anchor SNP3 which
was ignored in the above analysis (based on the CPA score).
Let’s see how things turn out if we also include SNP3 into the analysis.
# We create a combined data frame that contains all 3 SNPs and trait data
datCombined = data.frame(SNP1, SNP2, SNP3, y, datExpr)
7
pm=neo.get.param()
# Here we indicate the index of our expected "upstream" trait (which is y):
pm$A = 4 # y
# Here we indicate the indices of all other traits:
pm$B=5:(ncol(datCombined))# All expression traits
# This parameter will indicate whether we want running feedback from neo or not.
pm$quiet=FALSE # We want it to be "loud".
# We run neo
NEOoutput2=neo(datCombined, pm=pm, snpcols=1:3)
# The function call produces the following blue line plot
cor.dep.th = 0.1 cor.ind.th = 0.2 pcor.th = 0 SNP.below = 0.3
omega.th = 1 k.pcor.th = -0.05 ind.dep.th = 0.2 edge.th = 0.3
SNP2
SNP1
SNP3
y
Eyellow
y->Eturquoise=3.8
y->Egreen=1.8
y->Eblue=3.8
y->Ebrown=2.7
Eturquoise
Egreen
Eblue
Ebrown
Caption: the blue line plot shows that y was correctly anchored to SNPs 1 and 2.
Further, EBlue was correctly anchored to SNP3. But Eyellow was incorrectly anchored to
SNP3.
L2=GetEdgeScores(NEOoutput2,datTraits=data.frame(y,datExpr))
L2
> L2
8
$LEO.NB.OCA.AtoB
y Eturquoise Eblue Ebrown Egreen Eyellow
y
NA
3.77 3.83
2.72
1.78 0.0271
Eturquoise -20.7000
NA
NA
NA
NA
NA
Eblue
-9.7300
NA
NA
NA
NA
NA
Ebrown
-21.7000
NA
NA
NA
NA
NA
Egreen
-21.2000
NA
NA
NA
NA
NA
Eyellow
-0.0412
NA
NA
NA
NA
NA
$LEO.NB.CPA.AtoB
y Eturquoise Eblue Ebrown Egreen Eyellow
y
NA
3.77 2.25
2.72
1.78
0.333
Eturquoise
NA
NA
NA
NA
NA
NA
Eblue
-8.8500
NA
NA
NA
NA
NA
Ebrown
NA
NA
NA
NA
NA
NA
Egreen
NA
NA
NA
NA
NA
NA
Eyellow
0.0776
NA
NA
NA
NA
NA
# here we record the OCA scores
LEO.NB.OCAscores=L2$LEO.NB.OCA.AtoB[1,-1]
barplot(height=as.numeric(LEO.NB.OCAscores),
names.arg=names(LEO.NB.OCAscores) ,ylab="LEO.NB.OCA(y->E)",las=2)
abline(h=0.3,col="red")
LEO.NB.OCA(y->E)
3.5
3.0
2.5
2.0
1.5
1.0
0.5
Eyellow
Egreen
Ebrown
Eblue
Eturquoise
0.0
Message: Note that including SNP3 into the analysis helps with the retrieval of the
causal signal for Eblue. The red horizontal line corresponds to the recommended
threshold of 0.3 for the OCA score.
THE END
9
Download