Finding Genes that are Downstream of a Given Trait Tova Fuller, Jason Aten, Steve Horvath Introduction Here we use simulated data to show how NEO can be used to find genes that are reactive to a clinical trait (denoted by y). To cite this tutorial and the R code, please use the following reference. Aten JE, Fuller TF, Lusis AJ, Horvath S (2008) Using genetic markers to orient the edges in quantitative trait networks: the NEO software. BMC Systems Biology 2008, 2:34 R code # read in the R libraries library(MASS) # standard, no need to install library(class) # standard, no need to install library(cluster) # Download the WGCNA package from the following webpage # http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/ library(WGCNA) library(impute)# install it for imputing missing value # Memory – this step is unnecessary for mac users. # check the maximum memory that can be allocated memory.size(TRUE)/1024 # increase the available memory memory.limit(size=4000) # Set the working directory – this directory should correspond with the folder any files # you will need are stored in. dir.create("C:/NEOGeneScreeningMiniTutorial") setwd("C:/NEOGeneScreeningMiniTutorial") # Please adapt the following paths so that the point to the directory where # the causality function files are located. Note that we use / instead of \ source("C:/Documents and Settings/Steve Horvath/My Documents/RFunctions/CausalityFunctions.txt") source("C:/Documents and Settings/Steve Horvath/My Documents/RFunctions/updatedneo.txt") 1 #Step 1: Simulate the SNPs and gene expression data # Here we establish the causal flow between SNPs, genes and our trait y. # We can set the sample size by changing the value of no.obs no.obs=200 # the number of mice/people/etc. # Below we simulate the expression traits (profiles) of 5 genes, which will be denoted by Eturquoise, Eblue, Ebrown, Egreen, Eyellow. # The following gene significance parameters determine how correlated and how reactive each gene is to the trait y. GeneSignificanceTurquoise=0.5 # very reactive GeneSignificanceBlue=0.5 # very reactive GeneSignificanceBrown=0.4 GeneSignificanceGreen=0.3 GeneSignificanceYellow=0.1 # less reactive # Note that the turquoise and blue genes are the most reactive to y, while the yellow # gene is least reactive. # To make the analysis reproducible, we set the random seed. set.seed(1) # SNP data: We create vectors to represent SNP data randomly sampled from values of # either 0, 1 or 2, with HWE probabilities. SNP1=sample(c(0,1,2), no.obs, replace=T, prob=c(0.25,0.5,0.25) ) SNP2=sample(c(0,1,2), no.obs, replace=T, prob=c(0.25,0.5,0.25) ) SNP3=sample(c(0,1,2), no.obs, replace=T, prob=c(0.25,0.5,0.25) ) # Trait vector (y): We create a quantitative output that is a function of #SNP1 and SNP2 plus some noise. This is the trait vector which affects # the gene expression profiles. y=scale(SNP1+SNP2+rnorm(no.obs)) # The above equation corresponds to the following causal graph: SNP 1 SNP 2 y 2 # The expression profile of the turquoise gene is simulated to be reactive to the clinical #trait y GS=GeneSignificanceTurquoise; Eturquoise= GS*y+sqrt(1-GS^2)*rnorm(no.obs) y Eturquoise GS=GeneSignificanceBlue; Eblue= GS*y+2*SNP3+sqrt(1-GS^2)*rnorm(no.obs) SNP 3 y Eblue GS=GeneSignificanceBrown; Ebrown= GS*y+sqrt(1-GS^2)*rnorm(no.obs) y Ebrown GS=GeneSignificanceGreen; Egreen= GS*y+sqrt(1-GS^2)*rnorm(no.obs) y Egreen GS=GeneSignificanceYellow; Eyellow= GS*y+sqrt(1-GS^2)*rnorm(no.obs) y Eyellow 3 # This data frame contains all the expression traits datExpr=data.frame(Eturquoise,Eblue,Ebrown,Egreen, Eyellow) # We find the gene significance – the pearson correlation between each expression and y. GS=as.numeric((cor(y,datExpr,use="p"))) # The entire total causal graph we simulate is shown below. #Note that SNP 3 affects Eblue. SNP 1 SNP 2 SNP 3 y Eturquoise Ebrown Egreen Eyellow Eblue 4 #Step 2: Apply NEO and determine LEO.NB.CPA scores # We would now like to find LEO.NB.CPA scores based on SNP1, SNP2 (our markers), # y, and all expression traits. Specifically, we want LEO.NB.CPA scores for orienting # the edge from y -> expression traits. # We create a combined data frame that contains only SNPs 1 and 2 (SNP3 is omitted) # The combined data that contain SNPs and trait data are denoted by datCombined = data.frame(SNP1, SNP2, y, datExpr) # this sets the default parameters for NEO pm=neo.get.param() # Here we indicate the index of our expected "upstream" trait (which is y): pm$A = 3 # y # Here we indicate the indices of all other traits: pm$B=4:(ncol(datCombined))# All expression traits # If needed, one can impute medians of missing trait data with the following command: # pm$rough.and.ready.NA.imputation = TRUE # This parameter will indicate whether we want running feedback from neo or not. pm$quiet=FALSE # We want it to be "loud". # We run neo NEOoutput1=neo(datCombined, pm=pm, snpcols=1:2) 5 cor.dep.th = 0.1 cor.ind.th = 0.2 pcor.th = 0 SNP.below = 0.3 omega.th = 1 k.pcor.th = -0.05 ind.dep.th = 0.2 edge.th = 0.3 SNP2 SNP1 y-]Eyellow=0.26 y y->Eturquoise=3.8 Eyellow y->Egreen=1.8 y->Eblue=2.2 y->Ebrown=2.7 Eturquoise Egreen Eblue Ebrown Caption: the blue line plot shows that y has been correctly anchored to SNPs 1 and 2. L1=GetEdgeScores(NEOoutput1,datTraits=data.frame(y,datExpr)) L1 > L1 $LEO.NB.OCA.AtoB y Eturquoise Eblue Ebrown Egreen Eyellow y NA 3.77 2.25 2.72 1.78 0.333 Eturquoise -20.7 NA NA NA NA NA Eblue -20.8 NA NA NA NA NA Ebrown -21.7 NA NA NA NA NA Egreen -21.2 NA NA NA NA NA Eyellow -24.2 NA NA NA NA NA $LEO.NB.CPA.AtoB y Eturquoise Eblue Ebrown Egreen Eyellow y NA 3.77 2.25 2.72 1.78 0.333 Eturquoise NA NA NA NA NA NA Eblue NA NA NA NA NA NA Ebrown NA NA NA NA NA NA Egreen NA NA NA NA NA NA Eyellow NA NA NA NA NA NA 6 LEO.NB.CPAscores=L1$LEO.NB.CPA.AtoB[1,-1] > LEO.NB.CPAscores Eturquoise Eblue Ebrown Egreen Eyellow y 3.77 2.25 2.72 1.78 0.263 barplot(height=as.numeric(LEO.NB.CPAscores), names.arg=names(LEO.NB.CPAscores) ,ylab="LEO.NB.CPA(y->E)",las=2) abline(h=0.8,col="blue") LEO.NB.CPA(y->E) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Eyellow Egreen Ebrown Eblue Eturquoise 0.0 These results above depict LEO.NB.CPA scores for each y->gene edge. The blue horizontal line corresponds to the recommended threshold of 0.8 for the CPA score. Recall that the turquoise, blue, and green genes were set to be the most reactive genes. However, also remember that the blue gene has input from causal anchor SNP3 which was ignored in the above analysis (based on the CPA score). Let’s see how things turn out if we also include SNP3 into the analysis. # We create a combined data frame that contains all 3 SNPs and trait data datCombined = data.frame(SNP1, SNP2, SNP3, y, datExpr) 7 pm=neo.get.param() # Here we indicate the index of our expected "upstream" trait (which is y): pm$A = 4 # y # Here we indicate the indices of all other traits: pm$B=5:(ncol(datCombined))# All expression traits # This parameter will indicate whether we want running feedback from neo or not. pm$quiet=FALSE # We want it to be "loud". # We run neo NEOoutput2=neo(datCombined, pm=pm, snpcols=1:3) # The function call produces the following blue line plot cor.dep.th = 0.1 cor.ind.th = 0.2 pcor.th = 0 SNP.below = 0.3 omega.th = 1 k.pcor.th = -0.05 ind.dep.th = 0.2 edge.th = 0.3 SNP2 SNP1 SNP3 y Eyellow y->Eturquoise=3.8 y->Egreen=1.8 y->Eblue=3.8 y->Ebrown=2.7 Eturquoise Egreen Eblue Ebrown Caption: the blue line plot shows that y was correctly anchored to SNPs 1 and 2. Further, EBlue was correctly anchored to SNP3. But Eyellow was incorrectly anchored to SNP3. L2=GetEdgeScores(NEOoutput2,datTraits=data.frame(y,datExpr)) L2 > L2 8 $LEO.NB.OCA.AtoB y Eturquoise Eblue Ebrown Egreen Eyellow y NA 3.77 3.83 2.72 1.78 0.0271 Eturquoise -20.7000 NA NA NA NA NA Eblue -9.7300 NA NA NA NA NA Ebrown -21.7000 NA NA NA NA NA Egreen -21.2000 NA NA NA NA NA Eyellow -0.0412 NA NA NA NA NA $LEO.NB.CPA.AtoB y Eturquoise Eblue Ebrown Egreen Eyellow y NA 3.77 2.25 2.72 1.78 0.333 Eturquoise NA NA NA NA NA NA Eblue -8.8500 NA NA NA NA NA Ebrown NA NA NA NA NA NA Egreen NA NA NA NA NA NA Eyellow 0.0776 NA NA NA NA NA # here we record the OCA scores LEO.NB.OCAscores=L2$LEO.NB.OCA.AtoB[1,-1] barplot(height=as.numeric(LEO.NB.OCAscores), names.arg=names(LEO.NB.OCAscores) ,ylab="LEO.NB.OCA(y->E)",las=2) abline(h=0.3,col="red") LEO.NB.OCA(y->E) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Eyellow Egreen Ebrown Eblue Eturquoise 0.0 Message: Note that including SNP3 into the analysis helps with the retrieval of the causal signal for Eblue. The red horizontal line corresponds to the recommended threshold of 0.3 for the OCA score. THE END 9