RNA-SEQ ANALYSIS OF DIFFERENTIALLY EXPRESSED GENES FOR THE DIAGNOSIS OF CHAGAS DISEASE RUAIRI LENNON SCHOOL OF BIOTECHNOLOGY , DUBLIN CITY UNIVERSITY , GLASNEVIN, DUBLIN 11. EMAIL: ruairi.lennon@gmail.com WEBSITE: www.ruairi.info Date: 30/04/2010 Student Number: 52585731 Abstract: RNA-SEQ ANALYSIS FROM A MOUSE GENOME SHOW THAT DIFFERENTIALLY EXPRESSED GENES HIGHLIGHT PROTEIN-PROTEIN INTERACTIONS THAT ARE HIGHLY RELATED TO CHAGAS DISEASE . T HIS DEMONSTRATES THE USE OF RNA-SEQ FOR THE DIAGNOSIS OF DISEASE PATHWAYS, AND REVEALS INTERACTIONS AND RELATED PATHWAYS THAT REQUEST FURTHER INVESTIGATION. A HIGHLY SYSTEMATIC APPROACH TO THE ANALYSIS INVOLVING A VARIETY OF SOFTWARE PACKAGES AND STRATEGIES INVOLVED IN DATA MINING , STATISTICAL ANALYSIS AND VALIDATION, AND FINALLY PATHWAY AND INTERACTOME ANALYSIS. Background RNA-Seq RNA-Seq is a next-generation sequencing technology that is becoming a popular behind DNAmicroarrays for analyzing gene expression levels at various cell life stages[1]. It is also a very successful technique, like microarrays, in many fields of research, including for example mapping various types of interaction at the protein and DNA level. The concept, in fig 1 below, involves generation of an mRNA library from the host cells, conversion to cDNA with PCR amplification, sequencing the cDNA, normalising the data by using a reference genome and removing PCR duplicates and joining paired end reads, splitting the data into environmental conditions (e.g. growth phase or tissue type), cleaning the data and identifying genomics of interest. Fig 1: Stages of RNAseq analysis: Wetlab analysis (grey) of mRNA into cDNA library, sequenced. The fastq data from sequencing (cyan) is filtered, checked, then mapped to a reference genome of the same species, removing PCR duplicates and joining paired-ends. The exons/UTR’s are identified(lilac) and those genomics of interest are identified (pink) The RNAseq analysis must avoid rRNA inclusion in the sample prep, for example using oligo dT primers that are biased toward mRNA priming. Genes located on the reverse strand of nuclear DNA also need to be identified while assembling the data. Once outputted, the data is mapped using mapping software including a reference genome. Bowtie is one program that is well suited for rapid mapping of entire genomes. The reason paired-end mapping of reads is important is that for mammals amongst other species, introns can represent another problem, and because the pairedends methodology reads in both directions, these introns can be identified using Tophat. The scores for each transcript are then quality read by interpreting the ASCII format Phred score for each transcript. The abundances of each good transcript are then counted, either in Cufflinks or htseq. Identification of DE genes is carried out by a few packages from the Bioconductor website (www.bioconductor.org/) for use in the statistics software R. Two main tools are EdgeR and DeSeq. Both achieve the same goal, but different results. The level of differential expression may be determined by adjusted p value alone (p<0.05) or also using the log fold change values(logfold>2). Investigation of RNAseq differentially expressed genes InnateDB (www.innatedb.com) is a highly-curated database that contains annotations for over a thousand genes, pathways and interactions for human, and a few other species. It is linked to Ensembl and many other databases. Its outputs are very flexible, outputting kegg pathways, networks in cytoscape, and tables of various gene properties into either tab delimited, csv or excel files. InnateDB recognizes expression values and p values. For pathway analysis there is a variety of different databases available, including Kegg (www.genome.jp/kegg/kegg2.html) and Reactome (www.reactome.org), all with advantages and shortcomings with each other. Pathway analysis is very useful for associating phenotype to genotype following pathways, say for example from cellsurface receptors to protein-DNA interactions. Further, protein-protein interactions (PPI), proteinDNA and DNA-DNA interactions are also very indicative of cellular processes, where there is sometimes no physical contact needed between the two for an interaction to occur (use of substrates for example). InnateDB human database has about two thirds of the 28,000+ genes annotated to a good degree. There are biases with RNAseq that need to be accounted for. There is what we call shot noise in the data set, due to technical noise that is impossible to remove, and biological noise, due partially to treatment of samples. The stages of data processing, including removal of PCR duplicates, does have an element of type I and type II errors, for example presence of miRNA’s or siRNA’s. GoSeq is a tool from bioconductor that analyses the bias of gene length compared to the number of differentially expressed genes, and also removes some bias from RNAseq data when carrying out its built in pathway analysis by looking up Gene Ontology Id’s for each gene online. It is important to validate each stage of the analysis from running the Illumina analyzer to identifying the correct pathways. Cytoscape is not just simply visualization software for pathway/PPI analysis, but has valuable plugins that can perform calculations on the network, such as the cytohubba plugin, which identifies bottleneck or hub proteins, and the Cerebral plugin, which can remove hubs below a threshold pvalue. One of the optimum uses of RNAseq is for disease identification [2]. Where annotation of a gene has not occurred, or if it is poorly annotated, PPI can highlight significant roles of a protein, including its predicted function, identifying functional modules, identifying interactions of a protein, its disease candidate genes, drug targets, network structure and evolution. Using a score of disease-gene relativity, based on disease-gene knowledgebase from OMIM[3], the diagnosis of a disease from a gene list by genotype-phenotype relationship becomes much more practical. Even more so, the pathways that are achieved from RNAseq can often include various disease modules, where diseases modify pathways mostly without killing the cell (not disabling critical hub proteins), and tending to differentiate more peripheral proteins to the network. The methods to carry out the complete analysis are listed in the methods section. Results Reads The fastq files were copied to the data folder on ampato and fastqc was run as a preliminary examination of the data before further processing. The figures are to be found in Appendix 1. The error messages from all control samples were the same, and error messages for disease were all the same (Appendix1(a)). For controls, the per base errors were due to a high %A content at the end of each read (Appendix1(b) & Appendix1(c)).The high level of Duplicates are shown in Appendix1(d). The Kmer content per read is shown in Appendix1(e), highlighting the AAAAA stretch at the ends of the reads. The disease GC content was shown in Appendix1(f). Quality of reads for control and disease are shown (Appendix1(g,h)), with control very good, and disease good quality until the ends. After analysing the data and ensuring the results so far were satisfactory, the step by step alignment, removal of duplicates and count was carried out. Read counts were measured using the “wc –l” command for counting lines, and converting to reads by counting number of reads per line and offset (Table 1). Lecture notes recommend “grep “chr” filename” but other nonsense lines/reads are present and the table shows removal of these also. Table 1. Number of reads at each stage of data processing. The number of reads were calculated by counting number of lines with UNIX command wc –l. this number was divided by 4 for fastq(4 lines per read in fastq), and subtracted by 24 for SAM formats (the number of lines in header before transcript data begins). Appendix2(a) shows the variance dropping down as the sample size increases, indicative of shot noise, and is much lower for disease than control samples. In Appendix2(b,c) the residual of variation with the mean show a notable presence of biological noise. Control and disease have similar variations and distributions, allowing robust normalisation to reduce the noise(Appendix2 a,b,c,g). Appendix2(d,e,f) shows differential expression found between control and disease samples, which values are given as log2foldchange. Figure d highlights the amount/number of differentially expressed genes(red), figure e highlights how transcript expression varies between control and disease samples(hence differential expression from control to disease), and figure f showing PCA breakdown of the two vectors control and disease into separate distinct groupings, relating to the other graphs this vector must be differential expression. Figure g shows sensitivity versus specificity, where the sample cdf’s are biased towards true positive instead of false positives[4]. The counts for all control and disease htseq files were assembled into one table in Excel, and saved as Raw_counts.txt. Both EdgeR and DeSeq were run on the Raw_counts.txt file to conduct statistical analysis of the transcripts. DeSeq found 1707 differentially expressed transcripts (assuming a padj value(FDR) <0.05). EdgeR found 2309 significantly differentially expressed transcripts. Running a join query to match transcript ID’s between both sets of results, joining only where both have the same transcript ID, both have 1119 transcripts in common with each other (fig2). Fig.2. Edge R and DeSeq showing some similarity in results, EdgeR detecting the most. Analysis of differentially expressed and total genes The result from Appendix2 (a)shows DeSeq variance as a function against base mean density. The variance for control is less than disease and drops down as number of bases increases, showing shot noise. Appendix2(b) and Appendix2(c) show the log variance vs. log base mean for disease and control respectively. These plots do so correlation but with a high level of variance still observed. Appendix2(d) shows the level of differential expression in the transcript data set described in terms of fold change. Appendix2(e) shows a heatmap of each set of genes from control and disease categories, highlighting expression levels. Appendix2(f) shows a separation between control and disease using PCA, determined also by expression levels. Appendix2(g) shows the expected and experimental cumulative distribution functions for control and disease versus chi squared values and results in a small number of important residues with really high p values Appendix2(h) is a plot showing residual variation A (red) versus residual variation in B (blue), with a chi squared of the data(grey). The interactions detected between all differentially expressed genes, but not with those not in the list, are seen in fig 3.The results are taken from InnateDB, and visualised with cytoscape, identifying the hubs with cytohubba plugin. Fig 3.(left) Interactome of DE genes highlighting top 5 hubs. Results from the GoSeq pathway analysis were obtained from the EdgeR differentially expressed and total gene list (using mouse mapping and mouse gene ontologies) (fig 4). A significant proportion of genes are differentially expressed as related to the gene length. Fig4. (left) Differentially expressed genes as related to gene length. Pathway Analysis The pathway analysis carried out by GoSeq using InnateDB pathways and gene_ID’s on the EdgeR DE list shows pathways of various activities and various phenotypes (including a wide variety of diseases). The list is similar to pathways obtained from the InnateDB pathway analysis based on default distributions. Lysozyme is found in both GoSeq enriched pathway analysis and from the InnateDB pathway analysis using default hypergeometric distribution analysis. Gammacarboxylation is significantly present, related to blood clotting. Interestingly, Dilated cardiomyopathy is ranked 14th just below this list (p=0.07), and is present at 127th of enriched pathways from GoSeq. Table 2. The top 10 over-represented pathways from InnateDB sorted by corrected p-value (lowest to highest) Pathway analysis of chagas disease was shown by viewing the pathway in cerebral using cytoscape downloaded from the InnateDB pathway analysis results (fig5 below). Clearly the top 5 hub nodes can be seen, and all highlighted pathways involved in chagas disease genotype to phenotype stages can be seen. Fig 5. The pathways involved in Chagas disease (orange/red), showing the top 5 hub nodes found earlier. Discussion Given the six sample data sets from the Illumina Analyzer (version 1.5) in fastq format, the Fastqc, bowtie, sort, and strip-sam-duplicates were run on the DCU Ampato Server at near maximum speed, finishing these three steps in just over 12 hours for the complete data set. Strip-sam-duplicates removed PCR duplicates, and was used because SAM-TOOLS was malfunctioning at the time. The first main problem spotted with Fastqc results was with the control samples showing errors in each read for the last few positions, shown in Appendix1(b,c,e), to be related to a polyA-tail. This is a mature form of an mRNA to be polyadenylated with such a tail and is of no concern. The high number of duplicates shown at 10+ in Appendix1(d) is clearly PCR duplicates that are removed at the strip-sam-duplicates stage later. Since all control samples showed the same errors (Appendix1(a)), these were ignored as negligible to the data analysis. For disease samples, all showed the same GC problem that was not a poly A tail but a problem with GC content at the cap and 5’ UTR region, and have various GC contents as needed for mRNA stability and will vary from gene to gene. Since quality scores per read looked healthy for control and most of disease(Appendix 1(g,h)), the Fastqc results were passed (noting that disease do not have polyA tail and exposed to biochemical breakdown). Bowtie was run with the mm9 annotation index (Mus_musculus.mm9.ucsc.gtf) for the alignment. Tophat should have been used to identify exon-boundaries in transcripts and improve the resulting samfile. Also, reverse strand genes should have been identified. However given time constraint and a fair quality data set, it was suffice to skip Tophat and reverse-strand genes for the timeconstrained project. The coding region start and finish of the gtf file were mixed up, and a new gtf file was retrieved from ampato for bowtie and htseq. Before htseq could be run, there was another issue with the conversion of the Illumina HWUSI machine tag name to a scored more readable index list of only good quality reads. The problem was that one line, the first line “control1” in the fastq for some files, was not being removed as a zero quality read, since after strip-sam-duplicates PCR duplicates are scored zero and removed, HWUSI reads had their scores calculated but this read wasn’t scored). This read was removed manually in emacs. The abundances of each transcript were then successfully measured. Htseq-count was chosen over Cufflinks because it was readily available. The number of reads were counted (table1) showing massive removal of nonsense reads. All counts from all six samples were assembled into a single table, all listed by one set of unique transcript ID’s. EdgeR and DeSeq were both run on the table output file to measure for differentially expressed genes. The differential expression of transcript abundances was determined by both EdgeR and DeSeq between disease and control samples. Variances and noise were identified, including; shot noise (higher for control than disease) which dropped with number of reads biological noise (both disease and control had similar) and could be normalised possibility of technical noise in PCA in dimension two. These transcript ID’s were then mapped to gene ID’s using a Biomart table (www.biomart.org) listing both gene and transcript ID’s in a sheet, carried out in a custom MS Access database built for this project. To begin, there were a huge number of transcripts, reduced by number through each stage of the data processing stages from fastq to htseq. Htseq produced 93,809 unique transcripts. 88,554 were mappable to human gene ID’s. After mapping to mouse genes and removing duplicate genes, we were left with 36,814 unique mouse gene id’s. EdgeR returned 2309 differentially expressed (DE) transcripts, DeSeq returned 1706 DE transcripts, and both DeSeq and EdgeR had 1119 in common (fig2 in results). Because EdgeR found the most DE transcripts, to avoid removing important transcripts, the entire EdgeR DE mapped gene list was imported into InnateDB for analysis. The interactions, filtered for “only within data set” genes, were retrieved from InnateDB through Cerebral into Cytoscape. A spring-embedded network was generated and Cytohubba found the top 20 hubs in the network by degree of 5+ edges. The top five hubs, Jun, Nfkb1, Irf8, Rel and Fos, were highlighted using Vizmapper (fig3) and all proteins in the network are differentially expressed genes. Without further investigation of the protein-protein interactions (PPI), the pathway analysis was started. A complete list of disease-causing human gene id’s were found using Prospectr, which references OMIM as its source. Scores of 0.6 or higher were used to filter only those genes that were likely to be associated with disease, and the value was low enough (compared to 0.7 say) to allow quite a fair number of likely disease candidate genes to be suspected. These gene_ID’s were mapped to the DE gene list from EdgeR and produced a disease gene seed list. This reduced the list of DE genes from 2309 down to 309 genes. The logic for using human disease ontology for a mouse infection is that because 88,000+ of the mouse transcripts mapped to human id’s, that a vast majority of disease causing genes in mice should have human orthologs and human disease genes could be mapped to mouse genes. This helps the diagnosis of the disease. The 309 diseaseassociated DE genes were uploaded to InnateDB for over-represented pathway analysis and the top 10 diseases by p-value are found in Table 2 of results. Lysozyme topped the table, given its high number of over-represented pathways found in InnateDB. Lysozyme pathway in Cerebral failed to identify with the list of proteins found in the interactome previously, as did Malaria but is discussed later as relating to the identified pathogen. However, Chagas disease had all five hub proteins found in the interactome (fig5) as well as many others found in the interactome, suggesting a strong association between the Chagas disease pathway and list of differentially expressed genes. The pathway goes right from gene to cellular surface, indicating genotype-phenotype relationship. Dilated Cardiomyopathy and Gamma Carboxylation (related to fibrillation) were also found in the list of over-represented pathways, and both are symptoms known to be associated with chronic Chagas disease[5]. The chronic symptoms involving cardiomyopathy include blood clotting, indicating tissue damage or rupture of some sort. The infection host for Trypanosoma cruzi (Chagas Disease) would be Reduviidae[6]. To validate Chagas disease, GoSeq was run on the DE gene list from EdgeR. Gene bias was found (fig4) as well as a clear relationship between read length and differential expression level. The list of enriched pathways was obtained, using a second pathway analysis using all EdgeR DE genes in InnateDB for the creation of an innatedb.in file. This pathway also listed lysozyme, chagas disease and dilated cardiomyopathy. Chagas disease is still suspect from the enriched pathway list. The genotype-phenotype link is established, including the cell-surface TNF receptor, and its involvement in both chronic Chagas disease and dilated cardiomyopathy[7], and critical components such as NFKBIL1[8]. The reason for overrepresentation of lysozyme is for a immune response mediated by leukocytes upon infection of Chagas disease[9]. The caveat is that disease modules usually focus on less essential genes and hubs needed for survival, tending towards peripheral pathways and proteins. Other caveats include the level of shot, biological and technical noise detected from DeSeq results, detection of gene bias in GoSeq, and the possibility of similar disease pathways to Chagas, too numerous to verify all by hand, that all could give misleading results. Also bear that we validate presence of Chagas by presence of related disease found in literature, representing a strong case. Conclusion The RNA-seq data was successfully run amidst some technical troubleshooting, identification of errors in Fastqc and variation in DeSeq results. Reasonably high quality data was sufficient for the interactome and pathway analysis in InnateDB using GoSeq and Cytoscape. Chronic Chagas disease has been highlighted as the suspect infection given its similarity to the differentially expressed gene patterns observed in protein-protein interactions, especially highlighting the genotype-phenotype relationship in the pathway, and the significance of the top 20 protein hubs in the pathway, especially Jun, Nfkb1, Irf8, Rel and Fos. The relationship between Chagase disease and dilated Cardiomyopathy and fibrilliation was identified through pathway analysis. There was room for improvement given more time to run Tophat and improve the recognition of intron-exon boundaries and include otherwise discarded transcripts into the results. Also Cufflinks could have been run, and further analysis of many other pathways beyond the top 10 could have been included. This work reveals novel protein-interactions involving the Chagas disease pathway that could be further experimented by laboratory investigations. Methods 1. Raw Data-processing Running fastqc, bowtie, sort, strip-sam-duplicates and htseq on Illumina fastq sample data sets on DCU Ampato Server. #!/bin/bash #$ -pe mpich 40 #$ -cwd -j y #$ -N htseq #$ -q 40node #$ -V mpirun -np 280 -ppn 7 -machinefile ~/.mpich/mpich_hosts.$JOB_ID fastqc control_1.fastq export BOWTIE_INDEXES=/users/ccreevey/programs/bowtie-0.12.7/indexes/ /users/ccreevey/bin/bowtie -m 1 -S mm9 control_1.fastq control_1.aln.sam sort -k 3,3 -k 4,4n control_1.aln.sam > control_1.aln.sorted.sam /users/ccreevey/bin/strip_sam_duplicates control_1.aln.sorted.sam control_1.aln.sorted_rmdup.sam htseq-count control_1.aln.sorted_rmdup.sam Mus_musculus.mm9.ucsc.gtf > control_1.htseq The number of nodes on the server is 40, and 7 of 8 processors on each node was used, giving a total of 280 processers being used for the task. The final 6 lines were copied and modified to run the same tasks on filenames; control_2.fastq, control_3.fastq, disease_1.fastq, disease_2.fastq, disease3_fastq. The htseq-count needs to have the final line removed from the rmdup.sam file before htseq will run, a line containing “HWUSI”, by finding the line number: “grep –n “HWUSI” control_1.aln.sorted_rmdup.sam”” and using the line number n “emacs +n control_1.aln.sorted_rmdup.sam” to remove that line. Htseq was rerun on the modified files. The htseq output files were downloaded off the server by SFTP at ampato.computing.dcu.ie at port 22 onto the native pc. The counts were all summarised in one tab delimited file using Excel (Raw_counts_CA.txt) given samples had identical transcript_ID’s. The no. of reads at each step was calculated using wc –l, and normalised as no. Of lines per read. 2. Statistical Analysis of differentially expressed and total genes Downloading EdgeR, DeSeq and GoSeq from the Bioconductor website for installation into R. source("http://bioconductor.org/biocLite.R") biocLite("edgeR") biocLite("DESeq") biocLite(“goseq”) The scripts to run EdgeR, Deseq and Goseq are as given in the lecture notes, with a two line change for Goseq as required to run: EdgeR gives significant differentially expressed gene list with statistics. Deseq (exactly as in tutorial) library(DESeq) setwd("/USERS/RUAIRI/workspace/practicaldata/") #Read in input file and display top raw.data <- read.delim("Rawcounts_CA.txt", header=TRUE, stringsAsFactors=TRUE); gt5<- function(x){ max(x) > 5; } # Create a second array that only contains the rows that have a value that is 5 or greater and fewer than 8 #zeros th<-array(apply(raw.data, 1, gt5)); countsTable<- raw.data[ th == "TRUE" ,]; pdf("graphsCADeSeq.pdf"); head (countsTable); #Set the gene names in the forst column as the rowname of the vector and delete the first column rownames( countsTable ) <- countsTable$gene; countsTable<- countsTable[ , -1 ]; # Get rid of the column with the gene names #Set up the conditions of the experiment and set up the required data structures in R conds<-c("Control","Control","Control","Disease","Disease","Disease"); cds <- newCountDataSet( countsTable, conds); #Estimate the size factors of the datasets and adjust accordingly (normalising for amount of DNA) cds<-estimateSizeFactors(cds); sizeFactors(cds); #Estimate the variance in the samples and plot graphs to check cds<-estimateVarianceFunctions(cds); scvPlot(cds, ylim=c(0,2)); diagForControl<-varianceFitDiagnostics(cds, "Control"); smoothScatter(log10(diagForControl$baseMean), log10(diagForControl$baseVar)); lines( log10(fittedBaseVar) ~ log10(baseMean), diagForControl[ order(diagForControl$baseMean), ], col="red"); diagForDisease<-varianceFitDiagnostics(cds, "Disease"); smoothScatter(log10(diagForDisease$baseMean), log10(diagForDisease$baseVar) ); lines( log10(fittedBaseVar) ~ log10(baseMean), diagForDisease[ order(diagForDisease$baseMean), ], col="red"); par( mfrow=c(1,2)); residualsEcdfPlot( cds, "Control" ); residualsEcdfPlot( cds, "Disease" ); # Carry out binomial test for differential gene expression res<- nbinomTest( cds, "Control", "Disease"); head(res); #plot the log(base2) fold changes against the base mean, colouring in red those genes that are significant at 10% FDR plotDE <- function(res) plot( res$baseMean, res$log2FoldChange, log="x", pch=20, cex=.1, col = ifelse( res$padj < .1, "red", "black" ) ) plotDE(res); #Plot the density of resVarA and resVarB plot( density( res$resVarA, na.rm=TRUE, from=0, to=5), col="red"); lines( density( res$resVarB, na.rm=TRUE, from=0, to=5 ), col="blue"); xg <- seq( 0, 5, length.out=1000 ); lines( xg, dchisq( xg, df=1 ), col="grey"); #Sample Clustering cds3<- newCountDataSet(countsTable, conds); cds3<- estimateSizeFactors(cds3); cds3<- estimateVarianceFunctions(cds3); vsd<-getVarianceStabilizedData(cds3); dists<- dist( t(vsd)); heatmap( as.matrix(dists), symm=TRUE ); print(paste("number of genes at 1.0 =", length(res[,1]), sep=" ")); write.table(res, "all.statsresults_CA_DESeq.txt", sep="\t"); dev.off(); EdgeR (exactly as in tutorial) library(edgeR) setwd("/USERS/RUAIRI/workspace/practicaldata/") raw.data <- read.delim("Rawcounts_CA.txt", header=TRUE, stringsAsFactors=TRUE); pdf("graphs_cAEdgeR.pdf"); # Assign the conts matrix to an object "d" rownames(raw.data) <- raw.data[, 1] raw.data<-raw.data[,-1]; # Create a second array that only contains the rows that have a value that is 5 or greater and fewer than 8 #zeros gt5<- function(x){ max(x) > 5; } th<-array(apply(raw.data, 1, gt5)); d<- raw.data[ th == "TRUE" ,]; write.table(d, "Rawcounts_CA1.txt", sep="\t"); group <- c("Control","Control","Control","Disease","Disease","Disease"); # The treatment groups in order in the table d <- DGEList(counts = d, group = group) # Make a MDS plot for the samples plotMDS.dge(d, main="MDS Plot for Data",xlim=c(-0.8,1.6)) # Anaysis using common dispersion d <- estimateCommonDisp(d) de.com <- exactTest(d) padjSigCount<-sum(p.adjust(de.com$table$p.value, method = "BH") < 0.1) results4ouput<-(topTags(de.com, n=padjSigCount, adjust.method="BH",sort.by="p.value")$table ) alltests4output<-(topTags(de.com, n=nrow(de.com$table), adjust.method="BH",sort.by="p.value")$table ) write.table(results4ouput, "all.significant_CA_EdgeR.txt", sep="\t"); write.table(alltests4output, "all.testresults_CA_EdgeR.txt", sep="\t"); dev.off(); GoSeq library(goseq) setwd("/USERS/RUAIRI/workspace/practicaldata/") de.genes <- scan("goseqDEEdgeR", what=character() ) ## reads in file of DE gene IDs all.genes <- scan("goseqall", what=character() ) ## reads in file of genes assayed in the experiment genes = as.integer(all.genes %in% de.genes) ## creates a binary vector (0 = nonDE; 1 = DE) of genes indicating which are DE names(genes) = all.genes ## associates gene names with vector pwf = nullp(genes, "mm9", "ensGene") ### generates Probability Weighting Function (PWF) ## innatedb.in ## contains mappings between all human Ensembl IDs and InnateDB pathways. mapping <- scan("innatedb.in", list(QueryXref="",PathwayName="",PathwayName2="")) innatedb=split(mapping$QueryXref,mapping$PathwayName) pathways=goseq(pwf,gene2cat=innatedb) head(pathways, n = 50) ### show top 50 pathways enriched.pathways = pathways$category[pathways$upval < 0.01] ### returns pathways with an uncorrected pvalue of < 0.01 enriched.pathways Mapping Fig 6. Showing relational database using MS Access. MS Access was used to convert transcript ID’s to gene_ID’s, remove duplicates of gene_ID’s for goseq, and prepare a spreadsheet including gene_ID’s, fold changes and p values for Innate DB. Biomart (www.biomart.org) provided the mapping spreadsheet between human and mouse, and between transcript and gene id’s. The relationships were set up and queries assembled as required (fig6). The number of DeSeq and EdgeR DE reads are given from these tables. 3. Pathway/Gene Ontology/interactome analysis for diagnosis of disease The list of differentially expressed (DE) genes from EdgeR (mapping transcript to gene id) was uploaded to innatedb. The data columns were identified (logs as expression value for cond1 and cond2 respectively, pvalue and fdr as p values for cond1 and cond2 respectively). Gene_ID’s were defined as a cross-reference ID to Ensembl. Following the Data Analysis link, the interactions is to be filtered by interactions with self-proteins, using a subset of data rather than microarray data, using default distributions. The cerebral link opens up a network visualisation of the results using Cytoscape. In the left window pane of Cytoscape(cerebral) the tick box for high quality rendering is selected, and the data panel top left icon was opened to select all id boxes. The network was exported as XGMML format and reopened on an offline Cytoscape version with cerebral, Vizmapper and Cytohubba installed. Cytohubba detected the top 20 hubs by degree of edges, and the visualization of top nodes was enhanced by right-clicking top 5 hubs and manually styling each node font/label/size. For pathway and gene ontology, the EdgeR DE gene list was further reduced by only using scores above 0.6 given from Prospectr, linking Gene_ID’s to diseases found in OMIM (www.genetics.med.ed.ac.uk/prospectr/downloads/dump.tab). This reduced list of scored DE genes is uploaded into Innatedb through the Data Analysis link. For pathways, the over-represented pathways is explored, and likewise, over-represented gene ontologies can be explored. Pathways may be viewed in cerebral as well, and the list of pathways are downloaded as a table, removing all columns except gene_id and pathways, saving as innatedb.in for use with goseq pathway analysis. References [1] Brian T. Wilhelma, Josette-Renée Landry. “RNA-Seq —quantitative measurement of expression through massively parallel RNA –sequencing”. Methods.2009;48;3; pp 249-257 [2] Jing Chen, Bruce J Aronow, Anil G Jegga. “Disease candidate gene identification and prioritization using protein interaction networks.” BMC Bioinformatics. 2009; 10: 73. [3] Shi-Hua Zhanga, Chao Wua, Xia Lia, Xi Chena, Wei Jianga, Bin-Sheng Gonga, Jiang Lia, Yu-Qing Yanb. “From phenotype to gene: Detecting disease -specific gene functional modules via a text-based human disease phenotype network construction.” FEBS Letters. 2010; 584, 16, pp 3635-3643 [4] Bolan Linghu1, Evan S Snitkin, Zhenjun Hu1, Yu Xia1, Charles DeLisi1. “Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network” Genome Biology. 2009; 10:R91 [5] Valente N, et al. "Serial electrophysiological studies of the heart’s exicto conductor system in patients with chronic chagasic cardiopathy". Arq Bras Cardiol. 2006; 86 (1): 19–25 [6] E. Pfeiler, B.G. Bitler, J.M. Ramsey, C. Palacios-Cardiel, T.A. Markow. “Genetic variation, population structure, and phylogenetic relationships of Triatoma rubida and T. recurva (Hemiptera: Reduviidae: Triatominae) from the Sonoran Desert, insect vectors of the Chagas’ disease parasite Trypanosoma cruzi” Mol. Phyl. Evol.2006; 41, 1, pp 209-221 [7] Sandra A. Drigo, Edecio Cunha-Neto, Bárbara Ianni, Maria Regina A. Cardoso, Patrícia E. Braga, Kellen C. Faé, Vera Lopes Nunes, Paula Buck, Charles Mady, Jorge Kalil, Anna Carla Goldberg. “TNF gene polymorphisms are associated with reduced survival in severe Chagas' disease cardiomyopathy patients” Microbes and Infection, 2006; 8; 3; pp 598-603 [8] Rajendranath Ramasawmy, Kellen C. Faé, Edecio Cunha-Neto, Susan C.P. Borba, Barbara Ianni, Charles Mady, Anna C. Goldberg, Jorge Kalil. “Variants in the promoter region of IKBL/NFKBIL1 gene may mark susceptibility to the development of chronic Chagas’ cardiomyopathy among Trypanosoma cruzi-infected individuals”. Molecular Immunology, 2008; 45; 1; pp 283-288 [9] Ernesto H. de Titto and Rita L. Cardoni. “Trypanosoma cruzi: Parasite-induced release of lysosomal enzymes by human polymorphonuclear leukocytes.” Experimental Parasitology. 1983; 56; 2; pp 247-254 Appendix 1 – FastQC Results (a) showing fastQC errors (b) Controls: PolyA tails at read ends (c)Control: low GC at read ends (d) high duplicates of 10+ indicating PCR duplicates (e) high AAAAA’s at read ends (f) disease: GC content with high variation at starts (g)Control hiqh quality reads (h)disease quality per read dropping at read ends. Appendix 2: EdgeR & DeSeq (a)variance function vs mean variance drops with longer reads. (b) disease log var vs log mean (c) control log var vs log mean Variance for (b) and (c) very similar, allowing for normalisation (d) log2foldchange DE with DE genes in red (e) Heatmap DE with high similarity(red) compared to differentially expressed (white) (g) Residuals ECDF plot for Control and Disease Comparing the experimental distribution (ECDF) (f) PCA (EdgeR) with separation of control from disease in dimension 1 by DE with a possibility of technical noise in dimension 2 (h)Density of resVarA (red) and resVarB (blue) showing two components are independent To chi squared probability, comparing expected To observed to show control and disease have Significant and similar base residual distributions In the form of sensitivity (ECDF) vs specificity