RNA-Seq

advertisement
RNA-SEQ ANALYSIS OF DIFFERENTIALLY
EXPRESSED GENES FOR THE DIAGNOSIS OF
CHAGAS DISEASE
RUAIRI LENNON
SCHOOL OF BIOTECHNOLOGY , DUBLIN CITY UNIVERSITY , GLASNEVIN, DUBLIN 11.
EMAIL: ruairi.lennon@gmail.com
WEBSITE: www.ruairi.info
Date: 30/04/2010
Student Number: 52585731
Abstract:
RNA-SEQ ANALYSIS FROM A MOUSE GENOME SHOW THAT DIFFERENTIALLY EXPRESSED GENES HIGHLIGHT
PROTEIN-PROTEIN INTERACTIONS THAT ARE HIGHLY RELATED TO CHAGAS DISEASE . T HIS DEMONSTRATES THE
USE OF RNA-SEQ FOR THE DIAGNOSIS OF DISEASE PATHWAYS, AND REVEALS INTERACTIONS AND RELATED
PATHWAYS THAT REQUEST FURTHER INVESTIGATION. A HIGHLY SYSTEMATIC APPROACH TO THE ANALYSIS
INVOLVING A VARIETY OF SOFTWARE PACKAGES AND STRATEGIES INVOLVED IN DATA MINING , STATISTICAL
ANALYSIS AND VALIDATION, AND FINALLY PATHWAY AND INTERACTOME ANALYSIS.
Background
RNA-Seq
RNA-Seq is a next-generation sequencing technology that is becoming a popular behind DNAmicroarrays for analyzing gene expression levels at various cell life stages[1]. It is also a very
successful technique, like microarrays, in many fields of research, including for example mapping
various types of interaction at the protein and DNA level. The concept, in fig 1 below, involves
generation of an mRNA library from the host cells, conversion to cDNA with PCR amplification,
sequencing the cDNA, normalising the data by using a reference genome and removing PCR
duplicates and joining paired end reads, splitting the data into environmental conditions (e.g. growth
phase or tissue type), cleaning the data and identifying genomics of interest.
Fig 1: Stages of RNAseq analysis: Wetlab analysis (grey) of mRNA into cDNA library, sequenced. The fastq data
from sequencing (cyan) is filtered, checked, then mapped to a reference genome of the same species, removing
PCR duplicates and joining paired-ends. The exons/UTR’s are identified(lilac) and those genomics of interest are
identified (pink)
The RNAseq analysis must avoid rRNA inclusion in the sample prep, for example using oligo dT
primers that are biased toward mRNA priming. Genes located on the reverse strand of nuclear DNA
also need to be identified while assembling the data. Once outputted, the data is mapped using
mapping software including a reference genome. Bowtie is one program that is well suited for rapid
mapping of entire genomes. The reason paired-end mapping of reads is important is that for
mammals amongst other species, introns can represent another problem, and because the pairedends methodology reads in both directions, these introns can be identified using Tophat. The scores
for each transcript are then quality read by interpreting the ASCII format Phred score for each
transcript. The abundances of each good transcript are then counted, either in Cufflinks or htseq.
Identification of DE genes is carried out by a few packages from the Bioconductor website
(www.bioconductor.org/) for use in the statistics software R. Two main tools are EdgeR and DeSeq.
Both achieve the same goal, but different results. The level of differential expression may be
determined by adjusted p value alone (p<0.05) or also using the log fold change values(logfold>2).
Investigation of RNAseq differentially expressed genes
InnateDB (www.innatedb.com) is a highly-curated database that contains annotations for over a
thousand genes, pathways and interactions for human, and a few other species. It is linked to
Ensembl and many other databases. Its outputs are very flexible, outputting kegg pathways,
networks in cytoscape, and tables of various gene properties into either tab delimited, csv or excel
files. InnateDB recognizes expression values and p values. For pathway analysis there is a variety of
different databases available, including Kegg (www.genome.jp/kegg/kegg2.html) and Reactome
(www.reactome.org), all with advantages and shortcomings with each other. Pathway analysis is
very useful for associating phenotype to genotype following pathways, say for example from cellsurface receptors to protein-DNA interactions. Further, protein-protein interactions (PPI), proteinDNA and DNA-DNA interactions are also very indicative of cellular processes, where there is
sometimes no physical contact needed between the two for an interaction to occur (use of
substrates for example). InnateDB human database has about two thirds of the 28,000+ genes
annotated to a good degree.
There are biases with RNAseq that need to be accounted for. There is what we call shot noise in the
data set, due to technical noise that is impossible to remove, and biological noise, due partially to
treatment of samples. The stages of data processing, including removal of PCR duplicates, does have
an element of type I and type II errors, for example presence of miRNA’s or siRNA’s.
GoSeq is a tool from bioconductor that analyses the bias of gene length compared to the number of
differentially expressed genes, and also removes some bias from RNAseq data when carrying out its
built in pathway analysis by looking up Gene Ontology Id’s for each gene online. It is important to
validate each stage of the analysis from running the Illumina analyzer to identifying the correct
pathways.
Cytoscape is not just simply visualization software for pathway/PPI analysis, but has valuable plugins
that can perform calculations on the network, such as the cytohubba plugin, which identifies
bottleneck or hub proteins, and the Cerebral plugin, which can remove hubs below a threshold pvalue.
One of the optimum uses of RNAseq is for disease identification [2]. Where annotation of a gene has
not occurred, or if it is poorly annotated, PPI can highlight significant roles of a protein, including its
predicted function, identifying functional modules, identifying interactions of a protein, its disease
candidate genes, drug targets, network structure and evolution. Using a score of disease-gene
relativity, based on disease-gene knowledgebase from OMIM[3], the diagnosis of a disease from a
gene list by genotype-phenotype relationship becomes much more practical. Even more so, the
pathways that are achieved from RNAseq can often include various disease modules, where diseases
modify pathways mostly without killing the cell (not disabling critical hub proteins), and tending to
differentiate more peripheral proteins to the network. The methods to carry out the complete
analysis are listed in the methods section.
Results
Reads
The fastq files were copied to the data folder on ampato and fastqc was run as a preliminary
examination of the data before further processing. The figures are to be found in Appendix 1. The
error messages from all control samples were the same, and error messages for disease were all the
same (Appendix1(a)). For controls, the per base errors were due to a high %A content at the end of
each read (Appendix1(b) & Appendix1(c)).The high level of Duplicates are shown in Appendix1(d).
The Kmer content per read is shown in Appendix1(e), highlighting the AAAAA stretch at the ends of
the reads. The disease GC content was shown in Appendix1(f). Quality of reads for control and
disease are shown (Appendix1(g,h)), with control very good, and disease good quality until the ends.
After analysing the data and ensuring the results so far were satisfactory, the step by step alignment,
removal of duplicates and count was carried out. Read counts were measured using the “wc –l”
command for counting lines, and converting to reads by counting number of reads per line and
offset (Table 1). Lecture notes recommend “grep “chr” filename” but other nonsense lines/reads are
present and the table shows removal of these also.
Table 1. Number of reads at each stage of data processing. The number of reads were calculated by
counting number of lines with UNIX command wc –l. this number was divided by 4 for fastq(4 lines
per read in fastq), and subtracted by 24 for SAM formats (the number of lines in header before
transcript data begins).
Appendix2(a) shows the variance dropping down as the sample size increases, indicative of shot
noise, and is much lower for disease than control samples. In Appendix2(b,c) the residual of
variation with the mean show a notable presence of biological noise. Control and disease have
similar variations and distributions, allowing robust normalisation to reduce the noise(Appendix2
a,b,c,g). Appendix2(d,e,f) shows differential expression found between control and disease samples,
which values are given as log2foldchange. Figure d highlights the amount/number of differentially
expressed genes(red), figure e highlights how transcript expression varies between control and
disease samples(hence differential expression from control to disease), and figure f showing PCA
breakdown of the two vectors control and disease into separate distinct groupings, relating to the
other graphs this vector must be differential expression. Figure g shows sensitivity versus specificity,
where the sample cdf’s are biased towards true positive instead of false positives[4].
The counts for all control and disease htseq files were assembled into one table in Excel, and saved
as Raw_counts.txt. Both EdgeR and DeSeq were run on the Raw_counts.txt file to conduct statistical
analysis of the transcripts. DeSeq found 1707 differentially
expressed transcripts (assuming a padj value(FDR) <0.05). EdgeR
found 2309 significantly differentially expressed transcripts.
Running a join query to match transcript ID’s between both sets
of results, joining only where both have the same transcript ID,
both have 1119 transcripts in common with each other (fig2).
Fig.2. Edge R and DeSeq showing some similarity in results, EdgeR detecting the most.
Analysis of differentially expressed and total genes
The result from Appendix2 (a)shows DeSeq variance as a function against base mean density. The
variance for control is less than disease and drops down as number of bases increases, showing shot
noise. Appendix2(b) and Appendix2(c) show the log variance vs. log base mean for disease and
control respectively. These plots do so correlation but with a high level of variance still observed.
Appendix2(d) shows the level of differential expression in the transcript data set described in terms
of fold change. Appendix2(e) shows a heatmap of each set of genes from control and disease
categories, highlighting expression levels. Appendix2(f) shows a separation between control and
disease using PCA, determined also by expression levels. Appendix2(g) shows the expected and
experimental cumulative distribution functions for control and disease versus chi squared values and
results in a small number of important residues with really high p values Appendix2(h) is a plot
showing residual variation A (red) versus
residual variation in B (blue), with a chi
squared of the data(grey). The interactions
detected between all differentially expressed
genes, but not with those not in the list, are
seen in fig 3.The results are taken from
InnateDB, and visualised with cytoscape,
identifying the hubs with cytohubba plugin.
Fig 3.(left) Interactome of DE genes
highlighting top 5 hubs.
Results from the GoSeq pathway analysis
were obtained from the EdgeR differentially
expressed and total gene list (using mouse
mapping and mouse gene ontologies) (fig 4).
A significant proportion of genes are
differentially expressed as related to the
gene length.
Fig4. (left) Differentially expressed genes as
related to gene length.
Pathway Analysis
The pathway analysis carried out by GoSeq using InnateDB pathways and gene_ID’s on the EdgeR DE
list shows pathways of various activities and various phenotypes (including a wide variety of
diseases). The list is similar to pathways obtained from the InnateDB pathway analysis based on
default distributions. Lysozyme is
found in both GoSeq enriched
pathway analysis and from the
InnateDB pathway analysis using
default
hypergeometric
distribution analysis. Gammacarboxylation is significantly
present, related to blood
clotting. Interestingly, Dilated
cardiomyopathy is ranked 14th
just below this list (p=0.07), and
is present at 127th of enriched
pathways from GoSeq.
Table 2. The top 10 over-represented pathways from InnateDB sorted by corrected p-value (lowest to highest)
Pathway analysis of chagas disease was shown by viewing the pathway in cerebral using cytoscape
downloaded from the InnateDB pathway analysis results (fig5 below). Clearly the top 5 hub nodes
can be seen, and all highlighted pathways involved in chagas disease genotype to phenotype stages
can be seen.
Fig 5. The pathways involved in Chagas disease (orange/red), showing the top 5 hub nodes found earlier.
Discussion
Given the six sample data sets from the Illumina Analyzer (version 1.5) in fastq format, the Fastqc,
bowtie, sort, and strip-sam-duplicates were run on the DCU Ampato Server at near maximum speed,
finishing these three steps in just over 12 hours for the complete data set. Strip-sam-duplicates
removed PCR duplicates, and was used because SAM-TOOLS was malfunctioning at the time.
The first main problem spotted with Fastqc results was with the control samples showing errors in
each read for the last few positions, shown in Appendix1(b,c,e), to be related to a polyA-tail. This is a
mature form of an mRNA to be polyadenylated with such a tail and is of no concern. The high
number of duplicates shown at 10+ in Appendix1(d) is clearly PCR duplicates that are removed at the
strip-sam-duplicates stage later. Since all control samples showed the same errors (Appendix1(a)),
these were ignored as negligible to the data analysis. For disease samples, all showed the same GC
problem that was not a poly A tail but a problem with GC content at the cap and 5’ UTR region, and
have various GC contents as needed for mRNA stability and will vary from gene to gene. Since quality
scores per read looked healthy for control and most of disease(Appendix 1(g,h)), the Fastqc results
were passed (noting that disease do not have polyA tail and exposed to biochemical breakdown).
Bowtie was run with the mm9 annotation index (Mus_musculus.mm9.ucsc.gtf) for the alignment.
Tophat should have been used to identify exon-boundaries in transcripts and improve the resulting
samfile. Also, reverse strand genes should have been identified. However given time constraint and
a fair quality data set, it was suffice to skip Tophat and reverse-strand genes for the timeconstrained project. The coding region start and finish of the gtf file were mixed up, and a new gtf
file was retrieved from ampato for bowtie and htseq. Before htseq could be run, there was another
issue with the conversion of the Illumina HWUSI machine tag name to a scored more readable index
list of only good quality reads. The problem was that one line, the first line “control1” in the fastq for
some files, was not being removed as a zero quality read, since after strip-sam-duplicates PCR
duplicates are scored zero and removed, HWUSI reads had their scores calculated but this read
wasn’t scored). This read was removed manually in emacs. The abundances of each transcript were
then successfully measured. Htseq-count was chosen over Cufflinks because it was readily available.
The number of reads were counted (table1) showing massive removal of nonsense reads.
All counts from all six samples were assembled into a single table, all listed by one set of unique
transcript ID’s. EdgeR and DeSeq were both run on the table output file to measure for differentially
expressed genes. The differential expression of transcript abundances was determined by both
EdgeR and DeSeq between disease and control samples. Variances and noise were identified,
including;
 shot noise (higher for control than disease) which dropped with number of reads
 biological noise (both disease and control had similar) and could be normalised
 possibility of technical noise in PCA in dimension two.
These transcript ID’s were then mapped to gene ID’s using a Biomart table (www.biomart.org) listing
both gene and transcript ID’s in a sheet, carried out in a custom MS Access database built for this
project. To begin, there were a huge number of transcripts, reduced by number through each stage
of the data processing stages from fastq to htseq. Htseq produced 93,809 unique transcripts. 88,554
were mappable to human gene ID’s. After mapping to mouse genes and removing duplicate genes,
we were left with 36,814 unique mouse gene id’s. EdgeR returned 2309 differentially expressed (DE)
transcripts, DeSeq returned 1706 DE transcripts, and both DeSeq and EdgeR had 1119 in common
(fig2 in results). Because EdgeR found the most DE transcripts, to avoid removing important
transcripts, the entire EdgeR DE mapped gene list was imported into InnateDB for analysis. The
interactions, filtered for “only within data set” genes, were retrieved from InnateDB through
Cerebral into Cytoscape. A spring-embedded network was generated and Cytohubba found the top
20 hubs in the network by degree of 5+ edges. The top five hubs, Jun, Nfkb1, Irf8, Rel and Fos, were
highlighted using Vizmapper (fig3) and all proteins in the network are differentially expressed genes.
Without further investigation of the protein-protein interactions (PPI), the pathway analysis was
started. A complete list of disease-causing human gene id’s were found using Prospectr, which
references OMIM as its source. Scores of 0.6 or higher were used to filter only those genes that were
likely to be associated with disease, and the value was low enough (compared to 0.7 say) to allow
quite a fair number of likely disease candidate genes to be suspected. These gene_ID’s were mapped
to the DE gene list from EdgeR and produced a disease gene seed list. This reduced the list of DE
genes from 2309 down to 309 genes. The logic for using human disease ontology for a mouse
infection is that because 88,000+ of the mouse transcripts mapped to human id’s, that a vast
majority of disease causing genes in mice should have human orthologs and human disease genes
could be mapped to mouse genes. This helps the diagnosis of the disease. The 309 diseaseassociated DE genes were uploaded to InnateDB for over-represented pathway analysis and the top
10 diseases by p-value are found in Table 2 of results. Lysozyme topped the table, given its high
number of over-represented pathways found in InnateDB. Lysozyme pathway in Cerebral failed to
identify with the list of proteins found in the interactome previously, as did Malaria but is discussed
later as relating to the identified pathogen. However, Chagas disease had all five hub proteins found
in the interactome (fig5) as well as many others found in the interactome, suggesting a strong
association between the Chagas disease pathway and list of differentially expressed genes. The
pathway goes right from gene to cellular surface, indicating genotype-phenotype relationship.
Dilated Cardiomyopathy and Gamma Carboxylation (related to fibrillation) were also found in the list
of over-represented pathways, and both are symptoms known to be associated with chronic Chagas
disease[5]. The chronic symptoms involving cardiomyopathy include blood clotting, indicating tissue
damage or rupture of some sort. The infection host for Trypanosoma cruzi (Chagas Disease) would
be Reduviidae[6].
To validate Chagas disease, GoSeq was run on the DE gene list from EdgeR. Gene bias was found
(fig4) as well as a clear relationship between read length and differential expression level. The list of
enriched pathways was obtained, using a second pathway analysis using all EdgeR DE genes in
InnateDB for the creation of an innatedb.in file. This pathway also listed lysozyme, chagas disease
and dilated cardiomyopathy. Chagas disease is still suspect from the enriched pathway list. The
genotype-phenotype link is established, including the cell-surface TNF receptor, and its involvement
in both chronic Chagas disease and dilated cardiomyopathy[7], and critical components such as
NFKBIL1[8]. The reason for overrepresentation of lysozyme is for a immune response mediated by
leukocytes upon infection of Chagas disease[9]. The caveat is that disease modules usually focus on
less essential genes and hubs needed for survival, tending towards peripheral pathways and
proteins. Other caveats include the level of shot, biological and technical noise detected from DeSeq
results, detection of gene bias in GoSeq, and the possibility of similar disease pathways to Chagas,
too numerous to verify all by hand, that all could give misleading results. Also bear that we validate
presence of Chagas by presence of related disease found in literature, representing a strong case.
Conclusion
The RNA-seq data was successfully run amidst some technical troubleshooting, identification of
errors in Fastqc and variation in DeSeq results. Reasonably high quality data was sufficient for the
interactome and pathway analysis in InnateDB using GoSeq and Cytoscape. Chronic Chagas disease
has been highlighted as the suspect infection given its similarity to the differentially expressed gene
patterns observed in protein-protein interactions, especially highlighting the genotype-phenotype
relationship in the pathway, and the significance of the top 20 protein hubs in the pathway,
especially Jun, Nfkb1, Irf8, Rel and Fos. The relationship between Chagase disease and dilated
Cardiomyopathy and fibrilliation was identified through pathway analysis. There was room for
improvement given more time to run Tophat and improve the recognition of intron-exon boundaries
and include otherwise discarded transcripts into the results. Also Cufflinks could have been run, and
further analysis of many other pathways beyond the top 10 could have been included. This work
reveals novel protein-interactions involving the Chagas disease pathway that could be further
experimented by laboratory investigations.
Methods
1. Raw Data-processing
Running fastqc, bowtie, sort, strip-sam-duplicates and htseq on Illumina fastq sample data sets on
DCU Ampato Server.
#!/bin/bash
#$ -pe mpich 40
#$ -cwd -j y
#$ -N htseq
#$ -q 40node
#$ -V
mpirun -np 280 -ppn 7 -machinefile ~/.mpich/mpich_hosts.$JOB_ID
fastqc control_1.fastq
export BOWTIE_INDEXES=/users/ccreevey/programs/bowtie-0.12.7/indexes/
/users/ccreevey/bin/bowtie -m 1 -S mm9 control_1.fastq control_1.aln.sam
sort -k 3,3 -k 4,4n control_1.aln.sam > control_1.aln.sorted.sam
/users/ccreevey/bin/strip_sam_duplicates control_1.aln.sorted.sam control_1.aln.sorted_rmdup.sam
htseq-count control_1.aln.sorted_rmdup.sam Mus_musculus.mm9.ucsc.gtf > control_1.htseq
The number of nodes on the server is 40, and 7 of 8 processors on each node was used, giving a total
of 280 processers being used for the task. The final 6 lines were copied and modified to run the
same tasks on filenames; control_2.fastq, control_3.fastq, disease_1.fastq, disease_2.fastq,
disease3_fastq. The htseq-count needs to have the final line removed from the rmdup.sam file
before htseq will run, a line containing “HWUSI”, by finding the line number:
“grep –n “HWUSI” control_1.aln.sorted_rmdup.sam””
and using the line number n
“emacs +n control_1.aln.sorted_rmdup.sam”
to remove that line.
Htseq was rerun on the modified files. The htseq output files were downloaded off the server by
SFTP at ampato.computing.dcu.ie at port 22 onto the native pc. The counts were all summarised in
one tab delimited file using Excel (Raw_counts_CA.txt) given samples had identical transcript_ID’s.
The no. of reads at each step was calculated using wc –l, and normalised as no. Of lines per read.
2. Statistical Analysis of differentially expressed and total genes
Downloading EdgeR, DeSeq and GoSeq from the Bioconductor website for installation into R.
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
biocLite("DESeq")
biocLite(“goseq”)
The scripts to run EdgeR, Deseq and Goseq are as given in the lecture notes, with a two line change
for Goseq as required to run: EdgeR gives significant differentially expressed gene list with statistics.
Deseq (exactly as in tutorial)
library(DESeq)
setwd("/USERS/RUAIRI/workspace/practicaldata/")
#Read in input file and display top
raw.data <- read.delim("Rawcounts_CA.txt", header=TRUE, stringsAsFactors=TRUE);
gt5<- function(x){
max(x) > 5;
}
# Create a second array that only contains the rows that have a value that is 5 or greater and fewer than 8 #zeros
th<-array(apply(raw.data, 1, gt5));
countsTable<- raw.data[ th == "TRUE" ,];
pdf("graphsCADeSeq.pdf");
head (countsTable);
#Set the gene names in the forst column as the rowname of the vector and delete the first column
rownames( countsTable ) <- countsTable$gene;
countsTable<- countsTable[ , -1 ]; # Get rid of the column with the gene names
#Set up the conditions of the experiment and set up the required data structures in R
conds<-c("Control","Control","Control","Disease","Disease","Disease");
cds <- newCountDataSet( countsTable, conds);
#Estimate the size factors of the datasets and adjust accordingly (normalising for amount of DNA)
cds<-estimateSizeFactors(cds);
sizeFactors(cds);
#Estimate the variance in the samples and plot graphs to check
cds<-estimateVarianceFunctions(cds);
scvPlot(cds, ylim=c(0,2));
diagForControl<-varianceFitDiagnostics(cds, "Control");
smoothScatter(log10(diagForControl$baseMean), log10(diagForControl$baseVar));
lines( log10(fittedBaseVar) ~ log10(baseMean), diagForControl[ order(diagForControl$baseMean), ], col="red");
diagForDisease<-varianceFitDiagnostics(cds, "Disease");
smoothScatter(log10(diagForDisease$baseMean), log10(diagForDisease$baseVar) );
lines( log10(fittedBaseVar) ~ log10(baseMean), diagForDisease[ order(diagForDisease$baseMean), ], col="red");
par( mfrow=c(1,2));
residualsEcdfPlot( cds, "Control" );
residualsEcdfPlot( cds, "Disease" );
# Carry out binomial test for differential gene expression
res<- nbinomTest( cds, "Control", "Disease");
head(res);
#plot the log(base2) fold changes against the base mean, colouring in red those genes that are significant at 10% FDR
plotDE <- function(res)
plot(
res$baseMean,
res$log2FoldChange,
log="x",
pch=20,
cex=.1,
col = ifelse( res$padj < .1, "red", "black" )
)
plotDE(res);
#Plot the density of resVarA and resVarB
plot( density( res$resVarA, na.rm=TRUE, from=0, to=5), col="red");
lines( density( res$resVarB, na.rm=TRUE, from=0, to=5 ), col="blue");
xg <- seq( 0, 5, length.out=1000 ); lines( xg, dchisq( xg, df=1 ), col="grey");
#Sample Clustering
cds3<- newCountDataSet(countsTable, conds);
cds3<- estimateSizeFactors(cds3);
cds3<- estimateVarianceFunctions(cds3);
vsd<-getVarianceStabilizedData(cds3);
dists<- dist( t(vsd));
heatmap( as.matrix(dists), symm=TRUE );
print(paste("number of genes at 1.0 =", length(res[,1]), sep=" "));
write.table(res, "all.statsresults_CA_DESeq.txt", sep="\t");
dev.off();
EdgeR (exactly as in tutorial)
library(edgeR)
setwd("/USERS/RUAIRI/workspace/practicaldata/")
raw.data <- read.delim("Rawcounts_CA.txt", header=TRUE, stringsAsFactors=TRUE);
pdf("graphs_cAEdgeR.pdf");
# Assign the conts matrix to an object "d"
rownames(raw.data) <- raw.data[, 1]
raw.data<-raw.data[,-1];
# Create a second array that only contains the rows that have a value that is 5 or greater and fewer than 8
#zeros
gt5<- function(x){
max(x) > 5;
}
th<-array(apply(raw.data, 1, gt5));
d<- raw.data[ th == "TRUE" ,];
write.table(d, "Rawcounts_CA1.txt", sep="\t");
group <- c("Control","Control","Control","Disease","Disease","Disease"); # The treatment groups in order in
the table
d <- DGEList(counts = d, group = group)
# Make a MDS plot for the samples
plotMDS.dge(d, main="MDS Plot for Data",xlim=c(-0.8,1.6))
# Anaysis using common dispersion
d <- estimateCommonDisp(d)
de.com <- exactTest(d)
padjSigCount<-sum(p.adjust(de.com$table$p.value, method = "BH") < 0.1)
results4ouput<-(topTags(de.com, n=padjSigCount, adjust.method="BH",sort.by="p.value")$table )
alltests4output<-(topTags(de.com, n=nrow(de.com$table), adjust.method="BH",sort.by="p.value")$table )
write.table(results4ouput, "all.significant_CA_EdgeR.txt", sep="\t");
write.table(alltests4output, "all.testresults_CA_EdgeR.txt", sep="\t");
dev.off();
GoSeq
library(goseq)
setwd("/USERS/RUAIRI/workspace/practicaldata/")
de.genes <- scan("goseqDEEdgeR", what=character() ) ## reads in file of DE gene IDs
all.genes <- scan("goseqall", what=character() ) ## reads in file of genes assayed in the experiment
genes = as.integer(all.genes %in% de.genes) ## creates a binary vector (0 = nonDE; 1 = DE) of genes indicating
which are DE
names(genes) = all.genes ## associates gene names with vector
pwf = nullp(genes, "mm9", "ensGene") ### generates Probability Weighting Function (PWF)
## innatedb.in ## contains mappings between all human Ensembl IDs and InnateDB pathways.
mapping <- scan("innatedb.in", list(QueryXref="",PathwayName="",PathwayName2=""))
innatedb=split(mapping$QueryXref,mapping$PathwayName)
pathways=goseq(pwf,gene2cat=innatedb)
head(pathways, n = 50) ### show top 50 pathways
enriched.pathways = pathways$category[pathways$upval < 0.01] ### returns pathways with an uncorrected
pvalue of < 0.01
enriched.pathways
Mapping
Fig 6. Showing relational database using MS Access.
MS Access was used to convert transcript ID’s to gene_ID’s, remove duplicates of gene_ID’s for
goseq, and prepare a spreadsheet including gene_ID’s, fold changes and p values for Innate DB.
Biomart (www.biomart.org) provided the mapping spreadsheet between human and mouse, and
between transcript and gene id’s. The relationships were set up and queries assembled as required
(fig6). The number of DeSeq and EdgeR DE reads are given from these tables.
3. Pathway/Gene Ontology/interactome analysis for diagnosis of disease
The list of differentially expressed (DE) genes from EdgeR (mapping transcript to gene id) was
uploaded to innatedb. The data columns were identified (logs as expression value for cond1 and
cond2 respectively, pvalue and fdr as p values for cond1 and cond2 respectively). Gene_ID’s were
defined as a cross-reference ID to Ensembl. Following the Data Analysis link, the interactions is to be
filtered by interactions with self-proteins, using a subset of data rather than microarray data, using
default distributions. The cerebral link opens up a network visualisation of the results using
Cytoscape. In the left window pane of Cytoscape(cerebral) the tick box for high quality rendering is
selected, and the data panel top left icon was opened to select all id boxes. The network was
exported as XGMML format and reopened on an offline Cytoscape version with cerebral, Vizmapper
and Cytohubba installed. Cytohubba detected the top 20 hubs by degree of edges, and the
visualization of top nodes was enhanced by right-clicking top 5 hubs and manually styling each node
font/label/size.
For pathway and gene ontology, the EdgeR DE gene list was further reduced by only using scores
above 0.6 given from Prospectr, linking Gene_ID’s to diseases found in OMIM
(www.genetics.med.ed.ac.uk/prospectr/downloads/dump.tab). This reduced list of scored DE genes
is uploaded into Innatedb through the Data Analysis link. For pathways, the over-represented
pathways is explored, and likewise, over-represented gene ontologies can be explored. Pathways
may be viewed in cerebral as well, and the list of pathways are downloaded as a table, removing all
columns except gene_id and pathways, saving as innatedb.in for use with goseq pathway analysis.
References
[1] Brian T. Wilhelma, Josette-Renée Landry. “RNA-Seq —quantitative measurement of expression through
massively parallel RNA –sequencing”. Methods.2009;48;3; pp 249-257
[2] Jing Chen, Bruce J Aronow, Anil G Jegga. “Disease candidate gene identification and prioritization using
protein interaction networks.” BMC Bioinformatics. 2009; 10: 73.
[3] Shi-Hua Zhanga, Chao Wua, Xia Lia, Xi Chena, Wei Jianga, Bin-Sheng Gonga, Jiang Lia, Yu-Qing Yanb.
“From phenotype to gene: Detecting disease -specific gene functional modules via a text-based human
disease phenotype network construction.” FEBS Letters. 2010; 584, 16, pp 3635-3643
[4] Bolan Linghu1, Evan S Snitkin, Zhenjun Hu1, Yu Xia1, Charles DeLisi1. “Genome-wide prioritization of
disease genes and identification of disease-disease associations from an integrated human functional linkage
network” Genome Biology. 2009; 10:R91
[5] Valente N, et al. "Serial electrophysiological studies of the heart’s exicto conductor system in patients with
chronic chagasic cardiopathy". Arq Bras Cardiol. 2006; 86 (1): 19–25
[6] E. Pfeiler, B.G. Bitler, J.M. Ramsey, C. Palacios-Cardiel, T.A. Markow. “Genetic variation, population
structure, and phylogenetic relationships of Triatoma rubida and T. recurva (Hemiptera: Reduviidae:
Triatominae) from the Sonoran Desert, insect vectors of the Chagas’ disease parasite Trypanosoma cruzi” Mol.
Phyl. Evol.2006; 41, 1, pp 209-221
[7] Sandra A. Drigo, Edecio Cunha-Neto, Bárbara Ianni, Maria Regina A. Cardoso, Patrícia E. Braga, Kellen C.
Faé, Vera Lopes Nunes, Paula Buck, Charles Mady, Jorge Kalil, Anna Carla Goldberg. “TNF gene
polymorphisms are associated with reduced survival in severe Chagas' disease cardiomyopathy patients”
Microbes and Infection, 2006; 8; 3; pp 598-603
[8] Rajendranath Ramasawmy, Kellen C. Faé, Edecio Cunha-Neto, Susan C.P. Borba, Barbara Ianni, Charles
Mady, Anna C. Goldberg, Jorge Kalil. “Variants in the promoter region of IKBL/NFKBIL1 gene may mark
susceptibility to the development of chronic Chagas’ cardiomyopathy among Trypanosoma cruzi-infected
individuals”. Molecular Immunology, 2008; 45; 1; pp 283-288
[9] Ernesto H. de Titto and Rita L. Cardoni. “Trypanosoma cruzi: Parasite-induced release of lysosomal
enzymes by human polymorphonuclear leukocytes.” Experimental Parasitology. 1983; 56; 2; pp 247-254
Appendix 1 – FastQC Results
(a) showing fastQC errors
(b) Controls: PolyA tails at read ends
(c)Control: low GC at read ends
(d) high duplicates of 10+ indicating PCR duplicates
(e) high AAAAA’s at read ends
(f) disease: GC content with high variation at starts
(g)Control hiqh quality reads
(h)disease quality per read dropping at read ends.
Appendix 2: EdgeR & DeSeq
(a)variance function vs mean
variance drops with longer reads.
(b) disease log var vs log mean
(c) control log var vs log mean
Variance for (b) and (c) very similar, allowing for normalisation
(d) log2foldchange DE
with DE genes in red
(e) Heatmap DE with high
similarity(red) compared to
differentially expressed (white)
(g) Residuals ECDF plot for Control and Disease
Comparing the experimental distribution (ECDF)
(f) PCA (EdgeR) with separation
of control from disease in
dimension 1 by DE with a
possibility of technical noise in
dimension 2
(h)Density of resVarA (red) and resVarB (blue)
showing two components are independent
To chi squared probability, comparing expected
To observed to show control and disease have
Significant and similar base residual distributions
In the form of sensitivity (ECDF) vs specificity
Download