DESeq R practical

advertisement
DESeqRpractical
(Originalauthor:EdouardSevering)
InthispracticalwewillworkwithRNA-Seqreadcountsformaizegenesmeasuredindifferentinternodes
MakesureDESeqisinstalled.InRrunthesetwolines:
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq")
ToloadtheDESeqlibraryandasetofcustomfunctionsrun:
source("http://www.bioinformatics.nl/courses/RNAseq/DEseqExercise.R")
Thereadcountdataisalreadyavailable;youcanloaditbytypingthefollowingRcommand:
internode_data = read.table(
"http://www.bioinformatics.nl/courses/RNAseq/maize_e3.table", row.names = 1,
header = T, sep = "\t" )
Explanation:row.names = 1indicatesthatthefirstcolumninthefilecontainsthenameoftherows.
header = Tindicatesthatfilecontainsaheaderwhichinthiscasemeansthenamesofthecolumns.
sep = "\t" istoindicatethatthefieldsareseparatedbytabs.
Let’slookatabitcloseratthedatabyrunning:
dim( internode_data )
twovalueswillbedisplayed(ignorethe[1]).Thefirstvaluerepresentsthenumberofrowsandthe
secondonethenumberofcolumnsinourdataset.
Byrunning:
colnames( internode_data )
youwillobtainalistwiththecolumnnames.Inthiscaseallthecolumnsexceptthelastonecorrespond
toRNA-Seqsamples.Notethatthelastcolumncontainsannotations.
Byrunning:
rownames( internode_data)
youwillobtainalistwithalltherownames.Rownamescorrespondtothegenes.
Byrunning:
summary(internode_data)
yougetasummaryofthedatainthedifferentcolumns.
Inordertogetanindicationofthetotalnumberofcountsineachsampleyoucanrun:
apply( internode_data[,1:6], 2, sum)
Withthiscommandwesumallcountsinallcolumns.Thefirstparameterissetto
internode_data[,1:6]andnotjustinternode_databecausewecanonlyhavethesumfor
column1to6.Columnnumber7containsannotations.
Ourfirstexperiment
Inourfirstexperimentweareinterestedinwhatthedifferencesarebetweenfirstandfourthinternodes
ofmaizeplants.Wehopethatgeneexpressiondifferencesbetweentheinternodescanprovideuseful
information.TothisendwehavegeneratedRNA-Seqdataforthreesamplesofthefirstinternodeand
threesamplesofthefourthinternodeofmaizeplants.
Cleaning
Nowwewouldliketoremoveallgenesthatarenotveryinformative.Inthisspecificcasewewillremove
genesthathaveinnoneofthesamplesmorethan10counts(randomlychosenthreshold).
Run:
mx = apply( internode_data[,1:6], 1, max )
Thatcommandwillcreateavector(list)containingthemaximumreadcount(overallour6samples)for
eachgene.Againweuseinternode_data[,1:6]becausewemustexcludecolumn7.
Nextwewillmakeanewtablethatonlycontainsrowsforwhichthemaximumcountisgreaterthanten.
internode_data = internode_data[ mx > 10, ]
Usedim(internode_data)todeterminehowmanygenesyouhaveleftinyourset.
DESeq
Inordertocontinueweneedmakedata-objectthatDESeqcanuseforperformingdifferential
expressionanalysis.
Runthis:
cds = newCountDataSet( internode_data[,1:6], conditions = c("first", "first",
"first", "fourth", "fourth", "fourth" ) )
WiththiscommandwecreatedaDESeqcountdatasetfrominternode_data.Thesecondparameter
indicatesthatfirstthreecolumnscorrespondtoreplicatesfromthefirstinternodeandthelastthree
columnscorrespondtoreplicatesfromthefourthinternode.Notethatwehereagainexcludecolumn
number7
Normalization
Beforeanycomparisoncanbemadebetweensamplesthecountshavetobenormalized.Thereasonfor
thisisthatthecountsforagenenotonlydependonitsexpressionlevelbutalsoonthedepthof
sequencing.
Runthis:
cds = estimateSizeFactors( cds )
Inordertoshowyoutheimportanceofnormalization:Typethefollowingintheconsole:
norm_versus_non_norm( cds, 1, 2, left = 2, right = 8 )
ThiscommandcallsafunctionfromDEseqExercise.Rscript.Ittakesthefirstandsecondcolumnof
thecountdatasetandgeneratestwoscatterplots.Thefirstscatterplotcontainsthenon-normalized
gene-countsandsecondplotcontainsthenormalizedcounts.Doyouseewhynormalizationis
important?
Clusteranalysisofthesamples
Itisveryimportanttocheckwhetheryoursamplesclusterasyouexpectthemto.Forinstanceyoudon’t
wantforinstancereplicate1ofrootsampletoclusterwithareplicateofashootsample.Inthissection
youwillperformasimpleandquickclusteringanalysis.
Typethefollowingcommandsintheconsole:
cnt = log( 1 + counts( cds, normalized = T ) )
Thiscommandcreatesatableinwhichthenormalizedcountsaretransformedtologcounts.The1is
simplytheretopreventtakingthelogof0andhavingtransformedcountslessthan0.
Thiscommandwillchangethecolumnnamesofthetabletoshorterones:
colnames( cnt ) = c( "1-1", "1-2", "1-3", "4-1", "4-2", "4-3" )
1-1meansfirstinternodereplicate1.
Withthiscommandyoucreateadistancematrix:
dst = as.dist( 0.5 - 0.5 * cor( cnt ))
Thedistancebetweentwosamplesiscalculatedas0.5–0.5*pearson_correlation.Thisdistanceis
alwaysbetween0(veryclose/identical)to1(totallydifferent/opposite).
Withthiscommandyouplotatree
plot( hclust( dst ) )
Thistreerepresentsahierarchicalclusteringofthesamples.Dotheyclusterasexpected?
Gene-specificdispersions
InordertodetectdifferentialexpressionDESeqhastoestimatetheexpressionvarianceforeachgene.
DESeqassumesthatgenecountswithinconditionsfollowthenegativebinomialdistribution.According
tothismodelthevarianceinexpressionofagenedependsonitsmeanexpression-levelasfollows:
σ2 = sµ + αs2µ2
Thelefttermisthevariance,whichdependsonthemeanµ.Intheformulasisascalingfactorthatis
constantforallgenesinasample/conditionandαiscalledthedispersion.DESeqtriestodeterminethe
dispersionvalueforeachgenefromthenormalizedcountdata.Itlaterwillusethedispersionsto
determinethegene-expressionvarianceforeachgenesoitcantestfordifferentialexpression.
Run:
cds = estimateDispersions( cds, method = "per-condition" ) Withthiscommandthegene-specificdispersionvaluesareestimatedforeachconditionseparately.
Whenyouleavethemethodparameterout,youwillestimatedispersionsoneoverallsamples.
Nowtypethefollowingthreecommandsintheconsole:
par( mfrow = c(1,2) )
plotDispEsts( cds, cond = "first" )
plotDispEsts( cds, cond = "fourth" )
Youshouldnowsee2plots.Theleftplotcorrespondstothe“first”internodeandthesecondtothe
“fourthinternode”.Allblackdotsarethedispersionvaluesthatweredirectlycalculatedfromthe
normalizedcountdata.AsyoucanseeDESeqfitteda(red)linethroughthedata.Thismeansthatthe
dispersionvalueisafunctionofthemeanexpressionvalue.
Differentialexpression.
Wehavenowarrivedatthestepwherewecanperformadifferentialexpressionanalysis.
Typethefollowingcommandintheconsole:
res = nbinomTest( cds, "first", "fourth" )
Thiscommandwillperformthedifferentialexpressiontestsbetweenourtwosamples.
Toseethetoprowsfromthedifferentialexpressiontable,typethefollowingcommandintheconsole:
head( res )
Thepadjcolumncontainsp-valuesthatareadjustedformultipletesting.BaseMeanA,BaseMeanBare
themeancountvaluesforthefirstandfourthinternode,respectively.
Nextwearegoingtoaddourannotationbacktotheresultstable.
Typeintheconsole:
res$annotation = internode_data[,7]
Withthiscommandweaddanewcolumntoresultstablereswhichcontainstheannotationcolumn
(number7)fromtheinternode_datatable.Confirmthisusinghead
Exceltable.NowwewriteanoutputatablethatyoucanopeninExcellater.
Typeintheconsole:
write.table( res, col.names = T, row.names = F, file = "DESeq_output", sep =
"\t")
Withthiscommandwewritetablerestodiscwithouttherow.namesbutwithcol.names.
Row.namesareleftoutbecausenowtheyrepresentnumbers.
Weusetabsforseparatingfields(sep = "\t").Thefilethatiscreatediscalled:DESeq_output
Volcanoplot
ToendthisexercisewewillmakeavolcanoplotusingoneofthefunctionsfromtheDEseqExercise.R
file.Eachpointinavolcanoplotrepresentsagene.Thex-coordinateofthegenecorrespondstothe
log2foldchangesbetweenthetwoconditions/tissuesandy-axiscorrespondsto–log10(p-value).Hence
thevolcanoplotprovidesanoverviewofthelog2foldchangesandp-values.Allredpointscorrespondto
geneswiththataredifferentiallyexpressedaccordingtotheadjustedp-valuethresholdof0.01.
Youcanmaketheplotbytyping:
volcano(res)
Whatisthefirstthingthatstrikesyouwhenyouexaminethevolcanoplot?
Canyougetalistofthe10geneswiththehighestsignificantfoldchange?
Howmanygenesaredifferentiallyexpressedifyoutakeacut-offfortheadjustedp-valueof0.01?
Download