DESeqRpractical (Originalauthor:EdouardSevering) InthispracticalwewillworkwithRNA-Seqreadcountsformaizegenesmeasuredindifferentinternodes MakesureDESeqisinstalled.InRrunthesetwolines: source("http://bioconductor.org/biocLite.R") biocLite("DESeq") ToloadtheDESeqlibraryandasetofcustomfunctionsrun: source("http://www.bioinformatics.nl/courses/RNAseq/DEseqExercise.R") Thereadcountdataisalreadyavailable;youcanloaditbytypingthefollowingRcommand: internode_data = read.table( "http://www.bioinformatics.nl/courses/RNAseq/maize_e3.table", row.names = 1, header = T, sep = "\t" ) Explanation:row.names = 1indicatesthatthefirstcolumninthefilecontainsthenameoftherows. header = Tindicatesthatfilecontainsaheaderwhichinthiscasemeansthenamesofthecolumns. sep = "\t" istoindicatethatthefieldsareseparatedbytabs. Let’slookatabitcloseratthedatabyrunning: dim( internode_data ) twovalueswillbedisplayed(ignorethe[1]).Thefirstvaluerepresentsthenumberofrowsandthe secondonethenumberofcolumnsinourdataset. Byrunning: colnames( internode_data ) youwillobtainalistwiththecolumnnames.Inthiscaseallthecolumnsexceptthelastonecorrespond toRNA-Seqsamples.Notethatthelastcolumncontainsannotations. Byrunning: rownames( internode_data) youwillobtainalistwithalltherownames.Rownamescorrespondtothegenes. Byrunning: summary(internode_data) yougetasummaryofthedatainthedifferentcolumns. Inordertogetanindicationofthetotalnumberofcountsineachsampleyoucanrun: apply( internode_data[,1:6], 2, sum) Withthiscommandwesumallcountsinallcolumns.Thefirstparameterissetto internode_data[,1:6]andnotjustinternode_databecausewecanonlyhavethesumfor column1to6.Columnnumber7containsannotations. Ourfirstexperiment Inourfirstexperimentweareinterestedinwhatthedifferencesarebetweenfirstandfourthinternodes ofmaizeplants.Wehopethatgeneexpressiondifferencesbetweentheinternodescanprovideuseful information.TothisendwehavegeneratedRNA-Seqdataforthreesamplesofthefirstinternodeand threesamplesofthefourthinternodeofmaizeplants. Cleaning Nowwewouldliketoremoveallgenesthatarenotveryinformative.Inthisspecificcasewewillremove genesthathaveinnoneofthesamplesmorethan10counts(randomlychosenthreshold). Run: mx = apply( internode_data[,1:6], 1, max ) Thatcommandwillcreateavector(list)containingthemaximumreadcount(overallour6samples)for eachgene.Againweuseinternode_data[,1:6]becausewemustexcludecolumn7. Nextwewillmakeanewtablethatonlycontainsrowsforwhichthemaximumcountisgreaterthanten. internode_data = internode_data[ mx > 10, ] Usedim(internode_data)todeterminehowmanygenesyouhaveleftinyourset. DESeq Inordertocontinueweneedmakedata-objectthatDESeqcanuseforperformingdifferential expressionanalysis. Runthis: cds = newCountDataSet( internode_data[,1:6], conditions = c("first", "first", "first", "fourth", "fourth", "fourth" ) ) WiththiscommandwecreatedaDESeqcountdatasetfrominternode_data.Thesecondparameter indicatesthatfirstthreecolumnscorrespondtoreplicatesfromthefirstinternodeandthelastthree columnscorrespondtoreplicatesfromthefourthinternode.Notethatwehereagainexcludecolumn number7 Normalization Beforeanycomparisoncanbemadebetweensamplesthecountshavetobenormalized.Thereasonfor thisisthatthecountsforagenenotonlydependonitsexpressionlevelbutalsoonthedepthof sequencing. Runthis: cds = estimateSizeFactors( cds ) Inordertoshowyoutheimportanceofnormalization:Typethefollowingintheconsole: norm_versus_non_norm( cds, 1, 2, left = 2, right = 8 ) ThiscommandcallsafunctionfromDEseqExercise.Rscript.Ittakesthefirstandsecondcolumnof thecountdatasetandgeneratestwoscatterplots.Thefirstscatterplotcontainsthenon-normalized gene-countsandsecondplotcontainsthenormalizedcounts.Doyouseewhynormalizationis important? Clusteranalysisofthesamples Itisveryimportanttocheckwhetheryoursamplesclusterasyouexpectthemto.Forinstanceyoudon’t wantforinstancereplicate1ofrootsampletoclusterwithareplicateofashootsample.Inthissection youwillperformasimpleandquickclusteringanalysis. Typethefollowingcommandsintheconsole: cnt = log( 1 + counts( cds, normalized = T ) ) Thiscommandcreatesatableinwhichthenormalizedcountsaretransformedtologcounts.The1is simplytheretopreventtakingthelogof0andhavingtransformedcountslessthan0. Thiscommandwillchangethecolumnnamesofthetabletoshorterones: colnames( cnt ) = c( "1-1", "1-2", "1-3", "4-1", "4-2", "4-3" ) 1-1meansfirstinternodereplicate1. Withthiscommandyoucreateadistancematrix: dst = as.dist( 0.5 - 0.5 * cor( cnt )) Thedistancebetweentwosamplesiscalculatedas0.5–0.5*pearson_correlation.Thisdistanceis alwaysbetween0(veryclose/identical)to1(totallydifferent/opposite). Withthiscommandyouplotatree plot( hclust( dst ) ) Thistreerepresentsahierarchicalclusteringofthesamples.Dotheyclusterasexpected? Gene-specificdispersions InordertodetectdifferentialexpressionDESeqhastoestimatetheexpressionvarianceforeachgene. DESeqassumesthatgenecountswithinconditionsfollowthenegativebinomialdistribution.According tothismodelthevarianceinexpressionofagenedependsonitsmeanexpression-levelasfollows: σ2 = sµ + αs2µ2 Thelefttermisthevariance,whichdependsonthemeanµ.Intheformulasisascalingfactorthatis constantforallgenesinasample/conditionandαiscalledthedispersion.DESeqtriestodeterminethe dispersionvalueforeachgenefromthenormalizedcountdata.Itlaterwillusethedispersionsto determinethegene-expressionvarianceforeachgenesoitcantestfordifferentialexpression. Run: cds = estimateDispersions( cds, method = "per-condition" ) Withthiscommandthegene-specificdispersionvaluesareestimatedforeachconditionseparately. Whenyouleavethemethodparameterout,youwillestimatedispersionsoneoverallsamples. Nowtypethefollowingthreecommandsintheconsole: par( mfrow = c(1,2) ) plotDispEsts( cds, cond = "first" ) plotDispEsts( cds, cond = "fourth" ) Youshouldnowsee2plots.Theleftplotcorrespondstothe“first”internodeandthesecondtothe “fourthinternode”.Allblackdotsarethedispersionvaluesthatweredirectlycalculatedfromthe normalizedcountdata.AsyoucanseeDESeqfitteda(red)linethroughthedata.Thismeansthatthe dispersionvalueisafunctionofthemeanexpressionvalue. Differentialexpression. Wehavenowarrivedatthestepwherewecanperformadifferentialexpressionanalysis. Typethefollowingcommandintheconsole: res = nbinomTest( cds, "first", "fourth" ) Thiscommandwillperformthedifferentialexpressiontestsbetweenourtwosamples. Toseethetoprowsfromthedifferentialexpressiontable,typethefollowingcommandintheconsole: head( res ) Thepadjcolumncontainsp-valuesthatareadjustedformultipletesting.BaseMeanA,BaseMeanBare themeancountvaluesforthefirstandfourthinternode,respectively. Nextwearegoingtoaddourannotationbacktotheresultstable. Typeintheconsole: res$annotation = internode_data[,7] Withthiscommandweaddanewcolumntoresultstablereswhichcontainstheannotationcolumn (number7)fromtheinternode_datatable.Confirmthisusinghead Exceltable.NowwewriteanoutputatablethatyoucanopeninExcellater. Typeintheconsole: write.table( res, col.names = T, row.names = F, file = "DESeq_output", sep = "\t") Withthiscommandwewritetablerestodiscwithouttherow.namesbutwithcol.names. Row.namesareleftoutbecausenowtheyrepresentnumbers. Weusetabsforseparatingfields(sep = "\t").Thefilethatiscreatediscalled:DESeq_output Volcanoplot ToendthisexercisewewillmakeavolcanoplotusingoneofthefunctionsfromtheDEseqExercise.R file.Eachpointinavolcanoplotrepresentsagene.Thex-coordinateofthegenecorrespondstothe log2foldchangesbetweenthetwoconditions/tissuesandy-axiscorrespondsto–log10(p-value).Hence thevolcanoplotprovidesanoverviewofthelog2foldchangesandp-values.Allredpointscorrespondto geneswiththataredifferentiallyexpressedaccordingtotheadjustedp-valuethresholdof0.01. Youcanmaketheplotbytyping: volcano(res) Whatisthefirstthingthatstrikesyouwhenyouexaminethevolcanoplot? Canyougetalistofthe10geneswiththehighestsignificantfoldchange? Howmanygenesaredifferentiallyexpressedifyoutakeacut-offfortheadjustedp-valueof0.01?