“Bioinformatic toolsforwholegenomeandtargeted arrayanalysis”: Dealingwithnoiseinthedata Or Nosuchthingasafreelunch! GregPeters &ArturDarmanian CytogeneticsDept, Children’sHospitalatWestmead SydneyNSW2145,Australia CHWCytogeneticsLab: CGHMicroarraydiagnosticnumbers&timeline. …..Oligoarrays Ourlab:BACarrays SNP/genotypingarrays 20052006 2007 2008 2009 BMCGenomics 2009,10:588doi:10.1186/1471216410588 Notesignificanteffect OfGCcontenton Arraysignallogratio, WhenGC>0.45or<0.35 [Affymetrix arrays] ConsiderCGHarraydataforatrue Xq28deletion,inafemale. Whichalsoshowsaproximalduplication? Arraytracesforthispatient, Plustwo“normal”casesrun Inthesamebatch. Twoofthethreeshowsimilar [false]duplicationsinXq28. Greenarrows:TARGETED genesMECP2,SLC6A8&LICAM A2ndabnormalcase,withtruedistalduplicationofXq27.1toXqter SLC6A8 Samecase,indetail: DistalXq,showingq27.1 Translocationbreakpoint SLC6A8 Probeorderonly AnabnormalCase:malewith16Mbduplication, Duetounbalancedt(X;3)(q27.1;q29)translocation. DLR=0.20 AgilentProbelengthvsGCcontent:Xq28region SLC6A8 MECP2 MECP2 Agilent60Ktargetedarray:Probelengthinbases[blue]=60bases,usually AndprobeGCcontent%[red]BOX:detailofMECP2probes Greyareas= Proportionof Probes<60bases Long. Agilentarrays: Probelengthcompensatesfor GCcontent. BMCGenomics 2009,10:588doi:10.1186/1471216410588 ForAgilentarrays,probesarereducedinlength[from60to45bases] asmeanGClevelincreases,asthewholechromosomedatashows. [Generally,highGCchromosomesare:generich, morelethalastrisomies, Andgenerallyearlyreplicating.] 0.35 Prop’nof Probes <60bases 11 0 Autosomenumber “AgilentCGHAnalyticssoftware[used]theaberrationdetectionmethod1(ADM1or“adamone”)algorithm. ADM1isanaberrationalgorithmthatidentifiesallaberrantintervalsinagivensamplewithconsistently highorlowlogratiosbasedonthestatisticalscore.TheADMalgorithmssearchforintervalsinwhichthe averagelogratioofthesampleandreferencechannelsexceedsauserspecifiedthreshold.[usually+/ ~0.20] TheADM1statisticalscoreiscomputedastheaveragenormalizedlogratiosofallprobes inthegenomicinterval,multipliedbythesquarerootofthenumberoftheseprobes. Itrepresentsthedeviationoftheaverageofthenormalizedlogratiosfromitsexpectedvalueofzero. TheADM1scoreisproportionaltotheheighth (absoluteaveragelogratio)ofthegenomicinterval, andtothesquarerootofthenumberofprobesintheinterval. Roughly,foranintervaltohaveahighADM1score,itshouldhavehighheightand/orincludealargenumberofprobes.” Datafromourlab: Duplication+ve callfortheARSAgenein22q: Mean“height”h =~0.5,with18probes. Regardingalgorithms: OED: “aprocessorsetofrulesusedforcalculationorproblemsolving, especiallywithacomputer” [from“alKuwarizmi”: 9th centuryPersianmathematician]. ModificationofalgorithmADM1toincludeweightingsforprobereliability:ADM2 http://users.isr.ist.utl.pt/~jmrs/research/TopicsLinks/Genomics /gensips2005/papers/Gensips2005_SPSAppraoch140.pdf. .....etc! Breastcancersample: Chromosome17. Otherexamples:ADM2algorithm’sweightingsareGCrelated: 1.2 NB:theARSA“duplication”wasafalsepositive! SHANK3 1 0.8 ARSA 0.6 0.4 0.2 0 Distal 22q13.33 including ARSA & SHANK3 Agilent "Probescores“, which are the probe “weightings” to correct for high GC effects. 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354 1.2 HIRA 1 0.8 0.6 0.4 0.2 HIRA & TBX1 sequential probes region in VCFS region PLOT shows AGILENT ADM2 "PROBESCORES" across the region. TBX1 is badly behaved..... TBX1 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 103 106 109 112 115 118 121 124 127 130 0 ...andTBX1isGCrich DataforapatientwiththestandardVCFSdeletion [redarrow].ButnoteTBX1appearsnormal! TBX1regiongoeshere 1.2 MECP2 1 0.8 0.6 0.4 L1CAM 0.2 0 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157 163 169 175 181 187 193 199 205 211 217 223 229 235 241 247 253 259 265 271 277 283 289 295 301 SLC6A8 Xq28 region, from 149 to 153Mb: Agilent "Probescores" for The algorithm “ADM2”, which aim to correct for GC “bias” in certain array probes. SuggestionofsimilarGCeffectswithothergenomictechnologies: Fanetal(2008)alsoobserveasimilarproblemwhenmeasuringcopy number(trisomy)byhighthroughputsequencing. Theystate: “Weobservedthatcertainchromosomeshavelarge variationsinthecountsofsequencedfragmentsfromsampletosample, andthatthisdependsstronglyontheGCcontent. ItisunclearatthispointwhetherthisstemsfromPCRartifacts duringsequencing,librarypreparation,orclustergenerationorthe sequencingprocessitself,orwhetheritisatruebiologicaleffectrelating tochromatinstructure.” Theygoontosaythat,fortunatelyforprenataldiagnosis, chromosomes13,18and21areverylittleaffectedbythisproblem!!! FanHC,etal: “NoninvasivediagnosisoffetalaneuploidybyshotgunSequencing DNAfrommaternalblood” PNASOctober21,2008,105,no.42:1626616271. Andarelatedexplanation: Vannesteetal(2009)considertheseproblemsinthecontextof microarrays.Theregardthecauseasmorelikely “…nontechnical,asdeducedfromtheminorstandarddeviations betweentheduplicatespots”foreachprobeonthearray. TheysuggesttheGCrelatedproblemis“morelikelytobebiological. Forinstance,determinationofcopynumbersincellsduringSphase willinevitablyleadtomorescatterofintensityratiosdetectedon consecutiveBACclones.” VannesteE,etal(2009),NatureMedicine15,577 583 Awholenewcanofworms!!!!– butalsoGCrelated 11qarmGC%ge EgNDUFV1 qarmofchromosome11: Reddots=earlyreplicatingregions Bluedots=late 0.8 DLR=0.16 0.7 0.6 0.5 0.4 0.3 NDUFV1 0.2 0.1 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 -0.1 -0.2 -0.3 EgNDUFV1– atargetedgeneinanearlyreplicating,andGCrichregionof11q. Appears?duplicatedonthisreasonablequalityarraydata DGVdatabase:http://projects.tcag.ca/cgibin/variation/gbrowse/hg18 RecallingVannesteetal: ....theGCrelatedproblemis“morelikelytobebiological...” [thantechnical]. “Forinstance,determinationofcopynumbersincells duringSphasewillinevitablyleadtomorescatterof intensityratios...”, becausetheearlyreplicatingregionswill beoverrepresentedintheDNAsample. ForsomeDNAsources,wesuggestthisproblemcancause seriousproblemswitharrayQCvalues,oftenresultingindata rejection, orworse:falsepositives[duplications,usually] WhatDNAsamplesaremorelikelytosuffertheSphasecyclingproblem? 1) Nonblood[eg fibroblast]cellsbeingculturedinvitroattimeofDNAharvest 2) Neonatalbloodlymphocytes 3) Rapidlycyclingmalignantcells Thesedotendtogivepoorqualityarraydata,withdups atGCrichloci. Tcells Bcells AllgoodoldCytogeneticistsknowaboutlatevsearlyreplication!! Circledarebothcopies ofautosome19: Dark=earlyreplicating Rbanding: Pale=Latereplicating, partlydeletedcopyofX Conclusions: Wheninterpretingmicroarraydata… Bewareofartefactsbothtechnical&biological! Knowyourgenome,andit’speculiarities!! Acknowledgments: ThankstoallatCytogeneticsCHW,whocontributetothiswork. …toProfIanDawesandhiscolleagues,at TheRamaciottiCentre,DeptBiochemandMolecBiol, UniversityofNSW,Sydney,whohelpedusgetintomicroarrays. AndtoDanBelluocio[Agilent]forprobeGCdata. 8hours Sphase % replicons inS attimet Mitosis/harvest early late time Rbanding:additionofthethymineanalogueBrdU, inmidSphase(at)inhibitsgiemsastainingoflatereplicatingregions. TheseincludetheinactiveX,andall Gpositivebands.Thelatteraregene andGCpoor. Worksbetterifcellcyclesaresynchronised. Anoldcytogenetictechnique. 1111319 Mostlylate RbandingCourtesyof MrJohnKemp FHGSA[retired] NormalXDeletedX =early=late Mostly early Blue=MarginalarraydatafortheMECP2regionofXq28[DLR=0.24] Red=AgilentprobeGC%contents,forprobesinthisregion,onthe 60Ktargetedarrayusedinthisexperiment SLC6A8 MECP2 Arraydatafora[different]normalsubject 4KbBluegnomeCytoChipBACArrayDataforchromosome11 Bluepoints=poorqualityarraydata[logratios] Redcircles=RelativeGCcontentofeachBACprobeused[rescaled] 11pter cen 11qter Equivalentdyeswaplogratiodataforchromosome11, inanoligomerarraytest[Agilent44K]. Thetwodyeswapsarehereplottedasreciprocalratios(blackandtantraces Respectively).Thiscasesuffersbad“GC”effect,anddidfailQC[DLR>.25], althoughonlyjust. qter pter Chromosome11 Xaxis:%GCforCytoChipBACprobes[rescaled] Yaxis:BACarraydata[logratios] Correlation=0.41. DataanalysisinCGHmicroarrayworkreliesonalgorithms,basedon statisticaltheory,empiricalconsiderations,andunderlyingbiology ofthegenome. Suchalgorithmsaredevisedtominimiseerrors in interpretingthedata,ie,failuretofindthetrue,segmental “deviationsfromdisomy”, whichmaybescatteredanywherealongthechromosome. Someerrorsmayarisethrough: A)stochastic/statisticalsources: eg1“backgroundnoise”effects:mainlydependentonqualityofsample,or qualityofexperiment,suchthatthestandarddeviationof thedatapointswilloverwhelmtheresolutionrequired. eg2“bonferroni”effects – solongasyoudoenoughexperiments[eg 100,000],youWILLget asignificantresult![whichmaybemeaningless] B)genomebiology: eg1localpeculiaritiesincompositionoftheDNAsequence– whichaffect Probebehaviour[mostobviously,GC%ge] eg2cellcycleeffects:what%ge ofcellsareinSphaseattimeofharvest? ?apoptosis C)arraydesign“faults”:particularlyrelevantin“targeted”arrays,indeciding astotheoptimaldistributionofprobes.Becausemanythousands ofprobes[&loci]mustbechosen,errorsarelikely. Wikipedia: TheBonferronicorrectionisasafeguardagainstmultipletestsofstatistical significanceonthesamedatafalselygivingtheappearanceofsignificance, as1outofevery20hypothesistests[orsetsofhypothesistests] isexpectedtobesignificantatthe=0.05level,purelyduetochance. Thustheprobabilityofgettingasignificantresultwithn tests[ornsetsoftests]atthis“”levelofsignificanceis1 0.95n (or1 theprobabilityofnotgettingasignificantresult,withn tests…). Forarrays:Example: ifthereis1%chanceoffindingone+vefor5adjacentprobesin50,000, [bychancealone],thenforahigherresolutionarray,witheg100,000probes, thenthenumberof+veswill>1%. Hencethenumberoffalsepositivesmaybeunacceptable. Inpractice,asurrogate Bonf.correctioncanbeappliedbyincreasing theminimumnumberofadjacentprobesrequiredforapositivecall. Inourlab: 5probesrequiredfor60,000probearray[recommendednumberis4] =0.25Mbresolution 7probes 180,000 =0.16Mb“ 10probes 400,000 =0.06Mb“ Thesenumbersarenotasperthetextbookcalculation. Inpart,theyarechosenbasedonobservationofnumbersofapparently true+vesperexperiment. Thiscanbecalibrated,tosomeextent,ontheincidenceofknown commonCNVs. EgWiththe400,000probearrays,theaveragepersonwillhave 5to15suchCNVs,largeenoughtoscoreastruepositives. [2009] * *andinpress,2011: Regarding:16p11.2microdeletion RussellDale,PadraicGrattanSmith,VictorSCFung,andGregBPeters: Neurology Withacknowledgmentto ourcolleaguesinthe NeurologyDept,CHW Neurogenetics updated:Extendedhuntforautismgenes Nature,474, Pages:254–255 Datepublished:(16June2011) Publishedonline 15June2011 GroupsledbyMichaelWigleratColdSpringHarbor LaboratoryinNewYorkandMatthewStateatYale UniversityinNewHaven,Connecticut,conductedthe mostcomprehensivesearchyetforspontaneous duplicationsordeletionsofstretchesofDNAthatmay beassociatedwithautismspectrumdisorders. Inanalysingthegenomesofmorethan1,000people— somewithautism,someunaffectedfamilymembers — theteamsfoundatleast130sitesinthegenome wherespontaneousduplicationsordeletionsmight contributetoautismrisk. Whatdoesthelab’sreportinclude? Red? orBlue?orsomethinginbetween? CNVfoundbymicroarray software’sownalgorithm Xqdistalduplicationinamale: ArraytoconfirmdupofMECP2onMLPA Duplication:16.35MbfromXq27.1Xqter AcaseofTRUEXqduplication![inunbalancedt(X;3)] RealduplicationinXq:DLR=0.20Threshold=6.7 SLC6A8 MECP2 t(X;3)case:Xqdup,showingregionofMECP2 RealduplicationinXq:D:R=0.20Threshold=2.2 SLC6A8 MECP2 t(X;3)Case,showingregionofMECP2. Another“normal”case Anormalcase: Here,theAgilentalgorithmADM2findsfalseduplicationsflankingMECP2, [arrows]butonlyif“significancethreshold”isreducedto3.2,fromtheusual6 SLC6A8 MECP2 AnormalCase Andstilldoes,in2011! CGHarraydatafromanarrayuser’sperspective: Samecase,withtruedistalduplicationofXq, Fortheregionarrowed. Isthisbit Normal? Xchromosomearraydata,fromptertoqter[plotsprobeorderonly] 1996:GCcontentsinXq28 RedarrowshowsapproxpositionofMECP2 60KtargetedArray:~1MbregionflankingMECP2 Baseno. SLC6A8 1probeper2.1Kb Probes[~290] 44KArray:~1MbregionflankingMECP2 Base.No. 1probeper21.7Kb Probes[~50]