“Bioinformatic toolsforwholegenomeandtargeted arrayanalysis”: Dealingwithnoiseinthedata Or

advertisement
“Bioinformatic toolsforwholegenomeandtargeted
arrayanalysis”:
Dealingwithnoiseinthedata
Or
Nosuchthingasafreelunch!
GregPeters &ArturDarmanian
CytogeneticsDept,
Children’sHospitalatWestmead
SydneyNSW2145,Australia
CHWCytogeneticsLab:
CGHMicroarraydiagnosticnumbers&timeline.
…..Oligoarrays
Ourlab:BACarrays
SNP/genotypingarrays
20052006
2007
2008
2009
BMCGenomics 2009,10:588doi:10.1186/1471216410588
Notesignificanteffect
OfGCcontenton
Arraysignallogratio,
WhenGC>0.45or<0.35
[Affymetrix arrays]
ConsiderCGHarraydataforatrue
Xq28deletion,inafemale.
Whichalsoshowsaproximalduplication?
Arraytracesforthispatient,
Plustwo“normal”casesrun
Inthesamebatch.
Twoofthethreeshowsimilar
[false]duplicationsinXq28.
Greenarrows:TARGETED genesMECP2,SLC6A8&LICAM
A2ndabnormalcase,withtruedistalduplicationofXq27.1toXqter
SLC6A8
Samecase,indetail:
DistalXq,showingq27.1
Translocationbreakpoint
SLC6A8
Probeorderonly
AnabnormalCase:malewith16Mbduplication,
Duetounbalancedt(X;3)(q27.1;q29)translocation.
DLR=0.20
AgilentProbelengthvsGCcontent:Xq28region
SLC6A8
MECP2
MECP2
Agilent60Ktargetedarray:Probelengthinbases[blue]=60bases,usually
AndprobeGCcontent%[red]BOX:detailofMECP2probes
Greyareas=
Proportionof
Probes<60bases
Long.
Agilentarrays:
Probelengthcompensatesfor
GCcontent.
BMCGenomics 2009,10:588doi:10.1186/1471216410588
ForAgilentarrays,probesarereducedinlength[from60to45bases]
asmeanGClevelincreases,asthewholechromosomedatashows.
[Generally,highGCchromosomesare:generich, morelethalastrisomies,
Andgenerallyearlyreplicating.]
0.35
Prop’nof
Probes
<60bases
11
0
Autosomenumber
“AgilentCGHAnalyticssoftware[used]theaberrationdetectionmethod1(ADM1or“adamone”)algorithm.
ADM1isanaberrationalgorithmthatidentifiesallaberrantintervalsinagivensamplewithconsistently
highorlowlogratiosbasedonthestatisticalscore.TheADMalgorithmssearchforintervalsinwhichthe
averagelogratioofthesampleandreferencechannelsexceedsauserspecifiedthreshold.[usually+/ ~0.20]
TheADM1statisticalscoreiscomputedastheaveragenormalizedlogratiosofallprobes
inthegenomicinterval,multipliedbythesquarerootofthenumberoftheseprobes.
Itrepresentsthedeviationoftheaverageofthenormalizedlogratiosfromitsexpectedvalueofzero.
TheADM1scoreisproportionaltotheheighth (absoluteaveragelogratio)ofthegenomicinterval,
andtothesquarerootofthenumberofprobesintheinterval.
Roughly,foranintervaltohaveahighADM1score,itshouldhavehighheightand/orincludealargenumberofprobes.”
Datafromourlab:
Duplication+ve callfortheARSAgenein22q:
Mean“height”h =~0.5,with18probes.
Regardingalgorithms:
OED:
“aprocessorsetofrulesusedforcalculationorproblemsolving,
especiallywithacomputer”
[from“alKuwarizmi”: 9th centuryPersianmathematician].
ModificationofalgorithmADM1toincludeweightingsforprobereliability:ADM2
http://users.isr.ist.utl.pt/~jmrs/research/TopicsLinks/Genomics
/gensips2005/papers/Gensips2005_SPSAppraoch140.pdf.
.....etc!
Breastcancersample:
Chromosome17.
Otherexamples:ADM2algorithm’sweightingsareGCrelated:
1.2
NB:theARSA“duplication”wasafalsepositive!
SHANK3
1
0.8
ARSA
0.6
0.4
0.2
0
Distal 22q13.33 including ARSA & SHANK3
Agilent "Probescores“, which are the probe
“weightings” to correct for high GC effects.
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
1.2
HIRA
1
0.8
0.6
0.4
0.2
HIRA & TBX1 sequential probes
region in VCFS region
PLOT shows AGILENT ADM2
"PROBESCORES" across the region.
TBX1 is badly behaved.....
TBX1
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
100
103
106
109
112
115
118
121
124
127
130
0
...andTBX1isGCrich
DataforapatientwiththestandardVCFSdeletion
[redarrow].ButnoteTBX1appearsnormal!
TBX1regiongoeshere
1.2
MECP2
1
0.8
0.6
0.4
L1CAM
0.2
0
1
7
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
103
109
115
121
127
133
139
145
151
157
163
169
175
181
187
193
199
205
211
217
223
229
235
241
247
253
259
265
271
277
283
289
295
301
SLC6A8
Xq28 region, from 149 to 153Mb:
Agilent "Probescores" for
The algorithm “ADM2”,
which aim to correct for GC “bias”
in certain array probes.
SuggestionofsimilarGCeffectswithothergenomictechnologies:
Fanetal(2008)alsoobserveasimilarproblemwhenmeasuringcopy
number(trisomy)byhighthroughputsequencing.
Theystate:
“Weobservedthatcertainchromosomeshavelarge
variationsinthecountsofsequencedfragmentsfromsampletosample,
andthatthisdependsstronglyontheGCcontent.
ItisunclearatthispointwhetherthisstemsfromPCRartifacts
duringsequencing,librarypreparation,orclustergenerationorthe
sequencingprocessitself,orwhetheritisatruebiologicaleffectrelating
tochromatinstructure.”
Theygoontosaythat,fortunatelyforprenataldiagnosis,
chromosomes13,18and21areverylittleaffectedbythisproblem!!!
FanHC,etal: “NoninvasivediagnosisoffetalaneuploidybyshotgunSequencing
DNAfrommaternalblood”
PNASOctober21,2008,105,no.42:1626616271.
Andarelatedexplanation:
Vannesteetal(2009)considertheseproblemsinthecontextof
microarrays.Theregardthecauseasmorelikely
“…nontechnical,asdeducedfromtheminorstandarddeviations
betweentheduplicatespots”foreachprobeonthearray.
TheysuggesttheGCrelatedproblemis“morelikelytobebiological.
Forinstance,determinationofcopynumbersincellsduringSphase
willinevitablyleadtomorescatterofintensityratiosdetectedon
consecutiveBACclones.”
VannesteE,etal(2009),NatureMedicine15,577 583
Awholenewcanofworms!!!!– butalsoGCrelated
11qarmGC%ge
EgNDUFV1
qarmofchromosome11:
Reddots=earlyreplicatingregions
Bluedots=late
0.8
DLR=0.16
0.7
0.6
0.5
0.4
0.3
NDUFV1
0.2
0.1
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81
-0.1
-0.2
-0.3
EgNDUFV1– atargetedgeneinanearlyreplicating,andGCrichregionof11q.
Appears?duplicatedonthisreasonablequalityarraydata
DGVdatabase:http://projects.tcag.ca/cgibin/variation/gbrowse/hg18
RecallingVannesteetal:
....theGCrelatedproblemis“morelikelytobebiological...”
[thantechnical].
“Forinstance,determinationofcopynumbersincells
duringSphasewillinevitablyleadtomorescatterof
intensityratios...”, becausetheearlyreplicatingregionswill
beoverrepresentedintheDNAsample.
ForsomeDNAsources,wesuggestthisproblemcancause
seriousproblemswitharrayQCvalues,oftenresultingindata
rejection,
orworse:falsepositives[duplications,usually]
WhatDNAsamplesaremorelikelytosuffertheSphasecyclingproblem?
1) Nonblood[eg fibroblast]cellsbeingculturedinvitroattimeofDNAharvest
2) Neonatalbloodlymphocytes
3) Rapidlycyclingmalignantcells
Thesedotendtogivepoorqualityarraydata,withdups atGCrichloci.
Tcells
Bcells
AllgoodoldCytogeneticistsknowaboutlatevsearlyreplication!!
Circledarebothcopies
ofautosome19:
Dark=earlyreplicating
Rbanding:
Pale=Latereplicating,
partlydeletedcopyofX
Conclusions:
Wheninterpretingmicroarraydata…
Bewareofartefactsbothtechnical&biological!
Knowyourgenome,andit’speculiarities!!
Acknowledgments:
ThankstoallatCytogeneticsCHW,whocontributetothiswork.
…toProfIanDawesandhiscolleagues,at
TheRamaciottiCentre,DeptBiochemandMolecBiol,
UniversityofNSW,Sydney,whohelpedusgetintomicroarrays.
AndtoDanBelluocio[Agilent]forprobeGCdata.
8hours
Sphase
%
replicons
inS
attimet
Mitosis/harvest
early
late
time
Rbanding:additionofthethymineanalogueBrdU,
inmidSphase(at)inhibitsgiemsastainingoflatereplicatingregions.
TheseincludetheinactiveX,andall
Gpositivebands.Thelatteraregene andGCpoor.
Worksbetterifcellcyclesaresynchronised.
Anoldcytogenetictechnique.
1111319
Mostlylate
RbandingCourtesyof
MrJohnKemp
FHGSA[retired]
NormalXDeletedX
=early=late
Mostly
early
Blue=MarginalarraydatafortheMECP2regionofXq28[DLR=0.24]
Red=AgilentprobeGC%contents,forprobesinthisregion,onthe
60Ktargetedarrayusedinthisexperiment
SLC6A8
MECP2
Arraydatafora[different]normalsubject
4KbBluegnomeCytoChipBACArrayDataforchromosome11
Bluepoints=poorqualityarraydata[logratios]
Redcircles=RelativeGCcontentofeachBACprobeused[rescaled]
11pter
cen
11qter
Equivalentdyeswaplogratiodataforchromosome11,
inanoligomerarraytest[Agilent44K].
Thetwodyeswapsarehereplottedasreciprocalratios(blackandtantraces
Respectively).Thiscasesuffersbad“GC”effect,anddidfailQC[DLR>.25],
althoughonlyjust.
qter
pter
Chromosome11
Xaxis:%GCforCytoChipBACprobes[rescaled]
Yaxis:BACarraydata[logratios]
Correlation=0.41.
DataanalysisinCGHmicroarrayworkreliesonalgorithms,basedon
statisticaltheory,empiricalconsiderations,andunderlyingbiology
ofthegenome.
Suchalgorithmsaredevisedtominimiseerrors in
interpretingthedata,ie,failuretofindthetrue,segmental
“deviationsfromdisomy”,
whichmaybescatteredanywherealongthechromosome.
Someerrorsmayarisethrough:
A)stochastic/statisticalsources:
eg1“backgroundnoise”effects:mainlydependentonqualityofsample,or
qualityofexperiment,suchthatthestandarddeviationof
thedatapointswilloverwhelmtheresolutionrequired.
eg2“bonferroni”effects
– solongasyoudoenoughexperiments[eg 100,000],youWILLget
asignificantresult![whichmaybemeaningless]
B)genomebiology:
eg1localpeculiaritiesincompositionoftheDNAsequence– whichaffect
Probebehaviour[mostobviously,GC%ge]
eg2cellcycleeffects:what%ge ofcellsareinSphaseattimeofharvest?
?apoptosis
C)arraydesign“faults”:particularlyrelevantin“targeted”arrays,indeciding
astotheoptimaldistributionofprobes.Becausemanythousands
ofprobes[&loci]mustbechosen,errorsarelikely.
Wikipedia:
TheBonferronicorrectionisasafeguardagainstmultipletestsofstatistical
significanceonthesamedatafalselygivingtheappearanceofsignificance,
as1outofevery20hypothesistests[orsetsofhypothesistests]
isexpectedtobesignificantatthe=0.05level,purelyduetochance.
Thustheprobabilityofgettingasignificantresultwithn
tests[ornsetsoftests]atthis“”levelofsignificanceis1 0.95n
(or1 theprobabilityofnotgettingasignificantresult,withn tests…).
Forarrays:Example:
ifthereis1%chanceoffindingone+vefor5adjacentprobesin50,000,
[bychancealone],thenforahigherresolutionarray,witheg100,000probes,
thenthenumberof+veswill>1%.
Hencethenumberoffalsepositivesmaybeunacceptable.
Inpractice,asurrogate Bonf.correctioncanbeappliedbyincreasing
theminimumnumberofadjacentprobesrequiredforapositivecall.
Inourlab:
5probesrequiredfor60,000probearray[recommendednumberis4]
=0.25Mbresolution
7probes
180,000
=0.16Mb“
10probes
400,000
=0.06Mb“
Thesenumbersarenotasperthetextbookcalculation.
Inpart,theyarechosenbasedonobservationofnumbersofapparently
true+vesperexperiment.
Thiscanbecalibrated,tosomeextent,ontheincidenceofknown
commonCNVs.
EgWiththe400,000probearrays,theaveragepersonwillhave
5to15suchCNVs,largeenoughtoscoreastruepositives.
[2009]
*
*andinpress,2011:
Regarding:16p11.2microdeletion
RussellDale,PadraicGrattanSmith,VictorSCFung,andGregBPeters:
Neurology
Withacknowledgmentto
ourcolleaguesinthe
NeurologyDept,CHW
Neurogenetics updated:Extendedhuntforautismgenes
Nature,474, Pages:254–255
Datepublished:(16June2011)
Publishedonline
15June2011
GroupsledbyMichaelWigleratColdSpringHarbor
LaboratoryinNewYorkandMatthewStateatYale
UniversityinNewHaven,Connecticut,conductedthe
mostcomprehensivesearchyetforspontaneous
duplicationsordeletionsofstretchesofDNAthatmay
beassociatedwithautismspectrumdisorders.
Inanalysingthegenomesofmorethan1,000people—
somewithautism,someunaffectedfamilymembers
— theteamsfoundatleast130sitesinthegenome
wherespontaneousduplicationsordeletionsmight
contributetoautismrisk.
Whatdoesthelab’sreportinclude?
Red? orBlue?orsomethinginbetween?
CNVfoundbymicroarray
software’sownalgorithm
Xqdistalduplicationinamale:
ArraytoconfirmdupofMECP2onMLPA
Duplication:16.35MbfromXq27.1Xqter
AcaseofTRUEXqduplication![inunbalancedt(X;3)]
RealduplicationinXq:DLR=0.20Threshold=6.7
SLC6A8
MECP2
t(X;3)case:Xqdup,showingregionofMECP2
RealduplicationinXq:D:R=0.20Threshold=2.2
SLC6A8
MECP2
t(X;3)Case,showingregionofMECP2.
Another“normal”case
Anormalcase:
Here,theAgilentalgorithmADM2findsfalseduplicationsflankingMECP2,
[arrows]butonlyif“significancethreshold”isreducedto3.2,fromtheusual6
SLC6A8
MECP2
AnormalCase
Andstilldoes,in2011!
CGHarraydatafromanarrayuser’sperspective:
Samecase,withtruedistalduplicationofXq,
Fortheregionarrowed.
Isthisbit
Normal?
Xchromosomearraydata,fromptertoqter[plotsprobeorderonly]
1996:GCcontentsinXq28
RedarrowshowsapproxpositionofMECP2
60KtargetedArray:~1MbregionflankingMECP2
Baseno.
SLC6A8
1probeper2.1Kb
Probes[~290]
44KArray:~1MbregionflankingMECP2
Base.No.
1probeper21.7Kb
Probes[~50]
Download