BayesianDecisionTheory Chapter 2(Duda,Hart&Stork) CS7616- PatternRecognition HenrikIChristensen GeorgiaTech. BayesianDecisionTheory • Designclassifierstorecommenddecisions that minimizesometotalexpected”risk”. – Thesimplestrisk istheclassificationerror(i.e.,costs areequal). – Typically,therisk includesthecost associatedwith differentdecisions. Terminology • Stateofnatureω (randomvariable): – e.g.,ω1 forseabass,ω2 forsalmon • ProbabilitiesP(ω1) andP(ω2) (priors): – e.g.,priorknowledgeofhowlikelyistogetaseabass orasalmon • Probabilitydensityfunctionp(x)(evidence): – e.g.,howfrequentlywewillmeasureapatternwith featurevaluex (e.g.,x correspondstolightness) Terminology(cont’d) • Conditionalprobabilitydensityp(x/ωj) (likelihood): – e.g.,howfrequentlywewillmeasureapatternwith featurevaluex giventhatthepatternbelongstoclassωj e.g., lightness distributions between salmon/sea-bass populations Terminology(cont’d) • ConditionalprobabilityP(ωj/x)(posterior): – e.g.,theprobabilitythatthefishbelongsto classωj givenmeasurementx. DecisionRuleUsingPrior Probabilities Decideω1 if P(ω1) >P(ω2); otherwisedecide ω2 ⎧ P(ω1 ) if we decide ω2 P(error ) = ⎨ ⎩ P(ω2 ) if we decide ω1 or P(error)=min[P(ω1),P(ω2)] • Favoursthemostlikelyclass. • Thisrulewillbemakingthesamedecisionalltimes. – i.e.,optimumifnootherinformationisavailable DecisionRuleUsingConditional Probabilities • UsingBayes’rule,theposteriorprobabilityofcategoryωj givenmeasurementxisgivenby: P(ω j / x) = p( x / ω j ) P(ω j ) p ( x) likelihood × prior = evidence 2 where p( x) = ∑ p( x / ω j ) P(ω j ) (i.e.,scalefactor– sumofprobs=1) j =1 Decideω1ifP(ω1 /x)>P(ω2/x); otherwisedecideω2 or Decideω1ifp(x/ω1)P(ω1)>p(x/ω2)P(ω2) otherwisedecideω2 DecisionRuleUsing Conditionalpdf (cont’d) p(x/ωj) P(ω1 ) = 2 3 P (ω 2 ) = 1 3 P(ωj /x) ProbabilityofError • Theprobabilityoferrorisdefinedas: ⎧ P(ω1 / x) if we decide ω2 P(error / x) = ⎨ ⎩ P(ω2 / x) if we decide ω1 or • P(error/x) = min[P(ω1/x), P(ω2/x)] Whatistheaverageprobabilityerror? ∞ P(error ) = ∫ P(error , x)dx = ∫ P(error / x) p( x)dx −∞ • ∞ −∞ TheBayesruleisoptimum,thatis,itminimizesthe averageprobabilityerror! WheredoProbabilitiesComeFrom? • Therearetwocompetitiveanswerstothis question: (1) Relativefrequency (objective)approach. – Probabilitiescanonlycomefromexperiments. (2) Bayesian (subjective)approach. – Probabilitiesmayreflectdegreeofbeliefandcanbe basedonopinion. Example(objectiveapproach) • Classifycarswhethertheyaremoreorlessthan$50K: – Classes:C1 ifprice>$50K,C2 ifprice<=$50K – Features:x,theheight ofacar • UsetheBayes’ruletocomputetheposteriorprobabilities: p ( x / Ci )P (C i ) P(Ci / x ) = p( x) • Weneedtoestimatep(x/C1),p(x/C2),P(C1),P(C2) Example(cont’d) • Collectdata – Askdrivershowmuchtheircarwasandmeasureheight. • Determineprior probabilitiesP(C1),P(C2) – e.g.,1209samples:#C1=221#C2=988 221 P(C1 ) = = 0.183 1209 988 P(C2 ) = = 0.817 1209 Example(cont’d) • Determineclassconditionalprobabilities(likelihood) – Discretizecarheightintobinsandusenormalizedhistogram p( x / Ci ) Example(cont’d) • Calculatetheposteriorprobability foreachbin: p( x = 1.0 / C1) P( C1) P(C1 / x = 1.0) = = p( x = 1.0 / C1) P( C1) + p( x =1.0 / C2) P( C2) = P(Ci / x) 0.2081*0.183 = 0.438 0.2081*0.183 + 0.0597 *0.817 AMoreGeneralTheory • Usemorethanonefeatures. • Allowmorethantwocategories. • Allowactions otherthanclassifyingtheinputto oneofthepossiblecategories(e.g.,rejection). • Employamoregeneralerrorfunction(i.e.,“risk” function)byassociatinga“cost”(“loss”function) witheacherror(i.e.,wrongaction). Terminology • Featuresformavector x ∈ R d • Afinitesetofc categoriesω1,ω2,…,ωc • Bayesrule(i.e.,usingvectornotation): P(ω j / x) = p (x / ω j ) P(ω j ) p( x) c where p(x) = ∑ p(x / ω j ) P(ω j ) j =1 • Afinitesetof lactionsα1,α2,…,αl • Aloss functionλ(αi /ωj) – thecost associatedwithtakingactionαi whenthecorrect classificationcategoryisωj ConditionalRisk(orExpectedLoss) • Supposeweobservexandtakeaction αi • Supposethatthecostassociatedwithtaking actionαi withωj beingthecorrectcategoryis λ(αi /ωj) • Theconditionalrisk (orexpectedloss)with takingactionαi is: c R(ai / x) = ∑ λ (ai / ω j ) P(ω j / x) j =1 OverallRisk • Supposeα(x)isageneral decisionrulethat determineswhichactionα1,α2,…,αltotakefor everyx;thentheoverallriskisdefinedas: R = ∫ R(a(x) / x) p(x)dx • Theoptimum decisionruleistheBayesrule OverallRisk(cont’d) • TheBayesdecisionruleminimizesR by: (i)ComputingR(αi /x) foreveryαi givenanx (ii)ChoosingtheactionαiwiththeminimumR(αi /x) • TheresultingminimumoverallriskiscalledBayes risk andisthebest(i.e.,optimum)performance thatcanbeachieved: * R =minR Example:Two-category classification • Define – α1:decideω1 (c=2) – α2:decideω2 – λij=λ(αi /ωj) • Theconditionalrisksare: c R(ai / x) = ∑ λ (ai / ω j ) P(ω j / x) j =1 Example:Two-category classification(cont’d) • Minimumriskdecisionrule: or or (i.e.,usinglikelihoodratio) > likelihood ratio threshold SpecialCase: Zero-OneLossFunction • Assignthesamelosstoallerrors: • Theconditionalriskcorrespondingtothislossfunction: SpecialCase: Zero-OneLossFunction(cont’d) • Thedecisionrulebecomes: or or • Inthiscase,theoverallriskistheaverageprobability error! Example Assuminggeneral loss: > Assumingzero-one loss: Decide ω1 if p(x/ω1)/p(x/ω2)>P(ω2 )/P(ω1) otherwise decide ω2 θa = P(ω2 ) / P(ω1 ) (decisionregions) P(ω2 )(λ12 − λ22 ) θb = P(ω1 )(λ21 − λ11 ) assume: λ12 > λ21 DiscriminantFunctions • Ausefulwaytorepresentclassifiersisthrough discriminant functions gi(x),i =1,...,c,whereafeature vectorx isassignedtoclassωi if: gi(x)>gj(x) forall j ≠i DiscriminantsforBayesClassifier • Assumingagenerallossfunction: gi(x)=-R(αi/x) • Assumingthezero-onelossfunction: gi(x)=P(ωi/x) DiscriminantsforBayesClassifier (cont’d) • Isthechoiceofgi unique? – Replacinggi(x) withf(gi(x)),wheref() ismonotonically increasing,doesnotchangetheclassificationresults. gi(x)=P(ωi/x) p(x / ωi ) P(ωi ) g i ( x) = p ( x) gi (x) = p(x / ωi ) P(ωi ) gi (x) = ln p(x / ωi ) + ln P(ωi ) we’llusethis formextensively! Caseoftwocategories • Morecommontouseasinglediscriminantfunction (dichotomizer)insteadoftwo: • Examples: g (x) = P(ω1 / x) − P(ω2 / x) p (x / ω1 ) P(ω1 ) g (x) = ln + ln p ( x / ω2 ) P(ω2 ) DecisionRegions andBoundaries • Decisionrulesdividethefeaturespaceindecisionregions R1,R2,…,Rc, separatedbydecisionboundaries. decisionboundary isdefinedby: g1(x)=g2(x) DiscriminantFunctionfor MultivariateGaussianDensity N(µ,Σ) • Considerthefollowingdiscriminantfunction: gi (x) = ln p(x / ωi ) + ln P(ωi ) p(x/ωi) MultivariateGaussianDensity: CaseI • Σi=σ2(diagonal) – Featuresarestatisticallyindependent – Eachfeaturehasthesamevariance favoursthea-priori morelikelycategory MultivariateGaussianDensity: CaseI(cont’d) wi= ) ) MultivariateGaussianDensity: CaseI(cont’d) • Propertiesofdecisionboundary: – – – – – Itpassesthroughx0 Itisorthogonaltothelinelinkingthemeans. WhathappenswhenP(ωi)=P(ωj) ? IfP(ωi)=P(ω ≠ j),thenx0 shiftsawayfromthemostlikelycategory. Ifσ isverysmall,thepositionoftheboundaryisinsensitivetoP(ωi) and P(ωj) ) ) MultivariateGaussianDensity: CaseI(cont’d) ≠ IfP(ωi)=P(ω j),thenx0 shiftsaway fromthemostlikelycategory. MultivariateGaussianDensity: CaseI(cont’d) ≠ IfP(ωi)=P(ω j),thenx0 shiftsaway fromthemostlikelycategory. MultivariateGaussianDensity: CaseI(cont’d) ≠ IfP(ωi)=P(ω j),thenx0 shiftsaway fromthemostlikelycategory. MultivariateGaussianDensity: CaseI(cont’d) • Minimumdistanceclassifier – WhenP(ωi)areequal,then: g i ( x) = − || x − µi ||2 max MultivariateGaussianDensity: CaseII • Σi=Σ MultivariateGaussianDensity: CaseII(cont’d) MultivariateGaussianDensity: CaseII(cont’d) • Propertiesofhyperplane(decisionboundary): – – – – Itpassesthroughx0 Itisnot orthogonaltothelinelinkingthemeans. WhathappenswhenP(ωi)=P(ωj) ? ≠ IfP(ωi)=P(ω j),thenx0 shiftsawayfromthemostlikelycategory. MultivariateGaussianDensity: CaseII(cont’d) ≠ IfP(ωi)=P(ω j),thenx0 shiftsaway fromthemostlikelycategory. MultivariateGaussianDensity: CaseII(cont’d) ≠ IfP(ωi)=P(ω j),thenx0 shiftsaway fromthemostlikelycategory. MultivariateGaussianDensity: CaseII(cont’d) • Mahalanobisdistanceclassifier – WhenP(ωi)areequal,then: max MultivariateGaussianDensity: CaseIII • Σi=arbitrary hyperquadrics; e.g., hyperplanes,pairsofhyperplanes,hyperspheres, hyperellipsoids,hyperparaboloids etc. Example- CaseIII decisionboundary: P(ω1)=P(ω2) boundarydoes not passthrough midpointofμ1,μ2 MultivariateGaussianDensity: CaseIII(cont’d) non-linear decision boundaries MultivariateGaussianDensity: CaseIII(cont’d) • Moreexamples ErrorBounds • Exacterrorcalculationscouldbedifficult– easierto estimateerrorbounds! or min[P(ω1/x),P(ω2/x)] P(error) ErrorBounds(cont’d) • IftheclassconditionaldistributionsareGaussian,then where: | | ErrorBounds(cont’d) • TheChernoff boundcorrespondstoβ thatminimizes e-κ(β) – Thisisa1-Doptimizationproblem,regardlesstothedimensionality oftheclassconditionaldensities. loose bound loose bound tight bound ErrorBounds(cont’d) • Bhattacharyya bound – Approximatetheerrorboundusingβ=0.5 – EasiertocomputethanChernofferrorbutlooser. • TheChernoffandBhattacharyyaboundswillnotbegood boundsifthedistributionsarenot Gaussian. Example Bhattacharyya error: k(0.5)=4.06 P(error ) ≤ 0.0087 ReceiverOperating Characteristic(ROC)Curve • Everyclassifieremployssomekindofathreshold. θa = P(ω2 ) / P(ω1 ) P(ω2 )(λ12 − λ22 ) θb = P(ω1 )(λ21 − λ11 ) • Changingthethresholdaffectstheperformanceofthe system. • ROCcurvescanhelpusevaluatesystemperformancefor different thresholds. Example:PersonAuthentication • Authenticateapersonusingbiometrics(e.g.,fingerprints). • Therearetwopossibledistributions(i.e.,classes): – Authentic (A)andImpostor (I) I A Example:PersonAuthentication (cont’d) • Possibledecisions: – (1)correctacceptance(truepositive): • Xbelongs toA,andwedecideA correct rejection correct acceptance – (2)incorrectacceptance (falsepositive): • Xbelongs toI,andwedecide A – (3)correctrejection(truenegative): • Xbelongs toI,andwedecide I I A – (4)incorrectrejection (falsenegative): • Xbelongs toA,andwedecideI false negative false positive ErrorvsThreshold ROC FalseNegativesvsPositives NextLecture • LinearClassificationMethods – Hastieetal,Chapter4 • PaperlistwillavailablebyWeekend – BiddingtostartonMonday BayesDecisionTheory: CaseofDiscreteFeatures • Replacewith ∑ P(x / ω j ) ∫ p(x / ω j )dx x • Seesection2.9 MissingFeatures • ConsideraBayesclassifierusinguncorrupteddata. • Supposex=(x1,x2)isatestvectorwherex1 ismissingandthe x̂2 howcanweclassifyit? valueofx2 is– Ifwesetx1 equaltotheaveragevalue,wewillclassifyx asω3 p( xˆ2 / ω2 ) – Butislarger;maybeweshouldclassifyx asω2 ? MissingFeatures(cont’d) • Supposex=[xg,xb](xg:goodfeatures,xb:badfeatures) • DerivetheBayesruleusingthegoodfeatures: p p Marginalize posterior probability overbad features. CompoundBayesian DecisionTheory • Sequential decision (1)Decideaseachfishemerges. • Compound decision (1)Waitforn fishtoemerge. (2)Makeall n decisionsjointly. – Couldimproveperformancewhenconsecutivestates ofnaturearenot bestatisticallyindependent. CompoundBayesian DecisionTheory(cont’d) • SupposeΩ=(ω(1),ω(2),…,ω(n))denotesthe nstatesofnaturewhereω(i)cantakeoneof cvaluesω1,ω2,…,ωc(i.e.,ccategories) • SupposeP(Ω)isthepriorprobabilityofthen statesofnature. • SupposeX=(x1,x2,…,xn)arenobserved vectors. CompoundBayesian DecisionTheory(cont’d) acceptable! P P i.e.,consecutivestatesofnaturemay not bestatisticallyindependent!