Bayesian Decision Theory CS 7616 - Pattern Recognition Henrik I Christensen Georgia Tech.

advertisement
BayesianDecisionTheory
Chapter 2(Duda,Hart&Stork)
CS7616- PatternRecognition
HenrikIChristensen
GeorgiaTech.
BayesianDecisionTheory
• Designclassifierstorecommenddecisions that
minimizesometotalexpected”risk”.
– Thesimplestrisk istheclassificationerror(i.e.,costs
areequal).
– Typically,therisk includesthecost associatedwith
differentdecisions.
Terminology
• Stateofnatureω (randomvariable):
– e.g.,ω1 forseabass,ω2 forsalmon
• ProbabilitiesP(ω1) andP(ω2) (priors):
– e.g.,priorknowledgeofhowlikelyistogetaseabass
orasalmon
• Probabilitydensityfunctionp(x)(evidence):
– e.g.,howfrequentlywewillmeasureapatternwith
featurevaluex (e.g.,x correspondstolightness)
Terminology(cont’d)
• Conditionalprobabilitydensityp(x/ωj) (likelihood):
– e.g.,howfrequentlywewillmeasureapatternwith
featurevaluex giventhatthepatternbelongstoclassωj
e.g., lightness distributions
between salmon/sea-bass
populations
Terminology(cont’d)
• ConditionalprobabilityP(ωj/x)(posterior):
– e.g.,theprobabilitythatthefishbelongsto
classωj givenmeasurementx.
DecisionRuleUsingPrior
Probabilities
Decideω1 if P(ω1) >P(ω2); otherwisedecide ω2
⎧ P(ω1 ) if we decide ω2
P(error ) = ⎨
⎩ P(ω2 ) if we decide ω1
or P(error)=min[P(ω1),P(ω2)]
• Favoursthemostlikelyclass.
• Thisrulewillbemakingthesamedecisionalltimes.
– i.e.,optimumifnootherinformationisavailable
DecisionRuleUsingConditional
Probabilities
• UsingBayes’rule,theposteriorprobabilityofcategoryωj
givenmeasurementxisgivenby:
P(ω j / x) =
p( x / ω j ) P(ω j )
p ( x)
likelihood × prior
=
evidence
2
where
p( x) = ∑ p( x / ω j ) P(ω j ) (i.e.,scalefactor– sumofprobs=1)
j =1
Decideω1ifP(ω1 /x)>P(ω2/x); otherwisedecideω2
or
Decideω1ifp(x/ω1)P(ω1)>p(x/ω2)P(ω2) otherwisedecideω2
DecisionRuleUsing
Conditionalpdf (cont’d)
p(x/ωj)
P(ω1 ) =
2
3
P (ω 2 ) =
1
3
P(ωj /x)
ProbabilityofError
• Theprobabilityoferrorisdefinedas:
⎧ P(ω1 / x) if we decide ω2
P(error / x) = ⎨
⎩ P(ω2 / x) if we decide ω1
or
•
P(error/x) = min[P(ω1/x), P(ω2/x)]
Whatistheaverageprobabilityerror?
∞
P(error ) =
∫ P(error , x)dx = ∫ P(error / x) p( x)dx
−∞
•
∞
−∞
TheBayesruleisoptimum,thatis,itminimizesthe
averageprobabilityerror!
WheredoProbabilitiesComeFrom?
• Therearetwocompetitiveanswerstothis
question:
(1) Relativefrequency (objective)approach.
– Probabilitiescanonlycomefromexperiments.
(2) Bayesian (subjective)approach.
– Probabilitiesmayreflectdegreeofbeliefandcanbe
basedonopinion.
Example(objectiveapproach)
• Classifycarswhethertheyaremoreorlessthan$50K:
– Classes:C1 ifprice>$50K,C2 ifprice<=$50K
– Features:x,theheight ofacar
• UsetheBayes’ruletocomputetheposteriorprobabilities:
p ( x / Ci )P (C i )
P(Ci / x ) =
p( x)
• Weneedtoestimatep(x/C1),p(x/C2),P(C1),P(C2)
Example(cont’d)
• Collectdata
– Askdrivershowmuchtheircarwasandmeasureheight.
• Determineprior probabilitiesP(C1),P(C2)
– e.g.,1209samples:#C1=221#C2=988
221
P(C1 ) =
= 0.183
1209
988
P(C2 ) =
= 0.817
1209
Example(cont’d)
• Determineclassconditionalprobabilities(likelihood)
– Discretizecarheightintobinsandusenormalizedhistogram
p( x / Ci )
Example(cont’d)
• Calculatetheposteriorprobability foreachbin:
p( x = 1.0 / C1) P( C1)
P(C1 / x = 1.0) =
=
p( x = 1.0 / C1) P( C1) + p( x =1.0 / C2) P( C2)
=
P(Ci / x)
0.2081*0.183
= 0.438
0.2081*0.183 + 0.0597 *0.817
AMoreGeneralTheory
• Usemorethanonefeatures.
• Allowmorethantwocategories.
• Allowactions otherthanclassifyingtheinputto
oneofthepossiblecategories(e.g.,rejection).
• Employamoregeneralerrorfunction(i.e.,“risk”
function)byassociatinga“cost”(“loss”function)
witheacherror(i.e.,wrongaction).
Terminology
• Featuresformavector x ∈ R d
• Afinitesetofc categoriesω1,ω2,…,ωc
• Bayesrule(i.e.,usingvectornotation):
P(ω j / x) =
p (x / ω j ) P(ω j )
p( x)
c
where p(x) = ∑ p(x / ω j ) P(ω j )
j =1
• Afinitesetof lactionsα1,α2,…,αl
• Aloss functionλ(αi /ωj)
– thecost associatedwithtakingactionαi whenthecorrect
classificationcategoryisωj
ConditionalRisk(orExpectedLoss)
• Supposeweobservexandtakeaction αi
• Supposethatthecostassociatedwithtaking
actionαi withωj beingthecorrectcategoryis
λ(αi /ωj)
• Theconditionalrisk (orexpectedloss)with
takingactionαi is:
c
R(ai / x) = ∑ λ (ai / ω j ) P(ω j / x)
j =1
OverallRisk
• Supposeα(x)isageneral decisionrulethat
determineswhichactionα1,α2,…,αltotakefor
everyx;thentheoverallriskisdefinedas:
R = ∫ R(a(x) / x) p(x)dx
• Theoptimum decisionruleistheBayesrule
OverallRisk(cont’d)
• TheBayesdecisionruleminimizesR by:
(i)ComputingR(αi /x) foreveryαi givenanx
(ii)ChoosingtheactionαiwiththeminimumR(αi /x)
• TheresultingminimumoverallriskiscalledBayes
risk andisthebest(i.e.,optimum)performance
thatcanbeachieved:
*
R =minR
Example:Two-category
classification
• Define
– α1:decideω1
(c=2)
– α2:decideω2
– λij=λ(αi /ωj)
• Theconditionalrisksare:
c
R(ai / x) = ∑ λ (ai / ω j ) P(ω j / x)
j =1
Example:Two-category
classification(cont’d)
• Minimumriskdecisionrule:
or
or
(i.e.,usinglikelihoodratio)
>
likelihood ratio
threshold
SpecialCase:
Zero-OneLossFunction
• Assignthesamelosstoallerrors:
• Theconditionalriskcorrespondingtothislossfunction:
SpecialCase:
Zero-OneLossFunction(cont’d)
• Thedecisionrulebecomes:
or
or
• Inthiscase,theoverallriskistheaverageprobability
error!
Example
Assuminggeneral loss:
>
Assumingzero-one loss:
Decide ω1 if p(x/ω1)/p(x/ω2)>P(ω2 )/P(ω1) otherwise decide ω2
θa = P(ω2 ) / P(ω1 )
(decisionregions)
P(ω2 )(λ12 − λ22 )
θb =
P(ω1 )(λ21 − λ11 )
assume:
λ12 > λ21
DiscriminantFunctions
• Ausefulwaytorepresentclassifiersisthrough
discriminant functions gi(x),i =1,...,c,whereafeature
vectorx isassignedtoclassωi if:
gi(x)>gj(x) forall
j ≠i
DiscriminantsforBayesClassifier
• Assumingagenerallossfunction:
gi(x)=-R(αi/x)
• Assumingthezero-onelossfunction:
gi(x)=P(ωi/x)
DiscriminantsforBayesClassifier
(cont’d)
• Isthechoiceofgi unique?
– Replacinggi(x) withf(gi(x)),wheref() ismonotonically
increasing,doesnotchangetheclassificationresults.
gi(x)=P(ωi/x)
p(x / ωi ) P(ωi )
g i ( x) =
p ( x)
gi (x) = p(x / ωi ) P(ωi )
gi (x) = ln p(x / ωi ) + ln P(ωi )
we’llusethis
formextensively!
Caseoftwocategories
• Morecommontouseasinglediscriminantfunction
(dichotomizer)insteadoftwo:
• Examples:
g (x) = P(ω1 / x) − P(ω2 / x)
p (x / ω1 )
P(ω1 )
g (x) = ln
+ ln
p ( x / ω2 )
P(ω2 )
DecisionRegions andBoundaries
• Decisionrulesdividethefeaturespaceindecisionregions
R1,R2,…,Rc, separatedbydecisionboundaries.
decisionboundary
isdefinedby:
g1(x)=g2(x)
DiscriminantFunctionfor
MultivariateGaussianDensity
N(µ,Σ)
• Considerthefollowingdiscriminantfunction:
gi (x) = ln p(x / ωi ) + ln P(ωi )
p(x/ωi)
MultivariateGaussianDensity:
CaseI
• Σi=σ2(diagonal)
– Featuresarestatisticallyindependent
– Eachfeaturehasthesamevariance
favoursthea-priori
morelikelycategory
MultivariateGaussianDensity:
CaseI(cont’d)
wi=
)
)
MultivariateGaussianDensity:
CaseI(cont’d)
• Propertiesofdecisionboundary:
–
–
–
–
–
Itpassesthroughx0
Itisorthogonaltothelinelinkingthemeans.
WhathappenswhenP(ωi)=P(ωj) ?
IfP(ωi)=P(ω
≠
j),thenx0 shiftsawayfromthemostlikelycategory.
Ifσ isverysmall,thepositionoftheboundaryisinsensitivetoP(ωi)
and P(ωj)
)
)
MultivariateGaussianDensity:
CaseI(cont’d)
≠
IfP(ωi)=P(ω
j),thenx0 shiftsaway
fromthemostlikelycategory.
MultivariateGaussianDensity:
CaseI(cont’d)
≠
IfP(ωi)=P(ω
j),thenx0 shiftsaway
fromthemostlikelycategory.
MultivariateGaussianDensity:
CaseI(cont’d)
≠
IfP(ωi)=P(ω
j),thenx0 shiftsaway
fromthemostlikelycategory.
MultivariateGaussianDensity:
CaseI(cont’d)
• Minimumdistanceclassifier
– WhenP(ωi)areequal,then:
g i ( x) = − || x − µi ||2
max
MultivariateGaussianDensity:
CaseII
• Σi=Σ
MultivariateGaussianDensity:
CaseII(cont’d)
MultivariateGaussianDensity:
CaseII(cont’d)
• Propertiesofhyperplane(decisionboundary):
–
–
–
–
Itpassesthroughx0
Itisnot orthogonaltothelinelinkingthemeans.
WhathappenswhenP(ωi)=P(ωj) ?
≠
IfP(ωi)=P(ω
j),thenx0 shiftsawayfromthemostlikelycategory.
MultivariateGaussianDensity:
CaseII(cont’d)
≠
IfP(ωi)=P(ω
j),thenx0 shiftsaway
fromthemostlikelycategory.
MultivariateGaussianDensity:
CaseII(cont’d)
≠
IfP(ωi)=P(ω
j),thenx0 shiftsaway
fromthemostlikelycategory.
MultivariateGaussianDensity:
CaseII(cont’d)
• Mahalanobisdistanceclassifier
– WhenP(ωi)areequal,then:
max
MultivariateGaussianDensity:
CaseIII
• Σi=arbitrary
hyperquadrics;
e.g., hyperplanes,pairsofhyperplanes,hyperspheres,
hyperellipsoids,hyperparaboloids etc.
Example- CaseIII
decisionboundary:
P(ω1)=P(ω2)
boundarydoes
not passthrough
midpointofμ1,μ2
MultivariateGaussianDensity:
CaseIII(cont’d)
non-linear
decision
boundaries
MultivariateGaussianDensity:
CaseIII(cont’d)
• Moreexamples
ErrorBounds
• Exacterrorcalculationscouldbedifficult– easierto
estimateerrorbounds!
or
min[P(ω1/x),P(ω2/x)]
P(error)
ErrorBounds(cont’d)
• IftheclassconditionaldistributionsareGaussian,then
where:
|
|
ErrorBounds(cont’d)
• TheChernoff boundcorrespondstoβ thatminimizes e-κ(β)
– Thisisa1-Doptimizationproblem,regardlesstothedimensionality
oftheclassconditionaldensities.
loose bound
loose bound
tight bound
ErrorBounds(cont’d)
• Bhattacharyya bound
– Approximatetheerrorboundusingβ=0.5
– EasiertocomputethanChernofferrorbutlooser.
• TheChernoffandBhattacharyyaboundswillnotbegood
boundsifthedistributionsarenot Gaussian.
Example
Bhattacharyya error:
k(0.5)=4.06
P(error ) ≤ 0.0087
ReceiverOperating
Characteristic(ROC)Curve
• Everyclassifieremployssomekindofathreshold.
θa = P(ω2 ) / P(ω1 )
P(ω2 )(λ12 − λ22 )
θb =
P(ω1 )(λ21 − λ11 )
• Changingthethresholdaffectstheperformanceofthe
system.
• ROCcurvescanhelpusevaluatesystemperformancefor
different thresholds.
Example:PersonAuthentication
• Authenticateapersonusingbiometrics(e.g.,fingerprints).
• Therearetwopossibledistributions(i.e.,classes):
– Authentic (A)andImpostor (I)
I
A
Example:PersonAuthentication
(cont’d)
• Possibledecisions:
– (1)correctacceptance(truepositive):
• Xbelongs toA,andwedecideA
correct rejection
correct acceptance
– (2)incorrectacceptance (falsepositive):
• Xbelongs toI,andwedecide A
– (3)correctrejection(truenegative):
• Xbelongs toI,andwedecide I
I
A
– (4)incorrectrejection (falsenegative):
• Xbelongs toA,andwedecideI
false negative
false positive
ErrorvsThreshold
ROC
FalseNegativesvsPositives
NextLecture
• LinearClassificationMethods
– Hastieetal,Chapter4
• PaperlistwillavailablebyWeekend
– BiddingtostartonMonday
BayesDecisionTheory:
CaseofDiscreteFeatures
• Replacewith
∑ P(x / ω j )
∫ p(x / ω j )dx
x
• Seesection2.9
MissingFeatures
• ConsideraBayesclassifierusinguncorrupteddata.
• Supposex=(x1,x2)isatestvectorwherex1 ismissingandthe
x̂2 howcanweclassifyit?
valueofx2 is– Ifwesetx1 equaltotheaveragevalue,wewillclassifyx asω3
p( xˆ2 / ω2 )
– Butislarger;maybeweshouldclassifyx
asω2 ?
MissingFeatures(cont’d)
• Supposex=[xg,xb](xg:goodfeatures,xb:badfeatures)
• DerivetheBayesruleusingthegoodfeatures:
p
p
Marginalize
posterior
probability
overbad
features.
CompoundBayesian
DecisionTheory
• Sequential decision
(1)Decideaseachfishemerges.
• Compound decision
(1)Waitforn fishtoemerge.
(2)Makeall n decisionsjointly.
– Couldimproveperformancewhenconsecutivestates
ofnaturearenot bestatisticallyindependent.
CompoundBayesian
DecisionTheory(cont’d)
• SupposeΩ=(ω(1),ω(2),…,ω(n))denotesthe
nstatesofnaturewhereω(i)cantakeoneof
cvaluesω1,ω2,…,ωc(i.e.,ccategories)
• SupposeP(Ω)isthepriorprobabilityofthen
statesofnature.
• SupposeX=(x1,x2,…,xn)arenobserved
vectors.
CompoundBayesian
DecisionTheory(cont’d)
acceptable!
P
P
i.e.,consecutivestatesofnaturemay
not bestatisticallyindependent!
Download