Week 1

advertisement
PATTERN RECOGNITION
Fatoş Tunay Yarman Vural
Textbook: Pattern Recognition and Machine
Learning, C. Bishop
Reccomended Book: Pattern Theory, U.
Granander, M. Miller
Course Requirement
Final:50%
Project: 50%
Literature Survey report: 1 April
Algorithm Development:1 May
Full Paper with implementation 1 June
Content
1.What is Pattern
2. Probability Theory
3. Bayesian Paradigm
4. Information Theory
5. Linear Methods
6.Kernel Methods
7. Graph Methods
WHAT İS PATTERN
• Structures regulated by rules
• Goal:Represent empirical knowledge in
mathematical forms
• the Mathematics of Perception
• Need: Algebra, probability theory, graph
theory
What you perceive is not what you hear:
1.
2.
3.
4.
ACTUAL SOUND
The ?eel is on the shoe
The ?eel is on the car
The ?eel is on the table
The ?eel is on the orange
1.
2.
3.
4.
PERCEIVED WORDS
The heel is on the shoe
The wheel is on the car
The meal is on the table
The peel is on the orange
(Warren & Warren, 1970)
Statistical inference is being used!
All flows! Heraclitos
• It is only the invariance, the permanent
facts, that enable us to find the meaning in a
world of flux.
• We can only perceive variances
• Our aim is to find the invariant laws of our
varying obserbvations
Pattern Recognition
ASSUMPTION
SOURCE:
Hypothesis
Classes
Obljects
CHANNEL:
Noisy
OBSERVATION:
Multiple sensor
Variations
Example
Handwritten Digit Recognition
Polynomial Curve Fitting
Sum-of-Squares Error Function
0th Order Polynomial
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial
Over-fitting
Root-Mean-Square (RMS) Error:
Polynomial Coefficients
Data Set Size:
9th Order Polynomial
Data Set Size:
9th Order Polynomial
Regularization
Penalize large coefficient values
Regularization:
Regularization:
Regularization:
vs.
Polynomial Coefficients
Probability Theory
Apples and Oranges
Probability Theory
Marginal Probability
Joint Probability
Conditional Probability
Probability Theory
Sum Rule
Product Rule
The Rules of Probability
Sum Rule
Product Rule
Bayes’ Theorem
posterior  likelihood × prior
Probability Densities
Transformed Densities
Expectations
Conditional Expectation
(discrete)
Approximate Expectation
(discrete and continuous)
Variances and Covariances
The Gaussian Distribution
Gaussian Mean and Variance
The Multivariate Gaussian
Gaussian Parameter Estimation
Likelihood function
Maximum (Log) Likelihood
Properties of
and
Curve Fitting Re-visited
Maximum Likelihood
Determine
by minimizing sum-of-squares error,
.
Predictive Distribution
MAP: A Step towards Bayes
Determine
by minimizing regularized sum-of-squares error,
.
Bayesian Curve Fitting
Bayesian Predictive Distribution
Model Selection
Cross-Validation
Curse of Dimensionality
Curse of Dimensionality
Polynomial curve fitting, M = 3
Gaussian Densities in
higher dimensions
Decision Theory
Inference step
Determine either
or
.
Decision step
For given x, determine optimal t.
Minimum Misclassification Rate
Minimum Expected Loss
Example: classify medical images as ‘cancer’ or ‘normal’
Truth
Decision
Minimum Expected Loss
Regions
are chosen to minimize
Reject Option
Why Separate Inference and Decision?
•
•
•
•
Minimizing risk (loss matrix may change over time)
Reject option
Unbalanced class priors
Combining models
Decision Theory for Regression
Inference step
Determine
.
Decision step
For given x, make optimal
prediction, y(x), for t.
Loss function:
The Squared Loss Function
y(x), obtained by minimizing the expected
squared loss, with respect tonmean of the
conditional distribution p(t|x).
Minimum Expected loss= optimal least squares predictor w.r. To the conditional mean +
the variance of the distribution of t, averaged over x.
Second term: can be regarded as noise. Because it is independent of y(x),it
represents the irreducible minimum value of the loss function.
Minkowsky Metric: Lq = |y - t|q
Generative vs Discriminative
Generative approach:
Model
Use Bayes’ theorem
Discriminative approach:
Model
directly
Information Theory:Claude Shannon 1916-2001
Goal: Find the amount of information
carried by a specific value of an r.v.
Need something intuitive
Information Theory: C. Shannon
Information: giving form or shape to the mind
Assumptions:
Source message
Receiver
• Information is the quality of a message
• it may be a truth or a lie,
• if the amount of information in the received
message increases, the message is more accurate.
• Need a common alphabet to communucate
Quantification of information.
Given r.v. X and p(x) ,
what is the amount of information when we
receive an outcome of x?
Self Information
h(x)= -log p (x)
Low probability
High info
Surprise
Base e: nats
Base 2: bits
Entropy: Expected value of self information
information needed to specify the state of a
random variable.
Why does H(X) measures information?
•İt makes sense intuitively
•“Nobody knows what entropy really is, so in any
discussion you will always have an advan tage". Von
Neumann
Entropy
Noiseless Coding theory:
• Entropy is a lower bound on the number of bits needed to
transmit the state of random variable.
Ex: discrete with 8 possible states (alphabet); how many bits
(only two values using) to transmit the state of x?
All states equally likely,
Entropy
Entropy and Multiplicity
In how many ways can N identical objects be allocated in M bins?
Entropy maximized when
Entropy
Differential Entropy
Put bins of width ¢
along the real line
Differential entropy maximized (for fixed
in which case
) when
Conditional Entropy
The Kullback-Leibler Divergence
Average additional amount of information required to specify the value of x as a result of using
q(x) instead of the true distribution p(x)
Mutual Information
Download