PATTERN RECOGNITION Fatoş Tunay Yarman Vural Textbook: Pattern Recognition and Machine Learning, C. Bishop Reccomended Book: Pattern Theory, U. Granander, M. Miller Course Requirement Final:50% Project: 50% Literature Survey report: 1 April Algorithm Development:1 May Full Paper with implementation 1 June Content 1.What is Pattern 2. Probability Theory 3. Bayesian Paradigm 4. Information Theory 5. Linear Methods 6.Kernel Methods 7. Graph Methods WHAT İS PATTERN • Structures regulated by rules • Goal:Represent empirical knowledge in mathematical forms • the Mathematics of Perception • Need: Algebra, probability theory, graph theory What you perceive is not what you hear: 1. 2. 3. 4. ACTUAL SOUND The ?eel is on the shoe The ?eel is on the car The ?eel is on the table The ?eel is on the orange 1. 2. 3. 4. PERCEIVED WORDS The heel is on the shoe The wheel is on the car The meal is on the table The peel is on the orange (Warren & Warren, 1970) Statistical inference is being used! All flows! Heraclitos • It is only the invariance, the permanent facts, that enable us to find the meaning in a world of flux. • We can only perceive variances • Our aim is to find the invariant laws of our varying obserbvations Pattern Recognition ASSUMPTION SOURCE: Hypothesis Classes Obljects CHANNEL: Noisy OBSERVATION: Multiple sensor Variations Example Handwritten Digit Recognition Polynomial Curve Fitting Sum-of-Squares Error Function 0th Order Polynomial 1st Order Polynomial 3rd Order Polynomial 9th Order Polynomial Over-fitting Root-Mean-Square (RMS) Error: Polynomial Coefficients Data Set Size: 9th Order Polynomial Data Set Size: 9th Order Polynomial Regularization Penalize large coefficient values Regularization: Regularization: Regularization: vs. Polynomial Coefficients Probability Theory Apples and Oranges Probability Theory Marginal Probability Joint Probability Conditional Probability Probability Theory Sum Rule Product Rule The Rules of Probability Sum Rule Product Rule Bayes’ Theorem posterior likelihood × prior Probability Densities Transformed Densities Expectations Conditional Expectation (discrete) Approximate Expectation (discrete and continuous) Variances and Covariances The Gaussian Distribution Gaussian Mean and Variance The Multivariate Gaussian Gaussian Parameter Estimation Likelihood function Maximum (Log) Likelihood Properties of and Curve Fitting Re-visited Maximum Likelihood Determine by minimizing sum-of-squares error, . Predictive Distribution MAP: A Step towards Bayes Determine by minimizing regularized sum-of-squares error, . Bayesian Curve Fitting Bayesian Predictive Distribution Model Selection Cross-Validation Curse of Dimensionality Curse of Dimensionality Polynomial curve fitting, M = 3 Gaussian Densities in higher dimensions Decision Theory Inference step Determine either or . Decision step For given x, determine optimal t. Minimum Misclassification Rate Minimum Expected Loss Example: classify medical images as ‘cancer’ or ‘normal’ Truth Decision Minimum Expected Loss Regions are chosen to minimize Reject Option Why Separate Inference and Decision? • • • • Minimizing risk (loss matrix may change over time) Reject option Unbalanced class priors Combining models Decision Theory for Regression Inference step Determine . Decision step For given x, make optimal prediction, y(x), for t. Loss function: The Squared Loss Function y(x), obtained by minimizing the expected squared loss, with respect tonmean of the conditional distribution p(t|x). Minimum Expected loss= optimal least squares predictor w.r. To the conditional mean + the variance of the distribution of t, averaged over x. Second term: can be regarded as noise. Because it is independent of y(x),it represents the irreducible minimum value of the loss function. Minkowsky Metric: Lq = |y - t|q Generative vs Discriminative Generative approach: Model Use Bayes’ theorem Discriminative approach: Model directly Information Theory:Claude Shannon 1916-2001 Goal: Find the amount of information carried by a specific value of an r.v. Need something intuitive Information Theory: C. Shannon Information: giving form or shape to the mind Assumptions: Source message Receiver • Information is the quality of a message • it may be a truth or a lie, • if the amount of information in the received message increases, the message is more accurate. • Need a common alphabet to communucate Quantification of information. Given r.v. X and p(x) , what is the amount of information when we receive an outcome of x? Self Information h(x)= -log p (x) Low probability High info Surprise Base e: nats Base 2: bits Entropy: Expected value of self information information needed to specify the state of a random variable. Why does H(X) measures information? •İt makes sense intuitively •“Nobody knows what entropy really is, so in any discussion you will always have an advan tage". Von Neumann Entropy Noiseless Coding theory: • Entropy is a lower bound on the number of bits needed to transmit the state of random variable. Ex: discrete with 8 possible states (alphabet); how many bits (only two values using) to transmit the state of x? All states equally likely, Entropy Entropy and Multiplicity In how many ways can N identical objects be allocated in M bins? Entropy maximized when Entropy Differential Entropy Put bins of width ¢ along the real line Differential entropy maximized (for fixed in which case ) when Conditional Entropy The Kullback-Leibler Divergence Average additional amount of information required to specify the value of x as a result of using q(x) instead of the true distribution p(x) Mutual Information