Pattern classification Basic principles and tools 2 Summary • Why a lecture on pattern • • • recognition? Introduction to Pattern Recognition (Duda - Sections 1.1-1.6) An example of unsupervised learning: PCA Tools Pattern Classification, Chapter 1 3 Intelligent media environment Ambient Intelligence electronic environments that are sensitive and responsive to the presence of people AmI = Ubiquitous computing + Ubiquitous communication + Intelligent social user interfaces AmI at Philips, a video: http://www.date-conference.com/conference/2003/keynotes/ IBM video : smart supermarket Ambient intelligence envisions a world where people are surrounded by intelligent and intuitive interfaces embedded in the everyday objects around them. These interfaces recognize and respond to the presence and behavior of an individual in a personalized and relevant way. Pattern Classification, Chapter 1 4 Wireless Sensor Networks 1. 2. 3. Smart environments need “information feed” sensors Sensor data must be communicated, stored, processed network Networking anywhere, everywhere, little infrastructure wireless The “sensory system” of the intelligent ambient “organism” Human naturally recognize faces, understand spoken words, read handwritting characters, identify a key in a bag by touch… How to provide intelligence to the ‘digital organism’? Pattern Classification, Chapter 1 5 What is pattern recognition? “The assignment of a physical object or event to one of several pre-specified categories” -- Duda & Hart • • • • A pattern is an object, process or event that can be given a name. A pattern class (or category) is a set of patterns sharing common attributes and usually originating from the same source. During recognition (or classification) given objects are assigned to prescribed classes. A classifier is a machine which performs classification. Pattern Classification, Chapter 1 6 Examples of applications • Handwritten: sorting letters by postal code, input device for PDA‘s. • Optical Character • Printed texts: reading machines for blind people, digitalization of text documents. Recognition (OCR) • Biometrics • Face recognition, verification, retrieval. • Finger prints recognition. • Speech recognition. • Diagnostic systems • Medical diagnosis: X-Ray, EKG analysis. • Machine diagnostics, waster detection. • Military applications • Automated Target Recognition (ATR). • Image segmentation and analysis (recognition from aerial or satelite photographs). Pattern Classification, Chapter 1 7 Examples of applications Localization, HCI, user awareness, cooperative work and playtime Smart Objects Smart Environment Wearable and BAN Gestures, Natural Interfaces, HCI Bio-feedback, rehabilitation and healthcare, assistive technologies Static and dynamic posture and activity monitoring MicrelEye Pattern Classification, Chapter 1 8 Design space • The design space is wide… two examples: • Seq. of static posture: • Threshold based algorithm, network star topology: • • • Can be embedded in microcontroller Can be distributed among nodes More nodes (to understand complex postures) means problems in terms of: scalability, wearability, real-time loss, etc… • Activity recognition (Gait): one sensor • SVM based algorithm, one sensor • • • Extreme wearability Need computational power Can understand more complex activity Pattern Classification, Chapter 1 9 An Example • “Sorting incoming Fish on a conveyor according to species using optical sensing” Sea bass Species Salmon Material of the following slides mainly taken from: Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 Pattern Classification, Chapter 1 10 • Problem Analysis • Set up a camera and take some sample images to extract features • Length • Lightness • Width • Number and shape of fins • Position of the mouth, etc… • This is the set of all suggested features to explore for use in our classifier! Pattern Classification, Chapter 1 11 • Preprocessing • Use a segmentation operation to isolate fishes from one another and from the background • Information from a single fish is sent to a feature extractor whose purpose is to reduce the data by measuring certain features • The features are passed to a classifier Pattern Classification, Chapter 1 12 Pattern Classification, Chapter 1 13 • Classification • Select the length of the fish as a possible feature for discrimination Pattern Classification, Chapter 1 14 The length is a poor feature alone! Select the lightness as a possible feature. Pattern Classification, Chapter 1 15 • Threshold decision boundary and cost relationship • Move our decision boundary toward smaller values of lightness in order to minimize the cost (reduce the number of sea bass that are classified salmon!) Task of decision theory Pattern Classification, Chapter 1 16 Feature extraction Task: to extract features which are good for classification. Good features: • Objects from the same class have similar feature values. • Objects from different classes have different values. “Good” features “Bad” features Pattern Classification, Chapter 1 17 Feature vector • Adopt the lightness and add the width of the fish Fish xT = [x1, x2] Lightness Width Pattern Classification, Chapter 1 18 Therefore… Basic concepts Pattern y x1 x 2 x xn Feature vector x X - A vector of observations (measurements). - x is a point in feature space X . Hidden state y Y - Cannot be directly measured. - Patterns with equal hidden state belong to the same class. Task - To design a classifer (decision rule) q : X Y which decides about a hidden state based on an observation. Pattern Classification, Chapter 1 19 In our case… Sea bass Salmon lightness Task: fish recognition. x1 x x 2 The set of hidden state is Y {H , J } The feature space is X 2 width Training examples {( x1 , y1 ),..., (xl , yl )} Linear classifier: yH x2 H if (w x) b 0 q(x) J if (w x) b 0 yJ ( w x) b 0 x1 Pattern Classification, Chapter 1 20 Pattern Classification, Chapter 1 • We might add other features that are not correlated • 21 with the ones we already have. A precaution should be taken not to reduce the performance by adding such “noisy features” Ideally, the best decision boundary should be the one which provides an optimal performance such as: Pattern Classification, Chapter 1 22 • However, our satisfaction is premature because the central aim of designing a classifier is to correctly classify novel input Issue of generalization! Pattern Classification, Chapter 1 23 Pattern Classification, Chapter 1 24 Overfitting and underfitting Problem: underfitting good fit overfitting Pattern Classification, Chapter 1 25 Components of PR system Pattern Sensors and preprocessing Teacher Feature extraction Classifier Class assignment Learning algorithm • Sensors and preprocessing (segmentation / windowing) • A feature extraction aims to create discriminative features good for classification. • A classifier. • A teacher provides information about hidden state -- supervised learning. • A learning algorithm sets PR from training examples. Pattern Classification, Chapter 1 26 Classifier A classifier partitions feature space X into class-labeled regions such that X X 1 X 2 ... X |Y | X1 and X 1 X 2 ... X |Y | {0} X1 X3 X2 X1 X2 X3 The classification consists of determining to which region a feature vector x belongs to. Borders between decision boundaries are called decision regions. Pattern Classification, Chapter 1 27 Representation of classifier A classifier is typically represented as a set of discriminant functions f i (x) : X , i 1,..., | Y | The classifier assigns a feature vector x to the i-the class if f i ( x) f j ( x) j i f1 (x) x Feature vector f 2 (x) max y Class identifier f|Y | (x) Discriminant function Pattern Classification, Chapter 1 28 Post-processing and evaluation • Voting • Costs and Risks • Computational complexity (differentiating between learning and classifying) Pattern Classification, Chapter 1 29 The Design Cycle • Data collection • Feature Choice • Model Choice • Training • Evaluation • Computational Complexity Pattern Classification, Chapter 1 30 • Data Collection • How do we know when we have collected an adequately large and representative set of examples for training and testing the system? • Feature Choice • Depends on the characteristics of the problem domain. Simple to extract, invariant to irrelevant transformation insensitive to noise. • Model Choice • Unsatisfied with the performance of our fish classifier and want to jump to another class of model Pattern Classification, Chapter 1 31 • Training • Use data to determine the classifier. Many different procedures for training classifiers and choosing models • Evaluation • Measure the error rate (or performance and switch from one set of features to another one • Computational Complexity • • What is the trade-off between computational ease and performance? (How an algorithm scales as a function of the number of features, patterns or categories?) Pattern Classification, Chapter 1 32 Learning and Adaptation • Supervised learning • A teacher provides a category label or cost for each pattern in the training set • Unsupervised learning • The system forms clusters or “natural groupings” of the input patterns Pattern Classification, Chapter 1 33 Unsupervised learning: PCA • Principal Component Analysis • Used abundantly because it is a method for • • • extracting relevant information from confusing datasets It is Simple and Non parametric Can be used to reduce a complex data set to a lower dimension, revealing hidden simplified structures Starting point - we are experimenter: phenomenon to measures, but data appears clouded, unclear, redundant Pattern Classification, Chapter 1 34 Unsupervised learning: PCA as example • The Toy Example: motion of the ideal spring A ball of mass m attached to a massless, frictionless spring. The ball is released a small distance away from equilibrium it oscillates indef. along x at a set freq about its equilibrium We are ignorant we don’t know how many axes and dimensions to measure We decide to use: - 3 cameras, not orthogonal - each camera records at - 200Hz an image of the two-dim position of rthe ball (a projection) we chose three axes {a,b,c} Pattern Classification, Chapter 1 35 The Toy Example – con’t • • • • how do we get from this data set to a simple equation of x? One goal of PCA is to compute the most meaningful basis to re-express a noisy data set. Goal in our example: “the dynamics are along the x-axis.” x the unit basis vector along the x-axis - is the important dimension. Our data set is at each time: Where each camera contributes a 2-dimensional projection of the ball’s position to the entire vector X If we record the ball’s position for 10 minutes at 120 Hz, then we have recorded 10x60x120 = 72000 of these vectors. Pattern Classification, Chapter 1 36 Change of Basis • Each sample X is an m-dimensional vector, where m • • is the number of measurements types (e.g. 6) every samples is a vector lying in an m-dim vector space spanned by an orthonormal basis B and can be expressed as a lin. comb. of bi Exist B’, which is lin comb of B, that best reexpresses our data set? Linearity assumption • restrict potential basis • formalize implicit assumption of dataset continuity • PCA limited to re-express the data as a linear combination of its basis vector Pattern Classification, Chapter 1 • • • 1. 2. 3. • • X be the original data set (columns =samples in t) X is m x n (m=6, n=72000) Y is m x n related to X by P PX=Y 37 P is a matrix that transforms X in Y P is geometrically a rotation and stretch between X, Y And… The rows of P, {p1, . . . , pm}, are a set of new basis vectors for expressing the columns of X. jth coeff of yi is a projection on the jth row of P Therefore the rows of P are a new set of basis vectors for columns of X Pattern Classification, Chapter 1 38 Variance and the goal • • • Rows of P principal component of X • • What is the best way to re-express X? What is a good choice for P? What does“best express” the data mean? Decipher ‘garbled’ data. In a linear system “garbled” refers to: • • • noise, rotation and redundancy. Pattern Classification, Chapter 1 39 A. Noise and Rotation Noise is measured relative to the measurement. A common measure is the signal-to-noise ratio (SNR), or a ratio of variances σ2 SNR (>> 1) precision data, low SNR noise contaminated data • SNR measures how “fat” the cloud is • We assume that directions with largest variances in our measurement vector space contain the dynamics of interest. Maximizing the variance (and by assumption the SNR ) corresponds to finding the appropriate rotation of the naive basis. Simulated data of (xA, yA) for camera A. Pattern Classification, Chapter 1 • This intuition corresponds to finding the direction p in 40 Figure 2b. How do we find p* In the 2-dimensional case of Figure 2a, p falls along the direction of the best-fit line for the data cloud. Thus, rotating the naive basis to lie parallel to p would reveal the direction of motion of the spring for the 2-D case. Pattern Classification, Chapter 1 41 • Possible plots B. Redundancy between two arbitrary measurement types r1 and r2. (a) no apparent relationship = r1 is entirely uncorrelated with r2. r1 and r2 are statistically independent. • • • On the other extreme, Figure 3c depicts highly correlated recordings. Clearly in panel (c) it would be more meaningful to just have recorded a single variable, not both. Indeed, this is the very idea behind dimensional reduction. Pattern Classification, Chapter 1 42 Covariance matrix • • Generalizing to higher dimensionality Two sets of measurements (zero-mean) • The variances are: • • And covariance The covariance measures the degree of the linear relationship between two variables. A large (small) value indicates high (low) redundancy. Pattern Classification, Chapter 1 43 • • • • Generalizing to m vectors x1…xm we obtain a m x n matrix Each row of X corresponds to all measurements of a particular type (xi). Each column of X corresponds to a set of measurements from one particular trial Covariance matrix: Pattern Classification, Chapter 1 44 • • • • CX is a square symmetric m x m matrix. The diagonal terms of CX are the variance of particular measurement types. The off-diagonal terms of CX are the covariance between measurement types. CX captures the correlations between all possible pairs of measurements. The correlation values reflect the noise and redundancy in our measurements. • • • In the diagonal terms, by assumption, large (small) values correspond to interesting dynamics (or noise). In the off-diagonal terms large (small) values correspond to high (low) redundancy. Pretend we have the option of manipulating CX. We will suggestively define our manipulated covariance matrix CY. What features do we want to optimize in CY? Pattern Classification, Chapter 1 45 Diagonalize the Covariance Matrix (1) to minimize redundancy, measured by covariance, (2) maximize the signal, measured by variance. CY must be diagonal To do so PCA use the easiest way: - PCA assumes P is an orthonormal matrix - PCA assumes the directions with the largest variances the signals and the most “important” or principal - P acts as a generalized rotation to align a basis with the maximally variant axis Pattern Classification, Chapter 1 46 1. Select a normalized direction in m-dimensional space along which the variance in X is maximized. Save this vector as p1. 2. Find another direction along which variance is maximized, however, because of the orthonormality condition, restrict the search to all directions perpendicular to all previous selected directions. Save this vector as p2 3. Repeat this procedure until m vectors are selected. Y=PX and CY is diagonal Pattern Classification, Chapter 1 47 SOLVING PCA Two ways for the algebraic solution: • EIGENVECTORS OF COVARIANCE • A MORE GENERAL SOLUTION: SVD (singular value decomposition) The first method corresponds for a given data set X to (1) Subtract the mean of each measurements type (2) Compute the eigenvectors of XXT (= obtain P) More in general performing PCA is done by three steps 1. Organize a data set as an m x n matrix, where m is the number of measurement types and n is the number of trials. 2. Subtract off the mean for each measurement type or row xi. 3. Calculate the SVD or the eigenvectors of the covariance. Pattern Classification, Chapter 1 48 TOOLS and demonstration (30min) – Commercial – Open-source • WEKA http://www.cs.waikato.ac.nz/~ml/weka/index.html • YALE (now RapidMiner http://rapid-i.com/) • The R Project for Statistical Computing http://www.rproject.org/ • Pentaho – whole BI solutions. http://www.pentaho.com/ – Matlab Pattern Classification, Chapter 1