Classification When is Temple Cowley not Temple Cowley? Amber Tomas PRS, Dept of Statistics Classification Problem ● ● We have data known to have been generated from a number of different populations (classes) Our goal is to correctly classify data that arises in the future i.e. allocate it to the correct class Approach to a problem ● Consider the client's problem ● Formulate the problem ● Model the problem ● Calculate or compute the solution to this model Modelling ● The model is generally a simplification of real life ● The art of modelling is in ● – capturing the important processes – ignoring trivial or irrelevant detail Solution to the model should be a good approximation to the solution of the original problem Outline of Talk ● Describe the classification problem ● Formulate the problem – ● introduce relevant aspects of probability and distribution theory Describe 3 approaches to modelling “Usual” scenario ● ● Usually want to infer some property of a single population eg mean height of 10 year olds in Britain Classification scenario ● ● Classification – assume that there are at least 2 populations each observation has come from only one of these populations Classification example ● Credit card transactions – ● record amount, time, place etc All transactions can be put into one of two buckets (classes) Classification example Classification cont. ● ● Ideally, would like a system that correctly classifies every observation In reality, this is very often impossible Classification example Classification example Classification example Classification example Classification example Summary of Problem ● ● ● ● We have n > 1 populations (buckets) each population is given a label (valid or fraudulent) We have a list of labelled observations i.e. some observations for which we know what bucket they are from Our goal is to design a system (classifier) that will label a new observation correctly as frequently as possible Decision Boundaries ● Decision Boundary – a boundary in the feature space which divides the space into regions, such that an observation falling into a region will be classified with some label, different to that of neighbouring regions Decision Boundary example Decision Boundary Example Decision Boundary example Decision Boundary example effect of prior probabilities effect of different costs of misclassification Optimal decision boundary ● ● ● If we know – the distribution of observations from each class – the cost of misclassification for each class, and – the prior probability of class membership Then we can compute the optimal decision boundary, i.e. the boundary which minimises the expected cost of misclassification Generally we don't know this information Linear Discriminant Analysis ● ● ● Based on the assumption that each class is normally distributed Classes are assummed to have the same variance and different means Based on these assumptions we can calculate the optimal decision boundary data from 2 normal populations LDA boundary Linear Discriminant Analysis ● ● ● In practise, it is unlikely the assumptions will be true We hope that the best solution to an approximate problem will still be a good solution to our original problem It is very important to check the validity of the assumptions made Data (not normally distributed) LDA decision boundary Separating Hyperplanes ● ● The boundary between two classes should “separate the two classes and maximise the distance to the closest point from either class” This method works when we have linearly separable data Separating Hyperplanes Separating Hyperplanes Separating Hyperplanes ● ● Idea is to transform the data to a higher dimensional space, compute the optimal hyperplane then map this back to the original space Hence a linear method can produce non-linear boundaries SVM decision boundary SVM decision boundary Nearest Neighbours ● ● ● “Model Free” i.e. makes no distributional assumptions To classify an observation x, say, we consider only the k points in our data set which are nearest x We label x with the majority label among it's k nearest neigbours Nearest Neighbours Nearest Neighbours Nearest Neighbours NN-1 decision boundary NN-5 decision boundary NN-10 decision boundary NN-1 decision boundary NN-5 decision boundary NN-10 decision boundary Nearest Neighbours ● ● We need to – define “nearest” i.e. decide on a distance measure – choose a value for k Want to choose the value of k that will minimise the expected error rate Data Splitting ● Estimate k by using data-splitting – training set : data used for fitting the model – test set : data used for testing the model, i.e. estimating the expected error rate of the model Error rates on training and test sets Conclusions ● ● Understand the problem of classification Understand how we can apply the principles of statistical modelling to this problem