Classification When is Temple Cowley not Temple Cowley? Amber Tomas

When is Temple Cowley
not Temple Cowley?
Amber Tomas
PRS, Dept of Statistics
Classification Problem
We have data known to have been generated from
a number of different populations (classes)
Our goal is to correctly classify data that arises in
the future i.e. allocate it to the correct class
Approach to a problem
Consider the client's problem
Formulate the problem
Model the problem
Calculate or compute the solution to this model
The model is generally a simplification of real life
The art of modelling is in
capturing the important processes
ignoring trivial or irrelevant detail
Solution to the model should be a good
approximation to the solution of the original
Outline of Talk
Describe the classification problem
Formulate the problem
introduce relevant aspects of probability and
distribution theory
Describe 3 approaches to modelling
“Usual” scenario
Usually want to infer some property of a single
eg mean height of 10 year olds in Britain
Classification scenario
Classification – assume that there are at least 2
each observation has come from only one of these
Classification example
Credit card transactions
record amount, time, place etc
All transactions can be put into one of two
buckets (classes)
Classification example
Classification cont.
Ideally, would like a system that correctly
classifies every observation
In reality, this is very often impossible
Classification example
Classification example
Classification example
Classification example
Classification example
Summary of Problem
We have n > 1 populations (buckets)
each population is given a label (valid or
We have a list of labelled observations i.e. some
observations for which we know what bucket
they are from
Our goal is to design a system (classifier) that
will label a new observation correctly as
frequently as possible
Decision Boundaries
Decision Boundary – a boundary in the feature
space which divides the space into regions, such
that an observation falling into a region will be
classified with some label, different to that of
neighbouring regions
Decision Boundary example
Decision Boundary Example
Decision Boundary example
Decision Boundary example
effect of prior
effect of different
costs of
Optimal decision boundary
If we know
the distribution of observations from each class
the cost of misclassification for each class, and
the prior probability of class membership
Then we can compute the optimal decision
boundary, i.e. the boundary which minimises the
expected cost of misclassification
Generally we don't know this information
Linear Discriminant Analysis
Based on the assumption that each class is
normally distributed
Classes are assummed to have the same variance
and different means
Based on these assumptions we can calculate the
optimal decision boundary
data from 2 normal populations
LDA boundary
Linear Discriminant Analysis
In practise, it is unlikely the assumptions will be
We hope that the best solution to an approximate
problem will still be a good solution to our
original problem
It is very important to check the validity of the
assumptions made
Data (not normally distributed)
LDA decision boundary
Separating Hyperplanes
The boundary between two classes should
“separate the two classes and maximise the
distance to the closest point from either class”
This method works when we have linearly
separable data
Separating Hyperplanes
Separating Hyperplanes
Separating Hyperplanes
Idea is to transform the data to a higher
dimensional space, compute the optimal
hyperplane then map this back to the original
Hence a linear method can produce non-linear
SVM decision boundary
SVM decision boundary
Nearest Neighbours
“Model Free” i.e. makes no distributional
To classify an observation x, say, we consider
only the k points in our data set which are nearest
We label x with the majority label among it's k
nearest neigbours
Nearest Neighbours
Nearest Neighbours
Nearest Neighbours
NN-1 decision boundary
NN-5 decision boundary
NN-10 decision boundary
NN-1 decision boundary
NN-5 decision boundary
NN-10 decision boundary
Nearest Neighbours
We need to
define “nearest” i.e. decide on a distance measure
choose a value for k
Want to choose the value of k that will minimise
the expected error rate
Data Splitting
Estimate k by using data-splitting
training set : data used for fitting the model
test set : data used for testing the model, i.e.
estimating the expected error rate of the model
Error rates on training and test sets
Understand the problem of classification
Understand how we can apply the principles of
statistical modelling to this problem