Classification When is Temple Cowley not Temple Cowley? Amber Tomas

advertisement
Classification
When is Temple Cowley
not Temple Cowley?
Amber Tomas
PRS, Dept of Statistics
Classification Problem
●
●
We have data known to have been generated from
a number of different populations (classes)
Our goal is to correctly classify data that arises in
the future i.e. allocate it to the correct class
Approach to a problem
●
Consider the client's problem
●
Formulate the problem
●
Model the problem
●
Calculate or compute the solution to this model
Modelling
●
The model is generally a simplification of real life
●
The art of modelling is in
●
–
capturing the important processes
–
ignoring trivial or irrelevant detail
Solution to the model should be a good
approximation to the solution of the original
problem
Outline of Talk
●
Describe the classification problem
●
Formulate the problem
–
●
introduce relevant aspects of probability and
distribution theory
Describe 3 approaches to modelling
“Usual” scenario
●
●
Usually want to infer some property of a single
population
eg mean height of 10 year olds in Britain
Classification scenario
●
●
Classification – assume that there are at least 2
populations
each observation has come from only one of these
populations
Classification example
●
Credit card transactions
–
●
record amount, time, place etc
All transactions can be put into one of two
buckets (classes)
Classification example
Classification cont.
●
●
Ideally, would like a system that correctly
classifies every observation
In reality, this is very often impossible
Classification example
Classification example
Classification example
Classification example
Classification example
Summary of Problem
●
●
●
●
We have n > 1 populations (buckets)
each population is given a label (valid or
fraudulent)
We have a list of labelled observations i.e. some
observations for which we know what bucket
they are from
Our goal is to design a system (classifier) that
will label a new observation correctly as
frequently as possible
Decision Boundaries
●
Decision Boundary – a boundary in the feature
space which divides the space into regions, such
that an observation falling into a region will be
classified with some label, different to that of
neighbouring regions
Decision Boundary example
Decision Boundary Example
Decision Boundary example
Decision Boundary example
effect of prior
probabilities
effect of different
costs of
misclassification
Optimal decision boundary
●
●
●
If we know
–
the distribution of observations from each class
–
the cost of misclassification for each class, and
–
the prior probability of class membership
Then we can compute the optimal decision
boundary, i.e. the boundary which minimises the
expected cost of misclassification
Generally we don't know this information
Linear Discriminant Analysis
●
●
●
Based on the assumption that each class is
normally distributed
Classes are assummed to have the same variance
and different means
Based on these assumptions we can calculate the
optimal decision boundary
data from 2 normal populations
LDA boundary
Linear Discriminant Analysis
●
●
●
In practise, it is unlikely the assumptions will be
true
We hope that the best solution to an approximate
problem will still be a good solution to our
original problem
It is very important to check the validity of the
assumptions made
Data (not normally distributed)
LDA decision boundary
Separating Hyperplanes
●
●
The boundary between two classes should
“separate the two classes and maximise the
distance to the closest point from either class”
This method works when we have linearly
separable data
Separating Hyperplanes
Separating Hyperplanes
Separating Hyperplanes
●
●
Idea is to transform the data to a higher
dimensional space, compute the optimal
hyperplane then map this back to the original
space
Hence a linear method can produce non-linear
boundaries
SVM decision boundary
SVM decision boundary
Nearest Neighbours
●
●
●
“Model Free” i.e. makes no distributional
assumptions
To classify an observation x, say, we consider
only the k points in our data set which are nearest
x
We label x with the majority label among it's k
nearest neigbours
Nearest Neighbours
Nearest Neighbours
Nearest Neighbours
NN-1 decision boundary
NN-5 decision boundary
NN-10 decision boundary
NN-1 decision boundary
NN-5 decision boundary
NN-10 decision boundary
Nearest Neighbours
●
●
We need to
–
define “nearest” i.e. decide on a distance measure
–
choose a value for k
Want to choose the value of k that will minimise
the expected error rate
Data Splitting
●
Estimate k by using data-splitting
–
training set : data used for fitting the model
–
test set : data used for testing the model, i.e.
estimating the expected error rate of the model
Error rates on training and test sets
Conclusions
●
●
Understand the problem of classification
Understand how we can apply the principles of
statistical modelling to this problem
Download