Uploaded by sophie.roginsky

Bias Variance

advertisement
Generalization
 Central challenge in ML is that our algorithm must perform well on new, previously
unseen inputs, not just those on which our model was trained
 The ability to perform well on previously unobserved inputs is called generalization







Typically, when training a ML model, we have access to a training set
We can compute some error measure on the training set, called the training error and
we reduce this training error
This is just an optimization problem
What separates ML for optimization is that we want the generalization error, also called
the test error, to be low as well
Generalization error is defined as the expected value of error in a new input
We estimate the generalization error of a ML model by measuring its performance on a
test set on a test set of examples that were collected separately from the training set
For example, in linear regression, we trained the model by minimizing the training error,
But we really care about the test error
Data-generating distribution
 The training and test data are generated by a probability distribution over datasets
called the data-generating process
 We make some assumptions about the process called the iid assumption
o 1. Examples in each dataset are independent from each other
o 2. Training and test said are identically distributed, meaning they are draw from
the same probability distribution
 Same distribution is used to generate every train example and every test example
 We call the shared underlying distribution as the data-generating distribution
 Ideal model know the true probability distribution that generates the data, but even if
we this model would have error on many problems because there may still be some
noise in the distribution
Overfitting and Underfitting
 The factors that determine how well a ML algorithm will perform are its ability to:
1. Make the training error small
2. Make the gap between training and test error small
 These two factors correspond to two challenges in machine learning: underfitting and
overfitting
 Underfitting: model is not able to obtain a low error in training set
 Overfitting: gap between training error and test error is too large
Capacity
 We can control whether a model is more likely to overfit or underfit by playing around
with its capacity
 A model’s capacity is its ability to fit a wide variety of functions
 Models with low capacity may struggle to fit a training set
 Models with high capacity can overfit by memorizing properties of the training set that
do to generalize well on the test set
 One way to control the capacity is by choosing a model’s hypothesis space – the set of
functions that the algorithm is allowed to select as being the solution
Linear Regression Example
 Linear regression algorithm has the set of all linear functions of its input as its
hypothesis space
 We can generalize linear regression to include polynomials in its hypothesis space
 This increases the model’s capacity
 Introudce x^2 as another feature provided to the model, we can learn a model that is
quadratic as a function of x
 We can continue to add more powers of x as additional features to increase the
polynomial degree
 Even though the model implements a quadratic function of its input, the output is still a
linear function of the parameters, so we can still train the model in closed form
Linear Regression Example
 We fit three models to this example training set where the underlying distribution is
quadradic
 LEFT – a linear function fit to the data suffers from underfitting – it cannot capture the
curvature that is present in the data
 CENTER – quadratic function fit to the data generalizes well to unseen points, does not
suffer from significant amount of overfitting or underfitting
 RIGHT – polynomial of degree 9 for to the data, suffers from overfitting, passes through
all training points, deep valley between two training points, sharp increase on left side



In general, ML will perform best when their capacity is appropriate for the true
complexity of the task
Models with insufficient capacity are unable to solve complex tasks
Models with high capacity can solve complex tasks, but when their capacity is higher
than needed, they may overfit
Relationship between capacity and error
 Simpler functions are more likely to generalize, we still need to choose a sufficiently
complex hypothesis to achieve low error
 Typically, as you increase model capacity, the training error decreases until it reaches a
minimum possible error value




Generalization error has a U-shape curve
At the left end of the graph, training error and generalization error are both high – this is
the underfitting zone
As we increase capacity, training error decreases, but the gap between training and
generalization error increases
Eventually, the size of the gap outweighs the decrease in training error and this is the
overfitting zone where the capacity is too large
No Free Lunch Theorem
 Averaged over all possible data-generating distributions, every classification algorithm
has the same error rate when classifying previously unobserved points
 In other words, no ML algorithm is universally better than any other
 The most sophistical algorithm has the same average performance (over all possible
tasks) as merely predicting that every point belongs to the same class




Luckily, these results hold only when we average over all possible data-generating
distributions
If we make assumptions about the kinds of probability distributions we encounter, we
can design learning algorithms perform well on these distributions
Goal of ML is not to seek a universal learning algorithm
Our goal is to understand what kinds of distributions are relevant to the “real world”,
and what kind of ML algorithms perform well on data draw from data-generating
distributions
Regularization
 No free lunch theorem implies that we must design our ML algorithms to perform well
on a specific task
 We do so by building a set of preferences into the learning algorithm
 With the linear regression example, we discussed increasing the model’s capacity by
adding functions to the hypothesis space of the solutions the learning algorithm is able
to choose from
 We can also give a learning algorithm preference for one solution over another in its
hypothesis space, another way of controlling model’s capacity
 We can regularize a model that learns a function by adding a penalty to the cost
function
 Regularization is any modification we make to a learning algorithm that is intended to
reduce its generalization error but not its training error
Point estimation
 Statistics offers user concepts such as parameter estimation, bias and variance to
formally characterize generalization, underfitting and overfitting
 Point estimation is the attempt to find a single “best” prediction of some statistic






Bias


This can be a single parameter or a vector of parameters in some parametric model,
such as the weight in out linear regression example
If the true value of our parameter is theta, we denote a point estimate as theta hat
Given a set of data points, a point estimate is any function of the data
This definition is very general and does not even require that g return a value close to
the true theta
A good estimator is a function whose output is close to the true underlying theta that
generated the data
Frequentist perspective – we assume the true parameter theta is fixed but unknown,
while the point estimetta is a function of the data
the bias of an estimator is the difference between this estimator's expected value and
the true value of the parameter being estimated.
this estimator's expected value is equal to the true value of the parameter
Whiteboard example: gaussian distribution estimator of the mean
- a common estimator of the Gaussian mean parameter is the sample mean
-
To determine the bias of the sample mean, we are interesting in calculating its
expectation
Variance
- expected value of the squared sampling deviations
- indicate how far, on average, the collection of estimates are from the expected value of the
estimates
-
Variance of an estimator is a measure of how we would expect the estimate we vary as
we independently resample the dataset
from the underlying data-generating process
Mean squared error of estimator
- Bias and variance measure two sources of error in an estimator
- MSE measures the overall expected deviation between the estimator and true value of
parameter theta
- Evaluating MSE incorporated both bias and variance
- A Good estimate has a small MSE and these are esimators tha manage to keep both bias
and variance in check
Bias variance tradeoff
- As capacity increases, bias tends to decrease and variance tends to increase
- U-shaped curve of the generalization error as a function of capacity
- You want to balance out the bias and variance of your model to the point where your
test error (validation error) and training error have reached their combined minimum.
Download