Generalization Central challenge in ML is that our algorithm must perform well on new, previously unseen inputs, not just those on which our model was trained The ability to perform well on previously unobserved inputs is called generalization Typically, when training a ML model, we have access to a training set We can compute some error measure on the training set, called the training error and we reduce this training error This is just an optimization problem What separates ML for optimization is that we want the generalization error, also called the test error, to be low as well Generalization error is defined as the expected value of error in a new input We estimate the generalization error of a ML model by measuring its performance on a test set on a test set of examples that were collected separately from the training set For example, in linear regression, we trained the model by minimizing the training error, But we really care about the test error Data-generating distribution The training and test data are generated by a probability distribution over datasets called the data-generating process We make some assumptions about the process called the iid assumption o 1. Examples in each dataset are independent from each other o 2. Training and test said are identically distributed, meaning they are draw from the same probability distribution Same distribution is used to generate every train example and every test example We call the shared underlying distribution as the data-generating distribution Ideal model know the true probability distribution that generates the data, but even if we this model would have error on many problems because there may still be some noise in the distribution Overfitting and Underfitting The factors that determine how well a ML algorithm will perform are its ability to: 1. Make the training error small 2. Make the gap between training and test error small These two factors correspond to two challenges in machine learning: underfitting and overfitting Underfitting: model is not able to obtain a low error in training set Overfitting: gap between training error and test error is too large Capacity We can control whether a model is more likely to overfit or underfit by playing around with its capacity A model’s capacity is its ability to fit a wide variety of functions Models with low capacity may struggle to fit a training set Models with high capacity can overfit by memorizing properties of the training set that do to generalize well on the test set One way to control the capacity is by choosing a model’s hypothesis space – the set of functions that the algorithm is allowed to select as being the solution Linear Regression Example Linear regression algorithm has the set of all linear functions of its input as its hypothesis space We can generalize linear regression to include polynomials in its hypothesis space This increases the model’s capacity Introudce x^2 as another feature provided to the model, we can learn a model that is quadratic as a function of x We can continue to add more powers of x as additional features to increase the polynomial degree Even though the model implements a quadratic function of its input, the output is still a linear function of the parameters, so we can still train the model in closed form Linear Regression Example We fit three models to this example training set where the underlying distribution is quadradic LEFT – a linear function fit to the data suffers from underfitting – it cannot capture the curvature that is present in the data CENTER – quadratic function fit to the data generalizes well to unseen points, does not suffer from significant amount of overfitting or underfitting RIGHT – polynomial of degree 9 for to the data, suffers from overfitting, passes through all training points, deep valley between two training points, sharp increase on left side In general, ML will perform best when their capacity is appropriate for the true complexity of the task Models with insufficient capacity are unable to solve complex tasks Models with high capacity can solve complex tasks, but when their capacity is higher than needed, they may overfit Relationship between capacity and error Simpler functions are more likely to generalize, we still need to choose a sufficiently complex hypothesis to achieve low error Typically, as you increase model capacity, the training error decreases until it reaches a minimum possible error value Generalization error has a U-shape curve At the left end of the graph, training error and generalization error are both high – this is the underfitting zone As we increase capacity, training error decreases, but the gap between training and generalization error increases Eventually, the size of the gap outweighs the decrease in training error and this is the overfitting zone where the capacity is too large No Free Lunch Theorem Averaged over all possible data-generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points In other words, no ML algorithm is universally better than any other The most sophistical algorithm has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class Luckily, these results hold only when we average over all possible data-generating distributions If we make assumptions about the kinds of probability distributions we encounter, we can design learning algorithms perform well on these distributions Goal of ML is not to seek a universal learning algorithm Our goal is to understand what kinds of distributions are relevant to the “real world”, and what kind of ML algorithms perform well on data draw from data-generating distributions Regularization No free lunch theorem implies that we must design our ML algorithms to perform well on a specific task We do so by building a set of preferences into the learning algorithm With the linear regression example, we discussed increasing the model’s capacity by adding functions to the hypothesis space of the solutions the learning algorithm is able to choose from We can also give a learning algorithm preference for one solution over another in its hypothesis space, another way of controlling model’s capacity We can regularize a model that learns a function by adding a penalty to the cost function Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error Point estimation Statistics offers user concepts such as parameter estimation, bias and variance to formally characterize generalization, underfitting and overfitting Point estimation is the attempt to find a single “best” prediction of some statistic Bias This can be a single parameter or a vector of parameters in some parametric model, such as the weight in out linear regression example If the true value of our parameter is theta, we denote a point estimate as theta hat Given a set of data points, a point estimate is any function of the data This definition is very general and does not even require that g return a value close to the true theta A good estimator is a function whose output is close to the true underlying theta that generated the data Frequentist perspective – we assume the true parameter theta is fixed but unknown, while the point estimetta is a function of the data the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. this estimator's expected value is equal to the true value of the parameter Whiteboard example: gaussian distribution estimator of the mean - a common estimator of the Gaussian mean parameter is the sample mean - To determine the bias of the sample mean, we are interesting in calculating its expectation Variance - expected value of the squared sampling deviations - indicate how far, on average, the collection of estimates are from the expected value of the estimates - Variance of an estimator is a measure of how we would expect the estimate we vary as we independently resample the dataset from the underlying data-generating process Mean squared error of estimator - Bias and variance measure two sources of error in an estimator - MSE measures the overall expected deviation between the estimator and true value of parameter theta - Evaluating MSE incorporated both bias and variance - A Good estimate has a small MSE and these are esimators tha manage to keep both bias and variance in check Bias variance tradeoff - As capacity increases, bias tends to decrease and variance tends to increase - U-shaped curve of the generalization error as a function of capacity - You want to balance out the bias and variance of your model to the point where your test error (validation error) and training error have reached their combined minimum.