Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année Statistical Learning Table of Contents 1 Introduction ............................................................................................................... 3 1.1 2 3 Review of Probability ................................................................................................. 3 2.1 Random Variables and Probability Distributions ............................................................. 3 2.2 Expected Values, Mean and Variance ............................................................................. 4 2.3 Two Random Variables .................................................................................................. 5 2.4 Different Types of Distributions ..................................................................................... 6 2.5 Random Sampling, Distribution of ๐ and Estimation ....................................................... 6 Review of Statistics .................................................................................................... 7 3.1 3.1.1 3.2 3.2.1 3.3 4 5 Hypothesis Tests ............................................................................................................ 7 Concerning the Population Mean ..................................................................................................... 7 Confidence Intervals ...................................................................................................... 9 Concerning the population mean ...................................................................................................... 9 Scatterplots, the Sample Covariance and the Sample Correlation .................................... 9 Linear Regression with One Regressor ...................................................................... 10 4.1 Linear Regression Model with a Single Regressor (population) ...................................... 10 4.2 Estimating the Coefficients of the Linear Regression Model .......................................... 10 4.3 Measures of Fit and Prediction Accuracy ...................................................................... 11 4.4 The Least Squares Assumptions (hypotheses) ............................................................... 11 4.5 The Sampling Distribution of the OLS Estimators .......................................................... 12 Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals ........ 14 5.1 Testing Hypotheses About One of the Regression Coefficients ...................................... 14 5.2 Confidence Intervals for a Regression Coefficient ......................................................... 15 5.3 Regression When X Is a Binary Variable ........................................................................ 15 5.4 Heteroskedasticity and Homoskedasticity .................................................................... 15 5.4.1 6 Types of Data ................................................................................................................ 3 Mathematical Implications of Homoskedasticity ............................................................................ 16 5.5 The Theoretical Foundations of OLS ............................................................................. 16 5.6 Using the t-Statistic in Regression When the Sample Size Is Small ................................. 17 5.7 Summary and Assessment ........................................................................................... 18 Introduction to Data Mining ..................................................................................... 18 6.1 Data Mining - Concepts ................................................................................................ 18 6.2 Classification - Basic Concepts ...................................................................................... 20 6.3 Rule-Based Classifier: basic concepts ............................................................................ 21 1 Corentin Cossettini Semestre de printemps 2022 6.4 6.4.1 6.4.2 6.4.3 6.4.4 6.4.5 6.4.6 6.5 6.5.1 6.5.2 6.5.3 6.5.4 6.5.5 6.5.6 6.5.7 6.5.8 6.6 6.6.1 6.6.2 6.6.3 6.6.4 6.6.5 6.6.6 6.6.7 6.7 6.7.1 6.7.2 6.7.3 6.7.4 Université de Neuchâtel Bachelor en sciences économiques 2ème année Nearest Neighbors Classifier ........................................................................................ 24 Instance Based Classifiers................................................................................................................ 24 Basic Idea and concept .................................................................................................................... 24 Nearest Neighbor Classification ...................................................................................................... 24 Precisions ........................................................................................................................................ 25 Example: PEBLS (MVDM)................................................................................................................. 25 Proximity measures ......................................................................................................................... 25 Naïve Bayesian Classifier .............................................................................................. 26 Bayes Classifier ................................................................................................................................ 26 Example of Bayes Theorem ............................................................................................................. 27 Bayesian Classifiers ......................................................................................................................... 27 Naïve Bayes Classifier ...................................................................................................................... 27 Estimate Probabilities from Data .................................................................................................... 27 Examples of Naïve Bayes Classifier.................................................................................................. 28 Solution to the problem of a ๐๐๐๐๐ ๐๐ ๐๐ ๐๐๐๐๐๐๐๐๐ = ๐ ......................................................... 28 Summary ......................................................................................................................................... 28 Decision Tree Classifier ................................................................................................ 29 Example ........................................................................................................................................... 29 Advantages of a Decision Tree Based Classification ........................................................................ 29 General Process to create a model ................................................................................................. 29 Hunt’s Algorithm: General Structure............................................................................................... 29 Tree Induction ................................................................................................................................. 30 Measures of impurity ...................................................................................................................... 31 Practical Issues of Classification ...................................................................................................... 33 Model Evaluation ........................................................................................................ 36 Metrics for Performance Evaluation ............................................................................................... 36 Methods for Performance Evaluation ............................................................................................. 38 Test of Significance (confidence intervals) ...................................................................................... 40 Comparing performance of 2 Algorithms........................................................................................ 42 2 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année General informations ๏ท ๏ท ๏ท ๏ท ๏ท ๏ท Textbook: Introduction to Econometrics, Stock & Watson ๏ chaps 2-7 Read and understand the material before the lesson at home Sum up the slides and ad material from the book Work on the exercises/notebooks before the class and finish them in class Sessions are organized in two parts: 1. Explanations of the chapter 2. Solving the notebooks Work on and sum up the material before class ๏ textbook + theory (repetition) Part I 1 Introduction Econometrics is the science and art of using economic theory and statistical techniques to analyze economic data. 1.1 Types of Data Cross-Sectional Data Consists of multiple entities observed at a single time period Time Series Data Consists of a single entity observed at multiple time periods Panel Data Consists of multiple entities observed at two or more time periods 2 Review of Probability 2.1 Random Variables and Probability Distributions Population ๏ท Will consider populations as infinitely large ๏ท Collection of all possible entities of interest Population distribution of Y ๏ท Probability of different values of Y that occur in the population Random variable Y ๏ท Numerical summary of a random outcome ๏ท 2 types: o Discrete: takes a discrete set of values (0, 1, 2, …) o Continuous: takes a continuum of possible values Probability Distribution of a Discrete Random Variable ๏ท List of all possible values of the variable and the probability that each value will occur. ๏ท Probability of events: probability that an event occurs, comes from the probability distribution ๏ looks like an histogram because discrete variable ๏ท Cumulative probability distribution: probability that the random variable is less than or equal to a particular value ๏ Fx ๏ท Bernoulli distribution: the random discrete variable is binary so the outcome is 0 or 1. Probability Distribution of a Continuous Random Variable ๏ท Cumulative Probability Distribution: probability that the random variable is less than or equal to a particular value ๏ท Probability Density Function: the area under the probability density function between any two points is the probability that the random variable falls between those points. 3 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année 2.2 Expected Values, Mean and Variance Expected Value of a Random Variable ๏ท Expected Value: denoted ๐ธ(๐), represents the long-run average value of the random variable over many repeated trials or occurrences. ๏ท For a discrete random variable, it’s computed as a weighted average of the possible outcomes of that random variable, where the weights are the probabilities of that outcome. ๏ท For a Bernoulli random variable: ๐ธ(๐บ) = ๐ Expected Value for a continuous Random Variable ๏ท It’s the probability-weighted average of the possible outcomes of the random variable. Standard deviation and Variance ๏ท The Variance of the discrete random variable ๐ is: ๐ ๐๐2 ๏ท ๏ท = ๐ฃ๐๐(๐) = ๐ธ[(๐ − ๐๐ )2 ] = ∑(๐ฆ๐ − ๐ข๐ )2 ๐๐ ๐=1 For a Bernoulli random variable: ๐ฃ๐๐(๐บ) = ๐(1 − ๐) The Standard Deviation of the discrete random variable ๐ is: ๐๐ = √๐๐2 Other Measures of the Shape of a Distribution Mean and sd are important measures for a distribution. Two others exist: Skewness Kurtosis ๐ธ[(๐ − ๐๐ )3 ] ๐ธ[(๐ − ๐๐ )4 ] ๐๐๐๐ค๐๐๐ ๐ = ๐พ๐ข๐๐ก๐๐ ๐๐ = ๐๐4 ๐๐3 Measures the lack of symmetry of a Measures how thick and heavy are the tails distribution. of a distribution. Changes the weight of the observations: the smaller disappears and the bigger gets extremely big. Symmetric/normal Distribution: ๐๐๐๐ค๐๐๐ ๐ = 0 Symmetric/normal Distribution: ๐พ๐ข๐๐ก๐๐ ๐๐ = 3 Distribution w/ long right tail: ๐๐๐๐ค๐๐๐ ๐ > 0 Distribution heavytailed: ๐พ๐ข๐๐ก๐๐ ๐๐ > 3 Distribution w/ long left tail: ๐๐๐๐ค๐๐๐ ๐ < 0 ๐พ๐ข๐๐ก๐๐ ๐๐ ≥ 0 The greater the kurtosis, the more likely are outliers. 4 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 2.3 Two Random Variables Joint and Marginal Distributions ๏ท Random variables X and Z have a joint distribution. It’s the probability that the random variables simultaneously take on certain values. ๏ท The marginal probability distribution of a random variable Y is the same as the probability distribution. Conditional distribution ๏ท Conditional distribution: it’s the distribution of a random variable Y conditional on another random variable X taking on a specific value. It gives the conditional probability that Y takes on the value y when X takes the value x. Pr(๐=๐ฅ,๐=๐ฆ) ๏ท Pr(๐ = ๐ฆ|๐ = ๐ฅ) = Pr(๐=๐ฅ) Conditional expectation/conditional mean ๏ท It’s the mean of the conditional distribution of Y given X. It’s the expected value of Y, computed using the conditional distribution of Y given X ๏ท ๐ธ(๐|๐ = ๐ฅ) = ∑๐๐=1 ๐ฆ๐ Pr(๐ = ๐ฆ|๐ = ๐ฅ) Conditional variance ๏ท It’s the variance of the conditional distribution of Y given X ๏ท ๐๐๐(๐|๐ = ๐ฅ) = ∑๐๐=1[๐ฆ๐ − ๐ธ(๐|๐ = ๐ฅ)]2 Pr(๐ = ๐ฆ๐ |๐ = ๐ฅ) Bayes rule ๏ท Says that the conditional probability of Y given X is the conditional probability of X given Y times the relative marginal probabilities of Y and X Pr(๐ = ๐ฅ |๐ = ๐ฆ)Pr(๐=๐ฆ) ๏ท Pr(๐ = ๐ฆ|๐ = ๐ฅ) = Pr(๐=๐ฅ) Independence Two random variables X and Y are independently distributed (independent) if: 1) Knowing the value of one of them provides no information about the other. 2) The conditional distribution of Y given X equals the marginal distribution of Y 3) Pr(๐ = ๐ฆ|๐ = ๐ฅ) = Pr(๐ = ๐ฆ) Covariance ๏ท Measures the dependance of two random variables (how they move together) ๏ท Covariance between X and Y is the expected value of ๐ธ[(๐ − ๐๐ )(๐ − ๐๐ )], where ๐๐ is the mean of X and ๐๐ is the mean of Y. ๏ท ๐ถ๐๐ฃ(๐, ๐) = ๐ธ[(๐ − ๐๐ )(๐ − ๐๐ )] = ๐๐๐ ๏ท Measure of the linear association of X and Y: its units are (units of X)x (units of Y). ๏ท ๐ถ๐๐ฃ(๐, ๐) > 0 means a positive relation between X and Y (vice-versa) ๏ท If X and Y are independent: ๐ถ๐๐ฃ(๐, ๐) = 0 ๏ท ๐ถ๐๐ฃ(๐, ๐) = ๐ธ[(๐ − ๐๐ )(๐ − ๐๐ )] = ๐๐2 Correlation ๏ท Covariance does have units and it can be a problem. Correlation solves this problem. ๏ท Measure of dependance between X and Y ๐๐๐ฃ(๐,๐) ๐ ๏ท ๐๐๐๐(๐, ๐) = √(๐ฃ๐๐(๐)๐ฃ๐๐(๐) = ๐ ๐๐ ๐ ๏ท ๏ท ๐ฅ ๐ If ๐๐๐๐(๐, ๐) = 0: X and Y are uncorrelated/independant Correlation is always between -1 and 1 5 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 2.4 Different Types of Distributions The Normal Distribution ๏ท Normal Distribution with mean ๐ and variance ๐ 2 : ๐(๐, ๐ 2 ) ๏ท Standard Normal Distribution: ๐(0,1) ๏ท Standardize Y by computing ๐ = (๐ − ๐)/๐ Other types: Chi-Squared, Student t and F Distributions ฬ and Estimation 2.5 Random Sampling, Distribution of ๐ If ๐1 , … , ๐๐ are i.i.d and ๐ฬ is the estimator of the mean of the population (๐๐ ) ฬ Sampling Distribution of ๐ ฬ ๏ท The properties of ๐ are determined by its sampling distribution ๏ท The individuals in the sample are drawn randomly ๏ the values of (๐1 , . . . , ๐๐ ) are random ๏ functions of (๐1 , . . . , ๐๐ ), such as ๐ฬ , are random (had a different sample been drawn, they would have taken on a different value) ๏ท Sampling distribution of ๐ฬ = the distribution of ๐ฬ over different possible samples of size n ๏ท The mean and variance of ๐ฬ are the mean and variance of its sampling distribution: o ๐ธ(๐ฬ ) = ๐๐ ๐2 o ๐ฃ๐๐(๐ฬ ) = ๐ = ๐ธ[๐ฬ − ๐ธ(๐ฬ )]2 ๐ ๏ท ๏ท ๏ท As results of these two: ๏ท ๐ฬ ๏ unbiased estimator of ๐๐ ๏ท ๐ฃ๐๐(๐ฬ ) ๏ inversely proportional to ๐ The concept of the sampling distribution underpins all of econometrics. When ๐ is large: If ๐ is small ๏ complicated If ๐ is large ๏ simple As ๐ increases, distribution of ๐ฬ becomes more tightly centered around ๐๐ Distribution of ๐ฬ − ๐๐ becomes normal (CLT) Example of the sampling distribution: ๏ท Mean of ๐ฬ : If ๐ธ(๐ฬ ) = ๐ก๐๐ข๐ = ๐ = 0.78 ๏ ๐ฬ is an unbiased estimator of ๐ ๏ท If ๐ฬ becomes close to ๐ when ๐ large: Law of Large Number ๏ ๐ฬ is a consistent estimator of ๐ The Law of Large Numbers ๏ท Estimator ๏ Consistent if the Pr that it falls within an interval of the true pop value tends to 1 as the sample size increases ๏ท If (๐1 , … , ๐๐ ) are i.i.d. and ๐๐2 < ∞, then ๐ฬ is a consistent estimator of ๐๐ฆ , that is: Pr[|๐ฬ − ๐๐ | < ๐] → 1 ๐๐ ๐ → ∞, ๐ is an infinitesimal change, so basically zero ๐ ๐ฬ → ๐๐ (= lim ๐ฬ = ๐๐ฆ ) ๐→∞ 6 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 Central Limit Theorem ๏ท If (๐1 , … , ๐๐ ) are i.i.d. and ๐๐2 < ∞, then when ๐ is large, the distribution of ๐ฬ is well approximated by a normal distribution. ๐2 ๏ท ๐ฬ ~๐ (๐๐ , ๐ ) ๐ ๐ฬ −๐ธ(๐ฬ ) √๐ฃ๐๐(๐ฬ ) ๐ฬ −๐๐ ๐ /√๐ ๏ท Standardized, ๐ฬ = ๏ท The larger the n, the better the approximation =๐ is approximately ๐(0, 1) 3 Review of Statistics Estimators and Estimates Estimator: function of a sample of data to be drawn randomly from a population. Random variable Estimate: numerical value of the estimator when it is actually computed using data from a specific sample. Nonrandom number Bias, Consistency and Efficiency ๐ฬ is an estimator of ๐๐ ๏ท Bias: for ๐ฬ it’s ๐ธ(๐ฬ ) − ๐๐ ๏ the difference between them! o ๐ฬ is unbiased ๏ ๐ธ(๐ฬ ) = ๐๐ ๏ท Consistency: when the sample size is large, the uncertainty of ๐๐ arising from random variations in the sample is very small. ๐ o ๐ฬ is consistent ๏ ๐ฬ → ๐๐ ๏ท Efficiency: if an estimator has a smaller variance than another one, he’s more efficient. o ๐ฬ is efficient ๏ ๐ฃ๐๐(๐ฬ ) < ๐ฃ๐๐(๐ฬ) ฬ ๏ท If ๐ is unbiased, ๐ฬ is the Best Linear Unbiased Estimator (BLUE) and the most efficient. 3.1 Hypothesis Tests Method to choose between 2 hypotheses, in an uncertain context. 3.1.1 Concerning the Population Mean Null hypothesis: hypothesis to be tested ๐ป0 : ๐ธ(๐) = ๐๐,0 Two-sided alternative hypothesis: holds if the null hypothesis does not ๐ป1 : ๐ธ(๐) ≠ ๐๐,0 ๏ We use the evidence to decide whether to reject ๐ป0 or failing to do so (and accept ๐ป1 ) p-value Probability of drawing a statistic at least as adverse to the null as the value actually computed with your data, assuming that the null hypothesis is true. ๏ท Significance level of a test: pre-specified probability of incorrectly rejecting the null, when the null is true. ๏ท Some definitions: ๐๐,0 : specific value of the population mean under ๐ป0 ๐ฬ : sample average ๐ธ(๐): population mean, unknown ๐ฬ ๐๐๐ก : value of the sample average actually computed in the data set at hand To compute the p-value, it is necessary to know the sampling distribution of ๐ฬ under ๐ป0 ๏ According to the CLT, the distribution is normal when the sample size (n) is large. ๐2 ๏ The sampling distribution of ๐ฬ is ๐ (๐๐,0 , ๐ ) ๐ 7 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 Calculating the p-value when ๐๐ is unknown In practice, we have to estimate the standard deviation from the sample because we don’t have it from the population. ๐2 ๐2 We have: ๐ฬ ~๐ (๐๐,0 , ๐ ) where ๐ = ๐๐ฬ 2 ๐ ๐ ๏ Standard deviation of the sampling distribution of ๐ฬ : ๐๐ฬ = ๐๐ /√๐ #built from the pop but we don’t know its variance to build the variance of the sampling distribution so we have to find an estimator. ๏ Estimator of ๐๐ฬ : ๐๐ธ(๐ฬ ) = ๐ฬ๐ฬ = ๐ ๐ /√๐ #see how ๐ ๐ is computed below QUESTION: we divide 2 times by N ? Sample variance and sample standard deviation ๐ 1 2 ๐ ๐ = ∑(๐๐ − ๐ฬ )2 ๐−1 ๐=1 ๐ ๐ = √๐ ๐2 #dividing by n-1 makes the estimator unbiased, we lost 1 degree of freedom when estimating the mean ๏ Calculating the p-value with ๐๐ฬ 2 estimated and n large: ๐ฬ − ๐๐,0 ๐ฬ ๐๐๐ก − ๐๐,0 ๐ฬ ๐๐๐ก − ๐๐,0 ๐ − ๐ฃ๐๐๐ข๐ ≅ ๐๐๐ป0 [| ๐ |) |>| |] = 2๐ (− | ๐ ๐ ๐ ๐๐ธ(๐ฬ ) √๐ √๐ The t-statistic It’s also called the standardized sample average, we use it as a test statistic to perform hypothesis tests quite often: ๐ก= ๐ฬ −๐๐,0 ๐ ๐ √๐ once computed: ๐ก ๐๐๐ก = ๐ฬ ๐๐๐ก −๐๐,0 ๐ ๐ √๐ We can rewrite the formula for the p-value by substituting the equ° of the t-statistic: ๐ − ๐ฃ๐๐๐ข๐ = ๐๐๐ป0 [|๐ก| > |๐ก ๐๐๐ก |] = 2๐(−|๐ก ๐๐๐ก |) Link p-value – significance level and results ๏ The significance level is prespecified. Example: significance level = 5% ๏ท Reject ๐ป0 if |๐ก| ≥ 1.96 and reject ๐ป0 if ๐ ≤ 0.05 ๏ท A small p-value means that ๐ฬ ๐๐๐ก or ๐ก ๐๐๐ก is far away from the mean under ๐ป0 and that it is very unlikely that this sample would have been drawn if ๐ป0 is true, which means if the population mean is equal to the population mean under ๐ป0 . ๏ reject ๐ป0 ๏ท A big p-value means that ๐ฬ ๐๐๐ก or ๐ก ๐๐๐ก is close to the mean under ๐ป0 and that it is very likely that this sample would have been drawn if ๐ป0 is true, which means if the population mean is equal to the population mean under ๐ป0 . ๏ not reject ๐ป0 ๏ท P-value: marginal significance level One-sided Alternatives ๐ป1 : ๐ธ(๐) > ๐๐,0 The general approach is the same, with the modification that only large positive values of the t-statistic reject ๐ป0 ๏ right side of the normal distribution ๐ − ๐ฃ๐๐๐ข๐ = ๐๐๐ป0 (๐ > ๐ก ๐๐๐ก ) = 1 − ๐(๐ก ๐๐๐ก ) or ๐ป1 : ๐ธ(๐) < ๐๐,0 ๏ left side of the normal distribution ๐ − ๐ฃ๐๐๐ข๐ = ๐๐๐ป0 (๐ < ๐ก ๐๐๐ก ) = 1 + ๐(๐ก ๐๐๐ก ) 8 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année 3.2 Confidence Intervals Use data from a random sample to construct a set of values (confidence set) that contains the true population mean ๐๐ with a certain prespecified probability (confidence level). The upper and lower limits of the confidence set are an interval (confidence interval). 3.2.1 ๏ท ๏ท ๏ท Concerning the population mean 95% CI for ๐๐ is an interval that contains the true value of ๐๐ in 95% of all possible samples. The CI is random because it will differ from one sample to the next. ๐๐ is NOT random, we just don’t know it. A 95% CI can be seen as the set of value of ๐๐ not rejected by a hypothesis test with a 5% significance level. For further informations, see “Résumé statistique inférentielle II » from the 3rd semester. 3.3 Scatterplots, the Sample Covariance and the Sample Correlation The 3 ways to summarize the relationship between variables! Scatterplots Sample Covariance and Correlation Estimators of the population covariance and correlation. Computed by replacing a population mean with a sample mean. They are, just as the sample variance, consisten. 1 Sample covariance: ๐ ๐๐ = ๐−1 ∑๐๐=1(๐๐ − ๐ฬ )(๐๐ − ๐ฬ ) ๐ Sample correlation: ๐๐๐ = ๐ ๐๐ ๐ ๐ ๐ A high correlation (close to 1) means that the points in the scatter plot fall very close to a straight line ADD DIFFERENCE BETWEEN MEANS? NECESSARY? (chap. 3 SW). 9 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année 4 Linear Regression with One Regressor Allows us to estimate, and make inferences, about population slope coefficients. Our purpose is to estimate the causal effect on Y of a unit change in X. ๏ relationship between X and Y Linear regression (statistical procedure) can be used for: ๏ท Causal inference: using data to estimate the effect on an outcome of interest of an intervention that changes the value of another variable. ๏ท Prediction: using the observed value of some variable to predict the value of another variable. 4.1 Linear Regression Model with a Single Regressor (population) The problems of statistical inference for linear regression are basically the same as for estimation of the mean or of the differences between two means. We want to figure out if there is a relationship between the two variables (hyp test). Ideally, ๐ is large because we work with the CLT. General Notation of the Population Regression Model ๏ท ๐๐ : independent variable or regressor ๏ท ๐๐ : dependant variable or regressand ๏ท ๐ฝ0 : intercept ๏ coefficient/parameter, value of the regression line when ๐ = 0 ๏ท ๐ฝ1 : slope ๏ coefficient/parameter, difference in Y associated with a unit difference in X ๏ท ๐ข๐ = ๐๐ − (๐ฝ0 + ๐ฝ1 ๐๐ ): the error term, difference between ๐๐ and its predicted value according to the regression line. ๏ท ๐๐ = ๐ฝ0 + ๐ฝ1 ๐๐ : population regression line/function ๏ relationship that holds between X and Y, on average, over the population ๐๐ = ๐ฝ0 + ๐ฝ1 ๐๐ + ๐ข๐ ๐๐๐ ๐ = 1, … , ๐ 4.2 Estimating the Coefficients of the Linear Regression Model We don’t know the coefficients so we have to find estimators for them from the available data! ๏ learn about the population using a sample of data. ๏ these values are estimated from a sample to make an inference about the population. The Ordinary Least Squares Estimator (OLS) Intuitively, we want to fit a line through the data, the line that makes the least error or squares. The OLS estimator chooses the regression coefficients so that the line is as close as possible to the data, you find them by following: 1) Let ๐0 and ๐1 be some estimators for ๐ฝ0 and ๐ฝ1 2) Regression line based on these estimators: ๐0 + ๐1 ๐๐ 3) ๐๐๐ ๐ก๐๐๐ ๐๐๐๐ ๐คโ๐๐ ๐๐๐๐๐๐๐ก๐๐๐ = ๐๐ − (๐0 + ๐1 ๐๐ ) = ๐๐ − ๐0 − ๐1 ๐๐ 4) Sum of all these squared predictions for ๐ observations: ∑๐๐=1(๐๐ − ๐0 − ๐1 ๐๐ )2 5) There is a unique pair of estimators that minimize this expression ๏ the OLS estimators ฬ0 and the OLS estimator of ๐ฝ1 is ๐ฝ ฬ1 ๏ OLS estimator of ๐ฝ0 is ๐ฝ General Notation of the Sample Regression Line ฬ1 = ๐ ๐๐ ๏ท ๐ฝ ๐ 2 ๏ท ๏ท ๏ท ๐ ฬ0 = ๐ฬ − ๐ฝ ฬ1 ๐ฬ ๐ฝ ๐ขฬ๐ = ๐๐ − ๐ฬ๐ : residual for the ๐ ๐กโ observation, ๐ฬ๐ is the predicted value. ฬ0 + ๐ฝ ฬ1 ๐: OLS regression line/sample regression line/function ๐ฝ ฬ0 + ๐ฝ ฬ1 ๐๐ ฬ๐ = ๐ฝ ๐ ๐๐๐ ๐ = 1, … , ๐ 10 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année 4.3 Measures of Fit and Prediction Accuracy How well does the regression line fit the data? Is the regressor responsible for a little or big variation of the dependant variable? Are the observations tightly distributed around the regression line or spread out? The Regression ๐น๐ Fraction of the sample variance of ๐ explained by ๐. 1) We have from above: ๐ขฬ๐ = ๐๐ − ๐ฬ๐ โบ ๐๐ = ๐ฬ๐ + ๐ขฬ๐ 2) ๐ 2 can be written as the ratio of the explained sum of squares to the total sum of squares: 2 ๐ธ๐๐ = ∑๐๐=1(๐ฬ๐ − ๐ฬ ) #(sample predicted – sample average)^2 ๐๐๐ = ∑๐๐=1(๐๐ − ๐ฬ )2 #(actual – sample average)^2 ๐ธ๐๐ ๐ 2 = = ๐ถ๐๐๐(๐, ๐)2 ๐๐๐ 3) ๐ 2 can also be written in terms of the fraction of the variance of ๐๐ not explained by ๐๐ , aka the sum of squared residuals. ๐ ๐๐๐ = ∑ ๐ขฬ๐ 2 ๐=1 4) ๐๐๐ = ๐ธ๐๐ + ๐๐๐ ๐ 2 = 1 − ๐๐๐ ๐๐๐ 0 ≤ ๐ 2 ≤ 1 Complete with the exercises ! The Standard Error of the Regression (SER) Estimator of the standard deviation of the regression error ๐ข๐ ๐๐ธ๐ = ๐ ๐ขฬ = √๐ ๐ขฬ2 1 ๐๐๐ where ๐ ๐ขฬ2 = ๐−2 ∑๐๐=1 û2๐ = ๐−2 #we divide by ๐ − 2 because two degrees of freedom were lost when estimating ๐ฝ0 and ๐ฝ1 . Predicting Using OLS ๏ท in-sample prediction observation for which the prediction is made was also used to estimate the regression coefficients ๏ we are in the sample. ๏ท out-of-sample prediction prediction for observations not in the estimation sample. The goal of prediction: provide accurate out-of-sample predictions 4.4 The Least Squares Assumptions (hypotheses) What, in a precise sense, are the properties of the OLS estimator? We would like it to be unbiased, and to have a small variance. Does it? Under what conditions is it an unbiased estimator of the true population parameters? To answer these questions, we need to make some assumptions about how Y and X are related to each other, and about how they are collected (the sampling scheme). These assumptions – there are three – are known as the Least Squares Assumptions. ๐๐ = ๐ฝ0 + ๐ฝ1 ๐๐ + ๐ข๐ , ๐ = 1, … , ๐ 11 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 1. The Conditional distribution of ๐๐ given ๐ฟ๐ has mean zero o That is, ๐ธ(๐ข|๐ = ๐ฅ) = 0 ฬ1 is unbiased o Implies that ๐ฝ o Intuitively, consider an ideal randomized controlled experiment : ๐ is randomly assigned to people (ex : students randomly assigned to different size classes). Randomization is done by computer, using no information about the individual. Because ๐ is assigned randomly, all other individual characteristics (the things that make up ๐ข) are independently distributed of ๐. Thus, in an ideal randomized controlled experiment : ๐ธ(๐ข|๐ = ๐ฅ) = 0 Recall: if ๐ธ(๐ข|๐ = ๐ฅ) = 0, then ๐ข and ๐ have 0 cov and are uncorrelated. In actual experiments, or with observational data, it’s not necessary true. 2. (๐ฟ๐ , ๐๐ ), ๐ = ๐, … , ๐ Are Independently and Identically Distributed o i.i.d arises automatically if sampled by simple random sampling: the entity is selected then, for that entity, X and Y are observed (recorded) o The main place we will encounter non-i.i.d. sampling is when data are recorded over time (“time series data”) ๏ extra complications 3. Large Outliers are Unlikely A large outlier can strongly influence the results! See this outlier of Y (red circle) ๏ that’s why it’s important to look at the data before doing any regression line. 4.5 The Sampling Distribution of the OLS Estimators ฬ0 , ๐ฝ ฬ1 ) are computed from a randomly drawn sample ๏ random variables OLS Estimators (๐ฝ with a sampling distribution that describes the values they could take over different possible random samples. ฬ0 and ๐ฝ ฬ1 are ๐ฝ0 and ๐ฝ1 : ๏ท The means of the sampling distributions of ๐ฝ ฬ ฬ ๐ธ(๐ฝ0 ) = ๐ฝ0 and ๐ธ(๐ฝ1 ) = ๐ฝ1 ๏ unbiased estimators Normal Approximation to the Distribution of the OLS Estimators in Large Samples ฬ If the Least Squares Assumptions hold, then ๐ฝฬ 0 , ๐ฝ1 have a jointly normal sampling distribution: 1 ๐ฃ๐๐[(๐๐ −๐๐ )๐ข๐ ] 2 2 ฬ1 ~๐ (๐ฝ1 , ๐ฬ o ๐ฝ ) where ๐๐ฝฬ = ๐ [๐ฃ๐๐(๐ ๐ฝ )]2 1 o ๏ท 1 ๐ 1 ๐ฃ๐๐(๐ป๐ ๐ข๐ ) ๐ 2 2 ฬ0 ~๐(๐ฝ0 , ๐ฬ ๐ฝ ) where ๐๐ฝฬ =๐ where ๐ป๐ = 1 − [๐ธ(๐๐2 ] ๐๐ ๐ฝ 2 2 0 0 [๐ธ(๐ป๐ )] ๐ ) Implications: o When ๐ is large, the distribution of the estimators will be tightly centred around their means. o Consistent estimators 2 o The larger the ๐ฃ๐๐(๐๐ ), the smaller ๐๐ฝฬ 1 Intuitively, if there is more variation in ๐, then there is more information in the data that you can use to fit the regression line. ๏ easier to draw a regression line for the black dots (with ๐ฃ๐๐(๐๐ ) bigger!) 12 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année ########################################################################## 30.04.2022: bis jetzt We looked at probability: regression model ๏ท ๐๐ = ๐ฝ0 + ๐ฝ1 ๐๐ + ๐ข๐ ๐ฝ0 + ๐ฝ1 ๐๐ represents Z This model explains (X,Y) ๏ท E[u|X]=0 o ๐ธ[๐ข] = 0 o ๐ถ๐๐ฃ(๐, ๐) = 0 Parameters of the model are B0 and B1 ๐ฝ0 = ๐ธ[๐] − ๐ต1 ๐ธ[๐] ๐ถ๐๐ฃ(๐, ๐) ๐ฝ1 = ๐๐๐(๐) ๐๐๐(๐) = ๐๐๐(๐) + ๐๐๐(๐ข) ๐๐๐(๐) = [๐ถ๐๐๐(๐, ๐)]2 ๐๐๐(๐) All this probability tools are useful to do statistics, working with samples: We are going to try to estimate the parameters of the model and find how good the regression is. ๐ ฬ = 1 ∫ (๐๐ − ๐ฬ )2 ๐๐๐(๐) ๐ ๐=1 ฬ ๐ 1ฬ ฬ ๐ถ๐๐ฃ(๐, ๐) = ∫ (๐๐ − ๐ฬ )(๐๐ − ๐ฬ ) ๐ ๐=1 ฬ0 = ๐ฬ − ๐ต ฬ1 ๐ฬ ๐ฝ0 ๏ ๐ต 1 ๐ ฬ )(๐๐ −๐ฬ ) ∫ (๐ฬ ๐ −๐ ฬ1 = ๐ ๐=1 ๐ฝ1 ๏ ๐ต 1 ๐ ฬ 2 (๐๐ −๐) ∫ ๐ ๐=1 We want to know the part of the regression that is good ๏ work with Z because the “u” is the error term 2 2 ๐๐๐(๐) = (๐ต0 + ๐ต1 ๐1 − ๐ธ(๐)) + โฏ + (๐ต0 + ๐ต1 ๐๐ − ๐ธ(๐)) 2 2 ฬ = (๐ต ฬ0 + ๐ต ฬ1 ๐1 − ๐ธ(๐)) + โฏ + (๐ต ฬ0 + ๐ต ฬ1 ๐๐ − ๐ธ(๐)) ๐๐๐(๐) We still want to get rid of ๐ธ(๐) ฬ0 + ๐ต ฬ1 ๐1 is named ๐ฬ1 in the book, ๐ต ฬ0 + ๐ต ฬ1 ๐๐ is named ๐ ฬ ๐ต ๐ ฬ = 1 ∑(๐ฬ๐ − ๐ฬ ฬ) ๐๐๐(๐) ๐ 2 ๏ ฬ ๐๐๐(๐) ฬ ๐๐๐(๐) = 2 1 ∑ (๐ฬ๐ −๐ฬ ฬ) ๐ 1 ๐ ∫ (๐ −๐ฬ )2 ๐ ๐=1 ๐ = ๐ 2 SER: ฬ๐ = ๐๐ − ๐ต ฬ0 − ๐ต ฬ1 ๐๐ = ๐ขฬ๐ it’s the estimate of the contribution of other terms ๐๐ − ๐ ฬ = ๐ ๐๐๐ก(๐ฃ๐๐(๐ข) ฬ ) ๐๐ธ๐ = ๐ ๐(๐ข) ########################################################################## 13 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 5 Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals 5.1 Testing Hypotheses About One of the Regression Coefficients The process is the same for ๐ฝ0 and ๐ฝ1 ! General form of the t-statistic ๐๐ ๐ก๐๐๐๐ก๐๐ − โ๐ฆ๐๐๐กโ๐๐ ๐๐ง๐๐ ๐ฃ๐๐๐ข๐ ๐ก= ๐ ๐ก๐๐๐๐๐๐ ๐๐๐๐๐ ๐๐๐กโ๐ ๐๐ ๐ก๐๐๐๐ก๐๐ Two-Sided Hypotheses Concerning ๐ท๐ ฬ1 also has a normal sampling distribution in large samples, hypotheses about the Because ๐ฝ true value of the slope ๐ฝ1 can be tested using the same approach as for ๐ฬ . ๐ป0 : ๐ฝ1 = ๐ฝ1,0 ๐ป1 : ๐ฝ1 ≠ ๐ฝ1,0 To test ๐ป0 against the Alternative ๐ป1 and just like for the population mean ๐, we have to follow 3 steps: ฬ๐ 1) Compute the standard error of ๐ท 2 ฬ1 ) = √๐ฬ ๐๐ธ(๐ฝ ๐ฝ 1 Heteroskedasticity-robust standard errors 1 2 Where ๐ฬ๐ฝฬ =๐ 0 2 ฬ 1 ๐ ∑๐ ฬ๐2 ๐ข [1−( )๐ ] ๐ 1 ๐ ๐−2 ๐=1 ∑ ๐2 ๐ ๐=1 ๐ 2 2 ฬ 1 ๐ ๐ (๐ ∑๐=1[1−( 1 ๐ )๐๐ ] ) ∑ ๐2 ๐ ๐=1 ๐ 2 ๐ฬ๐ฝฬ 1 ฬ๐ = 1 − ( 1 ๐ป ๐ ๐ฬ 2 ∑๐ ๐=1 ๐๐ ) ๐๐ 1 ๐ 2 2 1 ๐ − 2 ∑๐=1(๐๐ − ๐ฬ ) ๐ขฬ๐ = 2 ๐ 1 ๐ [๐ ∑๐=1(๐๐ − ๐ฬ )2 ] 2) Compute the t-statistic ๐ก= ฬ1 − ๐ฝ1,0 ๐ฝ ฬ1 ) ๐๐ธ(๐ต 3) Compute the p-value Rejecting the hypothesis at the 5% significance level if the p-value is less than 0.05 or, equivalently, if |๐ก ๐๐๐ก | > 1.96. ๐๐๐ก ฬ ฬ1 − ๐ฝ1,0 ๐ฝ ๐ฝ − ๐ฝ1,0 ๐ − ๐ฃ๐๐๐ข๐ = ๐๐๐ป0 [| |>| 1 |] = 2๐(−|๐ก ๐๐๐ก |) ฬ1 ) ฬ1 ) ๐๐ธ(๐ฝ ๐๐ธ(๐ฝ Note: a regression software computes automatically those values (see function “lmfit” in R) One-Sided Hypotheses Concerning ๐ท๐ ๏ท ๏ท ๏ท ๐ป0 : ๐ฝ1 = ๐ฝ1,0 ๐ป1 : ๐ฝ1 < ๐๐ > ๐ฝ1,0 Because ๐ป0 is the same, the construction of the t-statistic is the same. The only difference between a one- and a two-sided hypothesis test is how you interpret the t-statistic: o For a left-tail test: ๐ − ๐ฃ๐๐๐ข๐ = ๐๐๐ป0 (๐ < ๐ก๐๐๐ก ) = 1 + ๐(๐ก๐๐๐ก ) o For a right-tail test: ๐ − ๐ฃ๐๐๐ข๐ = ๐๐๐ป0 (๐ > ๐ก๐๐๐ก ) = 1 − ๐(๐ก๐๐๐ก ) In practice, we use this kind of test only when there is a reason to do so 14 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année 5.2 Confidence Intervals for a Regression Coefficient The process is the same for ๐ฝ0 and ๐ฝ1 ! Confidence Interval for ๐ท๐ ๏ท 2 definitions: 1) Set of value that cannot be rejected using a two-sided hypothesis test with a 5% significance level 2) Interval that has a 95% probability of containing the true value of ๐ฝ1 : in 95% of possible samples that might be drawn, the CI will contain the true value. ๏ท Link with hypothesis test: a hyp test w/ 5% significance level will reject the value of ๐ฝ1 in only 5% of all possible samples ๏ in 95% of all possible samples, the true value of ๐ฝ1 won’t be rejected (2nd definition above) ฬ1 − 1.96๐๐ธ(๐ฝ ฬ1 ), ๐ฝ ฬ1 + 1.96๐๐ธ(๐ฝ ฬ1 )] ๏ท When sample size is large: 95% ๐ถ๐ผ ๐๐๐ ๐ฝ1 = [๐ฝ Confidence Intervals for predicted effects of changing ๐ฟ ฬ1 − 1.96๐๐ธ(๐ฝ ฬ1 )) โ๐ฅ, (๐ฝ ฬ1 + 1.96๐๐ธ(๐ฝ ฬ1 )) โ๐ฅ] 95% ๐ถ๐ผ ๐๐๐ ๐ฝ1 โ๐ฅ = [(๐ฝ 5.3 Regression When X Is a Binary Variable It’s also possible to make a regression analysis when the regressor is binary (takes only 2 values: 0 or 1). Binary variable = indicator variable = dummy variable Interpretation of the Regression Coefficients ๏ท The mechanics are the same as with a continuous regressor ๏ท The interpretation is different and is equivalent to a difference of means analysis. ๐๐ = ๐ฝ0 + ๐ฝ1 ๐ท๐ + ๐ข๐ ๏ท What is ๐ท๐ ? o It’s the binary variable that can take only 2 values ๏ท What is ๐ฝ1 ? o Called “Coefficient on ๐ท๐ ” o 2 possible cases: ๐ซ๐ = ๐ ๐ซ๐ = ๐ ๐๐ = ๐ฝ0 + ๐ข๐ ๐๐ = ๐ฝ0 + ๐ฝ1 + ๐ข๐ We know: ๐ธ(๐ข๐ |๐ท๐ ) = 0 → We know: ๐ธ(๐๐ |๐ท๐ = 1) = ๐ฝ0 + ๐ฝ1 ๐ธ(๐๐ |๐ท๐ = 0) = ๐ฝ0 ๐ฝ0 + ๐ฝ1 is the population mean ๐ฝ0 is the population mean value when value when ๐ท๐ = 1 ๐ท๐ = 0 ๏ท Thus, ๐ฝ1 = ๐ธ(๐๐ |๐ท๐ = 1) − ๐ธ(๐๐ |๐ท๐ = 0) ๏ difference between population means! Hypothesis Tests and Confidence Intervals ๏ท If the 2 population means are the same ๏ ๐ฝ1 = 0 ๏ท We can test this ๐ป0 : ๐ฝ1 = 0 against ๐ป1 : ๐ฝ1 ≠ 0 5.4 Heteroskedasticity and Homoskedasticity We bother about this because it influences our hypothesis tests (the SE is not the same!) Definitions The error term ๐ข๐ is: ๏ท homoskedastic if the variance of the conditional distribution of ๐ข๐ given ๐๐ , ๐ฃ๐๐(๐ข๐ |๐๐ = ๐ฅ) is constant for ๐ = 1, … , ๐ and in particular does not depend on ๐๐ ; ๏ท heteroskedastic if otherwise 15 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 ๏ Homoskedastic data: particular case of heteroskedastic data where the variance of the ฬ1 becomes simpler. estimator ๐ฝ Visually Homoskedastic Heteroskedastic LSA 1 satisfied bc ๐ธ(๐ข๐ |๐๐ = ๐ฅ) = 0 Variance of ๐ข doesn’t depend on ๐ 5.4.1 LSA 1 satisfied bc ๐ธ(๐ข๐ |๐๐ = ๐ฅ) = 0 Variance of ๐ข depends on ๐ Mathematical Implications of Homoskedasticity So far, we have allowed that the data is heteroskedastic. What if the error terms are in fact homoskedastic? ๏ท Whether the errors are homoskedastic or heteroskedastic, the OLS estimator is unbiased, consistent and asymptotically normal. ฬ1 and the OLS standard error simplifies ๏ท The formula for the variance of ๐ฝ Homoskedasticity-only Variance Formula 2 2 ฬ1 ) = ๐๐ข2 ฬ0 ) = ๐ธ(๐๐2 ) ๐๐ข2 ๐ฃ๐๐(๐ฝ ๐ฃ๐๐(๐ฝ ๐๐๐ ๐๐๐ ฬ1 ) is inversely proportional to ๐ฃ๐๐(๐): more spread in ๐ means more Note: we see that ๐ฃ๐๐(๐ฝ information about ๐ฝ1 Homoskedasticity-only Standard Errors 1 1 2 ฬ1 ) = √(๐ฬฬ ๐๐ธ(๐ฝ =) 1 ๐−2 ๐ ๐ฝ 1 ฬ2 ∑๐ ๐=1 ๐ข๐ ๐ ∑๐=1(๐๐ −๐ฬ )2 ๐ 1 2 2 ( ∑๐ ๐=1 ๐๐ )๐ û 2 ฬ0 ) = √(๐ฬฬ ๐๐ธ(๐ฝ =) ∑๐๐ ๐ฝ 0 ฬ 2 ๐=1(๐๐ −๐) The usual standard errors, which we call heteroskedasticity – robust standard errors, are valid whether or not the errors are heteroskedastic Practical Implications ๏ท In general, you get different standard errors using the different formulas. ๏ท To be sure, always use the general formula (heteroskedastic-robust standard errors) because valid for the both cases. ๏ท Warning: R uses the simpler formula ๏ท If you don’t override the default and there is in fact heteroskedasticity, your standard errors (and wrong t-statistics and confidence intervals) will be wrong – typically, homoskedasticity-only SEs are too small. 5.5 The Theoretical Foundations of OLS What we already know: ๏ท OLS is unbiased and consistent 16 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année ๏ท A formula for heteroskedasticity-robust standard errors ๏ท How to construct CI and test statistics We are using OLS is because it’s the language of the regression analysis! What we doubt: ๏ท Is this really a good reason to use OLS? Other estimators with smaller variances are maybe better? ๏ท What happened to the student t distribution? To answer these questions, we have to make stronger assumptions. The Extended Least Squares Assumptions 3 LSA + 2 others: 1) ๐ธ(๐ข๐ |๐๐ = ๐ฅ) = 0 2) (๐๐ , ๐๐ผ ), ๐ = 1, … , ๐ are i.i.d. 3) Large outliers are rare More restrictive – apply in fewer cases. But lead 4) ๐ข is homoskedastic to stronger results (under those assumptions) 5) ๐ข is distributed ๐(0, ๐ 2 ) because calculations simplify. Efficiency of OLS, part I: The Gauss-Markov Theorem ฬ1 has the smallest variance among all linear estimators. Under the first 4 assumptions, ๐ฝ “If the three least squares assumptions hold and if errors are homoskedastic, then the OLS ฬ1 is the Best Linear conditionally Unbiased Estimator (BLUE).” estimator ๐ฝ Efficiency of OLS, part II ฬ1 has the smallest variance of all consistent estimators. Under the 5 assumptions, ๐ฝ If the errors are homoskedastic and normally distributed + LSA 1-3, then OLS is a better choice than any other consistent estimator ๏ the best you can do! Some not-so-good thing about OLS Warning: that has important limitations! 1. Gauss-Markov Theorem has 2 limitations: ๏ท Homoskedasticity is rare and often doesn’t hold ๏ท Result only for linear estimators, which represent only a small subset of estimators 2. The strongest result (OLS is the best you can do, cf above “part II”) requires homoscedastic normal errors ๏ pretty rare in application 3. OLS is more sensitive to outliers than some other estimators. For example, in the case of the mean of the population with big outliers, we prefer the median. 5.6 Using the t-Statistic in Regression When the Sample Size Is Small 1) 2) 3) 4) 5) ๐ธ(๐ข๐ |๐๐ = ๐ฅ) = 0 (๐๐ , ๐๐ผ ), ๐ = 1, … , ๐ are i.i.d. Large outliers are rare ๐ข is homoskedastic ๐ข is distributed ๐(0, ๐ 2 ) ฬ0 ,ฬ If they all hold, ๐ฝ ๐ฝ1 are normally distributed and the t-statistic has ๐ − 2 degrees of freedom. Under those 5 assumptions and the null hypothesis, the t-statistic has a Student t distribution with ๐ − 2 degrees of freedom. 17 Corentin Cossettini Semestre de printemps 2022 ๏ท ๏ท ๏ท Université de Neuchâtel Bachelor en sciences économiques 2ème année Why ๐ − 2? Because we estimated 2 parameters For ๐ < 30, the t critical values can be a faire bit larger than the ๐(0,1) critical values For ๐ > 50, the difference in ๐ก๐−2 and ๐(0,1) distribution is negligible. Recall the student t table: Degrees of freedom 5% t-distribution critical value 10 2.23 20 2.09 30 2.04 60 2.00 1.96 ∞ Practical Implications ๏ท If n < 50 and you really believe that, for your application, u is homoskedastic and normally distributed, then use the t n−2 instead of the N(0,1) critical values for hypothesis tests and confidence intervals. ๏ท In most econometric applications, there is no reason to believe that u is homoskedastic and normal – usually, there is good reason to believe that neither assumption holds. ๏ท Fortunately, in modern applications, n > 50, so we can rely on the large-n results presented earlier, based on the CLT, to perform hypothesis tests and construct confidence intervals using the large-n normal approximation. 5.7 Summary and Assessment ๏ท The initial policy question: Suppose new teachers are hired so the student-teacher ratio falls by one student per class. What is the effect of this policy intervention (“treatment”) on test scores? ๏ท Does our regression analysis answer this convincingly? o Not really: districts with low student to teacher ratio tend to be ones with lots of other resources and higher income families, which provide kids with more learning opportunities outside school...this suggests that corr(ui , STR i ) > 0, so E(ui |X i) ≠ 0 o So, we have omitted some factors, or variables, from our analysis, and this has biased our results. Part II Data mining: deviner la valeur d’un attribut (variable) en utilisant d’autres attributs. 6 Introduction to Data Mining 6.1 Data Mining - Concepts Definitions ๏ท Science of discovering structure and making predictions in large samples. ๏ท Exploration and analysis by automatic or semi-automatic means of large quantities of data in order to discover meaningful patterns. Objectives ๏ท Discovering structures/patterns: explore and understand the data ๏ understanding the past! ๏ท Making predictions: given measurements of variables, learn a model to predict their future values. ๏ Predict the future! Data: definition 18 Corentin Cossettini Semestre de printemps 2022 ๏ท ๏ท ๏ท Université de Neuchâtel Bachelor en sciences économiques 2ème année Collection of data objects and their attributes Attribute: property or characteristic of an object Object: described by a collection of attributes Attributes Types Distinctness: =, ≠ Nominal Ex: ID number, eye color, … Ordinal Ex: rankings, grades, height Interval Ex: dates, temperatures, … Ratio Ex: length, time, counts Description Properties Order: <,> Addition: +,Multiplication: *,/ The attribute provides only enough Distinctness information to distinguish one object from another The attribute provides enough info to Distinctness and order order objects The difference between values are Distinctness, meaningful addition order and Differences between the values and Distinctness, order, ratios are meaningful addition and multiplication Concepts ๏ท Domain: set of objects from the real world about which knowledge is supposed to be delivered by data mining ๏ท Target attribute: attribute to be predicted (‘YES’/’NO’) ๏ท Input attributes: observable attributes that can be used for prediction ๏ท Dataset: a subset of the domain, described by the set of available attributes (rows: instances; columns: attributes) ๏ท Predictive model: chunks of knowledge about some domain of interest, that can be used to answer queries not just about instances from the data used for model creation, but also any other instances from the same domain Origins of Data Mining An analytic process that uses one or more available datasets from the same domain to create one or more models for the domain. The ultimate goal of data mining is delivering predictive models. ๏ท Uses ideas from machine learning, stats and database systems ๏ท Vocabulary differences between stats and machine learning: Inductive learning ๏ท Most of data mining algorithms are using inductive learning approach: The algorithm (or learner) is provided with training information from which it has to derive knowledge via inductive inference. Any prediction uses knowledge to deduce the answer via deductive inference. 19 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année ๏ท Knowledge: expected result of the learning process ๏ท Inductive inference: discovering patterns in training information and generalizing them appropriately (“specific to general”) ๏ท Query: a new cas or observation with unknown aspects ๏ท Deductive inference: supplying the values of unknown aspects based on a “general to specific process” The learner isn’t informed and has no possibility to verify with certainty which of the possible generalizations are correct ๏ complicated to check if the prediction is true or not. Use feedback to improve the quality of the model. Data Mining Tasks ๏ท Regression Predict a value of a numeric target attribute based on the values of other input attributes, assuming a linear or nonlinear model of dependency. Applications: predicting sales amount of new product based on advertising expenditure Numeric variable is the only difference between regression and classification. ๏ท Classification In a lot of cases the variable is binary. Predict a discrete target attribute based on the values of other input attributes. Applications: fraud detection, direct marketing... ๏ท Clustering Predict the assignment of instances to a set of clusters, such that the instances in any one cluster are more similar to each other than to instances outside the cluster. Applications: market segmentation ๏ท Association Rule Discovery Given a set of records each of which contains some number of items from a given collection. Application: identify items that are bought together by sufficiently many customers… 6.2 Classification - Basic Concepts Definition ๏ท Given a collection of records (training set) o Each record is defined as a tuple (x,y) where x is a set of attributes, and y is a special attribute, denoted as the class label (also called target) ๏ท Find a model for class attribute y as a function f of the values of attributes x o The function maps each attribute set x to one of the predefined class label y o The function is a classification model ๏ท Goal: previously unseen records should be assigned a class as accurately as possible o A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 20 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année Illustrating Classification Task Classification Techniques ๏ท Neural networks ๏ท Decision Tress based Methods ๏ are the most important! But others exist. 6.3 Rule-Based Classifier: basic concepts ๏ท Classification model uses a collection of IF…THEN… rules o “๐ถ๐๐๐๐๐ก๐๐๐ → ๐ฆ” o LHS: left hand side ๏ rule antecedent or condition, conjunction of tests related to attributes o RHS: right hand side ๏ rule consequent, class label (y) Rule ๏ท A rule R covers an instance x if the attributes of the instance satisfy the condition of the rule ๏ท Coverage of a rule: fraction of records that satisfy the antecedent of a rule all the instances covered by the rule (%) ๏ท Accuracy of a rule: fraction of records that satisfy both the antecedent and consequent of a rule instances that satisfy the condition and look which ones are true and false. We don’t look at all the instances (only the ones that satisfy the rule)! If no rule applies, we give a default value Characteristics of Rule-Based Classifier ๏ท Mutually exclusive rules Every possible record is covered by at most one rule ๏ท Exhaustive rules Every possible record is covered by at least one rule We can mix them! (exhaustive +mutually exclusive, …) Rules Can Be Simplified Effects of a rule simplification ๏ท Rules are no longer mutually exclusive A record may trigger more than one rule Solution: ordered set rule, unordered rule set- use voting scheme ๏ท Rules are no longer exhaustive A record may not trigger any rules Solution: use a default class 21 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année Ordered Rule Set ๏ท Known as a decision list ๏ท Give an order to the rule set (the 1st applies /…) When the test record is presented to the classifier, it is assigned to the class label of the highest ranked rule it has triggered (if no rule applies, assigned to the default class!) Rule Ordering Schemes ๏ท Rule-based ordering: individual rules are ranked based on their quality o Question: how to define quality? Several possibilities… ๏ท Class-based ordering: rules that belong to the same class appear together Building Classification Rules ๏ท Direct method: extract rules directly from data example of sequential covering ๏ท Indirect method: extract rules from other classification models STOPPING conditions ๏ if the quality of the rule is not met; ๏ if there is no more instance to put in the Aspects of Sequential Covering How to find rules, linked with the direct method 1. Rule Growing 2 common strategies: 1) Top-down: general-to-specific 2) Down-up: specific-to-general 2. Rule Evaluation Needed to determine which conjunct should be added or removed during the rulegrowing process Metrics: For one rule The value of accuracy isn’t correct or certain for the future data, bc based on past data ๏ improve this measure and develop Laplace or M-estimate ๏ 2 corrections 22 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année Other measure: FOIL’s information gain Compare 2 rules (for a rule extension) but we had a measure to the second one ๐๐ mesure la précision de la ฬ ฬ ฬ ฬ ๐๐ +๐๐ deuxième règle, de même pour ๐๐ ฬ ฬ ฬ ๐ฬ ๐๐ +๐ 3. Rule Pruning Reduced Error Pruning ๏ remove one of the conjuncts in the rule ๏ comparer error rate on validation set before and after pruning. ๏ if error improves, prune the conjunct What is the validation set? Allows us to compute the accuracy after we created the rule in the training set. Training set ๏ rules ๏ verify with the validation set Laplace is also a way to evaluate the rules to choose which one on them is the best 4. Stopping criterion Compute the gain If gain is not significant, discard the new rule Direct method – summary 1. Grow a single rule 2. Remove instances from the rule 3. Prune the rule if necessary 4. Add rule to current rule set 5. Repeat Example of Direct method – RIPPER variant It’s a method (among others) that builds a Rule-Based Classifier. ๏ท For a 2-class problem Choose one of the classes as positive class (the other is the negative class) o Learn rules for positive class o Negative class is the default class ๏ท For a multi-class problem Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class) o Learn the rule set for smallest class first, treat the rest as negative class o Repeat with next smallest class as positive class ๏ท Growing a rule o Start form empty rule o Add conjuncts as long as they improve FOIL’s information gain ๏ท o Stop when rule starts covering negative examples Prune the rule immediately using incremental reduced error pruning 23 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année o ๏ท ๏ท Measure for pruning: ๐ฃ = (๐ − ๐)/(๐ + ๐) where ๏ง ๐ is the number of positive examples covered by the rule in the validation set ๏ง ๐ is the number of negative examples covered by the rule in the validation set o Pruning method: deletes any final sequence of conditions that maximizes ๐ฃ Building a rule set o Use sequential covering algorithm ๏ง Finds the best rule that covers the current set of positive examples ๏ง Eliminate both positive and negative examples covered by the rule Stop adding a new rule if the error rate of the rule on the set exceeds 50% 6.4 Nearest Neighbors Classifier We don’t create a model this time, a contrario of the rule-based classifier. 6.4.1 ๏ท ๏ท ๏ท 6.4.2 Instance Based Classifiers We use instances ๏ no model Store the training records and use them to predict the class label of unseen cases. Examples: o Rote-learner: memorizes entire training data and performs classification only if attributes of record match one of the training examples EXACTLY, otherwise, doesn’t give results. o Nearest neighbor: uses ๐ closest points for performing classification. Basic Idea and concept Basic Idea: “If it walks like a duck, quacks like a duck, then it’s probably a duck.” 6.4.3 Nearest Neighbor Classification Requires 3 things: 1. Set of stored records 2. Distance metric to compute distance between records 3. Value of ๐, the number of nearest neighbors to retrieve To classify an unknown record: ๏ท Distance: Compute distance to other training records o Distance between 2 points: Euclidian distance: ๐(๐, ๐) = √∑๐(๐๐ − ๐๐ )2 o Problem with Euclidian measure: high dimensional data ๏ curse of dimensionality ๏ง If the number of points/objects is kept constant, higher the number of dimensions, larger the distance between points. ๏ง Given a point, the ratio “distance to its nearest neighbor/distance to its farthest neighbor” tends to one for high dimensions. ๏ท NN: Identify ๐ nearest neighbors o Choosing the value of ๐ o If ๐ too small ๏ sensitive to noise points o If ๐ too large ๏ neighborhood may include points from other classes 24 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 ๏ท Class: Use class labels of nearest neighbors to determine the class label of unknown record o Determine class from nearest neighbor list o Take the majority vote of class labels among the ๐ nearest neighbors o Weigh the vote according to distance 1 o Weight factor: ๐ค = ๐ ๐ nearest neighbors of a record ๐ฅ are data points that have the ๐ smallest distance to ๐ฅ (1-2-3 nearest neighbors) 6.4.4 ๏ท ๏ท Precisions Scaling issues: attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes. ๐-NN classifiers are lazy learners : No model is built, takes time to compute everything every time. 6.4.5 Example: PEBLS (MVDM) Parallel Exemplar-Based Learning System ๏ท Works with continuous (size in cm, …) and nominal (male/female, …) features o Nominal features: distance between 2 nominal values (๐1 and ๐2 ) is calculated using modified value difference metric (MVDM) ๐ ๐1๐ ๐2๐ ๐(๐1 , ๐2 ) = ∑ | − | ๐1 ๐2 ๐=1 o o ๐ – number of classes ๐๐๐ – number of examples from class ๐ with attribute value ๐๐ , ๐ = 1,2 ๐๐ - number of examples with attribute value ๐๐ , ๐ = 1, 2 Distance between ๐ and ๐: โ(๐, ๐) = ∑๐๐=1 ๐(๐๐ , ๐๐ )2 Number of nearest neighbors, ๐ = 1 Distance between nominal attribute values: 2 0 2 4 ๐(๐๐๐๐๐๐, ๐๐๐๐๐๐๐) = | − | + | − | = 1 4 4 4 4 Single and yes ๏ 2/4 Married and yes ๏ 0/4 Single and no ๏ 2/4 Married and no ๏ 4/4 2 1 2 1 ๐(๐๐๐๐๐๐, ๐ท๐๐ฃ๐๐๐๐๐) = | − | + | − | = 0 4 2 4 2 0 1 4 1 ๐(๐๐๐๐๐๐๐, ๐ท๐๐ฃ๐๐๐๐๐) = | − | + | − | = 1 4 2 4 2 0 3 3 4 ๐(๐ ๐๐๐ข๐๐ = ๐๐๐ , ๐ ๐๐๐ข๐๐ = ๐๐) = | − | + | − | = 6/7 3 7 3 7 6.4.6 Proximity measures Proximity refers to these 2: ๏ท Similarity o Numerical measure of how alike 2 data objects are o Higher when objects are more alike o Fall in the range [0,1] 25 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année ๏ท Dissimilarity o Numerical measure of how different are 2 data objects o Lower when objects are more alike o Min is 0, upper limit varies If Data: 1 p 2 q ๏ท Distance Particular type if dissimilarity ๏ท Euclidian Distance: ๐(๐, ๐) = √∑๐(๐๐ − ๐๐ )2 ๏ท 1 Minkowski Distance: ๐๐ (๐, ๐) = (∑๐๐=1|๐๐ − ๐๐ |๐ )๐ where ๐ = 1,2, … , ∞ o m = 1: Manhattant Distance/Taxicab distance o m = 2: Euclidian Distance o m = ∞: Supremum Distance 6.5 Naïve Bayesian Classifier Take the n attributes independently and try to look at them individually. “How often was it rainy when we go play? How often was it rainy when we go not play ?” ๏ Likelihood of going to play or not depending on each individual value. Then, just multiply all those values together to obtain a final likelihood value 6.5.1 Bayes Classifier Probabilistic framework for solving classification problems Conditional Probability ๐(๐ด, ๐ถ) ๐(๐ถ|๐ด) = ๐(๐ด) ๐(๐ด, ๐ถ) ๐(๐ด|๐ถ) = ๐(๐ถ) Bayes Theorem ๐ท(๐จ|๐ช)๐ท(๐ช) ๐(๐ช|๐จ) = ๐ท(๐จ) ๐(๐ถ) – prior probability of C ๐(๐ถ|๐ด) – posterior (revised) probability of C, after observing a new event A ๐(๐ด|๐ถ) – conditional probability of A, given C (likelihood) ๐(๐ด) – total probability of A (Probability to observe A independent of C) 26 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 6.5.2 Example of Bayes Theorem 6.5.3 Bayesian Classifiers ๏ท Consider each attribute and class label as random variables ๏ท Works only if the variables are independent. In general, no need to compute the denominator because it’s the same for all classes! 6.5.4 o Naïve Bayes Classifier “Naïve” because we assume that we are naïve to believe the attributes being independent ๏ we assume independence among attributes ๐ด๐ when class ๐ถ is given: ๐(๐ด1 = ๐1 , … , ๐ด๐ = ๐๐ |๐ถ = ๐๐ ) = ๐(๐ด1 = ๐1 |๐ถ = ๐๐ )๐(๐ด2 = ๐2|๐ถ = ๐๐ ) ∗ … ∗ ๐(๐ด๐ = ๐๐ |๐ถ = ๐๐ ) 6.5.5 Estimate Probabilities from Data For discrete attributes ๐(๐ถ = ๐๐ ) = ๐๐๐ /๐ ๐(๐ด๐ = ๐|๐ถ = ๐๐ ) = ๐๐๐ /๐๐๐ ๐๐๐ – number of instances from class ๐๐ ๐ – total number of instances ๐๐๐ – number of instances having value a for attribute ๐ด๐ and belonging to class ๐๐ Examples ๏ท ๏ท ๏ท For class ๐ถ “evade”: 4 ๐(๐๐ก๐๐ก๐ข๐ = ๐๐๐๐๐๐๐|๐ถ = ๐๐) = 7 ๐(๐ ๐๐๐ข๐๐ = ๐๐๐ |๐ถ = ๐๐๐ ) = 0 ๐(๐ถ = ๐๐) = 7/10 27 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 For continuous attributes 1. Discretize the range into bins o One ordinal attribute per bin o Violates independence assumption o Example: [60,75] ๏ cold; [76,80] ๏ mild; [81,90] ๏ hot 2. Probability density estimation o Assume attribute follows a normal distribution 2 ๐(๐ด๐ = ๐|๐ถ = ๐๐ ) = o o o 6.5.6 − 1 (๐−๐๐๐ ) 2 2๐๐๐ ๐ ๐๐๐ √2๐ Use data to estimate parameters of distribution (mean, variance) Once probability distribution is known, can use it to estimate the conditional probability ๐(๐ด๐ = ๐|๐ถ๐ ) Example: ๐ด๐ = ๐ผ๐๐๐๐๐, ๐๐ = ๐๐ ๐๐๐ ≈ 110 (sample mean of Income values for class No) ๐๐๐2 ≈ 2975 (sample var of Income value for class No) ๏ Probability Distribution! Estimate the conditional probability: (120−110)2 1 − ๐(๐ผ๐๐๐๐๐ = 120|๐๐) = ๐ 2(2975) = 0.0072 54.54√2๐ Examples of Naïve Bayes Classifier When multiplying numerical values, you should never have a zero. It makes the other attributes don’t count anymore ๏ problem with Naïve Bayes! But we have a solution… 6.5.7 Solution to the problem of a ๐๐๐๐๐ ๐๐ ๐๐ ๐๐๐๐๐๐๐๐๐ = ๐ If one of the conditional probability is zero, then the entire expression becomes zero! ๏ we have to use other probability estimations (we already know them): ๐ ๏ท The original (with the problem): ๐(๐ด๐ = ๐|๐ถ = ๐๐ ) = ๐๐๐ ๐๐๐ +1 ๏ท Laplace: ๐(๐ด๐ = ๐|๐ถ = ๐๐ ) = ๐ ๏ท M-estimate: ๐(๐ด๐ = ๐|๐ถ = ๐๐ ) = ๏ท ๏ท ๏ท ๏ท ๐๐ : number of classes ๐๐ด๐ : number of values that the attribute takes ๐: prior probability ๐: parameter 6.5.8 ๏ท ๏ท ๏ท ๐๐ ๐๐ +๐๐ด๐ ๐๐๐ +๐๐ ๐๐๐ +๐ Summary Robust to isolated noise points Handle missing values by ignoring the instance during probability estimate calculations Robust to irrelevant attributes 28 Corentin Cossettini Semestre de printemps 2022 ๏ท Université de Neuchâtel Bachelor en sciences économiques 2ème année Independence assumption may not hold for some attributes o Use other techniques such as Bayesian Belief Networks (BBN) Even if attributes aren’t often independent, we use NB bc it’s quick to compute and very simple! 6.6 Decision Tree Classifier ๏ท ๏ท ๏ท ๏ท 6.6.1 Solve a classification problem by asking a series of carefully crafted questions about the attributes of the test record; After each answer, a follow-up question is asked; Stop when reach a conclusion about the class label of the record. The questions and their possible answers can be organized in the form of a decision tree o Hierarchical structure consisting of nodes (noeuds) and directed edges. o Root node: has no incoming edges and zero or more outgoing edges. o Internal nodes: have exactly one incoming edge and two or more outgoing edges o Leaf node: have exactly one incoming edge and no outgoing edges. Example If Refund = Yes ๏ Cheat = No, that’s why we don’t continue further If married = no ๏ cheat=No, that’s why we don’t continue further 6.6.2 ๏ท ๏ท ๏ท ๏ท 6.6.3 Advantages of a Decision Tree Based Classification Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets General Process to create a model Training Set ๏ Learn Model (induction): algorithm ๏ Model: in this case, decision tree ๏ Apply Model (deduction) ๏ Test Set To create the tree ๏ several algorithms. (Hunt’s, CART, ID3, …) 6.6.4 ๏ท ๏ท Hunt’s Algorithm: General Structure ๐ท๐ก : set of training records that reach a node ๐ก General Procedure: o If ๐ท๐ก contains records that belong the same class ๐ฆ๐ก : ๐ก is a leaf node labeled as ๐ฆ๐ก o If ๐ท๐ก is an empty set: ๐ก is a leaf node labeled by the default class ๐ฆ๐ o If ๐ท๐ก contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset 29 Corentin Cossettini Semestre de printemps 2022 6.6.5 Université de Neuchâtel Bachelor en sciences économiques 2ème année Tree Induction How to choose the question to ask from each node (inkl. the root node)? ๏ Greedy Strategy: Split the records based on an attribute test that optimizes certain criterion. ๏ Issues we are facing when doing this: ๏ท Determine how to split the records o How to specify the attribute test condition? o How to determine the best split? ๏ท Determine when to stop splitting Specify Test Condition ๏ท Depends on attribute types. Nominal, Ordinal, Continuous ๏ท Depends on number of ways to split: 2-way split, multi-way split (1) Split for Nominal attributes (2) Split for Ordinal attributes Numerical values ๏ binary splits This split is also possible, even if the order isn’t fully respected. (3) Split for Continuous attributes Different ways of handling the thing: ๏ท Discretization to form an ordinal categorical attribute Transferring continuous functions into discrete counterparts ๏ท Binary Decision: ๐ด < ๐ฃ ๐๐ ๐ด ≥ ๐ฃ Find the best cut possible among all the splits! 30 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 Determine the best split ๏ท The data are the students: we create the tree to decide when receiving new data. Best variant: 2nd bc the data are the students and it helps classify them. The 3rd is too perfect bc every node is pure (one class is represented per node). Greedy Approach : nodes with homogeneous class distribution are preferred Non-homogeneous, high impurity Homogeneous, low impurity ๏ We need a measure of node impurity! How to Find the Best Split Compute M1 and M2, several ways to do it. After that, compare them with M0 to see if won or not (we should win). 6.6.6 Measures of impurity 1) GINI ๐ Gini Index for a given node ๐ก: ๐ฎ๐ฐ๐ต๐ฐ(๐) = ๐ − ∑๐(๐(๐|๐)) ๏ท ๐(๐|๐ก): relative frequency of class ๐ at node ๐ก 1 ๏ท ๐๐๐ฅ. ๐๐๐ ๐ ๐๐๐๐ ๐ฃ๐๐๐ข๐ = 1 − ๐๐ When records are equally distributed among all ๐๐ classes ๏ท ๐๐๐. ๐๐๐ ๐ ๐๐๐๐ ๐ฃ๐๐๐ข๐ = 0 When records belong to one class ๏ if it’s a pure node, GINI = 0 ๏ท When a node t is split into k partitions (children), the aggregated GINI impurity measure ๐ for all children is computed as: ๐ฎ๐ฐ๐ต๐ฐ๐บ๐ท๐ณ๐ฐ๐ป = ∑๐๐=๐ ๐๐ ๐ฎ๐ฐ๐ต๐ฐ(๐) o ๐๐ : number of records at child i o ๐: number of records at note t ๏ท Seek for the lowest GINIsplit ๏ the lowest impurity The quality of the split (GINI gain) is expressed as: ๐ฎ๐๐๐๐ฎ๐ฐ๐ต๐ฐ = ๐ฎ๐ฐ๐ต๐ฐ(๐) − ๐ฎ๐ฐ๐ต๐ฐ๐บ๐ท๐ณ๐ฐ๐ป 31 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 ๏ท Example Binary Attributes Categorical Attributes Continuous Attributes 2) Entropy (information gain) Entropy at a given node ๐ก : ๐ฌ๐๐๐๐๐๐(๐) = − ∑ ๐(๐|๐) ๐ฅ๐จ๐ ๐(๐(๐|๐)) ๐ ๏ท ๏ท Measures homogeneity of a node ๐๐๐ฅ. ๐ฃ๐๐๐ข๐ = log(๐๐ ) When records are equally distributed among all classes ๏ท ๐๐๐. ๐ฃ๐๐๐ข๐ = 0 When all records belong to one class ๏ท When a node t is split into k partitions (children), the aggregated entropy measure for ๐ all children nodes is computed as: ๐ฌ๐ต๐ป๐น๐ถ๐ท๐๐บ๐ท๐ณ๐ฐ๐ป = ∑๐๐=๐ ๐๐ ๐ฌ๐๐๐๐๐๐(๐) o ๐๐ : number of records at child i o ๐: number of records at note t The quality of the split (Information gain) is expressed as: ๐ฎ๐๐๐๐ฌ๐ต๐ป๐น๐ถ๐ท๐ = ๐ฌ๐ต๐ป๐น๐ถ๐ท๐(๐) − ๐ฌ๐ต๐ป๐น๐ถ๐ท๐๐บ๐ท๐ณ๐ฐ๐ป ๏ Measures reduction in entropy achieved because of the split. ๏ Choose the split that achieves most reduction! 3) Classification Error Classification error at node ๐ก : ๏ท ๏ท ๏ท ๏ท ๐ฌ๐๐๐๐(๐) = ๐ − ๐๐๐๐ ๐(๐|๐) Measures misclassification error made by a node 1 ๐๐๐ฅ. ๐ฃ๐๐๐ข๐ = 1 − ๐ ๐ When records are equally distributed among all classes ๐๐๐. ๐ฃ๐๐๐ข๐ = 0 When all records belong to one class The aggregated classification error for all children nodes: 32 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 ๐ ๐ฌ๐น๐น๐ถ๐น๐บ๐ท๐ณ๐ฐ๐ป = ∑ o o 6.6.7 ๐=๐ ๐๐ ๐ฌ๐๐๐๐(๐) ๐ ๐๐ : number of records at child i ๐: number of records at note t ๐ฎ๐๐๐๐ฌ๐น๐น๐น๐ถ๐น = ๐ฌ๐น๐น๐ถ๐น(๐) − ๐ฌ๐น๐น๐ถ๐น๐บ๐ท๐ณ๐ฐ๐ป Practical Issues of Classification 6.6.7.1 Underfitting and Overfitting Underfitting: the model is too simple ๏ท Not yet learned the true structure of data ๏ท Error rates on training set and on test set are large Solution: créer un modèle un peu plus complexe (réduire les conditions : ex : minsplit passe de 20 à 2) Overfitting: the model is more complex than necessary Si on a essayé de trop apprendre les données par cœur. Noise dans le data (valeurs qui ne nous aident pas) ๏ท Error rate on training set is small, but the error rate on test data is large ๏ท Training error no longer provides a good estimate of how well the tree will perform on previously unseen records ๏ท Need new ways for estimating errors ๏ l’arbre devient trop large Solution: see below Underfitting : when model is too simple, both training and test errors are large Overfitting due to Noise Overfitting due to Insufficient Examples ๏ท Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region ๏ท Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task Estimating Generalization Errors 33 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année Reduced error pruning (REP), 3ème méthode: utiliser un jeu de donner séparé Occam’s Razor (principle) ๏ท Given 2 models of similar generalization errors, one should prefer the simpler model over the more complex model ๏ท For complex models, there is a greater chance that it was fitted accidentally by errors in data ๏ท Therefore, one should include model complexity when evaluating a model Minimum Description Length (MDL) How to Adress Overfitting (solutions) 34 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année 1. Pre-Pruning On ne laisse pas croître l’arbre, on s’arrête avant qu’il soit trop grand. ๏ท Stop the algorithm before it becomes a fully-grown tree ๏ท Typical stopping conditions for a node o Stop if all instances belong to the same class o Stop if all the attribute values are the same ๏ท More restrictive conditions o Stop if number of instances is less than some user-specified threshold o Stop if class distribution of instances are independent of the available features (ex: using a ๐ 2 test) o Stop if expanding the current node does not improve impurity measures (ex: Gini of Information Gain) 2. Post-Pruning ๏ท Grow decision tree to its entirety ๏ท Trim the nodes of the decision tree in a bottom-up fashion ๏ท If generalization error improves after trimming, replace sub-tree by a leaf node ๏ท Class label of leaf node is determined from majority class of instances in the subtree ๏ท Can use MDL for post-pruning Example 6.6.7.2 Missing values Affect decision tree construction in 3 different ways 1. How impurity measures are computed 2. How to distribute instance with missing value to child nodes 3. How a test instance with missing value is classified 6.6.7.3 Data Fragmentation Number of instances gets smaller as you traverse down the tree Number of instances at the leaf nodes could be too small to make any statistically significant decision 6.6.7.4 Search Strategy Finding an optimal decision tree is NP-hard The algorithm presented so far uses a greedy, top-down, recursive partitioning strategy to induce a reasonable solution Other strategies? ๏ท Bottom-up ๏ท Bi-directional 6.6.7.5 Expressiveness ๏ท Decision tree provides expressive representation for learning discrete-valued function 35 Corentin Cossettini Semestre de printemps 2022 ๏ท ๏ท Université de Neuchâtel Bachelor en sciences économiques 2ème année But they do not generalize well to certain types of Boolean functions o Example: parity function ๏ง Class = 1 if there is an even number of Boolean Attributes with true value = True ๏ง Class = 0 if there is an odd number of Boolean Attributes with true value = True o For accurate modeling, must have a complete tree Not expressive enough for modeling continuous variables o Particularly when test condition involves only a single attribute at-a-time Decision Boundary ๏ท ๏ท Border line between 2 neighboring regions of different classes is known as decision boundary Decision boundary is parallel to axes because test condition involves a single attribute at-a-time Oblique Decision Trees ๏ท ๏ท ๏ท Test condition may involve multiple attributes More expressive representation Finding optimal test condition is computationally expensive 6.7 Model Evaluation ๏ท ๏ท ๏ท 6.7.1 Metrics for Performance Evaluation: “How to evaluate the performance of a model?” Methods for Performance Evaluation: “How to obtain reliable estimates?” Methods for Model Comparison: “How to compare the relative performance among competing models?” Metrics for Performance Evaluation Focus on the predictive capability of a model Confusion matrix 36 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 Metrics ๐+๐ ๐๐+๐๐ ๏ท ๐ด๐๐๐ข๐๐๐๐ฆ = ๐+๐+๐+๐ = ๐๐+๐๐+๐น๐+๐น๐ ๏ท ๐ธ๐๐๐๐ ๐๐๐ก๐ = 1 − ๐ด๐๐๐ข๐๐๐๐ฆ = ๐๐+๐๐+๐น๐+๐น๐ ๏ท ๐๐๐ข๐ ๐๐๐ ๐๐ก๐๐ฃ๐ ๐ ๐๐ก๐ (๐๐๐ )(๐ ๐๐๐ ๐๐ก๐๐ฃ๐๐ก๐ฆ) = ๐๐+๐น๐ ๏ท ๐๐๐ข๐ ๐๐๐๐๐ก๐๐ฃ๐ ๐ ๐๐ก๐ (๐๐๐ )(๐ ๐๐๐๐๐๐๐๐๐ก๐ฆ) = ๐๐+๐น๐ ๏ท ๐น๐๐๐ ๐ ๐๐๐ ๐๐ก๐๐ฃ๐ ๐ ๐๐ก๐ (๐น๐๐ ) = ๐๐+๐น๐ ๏ท ๐น๐๐๐ ๐ ๐๐๐๐๐ก๐๐ฃ๐ ๐ ๐๐ก๐ (๐น๐๐ ) = ๐๐+๐น๐ ๐น๐+๐น๐ ๐๐ ๐๐ ๐น๐ ๐น๐ Limitation of Accuracy ๏ท Consider a 2-class problem o Number of Class 0 examples = 9990 o Number of Class 1 examples = 10 ๏ท If model predicts everything to be class 0: 9990 o ๐ด๐๐๐ข๐๐๐๐ฆ = 10000 = 99.9 ๏ Misleading because the model does not detect any class 1 example ๏ Cost Matrix Cost Matrix ๏ท ๐ถ(๐, ๐): cost of misclassifying class ๐ example as class ๐ ๐ถ๐๐ ๐ก = ๐ถ(๐๐๐ , ๐๐๐ )๐ + ๐ถ(๐๐๐ , ๐๐)๐ + ๐ถ(๐๐, ๐๐๐ )๐ + ๐ถ(๐๐, ๐๐)๐ ๏ท Example: 2 models with the same cost matrix Cost vs Accuracy 37 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 Cost-Sensitive Measures Mesures utilisées par Google dans son moteur de recherche ๏ Chercher quelque chose dans un grand nombre de documents ๏ท ๏ท ๐ ๐๐๐๐๐๐ ๐๐๐(๐) = ๐+๐ ๐ ๐ ๐๐๐๐๐(๐) = ๐+๐ 2๐๐ 2๐ ๏ท ๐น − ๐๐๐๐ ๐ข๐๐(๐น) = ๐+๐ = 2๐+๐+๐ ๏ท ๐๐๐๐โ๐ก๐๐ ๐ด๐๐๐ข๐๐๐๐ฆ = ๐ค the bigger the f-measure, the better ๐ค1 ๐+๐ค4 ๐ 1 ๐+๐ค2 ๐+๐ค3 ๐+๐ค4 ๐ Remarques ๏ท Precision is biased towards ๐ถ(๐๐๐ , ๐๐๐ ) and ๐ถ(๐๐, ๐๐๐ ) ๏ท Recall is biased towards ๐ถ(๐๐๐ , ๐๐๐ ) and ๐ถ(๐๐๐ , ๐๐) ๏ท F-measure is biased towards all except ๐ถ(๐๐, ๐๐) o Combines precision and recall ๏ mean of the two 6.7.2 Methods for Performance Evaluation Comment obtenir une estimation de la performance ? ๏ Note : la performance du modèle peut dépendre d’autre facteurs que l’algorithme (class distribution (1%-99%) ; cost of misclassification ; size of training and test sets) Learning Curve Shows how accuracy changes with varying sample size ๏ท Requires a sample schedule for creating one! /!/: ↓ ๐ ๐๐๐๐๐ ๐ ๐๐ง๐ → ๐๐๐๐๐ ๐๐ ๐กโ๐ ๐๐ ๐ก๐๐๐๐ก๐ ↑ Methods of Estimation 38 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année 1. Holdout Reserve 2/3 for training and 1/3 for testing 2. Repeated Holdout Do it 3 times to test every subset 3. Cross validation o Partition data into ๐ disjoint subsets o ๐-fold: train on ๐−1 partitions, test on the remaining one o Leave-one-out: ๐ = ๐ 4. Repeated cross validation Do it ๐ times ROC (Receiver Operating Characteristic) Characterizes the trade-off between positive hits (TPR) and false alarms (FPR) ๏ท ๏ท ROC curve plots TPR (y-axis) against FPR (x-axis) Performance of each classifier represented as a point on the ROC curve o Changing the threshold ๐ก (seuile) of algorithm, sample distribution or cost matrix changes the location of the point ROC Curve ๏ท 1-dimensional data set containing 2 classes (positive and negative) ๏ท Any points located at ๐ฅ > ๐ก is classified as positive At threshold ๐ก ≈ 0.3: ๏ท ๐๐๐ = 0.5 ๐น๐๐ = 0.12 ๐๐๐ = 0.88 ๐น๐๐ = 0.5 (๐๐๐ , ๐น๐๐ ) (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal ๏ We have to find the nearest point to (1,0) Diagonal line: ๏ท Random guessing ๏ท Below diagonal line: prediction is opposite of the true class ROC for Model Comparison 39 Université de Neuchâtel Bachelor en sciences économiques 2ème année Corentin Cossettini Semestre de printemps 2022 No model consistently outperforms the other: ๏ท ๐1 is better for small FPR ๏ท ๐2 is better for large FPR Area under the ROC curve; ๏ท Ideal: ๐ด๐๐๐ = 1 ๏ท Random guess: ๐ด๐๐๐ = 0.5 How to construct a ROC curve ๏ท ๏ท Use classifier that produces posterior probability for each test instance ๐(+|๐ด) Sort the instances according to ๐(+|๐ด) in decreasing order Apply threshold ๐ก at each unique value of ๐(+|๐ด) Count the number of TP, FP, TN, FN at each threshold ๐๐ TP rate, ๐๐๐ = ๏ท FP rate, ๐น๐๐ = ๏ท ๏ท ๏ท ๐๐+๐น๐ ๐น๐ ๐น๐+๐๐ Si ๐ก = 0.7 ๏ on envoie à 7 personnes 6.7.3 Test of Significance (confidence intervals) ๏ท Given 2 models: o ๐1 : accuracy = 85%, tested on 30 instances o ๐2 : accuracy = 75%, tested on 5000 instances ๏ท Can we say ๐1 is better than ๐2 ? o (1) What’s the confidence we can place on the accuracy of ๐1 and ๐2 ? o (2) Is the difference in performance measure explained as a result of random fluctuations in the test set? (1) CI for Accuracy 40 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année ๏ท ๏ท Accuracy is the mean Prediction is a Bernoulli trial o 2 possible outcomes: correct or wrong for example o All the Bernoulli trials represent a Binominal Distribution: ๐~๐ต๐๐(๐, ๐), with ๐ the number of correct predictions ๏ง Example: toss a faire coin 50 times ๏ท Expected number of heads: ๐ธ(๐) = ๐๐ = 50 ∗ 0.5 = 5 ๏ท Variance: ๐๐๐(๐) = ๐๐(1 − ๐) = 50 ∗ 0.5 ∗ 0.5 = 12.5 ๏ท Probability the head show up 20 times: 50 ๐(๐ = 20) = 0.520 (1 − 0.5)30 ≈ 0.04 20 Comes from the binomial distribution ๐ฅ ๏ท Given ๐ฅ (# of correct predictions) and ๐ (# of test instances): a๐๐ = ๐ Acc ๏ sample // p ๏ population Is it possible to predict p, the true accuracy of the model (for the whole population)? ๏ท For large test sets (๐ > 30), acc has a normal distribution (๐ = ๐, ๐ 2 = ๐ ๐๐ผ < ๐๐๐ − ๐ √๐(1 − ๐) ( ๐ Confidence Interval for ๐: 2 < ๐1−๐ผ ๐(1−๐) ) ๐ =1−๐ผ 2 ) 2๐(๐๐๐) + ๐๐ผ2 ± √๐๐ผ2 + 4๐๐๐๐ − 4๐(๐๐๐)2 2 1−๐ผ ๐ถ๐ผ[๐] = 2 2 (๐ + ๐๐ผ2 ) 2 ๏ท Example: (2) Comparing Performance of 2 Models with 2 different sizes. Is the bigger one really statistically different from the smaller one ? ๏ Test the difference of the 2 models ๐1 and ๐2 ๏ท Given 2 models ๐1 and ๐2 , which one is better? o ๐1 is tested on ๐ท1 (๐ ๐๐ง๐ = ๐1), ๐๐๐ข๐๐ ๐๐๐๐๐ ๐๐๐ก๐ = ๐1 o ๐2 is tested on ๐ท2 (๐ ๐๐ง๐ = ๐2 ), ๐๐๐ข๐๐ ๐๐๐๐๐ ๐๐๐ก๐ = ๐2 o Assume ๐ท1 and ๐ท2 are indepenant o Assume ๐1 and ๐2 are sufficiently large, then: ๐ (1−๐ ) ๐1 ~๐(๐1 , ๐1 ), ๐2 ~๐(๐2 , ๐2 ), with ๐ฬ๐ = ๐ ๐ ๐ ๏ท ๐ To test if performance difference is statistically significant: ๐ = ๐1 − ๐2 o ๐~๐(๐๐ก , ๐๐ก ) where ๐๐ก is the true difference o Since ๐ท1 and ๐ท2 are independent, their variance adds up: 41 Corentin Cossettini Semestre de printemps 2022 Université de Neuchâtel Bachelor en sciences économiques 2ème année ๐1 (1 − ๐1 ) ๐2 (1 − ๐2 ) + ๐1 ๐2 = ๐ ± ๐๐ผ/2 ๐ฬ๐ก ๐๐ก2 = ๐12 + ๐22 ≅ ๐ฬ12 + ๐ฬ22 = ๏ท 1−๐ผ o At (1 − ๐ผ) confidence level, ๐ถ๐ผ[๐ ๐ก] Example: If zero is in the interval, not statistically significant If zero is not in the interval, statistically significant 6.7.4 Comparing performance of 2 Algorithms With different models! 42