Uploaded by cossettinicorentin

Abstract Statistical Learning

advertisement
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Statistical Learning
Table of Contents
1
Introduction ............................................................................................................... 3
1.1
2
3
Review of Probability ................................................................................................. 3
2.1
Random Variables and Probability Distributions ............................................................. 3
2.2
Expected Values, Mean and Variance ............................................................................. 4
2.3
Two Random Variables .................................................................................................. 5
2.4
Different Types of Distributions ..................................................................................... 6
2.5
Random Sampling, Distribution of ๐’€ and Estimation ....................................................... 6
Review of Statistics .................................................................................................... 7
3.1
3.1.1
3.2
3.2.1
3.3
4
5
Hypothesis Tests ............................................................................................................ 7
Concerning the Population Mean ..................................................................................................... 7
Confidence Intervals ...................................................................................................... 9
Concerning the population mean ...................................................................................................... 9
Scatterplots, the Sample Covariance and the Sample Correlation .................................... 9
Linear Regression with One Regressor ...................................................................... 10
4.1
Linear Regression Model with a Single Regressor (population) ...................................... 10
4.2
Estimating the Coefficients of the Linear Regression Model .......................................... 10
4.3
Measures of Fit and Prediction Accuracy ...................................................................... 11
4.4
The Least Squares Assumptions (hypotheses) ............................................................... 11
4.5
The Sampling Distribution of the OLS Estimators .......................................................... 12
Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals ........ 14
5.1
Testing Hypotheses About One of the Regression Coefficients ...................................... 14
5.2
Confidence Intervals for a Regression Coefficient ......................................................... 15
5.3
Regression When X Is a Binary Variable ........................................................................ 15
5.4
Heteroskedasticity and Homoskedasticity .................................................................... 15
5.4.1
6
Types of Data ................................................................................................................ 3
Mathematical Implications of Homoskedasticity ............................................................................ 16
5.5
The Theoretical Foundations of OLS ............................................................................. 16
5.6
Using the t-Statistic in Regression When the Sample Size Is Small ................................. 17
5.7
Summary and Assessment ........................................................................................... 18
Introduction to Data Mining ..................................................................................... 18
6.1
Data Mining - Concepts ................................................................................................ 18
6.2
Classification - Basic Concepts ...................................................................................... 20
6.3
Rule-Based Classifier: basic concepts ............................................................................ 21
1
Corentin Cossettini
Semestre de printemps 2022
6.4
6.4.1
6.4.2
6.4.3
6.4.4
6.4.5
6.4.6
6.5
6.5.1
6.5.2
6.5.3
6.5.4
6.5.5
6.5.6
6.5.7
6.5.8
6.6
6.6.1
6.6.2
6.6.3
6.6.4
6.6.5
6.6.6
6.6.7
6.7
6.7.1
6.7.2
6.7.3
6.7.4
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Nearest Neighbors Classifier ........................................................................................ 24
Instance Based Classifiers................................................................................................................ 24
Basic Idea and concept .................................................................................................................... 24
Nearest Neighbor Classification ...................................................................................................... 24
Precisions ........................................................................................................................................ 25
Example: PEBLS (MVDM)................................................................................................................. 25
Proximity measures ......................................................................................................................... 25
Naïve Bayesian Classifier .............................................................................................. 26
Bayes Classifier ................................................................................................................................ 26
Example of Bayes Theorem ............................................................................................................. 27
Bayesian Classifiers ......................................................................................................................... 27
Naïve Bayes Classifier ...................................................................................................................... 27
Estimate Probabilities from Data .................................................................................................... 27
Examples of Naïve Bayes Classifier.................................................................................................. 28
Solution to the problem of a ๐’—๐’‚๐’๐’–๐’† ๐’๐’‡ ๐’‚๐’ ๐’‚๐’•๐’•๐’“๐’Š๐’ƒ๐’–๐’•๐’† = ๐ŸŽ ......................................................... 28
Summary ......................................................................................................................................... 28
Decision Tree Classifier ................................................................................................ 29
Example ........................................................................................................................................... 29
Advantages of a Decision Tree Based Classification ........................................................................ 29
General Process to create a model ................................................................................................. 29
Hunt’s Algorithm: General Structure............................................................................................... 29
Tree Induction ................................................................................................................................. 30
Measures of impurity ...................................................................................................................... 31
Practical Issues of Classification ...................................................................................................... 33
Model Evaluation ........................................................................................................ 36
Metrics for Performance Evaluation ............................................................................................... 36
Methods for Performance Evaluation ............................................................................................. 38
Test of Significance (confidence intervals) ...................................................................................... 40
Comparing performance of 2 Algorithms........................................................................................ 42
2
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
General informations
๏‚ท
๏‚ท
๏‚ท
๏‚ท
๏‚ท
๏‚ท
Textbook: Introduction to Econometrics, Stock & Watson ๏ƒ  chaps 2-7
Read and understand the material before the lesson at home
Sum up the slides and ad material from the book
Work on the exercises/notebooks before the class and finish them in class
Sessions are organized in two parts:
1. Explanations of the chapter
2. Solving the notebooks
Work on and sum up the material before class ๏ƒ  textbook + theory (repetition)
Part I
1 Introduction
Econometrics is the science and art of using economic theory and statistical techniques to
analyze economic data.
1.1 Types of Data
Cross-Sectional Data
Consists of multiple entities observed at a single time period
Time Series Data
Consists of a single entity observed at multiple time periods
Panel Data
Consists of multiple entities observed at two or more time periods
2 Review of Probability
2.1 Random Variables and Probability Distributions
Population
๏‚ท Will consider populations as infinitely large
๏‚ท Collection of all possible entities of interest
Population distribution of Y
๏‚ท Probability of different values of Y that occur in the population
Random variable Y
๏‚ท Numerical summary of a random outcome
๏‚ท 2 types:
o Discrete: takes a discrete set of values (0, 1, 2, …)
o Continuous: takes a continuum of possible values
Probability Distribution of a Discrete Random Variable
๏‚ท List of all possible values of the variable and the probability that each value will occur.
๏‚ท Probability of events: probability that an event occurs, comes from the probability
distribution ๏ƒ  looks like an histogram because discrete variable
๏‚ท Cumulative probability distribution: probability that the random variable is less than or
equal to a particular value ๏ƒ  Fx
๏‚ท Bernoulli distribution: the random discrete variable is binary so the outcome is 0 or 1.
Probability Distribution of a Continuous Random Variable
๏‚ท Cumulative Probability Distribution: probability that the random variable is less than or
equal to a particular value
๏‚ท Probability Density Function: the area under the probability density function between
any two points is the probability that the random variable falls between those points.
3
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
2.2 Expected Values, Mean and Variance
Expected Value of a Random Variable
๏‚ท Expected Value: denoted ๐ธ(๐‘Œ), represents the long-run average value of the random
variable over many repeated trials or occurrences.
๏‚ท For a discrete random variable, it’s computed as a weighted average of the possible
outcomes of that random variable, where the weights are the probabilities of that
outcome.
๏‚ท For a Bernoulli random variable: ๐ธ(๐บ) = ๐‘
Expected Value for a continuous Random Variable
๏‚ท It’s the probability-weighted average of the possible outcomes of the random variable.
Standard deviation and Variance
๏‚ท The Variance of the discrete random variable ๐‘Œ is:
๐‘˜
๐œŽ๐‘Œ2
๏‚ท
๏‚ท
= ๐‘ฃ๐‘Ž๐‘Ÿ(๐‘Œ) = ๐ธ[(๐‘Œ − ๐œ‡๐‘Œ
)2 ]
= ∑(๐‘ฆ๐‘– − ๐‘ข๐‘Œ )2 ๐‘๐‘–
๐‘–=1
For a Bernoulli random variable: ๐‘ฃ๐‘Ž๐‘Ÿ(๐บ) = ๐‘(1 − ๐‘)
The Standard Deviation of the discrete random variable ๐‘Œ is:
๐œŽ๐‘Œ = √๐œŽ๐‘Œ2
Other Measures of the Shape of a Distribution
Mean and sd are important measures for a distribution. Two others exist:
Skewness
Kurtosis
๐ธ[(๐‘Œ − ๐œ‡๐‘Œ )3 ]
๐ธ[(๐‘Œ − ๐œ‡๐‘Œ )4 ]
๐‘†๐‘˜๐‘’๐‘ค๐‘›๐‘’๐‘ ๐‘  =
๐พ๐‘ข๐‘Ÿ๐‘ก๐‘œ๐‘ ๐‘–๐‘ 
=
๐œŽ๐‘Œ4
๐œŽ๐‘Œ3
Measures the lack of symmetry of a Measures how thick and heavy are the tails
distribution.
of a distribution.
Changes the weight of the observations: the
smaller disappears and the bigger gets
extremely big.
Symmetric/normal Distribution: ๐‘†๐‘˜๐‘’๐‘ค๐‘›๐‘’๐‘ ๐‘  = 0 Symmetric/normal Distribution: ๐พ๐‘ข๐‘Ÿ๐‘ก๐‘œ๐‘ ๐‘–๐‘  = 3
Distribution w/ long right tail: ๐‘†๐‘˜๐‘’๐‘ค๐‘›๐‘’๐‘ ๐‘  > 0
Distribution heavytailed: ๐พ๐‘ข๐‘Ÿ๐‘ก๐‘œ๐‘ ๐‘–๐‘  > 3
Distribution w/ long left tail: ๐‘†๐‘˜๐‘’๐‘ค๐‘›๐‘’๐‘ ๐‘  < 0
๐พ๐‘ข๐‘Ÿ๐‘ก๐‘œ๐‘ ๐‘–๐‘  ≥ 0
The greater the kurtosis, the more likely are
outliers.
4
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
2.3 Two Random Variables
Joint and Marginal Distributions
๏‚ท Random variables X and Z have a joint distribution. It’s the probability that the random
variables simultaneously take on certain values.
๏‚ท The marginal probability distribution of a random variable Y is the same as the
probability distribution.
Conditional distribution
๏‚ท Conditional distribution: it’s the distribution of a random variable Y conditional on
another random variable X taking on a specific value. It gives the conditional probability
that Y takes on the value y when X takes the value x.
Pr(๐‘‹=๐‘ฅ,๐‘Œ=๐‘ฆ)
๏‚ท Pr(๐‘Œ = ๐‘ฆ|๐‘‹ = ๐‘ฅ) = Pr(๐‘‹=๐‘ฅ)
Conditional expectation/conditional mean
๏‚ท It’s the mean of the conditional distribution of Y given X. It’s the expected value of Y,
computed using the conditional distribution of Y given X
๏‚ท ๐ธ(๐‘Œ|๐‘‹ = ๐‘ฅ) = ∑๐‘˜๐‘–=1 ๐‘ฆ๐‘– Pr(๐‘Œ = ๐‘ฆ|๐‘‹ = ๐‘ฅ)
Conditional variance
๏‚ท It’s the variance of the conditional distribution of Y given X
๏‚ท ๐‘‰๐‘Ž๐‘Ÿ(๐‘Œ|๐‘‹ = ๐‘ฅ) = ∑๐‘˜๐‘–=1[๐‘ฆ๐‘– − ๐ธ(๐‘Œ|๐‘‹ = ๐‘ฅ)]2 Pr(๐‘Œ = ๐‘ฆ๐‘– |๐‘‹ = ๐‘ฅ)
Bayes rule
๏‚ท Says that the conditional probability of Y given X is the conditional probability of X given
Y times the relative marginal probabilities of Y and X
Pr(๐‘‹ = ๐‘ฅ |๐‘Œ = ๐‘ฆ)Pr(๐‘Œ=๐‘ฆ)
๏‚ท Pr(๐‘Œ = ๐‘ฆ|๐‘‹ = ๐‘ฅ) =
Pr(๐‘‹=๐‘ฅ)
Independence
Two random variables X and Y are independently distributed (independent) if:
1) Knowing the value of one of them provides no information about the other.
2) The conditional distribution of Y given X equals the marginal distribution of Y
3) Pr(๐‘Œ = ๐‘ฆ|๐‘‹ = ๐‘ฅ) = Pr(๐‘Œ = ๐‘ฆ)
Covariance
๏‚ท Measures the dependance of two random variables (how they move together)
๏‚ท Covariance between X and Y is the expected value of ๐ธ[(๐‘‹ − ๐œ‡๐‘‹ )(๐‘Œ − ๐œ‡๐‘Œ )], where ๐œ‡๐‘‹
is the mean of X and ๐œ‡๐‘Œ is the mean of Y.
๏‚ท ๐ถ๐‘œ๐‘ฃ(๐‘‹, ๐‘Œ) = ๐ธ[(๐‘‹ − ๐œ‡๐‘‹ )(๐‘Œ − ๐œ‡๐‘Œ )] = ๐œŽ๐‘‹๐‘Œ
๏‚ท Measure of the linear association of X and Y: its units are (units of X)x (units of Y).
๏‚ท ๐ถ๐‘œ๐‘ฃ(๐‘‹, ๐‘Œ) > 0 means a positive relation between X and Y (vice-versa)
๏‚ท If X and Y are independent: ๐ถ๐‘œ๐‘ฃ(๐‘‹, ๐‘Œ) = 0
๏‚ท ๐ถ๐‘œ๐‘ฃ(๐‘‹, ๐‘‹) = ๐ธ[(๐‘‹ − ๐œ‡๐‘‹ )(๐‘‹ − ๐œ‡๐‘‹ )] = ๐œŽ๐‘‹2
Correlation
๏‚ท Covariance does have units and it can be a problem. Correlation solves this problem.
๏‚ท Measure of dependance between X and Y
๐‘๐‘œ๐‘ฃ(๐‘‹,๐‘Œ)
๐œŽ
๏‚ท ๐‘๐‘œ๐‘Ÿ๐‘Ÿ(๐‘‹, ๐‘Œ) = √(๐‘ฃ๐‘Ž๐‘Ÿ(๐‘‹)๐‘ฃ๐‘Ž๐‘Ÿ(๐‘Œ) = ๐œŽ ๐‘‹๐‘Œ
๐œŽ
๏‚ท
๏‚ท
๐‘ฅ ๐‘Œ
If ๐‘๐‘œ๐‘Ÿ๐‘Ÿ(๐‘‹, ๐‘Œ) = 0: X and Y are uncorrelated/independant
Correlation is always between -1 and 1
5
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
2.4 Different Types of Distributions
The Normal Distribution
๏‚ท Normal Distribution with mean ๐œ‡ and variance ๐œŽ 2 : ๐‘(๐œ‡, ๐œŽ 2 )
๏‚ท Standard Normal Distribution: ๐‘(0,1)
๏‚ท Standardize Y by computing ๐‘ = (๐‘Œ − ๐œ‡)/๐œŽ
Other types: Chi-Squared, Student t and F Distributions
ฬ… and Estimation
2.5 Random Sampling, Distribution of ๐’€
If ๐‘Œ1 , … , ๐‘Œ๐‘› are i.i.d and ๐‘Œฬ… is the estimator of the mean of the population (๐œ‡๐‘Œ )
ฬ…
Sampling Distribution of ๐’€
ฬ…
๏‚ท The properties of ๐‘Œ are determined by its sampling distribution
๏‚ท The individuals in the sample are drawn randomly ๏ƒ  the values of (๐‘Œ1 , . . . , ๐‘Œ๐‘› ) are
random ๏ƒ  functions of (๐‘Œ1 , . . . , ๐‘Œ๐‘› ), such as ๐‘Œฬ…, are random (had a different sample been
drawn, they would have taken on a different value)
๏‚ท Sampling distribution of ๐‘Œฬ… = the distribution of ๐‘Œฬ… over different possible samples of
size n
๏‚ท The mean and variance of ๐‘Œฬ… are the mean and variance of its sampling distribution:
o ๐ธ(๐‘Œฬ…) = ๐œ‡๐‘Œ
๐œŽ2
o ๐‘ฃ๐‘Ž๐‘Ÿ(๐‘Œฬ…) = ๐‘Œ = ๐ธ[๐‘Œฬ… − ๐ธ(๐‘Œฬ…)]2
๐‘›
๏‚ท
๏‚ท
๏‚ท
As results of these two:
๏‚ท ๐‘Œฬ… ๏ƒ  unbiased estimator of ๐œ‡๐‘Œ
๏‚ท ๐‘ฃ๐‘Ž๐‘Ÿ(๐‘Œฬ…) ๏ƒ  inversely proportional to ๐‘›
The concept of the sampling distribution underpins all of econometrics.
When ๐‘› is large:
If ๐‘› is small ๏ƒ  complicated
If ๐‘› is large ๏ƒ  simple
As ๐‘› increases, distribution of ๐‘Œฬ… becomes more tightly centered around ๐œ‡๐‘Œ
Distribution of ๐‘Œฬ… − ๐œ‡๐‘Œ becomes normal (CLT)
Example of the sampling distribution:
๏‚ท Mean of ๐‘Œฬ…:
If ๐ธ(๐‘Œฬ…) = ๐‘ก๐‘Ÿ๐‘ข๐‘’ = ๐œ‡ = 0.78 ๏ƒ  ๐‘Œฬ… is an
unbiased estimator of ๐œ‡
๏‚ท If ๐‘Œฬ… becomes close to ๐œ‡ when ๐‘› large:
Law of Large Number ๏ƒ  ๐‘Œฬ… is a
consistent estimator of ๐œ‡
The Law of Large Numbers
๏‚ท Estimator ๏ƒ  Consistent if the Pr that it falls within an interval of the true pop value tends
to 1 as the sample size increases
๏‚ท If (๐‘Œ1 , … , ๐‘Œ๐‘› ) are i.i.d. and ๐œŽ๐‘Œ2 < ∞, then ๐‘Œฬ… is a consistent estimator of ๐œ‡๐‘ฆ , that is:
Pr[|๐‘Œฬ… − ๐œ‡๐‘Œ | < ๐œ€] → 1 ๐‘Ž๐‘  ๐‘› → ∞, ๐œ€ is an infinitesimal change, so basically zero
๐‘
๐‘Œฬ… → ๐œ‡๐‘Œ (= lim ๐‘Œฬ… = ๐œ‡๐‘ฆ )
๐‘›→∞
6
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
Central Limit Theorem
๏‚ท If (๐‘Œ1 , … , ๐‘Œ๐‘› ) are i.i.d. and ๐œŽ๐‘Œ2 < ∞, then when ๐‘› is large, the distribution of ๐‘Œฬ… is well
approximated by a normal distribution.
๐œŽ2
๏‚ท ๐‘Œฬ…~๐‘ (๐œ‡๐‘Œ , ๐‘Œ )
๐‘›
๐‘Œฬ…−๐ธ(๐‘Œฬ…)
√๐‘ฃ๐‘Ž๐‘Ÿ(๐‘Œฬ…)
๐‘Œฬ…−๐œ‡๐‘Œ
๐‘Œ /√๐‘›
๏‚ท
Standardized, ๐‘Œฬ… =
๏‚ท
The larger the n, the better the approximation
=๐œŽ
is approximately ๐‘(0, 1)
3 Review of Statistics
Estimators and Estimates
Estimator: function of a sample of data to be drawn randomly from a population. Random variable
Estimate: numerical value of the estimator when it is actually computed using data from a
specific sample. Nonrandom number
Bias, Consistency and Efficiency
๐‘Œฬ… is an estimator of ๐œ‡๐‘Œ
๏‚ท Bias: for ๐‘Œฬ… it’s ๐ธ(๐‘Œฬ…) − ๐œ‡๐‘Œ ๏ƒ  the difference between them!
o ๐‘Œฬ… is unbiased ๏ƒ  ๐ธ(๐‘Œฬ…) = ๐œ‡๐‘Œ
๏‚ท Consistency: when the sample size is large, the uncertainty of ๐œ‡๐‘Œ arising from random
variations in the sample is very small.
๐‘
o ๐‘Œฬ… is consistent ๏ƒ  ๐‘Œฬ… → ๐œ‡๐‘Œ
๏‚ท Efficiency: if an estimator has a smaller variance than another one, he’s more efficient.
o ๐‘Œฬ… is efficient ๏ƒ  ๐‘ฃ๐‘Ž๐‘Ÿ(๐‘Œฬ…) < ๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ŒฬŒ)
ฬ…
๏‚ท If ๐‘Œ is unbiased, ๐‘Œฬ… is the Best Linear Unbiased Estimator (BLUE) and the most efficient.
3.1 Hypothesis Tests
Method to choose between 2 hypotheses, in an uncertain context.
3.1.1
Concerning the Population Mean
Null hypothesis: hypothesis to be tested
๐ป0 : ๐ธ(๐‘Œ) = ๐œ‡๐‘Œ,0
Two-sided alternative hypothesis: holds if the null hypothesis does not
๐ป1 : ๐ธ(๐‘Œ) ≠ ๐œ‡๐‘Œ,0
๏ƒ  We use the evidence to decide whether to reject ๐ป0 or failing to do so (and accept ๐ป1 )
p-value
Probability of drawing a statistic at least as adverse to the null as the value actually computed
with your data, assuming that the null hypothesis is true.
๏‚ท Significance level of a test: pre-specified probability of incorrectly rejecting the null,
when the null is true.
๏‚ท Some definitions:
๐œ‡๐‘Œ,0 : specific value of the population mean under ๐ป0
๐‘Œฬ…: sample average
๐ธ(๐‘Œ): population mean, unknown
๐‘Œฬ… ๐‘Ž๐‘๐‘ก : value of the sample average actually computed in the data set at hand
To compute the p-value, it is necessary to know the sampling distribution of ๐‘Œฬ… under ๐ป0
๏ƒ  According to the CLT, the distribution is normal when the sample size (n) is large.
๐œŽ2
๏ƒ  The sampling distribution of ๐‘Œฬ… is ๐‘ (๐œ‡๐‘Œ,0 , ๐‘Œ )
๐‘›
7
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
Calculating the p-value when ๐ˆ๐’€ is unknown
In practice, we have to estimate the standard deviation from the sample because we don’t
have it from the population.
๐œŽ2
๐œŽ2
We have: ๐‘Œฬ…~๐‘ (๐œ‡๐‘Œ,0 , ๐‘Œ ) where ๐‘Œ = ๐œŽ๐‘Œฬ…2
๐‘›
๐‘›
๏ƒ  Standard deviation of the sampling distribution of ๐‘Œฬ…: ๐œŽ๐‘Œฬ… = ๐œŽ๐‘Œ /√๐‘› #built from the pop but we
don’t know its variance to build the variance of the sampling distribution so we have to find an estimator.
๏ƒ  Estimator of ๐œŽ๐‘Œฬ… : ๐‘†๐ธ(๐‘Œฬ…) = ๐œŽฬ‚๐‘Œฬ… = ๐‘ ๐‘Œ /√๐‘› #see how ๐‘ ๐‘Œ is computed below
QUESTION: we divide 2 times by N ?
Sample variance and sample standard deviation
๐‘›
1
2
๐‘ ๐‘Œ =
∑(๐‘Œ๐‘– − ๐‘Œฬ…)2
๐‘›−1
๐‘–=1
๐‘ ๐‘Œ = √๐‘ ๐‘Œ2
#dividing by n-1 makes the estimator unbiased, we lost 1 degree of freedom when estimating the mean
๏ƒ  Calculating the p-value with ๐œŽ๐‘Œฬ…2 estimated and n large:
๐‘Œฬ… − ๐œ‡๐‘Œ,0
๐‘Œฬ… ๐‘Ž๐‘๐‘ก − ๐œ‡๐‘Œ,0
๐‘Œฬ… ๐‘Ž๐‘๐‘ก − ๐œ‡๐‘Œ,0
๐‘ − ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ ≅ ๐‘ƒ๐‘Ÿ๐ป0 [| ๐‘ 
|)
|>|
|] = 2๐œ™ (− |
๐‘ ๐‘Œ
๐‘Œ
๐‘†๐ธ(๐‘Œฬ…)
√๐‘›
√๐‘›
The t-statistic
It’s also called the standardized sample average, we use it as a test statistic to perform
hypothesis tests quite often:
๐‘ก=
๐‘Œฬ…−๐œ‡๐‘Œ,0
๐‘ ๐‘Œ
√๐‘›
once computed: ๐‘ก ๐‘Ž๐‘๐‘ก =
๐‘Œฬ… ๐‘Ž๐‘๐‘ก −๐œ‡๐‘Œ,0
๐‘ ๐‘Œ
√๐‘›
We can rewrite the formula for the p-value by substituting the equ° of the t-statistic:
๐‘ − ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ = ๐‘ƒ๐‘Ÿ๐ป0 [|๐‘ก| > |๐‘ก ๐‘Ž๐‘๐‘ก |] = 2๐œ™(−|๐‘ก ๐‘Ž๐‘๐‘ก |)
Link p-value – significance level and results
๏ƒ  The significance level is prespecified. Example: significance level = 5%
๏‚ท Reject ๐ป0 if |๐‘ก| ≥ 1.96 and reject ๐ป0 if ๐‘ ≤ 0.05
๏‚ท A small p-value means that ๐‘Œฬ… ๐‘Ž๐‘๐‘ก or ๐‘ก ๐‘Ž๐‘๐‘ก is far away from the mean under ๐ป0 and that
it is very unlikely that this sample would have been drawn if ๐ป0 is true, which means if
the population mean is equal to the population mean under ๐ป0 .
๏ƒ  reject ๐ป0
๏‚ท A big p-value means that ๐‘Œฬ… ๐‘Ž๐‘๐‘ก or ๐‘ก ๐‘Ž๐‘๐‘ก is close to the mean under ๐ป0 and that it is very
likely that this sample would have been drawn if ๐ป0 is true, which means if the
population mean is equal to the population mean under ๐ป0 .
๏ƒ  not reject ๐ป0
๏‚ท P-value: marginal significance level
One-sided Alternatives
๐ป1 : ๐ธ(๐‘Œ) > ๐œ‡๐‘Œ,0
The general approach is the same, with the modification that only large positive values of the
t-statistic reject ๐ป0 ๏ƒ  right side of the normal distribution
๐‘ − ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ = ๐‘ƒ๐‘Ÿ๐ป0 (๐‘ > ๐‘ก ๐‘Ž๐‘๐‘ก ) = 1 − ๐œ™(๐‘ก ๐‘Ž๐‘๐‘ก )
or ๐ป1 : ๐ธ(๐‘Œ) < ๐œ‡๐‘Œ,0 ๏ƒ  left side of the normal distribution
๐‘ − ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ = ๐‘ƒ๐‘Ÿ๐ป0 (๐‘ < ๐‘ก ๐‘Ž๐‘๐‘ก ) = 1 + ๐œ™(๐‘ก ๐‘Ž๐‘๐‘ก )
8
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
3.2 Confidence Intervals
Use data from a random sample to construct a set of values (confidence set) that contains the
true population mean ๐œ‡๐‘Œ with a certain prespecified probability (confidence level). The upper
and lower limits of the confidence set are an interval (confidence interval).
3.2.1
๏‚ท
๏‚ท
๏‚ท
Concerning the population mean
95% CI for ๐œ‡๐‘Œ is an interval that contains the true value of ๐œ‡๐‘Œ in 95% of all possible
samples. The CI is random because it will differ from one sample to the next.
๐œ‡๐‘Œ is NOT random, we just don’t know it.
A 95% CI can be seen as the set of value of ๐œ‡๐‘Œ not rejected by a hypothesis test
with a 5% significance level.
For further informations, see “Résumé statistique inférentielle II » from the 3rd semester.
3.3 Scatterplots, the Sample Covariance and the Sample Correlation
The 3 ways to summarize the relationship between variables!
Scatterplots
Sample Covariance and Correlation
Estimators of the population covariance and correlation. Computed by replacing a population
mean with a sample mean. They are, just as the sample variance, consisten.
1
Sample covariance: ๐‘ ๐‘‹๐‘Œ = ๐‘›−1 ∑๐‘›๐‘–=1(๐‘‹๐‘– − ๐‘‹ฬ…)(๐‘Œ๐‘– − ๐‘Œฬ…)
๐‘ 
Sample correlation: ๐‘Ÿ๐‘‹๐‘Œ = ๐‘  ๐‘‹๐‘Œ
๐‘ 
๐‘‹ ๐‘Œ
A high correlation (close to 1) means that the points in the scatter plot fall very close to a
straight line
ADD DIFFERENCE BETWEEN MEANS? NECESSARY? (chap. 3 SW).
9
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
4 Linear Regression with One Regressor
Allows us to estimate, and make inferences, about population slope coefficients. Our purpose
is to estimate the causal effect on Y of a unit change in X.
๏ƒ  relationship between X and Y
Linear regression (statistical procedure) can be used for:
๏‚ท Causal inference: using data to estimate the effect on an outcome of interest of an
intervention that changes the value of another variable.
๏‚ท Prediction: using the observed value of some variable to predict the value of another
variable.
4.1 Linear Regression Model with a Single Regressor (population)
The problems of statistical inference for linear regression are basically the same as for
estimation of the mean or of the differences between two means. We want to figure out if there
is a relationship between the two variables (hyp test). Ideally, ๐‘› is large because we work with
the CLT.
General Notation of the Population Regression Model
๏‚ท ๐‘‹๐‘– : independent variable or regressor
๏‚ท ๐‘Œ๐‘– : dependant variable or regressand
๏‚ท ๐›ฝ0 : intercept ๏ƒ  coefficient/parameter, value of the regression line when ๐‘‹ = 0
๏‚ท ๐›ฝ1 : slope ๏ƒ  coefficient/parameter, difference in Y associated with a unit difference in X
๏‚ท ๐‘ข๐‘– = ๐‘Œ๐‘– − (๐›ฝ0 + ๐›ฝ1 ๐‘‹๐‘– ): the error term, difference between ๐‘Œ๐‘– and its predicted value
according to the regression line.
๏‚ท ๐‘Œ๐‘– = ๐›ฝ0 + ๐›ฝ1 ๐‘‹๐‘– : population regression line/function ๏ƒ  relationship that holds between X
and Y, on average, over the population
๐‘Œ๐‘– = ๐›ฝ0 + ๐›ฝ1 ๐‘‹๐‘– + ๐‘ข๐‘–
๐‘“๐‘œ๐‘Ÿ ๐‘– = 1, … , ๐‘›
4.2 Estimating the Coefficients of the Linear Regression Model
We don’t know the coefficients so we have to find estimators for them from the available data!
๏ƒ  learn about the population using a sample of data.
๏ƒ  these values are estimated from a sample to make an inference about the population.
The Ordinary Least Squares Estimator (OLS)
Intuitively, we want to fit a line through the data, the line that makes the least error or squares.
The OLS estimator chooses the regression coefficients so that the line is as close as possible
to the data, you find them by following:
1) Let ๐‘0 and ๐‘1 be some estimators for ๐›ฝ0 and ๐›ฝ1
2) Regression line based on these estimators: ๐‘0 + ๐‘1 ๐‘‹๐‘–
3) ๐‘€๐‘–๐‘ ๐‘ก๐‘Ž๐‘˜๐‘’ ๐‘š๐‘Ž๐‘‘๐‘’ ๐‘คโ„Ž๐‘’๐‘› ๐‘๐‘Ÿ๐‘’๐‘‘๐‘–๐‘๐‘ก๐‘–๐‘›๐‘” = ๐‘Œ๐‘– − (๐‘0 + ๐‘1 ๐‘‹๐‘– ) = ๐‘Œ๐‘– − ๐‘0 − ๐‘1 ๐‘‹๐‘–
4) Sum of all these squared predictions for ๐‘› observations: ∑๐‘›๐‘–=1(๐‘Œ๐‘– − ๐‘0 − ๐‘1 ๐‘‹๐‘– )2
5) There is a unique pair of estimators that minimize this expression ๏ƒ  the OLS estimators
ฬ‚0 and the OLS estimator of ๐›ฝ1 is ๐›ฝ
ฬ‚1
๏ƒ  OLS estimator of ๐›ฝ0 is ๐›ฝ
General Notation of the Sample Regression Line
ฬ‚1 = ๐‘ ๐‘‹๐‘Œ
๏‚ท ๐›ฝ
๐‘ 2
๏‚ท
๏‚ท
๏‚ท
๐‘‹
ฬ‚0 = ๐‘Œฬ… − ๐›ฝ
ฬ‚1 ๐‘‹ฬ…
๐›ฝ
๐‘ขฬ‚๐‘– = ๐‘Œ๐‘– − ๐‘Œฬ‚๐‘– : residual for the ๐‘– ๐‘กโ„Ž observation, ๐‘Œฬ‚๐‘– is the predicted value.
ฬ‚0 + ๐›ฝ
ฬ‚1 ๐‘‹: OLS regression line/sample regression line/function
๐›ฝ
ฬ‚0 + ๐›ฝ
ฬ‚1 ๐‘‹๐‘–
ฬ‚๐‘– = ๐›ฝ
๐‘Œ
๐‘“๐‘œ๐‘Ÿ ๐‘– = 1, … , ๐‘›
10
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
4.3 Measures of Fit and Prediction Accuracy
How well does the regression line fit the data? Is the regressor responsible for a little or big
variation of the dependant variable? Are the observations tightly distributed around the
regression line or spread out?
The Regression ๐‘น๐Ÿ
Fraction of the sample variance of ๐‘Œ explained by ๐‘‹.
1) We have from above: ๐‘ขฬ‚๐‘– = ๐‘Œ๐‘– − ๐‘Œฬ‚๐‘– โŸบ ๐‘Œ๐‘– = ๐‘Œฬ‚๐‘– + ๐‘ขฬ‚๐‘–
2) ๐‘… 2 can be written as the ratio of the explained sum of squares to the total sum of
squares:
2
๐ธ๐‘†๐‘† = ∑๐‘›๐‘–=1(๐‘Œฬ‚๐‘– − ๐‘Œฬ…) #(sample predicted – sample average)^2
๐‘‡๐‘†๐‘† = ∑๐‘›๐‘–=1(๐‘Œ๐‘– − ๐‘Œฬ…)2 #(actual – sample average)^2
๐ธ๐‘†๐‘†
๐‘…2 =
= ๐ถ๐‘œ๐‘Ÿ๐‘Ÿ(๐‘Œ, ๐‘‹)2
๐‘‡๐‘†๐‘†
3) ๐‘… 2 can also be written in terms of the fraction of the variance of ๐‘Œ๐‘– not explained by ๐‘‹๐‘– ,
aka the sum of squared residuals.
๐‘›
๐‘†๐‘†๐‘… = ∑ ๐‘ขฬ‚๐‘– 2
๐‘–=1
4) ๐‘‡๐‘†๐‘† = ๐ธ๐‘†๐‘† + ๐‘†๐‘†๐‘…
๐‘…2 = 1 −
๐‘†๐‘†๐‘…
๐‘‡๐‘†๐‘†
0 ≤ ๐‘…2 ≤ 1
Complete with the exercises !
The Standard Error of the Regression (SER)
Estimator of the standard deviation of the regression error ๐‘ข๐‘–
๐‘†๐ธ๐‘… = ๐‘ ๐‘ขฬ‚ = √๐‘ ๐‘ขฬ‚2
1
๐‘†๐‘†๐‘…
where ๐‘ ๐‘ขฬ‚2 = ๐‘›−2 ∑๐‘›๐‘–=1 û2๐‘– = ๐‘›−2
#we divide by ๐‘› − 2 because two degrees of freedom were lost when estimating ๐›ฝ0 and ๐›ฝ1 .
Predicting Using OLS
๏‚ท in-sample prediction
observation for which the prediction is made was also used to estimate the regression
coefficients ๏ƒ  we are in the sample.
๏‚ท out-of-sample prediction
prediction for observations not in the estimation sample.
The goal of prediction: provide accurate out-of-sample predictions
4.4 The Least Squares Assumptions (hypotheses)
What, in a precise sense, are the properties of the OLS estimator?
We would like it to be unbiased, and to have a small variance. Does it? Under what conditions
is it an unbiased estimator of the true population parameters?
To answer these questions, we need to make some assumptions about how Y and X are related
to each other, and about how they are collected (the sampling scheme).
These assumptions – there are three – are known as the Least Squares Assumptions.
๐‘Œ๐‘– = ๐›ฝ0 + ๐›ฝ1 ๐‘‹๐‘– + ๐‘ข๐‘– , ๐‘– = 1, … , ๐‘›
11
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
1. The Conditional distribution of ๐’–๐’Š given ๐‘ฟ๐’Š has mean zero
o That is, ๐ธ(๐‘ข|๐‘‹ = ๐‘ฅ) = 0
ฬ‚1 is unbiased
o Implies that ๐›ฝ
o Intuitively, consider an ideal randomized controlled experiment :
๐‘‹ is randomly assigned to people (ex : students randomly assigned to different
size classes). Randomization is done by computer, using no information about
the individual.
Because ๐‘‹ is assigned randomly, all other individual characteristics (the things
that make up ๐‘ข) are independently distributed of ๐‘‹.
Thus, in an ideal randomized controlled experiment : ๐ธ(๐‘ข|๐‘‹ = ๐‘ฅ) = 0
Recall: if ๐ธ(๐‘ข|๐‘‹ = ๐‘ฅ) = 0, then ๐‘ข and ๐‘‹ have 0 cov and are uncorrelated.
In actual experiments, or with observational data, it’s not necessary true.
2. (๐‘ฟ๐’Š , ๐’€๐’Š ), ๐’Š = ๐Ÿ, … , ๐’ Are Independently and Identically Distributed
o i.i.d arises automatically if sampled by simple random sampling: the entity is
selected then, for that entity, X and Y are observed (recorded)
o The main place we will encounter non-i.i.d. sampling is when data are
recorded over time (“time series data”) ๏ƒ  extra complications
3. Large Outliers are Unlikely
A large outlier can strongly influence the
results!
See this outlier of Y (red circle)
๏ƒ  that’s why it’s important to look at the data
before doing any regression line.
4.5 The Sampling Distribution of the OLS Estimators
ฬ‚0 , ๐›ฝ
ฬ‚1 ) are computed from a randomly drawn sample ๏ƒ  random variables
OLS Estimators (๐›ฝ
with a sampling distribution that describes the values they could take over different possible
random samples.
ฬ‚0 and ๐›ฝ
ฬ‚1 are ๐›ฝ0 and ๐›ฝ1 :
๏‚ท The means of the sampling distributions of ๐›ฝ
ฬ‚
ฬ‚
๐ธ(๐›ฝ0 ) = ๐›ฝ0 and ๐ธ(๐›ฝ1 ) = ๐›ฝ1 ๏ƒ  unbiased estimators
Normal Approximation to the Distribution of the OLS Estimators in Large Samples
ฬ‚
If the Least Squares Assumptions hold, then ๐›ฝฬ‚
0 , ๐›ฝ1 have a jointly normal sampling
distribution:
1 ๐‘ฃ๐‘Ž๐‘Ÿ[(๐‘‹๐‘– −๐œ‡๐‘‹ )๐‘ข๐‘– ]
2
2
ฬ‚1 ~๐‘ (๐›ฝ1 , ๐œŽฬ‚
o ๐›ฝ
) where ๐œŽ๐›ฝฬ‚
= ๐‘› [๐‘ฃ๐‘Ž๐‘Ÿ(๐‘‹
๐›ฝ
)]2
1
o
๏‚ท
1
๐‘–
1 ๐‘ฃ๐‘Ž๐‘Ÿ(๐ป๐‘– ๐‘ข๐‘– )
๐œ‡
2
2
ฬ‚0 ~๐‘(๐›ฝ0 , ๐œŽฬ‚
๐›ฝ
) where ๐œŽ๐›ฝฬ‚
=๐‘›
where ๐ป๐‘– = 1 − [๐ธ(๐‘‹๐‘‹2 ] ๐‘‹๐‘–
๐›ฝ
2 2
0
0
[๐ธ(๐ป๐‘– )]
๐‘–
)
Implications:
o When ๐‘› is large, the distribution of the estimators will be tightly centred around
their means.
o Consistent estimators
2
o The larger the ๐‘ฃ๐‘Ž๐‘Ÿ(๐‘‹๐‘– ), the smaller ๐œŽ๐›ฝฬ‚
1
Intuitively, if there is more variation
in ๐‘‹, then there is more information
in the data that you can use to fit the
regression line. ๏ƒ  easier to draw a
regression line for the black dots
(with ๐‘ฃ๐‘Ž๐‘Ÿ(๐‘‹๐‘– ) bigger!)
12
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
##########################################################################
30.04.2022: bis jetzt
We looked at probability: regression model
๏‚ท ๐‘Œ๐‘– = ๐›ฝ0 + ๐›ฝ1 ๐‘‹๐‘– + ๐‘ข๐‘–
๐›ฝ0 + ๐›ฝ1 ๐‘‹๐‘– represents Z
This model explains (X,Y)
๏‚ท E[u|X]=0
o ๐ธ[๐‘ข] = 0
o ๐ถ๐‘œ๐‘ฃ(๐‘‹, ๐‘Œ) = 0
Parameters of the model are B0 and B1
๐›ฝ0 = ๐ธ[๐‘Œ] − ๐ต1 ๐ธ[๐‘‹]
๐ถ๐‘œ๐‘ฃ(๐‘‹, ๐‘Œ)
๐›ฝ1 =
๐‘‰๐‘Ž๐‘Ÿ(๐‘‹)
๐‘‰๐‘Ž๐‘Ÿ(๐‘Œ) = ๐‘‰๐‘Ž๐‘Ÿ(๐‘) + ๐‘‰๐‘Ž๐‘Ÿ(๐‘ข)
๐‘‰๐‘Ž๐‘Ÿ(๐‘)
= [๐ถ๐‘œ๐‘Ÿ๐‘Ÿ(๐‘‹, ๐‘Œ)]2
๐‘‰๐‘Ž๐‘Ÿ(๐‘Œ)
All this probability tools are useful to do statistics, working with samples:
We are going to try to estimate the parameters of the model and find how good the regression
is.
๐‘›
ฬ‚ = 1 ∫ (๐‘‹๐‘– − ๐‘‹ฬ…)2
๐‘‰๐‘Ž๐‘Ÿ(๐‘‹)
๐‘› ๐‘–=1
ฬ‚
๐‘›
1ฬ‚
ฬ‚
๐ถ๐‘œ๐‘ฃ(๐‘‹, ๐‘Œ) = ∫ (๐‘‹๐‘– − ๐‘‹ฬ…)(๐‘Œ๐‘– − ๐‘Œฬ…)
๐‘› ๐‘–=1
ฬ‚0 = ๐‘Œฬ… − ๐ต
ฬ‚1 ๐‘‹ฬ…
๐›ฝ0 ๏ƒ  ๐ต
1 ๐‘›
ฬ… )(๐‘Œ๐‘– −๐‘Œฬ…)
∫ (๐‘‹ฬ‚
๐‘– −๐‘‹
ฬ‚1 = ๐‘› ๐‘–=1
๐›ฝ1 ๏ƒ  ๐ต
1 ๐‘›
ฬ… 2
(๐‘‹๐‘– −๐‘‹)
∫
๐‘› ๐‘–=1
We want to know the part of the regression that is good ๏ƒ  work with Z because the “u” is the
error term
2
2
๐‘‰๐‘Ž๐‘Ÿ(๐‘) = (๐ต0 + ๐ต1 ๐‘‹1 − ๐ธ(๐‘)) + โ‹ฏ + (๐ต0 + ๐ต1 ๐‘‹๐‘› − ๐ธ(๐‘))
2
2
ฬ‚ = (๐ต
ฬ‚0 + ๐ต
ฬ‚1 ๐‘‹1 − ๐ธ(๐‘)) + โ‹ฏ + (๐ต
ฬ‚0 + ๐ต
ฬ‚1 ๐‘‹๐‘› − ๐ธ(๐‘))
๐‘‰๐‘Ž๐‘Ÿ(๐‘)
We still want to get rid of ๐ธ(๐‘)
ฬ‚0 + ๐ต
ฬ‚1 ๐‘‹1 is named ๐‘Œฬ‚1 in the book, ๐ต
ฬ‚0 + ๐ต
ฬ‚1 ๐‘‹๐‘› is named ๐‘Œ
ฬ‚
๐ต
๐‘›
ฬ‚ = 1 ∑(๐‘Œฬ‚๐‘– − ๐‘Œฬ…ฬ‚)
๐‘‰๐‘Ž๐‘Ÿ(๐‘)
๐‘›
2
๏ƒ 
ฬ‚
๐‘‰๐‘Ž๐‘Ÿ(๐‘)
ฬ‚
๐‘‰๐‘Ž๐‘Ÿ(๐‘Œ)
=
2
1
∑ (๐‘Œฬ‚๐‘– −๐‘Œฬ…ฬ‚)
๐‘›
1 ๐‘›
∫ (๐‘Œ −๐‘Œฬ…)2
๐‘› ๐‘–=1 ๐‘–
= ๐‘…2
SER:
ฬ‚๐‘– = ๐‘Œ๐‘– − ๐ต
ฬ‚0 − ๐ต
ฬ‚1 ๐‘‹๐‘– = ๐‘ขฬ‚๐‘– it’s the estimate of the contribution of other terms
๐‘Œ๐‘– − ๐‘Œ
ฬ‚ = ๐‘ ๐‘ž๐‘Ÿ๐‘ก(๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ข)
ฬ‚ )
๐‘†๐ธ๐‘… = ๐‘ ๐‘‘(๐‘ข)
##########################################################################
13
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
5 Regression with a Single Regressor: Hypothesis Tests
and Confidence Intervals
5.1 Testing Hypotheses About One of the Regression Coefficients
The process is the same for ๐›ฝ0 and ๐›ฝ1 !
General form of the t-statistic
๐‘’๐‘ ๐‘ก๐‘–๐‘š๐‘Ž๐‘ก๐‘œ๐‘Ÿ − โ„Ž๐‘ฆ๐‘๐‘œ๐‘กโ„Ž๐‘’๐‘ ๐‘–๐‘ง๐‘’๐‘‘ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’
๐‘ก=
๐‘ ๐‘ก๐‘Ž๐‘›๐‘‘๐‘Ž๐‘Ÿ๐‘‘ ๐‘’๐‘Ÿ๐‘Ÿ๐‘œ๐‘Ÿ ๐‘œ๐‘“๐‘กโ„Ž๐‘’ ๐‘’๐‘ ๐‘ก๐‘–๐‘š๐‘Ž๐‘ก๐‘œ๐‘Ÿ
Two-Sided Hypotheses Concerning ๐œท๐Ÿ
ฬ‚1 also has a normal sampling distribution in large samples, hypotheses about the
Because ๐›ฝ
true value of the slope ๐›ฝ1 can be tested using the same approach as for ๐‘Œฬ….
๐ป0 : ๐›ฝ1 = ๐›ฝ1,0
๐ป1 : ๐›ฝ1 ≠ ๐›ฝ1,0
To test ๐ป0 against the Alternative ๐ป1 and just like for the population mean ๐‘Œ, we have to follow
3 steps:
ฬ‚๐Ÿ
1) Compute the standard error of ๐œท
2
ฬ‚1 ) = √๐œŽฬ‚
๐‘†๐ธ(๐›ฝ
๐›ฝ
1
Heteroskedasticity-robust standard errors
1
2
Where ๐œŽฬ‚๐›ฝฬ‚
=๐‘›
0
2
ฬ…
1
๐‘‹
∑๐‘›
ฬ‚๐‘–2
๐‘ข
[1−(
)๐‘‹
]
๐‘–
1 ๐‘›
๐‘›−2 ๐‘–=1
∑
๐‘‹2
๐‘› ๐‘–=1 ๐‘–
2 2
ฬ…
1 ๐‘›
๐‘‹
(๐‘› ∑๐‘–=1[1−( 1 ๐‘›
)๐‘‹๐‘– ] )
∑
๐‘‹2
๐‘› ๐‘–=1 ๐‘–
2
๐œŽฬ‚๐›ฝฬ‚
1
ฬ‚๐‘– = 1 − ( 1
๐ป
๐‘›
๐‘‹ฬ…
2
∑๐‘›
๐‘–=1 ๐‘‹๐‘–
) ๐‘‹๐‘–
1
๐‘›
2 2
1 ๐‘› − 2 ∑๐‘–=1(๐‘‹๐‘– − ๐‘‹ฬ…) ๐‘ขฬ‚๐‘–
=
2
๐‘› 1 ๐‘›
[๐‘› ∑๐‘–=1(๐‘‹๐‘– − ๐‘‹ฬ…)2 ]
2) Compute the t-statistic
๐‘ก=
ฬ‚1 − ๐›ฝ1,0
๐›ฝ
ฬ‚1 )
๐‘†๐ธ(๐ต
3) Compute the p-value
Rejecting the hypothesis at the 5% significance level if the p-value is less than 0.05 or,
equivalently, if |๐‘ก ๐‘Ž๐‘๐‘ก | > 1.96.
๐‘Ž๐‘๐‘ก
ฬ‚
ฬ‚1 − ๐›ฝ1,0
๐›ฝ
๐›ฝ
− ๐›ฝ1,0
๐‘ − ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ = ๐‘ƒ๐‘Ÿ๐ป0 [|
|>| 1
|] = 2๐œ™(−|๐‘ก ๐‘Ž๐‘๐‘ก |)
ฬ‚1 )
ฬ‚1 )
๐‘†๐ธ(๐›ฝ
๐‘†๐ธ(๐›ฝ
Note: a regression software computes automatically those values (see function “lmfit” in R)
One-Sided Hypotheses Concerning ๐œท๐Ÿ
๏‚ท
๏‚ท
๏‚ท
๐ป0 : ๐›ฝ1 = ๐›ฝ1,0
๐ป1 : ๐›ฝ1 < ๐‘œ๐‘Ÿ > ๐›ฝ1,0
Because ๐ป0 is the same, the construction of the t-statistic is the same.
The only difference between a one- and a two-sided hypothesis test is how you
interpret the t-statistic:
o For a left-tail test: ๐‘ − ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ = ๐‘ƒ๐‘Ÿ๐ป0 (๐‘ < ๐‘ก๐‘Ž๐‘๐‘ก ) = 1 + ๐œ™(๐‘ก๐‘Ž๐‘๐‘ก )
o For a right-tail test: ๐‘ − ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ = ๐‘ƒ๐‘Ÿ๐ป0 (๐‘ > ๐‘ก๐‘Ž๐‘๐‘ก ) = 1 − ๐œ™(๐‘ก๐‘Ž๐‘๐‘ก )
In practice, we use this kind of test only when there is a reason to do so
14
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
5.2 Confidence Intervals for a Regression Coefficient
The process is the same for ๐›ฝ0 and ๐›ฝ1 !
Confidence Interval for ๐œท๐Ÿ
๏‚ท 2 definitions:
1) Set of value that cannot be rejected using a two-sided hypothesis test with
a 5% significance level
2) Interval that has a 95% probability of containing the true value of ๐›ฝ1 : in 95%
of possible samples that might be drawn, the CI will contain the true value.
๏‚ท Link with hypothesis test: a hyp test w/ 5% significance level will reject the value of ๐›ฝ1
in only 5% of all possible samples ๏ƒ  in 95% of all possible samples, the true value of
๐›ฝ1 won’t be rejected (2nd definition above)
ฬ‚1 − 1.96๐‘†๐ธ(๐›ฝ
ฬ‚1 ), ๐›ฝ
ฬ‚1 + 1.96๐‘†๐ธ(๐›ฝ
ฬ‚1 )]
๏‚ท When sample size is large: 95% ๐ถ๐ผ ๐‘“๐‘œ๐‘Ÿ ๐›ฝ1 = [๐›ฝ
Confidence Intervals for predicted effects of changing ๐‘ฟ
ฬ‚1 − 1.96๐‘†๐ธ(๐›ฝ
ฬ‚1 )) โˆ†๐‘ฅ, (๐›ฝ
ฬ‚1 + 1.96๐‘†๐ธ(๐›ฝ
ฬ‚1 )) โˆ†๐‘ฅ]
95% ๐ถ๐ผ ๐‘“๐‘œ๐‘Ÿ ๐›ฝ1 โˆ†๐‘ฅ = [(๐›ฝ
5.3 Regression When X Is a Binary Variable
It’s also possible to make a regression analysis when the regressor is binary (takes only 2
values: 0 or 1).
Binary variable = indicator variable = dummy variable
Interpretation of the Regression Coefficients
๏‚ท The mechanics are the same as with a continuous regressor
๏‚ท The interpretation is different and is equivalent to a difference of means analysis.
๐‘Œ๐‘– = ๐›ฝ0 + ๐›ฝ1 ๐ท๐‘– + ๐‘ข๐‘–
๏‚ท What is ๐ท๐‘– ?
o It’s the binary variable that can take only 2 values
๏‚ท What is ๐›ฝ1 ?
o Called “Coefficient on ๐ท๐‘– ”
o 2 possible cases:
๐‘ซ๐’Š = ๐ŸŽ
๐‘ซ๐’Š = ๐Ÿ
๐‘Œ๐‘– = ๐›ฝ0 + ๐‘ข๐‘–
๐‘Œ๐‘– = ๐›ฝ0 + ๐›ฝ1 + ๐‘ข๐‘–
We know: ๐ธ(๐‘ข๐‘– |๐ท๐‘– ) = 0 →
We know: ๐ธ(๐‘Œ๐‘– |๐ท๐‘– = 1) = ๐›ฝ0 + ๐›ฝ1
๐ธ(๐‘Œ๐‘– |๐ท๐‘– = 0) = ๐›ฝ0
๐›ฝ0 + ๐›ฝ1 is the population mean
๐›ฝ0 is the population mean value when value when ๐ท๐‘– = 1
๐ท๐‘– = 0
๏‚ท Thus, ๐›ฝ1 = ๐ธ(๐‘Œ๐‘– |๐ท๐‘– = 1) − ๐ธ(๐‘Œ๐‘– |๐ท๐‘– = 0)
๏ƒ  difference between population means!
Hypothesis Tests and Confidence Intervals
๏‚ท If the 2 population means are the same ๏ƒ  ๐›ฝ1 = 0
๏‚ท We can test this ๐ป0 : ๐›ฝ1 = 0 against ๐ป1 : ๐›ฝ1 ≠ 0
5.4 Heteroskedasticity and Homoskedasticity
We bother about this because it influences our hypothesis tests (the SE is not the same!)
Definitions
The error term ๐‘ข๐‘– is:
๏‚ท homoskedastic if the variance of the conditional distribution of ๐‘ข๐‘– given ๐‘‹๐‘– , ๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ข๐‘– |๐‘‹๐‘– =
๐‘ฅ) is constant for ๐‘– = 1, … , ๐‘› and in particular does not depend on ๐‘‹๐‘– ;
๏‚ท heteroskedastic if otherwise
15
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
๏ƒ  Homoskedastic data: particular case of heteroskedastic data where the variance of the
ฬ‚1 becomes simpler.
estimator ๐›ฝ
Visually
Homoskedastic
Heteroskedastic
LSA 1 satisfied bc ๐ธ(๐‘ข๐‘– |๐‘‹๐‘– = ๐‘ฅ) = 0
Variance of ๐‘ข doesn’t depend on ๐‘‹
5.4.1
LSA 1 satisfied bc ๐ธ(๐‘ข๐‘– |๐‘‹๐‘– = ๐‘ฅ) = 0
Variance of ๐‘ข depends on ๐‘‹
Mathematical Implications of Homoskedasticity
So far, we have allowed that the data is heteroskedastic. What if the error terms are in fact
homoskedastic?
๏‚ท Whether the errors are homoskedastic or heteroskedastic, the OLS estimator is
unbiased, consistent and asymptotically normal.
ฬ‚1 and the OLS standard error simplifies
๏‚ท The formula for the variance of ๐›ฝ
Homoskedasticity-only Variance Formula
2
2
ฬ‚1 ) = ๐œŽ๐‘ข2
ฬ‚0 ) = ๐ธ(๐‘‹๐‘–2 ) ๐œŽ๐‘ข2
๐‘ฃ๐‘Ž๐‘Ÿ(๐›ฝ
๐‘ฃ๐‘Ž๐‘Ÿ(๐›ฝ
๐‘›๐œŽ๐‘‹
๐‘›๐œŽ๐‘‹
ฬ‚1 ) is inversely proportional to ๐‘ฃ๐‘Ž๐‘Ÿ(๐‘‹): more spread in ๐‘‹ means more
Note: we see that ๐‘ฃ๐‘Ž๐‘Ÿ(๐›ฝ
information about ๐›ฝ1
Homoskedasticity-only Standard Errors
1
1
2
ฬ‚1 ) = √(๐œŽฬƒฬ‚
๐‘†๐ธ(๐›ฝ
=) 1 ๐‘›−2
๐‘›
๐›ฝ
1
ฬ‚2
∑๐‘›
๐‘–=1 ๐‘ข๐‘–
๐‘› ∑๐‘–=1(๐‘‹๐‘– −๐‘‹ฬ…)2
๐‘›
1
2 2
( ∑๐‘›
๐‘–=1 ๐‘‹๐‘– )๐‘ û
2
ฬ‚0 ) = √(๐œŽฬƒฬ‚
๐‘†๐ธ(๐›ฝ
=) ∑๐‘›๐‘›
๐›ฝ
0
ฬ… 2
๐‘–=1(๐‘‹๐‘– −๐‘‹)
The usual standard errors, which we call heteroskedasticity – robust standard errors, are valid
whether or not the errors are heteroskedastic
Practical Implications
๏‚ท In general, you get different standard errors using the different formulas.
๏‚ท To be sure, always use the general formula (heteroskedastic-robust standard errors)
because valid for the both cases.
๏‚ท Warning: R uses the simpler formula
๏‚ท If you don’t override the default and there is in fact heteroskedasticity, your standard
errors (and wrong t-statistics and confidence intervals) will be wrong – typically,
homoskedasticity-only SEs are too small.
5.5 The Theoretical Foundations of OLS
What we already know:
๏‚ท OLS is unbiased and consistent
16
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
๏‚ท A formula for heteroskedasticity-robust standard errors
๏‚ท How to construct CI and test statistics
We are using OLS is because it’s the language of the regression analysis!
What we doubt:
๏‚ท Is this really a good reason to use OLS? Other estimators with smaller variances are
maybe better?
๏‚ท What happened to the student t distribution?
To answer these questions, we have to make stronger assumptions.
The Extended Least Squares Assumptions
3 LSA + 2 others:
1) ๐ธ(๐‘ข๐‘– |๐‘‹๐‘– = ๐‘ฅ) = 0
2) (๐‘‹๐‘– , ๐‘Œ๐ผ ), ๐‘– = 1, … , ๐‘› are i.i.d.
3) Large outliers are rare
More restrictive – apply in fewer cases. But lead
4) ๐‘ข is homoskedastic
to stronger results (under those assumptions)
5) ๐‘ข is distributed ๐‘(0, ๐œŽ 2 )
because calculations simplify.
Efficiency of OLS, part I: The Gauss-Markov Theorem
ฬ‚1 has the smallest variance among all linear estimators.
Under the first 4 assumptions, ๐›ฝ
“If the three least squares assumptions hold and if errors are homoskedastic, then the OLS
ฬ‚1 is the Best Linear conditionally Unbiased Estimator (BLUE).”
estimator ๐›ฝ
Efficiency of OLS, part II
ฬ‚1 has the smallest variance of all consistent estimators.
Under the 5 assumptions, ๐›ฝ
If the errors are homoskedastic and normally distributed + LSA 1-3, then OLS is a better choice
than any other consistent estimator ๏ƒ  the best you can do!
Some not-so-good thing about OLS
Warning: that has important limitations!
1. Gauss-Markov Theorem has 2 limitations:
๏‚ท Homoskedasticity is rare and often doesn’t hold
๏‚ท Result only for linear estimators, which represent only a small subset of
estimators
2. The strongest result (OLS is the best you can do, cf above “part II”) requires
homoscedastic normal errors ๏ƒ  pretty rare in application
3. OLS is more sensitive to outliers than some other estimators. For example, in the case
of the mean of the population with big outliers, we prefer the median.
5.6 Using the t-Statistic in Regression When the Sample Size Is Small
1)
2)
3)
4)
5)
๐ธ(๐‘ข๐‘– |๐‘‹๐‘– = ๐‘ฅ) = 0
(๐‘‹๐‘– , ๐‘Œ๐ผ ), ๐‘– = 1, … , ๐‘› are i.i.d.
Large outliers are rare
๐‘ข is homoskedastic
๐‘ข is distributed ๐‘(0, ๐œŽ 2 )
ฬ‚0 ,ฬ‚
If they all hold, ๐›ฝ
๐›ฝ1 are normally distributed and the t-statistic has ๐‘› − 2 degrees of freedom.
Under those 5 assumptions and the null hypothesis, the t-statistic has a Student t distribution
with ๐‘› − 2 degrees of freedom.
17
Corentin Cossettini
Semestre de printemps 2022
๏‚ท
๏‚ท
๏‚ท
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Why ๐‘› − 2?
Because we estimated 2 parameters
For ๐‘› < 30, the t critical values can be a faire bit larger than the ๐‘(0,1) critical values
For ๐‘› > 50, the difference in ๐‘ก๐‘›−2 and ๐‘(0,1) distribution is negligible. Recall the
student t table:
Degrees of freedom
5% t-distribution critical value
10
2.23
20
2.09
30
2.04
60
2.00
1.96
∞
Practical Implications
๏‚ท If n < 50 and you really believe that, for your application, u is homoskedastic and
normally distributed, then use the t n−2 instead of the N(0,1) critical values for
hypothesis tests and confidence intervals.
๏‚ท In most econometric applications, there is no reason to believe that u is homoskedastic
and normal – usually, there is good reason to believe that neither assumption holds.
๏‚ท Fortunately, in modern applications, n > 50, so we can rely on the large-n results
presented earlier, based on the CLT, to perform hypothesis tests and construct
confidence intervals using the large-n normal approximation.
5.7 Summary and Assessment
๏‚ท
The initial policy question:
Suppose new teachers are hired so the student-teacher ratio falls by one student per
class. What is the effect of this policy intervention (“treatment”) on test scores?
๏‚ท
Does our regression analysis answer this convincingly?
o Not really: districts with low student to teacher ratio tend to be ones with lots of
other resources and higher income families, which provide kids with more
learning opportunities outside school...this suggests that corr(ui , STR i ) > 0, so
E(ui |X i) ≠ 0
o So, we have omitted some factors, or variables, from our analysis, and this has
biased our results.
Part II
Data mining: deviner la valeur d’un attribut (variable) en utilisant d’autres attributs.
6 Introduction to Data Mining
6.1 Data Mining - Concepts
Definitions
๏‚ท Science of discovering structure and making predictions in large samples.
๏‚ท Exploration and analysis by automatic or semi-automatic means of large quantities of
data in order to discover meaningful patterns.
Objectives
๏‚ท Discovering structures/patterns: explore and understand the data
๏ƒ  understanding the past!
๏‚ท Making predictions: given measurements of variables, learn a model to predict their
future values.
๏ƒ  Predict the future!
Data: definition
18
Corentin Cossettini
Semestre de printemps 2022
๏‚ท
๏‚ท
๏‚ท
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Collection of data objects and their attributes
Attribute: property or characteristic of an object
Object: described by a collection of attributes
Attributes
Types
Distinctness: =, ≠
Nominal
Ex: ID number, eye
color, …
Ordinal
Ex:
rankings,
grades, height
Interval
Ex:
dates,
temperatures, …
Ratio
Ex: length, time,
counts
Description
Properties
Order: <,>
Addition: +,Multiplication: *,/
The attribute provides only enough Distinctness
information to distinguish one object
from another
The attribute provides enough info to Distinctness and order
order objects
The difference between values are Distinctness,
meaningful
addition
order
and
Differences between the values and Distinctness,
order,
ratios are meaningful
addition and multiplication
Concepts
๏‚ท Domain: set of objects from the real world about which knowledge is supposed to be
delivered by data mining
๏‚ท Target attribute: attribute to be predicted (‘YES’/’NO’)
๏‚ท Input attributes: observable attributes that can be used for prediction
๏‚ท Dataset: a subset of the domain, described by the set of available attributes (rows:
instances; columns: attributes)
๏‚ท Predictive model: chunks of knowledge about some domain of interest, that can be
used to answer queries not just about instances from the data used for model creation,
but also any other instances from the same domain
Origins of Data Mining
An analytic process that uses one or more available datasets from the same domain to create
one or more models for the domain. The ultimate goal of data mining is delivering predictive
models.
๏‚ท Uses ideas from machine learning, stats and database systems
๏‚ท Vocabulary differences between stats and machine learning:
Inductive learning
๏‚ท Most of data mining algorithms are using inductive learning approach:
The algorithm (or learner) is provided with training information from which it has to
derive knowledge via inductive inference. Any prediction uses knowledge to deduce
the answer via deductive inference.
19
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
๏‚ท
Knowledge: expected result of the learning
process
๏‚ท Inductive inference: discovering patterns in
training information and generalizing them
appropriately (“specific to general”)
๏‚ท Query: a new cas or observation with
unknown aspects
๏‚ท Deductive inference: supplying the values
of unknown aspects based on a “general to
specific process”
The learner isn’t informed and has no possibility to verify with certainty which of the possible
generalizations are correct ๏ƒ  complicated to check if the prediction is true or not.
Use feedback to improve the quality of the model.
Data Mining Tasks
๏‚ท Regression
Predict a value of a numeric target attribute based on the values of other input
attributes, assuming a linear or nonlinear model of dependency.
Applications: predicting sales amount of new product based on advertising expenditure
Numeric variable is the only difference between regression and classification.
๏‚ท
Classification
In a lot of cases the variable is binary.
Predict a discrete target attribute based on the values of other input attributes.
Applications: fraud detection, direct marketing...
๏‚ท
Clustering
Predict the assignment of instances to a set of clusters, such that the instances in any
one cluster are more similar to each other than to instances outside the cluster.
Applications: market segmentation
๏‚ท
Association Rule Discovery
Given a set of records each of which contains some number of items from a given
collection.
Application: identify items that are bought together by sufficiently many customers…
6.2 Classification - Basic Concepts
Definition
๏‚ท Given a collection of records (training set)
o Each record is defined as a tuple (x,y) where x is a set of attributes, and y is a
special attribute, denoted as the class label (also called target)
๏‚ท Find a model for class attribute y as a function f of the values of attributes x
o The function maps each attribute set x to one of the predefined class label y
o The function is a classification model
๏‚ท Goal: previously unseen records should be assigned a class as accurately as possible
o A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build the
model and test set used to validate it.
20
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Illustrating Classification Task
Classification Techniques
๏‚ท Neural networks
๏‚ท Decision Tress based Methods
๏ƒ  are the most important! But others exist.
6.3 Rule-Based Classifier: basic concepts
๏‚ท
Classification model uses a collection of IF…THEN… rules
o “๐ถ๐‘œ๐‘›๐‘‘๐‘–๐‘ก๐‘–๐‘œ๐‘› → ๐‘ฆ”
o LHS: left hand side ๏ƒ  rule antecedent or condition, conjunction of tests related
to attributes
o RHS: right hand side ๏ƒ  rule consequent, class label (y)
Rule
๏‚ท
A rule R covers an instance x if the attributes of the instance satisfy the condition of the
rule
๏‚ท Coverage of a rule: fraction of records that satisfy the antecedent of a rule
all the instances covered by the rule (%)
๏‚ท Accuracy of a rule: fraction of records that satisfy both the antecedent and consequent
of a rule
instances that satisfy the condition and look which ones are true and false. We don’t
look at all the instances (only the ones that satisfy the rule)!
If no rule applies, we give a default value
Characteristics of Rule-Based Classifier
๏‚ท Mutually exclusive rules
Every possible record is covered by at most one rule
๏‚ท Exhaustive rules
Every possible record is covered by at least one rule
We can mix them! (exhaustive +mutually exclusive, …)
Rules Can Be Simplified
Effects of a rule simplification
๏‚ท Rules are no longer mutually exclusive
A record may trigger more than one rule
Solution: ordered set rule, unordered rule set- use voting scheme
๏‚ท Rules are no longer exhaustive
A record may not trigger any rules
Solution: use a default class
21
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Ordered Rule Set
๏‚ท Known as a decision list
๏‚ท Give an order to the rule set (the 1st applies /…)
When the test record is presented to the classifier, it is assigned to the class label of the highest
ranked rule it has triggered (if no rule applies, assigned to the default class!)
Rule Ordering Schemes
๏‚ท Rule-based ordering: individual rules are ranked based on their quality
o Question: how to define quality? Several possibilities…
๏‚ท Class-based ordering: rules that belong to the same class appear together
Building Classification Rules
๏‚ท Direct method: extract rules directly from data
example of sequential covering
๏‚ท Indirect method: extract rules from other classification models
STOPPING conditions
๏ƒ  if the quality of the rule is not met;
๏ƒ  if there is no more instance to put in the
Aspects of Sequential Covering
How to find rules, linked with the direct method
1. Rule Growing
2 common strategies:
1) Top-down: general-to-specific
2) Down-up: specific-to-general
2. Rule Evaluation
Needed to determine which conjunct should be added or removed during the rulegrowing process
Metrics:
For one rule
The value of accuracy isn’t correct
or certain for the future data, bc
based on past data ๏ƒ  improve
this measure and develop
Laplace or M-estimate ๏ƒ  2
corrections
22
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Other measure: FOIL’s information
gain
Compare 2 rules (for a rule extension)
but we had a measure to the second
one
๐‘š๐‘
mesure la précision de la
ฬ…ฬ…ฬ…ฬ…
๐‘š๐‘ +๐‘š๐‘
deuxième règle, de même pour
๐‘›๐‘
ฬ…ฬ…ฬ…๐‘ฬ…
๐‘›๐‘ +๐‘›
3. Rule Pruning
Reduced Error Pruning
๏ƒ  remove one of the conjuncts in the rule
๏ƒ  comparer error rate on validation set before and after pruning.
๏ƒ  if error improves, prune the conjunct
What is the validation set?
Allows us to compute the accuracy after we created the rule in the training set.
Training set ๏ƒ  rules ๏ƒ  verify with the validation set
Laplace is also a way to evaluate the rules to choose which one on them is the best
4. Stopping criterion
Compute the gain
If gain is not significant, discard the new rule
Direct method – summary
1. Grow a single rule
2. Remove instances from the rule
3. Prune the rule if necessary
4. Add rule to current rule set
5. Repeat
Example of Direct method – RIPPER variant
It’s a method (among others) that builds a Rule-Based Classifier.
๏‚ท For a 2-class problem
Choose one of the classes as positive class (the other is the negative class)
o Learn rules for positive class
o Negative class is the default class
๏‚ท For a multi-class problem
Order the classes according to increasing class prevalence (fraction of instances that
belong to a particular class)
o Learn the rule set for smallest class first, treat the rest as negative class
o Repeat with next smallest class as positive class
๏‚ท Growing a rule
o Start form empty rule
o Add conjuncts as long as they improve FOIL’s information gain
๏‚ท
o Stop when rule starts covering negative examples
Prune the rule immediately using incremental reduced error pruning
23
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
o
๏‚ท
๏‚ท
Measure for pruning: ๐‘ฃ = (๐‘ − ๐‘›)/(๐‘ + ๐‘›) where
๏‚ง ๐‘ is the number of positive examples covered by the rule in the
validation set
๏‚ง ๐‘› is the number of negative examples covered by the rule in the
validation set
o Pruning method: deletes any final sequence of conditions that maximizes ๐‘ฃ
Building a rule set
o Use sequential covering algorithm
๏‚ง Finds the best rule that covers the current set of positive examples
๏‚ง Eliminate both positive and negative examples covered by the rule
Stop adding a new rule if the error rate of the rule on the set exceeds 50%
6.4 Nearest Neighbors Classifier
We don’t create a model this time, a contrario of the rule-based classifier.
6.4.1
๏‚ท
๏‚ท
๏‚ท
6.4.2
Instance Based Classifiers
We use instances ๏ƒ  no model
Store the training records and use them to predict the class label of unseen cases.
Examples:
o Rote-learner: memorizes entire training data and performs classification only if
attributes of record match one of the training examples EXACTLY, otherwise,
doesn’t give results.
o Nearest neighbor: uses ๐‘˜ closest points for performing classification.
Basic Idea and concept
Basic Idea: “If it walks like a duck, quacks like a duck, then it’s probably a duck.”
6.4.3
Nearest Neighbor Classification
Requires 3 things:
1. Set of stored records
2. Distance metric to compute distance between records
3. Value of ๐‘˜, the number of nearest neighbors to retrieve
To classify an unknown record:
๏‚ท Distance: Compute distance to other training records
o Distance between 2 points: Euclidian distance: ๐‘‘(๐‘, ๐‘ž) = √∑๐‘–(๐‘๐‘– − ๐‘ž๐‘– )2
o Problem with Euclidian measure: high dimensional data ๏ƒ  curse of
dimensionality
๏‚ง If the number of points/objects is kept constant, higher the number of
dimensions, larger the distance between points.
๏‚ง Given a point, the ratio “distance to its nearest neighbor/distance to its
farthest neighbor” tends to one for high dimensions.
๏‚ท
NN: Identify ๐‘˜ nearest neighbors
o Choosing the value of ๐‘˜
o If ๐‘˜ too small ๏ƒ  sensitive to noise points
o If ๐‘˜ too large ๏ƒ  neighborhood may include points from other classes
24
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
๏‚ท
Class: Use class labels of nearest neighbors to determine the class label of unknown
record
o Determine class from nearest neighbor list
o Take the majority vote of class labels among the ๐‘˜ nearest neighbors
o Weigh the vote according to distance
1
o Weight factor: ๐‘ค = ๐‘‘
๐‘˜ nearest neighbors of a record ๐‘ฅ are data
points that have the ๐‘˜ smallest distance to ๐‘ฅ
(1-2-3 nearest neighbors)
6.4.4
๏‚ท
๏‚ท
Precisions
Scaling issues: attributes may have to be scaled to prevent distance measures from being
dominated by one of the attributes.
๐‘˜-NN classifiers are lazy learners : No model is built, takes time to compute everything
every time.
6.4.5
Example: PEBLS (MVDM)
Parallel Exemplar-Based Learning System
๏‚ท Works with continuous (size in cm, …) and nominal (male/female, …) features
o Nominal features: distance between 2 nominal values (๐‘‰1 and ๐‘‰2 ) is calculated
using modified value difference metric (MVDM)
๐‘š
๐‘›1๐‘– ๐‘›2๐‘–
๐‘‘(๐‘‰1 , ๐‘‰2 ) = ∑ |
−
|
๐‘›1 ๐‘›2
๐‘–=1
o
o
๐‘š – number of classes
๐‘›๐‘—๐‘– – number of examples from class ๐‘– with attribute value ๐‘‰๐‘— , ๐‘— = 1,2
๐‘›๐‘— - number of examples with attribute value ๐‘‰๐‘— , ๐‘— = 1, 2
Distance between ๐‘‹ and ๐‘Œ: โˆ†(๐‘‹, ๐‘Œ) = ∑๐‘‘๐‘–=1 ๐‘‘(๐‘‹๐‘– , ๐‘Œ๐‘– )2
Number of nearest neighbors, ๐‘˜ = 1
Distance between nominal attribute values:
2 0
2 4
๐‘‘(๐‘†๐‘–๐‘›๐‘”๐‘™๐‘’, ๐‘€๐‘Ž๐‘Ÿ๐‘Ÿ๐‘–๐‘’๐‘‘) = | − | + | − | = 1
4 4
4 4
Single and yes ๏ƒ  2/4
Married and yes ๏ƒ  0/4
Single and no ๏ƒ  2/4
Married and no ๏ƒ  4/4
2 1
2 1
๐‘‘(๐‘†๐‘–๐‘›๐‘”๐‘™๐‘’, ๐ท๐‘–๐‘ฃ๐‘œ๐‘Ÿ๐‘๐‘’๐‘‘) = | − | + | − | = 0
4 2
4 2
0 1
4 1
๐‘‘(๐‘€๐‘Ž๐‘Ÿ๐‘Ÿ๐‘–๐‘’๐‘‘, ๐ท๐‘–๐‘ฃ๐‘œ๐‘Ÿ๐‘๐‘’๐‘‘) = | − | + | − | = 1
4 2
4 2
0 3
3 4
๐‘‘(๐‘…๐‘’๐‘“๐‘ข๐‘›๐‘‘ = ๐‘Œ๐‘’๐‘ , ๐‘…๐‘’๐‘“๐‘ข๐‘›๐‘‘ = ๐‘๐‘œ) = | − | + | − | = 6/7
3 7
3 7
6.4.6
Proximity measures
Proximity refers to these 2:
๏‚ท Similarity
o Numerical measure of how alike 2 data objects are
o Higher when objects are more alike
o Fall in the range [0,1]
25
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
๏‚ท
Dissimilarity
o Numerical measure of how different are 2 data objects
o Lower when objects are more alike
o Min is 0, upper limit varies
If Data:
1 p
2 q
๏‚ท
Distance
Particular type if dissimilarity
๏‚ท Euclidian Distance: ๐‘‘(๐‘, ๐‘ž) = √∑๐‘–(๐‘๐‘– − ๐‘ž๐‘– )2
๏‚ท
1
Minkowski Distance: ๐‘‘๐‘š (๐‘, ๐‘ž) = (∑๐‘›๐‘–=1|๐‘๐‘– − ๐‘ž๐‘– |๐‘š )๐‘š where ๐‘š = 1,2, … , ∞
o m = 1: Manhattant Distance/Taxicab distance
o m = 2: Euclidian Distance
o m = ∞: Supremum Distance
6.5 Naïve Bayesian Classifier
Take the n attributes independently and try to look at them individually.
“How often was it rainy when we go play? How often was it rainy when we go not play ?”
๏ƒ  Likelihood of going to play or not depending on each individual value. Then, just multiply all
those values together to obtain a final likelihood value
6.5.1
Bayes Classifier
Probabilistic framework for solving classification problems
Conditional Probability
๐‘ƒ(๐ด, ๐ถ)
๐‘ƒ(๐ถ|๐ด) =
๐‘ƒ(๐ด)
๐‘ƒ(๐ด, ๐ถ)
๐‘ƒ(๐ด|๐ถ) =
๐‘ƒ(๐ถ)
Bayes Theorem
๐‘ท(๐‘จ|๐‘ช)๐‘ท(๐‘ช)
๐‘ƒ(๐‘ช|๐‘จ) =
๐‘ท(๐‘จ)
๐‘ƒ(๐ถ) – prior probability of C
๐‘ƒ(๐ถ|๐ด) – posterior (revised) probability of C, after observing a new event A
๐‘ƒ(๐ด|๐ถ) – conditional probability of A, given C (likelihood)
๐‘ƒ(๐ด) – total probability of A (Probability to observe A independent of C)
26
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
6.5.2
Example of Bayes Theorem
6.5.3
Bayesian Classifiers
๏‚ท
Consider each attribute and
class label as random
variables
๏‚ท Works only if the variables
are independent.
In general, no need to compute the
denominator because it’s the same
for all classes!
6.5.4
o
Naïve Bayes Classifier
“Naïve” because we assume that we are naïve to believe the attributes being
independent ๏ƒ  we assume independence among attributes ๐ด๐‘– when class ๐ถ is given:
๐‘ƒ(๐ด1 = ๐‘Ž1 , … , ๐ด๐‘› = ๐‘Ž๐‘› |๐ถ = ๐‘๐‘– )
= ๐‘ƒ(๐ด1 = ๐‘Ž1 |๐ถ = ๐‘๐‘– )๐‘ƒ(๐ด2 = ๐‘Ž2|๐ถ = ๐‘๐‘– ) ∗ … ∗ ๐‘ƒ(๐ด๐‘› = ๐‘Ž๐‘› |๐ถ = ๐‘๐‘– )
6.5.5
Estimate Probabilities from Data
For discrete attributes
๐‘ƒ(๐ถ = ๐‘๐‘– ) = ๐‘๐‘๐‘– /๐‘
๐‘ƒ(๐ด๐‘— = ๐‘Ž|๐ถ = ๐‘๐‘– ) = ๐‘๐‘—๐‘– /๐‘๐‘๐‘–
๐‘๐‘๐‘– – number of instances from class ๐‘๐‘–
๐‘ – total number of instances
๐‘๐‘—๐‘– – number of instances having value a for attribute ๐ด๐‘— and belonging to class ๐‘๐‘–
Examples
๏‚ท
๏‚ท
๏‚ท
For class ๐ถ “evade”:
4
๐‘ƒ(๐‘†๐‘ก๐‘Ž๐‘ก๐‘ข๐‘  = ๐‘€๐‘Ž๐‘Ÿ๐‘Ÿ๐‘–๐‘’๐‘‘|๐ถ = ๐‘๐‘œ) = 7
๐‘ƒ(๐‘…๐‘’๐‘“๐‘ข๐‘›๐‘‘ = ๐‘Œ๐‘’๐‘ |๐ถ = ๐‘Œ๐‘’๐‘ ) = 0
๐‘ƒ(๐ถ = ๐‘๐‘œ) = 7/10
27
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
For continuous attributes
1. Discretize the range into bins
o One ordinal attribute per bin
o Violates independence assumption
o Example: [60,75] ๏ƒ  cold; [76,80] ๏ƒ  mild; [81,90] ๏ƒ  hot
2. Probability density estimation
o Assume attribute follows a normal distribution
2
๐‘ƒ(๐ด๐‘— = ๐‘Ž|๐ถ = ๐‘๐‘– ) =
o
o
o
6.5.6
−
1
(๐‘Ž−๐œ‡๐‘—๐‘– )
2
2๐œŽ๐‘—๐‘–
๐‘’
๐œŽ๐‘—๐‘– √2๐œ‹
Use data to estimate parameters of distribution (mean, variance)
Once probability distribution is known, can use it to estimate the conditional
probability ๐‘ƒ(๐ด๐‘– = ๐‘Ž|๐ถ๐‘— )
Example: ๐ด๐‘— = ๐ผ๐‘›๐‘๐‘œ๐‘š๐‘’, ๐‘๐‘– = ๐‘๐‘œ
๐œ‡๐‘—๐‘– ≈ 110 (sample mean of Income values for class No)
๐œŽ๐‘—๐‘–2 ≈ 2975 (sample var of Income value for class No)
๏ƒ  Probability Distribution!
Estimate the conditional probability:
(120−110)2
1
−
๐‘ƒ(๐ผ๐‘›๐‘๐‘œ๐‘š๐‘’ = 120|๐‘๐‘œ) =
๐‘’ 2(2975) = 0.0072
54.54√2๐œ‹
Examples of Naïve Bayes Classifier
When multiplying numerical values, you should never have a zero. It makes the other attributes
don’t count anymore ๏ƒ  problem with Naïve Bayes! But we have a solution…
6.5.7
Solution to the problem of a ๐’—๐’‚๐’๐’–๐’† ๐’๐’‡ ๐’‚๐’ ๐’‚๐’•๐’•๐’“๐’Š๐’ƒ๐’–๐’•๐’† = ๐ŸŽ
If one of the conditional probability is zero, then the entire expression becomes zero!
๏ƒ  we have to use other probability estimations (we already know them):
๐‘
๏‚ท The original (with the problem): ๐‘ƒ(๐ด๐‘— = ๐‘Ž|๐ถ = ๐‘๐‘– ) = ๐‘๐‘—๐‘–
๐‘๐‘—๐‘– +1
๏‚ท
Laplace: ๐‘ƒ(๐ด๐‘— = ๐‘Ž|๐ถ = ๐‘๐‘– ) = ๐‘
๏‚ท
M-estimate: ๐‘ƒ(๐ด๐‘— = ๐‘Ž|๐ถ = ๐‘๐‘– ) =
๏‚ท
๏‚ท
๏‚ท
๏‚ท
๐‘๐‘ : number of classes
๐‘๐ด๐‘— : number of values that the attribute takes
๐‘: prior probability
๐‘š: parameter
6.5.8
๏‚ท
๏‚ท
๏‚ท
๐‘๐‘–
๐‘๐‘– +๐‘๐ด๐‘—
๐‘๐‘—๐‘– +๐‘š๐‘
๐‘๐‘๐‘– +๐‘š
Summary
Robust to isolated noise points
Handle missing values by ignoring the instance during probability estimate calculations
Robust to irrelevant attributes
28
Corentin Cossettini
Semestre de printemps 2022
๏‚ท
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Independence assumption may not hold for some attributes
o Use other techniques such as Bayesian Belief Networks (BBN)
Even if attributes aren’t often independent, we use NB bc it’s quick to compute and very simple!
6.6 Decision Tree Classifier
๏‚ท
๏‚ท
๏‚ท
๏‚ท
6.6.1
Solve a classification problem by asking a series of carefully crafted questions about
the attributes of the test record;
After each answer, a follow-up question is asked;
Stop when reach a conclusion about the class label of the record.
The questions and their possible answers can be organized in the form of a decision
tree
o Hierarchical structure consisting of nodes (noeuds) and directed edges.
o Root node: has no incoming edges and zero or more outgoing edges.
o Internal nodes: have exactly one incoming edge and two or more outgoing
edges
o Leaf node: have exactly one incoming edge and no outgoing edges.
Example
If Refund = Yes ๏ƒ  Cheat = No, that’s why
we don’t continue further
If married = no ๏ƒ  cheat=No, that’s why
we don’t continue further
6.6.2
๏‚ท
๏‚ท
๏‚ท
๏‚ท
6.6.3
Advantages of a Decision Tree Based Classification
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification techniques for many simple data sets
General Process to create a model
Training Set ๏ƒ  Learn Model (induction): algorithm ๏ƒ  Model: in this case, decision tree
๏ƒ  Apply Model (deduction) ๏ƒ  Test Set
To create the tree ๏ƒ  several algorithms. (Hunt’s, CART, ID3, …)
6.6.4
๏‚ท
๏‚ท
Hunt’s Algorithm: General Structure
๐ท๐‘ก : set of training records that reach a node ๐‘ก
General Procedure:
o If ๐ท๐‘ก contains records that belong the same class ๐‘ฆ๐‘ก : ๐‘ก is a leaf node labeled as
๐‘ฆ๐‘ก
o If ๐ท๐‘ก is an empty set: ๐‘ก is a leaf node labeled by the default class ๐‘ฆ๐‘‘
o If ๐ท๐‘ก contains records that belong to more than one class, use an attribute test
to split the data into smaller subsets. Recursively apply the procedure to each
subset
29
Corentin Cossettini
Semestre de printemps 2022
6.6.5
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Tree Induction
How to choose the question to ask from each node (inkl. the root node)?
๏ƒ  Greedy Strategy: Split the records based on an attribute test that optimizes certain criterion.
๏ƒ  Issues we are facing when doing this:
๏‚ท Determine how to split the records
o How to specify the attribute test condition?
o How to determine the best split?
๏‚ท Determine when to stop splitting
Specify Test Condition
๏‚ท Depends on attribute types. Nominal, Ordinal, Continuous
๏‚ท Depends on number of ways to split: 2-way split, multi-way split
(1) Split for Nominal attributes
(2) Split for Ordinal attributes
Numerical values ๏ƒ  binary splits
This split is also possible, even if the
order isn’t fully respected.
(3) Split for Continuous attributes
Different ways of handling the thing:
๏‚ท Discretization to form an ordinal categorical attribute
Transferring continuous functions into discrete counterparts
๏‚ท Binary Decision: ๐ด < ๐‘ฃ ๐‘œ๐‘Ÿ ๐ด ≥ ๐‘ฃ
Find the best cut possible among all the splits!
30
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
Determine the best split
๏‚ท
The data are the students: we create
the tree to decide when receiving new
data.
Best variant: 2nd bc the data are the
students and it helps classify them.
The 3rd is too perfect bc every node is
pure (one class is represented per
node).
Greedy Approach : nodes with homogeneous class distribution are preferred
Non-homogeneous,
high impurity
Homogeneous,
low impurity
๏ƒ  We need a measure of node impurity!
How to Find the Best Split
Compute M1 and M2, several ways to do it. After that, compare them with M0 to see if won or
not (we should win).
6.6.6
Measures of impurity
1) GINI
๐Ÿ
Gini Index for a given node ๐‘ก: ๐‘ฎ๐‘ฐ๐‘ต๐‘ฐ(๐’•) = ๐Ÿ − ∑๐’‹(๐’‘(๐’‹|๐’•))
๏‚ท ๐‘(๐‘—|๐‘ก): relative frequency of class ๐‘— at node ๐‘ก
1
๏‚ท ๐‘€๐‘Ž๐‘ฅ. ๐‘๐‘œ๐‘ ๐‘ ๐‘–๐‘๐‘™๐‘’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ = 1 −
๐‘›๐‘
When records are equally distributed among all ๐‘›๐‘ classes
๏‚ท ๐‘€๐‘–๐‘›. ๐‘๐‘œ๐‘ ๐‘ ๐‘–๐‘๐‘™๐‘’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ = 0
When records belong to one class ๏ƒ  if it’s a pure node, GINI = 0
๏‚ท When a node t is split into k partitions (children), the aggregated GINI impurity measure
๐’
for all children is computed as: ๐‘ฎ๐‘ฐ๐‘ต๐‘ฐ๐‘บ๐‘ท๐‘ณ๐‘ฐ๐‘ป = ∑๐’Œ๐’Š=๐Ÿ ๐’๐’Š ๐‘ฎ๐‘ฐ๐‘ต๐‘ฐ(๐’Š)
o ๐‘›๐‘– : number of records at child i
o ๐‘›: number of records at note t
๏‚ท Seek for the lowest GINIsplit ๏ƒ  the lowest impurity
The quality of the split (GINI gain) is expressed as:
๐‘ฎ๐’‚๐’Š๐’๐‘ฎ๐‘ฐ๐‘ต๐‘ฐ = ๐‘ฎ๐‘ฐ๐‘ต๐‘ฐ(๐’•) − ๐‘ฎ๐‘ฐ๐‘ต๐‘ฐ๐‘บ๐‘ท๐‘ณ๐‘ฐ๐‘ป
31
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
๏‚ท
Example
Binary Attributes
Categorical Attributes
Continuous Attributes
2) Entropy (information gain)
Entropy at a given node ๐‘ก :
๐‘ฌ๐’๐’•๐’“๐’๐’‘๐’š(๐’•) = − ∑ ๐’‘(๐’‹|๐’•) ๐ฅ๐จ๐  ๐Ÿ(๐’‘(๐’‹|๐’•))
๐’‹
๏‚ท
๏‚ท
Measures homogeneity of a node
๐‘€๐‘Ž๐‘ฅ. ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ = log(๐‘›๐‘ )
When records are equally distributed among all classes
๏‚ท ๐‘€๐‘–๐‘›. ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ = 0
When all records belong to one class
๏‚ท When a node t is split into k partitions (children), the aggregated entropy measure for
๐’
all children nodes is computed as: ๐‘ฌ๐‘ต๐‘ป๐‘น๐‘ถ๐‘ท๐’€๐‘บ๐‘ท๐‘ณ๐‘ฐ๐‘ป = ∑๐’Œ๐’Š=๐Ÿ ๐’๐’Š ๐‘ฌ๐’๐’•๐’“๐’๐’‘๐’š(๐’Š)
o ๐‘›๐‘– : number of records at child i
o ๐‘›: number of records at note t
The quality of the split (Information gain) is expressed as:
๐‘ฎ๐’‚๐’Š๐’๐‘ฌ๐‘ต๐‘ป๐‘น๐‘ถ๐‘ท๐’€ = ๐‘ฌ๐‘ต๐‘ป๐‘น๐‘ถ๐‘ท๐’€(๐’•) − ๐‘ฌ๐‘ต๐‘ป๐‘น๐‘ถ๐‘ท๐’€๐‘บ๐‘ท๐‘ณ๐‘ฐ๐‘ป
๏ƒ  Measures reduction in entropy achieved because of the split.
๏ƒ  Choose the split that achieves most reduction!
3) Classification Error
Classification error at node ๐‘ก :
๏‚ท
๏‚ท
๏‚ท
๏‚ท
๐‘ฌ๐’“๐’“๐’๐’“(๐’•) = ๐Ÿ − ๐’Ž๐’‚๐’™๐’‹ ๐’‘(๐’‹|๐’•)
Measures misclassification error made by a node
1
๐‘€๐‘Ž๐‘ฅ. ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ = 1 − ๐‘›
๐‘
When records are equally distributed among all classes
๐‘€๐‘–๐‘›. ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ = 0
When all records belong to one class
The aggregated classification error for all children nodes:
32
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
๐’Œ
๐‘ฌ๐‘น๐‘น๐‘ถ๐‘น๐‘บ๐‘ท๐‘ณ๐‘ฐ๐‘ป = ∑
o
o
6.6.7
๐’Š=๐Ÿ
๐’๐’Š
๐‘ฌ๐’“๐’“๐’๐’“(๐’Š)
๐’
๐‘›๐‘– : number of records at child i
๐‘›: number of records at note t
๐‘ฎ๐’‚๐’Š๐’๐‘ฌ๐‘น๐‘น๐‘น๐‘ถ๐‘น = ๐‘ฌ๐‘น๐‘น๐‘ถ๐‘น(๐’•) − ๐‘ฌ๐‘น๐‘น๐‘ถ๐‘น๐‘บ๐‘ท๐‘ณ๐‘ฐ๐‘ป
Practical Issues of Classification
6.6.7.1 Underfitting and Overfitting
Underfitting: the model is too simple
๏‚ท Not yet learned the true structure of data
๏‚ท Error rates on training set and on test set are large
Solution: créer un modèle un peu plus complexe (réduire les conditions : ex : minsplit passe
de 20 à 2)
Overfitting: the model is more complex than necessary
Si on a essayé de trop apprendre les données par cœur. Noise dans le data (valeurs qui ne
nous aident pas)
๏‚ท Error rate on training set is small, but the error rate on test data is large
๏‚ท Training error no longer provides a good estimate of how well the tree will perform on
previously unseen records
๏‚ท Need new ways for estimating errors
๏ƒ  l’arbre devient trop large
Solution: see below
Underfitting : when model is too simple, both training
and test errors are large
Overfitting due to Noise
Overfitting due to Insufficient Examples
๏‚ท
Lack of data points in the lower half of the diagram makes it difficult to predict correctly
the class labels of that region
๏‚ท Insufficient number of training records in the region causes the decision tree to predict
the test examples using other training records that are irrelevant to the classification
task
Estimating Generalization Errors
33
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Reduced error pruning (REP), 3ème méthode: utiliser un jeu de donner séparé
Occam’s Razor (principle)
๏‚ท Given 2 models of similar generalization errors, one should prefer the simpler model
over the more complex model
๏‚ท For complex models, there is a greater chance that it was fitted accidentally by errors
in data
๏‚ท Therefore, one should include model complexity when evaluating a model
Minimum Description Length (MDL)
How to Adress Overfitting (solutions)
34
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
1. Pre-Pruning
On ne laisse pas croître l’arbre, on s’arrête avant qu’il soit trop grand.
๏‚ท Stop the algorithm before it becomes a fully-grown tree
๏‚ท Typical stopping conditions for a node
o Stop if all instances belong to the same class
o Stop if all the attribute values are the same
๏‚ท More restrictive conditions
o Stop if number of instances is less than some user-specified threshold
o Stop if class distribution of instances are independent of the available
features (ex: using a ๐œ’ 2 test)
o Stop if expanding the current node does not improve impurity measures (ex:
Gini of Information Gain)
2. Post-Pruning
๏‚ท Grow decision tree to its entirety
๏‚ท Trim the nodes of the decision tree in a bottom-up fashion
๏‚ท If generalization error improves after trimming, replace sub-tree by a leaf node
๏‚ท Class label of leaf node is determined from majority class of instances in the subtree
๏‚ท Can use MDL for post-pruning
Example
6.6.7.2 Missing values
Affect decision tree construction in 3 different ways
1. How impurity measures are computed
2. How to distribute instance with missing value to child nodes
3. How a test instance with missing value is classified
6.6.7.3 Data Fragmentation
Number of instances gets smaller as you traverse down the tree
Number of instances at the leaf nodes could be too small to make any statistically significant
decision
6.6.7.4 Search Strategy
Finding an optimal decision tree is NP-hard
The algorithm presented so far uses a greedy, top-down, recursive partitioning strategy to
induce a reasonable solution
Other strategies?
๏‚ท Bottom-up
๏‚ท Bi-directional
6.6.7.5 Expressiveness
๏‚ท
Decision tree provides expressive representation for learning discrete-valued function
35
Corentin Cossettini
Semestre de printemps 2022
๏‚ท
๏‚ท
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
But they do not generalize well to certain types of Boolean functions
o Example: parity function
๏‚ง Class = 1 if there is an even number of Boolean Attributes with true value
= True
๏‚ง Class = 0 if there is an odd number of Boolean Attributes with true value
= True
o For accurate modeling, must have a complete tree
Not expressive enough for modeling continuous variables
o Particularly when test condition involves only a single attribute at-a-time
Decision Boundary
๏‚ท
๏‚ท
Border line between 2 neighboring regions of different classes is known as decision
boundary
Decision boundary is parallel to axes because test condition involves a single attribute
at-a-time
Oblique Decision Trees
๏‚ท
๏‚ท
๏‚ท
Test condition may involve multiple attributes
More expressive representation
Finding optimal test condition is computationally expensive
6.7 Model Evaluation
๏‚ท
๏‚ท
๏‚ท
6.7.1
Metrics for Performance Evaluation: “How to evaluate the performance of a model?”
Methods for Performance Evaluation: “How to obtain reliable estimates?”
Methods for Model Comparison: “How to compare the relative performance among
competing models?”
Metrics for Performance Evaluation
Focus on the predictive capability of a model
Confusion matrix
36
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
Metrics
๐‘Ž+๐‘‘
๐‘‡๐‘ƒ+๐‘‡๐‘
๏‚ท
๐ด๐‘๐‘๐‘ข๐‘Ÿ๐‘Ž๐‘๐‘ฆ = ๐‘Ž+๐‘+๐‘+๐‘‘ = ๐‘‡๐‘ƒ+๐‘‡๐‘+๐น๐‘ƒ+๐น๐‘
๏‚ท
๐ธ๐‘Ÿ๐‘Ÿ๐‘œ๐‘Ÿ ๐‘Ÿ๐‘Ž๐‘ก๐‘’ = 1 − ๐ด๐‘๐‘๐‘ข๐‘Ÿ๐‘Ž๐‘๐‘ฆ = ๐‘‡๐‘ƒ+๐‘‡๐‘+๐น๐‘ƒ+๐น๐‘
๏‚ท
๐‘‡๐‘Ÿ๐‘ข๐‘’ ๐‘ƒ๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’ ๐‘…๐‘Ž๐‘ก๐‘’ (๐‘‡๐‘ƒ๐‘…)(๐‘ ๐‘’๐‘›๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘–๐‘ก๐‘ฆ) = ๐‘‡๐‘ƒ+๐น๐‘
๏‚ท
๐‘‡๐‘Ÿ๐‘ข๐‘’ ๐‘๐‘’๐‘”๐‘Ž๐‘ก๐‘–๐‘ฃ๐‘’ ๐‘…๐‘Ž๐‘ก๐‘’ (๐‘‡๐‘๐‘…)(๐‘ ๐‘๐‘’๐‘๐‘–๐‘“๐‘–๐‘๐‘–๐‘ก๐‘ฆ) = ๐‘‡๐‘+๐น๐‘ƒ
๏‚ท
๐น๐‘Ž๐‘™๐‘ ๐‘’ ๐‘ƒ๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’ ๐‘…๐‘Ž๐‘ก๐‘’ (๐น๐‘ƒ๐‘…) = ๐‘‡๐‘+๐น๐‘ƒ
๏‚ท
๐น๐‘Ž๐‘™๐‘ ๐‘’ ๐‘๐‘’๐‘”๐‘Ž๐‘ก๐‘–๐‘ฃ๐‘’ ๐‘…๐‘Ž๐‘ก๐‘’ (๐น๐‘๐‘…) = ๐‘‡๐‘ƒ+๐น๐‘
๐น๐‘ƒ+๐น๐‘
๐‘‡๐‘ƒ
๐‘‡๐‘
๐น๐‘ƒ
๐น๐‘
Limitation of Accuracy
๏‚ท Consider a 2-class problem
o Number of Class 0 examples = 9990
o Number of Class 1 examples = 10
๏‚ท If model predicts everything to be class 0:
9990
o ๐ด๐‘๐‘๐‘ข๐‘Ÿ๐‘Ž๐‘๐‘ฆ = 10000 = 99.9
๏ƒ  Misleading because the model does not detect any class 1 example
๏ƒ  Cost Matrix
Cost Matrix
๏‚ท ๐ถ(๐‘–, ๐‘—): cost of misclassifying class ๐‘– example as class ๐‘—
๐ถ๐‘œ๐‘ ๐‘ก = ๐ถ(๐‘Œ๐‘’๐‘ , ๐‘Œ๐‘’๐‘ )๐‘Ž + ๐ถ(๐‘Œ๐‘’๐‘ , ๐‘๐‘œ)๐‘ + ๐ถ(๐‘๐‘œ, ๐‘Œ๐‘’๐‘ )๐‘ + ๐ถ(๐‘๐‘œ, ๐‘๐‘œ)๐‘‘
๏‚ท
Example: 2 models with the same cost matrix
Cost vs Accuracy
37
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
Cost-Sensitive Measures
Mesures utilisées par Google dans son moteur de recherche
๏ƒ  Chercher quelque chose dans un grand nombre de documents
๏‚ท
๏‚ท
๐‘Ž
๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘›(๐‘) = ๐‘Ž+๐‘
๐‘Ž
๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™(๐‘Ÿ) = ๐‘Ž+๐‘
2๐‘Ÿ๐‘
2๐‘Ž
๏‚ท
๐น − ๐‘š๐‘’๐‘Ž๐‘ ๐‘ข๐‘Ÿ๐‘’(๐น) = ๐‘Ÿ+๐‘ = 2๐‘Ž+๐‘+๐‘
๏‚ท
๐‘Š๐‘’๐‘–๐‘”โ„Ž๐‘ก๐‘’๐‘‘ ๐ด๐‘๐‘๐‘ข๐‘Ÿ๐‘Ž๐‘๐‘ฆ = ๐‘ค
the bigger the f-measure, the better
๐‘ค1 ๐‘Ž+๐‘ค4 ๐‘‘
1 ๐‘Ž+๐‘ค2 ๐‘+๐‘ค3 ๐‘+๐‘ค4 ๐‘‘
Remarques
๏‚ท Precision is biased towards ๐ถ(๐‘Œ๐‘’๐‘ , ๐‘Œ๐‘’๐‘ ) and ๐ถ(๐‘๐‘œ, ๐‘Œ๐‘’๐‘ )
๏‚ท Recall is biased towards ๐ถ(๐‘Œ๐‘’๐‘ , ๐‘Œ๐‘’๐‘ ) and ๐ถ(๐‘Œ๐‘’๐‘ , ๐‘๐‘œ)
๏‚ท F-measure is biased towards all except ๐ถ(๐‘๐‘œ, ๐‘๐‘œ)
o Combines precision and recall ๏ƒ  mean of the two
6.7.2
Methods for Performance Evaluation
Comment obtenir une estimation de la performance ?
๏ƒ  Note : la performance du modèle peut dépendre d’autre facteurs que l’algorithme (class
distribution (1%-99%) ; cost of misclassification ; size of training and test sets)
Learning Curve
Shows how accuracy changes with varying sample size
๏‚ท Requires a sample schedule for creating one!
/!/: ↓ ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘’ ๐‘ ๐‘–๐‘ง๐‘’ → ๐‘’๐‘Ÿ๐‘Ÿ๐‘œ๐‘Ÿ ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘’๐‘ ๐‘ก๐‘–๐‘š๐‘Ž๐‘ก๐‘’ ↑
Methods of Estimation
38
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
1. Holdout
Reserve 2/3 for training and 1/3 for
testing
2. Repeated Holdout
Do it 3 times to test every subset
3. Cross validation
o Partition data into ๐‘˜ disjoint
subsets
o ๐‘˜-fold:
train
on
๐‘˜−1
partitions, test on the
remaining one
o Leave-one-out: ๐‘˜ = ๐‘›
4. Repeated cross validation
Do it ๐‘˜ times
ROC (Receiver Operating Characteristic)
Characterizes the trade-off between positive hits (TPR) and false alarms (FPR)
๏‚ท
๏‚ท
ROC curve plots TPR (y-axis) against FPR (x-axis)
Performance of each classifier represented as a point on the ROC curve
o Changing the threshold ๐‘ก (seuile) of algorithm, sample distribution or cost matrix
changes the location of the point
ROC Curve
๏‚ท 1-dimensional data set containing 2 classes (positive and negative)
๏‚ท Any points located at ๐‘ฅ > ๐‘ก is classified as positive
At threshold ๐‘ก ≈ 0.3:
๏‚ท ๐‘‡๐‘ƒ๐‘… = 0.5
๐น๐‘ƒ๐‘… = 0.12
๐‘‡๐‘๐‘… = 0.88 ๐น๐‘๐‘… = 0.5
(๐‘‡๐‘ƒ๐‘…, ๐น๐‘ƒ๐‘…)
(0,0): declare everything to be negative
class
(1,1): declare everything to be positive class
(1,0): ideal
๏ƒ  We have to find the nearest point to (1,0)
Diagonal line:
๏‚ท Random guessing
๏‚ท Below diagonal line: prediction is
opposite of the true class
ROC for Model Comparison
39
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
Corentin Cossettini
Semestre de printemps 2022
No model consistently outperforms the other:
๏‚ท ๐‘€1 is better for small FPR
๏‚ท ๐‘€2 is better for large FPR
Area under the ROC curve;
๏‚ท Ideal: ๐ด๐‘Ÿ๐‘’๐‘Ž = 1
๏‚ท Random guess: ๐ด๐‘Ÿ๐‘’๐‘Ž = 0.5
How to construct a ROC curve
๏‚ท
๏‚ท
Use classifier that produces posterior probability
for each test instance ๐‘ƒ(+|๐ด)
Sort the instances according to ๐‘ƒ(+|๐ด) in
decreasing order
Apply threshold ๐‘ก at each unique value of ๐‘ƒ(+|๐ด)
Count the number of TP, FP, TN, FN at each
threshold
๐‘‡๐‘ƒ
TP rate, ๐‘‡๐‘ƒ๐‘… =
๏‚ท
FP rate, ๐น๐‘ƒ๐‘… =
๏‚ท
๏‚ท
๏‚ท
๐‘‡๐‘ƒ+๐น๐‘
๐น๐‘ƒ
๐น๐‘ƒ+๐‘‡๐‘
Si ๐‘ก = 0.7 ๏ƒ  on envoie à 7 personnes
6.7.3
Test of Significance (confidence intervals)
๏‚ท
Given 2 models:
o ๐‘€1 : accuracy = 85%, tested on 30 instances
o ๐‘€2 : accuracy = 75%, tested on 5000 instances
๏‚ท Can we say ๐‘€1 is better than ๐‘€2 ?
o (1) What’s the confidence we can place on the accuracy of ๐‘€1 and ๐‘€2 ?
o (2) Is the difference in performance measure explained as a result of random
fluctuations in the test set?
(1) CI for Accuracy
40
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
๏‚ท
๏‚ท
Accuracy is the mean
Prediction is a Bernoulli trial
o 2 possible outcomes: correct or wrong for example
o All the Bernoulli trials represent a Binominal Distribution: ๐‘‹~๐ต๐‘–๐‘›(๐‘, ๐‘), with
๐‘‹ the number of correct predictions
๏‚ง Example: toss a faire coin 50 times
๏‚ท Expected number of heads: ๐ธ(๐‘‹) = ๐‘๐‘ = 50 ∗ 0.5 = 5
๏‚ท Variance: ๐‘‰๐‘Ž๐‘Ÿ(๐‘‹) = ๐‘๐‘(1 − ๐‘) = 50 ∗ 0.5 ∗ 0.5 = 12.5
๏‚ท Probability the head show up 20 times:
50
๐‘ƒ(๐‘‹ = 20) = 0.520 (1 − 0.5)30 ≈ 0.04
20
Comes from the binomial distribution
๐‘ฅ
๏‚ท Given ๐‘ฅ (# of correct predictions) and ๐‘ (# of test instances): a๐‘๐‘ =
๐‘
Acc ๏ƒ  sample // p ๏ƒ  population
Is it possible to predict p, the true accuracy of the model (for the whole population)?
๏‚ท
For large test sets (๐‘ > 30), acc has a normal distribution (๐œ‡ = ๐‘, ๐œŽ 2 =
๐‘ƒ ๐‘๐›ผ <
๐‘Ž๐‘๐‘ − ๐‘
√๐‘(1 − ๐‘)
(
๐‘
Confidence Interval for ๐‘:
2
< ๐‘1−๐›ผ
๐‘(1−๐‘)
)
๐‘
=1−๐›ผ
2
)
2๐‘(๐‘Ž๐‘๐‘) + ๐‘๐›ผ2 ± √๐‘๐›ผ2 + 4๐‘๐‘Ž๐‘๐‘ − 4๐‘(๐‘Ž๐‘๐‘)2
2
1−๐›ผ
๐ถ๐ผ[๐‘]
=
2
2 (๐‘ + ๐‘๐›ผ2 )
2
๏‚ท
Example:
(2) Comparing Performance of 2 Models
with 2 different sizes. Is the bigger one really statistically different from the smaller one ?
๏ƒ  Test the difference of the 2 models ๐‘€1 and ๐‘€2
๏‚ท Given 2 models ๐‘€1 and ๐‘€2 , which one is better?
o ๐‘€1 is tested on ๐ท1 (๐‘ ๐‘–๐‘ง๐‘’ = ๐‘›1), ๐‘“๐‘œ๐‘ข๐‘›๐‘‘ ๐‘’๐‘Ÿ๐‘Ÿ๐‘œ๐‘Ÿ ๐‘Ÿ๐‘Ž๐‘ก๐‘’ = ๐‘’1
o ๐‘€2 is tested on ๐ท2 (๐‘ ๐‘–๐‘ง๐‘’ = ๐‘›2 ), ๐‘“๐‘œ๐‘ข๐‘›๐‘‘ ๐‘’๐‘Ÿ๐‘Ÿ๐‘œ๐‘Ÿ ๐‘Ÿ๐‘Ž๐‘ก๐‘’ = ๐‘’2
o Assume ๐ท1 and ๐ท2 are indepenant
o Assume ๐‘›1 and ๐‘›2 are sufficiently large, then:
๐‘’ (1−๐‘’ )
๐‘’1 ~๐‘(๐œ‡1 , ๐œŽ1 ), ๐‘’2 ~๐‘(๐œ‡2 , ๐œŽ2 ), with ๐œŽฬ‚๐‘– = ๐‘– ๐‘› ๐‘–
๏‚ท
๐‘–
To test if performance difference is statistically significant:
๐‘‘ = ๐‘’1 − ๐‘’2
o ๐‘‘~๐‘(๐‘‘๐‘ก , ๐œŽ๐‘ก ) where ๐‘‘๐‘ก is the true difference
o Since ๐ท1 and ๐ท2 are independent, their variance adds up:
41
Corentin Cossettini
Semestre de printemps 2022
Université de Neuchâtel
Bachelor en sciences économiques 2ème année
๐‘’1 (1 − ๐‘’1 ) ๐‘’2 (1 − ๐‘’2 )
+
๐‘›1
๐‘›2
= ๐‘‘ ± ๐‘๐›ผ/2 ๐œŽฬ‚๐‘ก
๐œŽ๐‘ก2 = ๐œŽ12 + ๐œŽ22 ≅ ๐œŽฬ‚12 + ๐œŽฬ‚22 =
๏‚ท
1−๐›ผ
o At (1 − ๐›ผ) confidence level, ๐ถ๐ผ[๐‘‘
๐‘ก]
Example:
If zero is in the interval, not statistically significant
If zero is not in the interval, statistically significant
6.7.4
Comparing performance of 2 Algorithms
With different models!
42
Download