Uploaded by Kristine Auh

15075 Lecture 03

advertisement
15.075: Statistical Thinking and Data Analysis
Lecture 3
Mohammad Fazel-Zarandi
February 13, 2019
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Recap
• What two visual summaries of quantitative data did we discuss?
• Did we give a formal definition for an outlying data point?
• What were the three types of numerical summaries?
• What was the “shift criterion”?
15.075 (Spring 2019)
Lecture 3
February 13, 2019
2 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Road Map
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
15.075 (Spring 2019)
Lecture 3
February 13, 2019
3 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
The Bell Curve
We will use the terms ”bell curve” = ”normal distribution” = ”Gaussian
distribution” interchangeably
• The normal distribution is sometimes a good approximation to
histograms of numerical variables.
• Not always!
Is there only one “Normal Distribution?”
15.075 (Spring 2019)
Lecture 3
February 13, 2019
4 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Normal Distributions
f(x)
• All of these are normally
distributed
0.4
0.6
0.8
Normal Distributions
0.0
0.2
• How do they differ?
−15
−10
−5
0
5
10
15
x
15.075 (Spring 2019)
Lecture 3
February 13, 2019
5 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
0.1
Red = Normal
Black = Non−Normal (Tails are too heavy.)
0.0
the other curves have
heavy tails which make
them not normal
distributions.
0.2
0.3
0.4
• Bell shape - unimodal, symmetric
• Not every symmetric bell-curve-looking shape is a normal distribution!
• The shape of the normal distribution is the result of a certain formula:
f (x) ∝ exp{−x 2 /2}
−3
−2
−1
0
1
2
3
Red curve is a normal distribution. All of the others aren’t!
15.075 (Spring 2019)
Lecture 3
February 13, 2019
6 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Mean, SD, and the Normal Distribution
What’s so appealing about the normal distribution?
Theorem
The normal distribution is fully characterized by the mean and the
standard deviation
Implication: If we know the mean and the sd, and if we know the
distribution is normal, then we know all the quantiles!
15.075 (Spring 2019)
Lecture 3
February 13, 2019
7 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Empirical Rules
If a variable is normally distributed:
• 50% of its values fall between [mean − 32 sd, mean + 32 sd]
• 68% of its values fall between [mean − 1sd, mean + 1sd]
• 95% of its values fall between [mean − 2sd, mean + 2sd]
• 99.7% of its values fall between [mean − 3sd, mean + 3sd]
15.075 (Spring 2019)
Lecture 3
February 13, 2019
8 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Empirical Rules Visualized
Units on the x axis: standard deviations from the mean!
15.075 (Spring 2019)
Lecture 3
February 13, 2019
9 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Finding Quantiles Using Empirical Rules
95% of its values fall between [mean − 2sd, mean + 2sd]
• mean − 2sd ∼ lower 2.5% quantile
• mean + 2sd ∼ upper 2.5% quantile
68% of its values fall between [mean − 1sd, mean + 1sd]
• mean − 1sd ∼ lower 16% quantile
• mean + 1sd ∼ upper 16% quantile
50% of its values fall between [mean − 23 sd, mean + 23 sd]
• mean − (2/3)sd ∼ lower quartile
• mean + (2/3)sd ∼ upper quartile
15.075 (Spring 2019)
Lecture 3
February 13, 2019
10 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Data Example
Let’s check these approximate relationships in ”Exam Scores.csv”.
• mean = 66.7
• sd = 8.9
• 2.5% and 97.5% quantiles: mean ± 2sd = 48.9 and 84.5
• Actual quantiles from data set: 50 and 82.75 – which is close!
• What about ±1sd?
15.075 (Spring 2019)
Lecture 3
February 13, 2019
11 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
How Can We Tell If Our Data Are Normal?
Unfortunately, our current graphical summaries fall short! They can only
help us in isolating certain departures from non-normality
• Histogram:
I
multimodal or skew ⇒ not bell shaped
• Boxplot:
I
I
Are the quartiles symmetric about the median?
Are there outlying observations?
We need a sharper tool!
15.075 (Spring 2019)
Lecture 3
February 13, 2019
12 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Normal Quantile Plot
Normal Q−Q Plot
80
q(i) = i/(n+1) x 100
if follows y=x line, indicates normal
distribution since demonstrates 1:1 correlation
between quantiles
Sample Quantiles
60
40
50
qqnorm(scores)
qqline(scores)
70
Also called a Q-Q Plot
−3
−2
−1
0
1
2
3
Theoretical Quantiles
15.075 (Spring 2019)
Lecture 3
February 13, 2019
13 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
How to Read a Normal Q-Q Plot:
• Vertical axis: sorted ”Scores” (why staircasing?)
• Horizontal axis: ”theoretical normal quantiles”
• ”Under the hood”:
I
The idea is that for example the 10th smallest value of a variable with
252 cases is an estimate of the (10/252) × 100% quantile; hence plot
the sorted values against the corresponding quantiles from a normal
distribution.
• The normal curve can be used to calculate ’theoretical quantiles’,
which are plotted on the x axis
I
I
If the data were normally distributed, then all I need to know are the
mean and the standard deviation, and I can calculate all of the
quantiles
So, take the variable in question, compute its mean and standard
deviation, and then compute what the quantiles should be if the
variable was normally distributed with that mean/sd.
15.075 (Spring 2019)
Lecture 3
February 13, 2019
14 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
How to Use a Normal Q-Q Plot:
• Use: If the points don’t deviate from the diagonal line much, then the
variable is ”approximately normally distributed.”
• Idea: The straight line tells where the sorted values of the variable
should fall approximately IF they are normally distributed.
15.075 (Spring 2019)
Lecture 3
February 13, 2019
15 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Detecting Non-Normalities
shows up in QQ plot as exponential curve, since skewed (first half is concentrated towards lower
quartile and at the right skew tail, the values get bigger and bigger).
How non-normalities show up in normal quantile plots:
• Right-Skewness: frequent in finance/econ (e.g., CEO compensation
data). This causes convex (cup-shaped) curvature in normal quantile
plots
• What do you think the curvature would be for left skewed data?
• Outliers cause points to be too high on the right or too low on the left
• Multi-modality (rare) causes snaking of the normal quantile plot
Let’s show a few examples
• What is the nature of the non-normality?
15.075 (Spring 2019)
Lecture 3
February 13, 2019
16 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
CEO Compensations
From the data set CEO comp 2003.csv
Right Skewed
Normal Q−Q Plot
2.0e+07
0
0.0e+00
200
5.0e+06
1.0e+07
1.5e+07
Sample Quantiles
600
400
Frequency
800
2.5e+07
1000
3.0e+07
Histogram of comp
0.0e+00
1.0e+07
2.0e+07
3.0e+07
comp
15.075 (Spring 2019)
−3
−2
−1
0
1
2
3
Theoretical Quantiles
Lecture 3
February 13, 2019
17 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Heavy Tails
Fat Tails
Normal Q−Q Plot
0
Sample Quantiles
150
0
−4
50
−2
100
Frequency
200
2
250
4
300
6
Histogram of x
−4
15.075 (Spring 2019)
−2
0
2
4
6
Lecture 3
−3
−2
−1
0
1
2
3
February 13, 2019
18 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
The Role of Standardization
Test scores are often approximately normally distributed. ⇒ Empirical rule
and reliance on mean and sd work well.
• Is a 70 out of 100 a good test score?
I
I
I
What if the mean is 80, sd = 5?
What if the mean is 60, sd = 10?
What if the mean is 60, sd = 5?
• Knowing your score was a 70 clearly isn’t enough even if you believe
scores are normally distributed! Depending on how many sds from
the mean you are, a 70 could be very good or very bad.
• Can we think of the outcomes on a scale that reflects this?
15.075 (Spring 2019)
Lecture 3
February 13, 2019
19 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Centering and Scaling a Continuous Variable
Suppose I have a continuous variable X . What is mean(X − x̄)?
• Called demeaning, or centering, the variable
Suppose I have a continuous variable, X . What is sd (X /sd(X ))?
• Called scaling a variable.
Suppose the variable X follow a normal distribution. Does centering
and/or scaling affect its normality?
15.075 (Spring 2019)
Lecture 3
February 13, 2019
20 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Z -Scores
A z score answers the following question:
• “How many sd above (+) or below (-) the mean was the observed
value?”
• Can write any observation as the mean, plus some number (z) times
the sd: observed = mean + z × sd
Solve for z:
Z-SCORE
unitless measure
z=
15.075 (Spring 2019)
observed − mean
sd
Lecture 3
February 13, 2019
21 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Z -Scores as a New Variable
Think of defining a new variable, where the values are the z-scores of a
variable X for each case. We could denote this new variable as:
Z Scores as a Variable
Z (X ) =
15.075 (Spring 2019)
X − mean(X )
sd(X )
Lecture 3
February 13, 2019
22 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Z-Scores as a Change of Units
Recall our exam example: A 70 on an exam could mean very different
things depending on the mean and standard deviation
• z-scores can be thought of as a change in units, where the units
becomes standard deviations above the mean
• unit = 1 sd, mean = 0
Mean, Standard Deviation of Z -Scores
If we form z-scores of any continuous variable, X , then we have:
mean(z-scores) = 0
sd(z-scores) = 1
MEMORIZE THIS
15.075 (Spring 2019)
Lecture 3
February 13, 2019
23 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Changing Units of Z -Scores
Suppose I have a variable X , and I change units by X ∗ = a + bX :
Changing Units
If b is positive...
z(X ∗ ) = z(X )
If b is negative...
z(X ∗ ) = −z(X )
That is, z-scores are not affected in terms of magnitude by additive and
multiplicative shifts. The only thing that can shift is the sign if b is
negative.
Example: Suppose we had z-scores of temperatures in Celcius and
someone changed the original data set to be in Fahrenheit. The z-scores
would not change!
15.075 (Spring 2019)
Lecture 3
February 13, 2019
24 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Normality of CERTAIN Z -Scores
Suppose the continuous variable of interest, X , is normally distributed,
with some mean m and some standard deviation, s, then we can say
something further about the z-scores:
Normality of Z -Scores
If we form z-scores of any normally distributed continuous variable, X ,
then we have that the z-scores will also follow a normal distribution, with
a mean of 0 and a standard deviation of 1
Important: Taking a z-score cannot make a variable look “more normal”
or “less normal”
• If the variable is normal, its z-scores will be normal
• If the variable is not normally distributed, neither will its z-scores be
15.075 (Spring 2019)
Lecture 3
February 13, 2019
25 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Empirical Rules for Z -Scores
If a variable is normally distributed:
• 50% of its z-scores fall between [− 32 , 23 ]
• 68% of its z-scores fall between [−1, 1]
• 95% of its z-scores fall between [−2, 2]
• 99.7% of its z-scores fall between [−3, 3]
15.075 (Spring 2019)
Lecture 3
February 13, 2019
26 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Quantiles Beyond the Empirical Rules
• We can now answer quantile and range problems approximately for
z = ±2/3, ±1, ±2, ±3
• General quantile problems:
I
I
”What is the fraction of students with a z-score above 1.5?”
”What is the fraction of students with a score below 70?”
• Converse:
I
”My score is at the 87% quantile. How many sd above the mean is it?”
• General range problems:
I
I
I
”What fraction of scores is within 1.2 sd of the mean?”
”What range about the mean contains 80% of the values?”
”What fraction of students scored between an 85 and a 94?”
15.075 (Spring 2019)
Lecture 3
February 13, 2019
27 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Old School: Normal Tables
”Normal Tables.pdf”
• Try to make sense of the tops of the columns:
• Graph: curve = idealized histogram for n=infinite under a normal
distribution Shaded area indicates quantile or range
• Formula: ’P(...)’ = ’Proportion of cases with ...’
I
Ex.: P(Z<z) = Proportion of cases with z-score below z, where Z =
column z-score values and z = threshold on z-score values
15.075 (Spring 2019)
Lecture 3
February 13, 2019
28 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Understanding the Columns
• Left margin: values of threshold z
• 1st column: Proportion of cases with z-scores below -z
• 2nd column: Proportion of cases with z-scores below +z
• 3rd column: Proportion of cases with z-scores below -z OR above +z
• 4th column: Proportion of cases with z-scores within ± z
I
I
I
Why do columns 1 and 2 add up to 1?
Why do columns 3 and 4 add up to 1?
Why are the values in column 1 below 0.5?
15.075 (Spring 2019)
Lecture 3
February 13, 2019
29 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
New School: Calculating within R
• Proportion of z-scores below z:
pnorm(z)
• Proportion of z-scores above z:
1-pnorm(z)
#OR
pnorm(z, lower = F)
• How would we find the proportion with z-scores below -z and above
z? Within ± z?
• p th percentile / p-quantile of z-scores (as a fraction, not percentage)
qnorm(p)
15.075 (Spring 2019)
Lecture 3
February 13, 2019
30 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
Exercises with Z -Scores
• What fraction of values is below me if I’m 1.2 SD ABOVE the mean?
• What fraction of values is above me if I’m 1.2 SD BELOW the mean?
• What fraction of values is in the interval ± 1.2 SD around the mean?
• How many SD above/below the mean am I if my quantile is 35%?
15.075 (Spring 2019)
Lecture 3
February 13, 2019
31 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
General Exercises
A histogram of GPAs for College Students at a particular university follows
the Normal curve with a mean of 2.7 and a standard deviation of 0.5
• What percentage of students have a GPA of 3.5 or lower?
• What percent have GPAs greater than 2.5?
• What percent have GPAs greater than 2.8?
• What percent have GPAs between 2.5 and 2.8?
• If I am in the top 10% of the class, at least what does my GPA need
to be?
15.075 (Spring 2019)
Lecture 3
February 13, 2019
32 / 34
The Normal Distribution
Empirical Rules
Detecting Normality
Z Scores and “Curving”
Beyond the Empirical Rules
From Quantiles to Means/SDs
Suppose that at a certain school of 820 students it is known that
• 3.8 GPA is 95th percentile
• 3.3 GPA is 80th percentile
• GPAs are normally distributed
Questions:
• What’s the mean and sd of GPAs in the school?
• Can you approximate the class rank of a student with a 3.0 GPA?
15.075 (Spring 2019)
Lecture 3
February 13, 2019
33 / 34
Download