15.075: Statistical Thinking and Data Analysis Lecture 3 Mohammad Fazel-Zarandi February 13, 2019 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Recap • What two visual summaries of quantitative data did we discuss? • Did we give a formal definition for an outlying data point? • What were the three types of numerical summaries? • What was the “shift criterion”? 15.075 (Spring 2019) Lecture 3 February 13, 2019 2 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Road Map The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules 15.075 (Spring 2019) Lecture 3 February 13, 2019 3 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules The Bell Curve We will use the terms ”bell curve” = ”normal distribution” = ”Gaussian distribution” interchangeably • The normal distribution is sometimes a good approximation to histograms of numerical variables. • Not always! Is there only one “Normal Distribution?” 15.075 (Spring 2019) Lecture 3 February 13, 2019 4 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Normal Distributions f(x) • All of these are normally distributed 0.4 0.6 0.8 Normal Distributions 0.0 0.2 • How do they differ? −15 −10 −5 0 5 10 15 x 15.075 (Spring 2019) Lecture 3 February 13, 2019 5 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules 0.1 Red = Normal Black = Non−Normal (Tails are too heavy.) 0.0 the other curves have heavy tails which make them not normal distributions. 0.2 0.3 0.4 • Bell shape - unimodal, symmetric • Not every symmetric bell-curve-looking shape is a normal distribution! • The shape of the normal distribution is the result of a certain formula: f (x) ∝ exp{−x 2 /2} −3 −2 −1 0 1 2 3 Red curve is a normal distribution. All of the others aren’t! 15.075 (Spring 2019) Lecture 3 February 13, 2019 6 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Mean, SD, and the Normal Distribution What’s so appealing about the normal distribution? Theorem The normal distribution is fully characterized by the mean and the standard deviation Implication: If we know the mean and the sd, and if we know the distribution is normal, then we know all the quantiles! 15.075 (Spring 2019) Lecture 3 February 13, 2019 7 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Empirical Rules If a variable is normally distributed: • 50% of its values fall between [mean − 32 sd, mean + 32 sd] • 68% of its values fall between [mean − 1sd, mean + 1sd] • 95% of its values fall between [mean − 2sd, mean + 2sd] • 99.7% of its values fall between [mean − 3sd, mean + 3sd] 15.075 (Spring 2019) Lecture 3 February 13, 2019 8 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Empirical Rules Visualized Units on the x axis: standard deviations from the mean! 15.075 (Spring 2019) Lecture 3 February 13, 2019 9 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Finding Quantiles Using Empirical Rules 95% of its values fall between [mean − 2sd, mean + 2sd] • mean − 2sd ∼ lower 2.5% quantile • mean + 2sd ∼ upper 2.5% quantile 68% of its values fall between [mean − 1sd, mean + 1sd] • mean − 1sd ∼ lower 16% quantile • mean + 1sd ∼ upper 16% quantile 50% of its values fall between [mean − 23 sd, mean + 23 sd] • mean − (2/3)sd ∼ lower quartile • mean + (2/3)sd ∼ upper quartile 15.075 (Spring 2019) Lecture 3 February 13, 2019 10 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Data Example Let’s check these approximate relationships in ”Exam Scores.csv”. • mean = 66.7 • sd = 8.9 • 2.5% and 97.5% quantiles: mean ± 2sd = 48.9 and 84.5 • Actual quantiles from data set: 50 and 82.75 – which is close! • What about ±1sd? 15.075 (Spring 2019) Lecture 3 February 13, 2019 11 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules How Can We Tell If Our Data Are Normal? Unfortunately, our current graphical summaries fall short! They can only help us in isolating certain departures from non-normality • Histogram: I multimodal or skew ⇒ not bell shaped • Boxplot: I I Are the quartiles symmetric about the median? Are there outlying observations? We need a sharper tool! 15.075 (Spring 2019) Lecture 3 February 13, 2019 12 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Normal Quantile Plot Normal Q−Q Plot 80 q(i) = i/(n+1) x 100 if follows y=x line, indicates normal distribution since demonstrates 1:1 correlation between quantiles Sample Quantiles 60 40 50 qqnorm(scores) qqline(scores) 70 Also called a Q-Q Plot −3 −2 −1 0 1 2 3 Theoretical Quantiles 15.075 (Spring 2019) Lecture 3 February 13, 2019 13 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules How to Read a Normal Q-Q Plot: • Vertical axis: sorted ”Scores” (why staircasing?) • Horizontal axis: ”theoretical normal quantiles” • ”Under the hood”: I The idea is that for example the 10th smallest value of a variable with 252 cases is an estimate of the (10/252) × 100% quantile; hence plot the sorted values against the corresponding quantiles from a normal distribution. • The normal curve can be used to calculate ’theoretical quantiles’, which are plotted on the x axis I I If the data were normally distributed, then all I need to know are the mean and the standard deviation, and I can calculate all of the quantiles So, take the variable in question, compute its mean and standard deviation, and then compute what the quantiles should be if the variable was normally distributed with that mean/sd. 15.075 (Spring 2019) Lecture 3 February 13, 2019 14 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules How to Use a Normal Q-Q Plot: • Use: If the points don’t deviate from the diagonal line much, then the variable is ”approximately normally distributed.” • Idea: The straight line tells where the sorted values of the variable should fall approximately IF they are normally distributed. 15.075 (Spring 2019) Lecture 3 February 13, 2019 15 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Detecting Non-Normalities shows up in QQ plot as exponential curve, since skewed (first half is concentrated towards lower quartile and at the right skew tail, the values get bigger and bigger). How non-normalities show up in normal quantile plots: • Right-Skewness: frequent in finance/econ (e.g., CEO compensation data). This causes convex (cup-shaped) curvature in normal quantile plots • What do you think the curvature would be for left skewed data? • Outliers cause points to be too high on the right or too low on the left • Multi-modality (rare) causes snaking of the normal quantile plot Let’s show a few examples • What is the nature of the non-normality? 15.075 (Spring 2019) Lecture 3 February 13, 2019 16 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules CEO Compensations From the data set CEO comp 2003.csv Right Skewed Normal Q−Q Plot 2.0e+07 0 0.0e+00 200 5.0e+06 1.0e+07 1.5e+07 Sample Quantiles 600 400 Frequency 800 2.5e+07 1000 3.0e+07 Histogram of comp 0.0e+00 1.0e+07 2.0e+07 3.0e+07 comp 15.075 (Spring 2019) −3 −2 −1 0 1 2 3 Theoretical Quantiles Lecture 3 February 13, 2019 17 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Heavy Tails Fat Tails Normal Q−Q Plot 0 Sample Quantiles 150 0 −4 50 −2 100 Frequency 200 2 250 4 300 6 Histogram of x −4 15.075 (Spring 2019) −2 0 2 4 6 Lecture 3 −3 −2 −1 0 1 2 3 February 13, 2019 18 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules The Role of Standardization Test scores are often approximately normally distributed. ⇒ Empirical rule and reliance on mean and sd work well. • Is a 70 out of 100 a good test score? I I I What if the mean is 80, sd = 5? What if the mean is 60, sd = 10? What if the mean is 60, sd = 5? • Knowing your score was a 70 clearly isn’t enough even if you believe scores are normally distributed! Depending on how many sds from the mean you are, a 70 could be very good or very bad. • Can we think of the outcomes on a scale that reflects this? 15.075 (Spring 2019) Lecture 3 February 13, 2019 19 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Centering and Scaling a Continuous Variable Suppose I have a continuous variable X . What is mean(X − x̄)? • Called demeaning, or centering, the variable Suppose I have a continuous variable, X . What is sd (X /sd(X ))? • Called scaling a variable. Suppose the variable X follow a normal distribution. Does centering and/or scaling affect its normality? 15.075 (Spring 2019) Lecture 3 February 13, 2019 20 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Z -Scores A z score answers the following question: • “How many sd above (+) or below (-) the mean was the observed value?” • Can write any observation as the mean, plus some number (z) times the sd: observed = mean + z × sd Solve for z: Z-SCORE unitless measure z= 15.075 (Spring 2019) observed − mean sd Lecture 3 February 13, 2019 21 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Z -Scores as a New Variable Think of defining a new variable, where the values are the z-scores of a variable X for each case. We could denote this new variable as: Z Scores as a Variable Z (X ) = 15.075 (Spring 2019) X − mean(X ) sd(X ) Lecture 3 February 13, 2019 22 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Z-Scores as a Change of Units Recall our exam example: A 70 on an exam could mean very different things depending on the mean and standard deviation • z-scores can be thought of as a change in units, where the units becomes standard deviations above the mean • unit = 1 sd, mean = 0 Mean, Standard Deviation of Z -Scores If we form z-scores of any continuous variable, X , then we have: mean(z-scores) = 0 sd(z-scores) = 1 MEMORIZE THIS 15.075 (Spring 2019) Lecture 3 February 13, 2019 23 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Changing Units of Z -Scores Suppose I have a variable X , and I change units by X ∗ = a + bX : Changing Units If b is positive... z(X ∗ ) = z(X ) If b is negative... z(X ∗ ) = −z(X ) That is, z-scores are not affected in terms of magnitude by additive and multiplicative shifts. The only thing that can shift is the sign if b is negative. Example: Suppose we had z-scores of temperatures in Celcius and someone changed the original data set to be in Fahrenheit. The z-scores would not change! 15.075 (Spring 2019) Lecture 3 February 13, 2019 24 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Normality of CERTAIN Z -Scores Suppose the continuous variable of interest, X , is normally distributed, with some mean m and some standard deviation, s, then we can say something further about the z-scores: Normality of Z -Scores If we form z-scores of any normally distributed continuous variable, X , then we have that the z-scores will also follow a normal distribution, with a mean of 0 and a standard deviation of 1 Important: Taking a z-score cannot make a variable look “more normal” or “less normal” • If the variable is normal, its z-scores will be normal • If the variable is not normally distributed, neither will its z-scores be 15.075 (Spring 2019) Lecture 3 February 13, 2019 25 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Empirical Rules for Z -Scores If a variable is normally distributed: • 50% of its z-scores fall between [− 32 , 23 ] • 68% of its z-scores fall between [−1, 1] • 95% of its z-scores fall between [−2, 2] • 99.7% of its z-scores fall between [−3, 3] 15.075 (Spring 2019) Lecture 3 February 13, 2019 26 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Quantiles Beyond the Empirical Rules • We can now answer quantile and range problems approximately for z = ±2/3, ±1, ±2, ±3 • General quantile problems: I I ”What is the fraction of students with a z-score above 1.5?” ”What is the fraction of students with a score below 70?” • Converse: I ”My score is at the 87% quantile. How many sd above the mean is it?” • General range problems: I I I ”What fraction of scores is within 1.2 sd of the mean?” ”What range about the mean contains 80% of the values?” ”What fraction of students scored between an 85 and a 94?” 15.075 (Spring 2019) Lecture 3 February 13, 2019 27 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Old School: Normal Tables ”Normal Tables.pdf” • Try to make sense of the tops of the columns: • Graph: curve = idealized histogram for n=infinite under a normal distribution Shaded area indicates quantile or range • Formula: ’P(...)’ = ’Proportion of cases with ...’ I Ex.: P(Z<z) = Proportion of cases with z-score below z, where Z = column z-score values and z = threshold on z-score values 15.075 (Spring 2019) Lecture 3 February 13, 2019 28 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Understanding the Columns • Left margin: values of threshold z • 1st column: Proportion of cases with z-scores below -z • 2nd column: Proportion of cases with z-scores below +z • 3rd column: Proportion of cases with z-scores below -z OR above +z • 4th column: Proportion of cases with z-scores within ± z I I I Why do columns 1 and 2 add up to 1? Why do columns 3 and 4 add up to 1? Why are the values in column 1 below 0.5? 15.075 (Spring 2019) Lecture 3 February 13, 2019 29 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules New School: Calculating within R • Proportion of z-scores below z: pnorm(z) • Proportion of z-scores above z: 1-pnorm(z) #OR pnorm(z, lower = F) • How would we find the proportion with z-scores below -z and above z? Within ± z? • p th percentile / p-quantile of z-scores (as a fraction, not percentage) qnorm(p) 15.075 (Spring 2019) Lecture 3 February 13, 2019 30 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules Exercises with Z -Scores • What fraction of values is below me if I’m 1.2 SD ABOVE the mean? • What fraction of values is above me if I’m 1.2 SD BELOW the mean? • What fraction of values is in the interval ± 1.2 SD around the mean? • How many SD above/below the mean am I if my quantile is 35%? 15.075 (Spring 2019) Lecture 3 February 13, 2019 31 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules General Exercises A histogram of GPAs for College Students at a particular university follows the Normal curve with a mean of 2.7 and a standard deviation of 0.5 • What percentage of students have a GPA of 3.5 or lower? • What percent have GPAs greater than 2.5? • What percent have GPAs greater than 2.8? • What percent have GPAs between 2.5 and 2.8? • If I am in the top 10% of the class, at least what does my GPA need to be? 15.075 (Spring 2019) Lecture 3 February 13, 2019 32 / 34 The Normal Distribution Empirical Rules Detecting Normality Z Scores and “Curving” Beyond the Empirical Rules From Quantiles to Means/SDs Suppose that at a certain school of 820 students it is known that • 3.8 GPA is 95th percentile • 3.3 GPA is 80th percentile • GPAs are normally distributed Questions: • What’s the mean and sd of GPAs in the school? • Can you approximate the class rank of a student with a 3.0 GPA? 15.075 (Spring 2019) Lecture 3 February 13, 2019 33 / 34