Statistics 111 - Lecture 7 Exploring Data Numerical Summaries of One Variable Summarizing Data • Center of a Distribution • Average or sample mean • Median • Spread of a Distribution • Sample Standard Deviation • Interquartile Range 1 Measure of Center: average • average is a list of numbers, is simply the arithmetic average: x1 x 2 X n xn n xi 1 n i1 • Simple examples: • Numbers: 1, 2, 3, 4, 10000 average = 2002 • Numbers: –1, –0.5, 0.1, 20 average = 4.65 Expected Value versus Average We have a random variable: 1 1 w. p. 6 2 X 0 w. p. 6 1 w. p. 3 6 E ( X ) xi P( X xi ) 1 2 3 1 0 (1) 6 6 6 2 1 6 3 We have a sample: x1 1, x2 1, x3 1, x4 0, x5 1, x6 1 n x x i n 1 (1) (1) 0 (1) (1) 6 n 1 1 (1) 4 0 1 xi f i 6 i 1 3 1 6 2 i 1 2 Average Behavior x The average , , estimates the value of the population expected value Reminder: we learned three interesting phenomena regarding the mean of random variables: 1. The Law of Large Numbers 2. The mean and the variance 3. The Central Limit Theorem What are the implication of these theorems to real data? Average Behavior: mean Theory: The Law of Large Numbers As n increases, the mean of independent random variables from the a population with expected value , will converge to the expected value Practice: If we have a large sample then the average is a good estimate of the population expected value 3 Average Behavior: mean and variance Theory: If X 1 ,..., X n are i.i.d random variables with expected value and variance 2 then 1. E ( X ) 2. Var ( X ) 2 n Practice: If one draws many independent samples of size n, from a population with expected value and 2 variance then, 1. The mean of these averages is 2 2. The variance of these averages is n Average Behavior: mean and variance Population Sample 1 of size Sample 2 of size Sample 3 of size Sample 4 of size Sample 5 of size Sample 6 of size Sample 7 of size Sample 8 of size . . . n n n n n n n n x x x x x x x x 4 Average Behavior: mean and variance • Population: seasonal home-run totals for 7032 baseball players from 1901 to 1996 • Take different samples from this population and compare the sample mean we get each time • In real life, we can’t do this because we don’t usually have the entire population just one sample! Sample Size meanX variance X 100 samples of size n = 1 3.69 46.8 100 samples of size n = 10 4.43 4.43 100 samples of size n = 100 4.42 0.43 100 samples of size n = 1000 4.42 0.06 Population Parameter = 4.42 Average Behavior: Distribution Theory: The Central Limit Theorem If X1,X2,…, Xn are i.i.d random variables, then as n increases, the average will follow a Normal distribution Practice: If we have many averages from the same popultion their histogram should look normal 5 Average Behavior: Distribution Practice: The sampling distribution of the average is normal! Population Unknown Parameter: Sample 1 of size Sample 2 of size Sample 3 of size Sample 4 of size Sample 5 of size Sample 6 of size Sample 7 of size Sample 8 of size . . . n n n n n n n n x x x x x x x x Distribution of these values? NORMAL Example: Home Runs per Season • Take many different samples from the seasonal HR totals for a population of 7032 players • Calculate sample mean for each sample n=1 n = 10 n = 100 6 Problems with the Mean • Mean is sensitive to large outliers • Example: 2002 income of people in Harvard Class of 1977 • Mean Income approximately $150,000 • Yet, almost all incomes $70,000 or less! • Why such a discrepancy? Potential Solution: Trimming the Mean • Throw away the most extreme k % on both sides of the distribution, then calculate the mean • Gets rid of outliers that are exerting an extreme influence on mean • Common to trim by 5% on each side, but can also do 10%, 20%, … 7 Measure of Center: Median • Take trimming to the extreme by throwing away all the data except for the middle value • Median = “middle number in distribution” • Simple examples: • • Numbers: 1, 2, 3, 4, 10000 Median = 3 Numbers: -1, -0.5, 0.1, 20 Median = -0.2 • Median is often described as a more robust or resistant measure of the center Examples • Shoe size of Stat 111 Class 8 Top 100 Richest People (Forbes 2004) Mean = 9.67 billion Median = 7.45 billon Effect of outliers Dataset Mean Median Shoe Size 8.96 8.75 Shoe Size with Shaq in class 9.27 8.75 Forbe’s Top 100 Richest 9.67 7.45 Forbe’s without Gates or Buffet 8.96 7.4 9 Effect of outliers Effect of Asymmetry • Symmetric Distributions • Mean ≈ Median (approx. equal) • Skewed to the Left • Mean < Median • Mean pulled down by small values • Skewed to the Right • Mean > Median • Mean pulled up by large values 10 Measures of Spread: Sample Standard Deviation • Want to quantify, on average, how far each observation is from the center xi x • For observation x i , deviation = • The sample variance is the average of the squared deviations of each observation: s2 (x i 2 x ) n 1 (x i x )2 • The sample Standard Deviation (SD): s n 1 • If n is large enough the sample standard deviation is a good estimate of the population standard deviation Sensitivity to outliers, again! • Sample Standard Deviation is also an average (like the mean) so it is sensitive to outliers • Can think about a similar solution: start trimming away extreme values on either side of the distribution • If we trim away 25% of the data on either side, we are left with the first and third quartiles 11 Measures of Spread: Inter-Quartile Range • First Quartile (Q1) is the median of the smaller half of the data (bottom 25% point) • Third Quartile (Q3) is the median of the larger half of the data (top 25% point) • Inter-Quartile Range is also a measure of spread: IQR = Q3 - Q1 • Like the median, the Inter-Quartile Range (IQR) is robust or resistant to outliers Detecting Outliers • • IQR is used to detect outliers in a boxplot: An observation xi is an outlier if either: Q1 1.5 IQR 1. xi is less than 2. xi is greater than Q3 1.5 IQR • This definition comes from the normal distribution. • • some outliers don’t fit definition, some observations that do are not outliers Note: if the data don’t go out that far then 1.5 IQR the whiskers stop before 12 Examples of Detecting Outliers Dataset Shoe Size Forbes 2004 Top 100 IQR 3 5.05 Q1 - 1.5 x IQR 3 -2.1 Q3 + 1.5 x IQR 15 18.1 Outliers none First 14 people! What to use? • In presence of outliers or asymmetry, it is usually better to use median and IQR • If distributions are symmetric and there are no outliers, median and mean are the same • Mean and standard deviation are easier to deal with mathematically, so we will often use models that assume symmetry and no outliers • Example: Normal distribution 13