Exploring Data Numerical Summaries of One Variable Statistics 111

advertisement
Statistics 111 - Lecture 7
Exploring Data
Numerical Summaries
of One Variable
Summarizing Data
• Center of a Distribution
• Average or sample mean
• Median
• Spread of a Distribution
• Sample Standard Deviation
• Interquartile Range
1
Measure of Center: average
• average is a list of numbers, is simply the
arithmetic average:
x1  x 2 
X
n
 xn
n
  xi
1
n
i1
• Simple examples:
• Numbers: 1, 2, 3, 4, 10000 average = 2002
• Numbers: –1, –0.5, 0.1, 20 average = 4.65

Expected Value versus Average
We have a random
variable:
1

 1 w. p. 6

2
X   0 w. p.
6

 1 w. p. 3

6
E ( X )   xi  P( X  xi )
1
2
3
 1  0   (1) 
6
6
6
2
1
 
6
3
We have a sample:
x1  1, x2  1, x3  1,
x4  0, x5  1, x6  1
n
x
x
i

n
1  (1)  (1)  0  (1)  (1)

6
n
1 1  (1)  4  0 1
xi  f i 


6
i 1
3
1

6
2
i 1
2
Average Behavior
x
The average , , estimates the value of the population
expected value 
Reminder: we learned three interesting phenomena regarding
the mean of random variables:
1. The Law of Large Numbers
2. The mean and the variance
3. The Central Limit Theorem
What are the implication of these theorems to real data?
Average Behavior: mean
Theory: The Law of Large Numbers
As n increases, the mean of independent random variables
from the a population with expected value , will converge
to the expected value
Practice: If we have a large sample then the average is a good
estimate of the population expected value 
3
Average Behavior: mean and variance
Theory: If X 1 ,..., X n are i.i.d random variables with
expected value  and variance  2 then
1. E ( X )  
2. Var ( X ) 
2
n
Practice: If one draws many independent samples of size
n, from a population with expected value  and
2
variance  then,
1. The mean of these averages is

2
2. The variance of these averages is 
n
Average Behavior: mean and variance
Population
Sample 1 of size
Sample 2 of size
Sample 3 of size
Sample 4 of size
Sample 5 of size
Sample 6 of size
Sample 7 of size
Sample 8 of size
.
.
.
n
n
n
n
n
n
n
n
x
x
x
x
x
x
x
x
4
Average Behavior: mean and variance
• Population: seasonal home-run totals for 7032
baseball players from 1901 to 1996
• Take different samples from this population and
compare the sample mean we get each time
• In real life, we can’t do this because we don’t usually
have the entire population just one sample!
Sample Size
meanX 
variance X
100 samples of size n = 1
3.69
46.8
100 samples of size n = 10
4.43
4.43
100 samples of size n 
= 100
4.42

0.43
100 samples of size n = 1000
4.42
0.06
Population Parameter
 = 4.42
Average Behavior: Distribution
Theory: The Central Limit Theorem
If X1,X2,…, Xn are i.i.d random variables, then as n
increases, the average will follow a Normal distribution
Practice: If we have many averages from the same popultion
their histogram should look normal
5
Average Behavior: Distribution
Practice: The sampling distribution of the average is normal!
Population
Unknown
Parameter:

Sample 1 of size
Sample 2 of size
Sample 3 of size
Sample 4 of size
Sample 5 of size
Sample 6 of size
Sample 7 of size
Sample 8 of size
.
.
.
n
n
n
n
n
n
n
n
x
x
x
x
x
x
x
x
Distribution
of these
values?
NORMAL
Example: Home Runs per Season
• Take many different samples from the seasonal HR totals
for a population of 7032 players
• Calculate sample mean for each sample
n=1
n = 10
n = 100
6
Problems with the Mean
• Mean is sensitive to large outliers
• Example: 2002 income of people in Harvard
Class of 1977
• Mean Income approximately $150,000
• Yet, almost all incomes $70,000 or less!
• Why such a discrepancy?
Potential Solution: Trimming the Mean
• Throw away the most extreme k % on both
sides of the distribution, then calculate the
mean
• Gets rid of outliers that are exerting an
extreme influence on mean
• Common to trim by 5% on each side, but can
also do 10%, 20%, …
7
Measure of Center: Median
• Take trimming to the extreme by throwing away
all the data except for the middle value
• Median = “middle number in distribution”
• Simple examples:
•
•
Numbers: 1, 2, 3, 4, 10000 Median = 3
Numbers: -1, -0.5, 0.1, 20 Median = -0.2
• Median is often described as a more robust or
resistant measure of the center
Examples
• Shoe size of Stat 111 Class
8
Top 100 Richest People (Forbes 2004)
Mean = 9.67 billion
Median = 7.45 billon
Effect of outliers
Dataset
Mean
Median
Shoe Size
8.96
8.75
Shoe Size with Shaq in class
9.27
8.75
Forbe’s Top 100 Richest
9.67
7.45
Forbe’s without Gates or Buffet
8.96
7.4
9
Effect of outliers
Effect of Asymmetry
• Symmetric Distributions
• Mean ≈ Median (approx. equal)
• Skewed to the Left
• Mean < Median
• Mean pulled down by small values
• Skewed to the Right
• Mean > Median
• Mean pulled up by large values
10
Measures of Spread: Sample Standard
Deviation
• Want to quantify, on average, how far each
observation is from the center
xi  x 
• For observation x i , deviation =
• The sample variance is the average of the
squared deviations of each observation:

s2 
(x
i
2
 x )
n 1
(x
i
 x )2
• The sample Standard Deviation (SD): s 
n 1
• If n is large
enough
the
sample
standard
deviation

is a good estimate of the population standard
deviation 

Sensitivity to outliers, again!
• Sample Standard Deviation is also an average (like
the mean) so it is sensitive to outliers
• Can think about a similar solution: start trimming
away extreme values on either side of the
distribution
• If we trim away 25% of the data on either side, we
are left with the first and third quartiles
11
Measures of Spread: Inter-Quartile Range
• First Quartile (Q1) is the median of the smaller half of
the data (bottom 25% point)
• Third Quartile (Q3) is the median of the larger half of
the data (top 25% point)
• Inter-Quartile Range is also a measure of spread:
IQR = Q3 - Q1
• Like the median, the Inter-Quartile Range (IQR) is
robust or resistant to outliers
Detecting Outliers
•
•
IQR is used to detect outliers in a boxplot:
An observation xi is an outlier if either:
Q1  1.5  IQR
1. xi is less than
2. xi is greater than Q3  1.5  IQR
•
This definition comes from the normal
distribution.
•
•
some outliers don’t fit definition, some
observations that do are not outliers
Note: if the data don’t go out that far then
1.5  IQR
the whiskers stop before
12
Examples of Detecting Outliers
Dataset
Shoe Size
Forbes 2004
Top 100
IQR
3
5.05
Q1 - 1.5 x IQR
3
-2.1
Q3 + 1.5 x IQR
15
18.1
Outliers
none
First 14 people!
What to use?
• In presence of outliers or asymmetry, it is usually
better to use median and IQR
• If distributions are symmetric and there are no
outliers, median and mean are the same
• Mean and standard deviation are easier to deal
with mathematically, so we will often use models
that assume symmetry and no outliers
• Example: Normal distribution
13
Download