NUMERICAL DESCRIPTIVE MEASURES

advertisement
NUMERICAL DESCRIPTIVE MEASURES
Wish to describe data using summary statistics
1. Measures of Central Tendency (what is the middle of the data?)
Three Options: Arithmetic Mean, Median, Mode
(ignore geometric mean in text)
1.a. Arithmetic Mean
Most common measure of central tendency: the “average”
When referring to sample values, denoted as X
When referring to population values, denoted as μ
Sample mean
n
 Xi
X  i 1
n

X1  X 2    X n
n
where n is sample size
Population Mean
N

X
i 1
N
i

X1  X 2 
N
 XN
where N is population size
Example: raw data: 1 3 5 7 9
mean = (1+3+5+7+9)/5 = 5
Arithmetic Mean is affected by outliers
0 1 2 3 4 5 6 7 8 9 10
Mean = 5
0 1 2 3 4 5 6 7 8 9 10 12 14
Mean = 6
Chapter 3 -
1
1. b Median
Raw data is arrayed in ascending order. The MEDIAN is the
middle of the data, i.e., a value such that half of the observations
lie below and half lie above the value.
If n or N is odd, the median is the middle number
If n or N is even, the median is the average of the two middle
numbers.
Example: a sample of 10 house prices (in thousands) yields:
144 98 204 177 155 316 100 177 177 170
to find median, arrange in ascending order:
98 100 144 155 170 177 177 177 204 316
Since even number of observations: median = (170+177)/2 =
173.5
Note: arithmetic mean = 171.8
Median is often preferred when data are skewed, since not
affected by extreme values:
Ex. Housing prices: 98 100 100 102 110 125 350
Median = 102
Mean = 140.7
0 1 2 3 4 5 6 7 8 9 10
Median = 5
0 1 2 3 4 5 6 7 8 9 10 12 14
Median = 5
1. c. Mode
Value that occurs most often
There may be no mode or there may be more than one mode.
Chapter 3 -
2
Unaffected by extreme values
Ex. In housing price data above, mode = 177
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2 3 4 5 6
No Mode
Mode = 9
Mode is infrequently used, but has applications in quality
control.
Which is the best measure to use? It depends.
If histogram of data is symmetric about the middle, all three
measures give similar results.
If histogram is skewed right or left, measures will differ.
Choose in context. The remainder of course will focus on mean
(eg., what is the mean income of Canadians?, what is the mean
life expectancy of a component?) and how to estimate
population mean using sample mean.
2. Measures of Variation/Dispersion
Need a single number describing how “spread out” the data are.
Many options: discuss range, MAD, Variance
2. a. Range: simply distance between highest and lowest values
Not very informative: ignores all but two observations:
Ex. 1, 9, 9.5, 9.5, 10
Ex. 1, 3, 7, 9, 10
Range = 9
Range = 9
2. b. Interquartile range (ignore)
2. c. Mean Absolute Deviation (MAD)
Chapter 3 -
3
Note in text, but simple idea. Conceptually straightforward idea
is to measure how spread out the data are from the middle by
reporting the average distance from the the middle to each
observation. Let the mean be the measure of the middle.
Ex. 4 observations 3 4 5 6
Mean = (3+4+5+6)/4 = 4.5
Observation 1 lies 1.5 units away from mean
Obs. 2 lies 0.5 units away
Obs. 3 lies 0.5 units away
Obs. 4 lies 1.5 units away.
Therefore, average distance away from mean is
(1.5+0.5+0.5+1.5)/4 = 1
This measure is Mean Absolute Deviation (MAD). Can be
computed using the formula (for population)
N
MAD 
 X i 
i 1
N
Absolute value operators are difficult to work with, but are
required, since simple mean of deviations will always equal
zero. An alternative to absolute value operators is to square
the deviations before adding and dividing by no. of observations
– then positives do not cancel negatives.
Leads to important concept in course …
2. d Variance
Defined: the average squared deviation from the mean.
Computed differently in population than in sample.
N
Population Variance:
2 
 ( X i  ) 2
i 1
N
Chapter 3 -
4
n
Sample Variance:
s2 
 ( X i X )2
i 1
n 1
Example of calculation:
Observation Deviation from Mean
3
-1.5
4
-0.5
5
0.5
6
1.5
Total
Squared Deviation
2.25
0.25
0.25
2.25
5
Therefore population variance = 5/4 = 1.25
Interpretation of variance is difficult
Most frequently used measure of variation/dispersion …
STANDARD DEVIATION - square root of variance
In Population
In Sample
  2
s  s2
Chapter 3 -
5
Data A
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 3.338
Data B
11 12 13
21
14 15 16 17 18 19 20
Mean = 15.5
s = .9258
Data C
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.57
The magnitude of the Standard Deviation is meaningful.
Standard Deviation uses original units of measurement.
Value of standard deviation interpreted using the Empirical
Rule: If the data are fairly symmetric about the mean (i.e.,
histogram is bell-shaped), then the interval
  1
contains approximately 68% of the observations
  2
contains approximately 95% of the observations
  3
contains approximately 99% of the observations
Ex. Suppose final grades in class of 150 statistics students are
distributed in a symmetric manner about the mean, with μ = 72
and σ = 9 . Does this indicate a wide variation in grades?
In order to capture 68% of final grades, we require a range from
72 – 9 = 63
to 72 + 9 = 81
and 95% of students will have grades between 54 and 90.
Chapter 3 -
6
2.e. Coefficient of Variation
We may wish to compare the dispersion of two sets of data. If
they have different means or are measured in different units, the
standard deviations are not comparable. In these cases, use
measures of dispersion relative to the mean …
CV = (Standard Deviation/Mean) x 100%
where standard dev. and mean can either be pop. or sample
Note: measured in percentages
Ex.
Stock A:
mean price last year = $50
Std. dev. of prices last year = $5
Stock B:
mean price last year = $100
Std. dev. of prices last year = $5
Calculate CV’s to compare degree of risk:
Stock A: CV = (5/50) x 100% = 10%
Stock B: CV = (5/100) x 100% = 5%
3. Measures of Shape
Measures of skewness indicate shape of distribution.
Left-Skewed
Symmetric
Mean = Median =Mode
Right-Skewed
Mode < Median < Mean
Chapter 3 -
7
4. Measures of Correlation
Require measure of how closely correlated two variables are.
Coefficient of Correlation indicates type and strength of linear
relationship in bivariate data.
Denoted r
n
r
 ( xi  x)( yi  y )
i 1
n
n
 ( xi  x)  ( yi  y ) 2
2
i 1
i 1
Coefficient of correlation is unit free, ranges from -1 to +1
-1 indicates perfect negative linear relationship
+1 indicates perfect positive linear relationship
Y
Y
Y
X
r = -1
X
r = -.6
Y
X
r=0
Y
r = .6
X
r=1
X
Chapter 3 -
8
Download