Chapter 4

advertisement
Week 2
September 8-12
Five Mini-Lectures
QMM 510
Fall 2014
ML 2.1
Chapter Contents
4.1 Numerical Description
4.2 Measures of Center
4.3 Measures of Variability
4.4 Standardized Data
4.5 Percentiles, Quartiles, and Box Plots
4.6 Correlation and Covariance
4.7 Grouped Data
So many
topics, so little
time …
4.8 Skewness and Kurtosis
4-2
Chapter 4
Describing Data Numerically
Three key characteristics of numerical data:
4-3
Chapter 4
Center, Variability, Shape
4-4
Chapter 4
Visual Description
Mean
•
A familiar measure of center
Population Mean
•
4-5
Sample Mean
Excel function =AVERAGE(Data) where Data is an array of data values.
Chapter 4
Measures of Center
Median
•
•
•
•
4-6
The median (M) is the 50th percentile or midpoint of the sorted sample
data.
M separates the upper and lower halves of the sorted observations.
If n is odd, the median is the middle observation in the data array.
If n is even, the median is the average of the middle two observations in
the data array.
Chapter 4
Measures of Center
Mode
•
The most frequently occurring data value.
•
Familiar and easy to understand.
•
But - data may have multiple modes or no mode.
•
Most useful for discrete or categorical data with only a few values.Rarely
useful for continuous data or data with a wide range.
Example: Revenue growth in 32 bio-tech companies last year.
0.57
4.00
8.29
11.00
1.57
4.01
8.43
11.57
1.71
5.28
9.14
11.57
1.71
5.29
9.29
11.86
1.86
6.14
10.00
12.43
2.14
6.43
10.29
13.43
2.43
6.71
10.43
13.57
2.86
6.86
10.43
14.14
Caution: In decimal data, some data values may occur more than once,
but this is likely due to chance (not central tendency). Excel’s
=MODE(Data) returns only the first mode (1.71 in this example).
4-7
Chapter 4
Measures of Center
Chapter 4
Measures of Center
• Compare mean and median or look at the histogram to determine degree
of skewness.
• Figure 4.10 shows prototype population shapes showing varying degrees of
skewness.
4-8
Geometric Mean
•
The geometric mean (G) is a
multiplicative average.
Growth Rates
In Excel =GEOMEAN(Data) or
=(2*3*7*9*10*12)^(1/6)
A variation on the geometric
mean used to find the average
growth rate for a time series.
4-9
Chapter 4
Measures of Center
Chapter 4
Measures of Center
Growth Rates
•
For example, from 2006
to 2010, JetBlue Airlines
revenues are:
The average growth rate:
or 12.5 % per year.
4-10
Year
Revenue (mil)
2006
2,361
2007
2,843
2008
3,392
2009
3,292
2010
3,779
Midrange
•
The midrange is the point halfway between the lowest and highest values
of X.
•
Easy to use but sensitive to extreme data values.
•
For the J.D. Power quality data:
•
Here, the midrange (126.5) is higher than the mean (114.70) or median
(113).
4-11
Chapter 4
Measures of Center
Trimmed Mean
Chapter 4
Measures of Center
•
To calculate the trimmed mean, first remove the highest and lowest k
percent of the observations.
•
For example, for the n = 33 P/E ratios, we want a 5 percent trimmed mean
(i.e., k = .05).
•
To determine how many observations to trim, multiply k by n, which is
0.05 x 33 = 1.65 or 2 observations.
•
So, we would remove the two smallest and two largest observations
before averaging the remaining values.
4-12
Trimmed Mean
•
Here is a summary of all the measures of central tendency for the J.D.
Power data, along with Excel functions.
Mean:
114.70
=AVERAGE(Data)
Median:
113
=MEDIAN(Data)
Mode:
111
=MODE.SNGL(Data)
Geometric Mean:
113.35
=GEOMEAN(Data)
Midrange:
126.5
(MIN(Data)+MAX(Data))/2
5% Trim Mean:
113.94
=TRIMMEAN(Data, 0.1)
•
4-13
The trimmed mean mitigates the effects of very high values.
Chapter 4
Measures of Center
Chapter 4
Measures of Variability
Variability is the “spread” of data points about the
center of the distribution in a sample.
Measures of Variability
Statistic
Range
Sample
Variance
(s2)
4-14
Formula
Excel
xmax – xmin
=MAX(Data) MIN(Data)
Pro
Con
Sensitive to
Easy to calculate extreme data
values.
Plays a key role
=VAR.S(Data) in mathematical
statistics.
Nonintuitive
meaning.
Population standard deviation
Population variance
Statistic
Formula
Excel
Pro
Chapter 4
Measures of Variability
Con
Sample
standard
deviation
(s)
Most common
measure. Uses
Nonintuitive
=STDEV.S(Data) same units as the
meaning.
raw data ($ , £, ¥,
grams etc.).
Sample
coefficient. of
variation
(CV)
Measures
=100*STDEV.S relative variation
(Data)/AVERAGE in percent so
(Data)
can compare
data sets.
4-15
Requires
nonnegative
data.
Chapter 4
Measures of Variability
Statistic
Mean
absolute
deviation
(MAD)
4-16
Formula
Excel
Pro
n
 xi  x
i 1
n
=AVEDEV(Data)
Easy to
understand.
Con
Lacks “nice”
theoretical
properties.
Coefficient of Variation
•
Useful for comparing variables measured in different units or with
different means.
•
A unit-free measure of dispersion.
•
Expressed as a percent of the mean.
•
Only appropriate for nonnegative data. It is undefined if the mean is zero
or negative.
4-17
Chapter 4
Measures of Variability
Example: Class scores on 16-point quiz on first day of class and after
students had an opportunity to review the material.
Caution: Only appropriate for nonnegative data. CV is undefined if
the mean is zero or negative (this could happen, for example, if stocks
in a portfolio had negative rates of return).
4-18
Chapter 4
Measures of Variability
Chapter 4
Standardized Data
ML 2.2
Topics
• sorting, standardizing, z-scores
• normal distribution as a benchmark
• Empirical Rule (MegaStat)
• outliers and unusual observations
• Excel functions (Appendix J)
• examples: birth weight, voting
• using MegaStat and Minitab
4-19
•
The normal distribution is symmetric and is also known as the
bell-shaped curve.
•
The Empirical Rule states that for data from a normal distribution,
we expect the interval  ± k to contain a known percentage
of observed data:
k = 1 68.26% will lie within  + 1
k = 2 95.44% will lie within  + 2
k = 3 99.73% will lie within  + 3
4-20
Chapter 4
The Empirical Rule
The Empirical Rule
Note: No upper
bound is given.
Data values
outside
 + 3
are rare.
4-21
Chapter 4
Standardized Data
•
A standardized variable (Z) redefines each observation in terms of the
number of standard deviations from the mean.
Standardization formula
for a population:
Standardization formula for
a sample (for n > 30):
4-22
Chapter 4
Standardized Data
A negative z
value means the
observation is to the
left of the mean.
Positive z means
the observation is to
the right of the mean.
4-23
Chapter 4
Standardized Data
Chapter 4
Standardized Data
Example: Birth Weights (n = 1429)
Resembles a normal
except for the low
tail (a few extremely
tiny babies).
Source Birth records from the North
Carolina State Center for Health and
Environmental Statistics and the Institute
for Research in Social Science at University
of North Carolina at Chapel Hill.
•
•
•
5 pound baby’s z-score: z = (80-116.14)/21.96 = -1.65
8 pound baby’s z-score: z = (144-116.14)/21.96 = 1.27
11 pound baby’s z-score: z = (176-116.14)/21.96 = 2.73
4-24
Example: Voting in 2004 Presidential Election)
State
Hawaii
California
Texas
Nevada
Georgia
…
…
Oregon
North Dakota
Maine
Wisconsin
Minnesota
Voting%
46.2
49.1
50.3
51.3
52.6
z-Score
-2.35
-1.89
-1.71
-1.55
-1.35
Mean
St Dev
n
61.29
6.43
50
Use Excel’s function
=STANDARDIZE(x, μ, σ)
…
70.6
70.8
72.0
73.0
76.7
1.45
1.48
1.67
1.82
2.40
Only two states stand
out as unusual
Note: Sorting the data values allows you to see the
extremes. Values within μ ±1σ are not less interesting.
4-25
Chapter 4
Standardized Data
Chapter 4
Excel
Voting%
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
4-26
61.286
0.909788089
61.5
59.7
6.433173274
41.38571837
0.014949556
0.00241464
30.5
46.2
76.7
3064.3
50
Voting
percent in
50 states
Note: In Excel’s Descriptive
Statistics, you can’t choose the
statistics displayed.
Chapter 4
MegaStat
Note: You can
choose the
statistics
displayed
(e.g.,Empirical
Rule).
Statistic
count
mean
sample variance
sample standard deviation
minimum
maximum
range
1st quartile
median
3rd quartile
interquartile range
mode
4-27
Voting%
50
61.286
41.386
6.433
46.2
76.7
30.5
57.450
61.500
64.950
7.500
59.700
empirical rule
mean - 1s
mean + 1s
percent in interval (68.26%)
mean - 2s
mean + 2s
percent in interval (95.44%)
mean - 3s
mean + 3s
percent in interval (99.73%)
low outliers
high outliers
high extremes
54.853
67.719
68.00%
48.420
74.152
96.00%
41.986
80.586
100.00%
0
1
0
Voting
percent in
50 states
4-28
Chapter 4
Appendix J: Excel Functions
4-29
Chapter 4
Appendix J: Excel Functions
Chapter 4
Quantiles
ML 2.3
Topics
• percentiles, quartiles, boxplots
• fences, another view of outliers
• examples: birth weight. City MPG
4-30
Percentiles
•
Percentiles are data that have been divided into 100 groups.
For example, you score in the 83rd percentile on a standardized test.
That means that 83% of the test-takers scored below you.
•
Deciles are data that have been divided into10 groups.
•
Quintiles are data that have been divided into 5 groups.
•
Quartiles are data that have been divided into 4 groups.
4-31
Chapter 4
Percentiles, Quartiles, and Box-Plots
Percentiles
•
Percentiles may be used to establish benchmarks for comparison
purposes (e.g. health care, manufacturing, and banking industries use
5th, 25th, 50th, 75th and 90th percentiles).
•
Quartiles (25, 50, and 75 percent) are commonly used to assess financial
performance and stock portfolios.
•
Percentiles can be used in employee merit evaluation and salary
benchmarking.
4-32
Chapter 4
Percentiles, Quartiles, and Box-Plots
Chapter 4
Percentiles, Quartiles, and Box-Plots
Quartiles
•
Quartiles are scale points that divide the sorted data into four groups of
approximately equal size.
Q1
Lower 25%
|
Q2
Second 25%
|
Q3
Third 25%
|
Upper 25%
The three values that separate the four groups are called Q1, Q2, and Q3.
4-33
Quartiles
•
The second quartile Q2 is the median, a measure of central tendency.
Q2
 Lower 50% 
4-34
|
 Upper 50% 
Chapter 4
Percentiles, Quartiles, and Box-Plots
Method of Medians
•
For small data sets, find quartiles using method of medians:
Step 1: Sort the observations.
Step 2: Find the median Q2.
Step 3: Find the median of the data values that lie below Q2.
Step 4: Find the median of the data values that lie above Q2.
4-35
Chapter 4
Percentiles, Quartiles, and Box-Plots
Chapter 4
Percentiles, Quartiles, and Box-Plots
Quartiles – The method of medians
•
•
The first quartile Q1 is the median of the data values below Q2
The third quartile Q3 is the median of the data values above Q2.
Q1
Lower 25%
|
Q2
Second 25%
For first half of data, 50% above,
50% below Q1.
4-36
|
Q3
Third 25%
|
Upper 25%
For second half of data, 50%
above, 50% below Q3.
Method of Medians
Example:
4-37
Chapter 4
Percentiles, Quartiles, and Box-Plots
Box Plots
•
A useful tool of exploratory data analysis (EDA).
•
Also called a box-and-whisker plot.
•
Based on a five-number summary:
Xmin, Q1, Q2, Q3, Xmax
•
For the previous P/E ratios example:
Xmin, Q1, Q2, Q3, Xmax
7 27 35.5 40.5 49
4-38
Chapter 4
Percentiles, Quartiles, and Box-Plots
Box Plots
• The box plot is displayed visually, like this.
4-39
Chapter 4
Percentiles, Quartiles, and Box-Plots
Box Plots
4-40
Chapter 4
Percentiles, Quartiles, and Box-Plots
Box Plots: Midhinge
•
The average of the first and third quartiles.
The name midhinge derives from the idea that, if the “box” were
folded in half, it would resemble a “hinge”.
4-41
Chapter 4
Percentiles, Quartiles, and Box-Plots
Box Plots: Fences and Unusual Data Values
•
Use quartiles to detect unusual data points.
•
These points are called fences and can be found using the following
formulas:
Inner fences
Outer fences:
Lower fence
Q1 – 1.5 (Q3 – Q1)
Q1 – 3.0 (Q3 – Q1)
Upper fence
Q3 + 1.5 (Q3 – Q1)
Q3 + 3.0 (Q3 – Q1)
•
4-42
Values outside the inner fences are unusual while those outside the outer
fences are outliers.
Chapter 4
Percentiles, Quartiles, and Box-Plots
Example: Birth Weights (n = 1429)
Source Birth records from the North
Carolina State Center for Health and
Environmental Statistics and the Institute
for Research in Social Science at
University of North Carolina at Chapel Hill.
Note: The middle 50% of birth weights lie within a small
range (105 to 130, or about 6.56 lb to 8.13 lbs). But there
are extremes on the low end.
4-43
Chapter 4
Box-Plots with Fences
Fences Visualized:
Fences
Example:
Interpretation: There are three outliers (beyond the inner upper fence). One is on
the border of the upper outer fence, so is almost an extreme outlier. Lower fences
are not displayed since they are irrelevant for this sample.
4-44
Chapter 4
Box-Plots with Fences
Example: Fences and Unusual Data Values
Outlier
Interpretation: Based on the fences, there is only one outlier and
no extreme outliers. Lower fences are not displayed since they are
not needed for this sample.
4-45
Chapter 4
Box-Plots with Fences
ML 2.4
Topics
• scatter plots
• correlation coefficient
• covariance – population, sample
• mean from grouped mean
• skewness, kurtosis (Excel)
4-46
Chapter 4
Correlation, Grouped Data, Shape
Chapter 4
Correlation and Covariance
Correlation Coefficient
The sample correlation coefficient is a statistic that describes the degree of
linearity between paired observations on two quantitative variables X and Y.
Note: -1 ≤ r ≤ +1
Perfect
negative
correlation
4-47
Perfect
positive
correlation
Illustration of Correlation Coefficients
4-48
Chapter 4
Correlation and Covariance
Correlation Coefficient: Examples
Note: -1 ≤ r ≤ +1
The sample correlation coefficient describes the degree of linearity between
paired observations on two quantitative variables X and Y.
X = car weight (lbs), Y = city MPG
4-49
X = gestation (months), Y = birth weight (oz)
Chapter 4
Correlation and Covariance
Correlation Coefficient: Example
Note: -1 ≤ r ≤ +1
The sample correlation coefficient describes the degree of linearity between
paired observations on two quantitative variables X and Y.
4-50
Chapter 4
Correlation and Covariance
Covariance
The covariance of two random variables X and Y (denoted σXY ) measures the
degree to which the values of X and Y change together.
Caution: The covariance is not easy to interpret because its
units depend on Y (e.g., dollars). That’s why we usually refer to
the correlation coefficient (it is unit free).
4-51
Chapter 4
Correlation and Covariance
Weighted Mean
Group Mean
4-52
Chapter 4
Grouped Data
Group Mean
Note: You will rarely need this. If you are given only grouped data.
you will have to make your own tables in Excel (like this).
4-53
Chapter 4
Grouped Data
To interpret Excel’s skewness coefficient, you need a table
showing critical values for various sample sizes.
Skewness
Note: You can assess skewness from the histogram or
boxplot (usually revealed by outliers or a long tail). It’s
usually not worth it to bother with the table.
4-54
Chapter 4
Skewness
To interpret Excel’s kurtosis coefficient, you need a table
showing critical values for various sample sizes.
4-55
Caution: You cannot reliably assess kurtosis from the histogram,
because its x-axis scale affects its appearance. Maybe best to let
statisticians worry about this topic.
Chapter 4
Kurtosis
Assignments
ML 2.5
• Connect C-2 (covers chapter 4)
•
•
•
•
You get three attempts
Feedback is given if requested
Printable if you wish
Deadline is midnight each Monday
• Project P-1 (data, tasks, questions)
•
•
•
•
0-56
Review instructions
Look at the data
Your task is to write a nice, readable report (not a spreadsheet)
Length is up to you
Projects: General Instructions
General Instructions
For each team project, submit a short (5-10 page) report (using Microsoft Word
or equivalent) that answers the questions posed. Strive for effective writing (see
textbook Appendix I). Creativity and initiative will be rewarded. Avoid careless
spelling and grammar. Paste graphs and computer tables or output into your
written report (it may be easier to format tables in Excel and then use Paste
Special > Picture to avoid weird formatting and permit sizing within Word).
Allocate tasks among team members as you see fit, but all should review and
proofread the report (submit only one report).
0-57
Project P-1
Random teams are assigned on Moodle (submit only one report). Data: Download Big
Dataset 02 - Crime in Major Cities from Moodle. Your team is assigned one crime
category (but you can change it if you wish). Copy the city names and the chosen crime
data column to a new spreadsheet. Delete lines (if any) with missing data. Analysis: (a)
Sort the observations (with city names). (b) List the top 10 and bottom 10 data values
(with city names). (c) For the entire data set, calculate the mean and median. What do
they tell you about center? Would the mode be helpful for this type of data? Explain. (d)
Calculate the standard deviation. (e) Calculate the standardized z-value for each
observation. (f) Are there outliers or unusual data values (see p. 137)? Discuss. (g) Use
MegaStat (or Minitab or Excel) to make a histogram. Describe its shape. (h) Calculate the
quartiles. Make a boxplot and describe it. (i) Make a scatter plot of your kind of crime
versus a different type of crime. What does it show? (j) Ambitious students: Sort the
database in random order (see bottom of page 36) using Excel’s function =RAND(). Copy
and paste the first few sorted lines into your report to illustrate your sorting method.
Comment on anything unusual (or interesting things that you might find on the web).
Watch the video walkthrough using Voting,
North Carolina Births, and CEO compensation
as examples (posted on Moodle)
0-58
Project P-1
your 2010 data will look like this (2005 and 2000 are also available)
Crime Rates in U.S. Metropolitan Areas, 2010 (n = 365)
Violent Crimes Per 100,000
Metropolitan Statistical Area
All Violent
Murder
Rape Robbery
Abilene, TX M.S.A.
423.0
3.1
48.9
72.7
Akron, OH M.S.A.
304.7
3.7
40.9
105.1
Albany, GA M.S.A.
566.0
8.7
24.9
150.4
Albany-Schenectady-Troy, NY M.S.A.
310.4
1.5
21.0
98.5
Albuquerque, NM M.S.A.
670.4
5.8
44.8
124.3
Alexandria, LA M.S.A.
638.0
5.8
23.1
132.3
Allentown-Bethlehem-Easton, PA-NJ M.S.A.
228.2
3.5
20.3
93.6
Altoona, PA M.S.A.
243.6
0.8
38.0
49.8
Amarillo, TX M.S.A.
513.1
5.7
40.8
98.9
Ames, IA M.S.A.
299.5
1.1
41.7
12.4
Anchorage, AK M.S.A.
812.9
4.2
85.9
148.5
Anderson, IN M.S.A.
205.8
2.3
33.4
70.6
Anderson, SC M.S.A.
586.0
5.3
36.4
75.9
Ann Arbor, MI M.S.A.
338.5
1.4
43.2
69.8
Appleton, WI M.S.A.
155.8
0.0
21.4
13.8
Asheville, NC M.S.A.
229.7
1.9
21.8
59.9
Athens-Clarke County, GA M.S.A.
374.9
4.2
19.6
70.5
Atlanta-Sandy Springs-Marietta, GA M.S.A.
413.8
6.1
20.9
149.7
Atlantic City-Hammonton, NJ M.S.A.
529.8
8.0
18.9
245.5
Augusta-Richmond County, GA-SC M.S.A.
412.9
10.2
37.4
156.6
Austin-Round Rock-San Marcos, TX M.S.A.
327.9
3.4
24.7
84.0
Bakersfield-Delano, CA M.S.A.
593.0
9.0
19.9
148.4
Baltimore-Towson, MD M.S.A.
685.3
10.3
23.6
214.4
Bangor, ME M.S.A.
68.4
2.0
12.6
27.2
Barnstable Town, MA M.S.A.
434.6
0.5
36.1
57.6
Battle Creek, MI M.S.A.
697.6
4.5
75.3
109.6
Bay City, MI M.S.A.
335.2
0.9
78.1
50.8
Beaumont-Port Arthur, TX M.S.A.
498.3
5.6
37.7
157.9
Bellingham, WA M.S.A.
267.0
2.5
44.7
50.6
Bend, OR M.S.A.2
304.9
4.3
29.0
30.9
0-59
Assault
298.3
155.0
382.1
189.4
495.6
476.7
110.9
155.0
367.8
244.4
574.4
99.5
468.4
224.0
120.5
146.1
280.5
237.1
257.5
208.7
215.8
415.7
437.0
26.6
340.3
508.3
205.2
297.0
169.1
240.7
Property Crimes Per 100,000
All Property
Burglary
Larceny
Car Theft
3617.3
1009.0
2459.8
148.5
3185.6
947.7
2074.5
163.3
4512.6
1417.8
2803.4
291.4
2693.6
512.1
2076.2
105.4
3896.1
920.6
2586.2
389.4
4592.9
1203.3
3176.3
213.3
2298.0
432.2
1758.1
107.7
1811.7
425.4
1318.2
68.0
4812.7
1137.2
3390.5
285.0
2528.1
478.6
1966.1
83.3
3506.3
416.1
2813.4
276.8
3353.8
848.1
2294.6
211.1
4707.8
1297.6
3041.7
368.4
2713.7
659.7
1879.5
174.4
2136.7
378.5
1708.2
50.0
2454.9
749.6
1534.9
170.3
3843.7
1018.0
2588.1
237.5
3462.6
957.0
2135.7
370.0
3550.3
741.5
2685.7
123.1
4815.3
1355.1
3037.7
422.5
3792.0
754.3
2866.9
170.8
3713.1
1148.0
1931.6
633.6
3090.7
649.5
2135.5
305.7
3098.2
573.3
2429.3
95.7
2972.8
1116.6
1764.7
91.5
3703.5
1145.6
2411.1
146.8
2472.4
610.1
1776.6
85.7
3865.3
1156.9
2488.4
220.1
3197.8
694.2
2372.7
130.8
2973.7
497.5
2360.2
116.0
Definitions
Violent crime
Murder and nonnegligent manslaughter
Forcible rape
Robbery
Aggravated assault
Property crime
Burglary
Larceny-theft
Motor vehicle theft
Example: CEO Compensation
sorting is a good first step
0-60
Example: CEO Compensation
Highlight all data (including the
headings) and use Custom Sort
0-61
Example: CEO Compensation
now you can clearly see the high and low data
values (and comment on any weird data values)
0-62
Example: CEO Compensation
use MegaStat’s Descriptive
Statistics to get your basic stats
along with a nice boxplot
0-63
Example: CEO Compensation
severely skewed
use MegaStat’s Frequency Distributions to get a
frequency table, histogram, etc
annotated by user
normal if logs used?
0-64
Example: CEO Compensation
standardize the sorted list by subtracting the mean from each x value and then
dividing by the standard deviation (or use =STANDARDIZE function)
0-65
Example: CEO Compensation
after standardizing the sorted list, unusual z values can be seen
0-66
Example: CEO Compensation
to randomize the list, paste values of
=RAND() beside data and custom
sort on =RAND()
0-67
Download