STA 1020 - Part 2 - Wayne State University

advertisement
** STA 1020 - Part 2 (24/Oct/13) **
MATERIAL FOR EXAM #2
Contents
Exam 2 of 3: Organizing Data
STA 1020
Quizzes every chapter and then Second Partial Exam
Fall 2013 Section 09 MWF 10:40-11:35 0035 State
Chapter 10 - Graphs, Good and Bad
Chapter 11 - Displaying Distributions with Graphs
Instructor: Dr. J.L. Menaldi
Chapter 12 - Describing Distributions with Numbers
Textbook - Statistics: Concepts and Controversies,
by David S. Moore and William I. Notz, 2013, W.H. Freeman & Company [8th ed]
Class Link: http://www.math.wayne.edu/˜menaldi/teach/13f1020.htm
Chapter 13 - Normal Distributions
Chapter 14 - Describing Relationships: Scatterplots and Correlations
Chapter 15 - Relationships: Regression, Predictions and Causation
Chapter 16 - The Consumer Priced Index and Government Statistics
– skipped!
“Statistics” is the Science of collecting, describing and interpreting data...
It is said that “Probability” is the vehicle of Statistics, i.e., if were not for the laws
of probability, the theory of statistics would not be possible
JLM (WSU)
STA 1020
Ch10 - Graphs, Good and Bad
1 / 114
JLM (WSU)
Thought Questions. . .
STA 1020
Ch10 - Graphs, Good and Bad
Part 2: Figures won’t lie, but liars will figure . . . , beware!
Chapter 10
2 / 114
Data Tables
. . . The table summarize data.
Table 10.1 Education of people 25 years and over, 2006
Level of education
Number of persons (thousands) Percent
Less that high education
27,896
14.5
High school graduate
60,989
31.7
Some college, no degree
32,611
17.0
Associate’s degree
16,760
8.7
Bachelor’s degree
35,153
18.3
Advanced degree
18,567
9.7
Total
191,884
100.0
What is confusing or misleading about the following graph?
Source: Census Bureau, Education Attainment in the United States: 2006
Ex1: How well educated are adults? Attention to details! Labels clear and
everywhere. Do not forget the source.
Ex2: Roundoff errors. . . 27896 + · · · + 18567 = 191885
Our eyes react to the area of the pictures!
JLM (WSU)
STA 1020
Ch10 - Graphs, Good and Bad
3 / 114
Pie charts show how a whole is divided into parts. Wedges within the circle
represent the parts, with the angle spanned by each wedge in proportion to the
size of that part, e.g., 18.3% of those in this age group have a bachelor’s degree
but not an advanced degree 0.183 × 360 = 66 degrees. Pie charts can compare
quantities that are parts of a whole
STA 1020
STA 1020
Ch10 - Graphs, Good and Bad
Pie chart of the distribution of level of education among persons aged 25 years and over
JLM (WSU)
JLM (WSU)
Pie charts
4 / 114
Bar graphs
Bar graph of the distribution of level of education among persons aged 25 years
and over
The distribution of a variable tells us what values it takes and how often
it takes these values.
Bar graphs compare quantities, not necessarily parts of a whole
5 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
1 / 19
6 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch10 - Graphs, Good and Bad
Ex3 High taxes?
Ch10 - Graphs, Good and Bad
Ex4 Beware of pictograms
Recall: Our eyes react to the area of
the pictures!
To magnify a picture, the artist must
increase both height and width to
avoid distortion. This create a misleading graph.
Figure 10.5 A pictogram, for Example 4.
Figure 10.4 Percentage of gross wage earnings paid in income tax and employee Social
Security contributions in eight countries in 2006, for Example 3. These percentages are
for single individuals without children at the income level of the average worker. (Data
from the Organization of Economic Cooperation and Development)
JLM (WSU)
STA 1020
Ch10 - Graphs, Good and Bad
This variation of a bar graph is attractive but
misleading
7 / 114
JLM (WSU)
Another misleading graph
STA 1020
Ch10 - Graphs, Good and Bad
8 / 114
Changes over time
A line graph of a variable plots each observation against the time at
which it was measured.
A categorical variable
places an individual into
one of several groups or
categories.
Always, time goes
into the horizontal
scale! Connect the
data points by lines
to display the change
over time
A quantitative variable
takes numerical values
for which arithmetic
operations
such
as
adding and averaging
make sense.
Figure 10.6 A line graph of the average cost of regular unleaded gasoline each week
from January 3, 2000, to January 21, 2008, for Example 5. (Bureau of Labor Statistics)
JLM (WSU)
STA 1020
Ch10 - Graphs, Good and Bad
9 / 114
JLM (WSU)
Line graphs
STA 1020
Ch10 - Graphs, Good and Bad
10 / 114
Scales
Changes over time
Look for an overall pattern (trend)
Look for patterns that repeat at known regular intervals (seasonal
variations)
Look for any striking deviations that might indicate unusual
occurrences
.......................................................................
A pattern that repeats itself at known regular intervals of time is
called seasonal variation
Many series of regular measurements over time are seasonally
adjusted, i.e., the expected seasonal variation is removed before the
data are published
Figure 10.7 The effect of changing the scales in a line graph, for Example 6. Both
graphs plot the same data, but the right-hand graph makes the increase appear much
more rapid
JLM (WSU)
STA 1020
11 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
2 / 19
12 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch10 - Graphs, Good and Bad
Ex7 Getting rich A
Ch10 - Graphs, Good and Bad
Figure 10.8 Percentage increase or decrease in the Standard & Poor’s 500 index of
common stock prices, 1971 to 2003, for Example 7
JLM (WSU)
STA 1020
Ch10 - Graphs, Good and Bad
Ex7 Getting rich B
Figure 10.9 Value at the end of each year, 1970 to 2003, of $1000 invested in the
Standard & Poor’s 500 index at the end of 1970, for Example 7
13 / 114
JLM (WSU)
Ex8 Rise in college Education
STA 1020
Ch10 - Graphs, Good and Bad
Figure 10.10 Chart junk: this graph is so cluttered with unnecessary ink that it is hard
14 / 114
Ex9 High Taxes?
Changing the order of the bars has improved the graph in Figure 10.4
to see the data
Figure 10.11 Percentage of gross wage earnings paid in income tax and employee Social
Security contributions in eight countries in 2006, for Example 9
JLM (WSU)
STA 1020
Ch10 - Graphs, Good and Bad
15 / 114
JLM (WSU)
Making Good Graphs
STA 1020
Ch10 - Graphs, Good and Bad
16 / 114
Exercise Ch10
10.10 College freshmen. A survey of college freshmen in 2001 asked
what field they planned to study. The results: 12.6%, arts and humanities;
16.6%, business; 10.1%, education; 18.6%, engineering and science;
12.0%, professional; and 10.3%, social science.
(a) What percentage of college freshmen plan to study fields other than
those listed?
(b) Make a graph that compares the percentages of college freshmen
planning to study various fields.
Title your graph
Make sure labels and legends describe variables and their
measurement units. Be careful with the scales used
Make the data stand out. Avoid distracting grids, artwork, etc
Pay attention to what the eye sees. Avoid pictograms and tacky
effects
Categorical and Quantitative Variables, Distributions, Pie Charts, Bar
Graphs, Line Graphs, Techniques for Making Good Graphs.
Recall that descriptive statistics consists of procedures used to
summarize and describe the important characteristics of a set of
measurements.
Now it’s your turn. Read Case Study Evaluated.
JLM (WSU)
STA 1020
17 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
3 / 19
18 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch10 - Graphs, Good and Bad
Exercise (answer) Ch10
Ch10 - Graphs, Good and Bad
**Answers
(a) The given percents add to 80.2%, so 19.8% were in other fields. (b) Either a
bar chart or a pie chart would be appropriate; both are shown below.
Multiple choice Ch10
A company database contains the following information about each employee:
age, date hired, sex (male or female), ethnic group (Asian, black, Hispanic, etc.),
job category (clerical, management, technical, etc.), and yearly salary. Which of
the following lists of variables are all categorical?
(a) age, sex, ethnic group. (b) sex, ethnic group, job category. (c) ethnic group,
job category, yearly salary. (d) yearly salary, age, date hired.
Answer: (b)
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Here is a table of the undergraduate enrollment at a large state
university, broken down by class:
Class
Freshman
Sophomore
Junior
Senior
Non-degree
Total
Count of students
8,248
8,073
7,001
6,904
535
30,761
Percent of Students
26.8%
26.2%
22.8%
22.4%
1.7%
100%
To make a correct graph of the distribution of students by class, you could use
(a) a bar graph. (b) a pie chart. (c) a line graph. (d) (a) or (b), but not (c).
Answer: (d)
JLM (WSU)
STA 1020
JLM (WSU)
19 / 114
Ch11 - Displaying Distribution with Graphs
STA 1020
Ch11 - Displaying Distribution with Graphs
20 / 114
Data Tables
Chapter 11
Table 11.1 Percentage of residents aged 65 and over in the 50 states, 2006
STA 1020
Fall 2013 Section 09 MWF 10:40-11:35 0035 State
Instructor: Dr. J.L. Menaldi
Textbook - Statistics: Concepts and Controversies,
by David S. Moore and William I. Notz, 2013, W.H. Freeman & Company [8th ed]
Class Link: http://www.math.wayne.edu/˜menaldi/teach/13f1020.htm
“Statistics” is the Science of collecting, describing and interpreting data...
It is said that “Probability” is the vehicle of Statistics, i.e., if were not for the laws
of probability, the theory of statistics would not be possible
State
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Percent
13.4
6.8
12.8
13.9
10.8
10.0
13.4
13.4
16.8
9.8
14.0
11.5
11.9
12.4
14.6
12.9
12.4
State
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Percent
12.2
14.6
11.6
13.3
12.5
12.1
12.4
13.3
13.8
13.3
11.1
12.4
12.9
12.4
13.1
12.2
14.6
State
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Percent
13.4
13.2
12.9
15.2
13.9
12.8
14.2
12.7
9.9
8.8
13.3
11.6
11.5
15.3
13.0
12.2
Source: 2008 Statistical Abstract of the United Stated
JLM (WSU)
STA 1020
Ch11 - Displaying Distribution with Graphs
1
21 / 114
JLM (WSU)
Ex1 Making histograms
STA 1020
Ch11 - Displaying Distribution with Graphs
22 / 114
Histogram
Divide the range of the data into classes of equal width. Be sure to
specify the classes precisely so that each individual falls into exactly one class, i.e.,
classes are exclusive
2
Count the number of individuals in each class
Class
6.0 to
7.0 to
8.0 to
9.0 to
3
6.9
7.9
8.9
9.9
Count
1
0
1
2
Class
10.0 to
11.0 to
12.0 to
13.0 to
10.9
11.9
12.9
13.9
Count
2
6
16
14
Class
14.0 to 14.9
15.0 to 15.9
16.0 to 16.9
Count
5
2
1
Draw the histogram.
Mark on the horizontal axis the scale for the variable whose distribution
you are displaying (e.g., “percentage of residents aged 65 and over”).
The vertical axis contains the scale of counts (each bar represent a class).
Be sure that the classes for a histogram have equal widths.
There is not one right choice for the number of classes or class widths
Figure 11.1 Histogram of the percentages of residents aged 65 and older in the 50
states, for Example 1. Note the outlier
JLM (WSU)
STA 1020
23 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
4 / 19
24 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch11 - Displaying Distribution with Graphs
Interpreting histogram
Ch11 - Displaying Distribution with Graphs
Ex3 Tuition & Fees
* In any graph of data, look for an overall pattern and also for striking
deviations from that pattern
* An outlier in any graph of data is an individual observation that falls
outside the overall pattern of the graph
* Ex2: shape (the distribution has a single peak?), roughly symmetric center
spread (how?), outlier (how
(the midpoint of the distribution is close to the peak?),
many?)
Overall pattern of a distribution center, spread and shape
* A distribution is symmetric if the right and left sides of the histogram
are approximately mirror images of each other
* A distribution is skewed to the right (or left) if the right (or left) side of
the histogram (containing the half of the observations with larger values)
extends much farther out than the left (or right) side
Figure 11.2 Histogram of the tuition and fees charged by 121 Illinois colleges and
universities in the 2004-2005 academic year, for Example 3.
Overall description: Roughly symmetric and skewed to the right
JLM (WSU)
STA 1020
Ch11 - Displaying Distribution with Graphs
25 / 114
STA 1020
Ch11 - Displaying Distribution with Graphs
STA 1020
Ch11 - Displaying Distribution with Graphs
Figure 11.3 Histogram of the sample proportion p̂ for 1000 simple random samples from
the same population, for Example 4. This is a symmetric distribution
JLM (WSU)
JLM (WSU)
Ex4 Sampling again
Figure 11.4 The distribution of word lengths used by Shakespeare in his plays, for
Example 5. This distribution is skewed to the right
27 / 114
JLM (WSU)
Stemplot
STA 1020
Ch11 - Displaying Distribution with Graphs
Histograms are not the only graphical display of distributions. For small
data sets, a stemplot is quicker to make and presents more detailed
information.
26 / 114
Ex5 Shakespeare’s words
28 / 114
Ex6 “65 and over”
From Table 11.1, the whole-number part of the observation is the stem, and the final
digit (tenths) is the leaf, i.e., the Alabama entry, 13.4 has stem 13 and leaf 4. Recall to
sort the leaves at the very end
Stem-and-Leaf Plots (for quantitative variables)
1
Separate each observation into a stem consistent of all but the final
(rightmost) digit and leaf, the final digit. Stems may have as many
digits as needed, but each leaf contains only a single digit
2
Write the stems in a vertical column with the smallest at the top, &
draw a vertical line at the right of this column
3
Write each leaf in the row to the right of its stem, in increasing order
out from the stem
Stemplot look like Histograms turned on end
Figure 11.6 Making a stemplot of the data in Table 11.1. Whole percents form the
stems, and tenths of a percent form the leaves
JLM (WSU)
STA 1020
29 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
5 / 19
30 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch11 - Displaying Distribution with Graphs
Tuition & Free
Ch11 - Displaying Distribution with Graphs
Example Weight Data 1
Figure 11.7 Stemplot of the Illinois tuition and fee data
Choose the stems and the leave
Data can be found at http://www.collegeillinois.com/en/collegefunding/costs.htm
JLM (WSU)
STA 1020
Ch11 - Displaying Distribution with Graphs
32 / 114
Example Weight Data 3
After sorting the leaves
STA 1020
Ch11 - Displaying Distribution with Graphs
33 / 114
JLM (WSU)
Example Weight Data 4
STA 1020
Ch11 - Displaying Distribution with Graphs
Choose the classes
JLM (WSU)
STA 1020
Ch11 - Displaying Distribution with Graphs
This is how you do it
JLM (WSU)
JLM (WSU)
31 / 114
Example Weight Data 2
34 / 114
Example Weight Data 5
** Now it’s your turn. Read Case Study Evaluated **
STA 1020
35 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
6 / 19
36 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch11 - Displaying Distribution with Graphs
Distributions
Ch11 - Displaying Distribution with Graphs
Exercise Ch11
11.4 Where do the young live? Figure 11.10 is a stemplot of the percentage
of residents aged under 18 in each of the 50 states in 2006. As in Figure 11.6
(page 227) for older residents, the stems are whole percents and the leaves are
tenths of a percent. (a) Utah has the largest percentage of young adults. What
is the percentage for this state? (b) Ignoring Utah, describe the shape, center,
and spread of this distribution. (c) Is the distribution for young adults more or
less spread out than the distribution in Figure 11.6 for older adults?
From left to right, from top to bottom: Symmetric Distributions Bell-Shaped, Symmetric
Distributions Uniform, Asymmetric Distributions Skewed to the Left, and to the Right
JLM (WSU)
STA 1020
Ch11 - Displaying Distribution with Graphs
Figure 11.6
37 / 114
Figure 11.10
JLM (WSU)
Exercise (answer) Ch11
STA 1020
Ch11 - Displaying Distribution with Graphs
38 / 114
Multiple choice Ch11
To make a correct graph of the distribution of students by class, you could use
(a) a bar graph. (b) a pie chart. (c) a line graph. (d) (a) or (b), but not (c).
Answer: (d)
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
**Answers
(a) Utah has 31.0% young adults.
(b) Without Utah, the distribution is roughly symmetric, centered at
about 24.2%, spread from 21.2% to 27%.
(c) The distribution of young adults is less spread out than the
distribution of older adults.
A well-drawn histogram should have
(a) bars all the same size. (b) no space between bars (unless a class has no
observations). (c) a clearly marked vertical scale. (d) all of these. (e) (a) and
(c), but not necessarily (b).
Answer: (d)
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
You want to make a graph to display the distribution of salaries of the 1,500
professors at a very large university. The best choice is:
(a) a histogram. (b) a line graph. (c) a pie chart. (d) a stemplot
Answer: (a)
Figure 11.6 (older adults)
JLM (WSU)
Figure 11.10 (young adults)
STA 1020
JLM (WSU)
39 / 114
Ch12 - Describing Distribution with Graphs
STA 1020
Ch12 - Describing Distribution with Graphs
40 / 114
Describing . . . center & spread
Chapter 12
Number of home runs hit by Barry Bonds in his first 22 seasons
STA 1020
Fall 2013 Section 09 MWF 10:40-11:35 0035 State
Instructor: Dr. J.L. Menaldi
Textbook - Statistics: Concepts and Controversies,
by David S. Moore and William I. Notz, 2013, W.H. Freeman & Company [8th ed]
Class Link: http://www.math.wayne.edu/˜menaldi/teach/13f1020.htm
“Statistics” is the Science of collecting, describing and interpreting data...
Season
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
Runs
16
25
24
19
33
25
34
46
37
33
42
Season
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Runs
40
37
34
49
73
46
45
45
5
26
28
It is said that “Probability” is the vehicle of Statistics, i.e., if were not for the laws
of probability, the theory of statistics would not be possible
A graph and a few words give a good description of Barry Bonds’s home runs career.
We need number that summarize the center and the spread of the distribution
JLM (WSU)
STA 1020
41 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
7 / 19
42 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch12 - Describing Distribution with Graphs
Median
Ch12 - Describing Distribution with Graphs
Data Set: Number of home runs hit by Barry Bonds in his first 22 seasons
16
25
24
19
33
25
34
46
37
33
42
40
37
34
49
73
46
45
45
5
1
26
28
5
Arrange all observations in increasing order and locate the median M
in the ordered list of observations
16
19
24
25
Median M
1
5
Arrange all observations in order, from the smallest to the largest
16
19
24
25
25
26
28
33
33
34
34
37
37
40
42
45
45
46
46
49
If the number of observations n is odd, the median is the center
observation in the ordered list
3
If the number of observations n (= 22) is even, the median is the
average of the two center observations in the ordered list:
(34 + 34)/2 = 34
4
5
1
10
2
3
27
26
44
30
39
40
34
45
44
24
32
44
13
20
24
26
Q1
27
29
30
32
34
34
M
38
39
29
44
38
47
34
40
20
12
10
39
40
40
44
Q3
44
44
44
45
47
For n = 23 we have (n + 1)/2 = 12, so M = 34. To the left (or to
the right) of the median there are 11 numbers, so (11 + 2)/2 = 6,
i.e., Q1 = 26 and Q3 = 44
In the ordered list, the position of the median is (n + 1)/2, and the
position of the first (third) quartile is (n + 1)/4 from the first (last)
(if the position is not an integer, take the average between both
adjacent places, could be a weighted average)
JLM (WSU)
STA 1020
Ch12 - Describing Distribution with Graphs
34
34
37
37
40
42
M
45
Q3
45
46
46
49
73
If you n is odd, e.g., suppose only 21 seasons (no season 1986 = 16 runs)
19
24
25
25
26
28
33
33
34
M
34
37
37
40
42
45
45
46
46
49
73
Q3
and Q3 = 45
JLM (WSU)
STA 1020
Ch12 - Describing Distribution with Graphs
Arrange all observations in increasing order and locate the median M
in the ordered list of observations
12
33
3
43 / 114
39
33
The first (or third) quartile Q1 is the median of the observations
whose position in the ordered list is to the left (or right) of the
location of the overall median, i.e., Q1 is a number such that at most
25% (or 75%) of the data are smaller in value than Q1 and at most
75% (or 25%) are larger.
Another case
Data Set: Number of home runs hit by Hank Aaron in his first 23 seasons
13
28
Q1 = (25 + 26)/2 = 25.5
STA 1020
Ch12 - Describing Distribution with Graphs
26
Q1
The location of the median is the (n + 1)/2 (=11.5) “position”
JLM (WSU)
25
Q1
2
73
2
Quartiles Q1 and Q3
44 / 114
Summary Numbers
The five-number summary of a distribution consists of the smallest
observation, the first quartile, the median, the third quartile, and the
largest observation, written in order from smallest to largest. In symbols,
the five-number summary is
min
i.e., for Bonds 5
25
34 45
Q1
M
Q3
max,
73 and for Aaron 10
26
34
44
47
A boxplot is a graph of the five-number summary
A central box spans the quartiles.
A line in the box marks the median.
Lines extend from the box out to the smallest and largest observation
45 / 114
JLM (WSU)
Boxplot
STA 1020
Ch12 - Describing Distribution with Graphs
46 / 114
Ex3 Income inequality
The Census Bureau Web site provides information on income distribution by race
Figure 12.2 Boxplots comparing the yearly home run production of Barry Bonds
(5 25 34 45 73) and Hank Aaron (10 26 34 44 47).
Figure 12.3 Boxplots comparing the distributions of income among Hispanics, blacks,
and whites. The ends of each plot are at 0 and at the 95% points in the distribution.
Now it’s your turn: 12.2 Babe Ruth
* Check Statistical Controversies: “Income Inequality”
JLM (WSU)
STA 1020
47 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
8 / 19
48 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch12 - Describing Distribution with Graphs
Mean and standard deviation
Ch12 - Describing Distribution with Graphs
The mean x̄ (pronounced ”x-bar”) of a set of observations is their
average. To find the mean of n observations, add the values and divide by
n, i.e., (sum of observation)/n
The standard deviation s measures the average distance of the
observations from their mean. It is calculated by finding an average of the
squared distances and then taking the square root. To find the standard
If x1 , . . . , xn are the observed numerical values then
Mean
n
x̄ =
n
s2 =
Find the distance of each observation from the mean and square each
of these distances
2
Average the distances by dividing their sum by n − 1. This average
squared distance is called the variance
3
The standard deviation s is the square root of this average squared
distance
JLM (WSU)
STA 1020
Ch12 - Describing Distribution with Graphs
25
24
19
33
25
34
46
37
33
42
i=1
i=1
49 / 114
JLM (WSU)
Ex4 Finding x̄ and s
40
37
34
49
(x1 − x̄)2 + · · · + (xn − x̄)2
1 X
=
(xi − x̄)2
n−1
n−1
Standard Deviation
v
s
u
n
(x1 − x̄)2 + · · · + (xn − x̄)2 u
1 X
s=
=t
(xi − x̄)2
n−1
n−1
STA 1020
Ch12 - Describing Distribution with Graphs
For the (Data Set) number of home runs hit by Barry Bonds in his first 22 seasons
16
x1 + x2 + · · · + xn
1X
=
xi
n
n
i=1
Variance
deviation of n observations:
1
In Formulae. . .
73
46
45
45
5
26
28
50 / 114
Ex4 Finding x̄ and s (cont)
Figure 12.6 Barry Bonds’s home run counts, for Example 4, with their mean and the distance of
one observation from the mean indicated. Think of the standard deviation as an average of
these distances.
We have n = 22 and
x̄ =
16 + 25 + . . . + 28
762
=
= 34.6,
22
22
(16 − 34.6)2 + (25 − 34.6)2 + · · · + (28 − 34.6)2
22 − 1
4139.12
=
= 197.1
21
√
√
and finally s = s 2 = 197.1 = 14.04.
s2 =
...........................................................................................
* The standard deviation s measures spread about the mean x̄. Use s to describe
the spread of a distribution only when you use x̄ to describe the center.
* If s = 0 only when there is no spread. This happens only when all observations have
the same value. So standard deviation zero means no spread at all. Otherwise s > 0. As
the observations become more spread out about their mean, s gets larger
Now it’s your turn! Hank Aaron’s home run x̄ and s
JLM (WSU)
STA 1020
Ch12 - Describing Distribution with Graphs
51 / 114
Ex5 Investing 101
JLM (WSU)
STA 1020
Ch12 - Describing Distribution with Graphs
Investors should think statistically (or not?). You can assess an investment
by thinking about the distribution of (say) yearly return. Risk (or variability):
52 / 114
Ex6 Mean vs Median
Figure 12.8 Stemplot of the salaries (in millions of dollars)
of Los Angeles Lakers players, with median M = 2.7 and
mean x̄ = 5.5.
Treasury bills are riskier than treasury bonds. Stocks are even riskier (and you know why, right?)
The distribution is skewed to the right and there are
three outliers. If we drop the outliers, the mean for
the other 10 players is only x̄ = 2.5 and the median
decrease to M = 2.2. For instance, moving the highest
salary from 19.5 to 195 would not change the median,
but the mean will increase considerable.
Figure 12.7 Stemplot of the yearly returns
on common stocks for the 50 years 1950
to 1999, for Example 5. The returns are
rounded to the nearest whole percent. The
stems are 10s of percents and the leaves are
single percents
TRY Textbook Online / Quizzes / Statistical Applets
(http://bcs.whfreeman.com/scc7e)
JLM (WSU)
STA 1020
53 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
9 / 19
54 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch12 - Describing Distribution with Graphs
Choosing a summary
Ch12 - Describing Distribution with Graphs
The mean and standard deviation are strongly affected by outliers or
by the long tail of a skewed distribution
The median and quartiles are less affected, if the distribution is exactly
symmetric then the mean x̄ and the median M are exactly equal
The five-number summary (as it graph, a boxplot) is usually better
than the mean and standard deviation for describing a skewed
distribution or a distribution with outliers
Use x̄ and s only for reasonably symmetric distributions that are free
of outliers
Exercise Ch12
12.30 Mean x̄ and standard deviation s are not enough. The mean x̄
and standard deviation s measure center and spread but are not a
complete description of a distribution. Data sets with different shapes can
have the same mean and standard deviation. To demonstrate this fact, use
your calculator to find x̄ and s for these two small data sets. Then make a
stemplot of each and comment on the shape of each distribution.
Data A:
Data B:
9.14
6.58
8.14
5.76
8.74
7.71
8.77
8.84
9.26
8.47
8.10
7.04
6.13
5.25
3.10
5.56
9.13
7.91
7.26
6.89
4.74
12.50
The variance is the square of the standard deviation s
** Read Case Study Evaluated **
JLM (WSU)
STA 1020
Ch12 - Describing Distribution with Graphs
55 / 114
JLM (WSU)
Exercise (answer) Ch12
STA 1020
Ch12 - Describing Distribution with Graphs
**Answers
Both sets of data have the same mean and standard deviation (x̄ = 7.50
and s = 2.03). However, the two distributions are quite different: Set A is
left-skewed, while set B is roughly uniform with a high outlier.
– Data A –
3 1
4 7
5
6 1
7 2
8 1177
9 112
10
11
12
JLM (WSU)
– Data B –
3
4
5 257
6 58
7 079
8 48
9
10
11
12 5
Check Textbook Portal:
Statistical Applets . . .
One Variable Statistical
Calculator
STA 1020
56 / 114
Multiple choice Ch12
Here are boxplots of the number of
calories in 20 brands of beef hot
dogs, 17 brands of meat hot dogs,
and 17 brands of poultry hot dogs
1
The main advantage of boxplots over stemplots and histograms is: (a) boxplots
make it easy to compare several distributions, as in this example. (b) boxplots
show more detail about the shape of the distribution. (c) boxplots use the
five-number summary, whereas stemplots and histograms use the mean and
standard deviation. (d) boxplots show skewed distributions, whereas stemplots and
histograms show only symmetric distributions.
Answer: (a)
2
This plot shows that: (a) all poultry hot dogs have fewer calories than the median
for beef and meat hot dogs. (b) about half of poultry hot dog brands have fewer
calories than the median for beef and meat hot dogs. (c) hot dog type is not
helpful in predicting calories, because some hot dogs of each type are high and
some of each type are low. (d) most poultry hot dog brands have fewer calories
than most beef and meat hot dogs, but a few poultry hot dogs have more calories
than the median beef and meat hot dog.
Answer: (d)
57 / 114
JLM (WSU)
Ch13 - Normal Distributions
STA 1020
Ch13 - Normal Distributions
58 / 114
Thought Questions. . .
Chapter 13
STA 1020
Birth weights of babies born in the United States follow, at least
approximately, a bell-shaped curve. What does that mean?
Fall 2013 Section 09 MWF 10:40-11:35 0035 State
What does it mean if a person’s SAT score falls at the 20th percentile
for all people who took the test?
Instructor: Dr. J.L. Menaldi
Textbook - Statistics: Concepts and Controversies,
by David S. Moore and William I. Notz, 2013, W.H. Freeman & Company [8th ed]
Class Link: http://www.math.wayne.edu/˜menaldi/teach/13f1020.htm
“Statistics” is the Science of collecting, describing and interpreting data...
Many measurements in nature tend to follow a similar pattern. The
pattern is that most of the individual measurements take on values
that are near the average, with fewer and fewer measurements taking
on values that are farther from the average in either direction.
Describe what shape the distribution of such measurements would
have
It is said that “Probability” is the vehicle of Statistics, i.e., if were not for the laws
of probability, the theory of statistics would not be possible
JLM (WSU)
STA 1020
59 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
10 / 19
60 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch13 - Normal Distributions
Frequency Histogram
Ch13 - Normal Distributions
Histogram and. . .
How to compare these two graphs?
Figure 3.1 Draw 1000 SRSs of size 100 from the same population
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [relative frequency!]
Figure 13.1 A histogram and a computer-drawn curve. Both picture the distribution of
the number of engineering doctorates earned by members of minority groups at 152
universities. This distribution is skewed to the right
Figure 3.2 Draw 1000 SRSs of size 2527 from the same population as in Figure 3.1.
JLM (WSU)
STA 1020
Ch13 - Normal Distributions
JLM (WSU)
61 / 114
. . . Computer-drawn curves
STA 1020
Ch13 - Normal Distributions
62 / 114
Analyzing Data
Density Curves versus (relative frequency) Histograms
Always plot your data: make a graph, usually a histogram or a
stemplot
Look for the overall pattern (shape, center, spread) and for striking
deviations such as outliers
Choose either the five-number summary or the mean and standard
deviation to briefly describe center and spread in numbers
Sometimes the overall pattern of a large number of observations is so
regular that we can describe it by a smooth curve
Bell-Shaped Curve? Asymmetric Distributions? Normal
Distribution?
Figure 13.2 A histogram and a computer-drawn curve. Both picture the distribution of
the sample proportion in 1000 simple random samples from the same population. This
distribution is quite symmetric. Almost a normal curve!
JLM (WSU)
STA 1020
Ch13 - Normal Distributions
63 / 114
JLM (WSU)
Ex1 Density Curves
STA 1020
Ch13 - Normal Distributions
64 / 114
Center and Spread
Figure 13.5 A perfectly symmetric Normal curve (distribution of sample proportions
Figure 13.4 A histogram and a Normal Density Curve, for Example 1. (a) The area of
the shaded bars in the histogram represents observations greater than 0.51. These make
up 171 of the 1000 observations. (b) The shaded area under the Normal curve
represents the proportion of observations greater than 0.51. This area is 0.1667
Figure 13.6 The mean of a density curve is the point at which it would balance
JLM (WSU)
STA 1020
65 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
11 / 19
66 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch13 - Normal Distributions
Median and Mean
Ch13 - Normal Distributions
Normal Distributions
The median of a density curve is the equal areas point, the point
that divides the area under the curve in half
The mean of a density curve is the balance point, or center of
gravity, at which the curve would balance if made of solid materia
The median and mean are the same for a symmetric density curve.
They both lie at the center of the curve. The mean of a skewed curve
is pulled away from the median in the direction of the long tail
Both, median and mean are measures of the central tendency of
distributions
Distributions (density curves): the five-number summary helps to
understand the shape, while the standard deviation measures spread
Recall that Histograms are made from a table of either frequency
distribution and/or relative frequency distribution
Figure 13.7 Two Normal curves. The standard deviation fixes the spread of a Normal curve
JLM (WSU)
STA 1020
Ch13 - Normal Distributions
67 / 114
JLM (WSU)
Normal Density Curves
The normal curves are symmetric, bell-shaped curves that have these
properties:
x2
2
+
x3
6
68 / 114
The 68-95-99.7 rule
If the observation follows (approximately) a normal distribution then,
approximately,
A specific (theoretical) normal curve is completely described by giving
its µ (mu) mean and its standard deviation σ (sigma), the density is
given by
(x − µ)2 1
f (x) = √ exp −
2σ 2
σ 2π
where exp means “exponential”, i.e., exp(x) = e x = 1 + x +
STA 1020
Ch13 - Normal Distributions
+ ···
The mean determines the center of the distribution. It is located at
the center of symmetry of the curve
68% of the observations fall within one standard deviation of the
mean
95% of the observations fall within two standard deviations of the
mean
99.7% of the observations fall within three standard deviations of the
mean
This is known as the “68-95-99.7 rule” (or the Empirical Rule) for the
normal distribution.
The standard deviation determines the shape of the curve. It is the
distance from the mean to the change-of-curvature points on either
side
Usually, x̄ and s are used for the sample mean and standard
deviation, while µ and σ denote the population mean and standard
deviation
JLM (WSU)
STA 1020
Ch13 - Normal Distributions
69 / 114
JLM (WSU)
The 68-95-99.7 rule (cont)
STA 1020
Ch13 - Normal Distributions
70 / 114
Ex2 The 68-95-99.7 rule
Figure 13.9: If the height of women aged 18 to 24 is approximatively normal with mean 65
inches and standard deviation 2.5 inches then, the rule says about women’s height that
Figure 13.8 The 68-95-99.7 rule for Normal distributions
Now it’s your turn: Heights of young men
JLM (WSU)
STA 1020
71 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
12 / 19
72 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch13 - Normal Distributions
Ex3/4 ACT vs SAT scores
Ch13 - Normal Distributions
* Jennie scored 600 on the SAT Math exam, and Gerald scored 21 on the ACT math
part. SAT and ACT scores are approximatively normal with mean 500 and 18, and
standard deviation 100 and 6, respectively. Who did better?
The standard score for an observation x is z = (x − x̄)/s,
and measures the relative standing of a measurement in a data set
(check Wikipedia http://en.wikipedia.org/wiki/Standard score)
* Jennie’s standard score is (600 − 500)/100 = 1.0 and Gerald’s (21 − 18)/6 = 0.5.
Larger standard score yields better grade
Ex5 Reverse Search
In any Table B, we can find the percentile c for a given standard score z or in
reverse, we can find the standard score z for a given percentile c. [Jennie
z = 1, c = 84.13 ≈ 68/2 + 50 = 84]
For instance, how high must a student score on the SAT to fall in the top 10% of
all scores? That requires a score at or above the 90-th percentile, i.e., for
c = 0.9032 we get z = 1.3 and for c = 0.8849 we get z = 1.2, say we take
z = 1.3. This yields x = x̄ + z s = 500 + (1.3)(100) = 630. • [“symmetry”]
The c-th percentile of a distribution (F (z) is
the area until z) is a value such that c percent
of the observations lie below it and the rest
lie above, i.e., F −1 (c) • Table B
* Jennie 84.13 and Gerald 69.15
• Table B w/2 digits
JLM (WSU)
STA 1020
Ch13 - Normal Distributions
73 / 114
JLM (WSU)
Another Example
• Another Table B
STA 1020
Ch13 - Normal Distributions
74 / 114
More questions
Health and Nutrition Examination Study of 1976-1980 (HANES).
What proportion of men are less
than 72.8 inches tall? rule •
From Data: Heights of adults, ages 18-24
women: mean 65.0 in & standard deviation 2.5 in
(100 − 68)/2 + 68 or 50 + 68/2 = 84.
men: mean 70.0 in & standard deviation 2.8 in
.......................................................................
Empirical Rule (68-95-99.7)
68% are between 62.5 and 67.5 inches
women 95% are between 60.0 and 70.0 inches
99.7% are between 57.5 and 72.5 inches
Ans: 84% or 84.13% (Table B)
What proportion of men are less
than 68 inches tall?
Observation x = 68, standard score z =
(68 − 70.0)/2.8 = −0.71. In Table B, we
find c = 24.20 for x = −0.7.
Ans: 24% or 23.87% (2-digit table)
68% are between 67.2 and 72.8 inches
men 95% are between 64.4 and 75.6 inches
99.7% are between 61.6 and 78.4 inches
-4
-3
-2
-1
0
+1
+2
+3
+4
-4
-3
-2
-1
0
+1
+2
+3
+4
Table B http://www.math.wayne.edu/˜menaldi/teach/others/Sta1020/table-percentile.pdf
Two-digits Table . . . /p-values-table.pdf and . . . /p-values-table-alt.pdf
or even this Table with comments . . . /p-values-table-triola.pdf
** Now it’s your turn. Read Case Study Evaluated **
JLM (WSU)
STA 1020
Ch13 - Normal Distributions
75 / 114
JLM (WSU)
Exercise Ch13
STA 1020
Ch13 - Normal Distributions
13.6 Random numbers. If you ask a computer to generate “random
numbers” between 0 and 1, you will get observations from a uniform
distribution.
76 / 114
Exercise (answer) Ch13
**Answers
(a) The curve forms a 1 × 1 square, which has area 1. (b) The mean and
median are both 0.5. (c) 10% (the region is a rectangle with height 1 and
base width 0.1; hence the area is 0.1). (d) 30% (the region is a rectangle
with height 1 and base width 0.9 − 0.6 = 0.3).
Figure 13.12 shows the density curve for a uniform distribution. This curve
takes the constant value 1 between 0 and 1 and is zero outside that range.
Use this density curve to answer these questions.
(a) Why is the total area under the curve equal to 1?
(b) The curve is symmetric. What is the value of the mean and median?
(c) What percentage of the observations lie between 0 and 0.1?
(d) What percentage of the observations lie between 0.6 and 0.9?
JLM (WSU)
STA 1020
77 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
13 / 19
78 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch13 - Normal Distributions
Multiple choice Ch13
Ch14 - Describing Relationships: Scatterplots and Correlation
Suppose that the Blood Alcohol Content (BAC) of students who drink five
beers varies from student to student according to a normal distribution
with mean 0.07 and standard deviation 0.01.
1
The middle 95% of students who drink five beers have BAC between:
(a) 0.06 and 0.08. (b) 0.05 and 0.09.
(c) 0.04 and 0.10. (d) 0.03 and 0.11.
Answer: (b)
2
What percent of students who drink five beers have BAC above 0.08
(the legal limit for driving in most states)? (a) 0.15%. (b) 2.5%.
Answer: (d)
(c) 5%. (d) 16%. (e) 32%.
3
What percent of students who drink five beers have BAC above 0.10
(the legal limit for driving in other states)? (a) 0.15%. (b) 2.5%.
(c) 5%. (d) 1.5%. (e) 32%.
Answer: (a)
STA 1020
Fall 2013 Section 09 MWF 10:40-11:35 0035 State
Instructor: Dr. J.L. Menaldi
Textbook - Statistics: Concepts and Controversies,
by David S. Moore and William I. Notz, 2013, W.H. Freeman & Company [8th ed]
Class Link: http://www.math.wayne.edu/˜menaldi/teach/13f1020.htm
“Statistics” is the Science of collecting, describing and interpreting data...
It is said that “Probability” is the vehicle of Statistics, i.e., if were not for the laws
of probability, the theory of statistics would not be possible
JLM (WSU)
STA 1020
Ch14 - Describing Relationships: Scatterplots and Correlation
79 / 114
Bivariate Data
JLM (WSU)
STA 1020
Ch14 - Describing Relationships: Scatterplots and Correlation
Chapter 14
These are the values of two different variables that are obtained form the
same population
Both variables are qualitative (attribute)
80 / 114
Scatterplots
A Scatterplot shows the relationship between two quantitative variables
measured on the same individuals. The values of one variable appear on the
horizontal axis, and the values of the other variable appear on the vertical axis.
Each individual in the data appears as the point in the plot fixed by the values of
both variables for that individual.
Both variables are quantitative (numerical)
One variables is qualitative and the other is quantitative
Two quantitative variables are seen as ordered pairs, sometimes called
explanatory (or input, or independent) variable and response (or output,
or dependent) variable.
.......................................................................
Figure 14.2 Scatterplot of
recession velocity against
distance from the earth.
Example 1: Hubble’s law and the Big Bang Investigate the relationship
between “distance from the earth” and “recession velocity” (moving away
from the observer)
Always plot explanatory
variable in the horizontal or
x axis of the scatterplot
Key evidence for the idea of the expanding universe, and rewinding, the
“Big Bang” appears!
JLM (WSU)
STA 1020
Ch14 - Describing Relationships: Scatterplots and Correlation
81 / 114
Ex2 Health and Wealth
Data from the World Bank. The explanatory variable is the GDP per person, the
response variable is the life expectancy at birth. Three African nations are outliers.
Figure 14.3 Scatterplot of the life expectancy of people in many nations
against each nation’s gross domestic
product per person.
The overall pattern does not show
that people in richer country live
longer, but life expectancy tend to
rise very quickly as GDP increases,
then levels off.
JLM (WSU)
STA 1020
JLM (WSU)
STA 1020
Ch14 - Describing Relationships: Scatterplots and Correlation
82 / 114
Examining a scatterplot
Look for
In any graph of data, look for the overall pattern and for striking
deviations from that pattern.
You can describe the overall pattern of a scatterplot by the form,
direction, and strength of the relationship.
An important kind of deviation is an outlier, an individual value that
falls outside the overall pattern of the relationship.
Two variables are positively (negatively) associated when above
average values of one tend to accompany above-average values of the
other and below average values also tend to occur together. The
scatter plot slopes upward (downward) as we move from left to right
83 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
14 / 19
84 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch14 - Describing Relationships: Scatterplots and Correlation
Archaeopteryx fossils
Length in centimeters of . . .
Ex3 Classifying fossils
Femur
Humerus
38
41
Ch14 - Describing Relationships: Scatterplots and Correlation
56
63
59
70
64
72
74
84
Recall x̄ and s
If x1 , . . . , xn are the observed numerical values then
Mean
n
x̄ =
x1 + x2 + · · · + xn
1X
=
xi
n
n
i=1
Figure 14.5 Scatterplot of the lengths of two
bones in 5 fossil specimens of the extinct beast
Archaeopteryx, for Example 3.
Variance
n
s2 =
The plot shows a strong, positive, straight-line
association
(x1 − x̄)2 + · · · + (xn − x̄)2
1 X
=
(xi − x̄)2
n−1
n−1
i=1
Standard Deviation
v
s
u
n
(x1 − x̄)2 + · · · + (xn − x̄)2 u
1 X
s=
=t
(xi − x̄)2
n−1
n−1
Actually, six archaeopteryx fossil specimens
are known, but the humerus of the last fossil is missing. To continue . . .
i=1
JLM (WSU)
STA 1020
Ch14 - Describing Relationships: Scatterplots and Correlation
85 / 114
Ex4 Linear correlation
The correlation describes the direction and strength of a straight-line
relationship between two quantitative variables. The (coefficient of linear)
correlation is usually written as r ,
1 X xi − x̄ yi − ȳ ,
n−1
sx
sy
n
r=
JLM (WSU)
where the roles of the explanatory variable x and response variable y are symmetric and unit
independent (i.e., r does not change if x and y are exchanged, or the unit is changed)
i =
* Like the mean and the standard deviation, the correlation is strongly affected by a few
outlying observations
JLM (WSU)
STA 1020
STA 1020
4
5
64
72
74
84
Next the standard scores z = (x − x̄)/sx and z = (y − ȳ )/sy for each
observation, i.e., for i = 1 we have (38 − 58.2)/13.20 = −1.530 and
(41 − 66.0)/15.89 = −1.573, and we continue up to the last one i = 5
to get (74 − 58.2)/13.20 = 1.197 and (84 − 66.0)/15.89 = 1.133
3
Finally
we add all, i.e., n = 5 and
r = (−1.530)(−1.573) + · · · + (1.197)(1.133) /4 = 0.994
JLM (WSU)
STA 1020
88 / 114
Relationships
Statistical versus Deterministic Relationships
Figure 14.8 Moving one point reduces the correlation
from r = 0.994 to r = 0.640.
JLM (WSU)
3
59
70
2
Ch14 - Describing Relationships: Scatterplots and Correlation
Figure 14.7 Patterns closer to a straight line have correlations closer to 1 or -1.
2
56
63
First we calculate the mean and the standard deviation for
explanatory variable x and response variable y , i.e., x̄ = 58.2,
sx = 13.20, ȳ = 66.0, sy = 15.89
87 / 114
Other Scatterplots
1
38
41
1
Note that r is always a number between −1 and 1.
* Correlation does not describe curved relationships between variables, no matter how
strong they are.
86 / 114
Ex4 Calculating correlation
Archaeopteryx fossils (cont.) Femur
Humerus
i=1
Ch14 - Describing Relationships: Scatterplots and Correlation
STA 1020
Ch14 - Describing Relationships: Scatterplots and Correlation
(Distance) = (Time) × (Speed)
89 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
(Income) ≈ a + b × Assets
STA 1020
15 / 19
90 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch14 - Describing Relationships: Scatterplots and Correlation
Statistical Significance
Ch14 - Describing Relationships: Scatterplots and Correlation
Thought Questions. . .
Assume you are doing a study . . . and you find that . . .
A strong relationship seen in the sample may indicate a strong
relationship in the population
The sample may exhibit a strong relationship simply by chance and
the relationship in the population is not strong or is zero.
The observed relationship is considered to be statistically significant
if it is stronger than a large proportion of the relationships we could
expect to see just by chance
“Statistical significance” does not imply the relationship is strong
enough to be considered “practically important”
For all cars manufactured in the U.S., there is a positive correlation
between the size of the engine and horsepower
There is a negative correlation between the size of the engine and gas
mileage. Is this what you expected?
What does it mean for two variables to have a positive correlation or
a negative correlation?
Do you expect a correlation between quality and price? Outliers?
Even weak (strong) relationships may (not) be labeled statistically
significant if the sample size is very large (small)
JLM (WSU)
STA 1020
Ch14 - Describing Relationships: Scatterplots and Correlation
91 / 114
Thought Questions. . .
JLM (WSU)
STA 1020
Ch14 - Describing Relationships: Scatterplots and Correlation
14.8 & 14.10 Calories and salt in hot dogs.
What type of correlation would the following pairs of variables have
positive, negative, or none?
1
Temperature during the summer and electricity bills
2
Temperature during the winter and heating costs
3
Number of years of education and height
4
Frequency of brushing and number of cavities
5
Number of churches and number of bars in cities in your state
6
Height of husband and height of wife
92 / 114
Exercise Ch14
(14.8) Figure 14.11 shows the calories
and sodium content in 17 brands of meat
hot dogs. Describe the overall pattern
of these data. In what way is the point
marked A unusual?
(14.10) Is the correlation r for the data
in Figure 14.11 near -1, clearly negative
but not near -1, near 0, clearly positive
but not near 1, or near 1? Explain your
answer.
** Now it’s your turn. Read Case Study Evaluated **
14.12 Outliers and correlation. Figure 14.10 contains outliers marked A, B,
and C. In Figure 14.11 the point marked A is an outlier. Removing the outliers
will increase the correlation r in one figure and decrease r in the other figure.
What happens in each figure, and why?
JLM (WSU)
STA 1020
Ch14 - Describing Relationships: Scatterplots and Correlation
93 / 114
Exercise Ch14 (cont.)
JLM (WSU)
STA 1020
Ch14 - Describing Relationships: Scatterplots and Correlation
14.12 Outliers and correlation.
94 / 114
Exercise (answer) Ch14
**Answers
(14.8) The association is roughly linear and positive (high calories tend to
go with high sodium, and low tends to go with low). Point A is a hot dog
brand which is well below average in both calories and sodium.
(14.10) This shows a fairly strong positive association, so r should be
reasonably close to 1.
Note: In fact, r = 0.863. In this case, point A makes the correlation
higher, because its presence makes the scatterplot appear more linear.
(With point A removed, the correlation drops slightly to 0.834.)
Figure 14.10
Figure 14.11
Q: Figure 14.10 contains outliers marked A, B, and C. In Figure 14.11 the point
marked A is an outlier. Removing the outliers will increase the correlation r in one
figure and decrease r in the other figure. What happens in each figure, and why?
JLM (WSU)
STA 1020
(14.12) The correlation increases when A, B, and C are removed from
Figure 14.10, because their presence makes the plot look less linear. The
correlation decreases when A is removed from Figure 14.11, because that
plot looks more linear with A. (That is, if we drew a line through that
scatterplot, there is a less relative scatter about that line with point A
than without.)
95 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
16 / 19
96 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch14 - Describing Relationships: Scatterplots and Correlation
Multiple choice Ch14
Ch15 - Relationships: Regression, Prediction and Causation
The stock market did well during the 1990s. Here are the percent total returns
(change in price plus dividends paid) for the Standard & Poor’s 500 stock index:
Year
Return
1
2
1990
-3.1
1991
30.5
1992
7.6
1993
10.1
1994
1.3
1995
37.6
1996
23.0
1997
33.4
1998
28.6
The correlation of U.S. stock returns with overseas stock returns during
these years was about r = 0.4. This tells you that: (a) when U.S. stocks
rose, overseas stocks also tended to rise, but the connection was not very
strong. (b) when U.S. stocks rose, overseas stocks rose by almost exactly
the same amount. (c) when U.S. stocks rose, overseas stocks tended to fall,
but the connection was not very strong. (d) nothing, because this is not a
Answer: (a)
possible value of r.
Stock returns are measured in percent. What are the units of the mean, the
median, the quartiles, the standard deviation, and the correlation between
U.S. and overseas returns? (a) all are measured in percent. (b) all are
measured in percent except the standard deviation, which is measured in
squared percent. (c) all are measured in percent except the correlation,
which is a number that has no units. (d) all are measured in percent except
the correlation, which is measured in squared percent.
Answer: (c)
JLM (WSU)
STA 1020
Ch15 - Relationships: Regression, Prediction and Causation
STA 1020
1999
21.0
Fall 2013 Section 09 MWF 10:40-11:35 0035 State
Instructor: Dr. J.L. Menaldi
Textbook - Statistics: Concepts and Controversies,
by David S. Moore and William I. Notz, 2013, W.H. Freeman & Company [8th ed]
Class Link: http://www.math.wayne.edu/˜menaldi/teach/13f1020.htm
“Statistics” is the Science of collecting, describing and interpreting data...
It is said that “Probability” is the vehicle of Statistics, i.e., if were not for the laws
of probability, the theory of statistics would not be possible
97 / 114
Regression
JLM (WSU)
STA 1020
Ch15 - Relationships: Regression, Prediction and Causation
Chapter 15
Archaeopteryx (cont.)
A regression line is a straight line that describes how a response
variable y changes as an explanatory variable x changes. We often
use a regression line to predict the value of y for a given value of x.
98 / 114
Ex1 & 3 Regression Equation
i =
Femur
Humerus
1
38
41
2
56
63
3
59
70
4
64
72
5
74
84
6
50
?
The least-squares regression line of y on x is the line that makes
the sum of the squares of the vertical distances of the data points
from the line as small as possible.
With the help of calculus, we obtain the equation of the least-squares
regression line, namely, y = a + bx, where the slope b = r sy /sx and
the intercept a = ȳ − bx̄
Usually, with the help of a computer (or calculator) we find the means
x̄ and ȳ , the standard deviations sx and sy , the correlation coefficient
r , and a, b.
(humerus) = (−3.66) + (1.197) × (femur),
JLM (WSU)
STA 1020
Ch15 - Relationships: Regression, Prediction and Causation
99 / 114
Understanding Prediction
JLM (WSU)
(−3.66) + (1.197)(50) = 56.2.
STA 1020
Ch15 - Relationships: Regression, Prediction and Causation
Prediction is based on fitting some “model” to a set of data (prophecy?), it works best
when the model fits data closely, and outside the range of available data is risky
The square of the correlation r 2 is the proportion of the variation in the values
of y that is explained by the least-squares regression of y on x
Ex 5 Using r 2 : For Ex 1 (5 fossils) we have
r = 0.994 so r 2 = (0.994)2 = 0.988, i.e., only
a 1.2% of the variation of y is not explained by
the variation of x
Ans: 56.2 cm
100 / 114
Ex6 Causation
Statistics and causation
Ex 6: Does TV extend life? Measure the
number of TV sets per person x and the life
expectancy y for the world’s nations. There is
a high positive correlation: nations with many
TV sets have higher life expectancies.
A lurking variable (national wealth)
Figure 15.2 A weaker straight-line pattern. The
data are the percentage in each state who voted
Democratic in the two Reagan presidential elections.
r = 0.704, r 2 = 0.498, i.e., a 50.2% not explained!
Read Ex 7 Obesity in mothers and daughters
A strong relationship between two variables does not always mean
that changes in one variable cause changes in the other.
The relationship between two variables is often influenced by other
variables lurking in the background.
The best evidence for causation comes from randomized comparative
experiments.
JLM (WSU)
STA 1020
101 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
17 / 19
102 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch15 - Relationships: Regression, Prediction and Causation
Causation
Ch15 - Relationships: Regression, Prediction and Causation
Figure 15.5 Some explanations for an observed association. A dashed line shows an association.
An arrow shows a cause-and-effect link. Variable x is explanatory, y is a response variable, and z
is a lurking variable.
Ex8 SAT scores. . .
High scores “x” on the SAT exams in high school certainly do not cause
high grades “y ” in college. A moderate association (say r 2 about 27%) is
no doubt explained by common response variable such as academic ability,
study habits and staying sober (any of these are lurking variables “z”).
* Prediction does not requires causation.
The observed relationship between two variables may be due to direct
causation, common response, or confounding. Two or more of these
factors may be present together.
An observed relationship can, however, be used for prediction without
worrying about causation as long as the patterns found in past data
continue to hold true.
JLM (WSU)
STA 1020
Ch15 - Relationships: Regression, Prediction and Causation
STA 1020
JLM (WSU)
STA 1020
Ch15 - Relationships: Regression, Prediction and Causation
From past natural disasters, a strong positive correlation has been
found between the amount of aid sent and the number of deaths.
Would you interpret this to mean that sending more aid causes more
people to die? Explain.
JLM (WSU)
** Now it’s your turn. Read Case Study Evaluated **
103 / 114
Thought Questions. . .
From a long-term study on several families, researchers constructed a
scatterplot of the cholesterol level of a child at age 50 versus the
cholesterol level of the father at age 50. You know the cholesterol
level of your best friend’s father at age 50. How could you use this
scatterplot to predict what your best friend’s cholesterol level will be
at age 50?
Ch15 - Relationships: Regression, Prediction and Causation
Evidence of causation: strong association (i.e., association between
smoking and lung cancer is very strong), consistent (many studies of
different kind of people in may countries link smoking to lung cancer),
higher doses yield stronger response (people who smoke more cigarettes
per day or who smoke over a longer period get lung cancer more often),
alleged cause is plausible (experiment with animals show that tars from
cigarettes smoke do cause cancer), etc.
Studies have shown a negative correlation between the amount of
food consumed that is rich in beta carotene and the incidence of lung
cancer in adults. Does this correlation provide evidence that beta
carotene is a contributing factor in the prevention of lung cancer?
Explain.
A scatterplot of number of bicycles sold versus number of bank
robberies in the United States for each year over the past century
would show a very strong positive correlation. Why would this be
true? Does an increase in one cause an increase in the other?
105 / 114
More Examples
JLM (WSU)
STA 1020
Ch15 - Relationships: Regression, Prediction and Causation
Prediction via Regression Line
104 / 114
Thought Questions (cont). . .
106 / 114
A Caution
Beware of Extrapolation: Sarah’s height was plotted against her age
(Hand, et al., A Handbook of Small Data Sets, London: Chapman and Hall)
The regression equation is y = 3.6 + 0.97x (where y is the average
age of all husbands who have wives of age x)
For all women aged 30, we predict the average husband age to be
32.7 years: 3.6 + (0.97)(30) = 32.7
Regression line:
y = 71.95 + 0.383 x
Suppose we know that an individual wife’s age is 30. What would we
predict her husband’s age to be?
The square of the correlation r 2 measures the usefulness of regression
prediction, e.g.,
if r = ±1 or r 2 = 1 then the regression line explains all (100%) of the
variation in y
if r = 0.7 or r 2 = 0.49 then the regression line explains almost half
(50%) of the variation in y
Can you predict her height at age 42 months?
Height at age 42 months?
y = (71.95) + (0.383)(42) = 88 cm.
Can you predict her height at age 30 years (360 months)?
Height at age 30 years? y = (71.95) + (0.383)(360) = 209.8 cm.
She is predicted to be 6’ 10.5” at age 30. [Could be possible?]
JLM (WSU)
STA 1020
107 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
18 / 19
108 / 114
** STA 1020 - Part 2 (24/Oct/13) **
Ch15 - Relationships: Regression, Prediction and Causation
Again. . .
Ch15 - Relationships: Regression, Prediction and Causation
Correlation does not imply causation, and two variables may be related if
Explanatory variable causes change in response variable
Response variable causes change in explanatory variable
Explanatory variable may have some cause, but is not the sole cause
of changes in the response variable
Confounding variables may exist
Both variables may result from a common cause (such as, both
variables changing over time)
JLM (WSU)
STA 1020
Both Variables are Changing Over Time [both divorces and suicides
have increased dramatically since 1900. (explanatory)] Are divorces
causing suicides or are suicides causing divorces? The population
has increased dramatically since 1900 (causing both to increase)
Better to investigate: Has the rate of divorce or the rate of suicide
changed over time?
STA 1020
JLM (WSU)
STA 1020
Ch15 - Relationships: Regression, Prediction and Causation
Common Response (both variables change due to common cause)
[divorce among men (explanatory)] and [percent abusing alcohol
(response)] Both may result from an unhappy marriage
JLM (WSU)
Response causes Explanatory: [Hotel advertising dollars
(explanatory)] and [occupancy rate (response)] Positive correlation?
(more advertising leads to increased occupancy rate?)
No, lower
occupancy leads to more advertising
109 / 114
Imagining Examples. . . (cont)
Confounding Variables: [meditation (explanatory)] and [aging
(measurable aging factor) (response)] General concern for one’s
well being may be confounded with decision to try meditation
Ch15 - Relationships: Regression, Prediction and Causation
Explanatory causes Response: [pollen count from grasses
(explanatory)] and [percentage of people suffering from allergy
symptoms (response)]; or [amount of food eaten (explanatory)] and
[hunger level (response)]
Explanatory is not Sole Contributor: [Consumption of barbecued
foods (explanatory)] and [Incidence of stomach cancer (response)]
Barbecued foods are known to contain carcinogens, but other lifestyle
choices may also contribute
The correlation may be merely a coincidence
Ch15 - Relationships: Regression, Prediction and Causation
Imagining Examples. . .
15.6 & 15.8 IQ and the school GPA.
Figure 14.10 (page 302) plots school grade
point average (GPA) against IQ test score for
78 seventh-grade students. There is a roughly
straight-line pattern with quite a bit of scatter. The correlation between these variables is
r = 0.634. What percentage of the observed
variation among the GPAs of these 78 students
is explained by the straight-line relationship between GPA and IQ score? What percentage of
the variation is explained by differences in GPA
among students with similar IQ scores?
15.8. The least-squares line for predicting school GPA from IQ score, based on
the 78 students plotted in Figure 14.10, is GPA = −3.56 + (0.101)(IQ). Explain
in words the meaning of the slope b = 0.101. Then predict the GPA of a student
whose IQ score is 115.
111 / 114
Exercise (answer) Ch15
JLM (WSU)
STA 1020
Ch15 - Relationships: Regression, Prediction and Causation
**Answers
15.6. Of the observed variation among the GPAs of these 78 students, the
percent explained by the straight-line relationship between GPA and IQ
score is r 2 = (0.634)2 = 0.402 = 40.2%. The rest of the variation (59.8%)
is due to differences in GPA among students with similar IQ scores.
15.8. The slope b = 0.101 means that we expect GPA to increase by
about 0.101 points for every one-point increase in IQ (and GPA drops by
about 0.101 for every one-point decrease in IQ). For an IQ of 115, we
predict a GPA of −3.56 + (0.101)(115) = 8.055.
110 / 114
Exercise Ch15
112 / 114
Multiple choice Ch15
Consider a large number of countries around the world. There is a positive
correlation between the number of Nintendo games per person x and the average
life expectancy y . Does this mean that we could increase the life expectancy in
Rwanda by shipping Nintendo games to that country?
(a) Yes: the correlation says that as the number of Nintendo games per person
goes up, so does life expectancy. (b) No: if the correlation were negative we
could accept that conclusion, but this correlation is positive. (c) Yes: positive
correlation means that if we increase x, then y will also increase. (d) No: the
positive correlation just shows that richer countries have both more Nintendo
Answer: (d)
games per person and higher life expectancies.
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Suppose that the correlation between the scores of students on Exam 1 and Exam
2 in a statistics class is r = 0.7. One way to interpret r is to say what percent of
the variation in Exam 2 scores can be explained by the straight-line relationship
between Exam 2 scores and Exam 1 scores. This percent is about
(a) 0.49%. (b) 70%. (c) 49%. (d) 30%.
Answer: (c)
JLM (WSU)
STA 1020
113 / 114
http://www.math.wayne.edu/˜menaldi/teach/
JLM (WSU)
STA 1020
19 / 19
114 / 114
Download