STAT 145 (Notes) - Department of Mathematics and Statistics

advertisement
STAT 145 (Notes)
Al Nosedal
anosedal@unm.edu
Department of Mathematics and Statistics
University of New Mexico
Fall 2013
.
.
.
.
.
.
CHAPTER 1
PICTURING DISTRIBUTIONS WITH GRAPHS.
.
.
.
.
.
.
Definitions
I
Statistics is the science of data.
.
.
.
.
.
.
Definitions
I
Statistics is the science of data.
I
Individuals are the objects described by a set of data.
Individuals may be people, but they may also be animals or
things.
.
.
.
.
.
.
Definitions
I
Statistics is the science of data.
I
Individuals are the objects described by a set of data.
Individuals may be people, but they may also be animals or
things.
I
A variable is any characteristic of an individual. A variable can
take different values for different individuals.
.
.
.
.
.
.
Descriptive Statistics
Most of the statistical information in newspapers, magazines,
company reports, and other publications consists of data that are
summarized and presented in a form that is easy for the reader to
understand. Such summaries of data, which may be tabular,
graphical, or numerical, are referred to as descriptive statistics.
.
.
.
.
.
.
Statistical Inference
Many situations require information about a large group of
elements. But, because of time, cost, and other considerations,
data can be collected from only a small portion of the group. The
larger group of elements in a particular study is called the
population, and the smaller group is called the sample. As one of
its major contributions, Statistics uses data from a sample to make
estimates and test hypotheses about the characteristics of a
population through a process referred to as statistical inference.
.
.
.
.
.
.
Categorical and Quantitative Variables
I
A categorical variable places an individual into one of several
groups or categories.
.
.
.
.
.
.
Categorical and Quantitative Variables
I
A categorical variable places an individual into one of several
groups or categories.
I
A quantitative variable takes numerical values for which
arithmetic operations such as adding and averaging makes
sense. The values of a quantitative variable are usually
recorded in a unit of measurement such as seconds or
kilograms.
.
.
.
.
.
.
Example. Fuel economy
Here is a small part of a data set that describes the fuel economy
(in miles per gallon) of model year 2010 motor vehicles:
Make and Model
Aston Martin Vantage
Honda Civic
Toyota Prius
Chevrolet Impala
Type
Two-seater
Subcompact
Midsize
Large
Transmission
Manual
Automatic
Automatic
Automatic
.
.
Cylinders
8
4
4
6
.
.
.
.
Example. Fuel economy (cont.)
Here is a small part of a data set that describes the fuel economy
(in miles per gallon) of model year 2010 motor vehicles:
Make and Model
Aston Martin Vantage
Honda Civic
Toyota Prius
Chevrolet Impala
City mpg
12
25
51
18
Highway mpg
19
36
48
29
Carbon footprint
13.1
6.3
3.7
8.3
The carbon footprint measures a vehicle’s impact on climate
change in tons of carbon dioxide emitted annually.
a) What are the individuals in this data set?
b) For each individual, what variables are given? Which of these
variables are categorical and which are quantitative?
.
.
.
.
.
.
Fuel economy (solution)
a) The individuals are the car makes and models.
b) For each individual, the variables recorded are Vehicle Type
(categorical), Transmission Type (categorical), Number of cylinders
(quantitative), City mpg (quantitative), Highway mpg
(quantitative), and Carbon footprint (tons, quantitative).
.
.
.
.
.
.
Distribution of a variable
The distribution of a variable tells us what values it takes and how
often it takes these values.
The values of a categorical variable are labels for the categories.
The distribution of a categorical variable lists the categories and
gives either the count or the percent of individuals that fall in each
category.
.
.
.
.
.
.
Example. Never on Sunday?
Births are not, as you might think, evenly distributed across the
days of the week. Here are the average numbers of babies born on
each day of the week in 2008:
Day
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Births
7,534
12,371
13,415
13,171
13,147
12,919
8,617
.
.
.
.
.
.
Example. Never on Sunday? (cont.)
Present these data in a well-labeled bar graph. Would it also be
correct to make a pie chart? Suggest some possible reasons why
there are fewer births on weekends.
Solution.
It would be correct to make a pie chart but a pie chart would make
it more difficult to distinguish between the weekend days and the
weekdays. Some births are scheduled (e.g., induced labor), and
probably most are scheduled for weekdays.
.
.
.
.
.
.
8000
6000
4000
2000
0
Births
10000
12000
14000
Example. Never on Sunday? Bar chart.
Sun
Mon
Tue
Wed
Thu
Fri
.
Sat
.
.
.
.
.
Example. Never on Sunday? Pie chart.
Tue
Mon
Sun
Wed
Sat
Thu
Fri
.
.
.
.
.
.
Example. What color is your car?
The most popular colors for cars and light trucks vary by region
and over time. In North America white remains the top color
choice, with black the top choice in Europe and silver the top
choice in South America. Here is the distribution of the top colors
for vehicles sold globally in 2010.
Color
Silver
Black
White
Gray
Red
Blue
Beige, brown
Other colors
Popularity (%)
26
24
16
16
6
5
3
.
.
.
.
.
.
What color is your car? (cont.)
a) Fill in the percent of vehicles that are in other colors.
b) Make a graph to display the distribution of color popularity.
.
.
.
.
.
.
Solution
a) Other = 100 − (26 + 24 + 16 + 16 + 6 + 5 + 3) = 4.
.
.
.
.
.
.
15
10
5
0
Popularity
20
25
Graph
silver
black
white
gray
red
blue
brown
.
other
.
.
.
.
.
Summarizing Quantitative Data
A common graphical representation of quantitative data is a
histogram. This graphical summary can be prepared for data
previously summarized in either a frequency, relative frequency, or
percent frequency distribution. A histogram is constructed by
placing the variables of interest on the horizontal axis and the
frequency, relative frequency, or percent frequency on the vertical
axis.
.
.
.
.
.
.
Example
Consider the following data
14 21 23 21 16 19 22 25 16 16
24 24 25 19 16 19 18 19 21 12
16 17 18 23 25 20 23 16 20 19
24 26 15 22 24 20 22 24 22 20.
a. Develop a frequency distribution using classes of 12-14, 15-17,
18-20, 21-23, and 24-26.
b. Develop a relative frequency distribution and a percent
frequency distribution using the classes in part (a).
c. Make a histogram.
.
.
.
.
.
.
Example (solution)
Class
12 -14
15 - 17
18 - 20
21 - 23
24 - 26
Frequency
2
8
11
10
9
Relative Freq.
2/40
8/40
11/40
10/40
9/40
Percent Freq.
0.05
0.20
0.275
0.25
0.225
.
.
.
.
.
.
Modified classes (solution)
12
15
18
21
24
Class
≤x <
≤x <
≤x <
≤x <
≤x <
15
18
21
24
27
Frequency
2
8
11
10
9
Relative Freq.
2/40
8/40
11/40
10/40
9/40
.
Percent Freq.
0.05
0.20
0.275
0.25
0.225
.
.
.
.
.
Histogram of Frequencies
12
Histogram of data
11
10
10
9
6
4
2
2
0
Frequency
8
8
15
20
25
data
.
.
.
.
.
.
Symmetric and Skewed Distributions
A distribution is symmetric if the right and left sides of the
histogram are approximately mirror images of each other. A
distribution is skewed to the right if the right side of the histogram
(containing the half of the observations with larger values) extends
much farther out than the left side. It is skewed to the left if the
left side of the histogram extends much farther out than the right
side.
.
.
.
.
.
.
Symmetric Distribution
0
50
100
150
200
Symmetric
0
1
2
3
4
5
.
6
.
.
.
.
.
Distribution Skewed to the Right
0
100
200
300
Skewed to the right
0
2
4
6
8
10
12
.
14
.
.
.
.
.
Distribution Skewed to the Left
0
100
200
300
Skewed to the left
0.4
0.5
0.6
0.7
0.8
0.9
.
1.0
.
.
.
.
.
Examining a histogram
In any graph of data, look for the overall pattern and for striking
deviations from that pattern.
You can describe the overall pattern of a histogram by its shape,
center, and spread.
An important kind of deviation is an outlier, and individual value
that falls outside the overall pattern.
.
.
.
.
.
.
Quantitative Variables: Stemplots
To make a stemplot:
1. Separate each observation into a stem, consisting of all but the
final (rightmost) digit, and a leaf, the final digit. Stems may have
as many digits as needed, but each leaf contains only a single digit.
2. Write the stems in a vertical column with the smallest at the
top, and draw a vertical line at the right of this column.
3. Write each leaf in the row to the right of its stem, in increasing
order out from the stem.
.
.
.
.
.
.
Example: Making a stemplot
Construct stem-and-leaf display (stemplot) for the following data:
70 72 75 64 58 83 80 82 76 75 68 65 57 78 85 72.
.
.
.
.
.
.
Solution
5
6
7
8
7
4
0
0
8
5
2
2
8
2
3
5
5
5
6
8
.
.
.
.
.
.
Health care spending.
The table below shows the 2009 health care expenditure per capita
in 35 countries with the highest gross domestic product in 2009.
Health expenditure per capita is the sum of public and private
health expenditure (in international dollars, based on
purchasing-power parity, or PPP) divided by population. Health
expenditures include the provision of health services, for health but
exclude the provision of water and sanitation. Make a stemplot of
the data after rounding to the nearest $100 (so that stems are
thousands of dollars, and leaves are hundreds of dollars). Split the
stems, placing leaves 0 to 4 on the first stem and leaves 5 to 9 on
the second stem of the same value.
Describe the shape, center, and spread of the distribution. Which
country is the high outlier?
.
.
.
.
.
.
Table
Country
Argentina
Australia
Austria
Belgium
Brazil
Canada
China
Denmark
Finland
France
Germany
Greece
Dollars
1387
3382
4243
4237
943
4196
308
4118
3357
3934
4129
3085
Country
India
Indonesia
Iran
Italy
Japan
Korea, South
Mexico
Netherlands
Norway
Poland
Portugal
Russia
Dollars
132
99
685
3027
2713
1829
862
4389
5395
1359
2703
1038
.
Country
Saudi Arabia
South Africa
Spain
Sweden
Switzerland
Thailand
Turkey
U. A. E.
U. K.
U. S. A.
Venezuela
.
.
.
Dollars
1150
862
3152
3690
5072
345
965
1756
3399
7410
737
.
.
Table, after rounding to the nearest $ 100
Country
Argentina
Australia
Austria
Belgium
Brazil
Canada
China
Denmark
Finland
France
Germany
Greece
Dollars
1400
3400
4200
4200
900
4200
300
4100
3400
3900
4100
3100
Country
India
Indonesia
Iran
Italy
Japan
Korea, South
Mexico
Netherlands
Norway
Poland
Portugal
Russia
Dollars
100
100
700
3000
2700
1800
900
4400
5400
1400
2700
1000
.
Country
Saudi Arabia
South Africa
Spain
Sweden
Switzerland
Thailand
Turkey
U. A. E.
U. K.
U. S. A.
Venezuela
.
.
.
Dollars
1200
900
3200
3700
5100
300
1000
1800
3400
7400
700
.
.
Table, rounded to units of hundreds
Country
Argentina
Australia
Austria
Belgium
Brazil
Canada
China
Denmark
Finland
France
Germany
Greece
Dollars
14
34
42
42
9
42
3
41
34
39
41
31
Country
India
Indonesia
Iran
Italy
Japan
Korea, South
Mexico
Netherlands
Norway
Poland
Portugal
Russia
Dollars
1
1
7
30
27
18
9
44
54
14
27
10
.
Country
Saudi Arabia
South Africa
Spain
Sweden
Switzerland
Thailand
Turkey
U. A. E.
U. K.
U. S. A.
Venezuela
.
.
.
Dollars
12
9
32
37
51
3
10
18
34
74
7
.
.
Stemplot
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
1
7
0
8
1
7
0
8
7
0
7
1
7
1
9
1
1
4
3
9
2
3
9
4
9
4
2
4
4
4
2
2
2
4
4
.
.
.
.
.
.
Shape, Center and Spread
This distribution is somewhat right-skewed, with a single high
outlier (U.S.A.). There are two clusters of countries. The center of
this distribution is around 27 ($2700 spent per capita), ignoring
the outlier. The distribution’s spread is from 1 ($100 spent per
capita) to 74 ($7400 spent per capita).
.
.
.
.
.
.
Time Plots
A time plot of a variable plots each observation against the time at
which it was measured. Always put time on the horizontal scale of
your plot and the variable you are measuring on the vertical scale.
Connecting the data points by lines helps emphasize any change
over time.
When you examine a time plot, look once again for an overall
pattern and for strong deviations from the pattern. A common
overall pattern in a time plot is a trend, a long-term upward or
downward movement over time. Some time plots show cycles,
regular up-and-down movements over time.
.
.
.
.
.
.
Example. The cost of college
Below you will find data on the average tuition and fees charged to
in-state students by public four-year colleges and universities for
the 1980 to 2010 academic years. Because almost any variable
measured in dollars increases over time due to inflation (the falling
buying power of a dollar), the values are given in ”constant dollars”
adjusted to have the same buying power that a dollar had in 2010.
a) Make a time plot of average tuition and fees.
b) What overall pattern does your plot show?
c) Some possible deviations from the overall pattern are outliers,
periods when changes went down (in 2010 dollars), and periods of
particularly rapid increase. Which are present in your plot, and
during which years?
.
.
.
.
.
.
Table
Year
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
Tuition ($)
2119
2163
2305
2505
2572
2665
2815
2845
2903
2972
3190
Year
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
Tuition ($)
3373
3622
3827
3974
4019
4131
4226
4338
4397
4426
4626
Year
2002
2003
2004
2005
2006
2007
2008
2009
2010
.
.
Tuition ($)
4961
5507
5900
6128
6218
6480
6532
7137
7605
.
.
.
.
Time plot
5000
4000
3000
2000
Average Tuition ($)
6000
7000
Time Plot of Average Tuition and Fees (1980-2010)
1980
1985
1990
1995
2000
2005
2010
Year
.
.
.
.
.
.
Answers to b) and c)
b) Tuition has steadily climbed during the 30-year period, with
sharpest absolute increases in the last 10 years.
c) There is a sharp increase from 2000 to 2010.
.
.
.
.
.
.
CHAPTER 2
DESCRIBING DISTRIBUTIONS WITH NUMBERS.
.
.
.
.
.
.
Problem
How much do people with a bachelors degree (but no higher
degree) earn? Here are the incomes of 15 such people, chosen at
random by the Census Bureau in March 2002 and asked how much
they earned in 2001. Most people reported their incomes to the
nearest thousand dollars, so we have rounded their responses to
thousands of dollars. 110 25 50 50 55 30 35 30 4 32 50 30 32 74
60.
How could we find the ”typical” income for people with a
bachelors degree (but no higher degree)?
.
.
.
.
.
.
Measuring center: the mean
The most common measure of center is the ordinary arithmetic
average, or mean. To find the mean of a set of observations, add
their values and divide by the number of observations. If the n
observations are x1 , x2 , ..., xn , their mean is
x1 + x2 + ... + xn
n
or in more compact notation,
x̄ =
x̄ =
n
∑
xi
i=1
.
.
.
.
.
.
Income Problem
x̄ = 110+25+50+50+55+30+...+32+74+60
= 44.466
15
Do you think that this number represents the ”typical” income for
people with a bachelors degree (but no higher degree)?
.
.
.
.
.
.
Measuring center: the median
The median M is the midpoint of a distribution, the number such
that half the observations are smaller and the other half are larger.
To find the median of the distribution:
Arrange all observations in order of size, from smallest to largest.
If the number of observations n is odd, the median M is the center
observation in the ordered list. Find the location of the median by
counting n+1
2 observations up from the bottom of the list.
If the number of observations n is even, the median M is the mean
of the two center observations in the ordered list. Find the location
of the median by counting n+1
2 observations up from the bottom of
the list.
.
.
.
.
.
.
Income Problem (Median)
We know that if we want to find the median, M, we have to order
our observations from smallest to largest: 4 25 30 30 30 32 32 35
50 50 50 55 60 74 110. Lets find the location of M
15+1
location of M = n+1
2 = 2 =8
Therefore, M = x8 = 35 (x8 = 8th observation on our ordered list).
.
.
.
.
.
.
Measuring center: Mode
Another measure of location is the mode. The mode is defined as
follows. The mode is the value that occurs with greatest frequency.
Note: situations can arise for which the greatest frequency occurs
at two or more different values. In these instances more than one
mode exists.
.
.
.
.
.
.
Income Problem (Mode)
Using the definition of mode, we have that:
mode1 = 30
and
mode2 = 50
Note that both of them have the greatest frequency, 3.
.
.
.
.
.
.
Example: New York travel times.
Here are the travel times in minutes of 20 randomly chosen New
York workers:
10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45.
Compare the mean and median for these data. What general fact
does your comparison illustrate?
.
.
.
.
.
.
Solution
Mean:
x̄ = 10+30+5+...+60+60+40+45
= 31.25
20
Median:
First, we order our data from smallest to largest
5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85 .
20+1
location of M = n+1
2 = 2 = 10.5
Which means that we have to find the mean of x10 and x11 .
11
M = x10 +x
= 20+25
= 22.5
2
2
.
.
.
.
.
.
Comparing the mean and the median
The mean and median of a symmetric distribution are close
together. In a skewed distribution, the mean is farther out in the
long tail than is the median. Because the mean cannot resist the
influence of extreme observations, we say that it is not a resistant
measure of center.
.
.
.
.
.
.
The quartiles Q1 and Q3
To calculate the quartiles:
Arrange the observations in increasing order and locate the median
M in the ordered list of observations.
The first quartile Q1 is the median of the observations whose
position in the ordered list is to the left of the location of the
overall median.
The third quartile Q3 is the median of the observations whose
position in the ordered list is to the right of the location of the
overall median.
.
.
.
.
.
.
Income Problem (Q1 )
Data:
4 25 30 30 30 32 32 35 50 50 50 55 60 74 110.
From previous work, we know that M = x8 = 35.
This implies that the first half of our data has n1 = 7 observations.
Let us find the location of Q1 :
location of Q1 = n12+1 = 7+1
2 = 4.
This means that Q1 = x4 = 30.
.
.
.
.
.
.
Income Problem (Q3 )
Data:
4 25 30 30 30 32 32 35 50 50 50 55 60 74 110.
From previous work, we know that M = x8 = 35.
This implies that the first half of our data has n2 = 7 observations.
Let us find the location of Q3 :
location of Q3 = n22+1 = 7+1
2 = 4.
This means that Q3 = 55.
.
.
.
.
.
.
Five-number summary
The five-number summary of a distribution consists of the smallest
observation, the first quartile, the median, the third quartile, and
the largest observation, written in order from smallest to largest.
In symbols, the five-number summary is
min Q1 M Q3 MAX .
.
.
.
.
.
.
Income Problem (five-number summary)
Data: 4 25 30 30 30 32 32 35 50 50 50 55 60 74 110. The
five-number summary for our income problem is given by:
4 30 35 55 110
.
.
.
.
.
.
Boxplot
A boxplot is a graph of the five-number summary.
A central box spans the quartiles Q1 and Q3 .
A line in the box marks the median M.
Lines extended from the box out to the smallest and largest
observations.
.
.
.
.
.
.
0
20
40
60
80
Income (thousands of dollars)
100
Boxplot for income data
.
.
.
.
.
.
Pittsburgh Steelers
The 2010 roster of the Pittsburgh Steelers professional football
team included 7 defensive linemen and 9 offensive linemen. The
weights in pounds of the defensive linemen were
305 325 305 300 285 280 298
and the weights of the offensive linemen were
338 324 325 304 344 315 304 319 318
a) Make a stemplot of the weights of the defensive linemen and
find the five-number summary.
b) Make a stemplot of the weights of the offensive linemen and
find the five-number summary.
c) Does either group contain one or more clear outliers? Which
group of players tends to be heavier?
.
.
.
.
.
.
Solution
a) Defensive linemen.
28
29
30
31
32
0
8
0
5
5
5
5
.
.
.
.
.
.
Solution
b) Offensive linemen.
30
31
32
33
34
4
5
4
8
4
4
8
5
9
.
.
.
.
.
.
Five-number summary
Offensive line (lbs)
Defensive line (lbs)
Minimum
304
280
Q1
309.5
285
Median
319
300
Q3
331.5
305
Maximum
344
325
c) Apparently, neither of these two groups contain outliers. It
seems that the offensive line players are heavier.
.
.
.
.
.
.
Measures of Variability: Range
The simplest measure of variability is the range.
Range= Largest value - smallest value
Range= MAX - min
.
.
.
.
.
.
Measures of Variability: IQR
A measure of variability that overcomes the dependency on
extreme values is the interquartile range (IQR).
IQR = third quartile - first quartile
IQR = Q3 − Q1
.
.
.
.
.
.
1.5 IQR Rule
Identifying suspected outliers. Whether an observation is an outlier
is a matter of judgement: does it appear to clearly stand apart
from the rest of the distribution? When large volumes of data are
scanned automatically, however, we need a rule to pick out
suspected outliers. The most common rule is the 1.5 IQR rule. A
point is a suspected outlier if it lies more than 1.5 IQR below the
first quartile Q1 or above the third quartile Q3 .
.
.
.
.
.
.
A high income.
In our income problem, we noted the influence of one high income
of $110,000 among the incomes of a sample of 15 college
graduates. Does the 1.5 IQR rule identify this income as a
suspected outlier?
.
.
.
.
.
.
Solution
Data: 4 25 30 30 30 32 32 35 50 50 50 55 60 74 110.
Q1 and Q3 are given by:
Q1 =30 and Q3 =55
Q3 + 1.5 IQR = 55 + 1.5(25) = 92.5
Since 110 > 92.5 we conclude that 110 is an outlier.
.
.
.
.
.
.
Problem
Ebby Halliday Realtors provide advertisements for distinctive
properties and estates located throughout the United States. The
prices listed for 22 distinctive properties and estates are shown
here. Prices are in thousands.
1500 895 719 619 625 4450 2200
1280 700 619 725 739 799 2495
1395 2995 880 3100 1699 1120 1250 912.
a)Provide a five-number summary.
b)The highest priced property, $ 4,450,000, is listed as an estate
overlooking White Rock Lake in Dallas, Texas. Should this
property be considered an outlier?
.
.
.
.
.
.
Solution
a) min = 619
Q1 = 725
M = 1016
Q3 = 1699
MAX = 4450
b) IQR = 1699 - 725 = 974
Q3 +1.5 IQR = 1699 + 1.5 (974) = 1699 + 1461 = 3160.
Since 4450 > 3160, we conclude that 4450 is an outlier.
.
.
.
.
.
.
Measures of Variability: Variance
The variance s 2 of a set of observations is an average of the
squares of the deviations of the observations from their mean. In
symbols, the variance of n observations x1 , x2 , ..., xn is
2
2 +...+(x −x̄)2
n
s 2 = (x1 −x̄) +(x2 −x̄)
n−1
or, more compactly,
1 ∑n
2
s 2 = n−1
i=1 (xi − x̄)
.
.
.
.
.
.
Measures of Variability: Standard Deviation
The √
standard deviation s is the square root of the variance s 2 :
∑n
s=
i=1 (xi −x̄)
2
n−1
.
.
.
.
.
.
Example
Consider a sample with data values of 10, 20, 12, 17, and 16.
Compute the variance and standard deviation.
.
.
.
.
.
.
Solution
First, we have to calculate the mean, x̄:
x̄ = 10+20+12+17+16
= 15.
5
Now, let’s find
the
variance
s 2:
2 +(20−15)2 +(12−15)2 +(17−15)2 +(16−15)2
(10−15)
s2 =
.
5−1
s 2 = 64
=
16.
4
Finally,
√ let’s find the standard deviation s:
s = 16 = 4.
.
.
.
.
.
.
x̄ and s
Radon is a naturally occurring gas and is the second leading cause
of lung cancer in the United States. It comes from the natural
breakdown of uranium in the soil and enters buildings through
cracks and other holes in the foundations. Found throughout the
United States, levels vary considerably from state to state. There
are several methods to reduce the levels of radon in your home, and
the Environmental Protection Agency recommends using one of
these if the measured level in your home is above 4 picocuries per
liter. Four readings from Franklin County, Ohio, where the county
average is 9.32 picocuries per liter, were 5.2, 13.8, 8.6 and 16.8.
a) Find the mean step-by-step.
b)Find the standard deviation step-by-step.
c)Now enter the data into your calculator and use the mean and
standard deviation buttons to obtain x̄ and s. Do the results agree
with your hand calculations?
.
.
.
.
.
.
Solution
First, we have to calculate the mean, x̄:
x̄ = 5.2+13.8+8.6+16.8
= 11.1.
4
Now, let’s find2 the variance
s 2:
2 +(8.6−11.1)2 +(16.8−11.1)2
(5.2−11.1)
+(13.8−11.1)
s2 =
.
4−1
s 2 = 80.84
=
26.9466
3
Finally,
√ let’s find the standard deviation s:
s = 26.9466 = 5.1910.
.
.
.
.
.
.
Choosing a summary
The five-number summary is usually better than the mean and
standard deviation for describing a skewed distribution or a
distribution with strong outliers. Use x̄ and s only for reasonably
symmetric distributions that are free of outliers.
.
.
.
.
.
.
CHAPTER 3
THE NORMAL DISTRIBUTIONS.
.
.
.
.
.
.
Simple Example
Random Experiment: Rolling a fair die 300 times.
Class
1≤x <2
2≤x <3
3≤x <4
4≤x <5
5≤x <6
6≤x <7
Expected Frequency
50
50
50
50
50
50
Expected Relative Freq
1/6
1/6
1/6
1/6
1/6
1/6
.
.
.
.
.
.
Histogram of Expected Frequencies
0
10
20
frequency
30
40
50
Histogram of expected frequencies
1
2
3
4
5
6
.
.
7
.
.
.
.
Histogram of Expected Relative Frequencies
0.10
0.05
0.00
frequency
0.15
Histogram of
expected relative frequencies
1
2
3
4
5
6
.
.
7
.
.
.
.
Density Curve
A density curve is a curve that is always on or above the horizontal
axis, and has area exactly 1 underneath it.
A density curve describes the overall pattern of a distribution. The
area under the curve and above any range of values is the
proportion of all observations that fall in that range.
Note. No set of real data is exactly described by a density curve.
The curve is an idealized description that is easy to use and
accurate enough for practical use.
.
.
.
.
.
.
Accidents on a bike path
Examining the location of accidents on a level, 5-mile bike path
shows that they occur uniformly along the length of the path. The
figure below displays the density curve that describes the
distribution of accidents.
a) Explain why this curve satisfies the two requirements for a
density curve.
b) The proportion of accidents that occur in the first mile of the
path is the area under the density curve between 0 miles and 1
mile. What is this area?
c) There is a stream alongside the bike path between the 0.8-mile
mark and the 1.3-mile mark. What proportion of accidents happen
on the bike path alongside the stream?
d) The bike path is a paved path through the woods, and there is
a road at each end. What proportion of accidents happen more
than 1 mile from either road?
.
.
.
.
.
.
Density Curve
0.00
0.05
0.10
0.15
0.20
Density Curve
0
1
2
3
4
5
Distance along bike path (miles)
.
.
.
.
.
.
Solution
a) It is on or above the horizontal axis everywhere, and because it
forms a 1/5 × 5 rectangle, the area beneath the curve is 1.
.
.
.
.
.
.
Solution b)
0.10
0.15
0.20
Density Curve
0.00
0.05
proportion = 1 x 0.20 = 0.20
0
1
2
3
4
5
Distance along bike path (miles)
.
.
.
.
.
.
Solution c)
0.10
0.15
0.20
Density Curve
0.00
0.05
proportion = (1.3-0.8) x 0.20 = 0.10
0
1
2
3
4
5
Distance along bike path (miles)
.
.
.
.
.
.
Solution d)
0.10
0.15
0.20
Density Curve
0.00
0.05
proportion = (4-1) x 0.20 = 0.60
0
1
2
3
4
5
Distance along bike path (miles)
.
.
.
.
.
.
Normal Distributions
A Normal Distribution is described by a Normal density curve. Any
particular Normal distribution is completely specified by two
numbers, its mean µ and standard deviation σ.
The mean of a Normal distribution is at the center of the
symmetric Normal curve. The standard deviation is the distance
from the center to the change-of-curvature points on either side.
.
.
.
.
.
.
Standard Normal Distribution
0.0
0.1
0.2
0.3
0.4
Normal Distribution
mean=0 and standard deviation=1
-3
-2
-1
0
1
2
.
.
3
.
.
.
.
0.15
0.20
Two Different Standard Deviations
0.00
0.05
0.10
std. dev.= 2
std. dev.= 5
-15
-10
-5
0
5
10
.
.
15
.
.
.
.
0.15
0.20
Two Different Means
0.00
0.05
0.10
mean = -5
mean = 5
-15
-10
-5
0
5
10
.
.
15
.
.
.
.
The 68-95-99.7 rule
In the Normal distribution with mean µ and standard deviation σ:
Approximately 68% of the observations fall within σ of the mean µ.
Approximately 95% of the observations fall within 2σ of µ.
Approximately 99.7% of the observations fall within 3σ of µ.
.
.
.
.
.
.
Problem
The national average for the verbal portion of the College Boards
Scholastic Aptitude Test (SAT) is 507. The College Board
periodically rescales the test scores such that the standard
deviation is approximately 100. Answer the following questions
using a bell-shaped distribution and the empirical rule for the
verbal test scores.
a. What percentage of students have an SAT verbal score greater
than 607?
b. What percentage of students have an SAT verbal score greater
than 707?
c. What percentage of students have an SAT verbal score between
407 and 507?
d. What percentage of students have an SAT verbal score between
307 and 707?
.
.
.
.
.
.
16 %
68 %
16 %
0.000
0.001
0.002
0.003
0.004
Solution a)
200
400
600
800
SAT score
.
.
.
.
.
.
2.5 %
95 %
2.5 %
0.000
0.001
0.002
0.003
0.004
Solution b)
200
400
600
800
SAT score
.
.
.
.
.
.
16 %
34 %
34 %
16 %
0.000
0.001
0.002
0.003
0.004
Solution c)
200
400
600
800
SAT score
.
.
.
.
.
.
2.5 %
95 %
2.5 %
0.000
0.001
0.002
0.003
0.004
Solution d)
200
400
600
800
SAT score
.
.
.
.
.
.
Fruit flies
The common fruit fly Drosophila melanogaster is the most studied
organism in genetic research because it is small, easy to grow, and
reproduces rapidly. The length of the thorax (where the wings and
legs attach) in a population of male fruit flies is approximately
Normal with mean 0.800 millimeters (mm) and standard deviation
0.078 mm. Draw a Normal curve on which this mean and standard
deviation are correctly located.
.
.
.
.
.
.
4
5
Solution
0.800+0.078
0
1
2
3
0.800-0.078
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Thorax length
.
.
.
.
.
.
Fruit flies
The lenght of the thorax in a population of male fruit flies is
approximately Normal with mean 0.800 mm and standard
deviation 0.078 mm. Use the 68-95-99.7 rule to answer the
following questions.
a) What range of lengths covers almost all (99.7%) of this
distribution?
b) What percent of male fruit flies have a thorax length exceeding
0.878 mm?
.
.
.
.
.
.
0.800-3(0.078)=0.566
0.800+3(0.078)=1.034
99.7 %
0
1
2
3
4
5
Solution a) Between 0.566 mm and 1.034 mm
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Thorax length
.
.
.
.
.
.
4
5
Solution b) 16% of thorax lenghts exceed 0.878 mm
16 %
0
1
2
3
84 %
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Thorax length
.
.
.
.
.
.
Monsoon rains
The summer monsoon brings 80% of India’s rainfall and is
essential for the country’s agriculture. Records going back more
than a century show that the amount of monsoon rainfall varies
from the year according to a distribution that is approximately
Normal with mean 582 mm and standard deviation 82 mm. Use
the 68-95-99.7 rule to answer the following questions.
a) Between what values do the monsoon rains fall in 95% of all
years?
b) How small are the monsoon rains in the dryest 2.5% of all years?
.
.
.
.
.
.
Solution
a) In 95% of all years, monsoon rain levels are between
582 - 2(82) and 582 + 2(82) i.e. 688 mm and 1016 mm.
b) The driest 2.5% of monsoon rainfalls are less than 688 mm; this
is more than two standard deviations below the mean.
.
.
.
.
.
.
Standard Normal Distribution
The standard Normal distribution is the Normal distribution N(0,1)
with mean 0 and standard deviation 1.
If a variable x has any Normal distribution N(µ,σ) with mean µ
and standard deviation σ, then the standardized variable
x −µ
σ
has the standard Normal distribution.
z=
.
.
.
.
.
.
SAT vs ACT
In 2010, when she was a high school senior, Alysha scored 670 on
the Mathematics part of the SAT. The distribution of SAT Math
scores in 2010 was Normal with mean 516 and standard deviation
116. John took the ACT and scored 26 on the Mathematics
portion. ACT Math scores for 2010 were Normally distributed with
mean 21.0 and standard deviation 5.3. Find the standardized
scores for both students. Assuming that both tests measure the
same kind of ability, who had the higher score?
.
.
.
.
.
.
Solution
Alysha’s standardized score is
670 − 516
= 1.33.
116
John’s standardized score is
zA =
26 − 21
= 0.94.
5.3
Alysha’s score is relatively higher than John’s.
zJ =
.
.
.
.
.
.
Men’s and women’s heights
The heights of women aged 20 to 29 are approximately Normal
with mean 64.3 inches and standard deviation 2.7 inches. Men the
same age have mean height 69.9 inches with standard deviation
3.1 inches. What are the z-scores for a woman 6 feet tall and a
man 6 feet tall? Say in simple language what information the
z-scores give that the original nonstandardized heights do not.
.
.
.
.
.
.
Solution
We need to use the same scale, so recall that 6 feet = 72 inches.
A woman 6 feet tall has standardized score
zW =
72 − 64.3
= 2.85
2.7
(quite tall, relatively).
A man 6 feet tall has standardized score
zM =
72 − 69.9
= 0.68.
3.1
Hence, a woman 6 feet tall is 2.85 standard deviations taller than
average for women. A man 6 feet tall is only 0.68 standard
deviations above average for men.
.
.
.
.
.
.
Using the Normal table
Use table A to find the proportion of observations from a standard
Normal distribution that satisfies each of the following statements.
In each case, sketch a standard Normal curve and shade the area
under the curve that is the answer to the question.
a) z < −1.42
b) z > −1.42
c) z < 2.35
d) −1.42 < z < 2.35
.
.
.
.
.
.
0.2
0.3
0.4
Solution a) 0.0778
0.0
0.1
0.0778
z* = -1.42
-3
-2
-1
0
1
2
.
.
3
.
.
.
.
0.2
0.3
0.4
Solution b) 0.9222
0.9222
0.0
0.1
0.0778
z* = -1.42
-3
-2
-1
0
1
2
.
.
3
.
.
.
.
0.2
0.3
0.4
Solution c) 0.9906
0.0
0.1
0.9906
z* = 2.35
-3
-2
-1
0
1
2
.
.
3
.
.
.
.
0.3
0.4
Solution d) 0.9966 - 0.0778 = 0.9128
0.0
0.1
0.2
0.9906-0.0778 = 0.9128
z* = 2.35
z* = -1.42
-3
-2
-1
0
1
2
.
.
3
.
.
.
.
Monsoon rains
The summer monsoon rains in India follow approximately a Normal
distribution with mean 852 mm of rainfall and standard deviation
82 mm.
a) In the drought year 1987, 697 mm of rain fell. In what percent
of all years will India have 697 mm or less of monsoon rain?
b) ”Normal rainfall” means within 20% of the long-term average,
or between 683 and 1022 mm. In what percent of all years is the
rainfall normal?
.
.
.
.
.
.
Solution a)
1. State the problem. Let x be the monsoon rainfall in a given
year. The variable x has the N(852, 82) distribution. We want the
proportion of years with x ≤ 697.
2. Standardize. Subtract the mean, then divide by the standard
deviation, to turn x into a standard Normal z.
Hence x ≤ 697 corresponds to z ≤ 697−852
= −1.89.
82
3. Use the table. From Table A, we see that the proportion of
observations less than −1.89 is 0.0294. Thus, the answer is 2.94%.
.
.
.
.
.
.
Solution b)
1. State the problem. Let x be the monsoon rainfall in a given
year. The variable x has the N(852, 82) distribution. We want the
proportion of years with 683 < x < 1022.
2. Standardize. Subtract the mean, then divide by the standard
deviation, to turn x into a standard Normal z.
683 < x < 1022 corresponds to 683−852
< z < 1022−852
, or
82
82
−2.06 < z < 2.07.
3. Use the table. Hence, using Table A, the area is
0.9808 − 0.0197 = 96.11%.
.
.
.
.
.
.
The Medical College Admission Test
Almost all medical schools in the United States require students to
take the Medical College Admission Test (MCAT). The exam is
composed of three multiple-choice sections (Physical Sciences,
Verbal Reasoning, and Biological Sciences). The score on each
section is converted to a 15-point scale so that the total score has
a maximum value of 45. The total scores follow a Normal
distribution, and in 2010 the mean was 25.0 with a standard
deviation of 6.4. There is little change in the distribution of scores
from year to year.
a) What proportion of students taking the MCAT had a score over
30?
b) What proportion had scores between 20 and 25?
.
.
.
.
.
.
Solution a)
1. State the problem. Let x be the MCAT score of a randomly
selected student. The variable x has the N(25, 6.4) distribution.
We want the proportion of students with x > 30.
2. Standardize. Subtract the mean, then divide by the standard
deviation, to turn x into a standard Normal z.
Hence x > 30 corresponds to z > 30−25
6.4 = 0.78.
3. Use the table. From Table A, we see that the proportion of
observations less than 0.78 is 0.7823. Hence, the answer is
1 − 0.7823 = 0.2177, or 21.77%.
.
.
.
.
.
.
Solution b)
1. State the problem. Let x be the MCAT score of a randomly
selected student. The variable x has the N(25, 6.4) distribution.
We want the proportion of students with 20 ≤ x ≤ 25.
2. Standardize. Subtract the mean, then divide by the standard
deviation, to turn x into a standard Normal z.
25−25
20 ≤ x ≤ 25 corresponds to 20−25
6.4 ≤ z ≤ 6.4 , or −0.78 ≤ z ≤ 0.
3. Use the table. Using Table A, the area is
0.5 − 0.2177 = 0.2833, or 28.33%.
.
.
.
.
.
.
Using a table to find Normal proportions
Step 1. State the problem in terms of the observed variable x.
Draw a picture that shows the proportion you want in terms of
cumulative proportions.
Step 2. Standardize x to restate the problem in terms of a
standard Normal variable z.
Step 3. Use Table A and the fact that the total are under the curve
is 1 to find the required area under the standard Normal curve.
.
.
.
.
.
.
Table A
Use Table A to find the value z∗ of a standard Normal variable
that satisfies each of the following conditions. (Use the value of z∗
from Table A that comes closest to satisfying the condition.) In
each case, sketch a standard Normal curve with your value of z∗
marked on the axis.
a) The point z∗ with 15% of the observations falling below it.
b) The point z∗ with with 70% of the observations falling above it.
.
.
.
.
.
.
0.1
0.2
0.3
0.4
Solution a) z* = -1.04
0.0
0.1492
z* = -1.04
-3
-2
-1
0
1
2
.
.
3
.
.
.
.
0.1
0.2
0.3
0.4
Solution b) z* = -0.52
1-0.3015 = 0.6985
0.0
0.3015
z* = -0.52
-3
-2
-1
0
1
2
.
.
3
.
.
.
.
The Medical College Admission Test
The total scores on the Medical College Admission Test (MCAT)
follow a Normal distribution with mean 25.0 and standard
deviation 6.4. What are the median and the first and third
quartiles of the MCAT scores?
.
.
.
.
.
.
Solution: Finding the median
Because the Normal distribution is symmetric, its median and
mean are the same. Hence, the median MCAT score is 25.
.
.
.
.
.
.
Solution: Finding Q1
1. State the problem. We want to find the MCAT score x with
area 0.25 to its left under the Normal curve with mean µ = 25 and
standard deviation σ = 6.4.
2. Use the table. Look in the body of Table A for the entry closest
to 0.25. It is 0.2514. This is the entry corresponding to
z∗ = −0.67. So z∗ = −0.67 is the standardized value with area
0.25 to its left.
3. Unstandardize to transform the solution from the z∗ back to
the original x scale. We know that the standardized value of the
unknown x is z∗ = −0.67.
So x itself satisfies
x − 25
= −0.67
6.4
Solving this equation for x gives
x = 25 + (−0.67)(6.4) = 20.71
.
.
.
.
.
.
Solution: Finding Q3
1. State the problem. We want to find the MCAT score x with
area 0.75 to its left under the Normal curve with mean µ = 25 and
standard deviation σ = 6.4.
2. Use the table. Look in the body of Table A for the entry closest
to 0.75. It is 0.7486. This is the entry corresponding to z∗ = 0.67.
So z∗ = 0.67 is the standardized value with area 0.75 to its left.
3. Unstandardize to transform the solution from the z∗ back to
the original x scale. We know that the standardized value of the
unknown x is z∗ = 0.67.
So x itself satisfies
x − 25
= 0.67
6.4
Solving this equation for x gives
x = 25 + (0.67)(6.4) = 29.29
.
.
.
.
.
.
Finding a value when given a proportion
1. State the problem.
2. Use the table.
3. Unstandardize to transform the solution from the z∗ back to
the original x scale.
.
.
.
.
.
.
Download