Chapter 1

advertisement
Welcome to MATH171!
 Overview
of Syllabus
 Technology Overview
 Basic Skills Quiz
 Start Chapter 1!
Displaying data with graphs
BPS chapter 1
© 2006 W. H. Freeman and Company
With Modifications by Dr. M. Leigh Lunsford
What is Statistics?
Statistics is the Science of Learning from Data
The Collection and Analysis of Data
Sampling and Experimental Design
Chapters 8 & 9
Probability & Sampling Distributions
Chapters 10 & 11
Descriptive Statistics
(Data Exploration)
Chapters 1 - 5
Inferential Statistics
Chapters 14 - 21
Objectives for Chapter 1
Picturing Distributions with Graphs

Individuals and variables

Two types of data: categorical and quantitative

Ways to chart categorical data: bar graphs and pie charts

Ways to chart quantitative data: histograms and stemplots

Interpreting histograms

Time plots
Individuals and variables (page 3)
Individuals are the objects described by a set of data. Individuals
may be people, but they may also be animals or things.

Example: Freshmen, 6-week-old babies, golden retrievers, fields
of corn, cells
A variable is any characteristic of an individual. A variable can take
different values for different individuals.

Example: Age, height, blood pressure, ethnicity, leaf length, first
language
Two types of variables (page 4)
A variable can be either

quantitative

Something that can be counted or measured for each individual
and then added, subtracted, averaged, etc., across individuals in
the population.

Example: How tall you are, your age, your blood cholesterol level,
the number of credit cards you own.
OR

categorical

Something that falls into one of several categories. What can be
counted is the count or proportion of individuals in each category.

Example: Your blood type (A, B, AB, O), your hair color, your
ethnicity, whether you paid income tax last tax year or not.
Example 1.1 (page 4-5)


How do you determine if a variable is categorical or quantitative?
Identify individuals, variables and types of variables.
Ways to graph categorical data
Because the variable is categorical, the data in the graph can be
ordered any way we want (alphabetical, by increasing value, by year,
by personal preference, etc.).

Bar graphs
Each category is
represented by
a bar.
Variable Values
Variable

Pie charts
Use when you want to emphasize
each category’s relation to the whole.
Example: Top 10 causes of death in the United States, 2001
Rank Causes of death
Counts
Percent of
top 10s
Percent of
total
deaths
1 Heart disease
700,142
37%
29%
2 Cancer
553,768
29%
23%
3 Cerebrovascular
163,538
9%
7%
4 Chronic respiratory
123,013
6%
5%
5 Accidents
101,537
5%
4%
6 Diabetes mellitus
71,372
4%
3%
7 Flu and pneumonia
62,034
3%
3%
8 Alzheimer’s disease
53,852
3%
2%
9 Kidney disorders
39,480
2%
2%
32,238
2%
1%
10 Septicemia
All other causes
629,967
How did
they get
these
numbers?
26%
For each individual who died in the United States in 2001, we record what was
the cause of death. The table above is a summary of that information.
Bar graphs
Each “value” of the categorical variable is represented by one bar. The bar’s
height shows the count (or sometimes the percentage) for that particular
category.
Top 10 causes of death in the U.S., 2001
The number of individuals
who died of an accident in
2001 is approximately
100,000.
di
se
as
es
C
an
C
er
ce
eb
rs
r
ov
C
as
hr
on
cu
ic
la
r
re
sp
ira
to
ry
Ac
cid
D
ia
en
be
ts
te
s
m
Fl
el
u
l it
&
us
p
Al
ne
zh
um
ei
on
m
er
ia
's
di
Ki
se
dn
as
ey
e
di
so
rd
er
Se
s
pt
ic
em
ia
H
ea
rt
Counts (x1000)
800
700
600
500
400
300
200
100
0
ov
ce
rs
on
ic
as
cu
la
r
re
sp
ira
D
to
ia
be
ry
te
s
m
Fl
el
u
lit
us
&
pn
eu
m
on
H
ea
ia
rt
di
se
Ki
as
dn
es
ey
di
so
rd
er
s
Se
pt
ic
em
ia
C
hr
C
an
s
se
nt
di
se
a
Ac
ci
de
er
's
eb
r
ei
m
C
er
Al
zh
Counts (x1000)
eb
r
is
e
ov
ce
rs
es
as
C
an
rt
d
as
cu
on
la
ic
r
re
sp
ira
to
ry
Ac
ci
D
de
ia
nt
be
s
te
s
m
Fl
el
u
lit
&
us
pn
Al
eu
zh
m
ei
on
m
ia
er
's
di
Ki
se
dn
as
ey
e
di
so
rd
er
s
Se
pt
ic
em
ia
C
hr
C
er
H
ea
Counts (x1000)
800
700
600
500
400
300
200
100
0
Top 10 causes of death in the U.S., 2001
Bar graph sorted by rank
 Easy to analyze
800
700
600
500
400
300
200
100
0
Sorted alphabetically
 Much less useful
Pie charts
Each slice represents a piece of one whole.
The size of a slice depends on what percent of the whole this category represents.
Percent of people dying from
top 10 causes of death in the U.S., 2000
Make sure your
labels match
the data!
Make sure
all percents
add up to 100!!
Percent of deaths from top 10 causes
Percent of
deaths from
all causes
Apply Your Knowledge Problem 1.4



Let’s work Problem 1.4 (page 10)
together!
Bar graph (in count & percent)
Pie chart?
Births in 2004 by Day of Week
Sun
10%
Sat
11%
Mon
15%
Fri
16%
Tues
16%
Thurs
16%
Wed
16%
Day of Week
Births
Sun
7563
Mon
11733
Tues
13001
Wed
12598
Thurs
12514
Fri
12396
Sat
8605
Number of Babies
Born on Each Day
of the Week in 2003
Ways to chart quantitative data

Histograms and stemplots
These are summary graphs for a single variable. They are very useful to
understand the pattern of variability in the data.

Line graphs: time plots
Use when there is a meaningful sequence, like time. The line connecting
the points helps emphasize any change over time.

Other graphs to reflect numerical summaries (see Chapter 2)
An Example

Suppose we want to determine the following:




What percent of all fifth grade students in our district have an IQ score of at least
120?
What is the average IQ score of all fifth grade students in our district?
It is too expensive to give an IQ test to all fifth grade students in our
district.
Below are the IQ test scores from 60 randomly chosen fifth graders
in our district. (Individuals (subjects)?, Variable(s)?)
Previews of Coming Attractions!





We are interested in questions about a population (all fifth grade
students in our district).
We want to know the percent (or proportion) of the population in a
particular category (IQ score of at least 120) and the average value
of a variable for the population (average IQ score).
We have taken a random sample from the population.
Eventually we will use the data from the sample to infer about the
population. (Inferential Statistics)
For now we will describe the data in the sample. (Descriptive
Statistics)



We will graphically represent the IQ scores for our sample (histogram &
stem and leaf)
We will find the percent of students in our sample with an average IQ
score of at least 120 and understand how that percent relates to the
graph.
Later (Chapter 2) we will also be able to describe the data with numerical
summaries and other types of plots (boxplots)
Stemplots (page 19)
How to make a stemplot:
STEM
1) Separate each observation into a stem, consisting of
all but the final (rightmost) digit, and a leaf, which is
that remaining final digit. Stems may have as many
digits as needed, but each leaf contains only a single
digit.
2) Write the stems in a vertical column with the smallest
value at the top, and draw a vertical line at the right
of this column.
3) Write each leaf in the row to the right of its stem, in
increasing order out from the stem.
Let’s try it with this data: 9, 9, 22, 32, 33, 39, 39, 42,
49, 52, 58, 70
LEAVES
Now Let’s Make a Stemplot for Our IQ
Data
Stem & Leaf Plot for IQ Data

IQ Test Scores for 60 Randomly
Chosen 5th Grade Students
Stem and Leaf plot for
IQ Scores
stem unit =
10
leaf unit =
1
Frequency
Stem
3
8
129
4
9
0467
14
10
01112223568999
17
11
00022334445677788
11
12
22344456778
9
13
013446799
2
14
25
60
Leaf
Now Let’s Make a Histogram (pages 10-12)



Use the Same IQ Data
We will start by hand….using class (bin) widths of 10
starting at 80…
Make a Frequency Table for the data:
Variable: X = IQ score
Frequency Table:
Bins
Frequency
80X<90
3
90X<100 4
6.7%
100X<110
14
110X<120
17
120X<130
11
130X<140
9
140X<150
2
totals: 60
Percent
5.0%
23.3%
28.3%
18.3%
15.0%
3.3%
99.9%
Now Let’s Make a Histogram (pages 10-12)
IQ Scores of Randomly Chosen Fifth Grade Students
30
23.3
25
18.3
20
15.0
15
10
5.0
6.7
3.3
5
IQ Score
15
0
14
0
13
0
11
0
10
0
0
12
0
Percent of
What?
What is the
meaning of this
bar?
28.3
90

80

Use the Same IQ Data
We will start by hand….using class (bin) widths of 10
starting at 80…
Make a Frequency Table for the data:
Percent

Back to Our Question:

What percent of the 60 randomly chosen fifth grade students have
an IQ score of at least 120?

Numerically?
18.3%+15%+3.3%=36.6%
(11+9+2)/60=.367 or 36.7%

How to Represent Graphically?
Grey Shaded Region corresponds to the 36.6% of students
Another
Histogram of
the IQ Data!
What is Different From
the Histogram we Generated
In Class?
How to create a histogram
It is an iterative process—try and try again.
What bin (class) size should you use?

Not too many bins with either 0 or 1 counts

Not overly summarized that you lose all the information

Not so detailed that it is no longer summary
 Rule of thumb: Start with 5 to10 bins.
Look at the distribution and refine your bins.
(There isn’t a unique or “perfect” solution.)
Same data set
Not
summarized
enough
GOAL: Capture
Overall Pattern
Too summarized
Apply Your Knowledge

Let’s try problem 1.7 (page 14)

What is the difference between a histogram and a bar chart?

See pages 12-13
Interpreting histograms
When describing a quantitative variable, we look for the overall pattern and for
striking deviations from that pattern. We can describe the overall pattern of a
histogram by its shape, center, and spread.
Histogram with a line connecting
each column  too detailed
Histogram with a smoothed curve
highlighting the overall pattern of
the distribution
Most common distribution shapes

Symmetric
distribution
A distribution is symmetric if the right and left sides
of the histogram are approximately mirror images
of each other.

A distribution is skewed to the right if the right
side of the histogram (side with larger values)
extends much farther out than the left side. It is
skewed to the left if the left side of the histogram
Skewed
distribution
extends much farther out than the right side.
Complex,
multimodal
distribution

Not all distributions have a simple overall shape,
especially when there are few observations.
Outliers
An important kind of deviation is an outlier. Outliers are observations
that lie outside the overall pattern of a distribution. Always look for
outliers and try to explain them.
The overall pattern is fairly
symmetric except for two
states clearly not belonging
to the main trend. Alaska
and Florida have unusual
representation of the
elderly in their population.
A large gap in the
distribution is typically a
sign of an outlier.
Alaska
Florida
IMPORTANT NOTE:
Your data are the way they are.
Do not try to force them into a
particular shape.
It is a common misconception
that if you have a large enough
data set, the data will eventually
turn out nice and symmetrical.
Line graphs: time plots
Time always goes on the
horizontal, or x, axis.
The variable of interest—
here “retail price of fresh
oranges”—goes on the
vertical, or y, axis.
This time plot shows a regular pattern of yearly variations. These are seasonal
variations in fresh orange pricing most likely due to similar seasonal variations in
the production of fresh oranges.
There is also an overall upward trend in pricing over time. It could simply be
reflecting inflation trends or a more fundamental change in this industry.
Let’s Start Problem 1.41 on Page 35….
Scales matter
Death rates from cancer (US, 1945-95)
Death rates from cancer (US, 1945-95)
Death rate (per
thousand)
250
200
150
100
250
Death rate (per thousand)
How you stretch the axes and choose your
scales can give a different impression.
200
150
100
50
50
0
1940
1950
1960
1970
1980
1990
0
1940
2000
1960
1980
2000
Years
Years
Death rates from cancer (US, 1945-95)
250
Death rates from cancer (US, 1945-95)
220
Death rate (per thousand)
Death rate (per thousand)
200
150
100
50
0
1940
1960
Years
1980
2000
A picture is worth a
thousand words,
200
BUT
180
160
there is nothing like hard
numbers.
 Look at the scales.
140
120
1940
1960
1980
Years
2000
Download