Uploaded by tyten.treyvon

Chapter 1 Slides (1)

advertisement
Chapter 1
Picturing Distributions with Graphs
Adapted from the DS1000B lectures by Dr. Yifan Li
Revisions made by Ashley McAlpine, FSA, FCIA
© 2021 W. H. Freeman and Company
Chapter 1 Overview
 Individuals and variables
 Categorical variables: pie charts and bar graphs
 Quantitative variables: histograms
 Interpreting histograms
 Quantitative variables: stemplots
 Time plots
2
Data Science
 Data science is a multidisciplinary field
(Statistics, Computer Science, Mathematics)
with the goals of extracting insight/information
from data and making data-driven decisions.
 Data Science encompasses: data collection,
storage, preprocessing, analysis and
visualisation.
3
Statistics
Statistics is the science of data. The first step in
dealing with data is to organize your thinking
about the data.
Individual: an object described by a set of data
Variable: a characteristic of the individual
Individuals
Variables
4
When planning a study …
… or simply exploring data from someone else’s
work, ask yourself these questions:
Who?
What individual does the data describe?
What?
How many and what are the exact definitions of the
variables in the data?
In what unit of measurement is each variable
recorded?
Where?
The context of the data collection is always important.
When?
(see previous point.)
Why?
Were the data collected to describe just those
individuals or to represent a larger group?
5
Types of Variables
A categorical variable places individuals into
one of several groups or categories.
A quantitative variable takes numerical values
for which arithmetic operations make sense
(usually recorded in a unit of measurement).
Most data tables follow this format—the data here appear in a spreadsheet program.
6
Exploratory Data Analysis
An exploratory data analysis is the process of
using statistical tools and ideas to examine data
in order to describe their main features.
EXPLORING DATA
 Begin by examining each variable by itself.
Then move on to studying the relationships
among the variables.
 Begin with a graph or graphs. Then add
numerical summaries of specific aspects of
the data.
7
Exercise
Suppose we have a dataset that describes the fuel
economy in miles per gallon (mpg) of model year 2019
motor vehicles. What is the individual in this dataset?
A. mile per gallon (mpg)
B. a year
C. the driver
D. a vehicle
8
 Categorical variables:
 Quantitative variables:
9
Distribution of a Variable
To examine a single variable, we usually want to display its
distribution.
DISTRIBUTION OF A VARIABLE
 The distribution of a variable tells us what values it
takes and how often it takes these values.
10
Displaying Categorical
Data
The distribution of a categorical variable lists the
categories and gives either the count or the percent of
individuals who fall into each category.
 Pie charts show the distribution of a categorical
variable as a “pie” where the sizes of the slices
reflect the counts or percents for the categories.
 Bar graphs represent each category as a bar whose
height shows the category count or percent.
11
Pie Chart or Bar Graph
EXAMPLE: What do the 1.5 million full-time first-year students
plan to study? Here are data on the percents of post-secondary
first-year students who plan to major in several discipline areas.
Field of Study
Percent of Student
Biological sciences
15.5
Business
13.8
Health professions
11.7
Engineering
11.5
Social sciences
11
Arts and humanities
8.8
Math and computer science
6.2
Education
4.4
Physical sciences
2.7
Other majors and undeclared
13.1
Total
98.7
PERCENT OF STUDENTS
Social
sciences
Arts and
humanities
Physical
sciences
Biological
sciences
Other
majors and
undeclared
Math and
computer
science
rounding error
Business
Health
professions
Education
Engineering
12
Pie Chart or Bar Graph
Field of Study
Percent of Student
Biological sciences
15.5
Business
13.8
Health professions
11.7
Engineering
11.5
Social sciences
11
Arts and humanities
8.8
Math and computer science
6.2
Education
4.4
Physical sciences
2.7
Other majors and undeclared
13.1
Percent of students who plan to major
EXAMPLE (cont’d): The same example now shown as a bar
graph
18
16
14
12
10
8
6
4
2
0
Field of Study
13
Bar Graphs Only
Brand
Percent of 12-34s
Who Have
Listened to Each
Brand
Pandora
36
Spotify
46
iHeartRadio
14
Apple Music
20
Amazon Music
10
SoundCloud
23
Google Play All Access
8
Percent of 12-34 YO’s Who Have Listened to Each Brand
EXAMPLE: What sources do Americans aged 12–34 years
use to keep up to date and50 learn about music?
45
40
35
30
25
20
15
10
5
0
Spotify
Pandora
SoundCloud
Apple Music
iHeartRadio
Amazon Music
Google Play All
Access
Audio Brand
Note: For bar graphs, percents don’t necessarily add to 100.
14
Bar Graphs vs Pie Charts
They both describe categorical variables, but we need
to choose them wisely in different cases:
 Pie charts must include all categories that make up
a whole. Use a pie chart only when you emphasize
each category’s relationship to the whole.
 Bar graphs are more flexible than pie charts (e.g.
the categories do not need to make up a whole;
when there are many categories, pie charts can be
hard to see but bar graphs still work.)
15
Can we use a pie chart?
Here are top brands chosen by Americans aged 12-34 who are
using social media.
A. Yes
B. No
16
Can we use a pie chart?
Here are the average numbers of babies born on each day of the week in 2017
A. Yes
B. No
17
Quantitative Data
The distribution of a quantitative variable tells us
what values the variable takes on and how often it
takes on those values.
 Histograms show the distribution of a
quantitative variable by using bars where the
height of each bar represents the number of
individuals who take on a value within a
particular class/interval.
 Stemplots separate each observation into a
stem and a leaf that are then plotted to display
the distribution, while maintaining the original
values of the variable.
18
Histograms
 Are appropriate for quantitative variables that
take on many values and/or for large
datasets.
 Divide the possible values into
classes/intervals (equal widths).
 Count how many observations fall into each
interval (may change to percents).
 Draw a picture representing the distribution
― bar heights are equivalent to the number
(percent) of observations in each interval.
19
Histograms (Example)
 Freshman Graduation Rate, or FGR, Data for 2016–
2017
20
Histograms vs Bar Graphs
Histogram
Bar graphs
Type of variable
Quantitative
Categorical
Space between
bars
No space
Equal space
The height of bars
(both mean
count/frequency/
percent)
Changes with the setup of
intervals or break points
Once the unit of
measurement is fixed, it does
not change
1. Why is there usually no space between the bars of a histogram?
2. Sometimes there is a gap between bars in a histogram. What does this mean?
21
Interpreting Histograms
EXAMINING A HISTOGRAM
 In any graph of data, look for the overall pattern
and for striking deviations from that pattern.
 You can describe the overall pattern by its
shape, center, and variability. You will
sometimes see variability referred to as spread.
 An important kind of deviation is an outlier, an
individual that falls outside the overall pattern.
22
Overall Shape of Distributions
SYMMETRIC AND SKEWED DISTRIBUTIONS
 A distribution is symmetric if the right and left sides of the
graph are approximately mirror images of each other.
 A distribution is skewed to the right (right-skewed) if the
right side of the graph (containing the half of the
observations with larger values) is much longer than the
left side.
 A distribution is skewed to the left (left-skewed) if the left
side of the graph is much longer than the right side.
Symmetric
Right-skewed
Left-skewed
23
Direction to the long tail (not the direction where most points are clustered)
1000
0
500
Frequency
1500
2000
Symmetric
6
8
10
12
14
Right-skewed
100
Frequency
50
150
100
0
50
40
50
60
70
80
90
100
0
Frequency
200
150
250
200
Left-skewed
0
5
10
15
20
24
What is the shape of this distribution?
A. Right-skewed
B. Left-skewed
C. Symmetric
25
What is the shape of this distribution?
A. Right-skewed
B. Left-skewed
C. Symmetric
26
Stemplots (Stem-and-Leaf Plots)
To make a stemplot (not suitable for a large dataset):
1. Separate each observation into a stem (consisting of all
but the final, or rightmost, digit) and a leaf (the final digit).
Stems may have as many digits as needed, but each leaf
contains only a single digit.
2. Write the stems in a vertical column with the smallest at
the top and draw a vertical line at the right of this column.
Be sure to include all the stems needed to span the data,
even when some stems have no leaves.
3. Write each leaf in the row to the right of its stem, in
increasing order out from the stem.
27
Example of Stemplot
5.3, 5.6, 5.3, 5.0, 5.5, 5.9, 5.9, 3.3, 5.0, 3.4, 5.0, 3.4, 4.6, 6.9, 6.0
• First, order the data from smallest to largest.
• For this dataset, the stems are the non-decimals and leaves are the
decimals (recall: leaves can only be one digit)
3.3, 3.4, 3.4, 4.6, 5.0, 5.0, 5.0, 5.3, 5.3, 5.5, 5.6, 5.9, 5.9, 6.0, 6.9
Stemplot:
3 | 344
4|6
5 | 000335699
6 | 09
28
 If there are very few stems (that is, when the data covers
only a very small range of values), then we may want to
create more stems by splitting the original stems.
 EXAMPLE:
3.3, 3.4, 3.4, 4.6, 5.0, 5.0, 5.0, 5.3, 5.3, 5.5, 5.6, 5.9, 5.9, 6.0, 6.9
3 | 344  decimals 0 to 4
3|
 decimals 5 to 9
4|
4|6
5 | 00033
5 | 5699
6|0
6|9
29
Exercise
The shape of the distribution below is:
a. Skewed to the left.
b. Skewed upward.
c. Skewed to the right.
30
Stemplot of Median Income for Families
of Four
Median incomes range from
$46,596 (New Mexico) to
$82,879 (Maryland).
Stemplot of Median Incomes:
4|66789
5|11344
Stems: 4 to 8, reusing two times 5|56666688899999
6|011112334
(0-5 then 5-9)
with leaves truncated to $1,000s. 6|556666789
Note leaves have been ordered. 7|01223
7|
Example:
8|0022
$46,596 would be truncated
Example: 4|6 = $46,xxx
to 46,000 and shown as 4|6
Source: Federal Registry, April 15, 2003
31
Obtaining Info from the Stemplot
Determine shape, identify outliers, locate center.
Pulse Rates:
5|4
5|789
6|023344
6|55567789
7|00124
7|58
Exam Scores
3|2
4|
5|5
6|024418
7|5659839
8|5430820
9|532087
Bell-shape
Centered mid 60’s
no outliers
Outlier of 32.
Apart from 55,
rest uniform from
the 60’s to 90’s.
Median Incomes:
4|66789
5|11344
5|56666688899999
6|011112334
6|556666789
7|01223
7|
8|0022
Wide range with 4
unusually high values.
Rest bell-shape around
high $50,000s.
32
Time Plots
 A time plot shows behavior over time.
 Time is always on the horizontal axis, and the
variable being measured is on the vertical axis.
 Look for an overall pattern (trend) and for deviations
from this trend. Connecting the data points by lines
may emphasize this trend.
 Look for patterns that repeat at known regular
intervals (seasonal variations or cycles).
33
Time Plots (Example 1)
Time Plot of Average Gauge Height,
Everglades National Park Monitoring Station
34
Time Plots (Example 2)
Time Plot of the Mean Atmospheric CO2 Concentration (ppm)
35
Practice Textbook Problems:
Odd problems from 1.13 to 1.45 (except 1.29)
36
The Basic Practice of Statistics
Ninth Edition
David S. Moore
William I. Notz
Chapter 1
Picturing Distributions with Graphs
Clicker Questions
© 2021 W. H. Freeman and Company
Question 1
In a study of commuting patterns of people in a large metropolitan area,
respondents were asked to report the time they took to travel to their
workplace on a specific day of the week. What is the individual?
a)
b)
c)
d)
travel time
a person
the day of the week
the city in which they lived
Question 2
In a study of commuting patterns of people in a large metropolitan area,
respondents were asked to report the time they took to travel to their
workplace on a specific day of the week. What is the variable of
interest?
a)
b)
c)
d)
travel time
a person
the day of the week
the city in which the person lives
Question 3
If we asked people to report their “weight,” would that variable be
considered a categorical variable or a quantitative variable?
a)
b)
categorical
quantitative
Question 4
We ask people whether their weight is underweight, normal,
overweight, or obese. Is this variable categorical or quantitative?
a)
b)
categorical
quantitative
Question 5
A survey involving 35 questions was given to a sample of 200 students
attending a university with an enrollment of 10,000. How many
individuals do the data describe?
a)
b)
c)
d)
35
200
10,000
It is impossible to say.
Question 6
A survey involving 35 questions was given to a sample of 200 students
attending a university with an enrollment of 10,000. How many
variables do the data contain?
a)
b)
c)
d)
35
200
35 × 200
35 × 10,000
Question 7
What type of data is produced by the answer choices for the following
question?
How many times have you accessed the Internet this week?
1)
2)
3)
4)
a)
b)
never
once or twice
three or four times
more than four times
categorical
quantitative
Question 8
Which statement is correct regarding the distribution of a variable?
a)
b)
c)
d)
The distribution of a quantitative variable can be displayed using a
graphical display like a histogram.
The distribution of a categorical variable can be displayed using a
table of counts or percents.
The distribution of a variable tells us what values the variable takes
and how often it takes these values.
All of these choices are correct.
Question 9
The purpose of pie charts and bar charts is to _______________.
a)
b)
c)
d)
hide the data from the audience
impress the audience with interesting colors and shapes
help the audience grasp the distribution of the data quickly
display the distribution of quantitative variables
Question 10
Married respondents, in a survey, were asked how they met their
spouses. According to the bar chart below, approximately how many
respondents met their spouses through a date (dating) service?
a)
b)
c)
d)
2
20
30
70
Question 11
Bar graphs are more flexible than pie charts. Both types of graphs can
display the distribution of a categorical variable, but a bar graph can
also compare any set of quantities measured in the same units.
a)
b)
true
false
Question 12
Look at the following histogram of salaries of baseball players. What
shape would you say the data take?
a)
b)
c)
d)
e)
bi-modal
left-skewed
right-skewed
symmetric
uniform
Question 13
Look at the following histogram of percent urban population in 50 U.S.
states. What shape would you say the data take?
a)
b)
c)
d)
e)
bi-modal
left-skewed
right-skewed
symmetric
uniform
Question 14
Wine bottles at a small winery are sampled
and tested for quality. One measurement is
volume: the bottles should be filled to 750 ml,
with some variation expected. Based on this
histogram, does 750 ml seem to be a
reasonable center for the data?
a)
b)
c)
Yes, the center is approximately 750 ml.
No, the center is too variable.
No, the data range from 744 to 755 ml.
Question 15
Considering this histogram, about how many bottles of wine have more
than 752 ml?
a)
b)
c)
d)
17
44
6
5
Question 16
In the dataset represented by the following stemplot, in which the
leaf unit = 1.0, how many times does the number “28” occur?
0 9
1 246999
2 111134567888999
3 000112222345666699
a)
b)
c)
d)
0
1
3
4
4 001445
5 0014
6 7
7 3
Question 17
Which of these plots is a time plot?
a)
b)
c)
d)
Plot A
Plot B
Both of these answers are correct.
None of these choices is correct.
Question 18
What type of trend does the following time plot show?
a)
b)
c)
downward
upward
no trend
Question 19
What would be the correct interpretation of the following graph?
a)
b)
c)
There is an upward trend in the data.
There is a downward trend in the data.
The data show seasonal variation.
Question 20
An individual value that falls outside the overall pattern is called ____________.
a)
b)
c)
d)
the center
a shape
an outlier
a spread
Download