Document

advertisement
Chapter 2
Turning Data
Into
Information
Copyright ©2011 Brooks/Cole, Cengage Learning
1
2.1 Raw Data
• Raw data are for numbers and category labels
that have been collected but have not yet been
processed in any way.
• Example list of questions and raw data for a student:
Copyright ©2011 Brooks/Cole, Cengage Learning
2
2.1 Raw Data
• An observation is an individual entity in a study.
• A variable is a characteristic that may differ
among individuals.
• Sample data are collected from a subset of a
larger population.
• Population data are collected when all individuals
in a population are measured.
• A statistic is a summary measure of sample data.
• A parameter is a summary measure of population data.
Copyright ©2011 Brooks/Cole, Cengage Learning
3
2.2 Types of Variables
• Raw data from categorical variables consist of
group or category names that don’t necessarily
have a logical ordering. Examples: eye color,
country of residence.
• Categorical variables for which the categories
have a logical ordering are called ordinal
variables. Examples: highest educational degree
earned, tee shirt size (S, M, L, XL).
• Raw data from quantitative variables consist
of numerical values taken on each individual.
Examples: height, number of siblings.
Copyright ©2011 Brooks/Cole, Cengage Learning
4
Asking the Right Questions
One Categorical Variable
Example: What percentage of college students
favor the legalization of marijuana,
and what percentage of college students
oppose legalization of marijuana?
Ask: How many and what percentage of
individuals fall into each category?
Copyright ©2011 Brooks/Cole, Cengage Learning
5
Asking the Right Questions
Two Categorical Variables
Example: In Case Study 1.6, we asked if the risk
of having a heart attack was different for the
physicians who took aspirin than for those who
took a placebo.
Ask: Is there a relationship between the two
variables? Does the chance of falling into a
particular category for one variable depend
on which category an individual is in for the
other variable?
Copyright ©2011 Brooks/Cole, Cengage Learning
6
Asking the Right Questions
One Quantitative Variable
Example: What is the average body temperature
for adults, and how much variability is there in
body temperature measurements?
Ask: What are the interesting summary measures,
like the average or the range of values?
Copyright ©2011 Brooks/Cole, Cengage Learning
7
Asking the Right Questions
One Categorical and One Quantitative Variable
Example: Do men and women drive at the same
“fastest speeds” on average?
Ask: Are the measurements similar across
categories or do they differ? Could be asked
regarding the averages or the ranges.
Copyright ©2011 Brooks/Cole, Cengage Learning
8
Asking the Right Questions
Two Quantitative Variables
Example: Does average body temperature
change as people age?
Ask: Are these variables related so that when
measurements are high (or low) on one
variable the measurements for the other
variable also tend to be high (or low)?
Copyright ©2011 Brooks/Cole, Cengage Learning
9
Explanatory and Response Variables
Many questions about the relationship between
two variables.
It is useful to identify one variable as the
explanatory variable and the other variable
as the response variable.
In general, the value of the explanatory variable
for an individual is thought to partially explain the
value of the response variable for that individual.
Copyright ©2011 Brooks/Cole, Cengage Learning
10
2.3 Summarizing One or Two
Categorical Variables
Numerical Summaries
• Count how many fall into each category.
• Calculate the percent in each category.
• If two variables, have the categories of
the explanatory variable define the rows
and compute row percentages.
Copyright ©2011 Brooks/Cole, Cengage Learning
11
Example 2.1
Seatbelt Use by Twelfth-Graders
2003 nationwide survey of American HS students
“How often do you wear a seatbelt when driving a car?”
•Total sample size n = 3042 students.
• A majority, 1686/3042 =
.554, or 55.4%, said they
always wear a seatbelt,
while 115/3042 = .038,
or 3.8%, said they never
wear a seatbelt.
• Rarely or never:
8.2% + 3.8% =12%
Copyright ©2011 Brooks/Cole, Cengage Learning
12
Example 2.1
Seatbelt Use by Twelfth-Graders
Are females more likely to say always wear seatbelt?
Are males more likely to say rarely or never wear seatbelt?
•
•
•
•
Females: 915/1467 = 62.4% said always wear seatbelt
Males: 771/1575 = 49.0% said always wear seatbelt.
Males: 10.5% + 5.7% =16.2% rarely or never wear one.
Females: 5.7% + 1.7% = 7.4% rarely or never wear one.
Copyright ©2011 Brooks/Cole, Cengage Learning
13
2.3 Summarizing One or Two
Categorical Variables
Frequency and Relative Frequency
• A frequency distribution for a categorical
variable is a listing of all categories along with
their frequencies (counts).
• A relative frequency distribution is a listing of all
categories along with their relative frequencies
(given as proportions or percentages, for example).
Copyright ©2011 Brooks/Cole, Cengage Learning
14
Example 2.2 Lighting the Way
to Nearsightedness
Survey of n = 479 children.
Those who slept with nightlight or in fully lit
room before age 2 had higher incidence of
nearsightedness (myopia) later in childhood.
Note: Study does not prove sleeping with light
actually caused myopia in more children.
Copyright ©2011 Brooks/Cole, Cengage Learning
15
Visual Summaries
for Categorical Variables
• Pie Charts: useful for summarizing
a single categorical variable if not
too many categories.
• Bar Graphs: useful for summarizing
one or two categorical variables and
particularly useful for making comparisons
when there are two categorical variables.
Copyright ©2011 Brooks/Cole, Cengage Learning
16
Example 2.3 Humans Are Not
Good Randomizers
Survey of n = 190 college students.
“Randomly pick a number between 1 and 10.”
Results: Most chose 7, very few chose 1 or 10.
Copyright ©2011 Brooks/Cole, Cengage Learning
17
Example 2.4 Revisiting Nightlights
and Nearsightedness
Survey of
n = 479 children.
Response:
Degree
of Myopia
Explanatory:
Amount of
Sleeptime
Lighting
Copyright ©2011 Brooks/Cole, Cengage Learning
18
2.4 Finding Information
in Quantitative Data
Long list of numbers – needs to be organized
to obtain answers to questions of interest.
Copyright ©2011 Brooks/Cole, Cengage Learning
19
Five-Number Summaries
• Find extremes (high, low),
the median, and the quartiles
(medians of lower and upper
halves of the values).
• Quick overview of the data values.
• Information about the center,
spread, and shape of data.
Copyright ©2011 Brooks/Cole, Cengage Learning
20
Example 2.5 Right Handspans
• Majority of females had handspans between 19 and 21 cm,
and many males had handspans between 21.5 and 23 cm.
• Two females with unusually small handspans.
Copyright ©2011 Brooks/Cole, Cengage Learning
21
Example 2.5 Right Handspans
About 25% of handspans of females are
between 12.5 and 19.0 centimeters,
• about 25% are between 19 and 20 cm,
• about 25% are between 20 and 21 cm, and
• about 25% are between 21 and 23.25 cm.
Copyright ©2011 Brooks/Cole, Cengage Learning
22
Summary Features of
Quantitative Variables
• Location: center or average. e.g. median
• Spread: variability e.g. difference between
two extremes or two quartiles.
• Shape: clumped in middle or on one end
(more later)
• Outliers: a data point that is not consistent
with the bulk of the data
Copyright ©2011 Brooks/Cole, Cengage Learning
23
Example 2.6 Annual Compensation
for Highest Paid CEOs in the United States
Paid compensation (in millions of $) for 50 highest-paid
CEOs in 2008 for Fortune Magazine’s Top 500 companies
Median: ~ $35.6 million
Minimum: $24.3 million
Maximum: $557 million (perhaps outlier?)
Shape: most clumped on lower end (= skewed)
Copyright ©2011 Brooks/Cole, Cengage Learning
24
Example 2.7 Ages of Death
of U.S. First Ladies
Partial Data Listing and five-number summary:
Extremes are more interesting here:
Who died at 34? Martha Jefferson
Who lived to be 97? Bess Truman
Copyright ©2011 Brooks/Cole, Cengage Learning
25
Pictures for Quantitative Data
• Histograms: similar to bar graphs, used
for any number of data values.
• Stem-and-leaf plots and dotplots:
present all individual values, useful for
small to moderate sized data sets.
• Boxplot or box-and-whisker plot:
useful summary for comparing two
or more groups.
Copyright ©2011 Brooks/Cole, Cengage Learning
26
Interpreting Histograms, Stemplots,
and Dotplots
• Values are centered around 20 cm.
• Two possible low outliers.
• Apart from outliers, spans range from about 16 to 23 cm.
Copyright ©2011 Brooks/Cole, Cengage Learning
27
Creating a Histogram
Step 1: Decide how many equally spaced (same
width) intervals to use for the horizontal axis.
Between 6 and 15 intervals is a good number.
Step 2: Decide to use frequencies (count) or relative
frequencies (proportion) on the vertical axis.
Step 3: Draw equally spaced intervals on horizontal
axis covering entire range of the data. Determine
frequency or relative frequency of data values in
each interval and draw a bar with corresponding
height. Decide rule to use for values that fall on
the border between two intervals.
Copyright ©2011 Brooks/Cole, Cengage Learning
28
Example 2.8 Ages of Death of First Ladies
Two different histograms
Copyright ©2011 Brooks/Cole, Cengage Learning
29
Creating a Dotplot
• Draw a number line (horizontal axis)
to cover range from smallest to largest
data value.
• For each observation, place a dot
above the number line located at the
observation’s data value.
• When multiple observations with the
same value, dots are stacked vertically
Copyright ©2011 Brooks/Cole, Cengage Learning
30
Creating a Stem-and-Leaf Plot
Step 1: Determine stem values. The “stem”
contains all but the last of the displayed digits
of a number. Stems should define equally
spaced intervals.
Step 2: For each individual, attach a “leaf”
to the appropriate stem. A “leaf” is the last
of the displayed digits of a number. Often
leaves are ordered on each stem.
Note: More than one way to define stems.
Can use split-stems or truncate/round values first.
Copyright ©2011 Brooks/Cole, Cengage Learning
31
Example 2.9 Big Music Collection
How many songs on iPod or MP3?
2510, 500, 500, 1300, 687, 600, 500, 2600, 30, 900, 800, 0, 750,
1500, 1500, 2400, 800, 2017, 1150, 5000, 4000, 1250, 1700, 3305
Final two digits truncated
• 2510: stem label of 2 and leaf value of 5
• 500: stem label of 0 and leaf value of 5
• 30: stem value is 0 and leaf value also 0
Two stems for each 1000s possibility:
• first = leaf values 0, 1, 2, 3, 4
• second = leaf values 5, 6, 7, 8, 9
Shape is skewed right
Copyright ©2011 Brooks/Cole, Cengage Learning
32
Describing Shape
•
•
•
•
Symmetric, bell-shaped
Symmetric, not bell-shaped
Skewed Right: values trail off to right
Skewed Left: values trail off to left
Copyright ©2011 Brooks/Cole, Cengage Learning
33
Boxplots: Picturing Location and
Spread for Group Comparisons
• Box covers the middle
50% of the data
• Line within box marks
the median value
• Possible outliers are
marked with asterisk
• Apart from outliers, lines
extending from box reach
to min and max values.
Copyright ©2011 Brooks/Cole, Cengage Learning
34
Using Visual Displays
• To illustrate location and spread,
any of the pictures work well.
• To illustrate shape,
histograms and stem-and-leaf plots are best.
• To see individual values,
use stem-and-leaf plots and dotplots.
• To sort values,
use stem-and-leaf plots.
• To compare groups,
use side-by-side boxplots.
• To identify outliers
using the standard definition, use a boxplot.
Copyright ©2011 Brooks/Cole, Cengage Learning
35
2.6 Numerical Summaries
of Quantitative Data
Notation for Raw Data:
n = number of individuals in a data set
x1, x2 , x3,…, xn represent individual raw data values
Example: A data set consists of handspan
values in centimeters for six females;
the values are 21, 19, 20, 20, 22, and 19.
Then, n = 6
x1= 21, x2 = 19, x3 = 20, x4 = 20, x5 = 22, and x6 = 19
Copyright ©2011 Brooks/Cole, Cengage Learning
36
Describing the Location
of a Data Set
• Mean: the numerical average
• Median: the middle value (if n odd)
or the average of the middle two
values (n even)
Symmetric: mean = median
Skewed Left: mean < median
Skewed Right: mean > median
Copyright ©2011 Brooks/Cole, Cengage Learning
37
Determining the Mean and Median
The Mean
x

x
i
n
where  xi means “add together all the values”
The Median
If n is odd: M = middle of ordered values.
Count (n + 1)/2 down from top of ordered list.
If n is even: M = average of middle two ordered values.
Average values that are (n/2) and (n/2) + 1
down from top of ordered list.
Copyright ©2011 Brooks/Cole, Cengage Learning
38
Example 2.12 Will “Normal” Rainfall
Get Rid of Those Odors?
Data: Average rainfall (inches)
for Davis, California for 47 years
Mean = 18.69 inches
Median = 16.72 inches
In 1997-98, a company
with odor problem blamed
it on excessive rain.
That year rainfall was
29.69 inches. More rain
occurred in 4 other years.
Copyright ©2011 Brooks/Cole, Cengage Learning
39
The Influence of Outliers
on the Mean and Median
Larger influence on mean than median.
High outliers will increase the mean.
Low outliers will decrease the mean.
If ages at death are: 76, 78, 80, 82, and 84
then mean = median = 80 years.
If ages at death are: 46, 78, 80, 82, and 84
then median = 80 but mean = 74 years.
Copyright ©2011 Brooks/Cole, Cengage Learning
40
Describing Spread: Range
and Interquartile Range
• Range = high value – low value
• Interquartile Range (IQR) =
upper quartile – lower quartile
• Standard Deviation
(covered later in Section 2.7)
Copyright ©2011 Brooks/Cole, Cengage Learning
41
Example 2.13 Fastest Speeds Ever Driven
Five-Number
Summary
for 87 males
•
•
•
Median = 110 mph measures the center of the data
Two extremes describe spread over 100% of data
Range = 150 – 55 = 95 mph
Two quartiles describe spread over middle 50% of data
Interquartile Range = 120 – 95 = 25 mph
Copyright ©2011 Brooks/Cole, Cengage Learning
42
Notation and Finding the Quartiles
Split the ordered values into the half
that is below the median and the half
that is above the median.
Q1 = lower quartile
= median of data values
that are below the median
Q3 = upper quartile
= median of data values
that are above the median
Copyright ©2011 Brooks/Cole, Cengage Learning
43
Example 2.14 Fastest Speeds for Men
Ordered Data (in rows of 10 values) for the 87 males:
• Median = (87+1)/2 = 44th value in the list = 110 mph
• Q1 = median of the 43 values below the median =
(43+1)/2 = 22nd value from the start of the list = 95 mph
• Q3 = median of the 43 values above the median =
(43+1)/2 = 22nd value from the end of the list = 120 mph
Copyright ©2011 Brooks/Cole, Cengage Learning
44
How to Draw a Boxplot
and Identify Outliers
Step 1: Label either a vertical axis or a horizontal axis
with numbers from min to max of the data.
Step 2: Draw box with lower end at Q1 and upper end at Q3.
Step 3: Draw a line through the box at the median M.
Step 4: Calculate IQR = Q3 – Q1.
Step 5: Draw a line from Q1 end of box to smallest data value
that is not further than 1.5  IQR from Q1.
Draw a line from Q3 end of box to largest data value
that is not further than 1.5  IQR from Q3.
Step 6: Mark data points further than 1.5  IQR from either
edge of the box with an asterisk. Points represented
with asterisks are considered to be outliers.
Copyright ©2011 Brooks/Cole, Cengage Learning
45
Percentiles
The kth percentile is a number that has
k% of the data values at or below it and
(100 – k)% of the data values at or above it.
• Lower quartile = 25th percentile
• Median = 50th percentile
• Upper quartile = 75th percentile
Copyright ©2011 Brooks/Cole, Cengage Learning
46
2.6 How to Handle Outliers
Outlier: a data point that is not
consistent with the bulk of the data.
• Look for them via graphs.
• Can have big influence on conclusions.
• Can cause complications in some
statistical analyses.
• Cannot discard without justification.
Copyright ©2011 Brooks/Cole, Cengage Learning
47
Possible Reasons for Outliers
and Reasonable Actions
• Outlier is legitimate data value and represents natural
variability for the group and variable(s) measured.
Values may not be discarded — they provide important
information about location and spread.
• Mistake made while taking measurement or entering it
into computer. If verified, should be discarded/corrected.
• Individual in question belongs to a different group
than bulk of individuals measured. Values may be
discarded if summary is desired and reported for the
majority group only.
Copyright ©2011 Brooks/Cole, Cengage Learning
48
Example 2.16 Tiny Boatsmen
Weights (in pounds) of 18 men on crew team:
Cambridge:188.5, 183.0, 194.5, 185.0, 214.0,
203.5, 186.0, 178.5, 109.0
Oxford:
186.0, 184.5, 204.0, 184.5, 195.5,
202.5, 174.0, 183.0, 109.5
Note: last weight in each list is unusually small.
They are the coxswains for their teams,
while others are rowers.
Copyright ©2011 Brooks/Cole, Cengage Learning
49
2.7 Bell-Shaped Distributions
of Numbers
Many measurements follow a predictable pattern:
• Most individuals are clumped around the center
• The greater the distance a value is from the
center, the fewer individuals have that value.
Variables that follow such a pattern are said
to be “bell-shaped”. A special case is called
a normal distribution or normal curve.
Copyright ©2011 Brooks/Cole, Cengage Learning
50
Example 2.17 Bell-Shaped
British Women’s Heights
Data: representative sample of 199 married British couples.
Below shows a histogram of the wives’ heights with a normal
curve superimposed. The mean height = 1602 millimeters.
Copyright ©2011 Brooks/Cole, Cengage Learning
51
Describing Spread
with Standard Deviation
Standard deviation measures variability
by summarizing how far individual
data values are from the mean.
Think of the standard deviation as
roughly the average distance
values fall from the mean.
Copyright ©2011 Brooks/Cole, Cengage Learning
52
Describing Spread
with Standard Deviation
Both sets have same mean of 100.
Set 1: all values are equal to the mean so there is
no variability at all.
Set 2: one value equals the mean and other four values
are 10 points away from the mean, so the average
distance away from the mean is about 10.
Copyright ©2011 Brooks/Cole, Cengage Learning
53
Calculating the Standard Deviation
Formula for the (sample) standard deviation:
 x  x 
2
s
i
n 1
The value of s2 is called the (sample) variance.
An equivalent formula, easier to compute, is:
s
x
Copyright ©2011 Brooks/Cole, Cengage Learning
2
i
 nx
2
n 1
54
Calculating the Standard Deviation
Step 1: Calculate x, the sample mean.
Step 2: For each observation, calculate the
difference between the data value
and the mean.
Step 3: Square each difference in step 2.
Step 4: Sum the squared differences in step 3,
and then divide this sum by n – 1.
Step 5: Take the square root of the value in step 4.
Copyright ©2011 Brooks/Cole, Cengage Learning
55
Example 2.18 Calculating
a Standard Deviation
Consider four pulse rates: 62, 68, 74, 76
Step 1:
x
62  68  74  76 280

 70
4
4
Steps 2 and 3:
120
Step 4: s 
 40
4 1
2
Step 5: s  40  6.3
Copyright ©2011 Brooks/Cole, Cengage Learning
56
Population Mean and Standard Deviation
Data sets usually represent a sample from a larger
population. If the data set includes measurements for
an entire population, the notations for the mean and
standard deviation are different, and the formula for
the standard deviation is also slightly different.
A population mean is represented by the symbol m
(“mu”), and the population standard deviation is
 x  m 
2

Copyright ©2011 Brooks/Cole, Cengage Learning
i
n
57
Interpreting the Standard Deviation
for Bell-Shaped Curves:
The Empirical Rule
For any bell-shaped curve, approximately
• 68% of the values fall within 1 standard
deviation of the mean in either direction
• 95% of the values fall within 2 standard
deviations of the mean in either direction
• 99.7% of the values fall within 3 standard
deviations of the mean in either direction
Note: ~0.3% fall farther than 3 standard deviations from mean
Copyright ©2011 Brooks/Cole, Cengage Learning
58
Example 2.19 Women’s Heights revisited
Mean height for the 199 British women is 1602 mm
and standard deviation is 62.4 mm.
• 68% of the 199 heights would fall in the range
1602  62.4, or 1539.6 to 1664.4 mm
• 95% of the heights would fall in the interval
1602  2(62.4), or 1477.2 to 1726.8 mm
• 99.7% of the heights would fall in the interval
1602  3(62.4), or 1414.8 to 1789.2 mm
Copyright ©2011 Brooks/Cole, Cengage Learning
59
Example 2.19 Women’s Heights revisited
Note: Not perfect, but follows Empirical Rule quite well
Copyright ©2011 Brooks/Cole, Cengage Learning
60
The Empirical Rule, the Standard
Deviation, and the Range
• Empirical Rule => the range from the
minimum to the maximum data values equals
about 4 to 6 standard deviations for data with
an approximate bell shape.
• You can get a rough idea of the value of the
standard deviation by dividing the range by 6.
Range
s
6
Copyright ©2011 Brooks/Cole, Cengage Learning
61
Standardized z-Scores
Standardized score or z-score:
Observed value  Mean
z
Standard deviation
Example: Mean resting pulse rate for adult men is 70
beats per minute (bpm), standard deviation is 8 bpm.
The standardized score for a resting pulse rate of 80:
80  70
z
 1.25
8
A pulse rate of 80 is 1.25 standard deviations
above the mean pulse rate for adult men.
Copyright ©2011 Brooks/Cole, Cengage Learning
62
The Empirical Rule Restated
For bell-shaped data,
• About 68% of values have z-scores between –1 and +1.
• About 95% of values have z-scores between –2 and +2.
• About 99.7% of values have z-scores between –3 and +3.
Copyright ©2011 Brooks/Cole, Cengage Learning
63
Download