Lecture 2

advertisement
Statistics 111 - Lecture 2
Collecting Data
Surveys and Sampling/
Graphs of a Single Variable
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
1
Administrative Notes
• Lecture notes on website
• Office hours today from 3-4:30pm
• Homework 1 available on website
• Due at beginning of class on Monday, June 1
• JMP “how to” guide for the homework on website
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
2
Outline for First Half of Lecture
• Introduction to Sampling
• Voluntary Response Samples
• Simple Random Samples
•
•
•
•
•
Sources of Sampling Bias
More complicated sampling schemes
Preview of Inference
Bias versus Variability
Read: Section 3.3
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
3
Survey Definitions
• Population: entire group of objects or people
about which information is sought
• Census: survey of an entire population
• Sample: survey that examines only a portion
of the population
• Parameter: a numerical characteristic of the
population
• Statistic: a numerical characteristic of the
sample
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
4
Why Sample?
•
Expense: cheaper than a census
•
•
Time: quicker than a census
•
•
Nielson ratings: based on 5000 out of an estimated
105.5 million US households with TVs
Exit polls: gives news agencies valuable (?)
information on election day in order to project
election before all votes (census) are counted
Sampled units must sometimes be destroyed
(or changed) to measure characteristics
•
Reliability studies: testing lifetime of light bulbs,
strength of windshields, etc.
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
5
Sampling Bias
• Systematic errors that result in a sample
that is not representative of the overall
population of interest
• Just like in experiments, we must be
cautious of potential sources of bias in
our sampling results
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
6
Voluntary Response Samples
• People choose to be included in sample
themselves by responding to a general appeal
• Eg. Amazon consumer ratings
• Results are often biased because people with
strong opinions (usually negative) are more
likely to respond and be included in the sample
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
7
Hite Report: Women and Love (1987)
• Hite mailed 100,000 questionnaires to groups of women
professionals, counseling centers, church societies,
senior citizens centers. Only 4.5% were returned
• 84% of women are “not satisfied emotionally with their
relationships” (p. 804)
• 70% of all women “married five or more years are
having sex outside of their marriages (p. 856)
• 95% of women “report forms of emotional and
psychological harassment from men with whom they are
in love relationships” (p. 810)
• 84% of women report forms of condescension from the
men in their love relationships (p. 809)
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
8
Simple Random Sampling (SRS)
• Just as an experiment can be improved by
randomization, so can sampling
• Each individual in the population has an equal
chance of being included in the sample
• Does not allow self-response or evaluators to
influence makeup of the survey (kinda like
double-blinding in experiments)
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
9
Example: Presidential Elections
• In 1912 Literary Digest began using surveys to predict US
presidential elections
• “The poll represents 30 years constant evolution and perfection...”
• In the 1936 Roosevelt vs Landon election, they polled 10
million voters:
• 1,293,669 said they would vote for Landon
• 972,897 said they would vote for Roosevelt
• Reality: Landslide victory (61% to 37%) for FDR
• What went wrong?
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
10
Biases in Random Samples
• Randomization doesn’t correct for certain problems with
sampling
• Bias 1: Undercoverage: some groups in the population
are left out of the process of choosing the sample
• Bias 2: Nonresponse: sampled individuals can not be
contacted or do not cooperate
• Eg. 1936 presidential polls
• Low response rate: less than 25% of responded
• Undercoverage of poorer demographics: sample of voters relied
heavily on lists of automobile and telephone owners, which were
generally more affluent voters
• Well, at least we learned from those mistakes, right?
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
11
Recent Presidential Elections
• Using exit polls, several networks reported early that Gore
would win Florida on 2000 election
• Using exit polls, several pundits predicted Kerry would win
Ohio in 2004 election
• In general, we have gotten better, but still can make
mistakes (especially when difference itself is so small)
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
12
More Potential Problems with Surveys
• Response Bias: respondents may not answer
truthfully to survey questions
• Illegal or unpopular behavior such as drug usage
• Controversial topics such as teen sexual activity
• Race or gender of interviewer can influence answers
about race or gender-related questions
• Respondents often have trouble remembering past
events eg. yearly nutrition and health surveys
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
13
More Potential Problems with Surveys
• Wording of questions can be confusing or
intentionally lead the respondent
• Do you favor a ban on disposable diapers?
• It is estimated that disposable diapers account for
less than 2% of the trash in today’s landfills. In
contrast, beverage containers, third-class mail and
yard wastes account for 21% of the trash in landfills.
Given this, would it be fair to ban disposable diapers?
• Complicated multi-part forms that require lots of
skipped questions lead to a drop off in response
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
14
More Complicated Random Surveys
• Weakness of simple random sampling is that
you cannot use extra information about
population (similar to blocking in experiments)
• What if you know a particular group is missing from
your sample?
• Stratified random sampling: individuals are
divided into groups called strata
• Simple random sampling done within each stratum
• National surveys can be even more complicated
by using multistage sampling (cheaper)
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
15
Dinner and Drugs Study
• Study by CASA that linked frequent family dining
to reduced risk of substance abuse
“There is no more important thing that a parent can do”
• Some problems with study that relate to what we
know about surveys and observational studies
• Problem 1: undercoverage of minority groups
• Survey not representative of teen population
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
16
Dinner and Drugs Study II
• Problem 2: high level of non-response in survey
• Many households declined to answer, didn’t complete
survey or denied permission to use
• Problem 3: observational study with lots of
potential confounding variables
• Drug use itself wasn’t measured, but rather a risk score
for drug use
• Study isn’t adjusted for age, which is also associated
with drug use
• No proof of causation!
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
17
After Break
• Exploring Data: Graphical summaries of
a single variable
• Moore, McCabe and Craig: Section 1.1
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
18
Break!
• 5 minutes
• More awesome statistics to come
May 27, 2008
Stat 111 – Lecture 2
Sampling and Graphing
19
Outline for Second Half of Lecture
• Characteristics of Distributions
• Center, spread, shape, outliers
• Plotting Distributions of Data
• Boxplots
• Histograms (no stem and leaf plots)
• Density Curves
• Read: Section 1.1
January 29, 2008
Stat 111 - Lecture 4 - Graphing
20
Definitions
• Variable: any characteristic that takes
different values for different individuals
• Categorical variables place an individual
into one of several groups
• Examples: gender, race
• Quantitative variables take on numerical
values that are usually considered as
continuous
• Examples: height, age, wages
January 29, 2008
Stat 111 - Lecture 4 - Graphing
21
Distributions
• A distribution describes what values a variable
takes and how frequently these values occur.
• The distribution of a variable can be described
graphically and numerically in terms of:
• Center: where are most of the values located?
• Spread: how variable are the values?
• Shape: is the distribution symmetric or skewed?
Are there multiple peaks or just one?
• Outliers: are there certain values that seem
surprisingly large or small?
January 29, 2008
Stat 111 - Lecture 4 - Graphing
22
Barplots and Pie Charts
• For categorical variables, we can graph the distribution
using bar plots and pie charts
January 29, 2008
Stat 111 - Lecture 4 - Graphing
23
Barplots and Pie Charts
• Pie charts are generally not as useful as
bar plots
• Need to have all categories to make a pie
chart
• harder to compare subsets of categories
• Scale of pie charts can sometimes be
misleading
• harder to see small differences
January 29, 2008
Stat 111 - Lecture 4 - Graphing
24
Boxplots
• Box plots are an effective tool for conveying
information of continuous variables
• Box contains the central 50% of the data, with a line
indicating the median
• Median is the value with 50% of data on either side
• Whiskers contain most of the rest of the data, except
for suspected outliers
• Outliers are suspiciously large or small values
January 29, 2008
Stat 111 - Lecture 4 - Graphing
25
Boxplot: Shoe Size of Stat 111 Class
•
•
•
•
Almost all values are between 5 and 13
50% of values are between 7.5 and 10
Center (Median) is around 8.5
Couple of suspected outliers: 14 and 14.5
January 29, 2008
Stat 111 - Lecture 4 - Graphing
26
Summary of Boxplots
• Useful for displaying center and spread of a
distribution, as well as potential outliers
• However, boxplot doesn’t really give us much
of an idea of the shape of the distribution
• Histograms are much better graphical
summaries of shape
• We’ll see boxplots again in Chapter 2, for
comparing distributions across groups
January 29, 2008
Stat 111 - Lecture 4 - Graphing
27
Histograms
• Histograms emphasize frequency of different
values in the distribution
• X-axis: Values are divided into bins
• Y-axis: Height of each bin is the frequency that values
from that bin appear in dataset
January 29, 2008
Stat 111 - Lecture 4 - Graphing
28
Another Example: Height in Stat 111
• Vertical axis is sometimes the density (or
relative frequency) : equal to the frequency
of the bin divided by the total number of obs
January 29, 2008
Stat 111 - Lecture 4 - Graphing
29
Histograms versus Boxplots
• Both graphs give a good idea of the spread
• Boxplots may be a little clearer in terms of the center
and outliers in a distribution
center
outliers
center
spread of likely values
January 29, 2008
Stat 111 - Lecture 4 - Graphing
30
Histograms versus Boxplots
• Histograms much more effective at displaying the
shape of a distribution
• Skewness: departure from left-right symmetry
• Multi-modality: presence of multiple high frequency values
clearly not symmetric
not symmetric?
possible second peak?
January 29, 2008
Stat 111 - Lecture 4 - Graphing
31
Symmetry - Histograms vs. Boxplots
January 29, 2008
Stat 111 - Lecture 4 - Graphing
32
Density Curves
• Often easier to examine a distribution with a
smooth curve instead of a histogram
• Example: vocabulary scores from 947 seventh
graders in Gary, Indiana
January 29, 2008
Stat 111 - Lecture 4 - Graphing
33
Example with Test Score Data
• Number of scores less than 6 in population is
287 out of 947, so relative frequency is 0.303
• Using a density curve (normal distribution),
the approximate frequency is 0.293
January 29, 2008
Stat 111 - Lecture 4 - Graphing
34
Approximations
• Real data will never exactly fit a density curve
ie. be exactly symmetric or normally-distributed
• We will talk later in course about how to fit these
density curves and we will use them to make
probability calculations
January 29, 2008
Stat 111 - Lecture 4 - Graphing
35
Next Class - Lecture 3
• Using JMP
• Exploring Data: Numerical summaries
of a single variable
• Moore and McCabe: Section 1.2
January 29, 2008
Stat 111 - Lecture 4 - Graphing
36
Download