Chapter 1: Data Collection

advertisement
Chapter 1: Data Collection
1.1 Introduction to the Practice of Statistics
1.2 Observational Studies, Experiments, and Simple Random Sampling
1.3 Other Effective Sampling Methods
1.4 Sources of Errors in Sampling
1.5 The Design of Experiments
September 3, 2008
1
Definition of Statistics
Given a question, statistics is the art and science of designing studies,
collecting the data, summarizing the data, and then analyzing the data
to draw conclusions. In particular, statistics is:
• collecting data
• organizing this data
• summarizing the organized data
• analyzing the summarized data
• draw conclusions from this analysis
Section 1.1
2
Data
Data is information that is collected about a generic population (people,
animals, machines, etc.).
In the social sciences it is usually about people: the characteristics (height,
weight, age, etc.) or attitudes (believes, political opinions, religion, etc.).
3
Types of Statistics
• Descriptive Statistics: This type of statistics uses graphs, tables,
charts and the calculation of various statistical measures (mean,
standard deviation, etc.) to organize and summarize information
about a population. This is material in Math 127A.
• Inferential Statistics: This type of statistics consists of techniques
(hypothesis testing, confidence intervals, etc.) to reach conclusions
about a population based upon information obtained by a subset of
the population. This is the material in Math 127B.
4
Average Yearly Temperature in Nashville
Question: Is the climate of Nashville warming?
The average temperature of Nashville is available National
Weather Service website from 1872-2007. Average daily
temperature is calculated by summing the highest and lowest
hourly temperature and then dividing by 2. The monthly
average temperature is obtained by the computing the average
of the daily average temperatures and yearly average
temperature is obtained by computing the average of the
monthly temperatures.
5
Mathematica Notebook
6
The Statistical Method (QDDI)
• Question: What is the problem of interest? Identify your
research objective.
• Design: How will the data be collected? From whom? About
what?
• Description: Give the characteristics of the data. This is were
mathematics can play a major role. Summarize the data. Give
a graphical description of the data. (Descriptive Statistics)
• Inference: What does the data tell us? If you started with a
hypothesis, does the data confirm this hypothesis? (Inferential
Statistics)
7
Example
Harvard Medical School studied 22,000 male physician to determine if
taking aspirin could prevent heart attacks. The physician were split into
two equal groups: 11,000 would receive an aspirin per day and the other
11,000 would receive a placebo. The assignment of physicians was done
randomly. During the course of the study, 0.9% of the male physicians in
the study who were taking aspirin had a heart attacked and while 1.7%
taking the placebo experienced a heart attack. They then used the
statistical method to predict that if all male physicians could have
participated in the study, the percentage having a heart attack would have
been lower for those taking aspirin.
8
QDDI
• Question: Does taking aspirin each day reduce the
incidence of heart attacks in male physicians?
• Design: Take sample with half taking aspirin and half
taking a placebo. This is called an experiment.
• Description: Heart attack rate: aspirin (0.9%) versus
placebo (1.7%).
• Inference: All male physicians would benefit from taking
daily aspirin.
9
Terminology of Statistics
•
•
•
•
•
Population: A population is the complete collection of all elements to be
studied.
Sample: Any subset or group of a population is called a sample.
Variable: A variable is characteristic of the individuals in the population
that will be analyzed.
Parameter: A parameter is numerical summary of a variable for the
population.
Statistic: A statistic is numerical summary for a variable obtained from a
sample of the population.
10
Types of Data
• Quantitative data is composed of measurements (numbers)
about the population.
• Categorical (or qualitative) data is data that can be separated
into categories and can be identified by some non-numeric
characteristic.
• Continuous data is quantitative data that can take any value.
• Discrete data is quantitative data is not continuous .
11
Example
• Population: All of the students in Math 127A that are in WH 103 today.
• Sample: The students in Row 10 of the classroom.
• Variables:
–
–
–
–
–
Color of eyes
Month of birth
Home state
Age
Religion
12
Example (continued)
• Data (Qualitative/Qualitative):
–
–
–
–
–
Blue eyes
October
Georgia
18
Lutheran
• Parameter:
– The average age.
– The standard deviation of heights.
• Statistics:
– The average age of students in Row 5.
– The fraction of students with blue eyes in Row 9.
13
Data for Statistical Studies
• Census: A census is list of all individuals in a population along with certain
characteristics of each individual in the population (e.g., age, race, home
ownership, etc.).
• Observational Study: An observational study attempts to measure a
characteristic of the population by examining a sample, but does not
manipulate the sample. An observational study often uses a sample
survey to collect data.
• Experimental Study: An experiment selects a sample of the population
and manipulates one or more variables of the population. The variable
that is manipulated is called an independent variable and variable that is
effected is called a dependent variable.
Section 1.2
14
Census Website
http://www.census.gov
15
Observational Study
Observational Study: An observational study measures
the characteristics of a population by studying a sample
of individuals. It attempts to find connections between
these characteristics without manipulation of the sample.
The study is passive or ex post facto.
16
Design of Observational Studies
17
Example of Sample Survey
Sample Survey: A random sample of 10,000 people were the individual
are interviewed to determine information about the following variables of
the population:
• age
• race
• gender
• number of children
• income bracket ($0-$25K, $25K-$50K, ….)
• wealth bracket
• homeowner
Question: Is there a relationship between homeownership and number
of children?
18
Algorithm for Setting Up a Sample Survey
• Step 1: Identify the population from which the sample is to be drawn.
• Step 2: Compile a list of subjects in the population from which the
sample will be taken. This is called the sampling frame.
• Step 3: Specify a method for selecting subjects from the sampling
frame. This is called the sampling design.
• Step 4: Collect the data.
19
Designed Experiments
Experimental Study: An experiment is a study in which data
is used and manipulated to determine the effects of one or
more variables (called explanatory variables) on another
variable (called the response variable). That is, the
explanatory variable is controlled to see how the response
variable changes with changes in the explanatory variable.
The conditions placed on the explanatory variable are called
treatments. In this type of study, the explanatory variable is
sometimes called a factor of the experiment.
20
Design of Experiments
21
Remark
Observational studies are useful for detecting connections between
two variables in a population. Experimental studies are useful to
determine the nature of the connection.
22
Types of Sampling
• Random (good)
• Non-random (bad)
Examples: Suppose that our population is 200 students who are seated in a
classroom of 10 rows with 20 seats per row.
If we chose a sample as the subset of students who sit in the rows that end
with an even integer, then this would be a non-random sample.
Suppose that we place 10 balls each marked with a separate number (1-10)
in a bag. We would generate a random sample of 20 by choosing one of the
balls out of the bag and using the number on the ball as the row for our
sample.
Section 1.3
23
Simple Random Sample
Simple Random Sampling: each individual in the
population has the same or equal chance of being
selected for a sample as any other individual. A list of
individuals in the population from which a sample is to
be drawn is called a frame.
24
Two Sets of Random Numbers
Generate a set of 100 random numbers (1 - 9) :
S  {8, 1, 7, 1, 2, 7, 6, 4, 4, 5, 9, 6, 5, 4, 9, 9, 2, 4, 6, 6, 6, 7, 4, 2, 1,
S = {1, 6, 6, 9, 3, 1, 6, 3, 5, 5, 4, 4, 4, 9, 2, 1, 1, 7, 6, 3, 2, 8, 1, 5, 4,
8, 8, 7, 5, 9, 2, 6, 6, 7, 2, 8, 1, 4, 1, 4, 9, 2, 7, 2, 8, 7, 4, 4, 1, 9, 8,
6, 4, 9, 8, 1, 3, 7, 5, 7, 9, 6, 1, 8, 1, 6, 8, 8, 6, 2, 5, 1, 6, 9, 6, 5, 8,
8, 2, 9, 9, 6, 8, 6, 2, 9, 8, 1, 1, 8, 2, 9, 1, 9, 3, 9, 4, 5, 2, 2, 5, 3, 5,
3, 5, 5, 5, 2, 8, 1, 2, 4, 2, 2, 7, 4, 2, 8, 8, 2, 4, 3, 9, 3, 7, 3, 2, 5, 1,
1, 6, 7, 4, 6, 9, 1, 8, 4, 1, 8, 5, 9, 6, 3, 7, 5, 4, 1, 9, 9, 5, 3}
7, 2, 4, 1, 1, 4, 7, 4, 7, 7, 9, 9, 2, 4, 4, 9, 3, 6, 6, 6, 4, 1, 6}


Frequency Chart of Numbers
25
Types of Samples
Simple Random Sample: A sample that is obtained by randomly choosing individuals in the
population.
Stratified Sample: A stratified sample is sample that is obtained by separating the population
into non-overlapping groups (call strata) and then randomly selecting individuals from each
stratum.
Systematic Sample: A systematic sample is a sample that is obtained by selecting individuals in
the population is a systematic way e.g., every 5th individual.
Cluster Sample: A cluster sample that is obtained by selecting all individuals with a randomly
selected subset or group of the population.
Convenience Sample: A convenience sample is a type of sample that is drawn because it is easy
or convenient to collect. Convenience samples are likely to under represent portions of the
population. They may not be random and may contain bias due to time or location.
Section 1.3
26
Three Main Sampling Methods
Random
Cluster
Stratified
27
Advantages of Different Random Sampling
Methods
•
Simple Random Sampling: Gives a good picture of the
whole population.
• Cluster Random Sampling: Often it easier and cheaper to
implement because subjects are close together and welldefined once clusters are chosen.
• Stratified Random Sampling: Guarantees that each
stratum (segment) is sampled.
28
Sources of Errors in Sampling
Fact: Erroneous conclusions can be drawn from observational or experimental
studies due to faulty statistical design and sampling.
• Non-sampling Errors: These errors occur when the sampling process (design)
are faulty. This usually occurs when there is a problem with the sampling frame
or sampling design. In other words, preference is given to selecting some
individuals over other individuals in the population.
 response errors
 non-response errors
 processing error
 analysis errors
 coverage errors
• Sampling or Estimation Errors: This error occurs when the sample gives an
incomplete picture of the population. This type of error is due to the fact that
we are using a sample instead of the whole population.
Section 1.4
29
Non-sampling Errors
• Response Errors: Poor questionnaire design, interview
bias, respondent errors, poor survey process. For example,
the organization of the survey could be confusing, individuals
give deceptive responses to questions, the data collector
may not speak the language of the individual to be
interviewed, etc.
• Non-response Errors: Complete or partial non-response.
For example, individuals may agree to be interviewed, but
then choose not to answer some or all of the questions.
• Processing Errors: There are computational errors in
coding, capturing, editing and presenting the final data.
• Analysis Errors: Incorrect statistical tests are applied to
the data resulting in erroneous conclusions.
• Coverage Errors: There are errors in the duplication or
omission of individuals in the sample.
30
Non-sampling Bias
Example: Suppose we are interested the approval rating of Mayor Dean and we
will conduct a random telephone survey on whether citizens of Nashville approve
or disapprove of his job performance since he took office. Is there bias in this
sample survey?
Answer: Maybe, since it will miss citizens who do not have a telephone and this
group of people may have different opinions about the mayor than those who do
have a telephone.
31
Design of Experiments
Review from Section 1.3:
An experiment is a study for the collection of data that is used to
determine the effects of one or more variables (called explanatory
variables) on another variable (called the response variable). The
individuals from which the data is collected are called subjects or
experimental units. The conditions placed on the explanatory variable are
called treatments. In this type of study, the explanatory variable is
sometimes called a factor. An experiment is called double-blind if the
subjects and the experimenter do not know which treatments are being
administered to each subject. We say that the experiment is completely
randomized if each experimental unit is randomly assigned to a
treatment. A randomized experiment comparing medical treatments is
called a clinical trial.
Section 1.5
32
Types of Experiments
• Completely Randomized Design: Each experimental unit is
randomly assigned a treatment.
• Randomized Matched-pairs Design: Experimental units are
paired with each experiment unit in the pair assigned a
different treatment. The matched-pair can be the same
individual so that the individual receives both treatments (e.g.,
before and after).
• Randomized Block Design: Experimental units are
grouped together in groups. Units in each group (block) are
randomly assigned treatments.
33
Example
Object of Study: Does aspirin reduce the heart attack rate?
Population: Male physicians in the U.S.
Sample: 20,071 male physicians between the ages or 40 and 84.
Study: The sample was split in two groups. One group took an aspirin per
day and the other group took a placebo. The doctors were randomly
assigned to these two groups. The doctors were monitored over a 5 year
period.
Explanatory Variable: aspirin: yes or no (categorical)
Response Variable: heart attack: yes or no (categorical)
Type of Experiment: Completely randomized design.
34
Example (continued)
Yes
No
Total
Aspirin
104
10,933
11,037
Placebo
189
10,845
11,034
Total
293
21,778
22,071
This is an experiment and the aspirin/placebo are the
treatments. We manipulated the explanatory variable
to see the effect on the response variable.
35
Example (continued)
Fraction of Heart Attacks for both Treatments
Yes
No
Aspirin
0.0094
0.9906
1.0
Placebo
0.0171
0.9829
1.0
36
Example (continued)
Conclusion from Study: The heart attack rate per 1000 male physicians
is 9.4 for those taking aspirins and 17.1 for those not taking aspirin.
Hence, we would conclude that taking aspirin reduces the heart attack
rate.
37
Matched-pairs Designs
A matched-pair design experiment is a study where there are only two
treatments and experimental units are matched. One experimental unit receives
one treatment and the other experimental unit receives the second treatment.
The pairs may be the same individual (before treatment and after treatment) or it
may be two individuals who have similar characteristics (e.g., gender, age, etc.).
The assignment of the treatments to each pair should be random.
38
Example of Matched-Pairs
Purpose: Study the effect of taking caffeine one half hour before
swimming.
Sample: 50 randomly chosen swimmers.
Explanatory Variable: A caffeine pill or a placebo.
Response Variable: Time to swim one mile.
Study Design: Experiment
Matched-pair Design: The 50 swimmers are selected. Each swimmer is randomly
given the caffeine pill or the placebo and swims one mile with the time recorded. After 1
week, the same 50 swimmers return and are given the treatment that they did not
receive the previous week. They swim the mile and the time is recorded. Each
swimmer’s times is compared against both treatments.
39
Blocks and Block Designs
•
•
•
•
•
A collection of experimental units that have the same (or similar values) on a key
variable is called a block. In the previous example, each subject (person) is a block.
Experimental units are divided into groups (blocks) and each treatment is randomly
assign to one or more of the units in each block. In other words, a block design
identifies blocks before the start of the experiment and assigns subjects to
treatments within those blocks.
To reduce bias, order of treatments within each block is randomized and we call
this a randomized block design.
A matched-pair design is a special type of block design. Here each paired
experimental units form a block.
In a block design study, an experimental unit (subject) may receive only one
treatment.
40
Example of Block Design
Purpose: Study the effect of taking caffeine one half hour before swimming.
Sample: 50 swimmers, but 16 males who swim competitively, 14 males who do not
swim competitively, 8 females who swim competitively and 12 females who do not swim
competitively.
Explanatory Variable: A caffeine pill or a placebo.
Response Variable: Time to swim one mile.
Study Design: Experiment
Randomized Block Design: We create four blocks (16, 14, 8, 12 subjects).
Within
each block, individuals take either the caffeine pill or the placebo. Each subject’s swim
time is recorded. The times of each swimmer within each block as well as across the
blocks are compared (caffeine pill versus placebo).
41
What type of experiment?
A drug company wanted to test a new arthritis medication. The
researchers found 200 adults aged 25-35 and randomly assigned them to
two groups. The first group received the new drug, while the second
received a placebo. After one month of treatment, the percentage of each
group whose arthritis symptoms decreased was recorded and compared
with their original condition. What type of experimental design is this?
42
What type of experiment?
A medical journal published the results of an experiment on insomnia.
The experiment investigated the effects of a controversial new therapy for
insomnia. Researchers measured the insomnia levels of 86 adult women
who suffer moderate conditions of the disorder. After the therapy, the
researchers again measured the women's insomnia levels. The
differences between the the pre- and post-therapy insomnia levels were
reported. What type of experimental design is this?
43
What type of experiment?
A farmer wishes to test the effects of a new fertilizer on her tomato yield.
She has four equal-sized plots of land--one with sandy soil, one with rocky
soil, one with clay-rich soil, and one with average soil. She divides each of
the four plots into three equal-sized portions and randomly labels them A,
B, and C. The four A portions of land are treated with her old fertilizer. The
four B portions are treated with the new fertilizer, and the four C's are
treated with no fertilizer. At harvest time, the tomato yield is recorded for
each section of land. What type of experimental design is this?
44
What type of experiment?
A random sample of 1,000 overweight male adults is recruited. Each male
is weighed and his weight is recorded. Each individual is given a diet and
are told to follow it for one month. After one month, each individual is
weighed and recorded. The “before” and “after” are compared. What type of
experimental design is this?
45
What type of experiment?
A random sample of 30 Vanderbilt students is selected. We are interested in
the reaction times when using or not using a cell phone during driving. Each
student’s reaction time was measured when he or she was using or not
using a cell phone on a driving course in a Vanderbilt parking lot. What type
of experimental design is this?
46
Download