Week 8 Lecture Notes (Sampling)

advertisement
I.4 Sampling Lecture Notes
1. Statistical Thinking
Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write. – H. G. Wells, author
of “War of the Worlds”
Definition: Statistics is the science of collecting, analyzing, and interpreting
data in such a way that the conclusions can be objectively evaluated.
2. Three Phases of Statistics
• Collect the data
• Analyze the data
– order the data
– graphical displays
– numerical calculations (such as mean and standard dev)
• Interpret the results
– use proper statistical techniques to substantiate or refute hypothesized statements
– match data to the appropriate technique
– determine whether the proper assumptions are satisfied
3. Two types of statistics
• Descriptive statistics – summarize and describe a characteristic for
some group
• Inferential statistics – estimate, infer, predict, or conclude something
about a larger group
4. Examples
Descriptive
Batting Average
Yards Per Carry
Test Scores
Inferential
Polls
Medical Studies
Market Surveys
1
2
5. Two types of data
• Quantitative data – values recorded on a natural numerical scale
• Qualitative data – classified into categories
6. Quantitative Data
• Weight of subjects in medical sample
• Height of buildings in Chicago
• Temperatures per day at Antarctica Weather Station
7. Qualitative Data
• Gender of subjects in medical sample
• Political affilation of respondents in a poll survey
• Class (fresh, soph, jr, sr) of Math 101 students
8. Vocabulary
• The population is the entire set of objects (people or things) under
consideration.
• A sample is a subset of the population that is available for the analysis.
• A bias is a favoring of certain outcomes over others.
• A census collects data from each member of the population.
• A statistic is a statement of numerical information about a sample.
• A parameter is a statement of numerical information about a population.
9. Census versus Sample
Would you use a census or a sample to determine the following:
• Project the winner of an election
• Calculate a baseball player’s batting average
• Predict whether it will rain tomorrow
3
•
•
•
•
•
•
Test whether the soup is too salty
Calculate Shaq’s free throw average
Use a market study to determine a new flavor of toothpaste
Report the Dow Jones Average
Generalize a medical study to other groups
The average score on the first test
10. Dealing with bias
Bias in some form occurs in the collecting of most, if not all, sets of data.
The bias may come from
• the portion of the population surveyed
• the phrasing of the questions
11. Examples
• “Dewey defeats Truman” projection of Chicago Tribune based on 1948
telephone poll
• “Are you in favor of Illinois banning cell phones in cars? Dial *91 on
your cellular phone to vote.”
• “Do you feel budget cuts are more important than humanitarian programs that would need to be cut to obtain a balanced budget?”
12. Methods for Choosing Samples
• Judgement Sample
– Use the opinion of person(s) deemed qualified to choose members
of the sample.
– Example: to investigate study habits of atheletes, ask their coaches
and teachers.
• Simple Random Selection
– Use random numbers to select the sample.
4
– Page 315 Random Digit Table:
72985547555515086461
• Stratefied Sampling
– Divide the population into relatively homogenous groups, draw a
sample from each group, and take their union.
13. Goals of a good sample
• from the correct population
• chosen in an unbiased way
• large enough to reflect total population
14. Normal Distribution of Random Events
Toss a coin 100 times and count the number of heads.
How many heads would you expect?
• about 50
• exactly 50
It does not seem reasonable that the count will be exactly 50.
We would not be surprised if the number of heads turned out to be 48 or
51 or even 55.
We would be surprised to see 80 heads, and would begin to suspect that the
coin was not fair.
15. Coin Toss Data
Experiment: A coin is tossed n = 100 times.
The experiment is repeated 1000 times.
Here are the results:
5
16. Frequency Table: No. of Heads
Heads Freq
1
0
..
.
0
34
0
35
2
36
2
37
2
38
2
39
5
40 14
41 16
42 25
43 30
44 31
Heads Freq
45 54
46 49
47 54
48 66
49 89
50 70
51 77
52 85
53 62
54 57
55 52
56 40
57 36
Heads Freq
58 27
59 19
60 11
61 11
62
5
63
4
64
2
65
0
66
0
67
1
68
0
..
.
0
100
0
mean = 50.296
stand dev = 5.100
17. Coin Toss Histogram
30
40
50
60
70
6
18. Sampling Distributions
If we could examine all possible samples of size n of a population, then the
frequency distribution of the means of these samples is normally distributed.
•
•
•
•
µ = the mean over the entire population
σ = the standard deviation over the entire population
x = the mean of the sampling distribution
σx = the standard deviation of the sampling distribution
19. Two Rules
Rule 1. x = µ
σ
Rule 2. σx = √
n
We are assuming in Rule 2 that the size of the entire population is much
larger than the sample size n.
20. Two Outcome Situations
Situation: Two outcomes (for–against; heads–tails; yes–no)
p = percent in favor
q = percent opposed
Written as decimals
p+q =1
Why?
21. Example
•
•
•
•
29 % of Americans favor Bush’s handling of the War in Iraq,
while 71 % do not.
p = .29 q = .71
p + q = .29 + .71 = 1
7
22. Quantitizing the Data
•
•
•
•
We count a for (or yes) vote as X1 = 1
and an against (or no) vote as X2 = 0
Out of 100 people, we would expect
100p yes votes and 100q no votes
23. To calculate the mean
Outcome (out of 100 cases):
Vote
Frequency Freq ×Xi
X1 = 1 (yes)
100p
100p
X2 = 0 (no)
100q
0
Total
100p
So the mean
µ=
100p
=p
100
24. Standard Deviation
Out of 100 cases,
Vote Freq (Xi − µ)2 Freq×(Xi − µ)2
X1 = 1 100p (1 − p)2
100p(1 − p)2
2
X2 = 0 100q (0 − p)
100q(0 − p)2
Total
100p(1 − p)2
+100q(0 − p)2
25. Calculating standard deviation
First divide the Total by n = 100 cases:
Total
= p(1 − p)2 + q(0 − p)2
100
= p(1 − p)2 + qp2
= pq 2 + qp2
[1-p=q]
8
= pq(q + p)
= pq
[because p + q = 1]
Then to get σ, take the square root:
√
σ = pq
26. The p–q Rule
Suppose a coin has probability p of landing heads and q = 1 − p of landing
tails.
(A value other than p =
1
2
means the coin is not “fair.”)
The parameter which measures a head (X = 1) versus a tail (X = 0) has
mean µ = p and standard deviation σ =
√
pq
27. Bush Popularity Example
29% think Bush is doing a good job
71% do not
p = .29 and q = .71
µ = p = .29
p
√
σ = pq = (.29)(.71) = .4538
28. Fair Coin Toss
Heads = 1, Tails = 0
With a fair coin, we expect the percentage of heads to be 50%:
p = .5 and q = .5
µ = p = .5
p
√
√
σ = pq = (.5)(.5) = .25 = .5
9
29. Percents versus Actual Numbers
Sometimes our calculations are in terms of percents and sometimes they are
given as actual numbers.
For example, suppose we flip a coin 340 times.
We would expect to have roughly 170 heads (and 170 tails).
We expect the percentage of heads to be
170
340
=
1
2
or 50%
p = 0.5 is the number used in our formulas (along with q = .5)
To convert from the percentage to the actual number of expected heads,
simply multiply p by n
In this case, we expect
1
2
× 340 = 170 heads.
30. Percents versus Actual Numbers Cont’d
The p–q formula computes the standard deviation σ for the population when
we are thinking in terms of percent
The formula σx = √σn computes the standard error of the mean when we are
thinking in terms of percent
To convert to actual numbers, multiply σx by n.
By properties of the square root function
√
σ
√ ·n=σ· n
n
31. Percents versus Actual Numbers
Flip a coin 340 times and count the number of heads.
Mean and Standard Deviation for the Entire Population
p
µ = 21 = 0.5 σ = (.5 × .5) = 0.5
Mean and Standard Deviation for Sample Size of n = 340 tosses
In terms of percents:
x = µ = 0.5 σx =
√σ
340
= .027
10
In terms of actual numbers, multiply by n = 340:
mean = 0.5 × 340 = 170 stan. dev. = .027 × 340 = 9.22
32. Interpetation
Since the sampling distribution is normally distributed with mean 170 and
standard deviation of 9.2, the 68–95–99 rule tells us:
If you flip a fair coin 340 you would expect the number of heads to be
between 161 and 179 68% of the time [1 standard deviation]
between 152 and 188 95% of the time [2 standard deviations]
between 142 and 198 99% of the time [3 standard deviations]
33. Coin–Toss Model
• Suppose a coin has probability p of landing heads and q = 1 − p of
landing tails.
• Suppose we flip the coin n times and record x, the number of heads for
each sample.
• The values of x will be normally distributed with mean and standard
deviation given as follows:
Distribution
Distribution
Population
Sample
Sample
Percents
Actual Numbers
Mean
p
p
p·n
√
σ
√
√
Stan. Dev. σ = pq
σ· n
n
34. Comparison with Previous Experiment
Toss a coin n = 100 times
Actual Value Predicted Value
Mean
50.296
50
Stan. Dev.
5.100
5
Download