Monday, January 27: 8 - Whittier Union High School District

advertisement
Chapter 18: Sampling Distributions.1
___________________________ is the process of drawing conclusions about a population based on
data from a sample of that population.
Any quantity computed from the values in a sample is called a _____________. The corresponding
population characteristic is usually called a ___________________.
Note: Since we use “statistics” to estimate “parameters,” “statistics” are also called “estimates.”
The problem with using statistics to estimate parameters is that our estimates usually aren’t correct. For
example, suppose that we know the mean weight of all students at LSHS is 135 pounds (i.e.  = 135).
However, if we were to take a random sample of 50 students, we may get ______________________.
We need to answer the question: “How good is our estimate?”
Since the value of the statistic (estimate) depends on the sample selected, the value of the statistic will
vary from sample to sample. This variability is called ____________________________.
Note: Sampling variability is different from the variability we have studied so far.
If y = weight of a LSHS student, then y has variability (different students have different weights). This
is NOT sampling variability.
If y = the average weight of a sample of 50 LSHS students, then y also has variability (different
samples will produce different averages). This is sampling variability.
Thus, statistics (estimates) are variable while parameters (population characteristics) are constant.
The distribution of a statistic (estimate) is called its ______________________________. It is the
distribution of all the values of the statistic based on all possible samples of the same size.
Note: The important features of a sampling distribution are its ______________________________.
Suppose that our class is the population and we are taking samples of size 3 to estimate the true average
height (  ).
y = height (inches)
y = average height of 3 students
The sampling distribution of y lists all possible values of y and how frequently they will occur.
Note: each possible statistic (range, standard deviation, median, proportion of males, etc.) has its own
sampling distribution
HW #1 Textbook intro
How Many Textbooks??
During World War II, Allied forces noticed that all German tanks were labeled with a serial number.
After observing numerous tanks, it became obvious that the tanks were numbered systematically. Thus,
the allies were able to predict the total number of German tanks and their locations using the observed
serial numbers. This became known as the German Tank Problem.
Fortunately for our troops in the Middle East, enemy forces do not have any tanks to speak of.
However, the methods used to solve the German Tank Problem can be applied to many other situations.
In this assignment, your job is to develop a method to estimate the total number of AP Statistics books
(or any other textbook) at our school based on a random sample of books and the assumption that the
books are numbered sequentially starting from 1.
For example, suppose a random sample of n = 7 books gives the following numbers: 10, 38, 59, 61, 74,
90, 94. How can you use this information to estimate the total number of books (N)? One possible
method would be to take the median value and double it. In this example, median • 2 = 61 • 2 = 122.
Thus, our estimate for the total number of books is N = 122.
For this activity you will create three other statistics to estimate the total number of books. Remember, a
statistic is any quantity computed from the values in a sample. You may use any combination of the
summary statistics we already know (mean, median, min, max, quartiles, IQR, standard deviation, etc.)
or invent your own. The goal is to find a relatively simple statistic that reliably predicts the total number
of books.
To determine which of your three statistics gives the best predictions, you will perform a simulation to
generate sampling distributions for each of your three statistics. For the purposes of the simulation,
assume that there are 100 books total (N = 100) and that you will be taking samples of size 7 (n = 7).
For each run (do 25 runs), generate 7 random numbers from 1-100 and compute the values of each of
your three statistics.
What you need to turn in:
 An introduction describing the choice of your 3 statistics
 Three dotplots showing the sampling distributions for each of your statistics. Make sure they are
on the same scale so they can be easily compared.
 A verbal description/comparison of the three distributions.
 A description of which statistic you think is best and why.
Due _________________ (one per group)
NOTE: _______________ in class we will have a competition to see which group has the best statistic.
The nominating order will be randomly selected and no statistic will be allowed more than once. During
each round of the competition, I will secretly choose N and then give you a sample of n numbers from 1N. Each group will compute the value of their statistic based on the sample of n numbers and the closest
estimate gets 3 points, the second closest gets 2 points, and the third closest gets 1 point. After several
rounds, the team with the most points will get 2 extra credit points per member.
Chapter 18: Modeling the Distribution of Sample Proportions.2
One of the major tasks of a statistician is to discover the proportion of individuals in a population who
possess a certain property or characteristic.
Unfortunately, without conducting a census, we will never know the true value of p, the ____________.
Instead, statisticians will estimate p by taking a sample and computing the ____________________, p̂ .
p̂ =
number of "successes" in the sample x

samplesize
n
Note: since x is the number of successes, x is a binomial random variable
Suppose that 22% of registered voters in California are senior citizens
If we were to take a sample of size 100 and find the sample proportion of senior citizens ( p̂ ), what
values might we get?
What does the distribution of p̂ look like?
What about the shape of the distribution? Recall that the binomial distribution can be well approximated
by a normal distribution when both np and nq are at least 10.
General Properties of the Sampling Distribution of a Sample Proportion:
Let p̂ be the proportion of successes in a random sample of size n from a population whose proportion
of successes is p. Denote the mean value of the p̂ distribution by  p̂ and the standard deviation of the
p̂ distribution by  p̂ . Then,
1.  p̂  p
pq
n
3. When n is large and p is not too close to 0 or 1, the sampling distribution of p̂ is approximately
normal.
2.  p̂ 
Conditions:
a. The data is from a random sample from population of interest
b. Large sample size: both np and nq are at least 10 (so the normal model is a good approximation
to the binomial)
c. Sample less than 10% of the population (so we can consider trials to be independent)
Note: Rules 1 and 2 follow directly from the rules for binomial random variables.
 np
 pˆ  x 
p
n
n
 pˆ 
x
n

npq
npq


n
n2
pq
n
Suppose that 43% of registered voters in California are Democrats. What is the probability that there
will be less than 30% Democrats in a random sample of size 50?
If n = 50, what is the probability that the sample proportion will be within .05 of the true proportion?
How can we make this probability bigger?
Note: If the large sample size condition isn’t met (or if you just don’t like the normal distribution), you
can still use the binomial distribution. However, in later chapters we will need to use the normal
approximation so we will use it whenever we can from now on.
HW #1: Problems: 18.1-10 all
Chapter 18: The Sampling Distribution of the Sample Mean.3
General Properties of the Sampling Distribution of a Sample Mean ( y ):
Let y denote the mean of the observations in a random sample of size n from a population having mean
 y and standard deviation  y . Denote the mean value of the y distribution by  y and the standard
deviation of the y distribution by  y . Then, the following rules hold:
1.  y = 
2.  y 

n
3. When the population distribution of y is normal, the sampling distribution of y is also normal
for any size n.
4. Central Limit Theorem: When n is sufficiently large, the sampling distribution of y is well
approximated by a normal curve, even when the population is not normal.
To explore this, try the following Applet: http://www.ruf.rice.edu/~lane/rvls.html
Conditions:
a. The data is from a random sample from population of interest
b. Large sample size (n > 30) OR distribution of population approximately normal (so distribution
of y can be considered approximately normal)
c. Sample < 10% of population (so selections can be considered independent)
Note: Rules 1 and 2 come from the rules for combining independent random variables:
 y  y ... y   y   y  ...   y n   y
y  


 y
n
n
n
y 
  y  y ... y 
n

 y2   y2  ...   y2
n

n   y2
n

n   y2
n
2

 y2
n

y
n
Suppose that the weights of apples are approximately normally distributed with a mean of 200 grams
and a standard deviation of 35 grams.
a. What is the probability that if you choose one apple at random, its weight is at least 220 grams?
b. What is the probability that if you choose 25 apples at random, the average weight is at least 220 g?
Suppose that the population of residents of local town has a mean annual income of $50,000 and a
standard deviation of $35,000. If a random sample of 16 town residents is selected, what is the
probability that their sample mean is within $5000 of the true mean?
HW #3: 18.11, 17-23 odd
How Many Textbooks? The Competition!!
HW #4: 18.12, 18-24 even
Review Chapter 18
HW #5: 18.13, 14, 27, 28, 30
Test Chapter 18
Chapter 19: Confidence Intervals for Proportions.1
One of the primary jobs of a statistician is to estimate characteristics of populations (parameters). What
proportion of students drive to school? What is the average weight of a teenager?
To make these estimates, statisticians will select a sample from the population of interest and study the
members of the sample. However, because of sampling variability, it is possible that our estimates will
be wrong.
In general, there are two types of estimates that can be made: point estimates and interval estimates.
A ________________________ is single number that represents a plausible value for the population
parameter. It is called a “point” estimate since it represents a single point on the number line. For
example, based on a recent survey, we estimate that 61% of young adults favor social security reform.
An _________________________ is a range of plausible values for that parameter. It is called a
“interval” estimate since it represents an interval on the number line. For example, based on a recent
survey, we estimate that between 58% and 64% of young adults favor social security reform (61% ±
3%).
Using an interval estimate greatly increases the probability we are correct. For example, I predict the
high temperature tomorrow will be between 0 and 150 degrees. Of course, this interval won’t help me
pick my clothes in the morning. There is a tradeoff between confidence and usefulness.
Making point estimates: Which statistic should we use?
There are two things to look for when choosing a statistic to estimate a population characteristic
(parameter): the statistic should have no bias and low variability.
An _____________________________ is a statistic with a mean value equal to that of the population
characteristic being estimated (i.e. its sampling distribution is centered in right place).
For example, in the textbook problem, some unbiased statistics were:
They are unbiased since their sampling distributions were centered in the right place. In other words,
the estimates weren’t consistently too low or consistently too high.
Some examples of biased statistics would be:
Why are these biased?
Given a choice between several unbiased statistics, which should we use?
Note: Sometimes a biased statistic is preferred if it has much less variability.
Making Interval Estimates
Because of the variability inherent in sampling, our point estimates will rarely be correct. After all,
different samples will yield different estimates. And, although a point estimate may be our best single
guess for the value of a population parameter, it is certainly not the only plausible value.
Thus, statisticians will usually report an interval of plausible values for the population parameter based
on the sample. This interval is called a ___________________________. Using a confidence interval
gives a much better chance of being correct.
The Logic of Confidence Intervals:
1. The distance from p to p̂ is the same as the distance from p̂ to p.
2. 95% of all samples will give a sample proportion p̂ within 1.96 SD of the population proportion
p, assuming the distribution of p̂ is normal.
3. Therefore, in 95% of all samples the population proportion p will be within 1.96 SD of the
sample proportion p̂ .
When the distribution of p̂ is normal, the 95% confidence interval for a population proportion (p) is:
point estimate ± margin of error = pˆ  1.96
ˆˆ
pq
n
Why do we use p̂ in the standard deviation instead of p?
Note: When we use the sample to estimate the SD, it is called the Standard Error (SE).
ˆˆ
pq
pq
SD =
and SE =
n
n
How do we know if the distribution of p is normal?
Suppose that in a random sample of 100 La Serna HS students, 43 owned a graphing calculator. Find a
95% confidence interval for the true proportion of LSHS students with a graphing calculator.
HW #1: Read 432-36 Problems 19.17, 18
Chapter 19: Formal Confidence Intervals.2
Suppose that in a random sample of 200 Whittier teenagers, 18 had tattoos. Use a 95% confidence
interval to estimate p, the true proportion of Whittier teenagers with a tattoo.
To get full credit on any confidence interval problem, you MUST include the following 4 steps!
Note: If any of the conditions in step 2 are violated, then the interval may not be reliable. In this case,
state your concerns and proceed with caution.
Changing confidence levels: Although a 95% confidence interval is the most frequently used, there are
other common confidence levels, including 90% and 99%.
The general form of a confidence interval:
point estimate ± margin of error
statistic ± (critical value)(estimated standard deviation of the statistic)
Suppose that in a random sample of 1000 US adults, 483 believe that the US war in Iraq was worth the
costs. Use a 90% confidence interval to estimate the true proportion of US residents who believe this.
Since p = .483 can we conclude that the majority of US residents feel the war was not worth the costs?
Repeat step 3 with a 95% CI and a 99% CI.
What factors affect the margin of error (length of the interval)?
1. The confidence level. If we want to be really confident, we can increase the confidence level.
However, while this will give the interval a better chance to capture the parameter, our estimate will
have less precision.
2. The sample size. If we want more precision, we can increase the sample size. However, this will
cost us additional time and money.
HW #2: Read 436-438, Problems 19.3, 7, 11, 21, 23
Chapter 19: Finding Sample Size.3
Suppose we are planning a survey and we want a margin of error of less than 4% with 95% confidence.
What sample size do we need?
ˆˆ
pq
1.96
 .04
n
We are trying to solve for n, but we also do not know the value of p̂ . What can we do?
 Use a known value of p̂ from a previous survey
 Do a preliminary pilot survey to estimate p̂
 Be conservative. What is the biggest sample size we could possibly need?
ˆ ˆ is, the bigger n needs to be.
o The bigger pq
ˆ ˆ be the largest?
o For what value of p̂ will pq
Note: When no other information is given use the third option ( p̂ = .5)
Suppose we wanted to estimate a proportion with 99% confidence with a margin of error of at most
2.5%. What is the largest sample size we might need?
HW #3 Read 441-443 Problems 19.20, 25, 27, 29
Chapter 19: Interpreting the confidence level.4
Suppose that 10% of all students at La Serna have a tattoo and we want to simulate a random sample of
size 200 to calculate a 90% confidence interval for the proportion of La Serna students with a tattoo.
Let 0 = has a tattoo and 1-9 = no tattoo
Generate 200 integers from 0-9 and count the number of 0’s.
Use this number to calculate p and the 90% CI for p.
Interpreting the confidence level. In other words, what does it mean to be 90% confident?
 In the long run, 90% of all intervals produced with this method will capture the true value.
 Or: Before we gather our sample, there is a 90% chance that the sample we will obtain will
produce an interval that captures the true value.
 It does NOT mean that there is a 90% chance (or probability) that a particular interval captures
the true value--that particular interval either captures the value or it doesn’t.
Again, it is incorrect to say that “there is a 90% chance that the interval from .05 to .13 captures the true
proportion.” Rather, the 90% confidence level refers to the method used to construct the interval rather
than any particular interval.
How many students can you identify at La Serna High School? Suppose that Mr. DuPont randomly
selected 50 students from LSHS and you could name 14 of them. Construct a 95% CI for the proportion
of students you know at La Serna High School. Then, use the fact that there are 1800 students at LSHS
to estimate the total number of students you can identify at LSHS.
HW #4 Problems: 19.5, 13, 26, 30
Chapter 19: Review.5
HW #5 Problems 19.14, 22, 24, 28
Chapter 23: Confidence Intervals for a Population Mean.6
In addition to using interval estimates for population proportions (p), we can also use confidence
intervals to estimate the true value of a population mean (µ).
The formula for a CI for a population mean  is based on the sampling distribution of the sample mean
y . Recall that:
1.  y = 
2.  y 

n
3. y is approximately normally distributed when n > 30 or when the population is normal
Thus far, we have been using z-critical values from the normal distribution to make our confidence
intervals.
Now, remember that a z-score is calculated as z =
observed value-expected value y  


standard deviation
n
As long as y is normally distributed, the distribution of z will also be normal. This is because z is a
linear transformation of y . In other words, since  ,  , n are all constants, the shape shouldn’t change,
only the center and spread.
However, if we don’t know  (the population standard deviation) and we use s (the sample standard
y 
deviation) to estimate  , the distribution of
isn’t quite normal, even if y is approximately
s
n
y 
normal. So, we give this quantity a new name: t =
.
s
n
Since t is based on 2 variables (s and y ) instead of just y (like z), the t-distributions have more
variability than the z (normal) distribution. However, as the sample size increases, s gets closer to 
and the t-distributions get closer to the standard normal distribution (z).
The t-curves are very similar to normal curves, except that they are wider and defined by a number
called the __________________________. For this chapter, df = n – 1.
Properties of the t-distributions:
1. The t-curve corresponding to any fixed number of degrees of freedom (df) is bell shaped,
symmetric and centered at 0.
2. Each t-curve is more spread out than the z-curve (standard normal curve).
3. As the df increase, the spread of the corresponding t-curve decreases.
4. As the number of df increases, the t-curves get closer and closer to the z-curve.
Since the t-curves are wider than the z-curve, we must go out further than 1.96 SD to capture 95% of the
area.
To find out how far, we use a t-critical value from the t-table.
If df = 20 and you want 95% confidence, what t-critical value should you use?
Thus, to capture the middle 95% of the t-distribution with 20 df, you must go out 2.086 stdev.
Find the t critical values for the following:
a. 95% confidence with n = 10
b. 90% confidence with n = 25
c. 99% confidence with n = 100
Note: When using the t-table and the df you want are not provided, round down to the nearest df given.
Suppose that a machine is designed to produce bolts that have a diameter of 5 mm. Every hour a
random sample of 15 bolts is selected and a 95% confidence interval for the mean diameter is
constructed. If there is evidence that  ≠ 5, the machine is adjusted. In one particular sample, the mean
diameter was 5.08 mm with a standard deviation of 0.11 mm. Calculate the interval and decide if you
need to adjust the machine.
HW #6: Read 527-530, problems 23.3, 5, 7, 13 (4 steps only), 15
Chapter 23: More confidence intervals.7
Checking for Normality: When the sample size is small and the data from the sample are provided, you
must graph the sample to see if it is reasonable to assume it came from a normally distributed
population.
Conclusions:
 slight to moderate skewness is OK, since samples from normal populations will not always look
normal: “Since there are no outliers or strong skewness in the sample, it is reasonable to assume
that the population is approximately normal.”
 strong skewness or outliers makes this condition questionable: “Since there is an outlier or strong
skewness in the sample, it is questionable to assume that the population is approximately normal.
I will proceed with caution.”
Suppose that a random sample of students at a particular SAT school were selected and their
improvement in SAT score were calculated. Construct a 99% confidence interval to estimate the true
mean improvement of students at this school. Does this interval give evidence that students in this
school are improving their scores, on average?
Improvements: 50, 110, 20, 140, 80, 70, 70, 40
HW #7 Problems: 23.4, 6, 8, 28 (4 steps + e), 27 (4 steps)
Chapter 23: Finding Sample Size.8
Suppose that we wanted to estimate the average height of a LSHS student to within 0.5 inches of the true
value with 99% confidence. If the standard deviation is known to be 2.3 inches, how big of a sample do
we need?
Since we know the population standard deviation, we can use z to calculate the sample size.
Unfortunately, we rarely (if ever) know the true standard deviation  and not the true mean  . So,
these problems are somewhat unrealistic.
In most cases, to find an appropriate sample size we need to first estimate the standard deviation (s).
There are a few ways to do this.
-Use a previously known value of s.
-You can do a preliminary study (pilot study) and use the sample standard deviation s.
Also, we use z-critical values instead of t-critical values since we cannot calculate df without knowing n
(which is what we looking for). However, if the sample size is large, using z will be close enough. But,
for small sample sizes, using z will underestimate the sample size we need.
Using the example from yesterday, how many additional students do we need to select to reduce the
margin of error to 20 points?
HW #8: Read 535-538, Problems: 23.11 (4 steps + de), 13, 34 (4 steps + e)
HW #9 Review:
Chapter 19 & 23 Test
Download