- Stratified Samples - Systematic Samples - Samples can vary - Standard Error

advertisement
-
Stratified Samples
Systematic Samples
Samples can vary
Standard Error
-
From last time: A sample is a small collection we observe and
assume is representative of a larger sample.
Example: You haven’t seen Vancouver, you’ve seen only seen a
small part of it. It would be infeasible to see all of Vancouver.
When someone asks you ‘how is Vancouver?’, you infer to the
whole population of Vancouver places using your sample.
From last time: A sample is random if every member of the
population has an equal chance of being in the sample.
Your Vancouver sample is not random. You’re more likely to
rd
have seen Production Station than you have of 93 st. in
Surrey.
From last time: A simple random sample (SRS) is one where
the chances of being in a sample are independent.
Your Vancouver sample is not SRS because if you’ve seen 93
th
st., you’re more likely to have also seen 94 st.
rd
A common, random but not SRS sampling method is stratified
sampling.
To stratify something means to divide it into groups.
(Geologically into layers)
To do stratified sampling, first split the population into
different groups or strata. Often this is done naturally.
Possible strata: Sections of a course, gender, income level,
grads/undergrads any sort of category like that.
Then, random select some of the strata.
Unless you’re doing something fancy like multiple layers, the
strata are selected using SRS.
Within each strata, select members of the population using
SRS.
If the strata are different sizes, select samples from them
proportional to their sizes.
Example: Quality testing of milk.
A government agency wants to check if the milk from a
company is up to code.
There are several trucks out leaving the plant today, each truck
is a stratum. (single version of strata). The agency selects
some of the trucks with SRS.
Each truck is carrying many jugs of milk, some jugs from each
truck are selected by SRS.
One of the trucks is twice as big as the others, so twice as
many jugs are sampled from that one. Therefore every jug has
an equal chance of being sampled.
Say they tested 50 jugs of milk from a total of 5 trucks.
That’s a lot easier than stopping 50 trucks and testing 1 jug
each. This is part of the appeal of stratified sampling.
Another appeal is that you can choose EVERY strata. (A
stratum’s chance of being picked by SRS becomes 1)
Example: Employment survey.
A large company wants information about its workforce of
1000 full time employees and 500 part-time employees.
A company chooses both strata and uses SRS to select 80 from
the full-time stratum and 40 from the part-time stratum.
8% of each strata is sampled this way.
Samples can vary.
Not every sample will be the same. My Vancouver sample is
different from yours, which will be different from the person
sitting next to you.
You’ve all seen different parts of the city, you’ve all observed a
different set of members of the population.
If samples are different, then their means are going to be
different too.
But, no matter how many times you take a sample, it’s always
from the same population.
So the sample mean
can change, but the population mean
is always the same (unknown) number.
The sample mean
, on average is going to be the population
mean μ.
(Average of
The standard deviation of
is
μ)
is the standard error
:
The typical amount that sample means change from the true
mean is the standard error.
Technically, it’s the standard error of the mean, because you
can have standard errors of other things too, but we’ll only
look at the standard error of the mean.
The standard error is our main tool for reducing the
uncertainty of our sample mean.
n is the sample size. The larger n gets, the smaller
gets.
In other words, a bigger sample gives you a better estimate of
the sample mean.
This should be intuitive, if you take a bigger sample, you have
more information about the population
.
This is important because it gives us some measure of control
over the statistics we get. We can’t do that with the standard
deviation.
Say the government agency of before knows that in regular
milk, the amount of calcium is normal with mean 20 mg/L, and
standard deviation 5 mg/L.
If it samples 1 1L bottle of regular milk, it will have a standard
error of 5 mg/L.
If it samples 4 1L bottles milk, the mean calcium concentration
will have a standard error
If the agency samples 25 one-Litre bottles, the average calcium
per bottle is going to be a lot closer to the true mean of
20mg/L than it was with 4 bottles.
The sample mean of 25 bottles will have a standard error of 1,
even though the standard deviation of a single bottle is 5.
Why does this happen?
Consider: Which is more likely, one bottle being above the
mean, or a whole lot of bottles?
In a large sample, the bottles above the mean are going to
balance out with the bottles below the mean.
As you get more and more bottles, the closer to a 50-50
balance you would expect.
As we get closer to that 50-50 balance, the sample mean will
tend to be closer and closer to the true mean.
Since we become more sure of where the sample mean will be,
we say it becomes less variable.
It’s why elevators can make these limits:
It’s 68kg/person, and lots of people weigh more than 68kg.
But how often will you get a group of 26 averaging more than
68kg/person.
Practice example: Suppose the average age when smokers
begin is 17 years old with a standard deviation of 2 years.
What’s the standard error of the mean from a sample of 16
smokers?
What’s the standard error of the mean of 100 smokers?
The sample mean never changes with size, it’s always centered
around the true mean at 17.
We can expand our definition of z-score from something that
pertains to single values to something that pertains to sample
means.
It’s still (value minus mean) / (standard deviation of the value),
But since the value is a sample mean instead of a single value,
it has a different standard deviation.
Consider again the smokers starting at
What’s the z-score of a single smoker if he starts at 18 years?
What’s the z-score of a sample of 16 smokers if their mean is
18 years?
Instead of finding the standard error first, we can put it all into
one question. (Just another option)
What’s the chance of getting a sample of 100 smokers who
started at an average of 18 years or older?
Common Question: How do I know what z-score formula to
use? This or the one from chapter 5?
Answer: Look for an indication that you’re dealing with a
sample. If it’s giving you an n (sample size), use it.
Pro-Tip: Use this new one by default. If you can’t find n, you
probably have a sample of size 1, so use n=1.
When you use a sample of size 1, the standard error z becomes
the standard deviation z.
When n=1
So
In other terms:
Use the formula with square root n when you have an n.
Use the original z-score formula when it’s just a single value.
If you don’t know, use the square root n formula because
you’ll still get the right answer, you’ll just waste some effort.
Finally… Why would we ever deal with standard error?
Parameters are usually unknown.
In less contrived situations, we wouldn’t know what the true
mean was, but the larger our sample the better our idea of
that true mean.
On Monday
- More standard error, now with proportion data!
- Law of large numbers
- End of Midterm 1 exam material.
Download