Introduction to Statistics
From the Data at Hand
to the World at Large
Part I
Siana Halim
TOPICS
 Population and Sample
 Sampling distribution
models
 Confidence interval for
proportions
References:
•De Veaux, Velleman , Bock, Stats, Data and Models, Pearson Addison Wesley International Edition, 2005
•John A Rice, Mathematical Statistics and Data Analysis, Duxbury Press, 1995
Sampling and Population
 We’d like to know about an
entire population of
individuals, but examining all of
them is usually impractical, if
not impossible. So we settle for
examining a smaller group of
individuals – a sampleselected from the population
 We should select individuals for
the sample at random.
 Randomizing protects us
from the influences of all the
features of our population, even
ones that we may not have
thought about.
 The fraction of the population
that you’ve sampled doesn’t
matter. It’s the sample size
itself that’s important.
Sampling and Population
 Does a census make sense ?



It can be difficult to
complete a census
Populations rarely stand still
Taking a census can be more
complex than sampling.
Population and Parameters
 Models use mathematics to represent reality. Parameters are the
key numbers in those models.
 A parameter used in a model for a population is called a
population parameter.
 Any summary found from the data is a statistic.
Name
Statistic
Mean
y
 (mu)
Standard deviation
s
 (sigma)
Correlation
r
 (rho)
Regression
coefficient
b
 (beta)
Proportion
p̂
Parameter
p
Simple Random Samples
 We need to be sure that the
statistics we compute from the
sample reflect the
corresponding parameter
accurately (representative).
 How would we select a
representative sample ?
 A Simple Random Sample (SRS)
 Every possible sample of the size
we plan to draw has an equal
chance to be selected.
 Each combination of people has
an equal chance of being
selected as well.
 The sampling frame is a list of
individuals from which the
sample is drawn.
 Samples drawn at random
generally differ one from
another. Each draw of random
numbers selects different people
for our sample.
 These differences lead to
different values for the variables
we measure.
 We call these sample-to-sample
difference sampling
variability.
Stratified Sampling
 All statistical sampling
designs have in common the
idea that chance, rather than
human choice, is used to
select to sample.
 Designs that are used to
sample from large
populations – especially
populations residing across
large areas – are often more
complicated than simple
random samples.
 Sometimes the population is
first sliced into homogeneous
groups, called strata, before
the sample is selected. Then
simple random sampling is
used within each stratum
before the results are
combined. This common
sampling design is called
stratified random
sampling.
Cluster and Multistage Sampling
 Splitting the population
 Sampling schemes that
into similar parts or
clusters can make
sampling more practical.
 Then we could simply
select one or a few clusters
at random and perform a
census within each of
them.
combine several methods
are called multistage
samples.
 Sometimes we draw a
sample by selecting
individuals systematically.
This is called a systematic
sampling.
Sampling Distribution Models
• Why do sample proportions vary
at all ?
• How can surveys conducted at
essentially the same time by the
same organization asking the
same questions get different
result ?
• This answer is the heart of
statistics.
• It’s because each survey is based
on different sample size.
• The proportion vary from sample
to sample because the samples are
composed of different people
Modeling the Distribution of Sample
Proportion
 Most models are useful only when specific assumptions
are true. In the case of the model for the distribution of
sample proportions, there are two assumptions:
1. The sampled values must be independent of each other.
2. The sample size, n, must be large enough.
 The corresponding conditions to check before using the Normal
to model the distribution of sample proportions are:
1. 10% condition : If sampling has not been made with
replacement, then the sample size, n, must be no larger than
10% of the population
2. Success/failure condition : The sample size has to be
big enough so that both np and nq are greater than 10
The Sampling Distribution Model of a
Proportion
 Provided that the sampled values are independent and the sample
size is large enough, the sampling distribution of p is modeled by a
Normal model with mean  ( pˆ )  p
pq
 and standard deviation SD( pˆ ) 
n

N  p,


p 3
pq
n
p2
pq 
n 
pq
n
p 1
y
pˆ  ,
n
Proporsi
sample
qˆ  1  pˆ
y is number of success
n is the sample size
pq
n
p
p 1
pq
n
p 2
pq
n
p 3
pq
n
The Central Limit Theorem (CLT)
 As the sample size, n, increases, the mean of n independent values
has a sampling distribution that tends toward a Normal model
with mean  ( y ) equal to the population mean, µ, and standard
deviation

 ( y )  SD( y ) 
n
 The CLT requires remarkably few assumptions, so there are few
conditions to check:
1. Random sampling condition.
2. Independence assumption
Sampling Distribution Model for Mean
 If assumptions of independence and random sampling are met,
and the sample size is large enough, the sampling
distribution of the sample mean is modeled by a Normal model
with a mean equal to the population mean, µ, and a standard
deviation equal to 
n parameter  in the population is
estimated by
Sample mean
1
X
n
  

N   ,
n

n

i 1
1
xi , ˆ  s 
n
2


x

X
 i
n
i 1
Sample standard deviation
 3

n
 2

n
 1

n

 1

n
2

n
 3

n
Working with Sample Distribution
Models
Example 1.
About 13% of the population is left-handed. A 200-seat school
auditorium has been built with 15 “leftie seats,” seats that have the builtin desk on the left rather than the right arm of the chair. In a class of 90
students, what’s the probability that there will not be enough seats for
the left-handed students?
Step-by-step
• State what we want to know.
• Check the conditions.
• State the parameters and the sampling distribution model.
• Make a picture. Sketch the model and shade the area we’re interested
in.
• Find the z-score or the cutoff proportion.
• Find the resulting probability from a table of Normal probabilities.
• Discuss the probability in the context of the question.
Working with Sample Distribution
Models
Example 2.
Suppose that mean adult weight is 175 pounds with a
standard deviation of 25 pounds. An elevator in our
building has a weight limit of 10 persons or 2000
pounds. What’s the probability that the 10 people who
get on the elevator overload its weight limit?
Standard Error
 When we estimate the standard deviation of a sampling
distribution using statistics found from the data, the
estimate is called a standard error.
pˆ qˆ
SE ( pˆ ) 
n
SE ( y ) 
s
n
Confidence Interval for Proportion
 We 95% confidence to state that the True Proportion of the
Proportion
population is in our interval.
Confidence Interval (Example)
Sea fans, one spectacular kind of coral, in
the Caribbean Sea have been under
attack by the disease aspergillosis. In
June of 2000, the sea fan disease team
from Dr. Drew Harvell’s lab randomly
sampled some sea fans at the Las Redes
Reef in Akumal, Mexico, at a depth of 40
feet. They found that 54 of the 104 sea
fans they sampled were infected with the
disease. What might this say about the
prevalence of this disease among sea fans
in general?
Confidence Interval (Example)
What can we say about the population
proportion, p? Is the infected proportion
of all sea fans 51.9%?
We do know, though, that the sampling
distribution model of p̂ is centered at p,
and we know that the standard deviation of
the sampling distribution is
SD( pˆ ) 
pq
n
But we don’t know p, instead we’ll use
and find the standard error,
Now we know the sampling model for p̂ should look like this:

N  p,


3
pq
n
2
pq
n
1
pq
n
p
1
pq
n
2
pq
n
3
pq 
n 
pq
n
Because it’s Normal, it says that about 68% of all samples of 104 see fans will
have p̂ ‘s within 1SE, 0.049, of p. And about 95% of all these samples will be
within p2SEs. BUT Where is our sample proportion in this picture?
We do know that for 95% if random samples, p̂ will be no more than 2
SEs away from p. So let’s look at this from p̂‘s point of view. If I’m p̂ ,
there’s a 95% chance that p is no more than 2 SEs away from me. If I reach
out 2 SEs, or 2 x 0.049, away from me on both sides, I’m 95% sure that p
will be within my grasp. Now, I’ve got him! Probably.
Far better an approximate answer to the right question, …
than an exact answer to the wrong question.”
- John W. Tukey
So what can we really say about p?
1. “51.9% of all sea fans on the Las Redes Reef are infected.” → NO WAY!
2. “It is probably true that 51.9% of all sea fans on the Las Redes Reef are
infected” → NO
3. “We don’t know exactly what proportion of sea fans on the Las Redes
Reef are infected but we know that it’s within the interval 51.9% ±
2x4.9%. That is, it’s between 42.1% and 61.7%” → GETTING
CLOSER!
4. “We don’t know exactly what proportion of sea fans on the Las Redes
Reef are infected, but the interval from 42.1% and 61.7% probably
contains the true proportion.” →TRUE but a bit wishy-washy.
5. “We are 95% confident that between 42.1% and 61.7% of Las Redes
Reef sea fans are infected.”YES!
Statement like these are called confidence intervals. They’re the best we can
do. The interval is called a one-proportion z-interval.
Margin of Error
Confidence Interval (CI) has the form
^
^
p  2SE( p )
The extent of the interval on either side of is called the margin of error
(ME). In general, CI look like this:
estimate ± ME
The more confident we want to be, the larger the margin of error must
be.
Critical Value
The z* = 1.96 and z* = 1.645 is
called as the critical value.
-1.96
0.95
1.96
The CI for the sample proportion
and the sample mean can be
formulated as follow
ˆ  z SE( p
ˆ)
p
*
-1.645
1.645
0.9
X  z SE ( X )
*
Assumptions and Conditions
Independence Assumption → check three conditions:
 Plausible independence condition. This condition
depends on your knowledge of the situation.
 Randomization condition. Were the data sampled at
random or generated from a properly randomized
experiment?
 10% condition.
Sample Size Assumption → check success/failure
condition.
We must expect at least 10 “successes” and at least 10
“failures.”
One-proportion z-interval
When the conditions are met, we are ready to find the
confidence interval for the population proportion, p.
The confidence interval is
ˆ  z * SE ( p
ˆ)
p
where the standard error of the proportion is estimated by
SE ( pˆ ) 
pˆ qˆ
n
Example
In May 2002, the Gallup Poll asked 537 randomly sampled
adults the question “Generally speaking, do you believe the
death penalty is applied fairly or unfairly in this country
today?” Of these, 53% answered “Fairly” and 7% said they
didn’t know, What can we conclude from this survey?
Student t distribution
Note: t → Z if n increase
Standard Normal
(t with df = )
t (df = 13)
t-Distribution has similar shape as
the normal distribution but it has
longer tails
t (df = 5)
0
t
T- Distribution
Upper Tail Area
df
.25
.10
.05
Let: n = 3
df = n - 1 = 2
 = .10
/2 =.05
1 1.000 3.078 6.314
2 0.817 1.886 2.920
/2 = .05
3 0.765 1.638 2.353
This the value of t, not the value of the
probability..
Using t distribution then the CI for mean can be formulated as
*
X  t n
1SE( X )
0
2.920
t