Fundamentals of
Sampling Method
Week 4
Research Methods & Data
Analysis
Dr. Mario Mazzocchi
Research Methods & Data Analysis
1
Tutorials
• Thursday 30th October
9-11 AG GL 20 (M. Mazzocchi)
• Tuesday 4th November
11-1pm (H.Neeliah)
• You may attend:
– One (the most convenient for you)
– Both (it may be very useful)
– None (not really advised…)
Dr. Mario Mazzocchi
Research Methods & Data Analysis
2
Lecture outline
•
•
•
•
•
Key notions of statistics
Simple random sampling
Sampling error
Sampling size
Other sampling methods
Dr. Mario Mazzocchi
Research Methods & Data Analysis
3
Distributions
• A set of values of a set
of data together with
their
Count
– Absolute frequencies
– Relative frequencies
(probabilities)
60
40
20
0
200.00
300.00
400.00
500.00
600.00
700.00
Amount spent
Dr. Mario Mazzocchi
Research Methods & Data Analysis
4
Relative and cumulate
frequencies
fi=ni/N
Fi  f1  f 2 
i
 fi   f h
h 1
100%
8%
75%
Perce nt
Percent
6%
4%
50%
25%
2%
0%
200.00
300.00
400.00
500.00
600.00
700.00
200.00
300.00
Amount spe nt
Dr. Mario Mazzocchi
400.00
500.00
600.00
700.00
Amount s pent
Research Methods & Data Analysis
5
Distributions of random
variables
• The distribution of possible values
together with their probabilities
(probability density function, p.d.f.)
Dr. Mario Mazzocchi
Research Methods & Data Analysis
6
The normal (Gaussian)
distribution
• …is the distribution representing perfect
randomness around a mean value
• In statistics, the normal distribution play a
key role in the theory of errors
• The central limit theorem implies that
“averaging” almost always give origin to a
normal distribution (error on the average is
random), provided that the number of
observation is large (>40)
Dr. Mario Mazzocchi
Research Methods & Data Analysis
7
The normal distribution
p
95% of
values
0,025
0,025
m-1.96s
Dr. Mario Mazzocchi
m
m+1.96s
Research Methods & Data Analysis
8
The student-t distribution
• When the parameter in the population has a normal
distribution (with unknown variance), within the
sample the parameter assumes a t distribution
• The t-distribution is similar to the normal
distribution, apart from having higher tailprobabilities
• The bigger is the sample, the more similar the tdistribution is to the normal distribution
• For samples with more than 30-40 units, the
difference between the two distributions is negligible
Dr. Mario Mazzocchi
Research Methods & Data Analysis
9
The t-distribution
x-ta/2sx
Dr. Mario Mazzocchi
x
x+ta/2sx
Research Methods & Data Analysis
10
ta/2 and za/2 – tabled values
Level of confidence
99%
95%
90%
Dr. Mario Mazzocchi
t according to sample size
a
a /2
10
20 30
0.01 0.005 3.17 2.85 2.75
0.05 0.025 2.23 2.09 2.04
0.10 0.050 1.81 1.72 1.70
Research Methods & Data Analysis
z
40
2.70
2.02
1.68
2.58
1.96
1.64
11
Population parameters
(in a population of N elements)
1 N
• Mean
m   xi
N i 1
N
1
• Variance 2
s   ( xi  m )2
N i 1
• Standard deviation
N
1
2
s  s2 
(
x

m
)

i
N i 1
Dr. Mario Mazzocchi
Research Methods & Data Analysis
12
Sampling
• A sample is a subgroup of the
population selected for the study
• Sample statistics allow to make
inference about the population
parameters, through estimation and
hypothesis testing
• The sample space is a complete set of
all possible results of the sampling
procedure
Dr. Mario Mazzocchi
Research Methods & Data Analysis
13
Simple random sampling
• Each element of the population has a known and
equal probability of selection
• Every element is selected independently from other
elements
• The probability of selecting a given sample of n
elements is computable (known)
• The Central Limit Theorem guarantees that for
simple random samples with sample size (n)
sufficiently large (>40), the sample mean in a S.R.S.
follows the normal distribution
Dr. Mario Mazzocchi
Research Methods & Data Analysis
14
Sample statistics
• Sample mean
1 n
x   xi
n i 1
• Sample variance
n
1
2
2
s 
( xi  x )

n  1 i 1 unbiasedness
• Sample standard deviation
n
1
2
2
s s 
( xi  x )

n  1 i 1
Dr. Mario Mazzocchi
Research Methods & Data Analysis
15
Standard deviation and
standard error
• The standard deviation measures
the variability of a given variable (e.g.
X) within the population or sample
• The standard error refers to the
accuracy (variability) of the sample
statistics (e.g. mean), i.e. the error due
to the fact that the statistic is computed
on a sample rather than on the
population (sampling error)
Dr. Mario Mazzocchi
Research Methods & Data Analysis
16
Basic SRS sample statistics
(unknown pop. variance)
n
Mean case
x
x
i
i 1
Proportion case (p)
n
n
s
2
(
x

x
)
 i
i 1
n 1
sx 
s2
n
Sample
standard
deviation of X
Standard error of the
mean/proportion
s
n
p(1  p)
n 1
sp 
p(1  p)
n 1
ACCURACY of
sample estimates
Dr. Mario Mazzocchi
Research Methods & Data Analysis
17
Finite population
correction factor
• For finite population (…i.e. all in social
research), large samples (more than 10% of
N) tend to overestimate the standard error of
the sample mean (proportion)
• In order to account for that, the following
correction is necessary
sx 
n
s2
1
N
n
Dr. Mario Mazzocchi
sp 
p(1  p)
n
1
n 1
N
Research Methods & Data Analysis
18
Level of confidence a
and z parameter
The level of confidence a refers to the
probability that the true population mean falls
in the identified confidence interval
For the normal distribution, given a
value of a, the corresponding za/2
values is tabulated
a/2
x  za / 2 sx
a/2
x
x  za / 2 sx
a=0.05
za/2 =1.96
Confidence interval for x at a level of confidence a
Dr. Mario Mazzocchi
Research Methods & Data Analysis
19
The t-distribution
x-ta/2sx
Dr. Mario Mazzocchi
x
x+ta/2sx
Research Methods & Data Analysis
20
Confidence intervals
• Calculate the sample mean
• Decide a level of confidence (usually
95% or 99%)
• Choose whether using the Student-t
distribution or the Normal distribution
• Compute the sample standard error
• Define the lower and upper bound of
the confidence interval
Dr. Mario Mazzocchi
Research Methods & Data Analysis
21
Exercise
• Suppose that you have interviewed 20
students out of 200 in the agricultural
building, asking them how much they
paid for lunch yesterday
• You get an average of £ 3.67
• The standard deviation is 1.25
• Compute the 95% confidence interval
• Compute the 99% confidence interval
Dr. Mario Mazzocchi
Research Methods & Data Analysis
22
Determining sample size
Factors influencing sample size (n):
• Size of the population (N)
• Variability of the population (s)
• Desired level of accuracy (q)
• Level of confidence (a)
• Budget constraint
Dr. Mario Mazzocchi
Research Methods & Data Analysis
23
Simple random sampling:
determining sample size
• Relative sampling error (r.s.e)
r
ta / 2 sx
nX
n
1
N
• Determining sampling size for a given
r.s.e. (approximate formula)
 ta / 2 sx 
n0  

 rX 
Dr. Mario Mazzocchi
2
Research Methods & Data Analysis
24
The sampling design
process
1. Define the target population, its elements and
the sampling units
2. Determine the sampling frame (list)
3. Select a sampling technique
• Sampling with/without replacement
• Probability/Nonprobability sampling
4. Determine the sample size
• Precision versus costs
• The marginal value in terms of precision of
additional sampling units is decreasing
5. Execute the sampling process
Dr. Mario Mazzocchi
Research Methods & Data Analysis
25
The sampling techniques
• Probabilistic samples
– Simple random sampling
– Systematic sampling
– Stratified sampling
– Cluster sampling
– Other sampling techniques
• Nonprobabilistic samples
– Convenience sampling
– Judgmental sampling
– Quota sampling
– Snowball sampling
Dr. Mario Mazzocchi
Research Methods & Data Analysis
26
Representativeness
• A sample can be considered as
“representative” when it is expected to
exhibit the average properties of the
population
Dr. Mario Mazzocchi
Research Methods & Data Analysis
27
Selection bias
• Improper selection of sample units (ignoring
a relevant “control variable” that generate
bias), so that the values observed in the
sample are biased and the sample is not
representative.
Example:
A survey is conducted for measuring goat milk
consumption, but the interviewers just select
people in urban areas, that on average drink
less goat milk.
Dr. Mario Mazzocchi
Research Methods & Data Analysis
28
Simple random sampling
• Each element of the population has a known and
equal probability of selection
• Every element is selected independently from other
elements
• The probability of selecting a given sample of n
elements is computable (known)
–Statistical inference is possible
–It is easily understood
Dr. Mario Mazzocchi
–Representative samples are large
and expensive
–Standard errors are larger than in
other probabilistic sampling
techniques
–Sometimes it is difficult to execute
a really random sampling
Research Methods & Data Analysis
29
Systematic sampling
• A list of N elements in the population is compiled,
ordered according to a specified variable
– Unrelated to the target variable (similar to SRS)
– Related to the target variable (increased
representativeness)
• A sampling size n is chosen
• A systematic step of k=N/n is set
• A random number s between 1 and N is extracted and
represents the first element to be included
• Then the other elements selected are s+k, s+2k, s+3k…
–Cheaper and easier than SRS
–More representative if order is related
to the interest variable (monotone)
–Sampling frame not always necessary
Dr. Mario Mazzocchi
–Less representative (biased) if the
order is cyclical
Research Methods & Data Analysis
30
Stratified sampling
• Population is partitioned in strata through control
variables (stratification variables), closely related
with the target variable, so that there is homogeneity
within each stratum and heterogeneity between strata
• A simple random sampling frame is applied in each
strata of the population
– Proportionate sampling: size of the sample from each stratum is
proportional to the relative size of the stratum in the total
population
– Disproportionate sampling: size is also proportional to the
standard deviation of the target variable in each stratum
–Gains in precision
–Stratification variables may not be
easily identifiable
–Include all relevant subpopolation even
if small
–Stratification can be expensive
Dr. Mario Mazzocchi
Research Methods & Data Analysis
31
Cluster sampling
•
•
1.
The population is partitioned into clusters
Elements within the cluster should be as
heterogeneous as possible with respect to the
variable of interests (e.g. area sampling)
A random sample of clusters is extracted through
SRS (with probability proportional to the cluster
size)
–
–
2a. All the elements of the cluster are selected (onestage)
2b. A probabilistic sample is extracted from the cluster
(two-stage cluster sampling)
–Reduced costs
–Higher feasibility
Dr. Mario Mazzocchi
–Less precision
–Inference can be difficult
Research Methods & Data Analysis
32
Non probabilistic samples
Dr. Mario Mazzocchi
Research Methods & Data Analysis
33
Convenience sampling
• Only “convenient” elements enter the
sample
–Cheapest method
–Quickest method
Dr. Mario Mazzocchi
–Selection bias
–Non representativeness
–Inference is not possible
Research Methods & Data Analysis
34
Judgmental sampling
• Selection based on the judgment of the
researcher
–Low cost
–Quick
Dr. Mario Mazzocchi
–Non representativeness
–Inference is not possible
–Subjective
Research Methods & Data Analysis
35
Quota sampling
1.
Define control categories (quotas) for the
population elements, such as sex, age…
2. Apply a “restricted judgmental sampling”,
so that quotas in the sample are the same of
those in the population
–Cheapest method
–Quickest method
Dr. Mario Mazzocchi
–There is no guarantee that the
sample is representative (relevance
of control characteristic chosen)
–Many sources of selection bias
–No assessment of sampling error
Research Methods & Data Analysis
36
Snowball sampling
• A first small sample is selected randomly
• Respondents are asked to identify others who
belong to the population of interests
• The referrals will have demographic and
psychographic characteristics similar to the
referrers
–Lower costs
–Low variability
–Useful for “rare” populations
Dr. Mario Mazzocchi
–Inference is not possible
Research Methods & Data Analysis
37