Uploaded by Diego Martínez

Data Analysis: Inferences Based on One & Two Samples

advertisement
Fundamentals of Data Analysis:
Inferences Based on One and Two
Samples
These are a series of notes to help you grasp the contents of the subject Fundamentals of Data
Analysis for BBA + DBA students. Here you will find the formulas and R commands for most
of the one and two samples related staff (Chapter 6, 7 and 8 in MBS)
In case of doubt, just reach out to me
Course 2022 - 2023
Fundamentals of Data Analysis - IE University - BBA + BDA
Inferences Based on Two Samples
Contents
1
• What are we doing?
4
2
• Point Estimators
4
3
• Confidence Interval
5
3.1
Formulas for Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
3.1.1
Confidence interval for µ: Large Sample . . . . . . . . . . . . . . . . . . . .
6
3.1.2
Confidence interval for µ: Small Sample . . . . . . . . . . . . . . . . . . .
7
3.1.3
Confidence Interval for Population Proportion p: Large Sample . . . . . . .
7
3.1.4
Confidence interval for Diff.of means µ1 −µ2 : Large Sample + Indep. Sampling
7
3.1.5
Confidence interval for Diff.of means µ1 −µ2 : Small Sample + Indep. Sampling
9
3.1.6
Conf. Interval for Diff. of means µd (Matched Pairs): Large Sample + Dep.
Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Conf. Interval for Diff. of means µd (Matched Pairs): Small Sample + Dep.
Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Conf. interval for difference of population proportions p1 − p2 : Large Sample
10
3.1.7
3.1.8
4
• Hypothesis testing
10
4.1
The two routes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
4.1.1
Way 1: Compare test statistic and rejection region . . . . . . . . . . . . . . .
11
4.1.2
Way 2: Compare p-value and α . . . . . . . . . . . . . . . . . . . . . . . .
12
Corresponding test statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
4.2.1
Hypothesis Testing for Population mean µ . . . . . . . . . . . . . . . . . . .
14
4.2.2
Hypothesis Testing for Population mean µ: Large Sample . . . . . . . . . .
15
4.2.3
Hypothesis Testing for Population mean µ: Small Sample . . . . . . . . . .
15
4.2
5
4.2.4
Hypothesis Testing for Population Proportion p: Large Sample . . . . . . . .
16
4.2.5
Hypothesis Testing for Difference of means µ1 − µ2 . . . . . . . . . . . . .
16
4.2.6
Hypothesis Testing for Diff. of means µ1 −µ2 : Large Sample + Indep. Sampling 17
4.2.7
Hypothesis Testing for Diff. of means µ1 −µ2 : Small Sample + Indep. Sampling 18
4.2.8
Hypothesis testing for µd = µ1 − µ2 (Matched Pairs) . . . . . . . . . . . . .
18
4.2.9
Hypothesis testing for µd = µ1 − µ2 (Matched Pairs): Large Sample + Dep.
Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
4.2.10 Hypothesis testing for µd = µ1 − µ2 (Matched Pairs): Small Sample + Dep.
Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
4.2.11 Hypothesis testing for p1 − p2 : Large Sample . . . . . . . . . . . . . . . . .
20
• Determining Sample Size
20
5.1
Confidence interval for the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
5.2
Confidence Interval for a proportion . . . . . . . . . . . . . . . . . . . . . . . . . .
21
5.3
Confidence Interval for the difference of two means . . . . . . . . . . . . . . . . . .
21
5.4
Confidence Interval for the difference of two proportions . . . . . . . . . . . . . . .
21
A Appendix: Useful Definitions
23
B Appendix: zα and tα (also zα/2 and tα/2 )
24
B.1 • zα and zα/2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
B.2 • tα and tα/2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Inf. Based on One and Two Samples
FDA
1 • What are we doing?
MAIN GOAL: Obtain information of some parameters of the population (target parameters) through
a sample 1 . Examples:
• Population mean ⇒ µ
• Population proportion ⇒ p
• Difference of two means ⇒ µ1 − µ2
• Difference of two means, matched pairs ⇒ µd
• Difference of two proportions p1 − p2
We can pose many questions that lead to different inferences. Nonetheless, here we focus on:
• Point estimators: is a rule or formula that tells us how to use the sample data to calculate a single
number that can be used as an estimate of the population parameter.
• Confidence interval: (or interval estimator) is a formula that tells us how to use the sample data
to calculate an interval that estimates the target parameter with a certain level of confidence.
• Hypothesis testing: where we make a statistical hypothesis (null hypothesis). An alternative
hypothesis is proposed for the probability distribution of the data and a comparison of the two
models is deemed statistically significant if, according to a threshold probability—the significance
level—the data is very unlikely to have occurred under the null hypothesis
2 • Point Estimators
You should understand points estimators as values that approximate parameters of the population that
are unknown (target parameters). Below we see the main point estimators seen so far.
1
Definitions in the Appendix
4
Inf. Based on One and Two Samples
FDA
Mean of a population (µ)
Pn xi
Sample mean, x = i . Where the sample = x1 , x2 , . . . , xn
n
Variance of a population (σ 1 )
Pn
Sample variance, s2 =
2
i (xi − x)
n−1
. Where the sample = x1 , x2 , . . . , xn .
Related to the sample variance we r
have the sample standard deviation
Pn
2
i (xi − x)
Sample standard deviation, s =
. Where the sample = x1 , x2 , . . . , xn .
n−1
Sample proportion (p)
Sample proportion, p̂ =
x
. Where x =successes, n =sample size
n
Difference of means (µ1 − µ2 )
Diff. of sample means, x1 − x2 . Where x1 sample mean of pop. 1 and x2 sample mean of pop.
2
Difference of means Matched Pairs (µd )
P
di
. Where di are computed through the
Diff. of sample means, Matched Pairs, d =
nd
difference of pairs of the matched pairs and nd is the number of pairs
Difference of proportions (p1 − p2 )
Difference of sample proportions, p̂1 − p̂2 . Where p̂1 sample prop of pop. 1 and p̂2 sample
prop of pop. 2
• To compute them in R is easy, using R as a calculator
• Any one of these estimators has his own properties. We saw them in class and you should be able
to locate them easily using the slides. For instance, any of these estimators has its own distribution,
that we call sampling distribution and they are unbiased estimators, that is, their expected value is
the exactly the target parameter.
3 • Confidence Interval
Interpretation: A 100(1 − α)% confidence interval is an interval where the target parameters fall
with a confidence 100(1 − α)% or 1 − α in probability
• Remember that we tend to write confidence intervals as
5
Inf. Based on One and Two Samples
FDA
Conf. Interval Structure
point estimator ± Margin of Error
• Usually, Margin of Error = Critical Value ·Standard Error, and Width = 2 · Margin of Error
3.1
Formulas for Confidence Intervals
3.1.1
C ONFIDENCE INTERVAL FOR µ: L ARGE S AMPLE
• Large sample n ≥ 30
• σ is known
√
x ± zα/2 (σ/ n)
• Large sample n ≥ 30
• σ is unknown
√
x ± zα/2 (s/ n)
• To compute them with R you need to use R as a calculator.
The commands below can come in handy
• mean(): gives the mean of a vector
• sd(): gives the sample standard deviation of a vector
Also, remember that zα/2 can be computed with the command
6
Inf. Based on One and Two Samples
FDA
zα/2 = qnorm(p = 1 − α/2)
3.1.2
C ONFIDENCE INTERVAL FOR µ: S MALL S AMPLE
• Small sample n < 30
• The population is approx. normal.
√
x ± tα/2 (s/ n)
tα/2 computed with n − 1 degrees of freedom
• With R, use the t.test function. For a sample S
t.test(x = S,conf.level = 1 - α)
Also, tα/2 can be computed using:
tα/2 = qt(p = 1 − α/2,degrees of freedom)
3.1.3
C ONFIDENCE I NTERVAL FOR P OPULATION P ROPORTION p: L ARGE S AMPLE
• Large sample np̂ ≥ 15 and nq̂ ≥ 15
r
p̂ ± zα/2
p̂q̂
n
where q̂ = 1 − p̂
• In R, you can use R as a calculator. There are packages that allow you to do the computation, but
we will not see them in this course.
3.1.4
C ONFIDENCE INTERVAL FOR D IFF . OF MEANS µ1 − µ2 : L ARGE S AMPLE + I NDEP.
S AMPLING
• Large sample n1 ≥ 30 and n2 ≥ 30
• Independent Sampling
• σ1 and σ2 are known
r 2
σ1 σ22
(x1 − x2 ) ± zα/2
+
n1 n2
7
Inf. Based on One and Two Samples
FDA
• Large sample n1 ≥ 30 and n2 ≥ 30
• Independent Sampling
• σ1 and σ2 are unknown
r 2
s2
s1
+ 2
(x1 − x2 ) ± zα/2
n1 n2
• To compute them with R you need to use R as a calculator.
Remember the commands
• mean(): gives the mean of a vector
• sd(): gives the sample standard deviation of a vector
Also, remember that zα/2 can be computed with the command
zα/2 = qnorm(p = 1 − α/2)
• We can use confidence intervals to make inferences, as it summarized below.
Inferences Using Conf. Intervals
If we have a conf. interval for µ1 − µ2 with a confidence level (1 − α)% then
1. If the confidence interval is larger than 0, then with confidence (1 − α) or level of significance α we can say that µ1 > µ2
2. If the confidence interval is smaller than 0, then with confidence (1 − α) or level of
significance α we can say that µ2 > µ1
3. If the interval contains 0, we will not say anything.
• This also applies to the other confidence intervals that we will see along the course.
8
Inf. Based on One and Two Samples
3.1.5
FDA
C ONFIDENCE INTERVAL FOR D IFF . OF MEANS µ1 − µ2 : S MALL S AMPLE + I NDEP.
S AMPLING
• Small sample n1 < 30 or n2 < 30
• Independent Sampling
• The populations are approx. normal.
• Equal variances σ1 = σ2
s
s2p
(x1 − x2 ) ± tα/2
1
1
+
n1 n2
tα/2 computed with n1 + n2 − 2 degrees of freedom,
(n1 − 1)s21 + (n2 − 1)s22
s2p =
n1 + n2 − 2
• In R, use the t.test function. For a sample S1 and S2
t.test(x = S1 , y = S2 ,conf.level = 1 - α,var.equal = TRUE)
3.1.6
C ONF. I NTERVAL FOR D IFF . OF MEANS µd (M ATCHED PAIRS ): L ARGE S AMPLE +
D EP. S AMPLING
Remember that for dependent sampling you have to construct a table where di = x1i − x2i and nd is
the number of pairs
P
d=
sP
sd =
i di
nd
2
i (di − d)
nd − 1
and that the mean of the d’s satisfies
µd = µ1 − µ2
• Dependent sampling (matched pairs)
• Large sample nd ≥ 30
• σd is known
√
d ± zα/2 (σd / nd )
9
Inf. Based on One and Two Samples
FDA
• Dependent sampling (matched pairs)
• Large sample nd ≥ 30
• σd is unknown
√
d ± zα/2 (sd / nd )
• Use R as a calculator
3.1.7
C ONF. I NTERVAL FOR D IFF . OF MEANS µd (M ATCHED PAIRS ): S MALL S AMPLE +
D EP. S AMPLING
• Dependent sampling (matched pairs)
• small sample nd < 30
• Pop. approx. normal
√
d ± tα/2 (sd / nd )
• In R, you can use the t.test function. Given the two samples S1 and S2 we need to set the paired
option to true
t.test(x = S1 , y = S2 ,conf.level = 1 - α, paired = TRUE)
3.1.8
C ONF. INTERVAL FOR DIFFERENCE OF POPULATION PROPORTIONS p1 − p2 : L ARGE
S AMPLE
• Large sample size (ni p̂i ≥ 15 and ni q̂i ≥ 15 for i = 1, 2)
• Independent sampling
r
p̂1 q̂1 p̂2 q̂2
+
(p̂1 − p̂2 ) ± zα/2
n1
n2
• To compute it with R use R as a calculator. There are packages that allow you to do this computation,
but we will not see them in this course.
4 • Hypothesis testing
There are two hypothesis: the null and the alternative
10
Inf. Based on One and Two Samples
FDA
• The null hypothesis, denoted H0 , represents the hypothesis that will be accepted unless the
data provide convincing evidence that it is false. This usually represents the “status quo” or
some claim about the population parameter that the researcher wants to test.
• The alternative (research) hypothesis, denoted Ha or H1 , represents the hypothesis that will
be accepted only if the data provides convincing evidence of its truth. This usually represents
the values of a population parameter for which the researcher wants to gather evidence to
support. Stated in one of the following forms:
Types of hypothesis. The types of hypothesis are:
• H0 : target parameter = certain value
Ha : target parameter 6= certain value → Two-tailed
• H0 : target parameter = certain value
Ha : target parameter < certain value → One-tailed, lower-tailed
• H0 : target parameter = certain value
Ha : target parameter > certain value → One-tailed, upper-tailed
4.1
The two routes
To make hypothesis testing we can see if
1. The test statistic falls in the rejection region
2. The p-value is smaller than α
4.1.1
WAY 1: C OMPARE TEST STATISTIC AND REJECTION REGION
• Define the hypotheses
• Get the test statistic
• Define the rejection region
• Draw a conclusion (does the test statistic fall into the rejection region?)
How to compute rejection regions for z-statistic. Depending on the type of hypothesis we have:
11
Inf. Based on One and Two Samples
FDA
• H0 : target parameter = certain value,
Ha : target parameter < certain value
then reject if z < −zα
• H0 : target parameter = certain value,
Ha : target parameter > certain value
then reject if z > zα
• H0 : target parameter = certain value,
Ha : target parameter 6= certain value
then reject if |z| > zα/2 . Equivalently z < −zα/2 or z > zα/2 .
How to compute rejection regions for t-statistic. Depending on the type of hypothesis we have:
• H0 : target parameter = certain value,
Ha : target parameter < certain value
then reject if t < −tα
• H0 : target parameter = certain value,
Ha : target parameter > certain value
then reject if t > tα
• H0 : target parameter = certain value,
Ha : target parameter 6= certain value
then reject if |t| > tα/2 . Equivalently t < −tα/2 or t > tα/2 .
• Remember to check the appropriate degrees of freedom
4.1.2
WAY 2: C OMPARE P - VALUE AND α
• Define the hypotheses
• Get the test statistic
• Calculate the p-value (does not depend on α)
• Draw a conclusion (is the p − value < α?)
How to compute the p-values: z-statistic. Depending on the type of hypothesis we have:
12
Inf. Based on One and Two Samples
FDA
Lower-tailed test p-value
• p-value = P (z < test statistic)
Upper-tailed test p-value:
• p-value = P (z > test statistic)
Two-tailed test p-value
If test statistic is positive
• p-value = 2P (z > test statistic)
If test statistic is negative
• p-value = 2P (z < test statistic)
Remember that to reject or not reject we have:
• If p-value < α ⇒ Reject
• If p-value ≥ α ⇒ Do not reject
• You can use the table to compute them. If you want to use R, the probability P (z < certain value)
is given by pnorm(certain value). Using the symmetry of the normal distribution you can compute
P (z > certain value) = 1 − P (z < certain value)
How to compute the p-values: t-statistic. Depending on the type of hypothesis we have:
Lower-tailed test p-value
• p-value = P (t < test statistic)
Upper-tailed test p-value:
• p-value = P (t > test statistic)
Two-tailed test p-value
If test statistic is positive
• p-value = 2P (t > test statistic)
If test statistic is negative
• p-value = 2P (t < test statistic)
Rememember that to make the conclusion
• If p-value < α ⇒ Reject
• If p-value ≥ α ⇒ Do not reject
• t should be of the appropriate degrees of freedom
13
Inf. Based on One and Two Samples
FDA
• In R we will use the t.test function to compute it. Alternatively, the probability P(t< certain
value), where t is the t distribution with certain degrees of freedom, can be computed with pt(certain
value, degrees of freedom). Using the symmetry of the t distribution you can compute P (t >
certain value) = 1 − P (t < certain value).
4.2
Corresponding test statistics
• A test statistic is a statistic (a quantity derived from the sample) used in statistical hypothesis testing.
A hypothesis test is typically specified in terms of a test statistic, considered as a numerical summary
of a data-set that reduces the data to one value that can be used to perform the hypothesis test. In
general, a test statistic is selected or defined in such a way as to quantify, within observed data,
behaviours that would distinguish the null from the alternative hypothesis, where such an alternative
is prescribed, or that would characterize the null hypothesis if there is no explicitly stated alternative
hypothesis.
• An important property of a test statistic is that its sampling distribution under the null hypothesis
must be calculable, either exactly or approximately, which allows p-values to be calculated.
4.2.1
H YPOTHESIS T ESTING FOR P OPULATION MEAN µ
Remember that the types of hypotheses are:
One-tailed, lower-tailed
• H0 : µ = µ0 ,
Ha : µ < µ0
One-tailed, upper-tailed
• H0 : µ = µ0 ,
Ha : µ > µ0
Two-tailed
• H0 : µ = µ0 ,
Ha : µ 6= µ0
Now we show the corresponding test statistics
14
Inf. Based on One and Two Samples
4.2.2
FDA
H YPOTHESIS T ESTING FOR P OPULATION MEAN µ: L ARGE S AMPLE
• Large sample n ≥ 30
• σ is known
z=
x − µ0
√
σ/ n
z=
x − µ0
√
s/ n
• Large sample n ≥ 30
• σ is unknown
• To compute them with R you need to use R as a calculator.
4.2.3
H YPOTHESIS T ESTING FOR P OPULATION MEAN µ: S MALL S AMPLE
• Small sample n < 30
• The population is approx. normal.
t=
x − µ0
√
s/ n
t has n − 1 degrees of freedom
• With R, use the t.test function. For a sample S
t.test(x = S,conf.level = 1 - α, alternative = ’type’, mu = µ0 )
where ’type’ is equal to one of the following values, ’less’, ’greater’ or ’two.sided’ depending on the
type of hypothesis (lower-tailed, upper-tailed or two-tailed respectively)
15
Inf. Based on One and Two Samples
4.2.4
FDA
H YPOTHESIS T ESTING FOR P OPULATION P ROPORTION p: L ARGE S AMPLE
Remember that the types of hypotheses are:
One-tailed, lower-tailed
• H0 : p = p0 ,
Ha : p < p0
One-tailed, upper-tailed
• H0 : p = p0 ,
Ha : p > p0
Two-tailed
• H0 : p = p0 ,
Ha : p 6= p0
• Large sample np0 ≥ 15 and nq0 ≥ 15
p̂ − p0
z=p
(p0 q0 )/n
where q0 = 1 − p0
• In R you can use R as a calculator. There are some packages that allow you to do this computation,
but we will not see them in this course.
4.2.5
H YPOTHESIS T ESTING FOR D IFFERENCE OF MEANS µ1 − µ2
Remember that the types of hypotheses are:
One-tailed, lower-tailed
• H0 : µ1 − µ2 = D0 ,
Ha : µ1 − µ2 < D0
One-tailed, upper-tailed
• H0 : µ1 − µ2 = D0 ,
Ha : µ1 − µ2 > D0
Two-tailed
• H0 : µ1 − µ2 = D0 ,
Ha : µ1 − µ2 6= D0
• Usually D0 = 0
16
Inf. Based on One and Two Samples
4.2.6
FDA
H YPOTHESIS T ESTING FOR D IFF .
S AMPLING
OF MEANS µ1 − µ2 :
L ARGE S AMPLE + I NDEP.
• Large sample n1 ≥ 30 and n2 ≥ 30
• Independent Sampling
• σ1 and σ2 are known
z=
(x1 − x2 ) − D0
r 2
σ1 σ22
+
n1 n2
• Large sample n1 ≥ 30 and n2 ≥ 30
• Independent Sampling
• σ1 and σ2 are unknown
z=
(x1 − x2 ) − D0
r 2
s2
s1
+ 2
n1 n2
• To compute them with R you need to use R as a calculator.
Remember the commands
• mean(): gives the mean of a vector
• sd(): gives the sample standard deviation of a vector
Also, remember that zα/2 can be computed with the command
zα = qnorm(p = 1 − α)
zα/2 = qnorm(p = 1 − α/2)
17
Inf. Based on One and Two Samples
4.2.7
FDA
H YPOTHESIS T ESTING FOR D IFF .
S AMPLING
OF MEANS µ1 − µ2 :
S MALL S AMPLE + I NDEP.
• Small sample n1 < 30 or n2 < 30
• Independent Sampling
• The population is approx. normal.
• Equal var. σ1 = σ2
(x1 − x2 ) − D0
t= s 1
1
s2p
+
n1 n2
tα/2 has n1 + n2 − 2 degrees of freedom,
(n1 − 1)s21 + (n2 − 1)s22
s2p =
n1 + n2 − 2
• With R you can use the t.test function. For a sample S1 and S2
t.test(x = S1 , y = S2 ,conf.level = 1 - α, alternative = ’type’, mu = D0 ,var.equal =TRUE )
where ’type’ is equal to one of the following values, ’less’, ’greater’ or ’two.sided’ depending on the
type of hypothesis
4.2.8
H YPOTHESIS TESTING FOR µd = µ1 − µ2 (M ATCHED PAIRS )
Remember that the types of hypotheses are:
One-tailed, lower-tailed
• H0 : µd = D0 ,
Ha : µd < D0
One-tailed, upper-tailed
• H0 : µd = D0 ,
Ha : µd > D0
Two-tailed
• H0 : µd = D0 ,
Ha : µd 6= D0
• Usually D0 = 0
18
Inf. Based on One and Two Samples
4.2.9
FDA
H YPOTHESIS TESTING FOR µd = µ1 − µ2 (M ATCHED PAIRS ): L ARGE S AMPLE + D EP.
S AMPLING
• Dependent sampling (matched pairs)
• Large sample nd ≥ 30
• σd is known
z=
d − D0
√
σ d / nd
• Dependent sampling (matched pairs)
• Large sample nd ≥ 30
• σd is unknown
z=
d − D0
√
s d / nd
• Remember that for dependent sampling you have to construct a table where di = x1i − x2i and nd
is the number of pairs
sP
P
2
di
i (di − d)
d = i , sd =
nd
nd − 1
• In R you would need to use R as a calculator
4.2.10
H YPOTHESIS TESTING FOR µd = µ1 − µ2 (M ATCHED PAIRS ): S MALL S AMPLE +
D EP. S AMPLING
• Dependent sampling (matched pairs)
• Small sample nd < 30
• The population of differences has a distribution that is approximately normal
t=
d − D0
√
sd / nd
where t has n − 1 degrees of freedom
• In R, you can use the t.test function. Given the two samples S1 and S2 we need to set the paired
option to true
t.test(x = S1 , y = S2 ,conf.level = 1 - α, paired = TRUE, alternative = ’type’, mu = D0 )
where ’type’ is equal to one of the following values, ’less’, ’greater’ or ’two.sided’ depending on the
type of hypothesis
19
Inf. Based on One and Two Samples
4.2.11
FDA
H YPOTHESIS TESTING FOR p1 − p2 : L ARGE S AMPLE
Remember that the types of hypotheses are:
One-tailed, lower-tailed
• H0 : p1 − p2 = 0,
Ha : p1 − p2 < 0
One-tailed, upper-tailed
• H0 : p1 − p2 = 0,
Ha : p1 − p2 > 0
Two-tailed
• H0 : p1 − p2 = 0,
Ha : p1 − p2 6= 0
• Large sample size (ni p̂i ≥ 15 and ni q̂i ≥ 15 for i = 1, 2)
• Independent sampling
(p̂1 − p̂2 )
x1 + x 2
where p̂ = n + n , q̂ = 1 − p̂
1
2
1
1
p̂q̂
+
n1 n2
z=s
• To compute it with R we will use R as a calculator. There are packages that allow you to do this
computation, but we will not see them in this course. ,
5 • Determining Sample Size
There are situations in which we want to compute the necessary sample size to obtain a confidence
interval with certain confidence level and width. Here we will describe the corresponding formulas in
several cases. We are going to assume equal sample size for the two samples. Remember you have
to round upwards in all cases.
5.1
Confidence interval for the mean
n=
Sample for the mean
2
zα/2 σ
width
, ME =
margin of error
2
20
Inf. Based on One and Two Samples
FDA
We would have all the data necessary to answer our question except the value of the population
variance. Usually this value is unknown so we have two options:
1. Replace them with estimates from prior sampling: s2
2. Use a rule of thumb s = R/4. Where R (range) is the difference between the highest
value of the sample and the lowest value.
5.2
Confidence Interval for a proportion
n=
Sample size for a proportion
!
2
zα/2
pq
width
, ME =
2
(margin of error)
2
We would have all the data necessary to answer our question except the value of the population
proportion. Usually these values are unknown so we have two options:
• Replace them with estimates from prior sampling: p̂
• Use a rule of thumb → p = .5
5.3
Confidence Interval for the difference of two means
Sample size diff. two means
(zα/2 )2 (σ12 + σ22 )
,
n1 = n2 =
(margin of error)2
ME =
width
2
We would have all the data necessary to answer our question except the value of the population
variances. Usually these values are unknown so we have two options:
1. Replace them with estimates from prior sampling: s21 and s22
2. Use a rule of thumb → s = R/4. Where R (range) is the difference between the highest
value of the sample and the lowest value.
5.4
Confidence Interval for the difference of two proportions
Sample size diff. two proportions
(zα/2 )2 (p1 q1 + p2 q2 )
n1 = n2 =
,
(margin of error)2
ME =
width
2
21
Inf. Based on One and Two Samples
FDA
We would have all the data necessary to answer our question except the value of the population
proportions. Usually these values are unknown so we have two options:
• Replace them with estimates from prior sampling: p̂1 and p̂2
• Use a rule of thumb → p1 = p2 = 0.5
22
Inf. Based on One and Two Samples
A
FDA
Appendix: Useful Definitions
Experimental (or observational) unit: is an object (e.g., person, thing, transaction, or event) upon
which we collect data.
Population: is a set of units (usually people, objects, transactions, or events) that we are interested in
studying.
Sample: is a subset of the units of a population.
Statistical inference: is an estimate or prediction or some other generalization about a population
based on information contained in a sample.
Parameter: is a numerical descriptive measure of a population. Because it is based on the observations in the population, its value is almost always unknown.
Statistic: any function of the sample (e.g. sample mean, sample variance, sample proportion).
Sampling distribution of a statistic: is the probability distribution of the statistic.
Point Estimator: of a population parameter is a rule or formula that tells us how to use the sample
data to calculate a single number that can be used as an estimate of the population parameter. They
may be unbiased if the If the sampling distribution of a sample statistic has a mean equal to the
population parameter the statistic is intended to estimate. Or biased if the mean of the sampling
distribution is not equal to the parameter.
Target Parameter: is the unknown population parameter that we are interested in estimating.
Central Limit Theorem (CLT): Consider a random sample of n observations selected from a population (any probability distribution) with mean µ and standard deviation σ. Then, when n is sufficiently large, the sampling distribution √
of x will be approximately a normal distribution with mean
µx = µ and standard deviation σx = σ/ n.
Interval estimator (or confidence interval): is a formula that tells us how to use the sample data to
calculate an interval that estimates the target parameter.
P-value (observed significance level): for a specific statistical test is the probability (assuming H0
is true) of observing a value of the test statistic that is at least as contradictory to the null hypothesis,
and supportive of the alternative hypothesis, as the actual one computed from the sample data.
Critical Value: the value that separates the rejection region and the acceptance region.
Confidence coefficient: is the probability that a randomly selected confidence interval encloses
the population parameter. That is, the relative frequency with which similarly constructed intervals
enclose the population parameter when the estimator is used repeatedly a very large number of times.
The confidence level is the confidence coefficient expressed as a percentage.
Statistical hypothesis: is a statement about the numerical value of a population parameter.
Null hypothesis: denoted H0 , represents the hypothesis that is assumed to be true unless the data
23
Inf. Based on One and Two Samples
FDA
provide convincing evidence that it is false. This usually represents the “status quo” or some claim
about the population parameter that the researcher wants to test.
The alternative (research) hypothesis: denoted Ha , represents the hypothesis that will be accepted
only if the data provide convincing evidence of its truth. This usually represents the values of a
population parameter for which the researcher wants to gather evidence to support.
Test statistic: is a sample statistic, computed from information provided in the sample, that the
researcher uses to decide between the null and alternative hypotheses.
Type I error: occurs if the researcher rejects the null hypothesis in favor of the alternative hypothesis
when, in fact, H0 is true. The probability of committing a Type I error is denoted by α .
Type II error: occurs if the researcher accepts the null hypothesis when, in fact, H0 is false. The
probability of committing a Type II error is denoted by β.
Rejection region of a statistical test: is the set of possible values of the test statistic for which the
researcher will reject H0 in favor of Ha .
B
Appendix: zα and tα (also zα/2 and tα/2)
B.1
• zα and zα/2
The value zα is defined as the value of the standard normal random variable z such that an area
(probability) α will lie to its right. In other words, P (z > zα ) = α. Equivalently, P (z ≤ zα ) = 1 − α.
The value zα/2 is defined as the value of the standard normal random variable z such that the area α/2
will lie to its right. In other words, P (z > zα/2 ) = α/2. Equivalently, P (z ≤ zα/2 ) = 1 − α/2.
There are two main ways to compute zα (analogously for zα/2 .)
1. Using a table. The tables usually give you the value of P (z ≤ z0 ) where z0 is a number. You
have to look for the number z0 such that P (z ≤ z0 ) = 1 − α. That z0 is zα .
2. Using the R programming language. You can use the qnorm function: qnorm(1 - α) will give
you zα .
Analogously for zα/2 .
B.2 • tα and tα/2
The value tα is defined as the value of Students’ t random variable such that an area α will lie to
its right. Remember that you need to fix the degrees of freedom. In other words, P (t > tα ) = α.
Equivalently, P (t ≤ tα ) = 1 − α.
The value tα/2 is defined as the value of Students’ t random variable t such that an area α will lie to
24
Inf. Based on One and Two Samples
FDA
its right. In other words, P (t > tα/2 ) = α/2. Equivalently, P (t ≤ tα/2 ) = 1 − α/2.
There are two main ways to compute tα
1. Using a table. You have to pay attention to which value is the table given you (right or left
area).
2. Using the R programming language. You can use the qt function: qt(1 - α,degrees of freedom)
will give you the tα
Analogously for tα/2 .
25
Download