Data Analysis: Inferences Based on One & Two Samples

Fundamentals of Data Analysis: Inferences Based on One and Two Samples These are a series of notes to help you grasp the contents of the subject Fundamentals of Data Analysis for BBA + DBA students. Here you will find the formulas and R commands for most of the one and two samples related staff (Chapter 6, 7 and 8 in MBS) In case of doubt, just reach out to me Course 2022 - 2023 Fundamentals of Data Analysis - IE University - BBA + BDA Inferences Based on Two Samples Contents 1 • What are we doing? 4 2 • Point Estimators 4 3 • Confidence Interval 5 3.1 Formulas for Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1.1 Confidence interval for µ: Large Sample . . . . . . . . . . . . . . . . . . . . 6 3.1.2 Confidence interval for µ: Small Sample . . . . . . . . . . . . . . . . . . . 7 3.1.3 Confidence Interval for Population Proportion p: Large Sample . . . . . . . 7 3.1.4 Confidence interval for Diff.of means µ1 −µ2 : Large Sample + Indep. Sampling 7 3.1.5 Confidence interval for Diff.of means µ1 −µ2 : Small Sample + Indep. Sampling 9 3.1.6 Conf. Interval for Diff. of means µd (Matched Pairs): Large Sample + Dep. Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Conf. Interval for Diff. of means µd (Matched Pairs): Small Sample + Dep. Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Conf. interval for difference of population proportions p1 − p2 : Large Sample 10 3.1.7 3.1.8 4 • Hypothesis testing 10 4.1 The two routes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.1.1 Way 1: Compare test statistic and rejection region . . . . . . . . . . . . . . . 11 4.1.2 Way 2: Compare p-value and α . . . . . . . . . . . . . . . . . . . . . . . . 12 Corresponding test statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2.1 Hypothesis Testing for Population mean µ . . . . . . . . . . . . . . . . . . . 14 4.2.2 Hypothesis Testing for Population mean µ: Large Sample . . . . . . . . . . 15 4.2.3 Hypothesis Testing for Population mean µ: Small Sample . . . . . . . . . . 15 4.2 5 4.2.4 Hypothesis Testing for Population Proportion p: Large Sample . . . . . . . . 16 4.2.5 Hypothesis Testing for Difference of means µ1 − µ2 . . . . . . . . . . . . . 16 4.2.6 Hypothesis Testing for Diff. of means µ1 −µ2 : Large Sample + Indep. Sampling 17 4.2.7 Hypothesis Testing for Diff. of means µ1 −µ2 : Small Sample + Indep. Sampling 18 4.2.8 Hypothesis testing for µd = µ1 − µ2 (Matched Pairs) . . . . . . . . . . . . . 18 4.2.9 Hypothesis testing for µd = µ1 − µ2 (Matched Pairs): Large Sample + Dep. Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.10 Hypothesis testing for µd = µ1 − µ2 (Matched Pairs): Small Sample + Dep. Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.11 Hypothesis testing for p1 − p2 : Large Sample . . . . . . . . . . . . . . . . . 20 • Determining Sample Size 20 5.1 Confidence interval for the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.2 Confidence Interval for a proportion . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.3 Confidence Interval for the difference of two means . . . . . . . . . . . . . . . . . . 21 5.4 Confidence Interval for the difference of two proportions . . . . . . . . . . . . . . . 21 A Appendix: Useful Definitions 23 B Appendix: zα and tα (also zα/2 and tα/2 ) 24 B.1 • zα and zα/2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 B.2 • tα and tα/2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Inf. Based on One and Two Samples FDA 1 • What are we doing? MAIN GOAL: Obtain information of some parameters of the population (target parameters) through a sample 1 . Examples: • Population mean ⇒ µ • Population proportion ⇒ p • Difference of two means ⇒ µ1 − µ2 • Difference of two means, matched pairs ⇒ µd • Difference of two proportions p1 − p2 We can pose many questions that lead to different inferences. Nonetheless, here we focus on: • Point estimators: is a rule or formula that tells us how to use the sample data to calculate a single number that can be used as an estimate of the population parameter. • Confidence interval: (or interval estimator) is a formula that tells us how to use the sample data to calculate an interval that estimates the target parameter with a certain level of confidence. • Hypothesis testing: where we make a statistical hypothesis (null hypothesis). An alternative hypothesis is proposed for the probability distribution of the data and a comparison of the two models is deemed statistically significant if, according to a threshold probability—the significance level—the data is very unlikely to have occurred under the null hypothesis 2 • Point Estimators You should understand points estimators as values that approximate parameters of the population that are unknown (target parameters). Below we see the main point estimators seen so far. 1 Definitions in the Appendix 4 Inf. Based on One and Two Samples FDA Mean of a population (µ) Pn xi Sample mean, x = i . Where the sample = x1 , x2 , . . . , xn n Variance of a population (σ 1 ) Pn Sample variance, s2 = 2 i (xi − x) n−1 . Where the sample = x1 , x2 , . . . , xn . Related to the sample variance we r have the sample standard deviation Pn 2 i (xi − x) Sample standard deviation, s = . Where the sample = x1 , x2 , . . . , xn . n−1 Sample proportion (p) Sample proportion, p̂ = x . Where x =successes, n =sample size n Difference of means (µ1 − µ2 ) Diff. of sample means, x1 − x2 . Where x1 sample mean of pop. 1 and x2 sample mean of pop. 2 Difference of means Matched Pairs (µd ) P di . Where di are computed through the Diff. of sample means, Matched Pairs, d = nd difference of pairs of the matched pairs and nd is the number of pairs Difference of proportions (p1 − p2 ) Difference of sample proportions, p̂1 − p̂2 . Where p̂1 sample prop of pop. 1 and p̂2 sample prop of pop. 2 • To compute them in R is easy, using R as a calculator • Any one of these estimators has his own properties. We saw them in class and you should be able to locate them easily using the slides. For instance, any of these estimators has its own distribution, that we call sampling distribution and they are unbiased estimators, that is, their expected value is the exactly the target parameter. 3 • Confidence Interval Interpretation: A 100(1 − α)% confidence interval is an interval where the target parameters fall with a confidence 100(1 − α)% or 1 − α in probability • Remember that we tend to write confidence intervals as 5 Inf. Based on One and Two Samples FDA Conf. Interval Structure point estimator ± Margin of Error • Usually, Margin of Error = Critical Value ·Standard Error, and Width = 2 · Margin of Error 3.1 Formulas for Confidence Intervals 3.1.1 C ONFIDENCE INTERVAL FOR µ: L ARGE S AMPLE • Large sample n ≥ 30 • σ is known √ x ± zα/2 (σ/ n) • Large sample n ≥ 30 • σ is unknown √ x ± zα/2 (s/ n) • To compute them with R you need to use R as a calculator. The commands below can come in handy • mean(): gives the mean of a vector • sd(): gives the sample standard deviation of a vector Also, remember that zα/2 can be computed with the command 6 Inf. Based on One and Two Samples FDA zα/2 = qnorm(p = 1 − α/2) 3.1.2 C ONFIDENCE INTERVAL FOR µ: S MALL S AMPLE • Small sample n < 30 • The population is approx. normal. √ x ± tα/2 (s/ n) tα/2 computed with n − 1 degrees of freedom • With R, use the t.test function. For a sample S t.test(x = S,conf.level = 1 - α) Also, tα/2 can be computed using: tα/2 = qt(p = 1 − α/2,degrees of freedom) 3.1.3 C ONFIDENCE I NTERVAL FOR P OPULATION P ROPORTION p: L ARGE S AMPLE • Large sample np̂ ≥ 15 and nq̂ ≥ 15 r p̂ ± zα/2 p̂q̂ n where q̂ = 1 − p̂ • In R, you can use R as a calculator. There are packages that allow you to do the computation, but we will not see them in this course. 3.1.4 C ONFIDENCE INTERVAL FOR D IFF . OF MEANS µ1 − µ2 : L ARGE S AMPLE + I NDEP. S AMPLING • Large sample n1 ≥ 30 and n2 ≥ 30 • Independent Sampling • σ1 and σ2 are known r 2 σ1 σ22 (x1 − x2 ) ± zα/2 + n1 n2 7 Inf. Based on One and Two Samples FDA • Large sample n1 ≥ 30 and n2 ≥ 30 • Independent Sampling • σ1 and σ2 are unknown r 2 s2 s1 + 2 (x1 − x2 ) ± zα/2 n1 n2 • To compute them with R you need to use R as a calculator. Remember the commands • mean(): gives the mean of a vector • sd(): gives the sample standard deviation of a vector Also, remember that zα/2 can be computed with the command zα/2 = qnorm(p = 1 − α/2) • We can use confidence intervals to make inferences, as it summarized below. Inferences Using Conf. Intervals If we have a conf. interval for µ1 − µ2 with a confidence level (1 − α)% then 1. If the confidence interval is larger than 0, then with confidence (1 − α) or level of significance α we can say that µ1 > µ2 2. If the confidence interval is smaller than 0, then with confidence (1 − α) or level of significance α we can say that µ2 > µ1 3. If the interval contains 0, we will not say anything. • This also applies to the other confidence intervals that we will see along the course. 8 Inf. Based on One and Two Samples 3.1.5 FDA C ONFIDENCE INTERVAL FOR D IFF . OF MEANS µ1 − µ2 : S MALL S AMPLE + I NDEP. S AMPLING • Small sample n1 < 30 or n2 < 30 • Independent Sampling • The populations are approx. normal. • Equal variances σ1 = σ2 s s2p (x1 − x2 ) ± tα/2 1 1 + n1 n2 tα/2 computed with n1 + n2 − 2 degrees of freedom, (n1 − 1)s21 + (n2 − 1)s22 s2p = n1 + n2 − 2 • In R, use the t.test function. For a sample S1 and S2 t.test(x = S1 , y = S2 ,conf.level = 1 - α,var.equal = TRUE) 3.1.6 C ONF. I NTERVAL FOR D IFF . OF MEANS µd (M ATCHED PAIRS ): L ARGE S AMPLE + D EP. S AMPLING Remember that for dependent sampling you have to construct a table where di = x1i − x2i and nd is the number of pairs P d= sP sd = i di nd 2 i (di − d) nd − 1 and that the mean of the d’s satisfies µd = µ1 − µ2 • Dependent sampling (matched pairs) • Large sample nd ≥ 30 • σd is known √ d ± zα/2 (σd / nd ) 9 Inf. Based on One and Two Samples FDA • Dependent sampling (matched pairs) • Large sample nd ≥ 30 • σd is unknown √ d ± zα/2 (sd / nd ) • Use R as a calculator 3.1.7 C ONF. I NTERVAL FOR D IFF . OF MEANS µd (M ATCHED PAIRS ): S MALL S AMPLE + D EP. S AMPLING • Dependent sampling (matched pairs) • small sample nd < 30 • Pop. approx. normal √ d ± tα/2 (sd / nd ) • In R, you can use the t.test function. Given the two samples S1 and S2 we need to set the paired option to true t.test(x = S1 , y = S2 ,conf.level = 1 - α, paired = TRUE) 3.1.8 C ONF. INTERVAL FOR DIFFERENCE OF POPULATION PROPORTIONS p1 − p2 : L ARGE S AMPLE • Large sample size (ni p̂i ≥ 15 and ni q̂i ≥ 15 for i = 1, 2) • Independent sampling r p̂1 q̂1 p̂2 q̂2 + (p̂1 − p̂2 ) ± zα/2 n1 n2 • To compute it with R use R as a calculator. There are packages that allow you to do this computation, but we will not see them in this course. 4 • Hypothesis testing There are two hypothesis: the null and the alternative 10 Inf. Based on One and Two Samples FDA • The null hypothesis, denoted H0 , represents the hypothesis that will be accepted unless the data provide convincing evidence that it is false. This usually represents the “status quo” or some claim about the population parameter that the researcher wants to test. • The alternative (research) hypothesis, denoted Ha or H1 , represents the hypothesis that will be accepted only if the data provides convincing evidence of its truth. This usually represents the values of a population parameter for which the researcher wants to gather evidence to support. Stated in one of the following forms: Types of hypothesis. The types of hypothesis are: • H0 : target parameter = certain value Ha : target parameter 6= certain value → Two-tailed • H0 : target parameter = certain value Ha : target parameter < certain value → One-tailed, lower-tailed • H0 : target parameter = certain value Ha : target parameter > certain value → One-tailed, upper-tailed 4.1 The two routes To make hypothesis testing we can see if 1. The test statistic falls in the rejection region 2. The p-value is smaller than α 4.1.1 WAY 1: C OMPARE TEST STATISTIC AND REJECTION REGION • Define the hypotheses • Get the test statistic • Define the rejection region • Draw a conclusion (does the test statistic fall into the rejection region?) How to compute rejection regions for z-statistic. Depending on the type of hypothesis we have: 11 Inf. Based on One and Two Samples FDA • H0 : target parameter = certain value, Ha : target parameter < certain value then reject if z < −zα • H0 : target parameter = certain value, Ha : target parameter > certain value then reject if z > zα • H0 : target parameter = certain value, Ha : target parameter 6= certain value then reject if |z| > zα/2 . Equivalently z < −zα/2 or z > zα/2 . How to compute rejection regions for t-statistic. Depending on the type of hypothesis we have: • H0 : target parameter = certain value, Ha : target parameter < certain value then reject if t < −tα • H0 : target parameter = certain value, Ha : target parameter > certain value then reject if t > tα • H0 : target parameter = certain value, Ha : target parameter 6= certain value then reject if |t| > tα/2 . Equivalently t < −tα/2 or t > tα/2 . • Remember to check the appropriate degrees of freedom 4.1.2 WAY 2: C OMPARE P - VALUE AND α • Define the hypotheses • Get the test statistic • Calculate the p-value (does not depend on α) • Draw a conclusion (is the p − value < α?) How to compute the p-values: z-statistic. Depending on the type of hypothesis we have: 12 Inf. Based on One and Two Samples FDA Lower-tailed test p-value • p-value = P (z < test statistic) Upper-tailed test p-value: • p-value = P (z > test statistic) Two-tailed test p-value If test statistic is positive • p-value = 2P (z > test statistic) If test statistic is negative • p-value = 2P (z < test statistic) Remember that to reject or not reject we have: • If p-value < α ⇒ Reject • If p-value ≥ α ⇒ Do not reject • You can use the table to compute them. If you want to use R, the probability P (z < certain value) is given by pnorm(certain value). Using the symmetry of the normal distribution you can compute P (z > certain value) = 1 − P (z < certain value) How to compute the p-values: t-statistic. Depending on the type of hypothesis we have: Lower-tailed test p-value • p-value = P (t < test statistic) Upper-tailed test p-value: • p-value = P (t > test statistic) Two-tailed test p-value If test statistic is positive • p-value = 2P (t > test statistic) If test statistic is negative • p-value = 2P (t < test statistic) Rememember that to make the conclusion • If p-value < α ⇒ Reject • If p-value ≥ α ⇒ Do not reject • t should be of the appropriate degrees of freedom 13 Inf. Based on One and Two Samples FDA • In R we will use the t.test function to compute it. Alternatively, the probability P(t< certain value), where t is the t distribution with certain degrees of freedom, can be computed with pt(certain value, degrees of freedom). Using the symmetry of the t distribution you can compute P (t > certain value) = 1 − P (t < certain value). 4.2 Corresponding test statistics • A test statistic is a statistic (a quantity derived from the sample) used in statistical hypothesis testing. A hypothesis test is typically specified in terms of a test statistic, considered as a numerical summary of a data-set that reduces the data to one value that can be used to perform the hypothesis test. In general, a test statistic is selected or defined in such a way as to quantify, within observed data, behaviours that would distinguish the null from the alternative hypothesis, where such an alternative is prescribed, or that would characterize the null hypothesis if there is no explicitly stated alternative hypothesis. • An important property of a test statistic is that its sampling distribution under the null hypothesis must be calculable, either exactly or approximately, which allows p-values to be calculated. 4.2.1 H YPOTHESIS T ESTING FOR P OPULATION MEAN µ Remember that the types of hypotheses are: One-tailed, lower-tailed • H0 : µ = µ0 , Ha : µ < µ0 One-tailed, upper-tailed • H0 : µ = µ0 , Ha : µ > µ0 Two-tailed • H0 : µ = µ0 , Ha : µ 6= µ0 Now we show the corresponding test statistics 14 Inf. Based on One and Two Samples 4.2.2 FDA H YPOTHESIS T ESTING FOR P OPULATION MEAN µ: L ARGE S AMPLE • Large sample n ≥ 30 • σ is known z= x − µ0 √ σ/ n z= x − µ0 √ s/ n • Large sample n ≥ 30 • σ is unknown • To compute them with R you need to use R as a calculator. 4.2.3 H YPOTHESIS T ESTING FOR P OPULATION MEAN µ: S MALL S AMPLE • Small sample n < 30 • The population is approx. normal. t= x − µ0 √ s/ n t has n − 1 degrees of freedom • With R, use the t.test function. For a sample S t.test(x = S,conf.level = 1 - α, alternative = ’type’, mu = µ0 ) where ’type’ is equal to one of the following values, ’less’, ’greater’ or ’two.sided’ depending on the type of hypothesis (lower-tailed, upper-tailed or two-tailed respectively) 15 Inf. Based on One and Two Samples 4.2.4 FDA H YPOTHESIS T ESTING FOR P OPULATION P ROPORTION p: L ARGE S AMPLE Remember that the types of hypotheses are: One-tailed, lower-tailed • H0 : p = p0 , Ha : p < p0 One-tailed, upper-tailed • H0 : p = p0 , Ha : p > p0 Two-tailed • H0 : p = p0 , Ha : p 6= p0 • Large sample np0 ≥ 15 and nq0 ≥ 15 p̂ − p0 z=p (p0 q0 )/n where q0 = 1 − p0 • In R you can use R as a calculator. There are some packages that allow you to do this computation, but we will not see them in this course. 4.2.5 H YPOTHESIS T ESTING FOR D IFFERENCE OF MEANS µ1 − µ2 Remember that the types of hypotheses are: One-tailed, lower-tailed • H0 : µ1 − µ2 = D0 , Ha : µ1 − µ2 < D0 One-tailed, upper-tailed • H0 : µ1 − µ2 = D0 , Ha : µ1 − µ2 > D0 Two-tailed • H0 : µ1 − µ2 = D0 , Ha : µ1 − µ2 6= D0 • Usually D0 = 0 16 Inf. Based on One and Two Samples 4.2.6 FDA H YPOTHESIS T ESTING FOR D IFF . S AMPLING OF MEANS µ1 − µ2 : L ARGE S AMPLE + I NDEP. • Large sample n1 ≥ 30 and n2 ≥ 30 • Independent Sampling • σ1 and σ2 are known z= (x1 − x2 ) − D0 r 2 σ1 σ22 + n1 n2 • Large sample n1 ≥ 30 and n2 ≥ 30 • Independent Sampling • σ1 and σ2 are unknown z= (x1 − x2 ) − D0 r 2 s2 s1 + 2 n1 n2 • To compute them with R you need to use R as a calculator. Remember the commands • mean(): gives the mean of a vector • sd(): gives the sample standard deviation of a vector Also, remember that zα/2 can be computed with the command zα = qnorm(p = 1 − α) zα/2 = qnorm(p = 1 − α/2) 17 Inf. Based on One and Two Samples 4.2.7 FDA H YPOTHESIS T ESTING FOR D IFF . S AMPLING OF MEANS µ1 − µ2 : S MALL S AMPLE + I NDEP. • Small sample n1 < 30 or n2 < 30 • Independent Sampling • The population is approx. normal. • Equal var. σ1 = σ2 (x1 − x2 ) − D0 t= s 1 1 s2p + n1 n2 tα/2 has n1 + n2 − 2 degrees of freedom, (n1 − 1)s21 + (n2 − 1)s22 s2p = n1 + n2 − 2 • With R you can use the t.test function. For a sample S1 and S2 t.test(x = S1 , y = S2 ,conf.level = 1 - α, alternative = ’type’, mu = D0 ,var.equal =TRUE ) where ’type’ is equal to one of the following values, ’less’, ’greater’ or ’two.sided’ depending on the type of hypothesis 4.2.8 H YPOTHESIS TESTING FOR µd = µ1 − µ2 (M ATCHED PAIRS ) Remember that the types of hypotheses are: One-tailed, lower-tailed • H0 : µd = D0 , Ha : µd < D0 One-tailed, upper-tailed • H0 : µd = D0 , Ha : µd > D0 Two-tailed • H0 : µd = D0 , Ha : µd 6= D0 • Usually D0 = 0 18 Inf. Based on One and Two Samples 4.2.9 FDA H YPOTHESIS TESTING FOR µd = µ1 − µ2 (M ATCHED PAIRS ): L ARGE S AMPLE + D EP. S AMPLING • Dependent sampling (matched pairs) • Large sample nd ≥ 30 • σd is known z= d − D0 √ σ d / nd • Dependent sampling (matched pairs) • Large sample nd ≥ 30 • σd is unknown z= d − D0 √ s d / nd • Remember that for dependent sampling you have to construct a table where di = x1i − x2i and nd is the number of pairs sP P 2 di i (di − d) d = i , sd = nd nd − 1 • In R you would need to use R as a calculator 4.2.10 H YPOTHESIS TESTING FOR µd = µ1 − µ2 (M ATCHED PAIRS ): S MALL S AMPLE + D EP. S AMPLING • Dependent sampling (matched pairs) • Small sample nd < 30 • The population of differences has a distribution that is approximately normal t= d − D0 √ sd / nd where t has n − 1 degrees of freedom • In R, you can use the t.test function. Given the two samples S1 and S2 we need to set the paired option to true t.test(x = S1 , y = S2 ,conf.level = 1 - α, paired = TRUE, alternative = ’type’, mu = D0 ) where ’type’ is equal to one of the following values, ’less’, ’greater’ or ’two.sided’ depending on the type of hypothesis 19 Inf. Based on One and Two Samples 4.2.11 FDA H YPOTHESIS TESTING FOR p1 − p2 : L ARGE S AMPLE Remember that the types of hypotheses are: One-tailed, lower-tailed • H0 : p1 − p2 = 0, Ha : p1 − p2 < 0 One-tailed, upper-tailed • H0 : p1 − p2 = 0, Ha : p1 − p2 > 0 Two-tailed • H0 : p1 − p2 = 0, Ha : p1 − p2 6= 0 • Large sample size (ni p̂i ≥ 15 and ni q̂i ≥ 15 for i = 1, 2) • Independent sampling (p̂1 − p̂2 ) x1 + x 2 where p̂ = n + n , q̂ = 1 − p̂ 1 2 1 1 p̂q̂ + n1 n2 z=s • To compute it with R we will use R as a calculator. There are packages that allow you to do this computation, but we will not see them in this course. , 5 • Determining Sample Size There are situations in which we want to compute the necessary sample size to obtain a confidence interval with certain confidence level and width. Here we will describe the corresponding formulas in several cases. We are going to assume equal sample size for the two samples. Remember you have to round upwards in all cases. 5.1 Confidence interval for the mean n= Sample for the mean 2 zα/2 σ width , ME = margin of error 2 20 Inf. Based on One and Two Samples FDA We would have all the data necessary to answer our question except the value of the population variance. Usually this value is unknown so we have two options: 1. Replace them with estimates from prior sampling: s2 2. Use a rule of thumb s = R/4. Where R (range) is the difference between the highest value of the sample and the lowest value. 5.2 Confidence Interval for a proportion n= Sample size for a proportion ! 2 zα/2 pq width , ME = 2 (margin of error) 2 We would have all the data necessary to answer our question except the value of the population proportion. Usually these values are unknown so we have two options: • Replace them with estimates from prior sampling: p̂ • Use a rule of thumb → p = .5 5.3 Confidence Interval for the difference of two means Sample size diff. two means (zα/2 )2 (σ12 + σ22 ) , n1 = n2 = (margin of error)2 ME = width 2 We would have all the data necessary to answer our question except the value of the population variances. Usually these values are unknown so we have two options: 1. Replace them with estimates from prior sampling: s21 and s22 2. Use a rule of thumb → s = R/4. Where R (range) is the difference between the highest value of the sample and the lowest value. 5.4 Confidence Interval for the difference of two proportions Sample size diff. two proportions (zα/2 )2 (p1 q1 + p2 q2 ) n1 = n2 = , (margin of error)2 ME = width 2 21 Inf. Based on One and Two Samples FDA We would have all the data necessary to answer our question except the value of the population proportions. Usually these values are unknown so we have two options: • Replace them with estimates from prior sampling: p̂1 and p̂2 • Use a rule of thumb → p1 = p2 = 0.5 22 Inf. Based on One and Two Samples A FDA Appendix: Useful Definitions Experimental (or observational) unit: is an object (e.g., person, thing, transaction, or event) upon which we collect data. Population: is a set of units (usually people, objects, transactions, or events) that we are interested in studying. Sample: is a subset of the units of a population. Statistical inference: is an estimate or prediction or some other generalization about a population based on information contained in a sample. Parameter: is a numerical descriptive measure of a population. Because it is based on the observations in the population, its value is almost always unknown. Statistic: any function of the sample (e.g. sample mean, sample variance, sample proportion). Sampling distribution of a statistic: is the probability distribution of the statistic. Point Estimator: of a population parameter is a rule or formula that tells us how to use the sample data to calculate a single number that can be used as an estimate of the population parameter. They may be unbiased if the If the sampling distribution of a sample statistic has a mean equal to the population parameter the statistic is intended to estimate. Or biased if the mean of the sampling distribution is not equal to the parameter. Target Parameter: is the unknown population parameter that we are interested in estimating. Central Limit Theorem (CLT): Consider a random sample of n observations selected from a population (any probability distribution) with mean µ and standard deviation σ. Then, when n is sufficiently large, the sampling distribution √ of x will be approximately a normal distribution with mean µx = µ and standard deviation σx = σ/ n. Interval estimator (or confidence interval): is a formula that tells us how to use the sample data to calculate an interval that estimates the target parameter. P-value (observed significance level): for a specific statistical test is the probability (assuming H0 is true) of observing a value of the test statistic that is at least as contradictory to the null hypothesis, and supportive of the alternative hypothesis, as the actual one computed from the sample data. Critical Value: the value that separates the rejection region and the acceptance region. Confidence coefficient: is the probability that a randomly selected confidence interval encloses the population parameter. That is, the relative frequency with which similarly constructed intervals enclose the population parameter when the estimator is used repeatedly a very large number of times. The confidence level is the confidence coefficient expressed as a percentage. Statistical hypothesis: is a statement about the numerical value of a population parameter. Null hypothesis: denoted H0 , represents the hypothesis that is assumed to be true unless the data 23 Inf. Based on One and Two Samples FDA provide convincing evidence that it is false. This usually represents the “status quo” or some claim about the population parameter that the researcher wants to test. The alternative (research) hypothesis: denoted Ha , represents the hypothesis that will be accepted only if the data provide convincing evidence of its truth. This usually represents the values of a population parameter for which the researcher wants to gather evidence to support. Test statistic: is a sample statistic, computed from information provided in the sample, that the researcher uses to decide between the null and alternative hypotheses. Type I error: occurs if the researcher rejects the null hypothesis in favor of the alternative hypothesis when, in fact, H0 is true. The probability of committing a Type I error is denoted by α . Type II error: occurs if the researcher accepts the null hypothesis when, in fact, H0 is false. The probability of committing a Type II error is denoted by β. Rejection region of a statistical test: is the set of possible values of the test statistic for which the researcher will reject H0 in favor of Ha . B Appendix: zα and tα (also zα/2 and tα/2) B.1 • zα and zα/2 The value zα is defined as the value of the standard normal random variable z such that an area (probability) α will lie to its right. In other words, P (z > zα ) = α. Equivalently, P (z ≤ zα ) = 1 − α. The value zα/2 is defined as the value of the standard normal random variable z such that the area α/2 will lie to its right. In other words, P (z > zα/2 ) = α/2. Equivalently, P (z ≤ zα/2 ) = 1 − α/2. There are two main ways to compute zα (analogously for zα/2 .) 1. Using a table. The tables usually give you the value of P (z ≤ z0 ) where z0 is a number. You have to look for the number z0 such that P (z ≤ z0 ) = 1 − α. That z0 is zα . 2. Using the R programming language. You can use the qnorm function: qnorm(1 - α) will give you zα . Analogously for zα/2 . B.2 • tα and tα/2 The value tα is defined as the value of Students’ t random variable such that an area α will lie to its right. Remember that you need to fix the degrees of freedom. In other words, P (t > tα ) = α. Equivalently, P (t ≤ tα ) = 1 − α. The value tα/2 is defined as the value of Students’ t random variable t such that an area α will lie to 24 Inf. Based on One and Two Samples FDA its right. In other words, P (t > tα/2 ) = α/2. Equivalently, P (t ≤ tα/2 ) = 1 − α/2. There are two main ways to compute tα 1. Using a table. You have to pay attention to which value is the table given you (right or left area). 2. Using the R programming language. You can use the qt function: qt(1 - α,degrees of freedom) will give you the tα Analogously for tα/2 . 25

Data Analysis: Inferences Based on One & Two Samples

Related documents

Products

Support

Data Analysis: Inferences Based on One & Two Samples

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib