Data Collection & Analysis Methods

Data Collection Methods of data collection Census Sample Survey entire population time and cost Observational Study part of a population subjects controlled blocking (block) indicates causation control group and treatment group single-blind and double-blind part of a population subjects observed stratification (strata) indicates correlation confounding (lurking variable) may occur 뺑 Experiment part of a population time and cost 테스트프렙어학원 Methods of data planning (for surveys) 1. 2. 3. 4. 5. 6. simple random sampling (equal prob. for all) systematic sampling (every nth person) stratified sampling (homogeneous strata) proportional sampling (proportional to pop.) cluster sampling (heterogenous clusters) multistage sampling (methods combined) Bias in data planning (for surveys) 1. 2. 3. 4. 5. 6. 7. 8. 9. household bias (family of two vs. five) nonresponse bias (refuse to respond) response bias (respond untruthfully) voluntary response bias (strong opinions) quota sampling bias (homogeneous group) selection bias (specific subjects are chosen) size bias (big vs. small coins) undercoverage bias (part of pop. ignored) wording bias (poorly worded questions) Data Collection More on experiments x is the explanatory variable (factor) y is the response variable experimental units (nonhuman) are sometime called subjects (human) control group does not receive treatment treatment group receives treatment placebo effect is a response to "fake" treatment single blinding (only subjects are blind) double blinding (both subjects and evaluators are blind) completely randomized design different samples get different treatments randomized paired comparison design one sample gets different treatments randomized block design population undergoes blocking each block receive randomization random samples get different treatments control, blocking, randomization, replicability and generalizability Data Analysis I Graphs for data analysis Bar Chart Note: there are gaps between bars. Dotplot Histogram Sampling error vs. bias Sampling Error natural variability when taking samples from a population cannot be avoided Bias tendency to favor the selection of certain members of a population can be avoided Note: there may be no gaps between bars. Data Analysis I Data Analysis I Graphs for data analysis (cont.) Special features of graphs Stemplot Clusters Data Analysis I Shape Outlier y lp o Note: the leaves may not be skipped and the key must be clearly indicated. 테스트프렙어학원 Gaps Cumulative Relative Frequency (CRF) Plot Center (Measures of Central Tendency) Describing distributions (SOCS) Shape Note: the median value can be found by drawing a horizontal line across 0.5 on the y-axis. Boxplot Note: min, Q1, median, Q3, max are indicated. Outliers are indicated by separate dots Choosing the right graphs Qualitative (Categorical) Variable Quantitative (Numerical) Variable dotplot bar chart dotplot histogram stemplot boxplot CRF plot symmetric skewed to the right median < mean skewed to the left median > mean bell-shaped uniform Center mean median divides area under graph into two equal parts mode uni vs. bimodal mean: add all values and divide by n Outlier by inspection by formula less than Q1 - 1.5 IQR greater than Q3 + 1.5 IQR Spread range interquartile range variance standard deviation median: arrange values in ascending order and do one of the following if n is odd, select the middle value if n is even, take the average of two middle values mode: most frequently occurring value Spread (Measures of Dispersion) range: max - min interquartile range: Q3 - Q1 variance: standard deviation Data Analysis I Measures of Position simple ranking: indicates rank from an ordered list percentile ranking: indicates a percentage of values under the value under consideration z-score: indicates specifically by how many standard deviations the value under consideration varies from the mean Empirical Rule (68-95-99.7 Rule) Comparing distributions (cont.) Addition and subtraction affects mean median mode Parallel Boxplots Multiplication and division affects mean median mode range IQR variance standard deviation Comparing distributions (Use SOCS when making comparisons) Double Bar Charts 테스트프렙어학원 Note: range is approximately equal to 6 standard deviations in a bell-shaped distribution. Resistance to outliers Not resistant mean range variance standard deviation Resistant median mode IQR Data Analysis I Transforming distributions 뻐 (applies to bell-shaped curves only) Data Analysis I Back-to-back Stemplots Overlapping CRF Plots Data Analysis II Exploring bivariate data Data Analysis II Least squares regression line (LSRL) Data Analysis II Residual plots Scatterplot 깸 There could be many lines of best fit, but the x is the explanatory (independent) variable. one which minimizes the sum of the squares y is the response (dependent) variable. of residuals is called the least squares A line of best fit describes the overall pattern. regression line. The correlation coefficient (r) gives the strength of association between the two variables. -1 ≤ r ≤ 1 테스트프렙어학원 residual = observed (actual) - predicted ê=y-ŷ The residual plot is used as evidence that a linear regression is a good fit when the residual plot shows no overall pattern as shown below. The sum of residuals is always zero. LSRL passes through the point (x̄, ȳ) and has a slope b1 , which has the same sign as r. The standard deviation of the residuals is can be calculated as follows. Population Regression Line The coefficient of determination (r2 ) gives the percentage of variation in y that is explained by the variation in x. It gives a measure of how the data points are spread around the regression line. Sample Regression Line If the residual plot shows a pattern, it means that a nonlinear model is more appropriate. Data Analysis II Transformation to achieve linearity Data Analysis III Exploring categorical variables Probability Distribution Instead of using nonlinear regression models, we can 1transform existing data so that the scatterplot shows a linear pattern (which means that linear regression could be used). Marginal Frequency and Distribution Relative frequency tells you the percentage of an event 1 that happened relative to the whole. Relative frequencies vary from experiment to experiment. Main transformation are log, square root, and reciprocal transformations. Law of large numbers When an experiment is performed a large number of times, the relative frequency converges to a certain value. We call this value the probability of that event. Before log transformation In other words, probability is long-term relative frequency. Conditional Frequency and Distribution Calculating probabilities General formulae After log transformation (for people who saw baby animals) ㅹ Mutually exclusive events (implies that there is no intersection) (for people who saw adult animals) Independent events (implies that events do not influence each other) 테스트프렙어학원 Conditional probability (probability of B given A) (for people who saw tasty foods) Probability Distribution Probability Distribution Probability Distribution Calculating multistage probabilities Binomial distribution (cont.) Geometric distribution Product Principle (multiply probabilities that occur together or in series) For instance, when you toss a coin, the two possible 1 outcomes are heads (H) and tails (T). But you can have various combinations of them: HHHH, HHTT, TTHT, etc. When there are two possible outcomes (binomial) and you want to find the probability 1 that the first success occurs after nth trial, model the situation using a geometric distribution. Each of these combinations have a probability associated with them: binomial probability. 폈 Addition Principle (add probabilities that cannot occur together or are from entirely different scenarios) 테스트프렙어학원 Types of probability distributions Discrete binomial geometric Continuous normal Binomial distribution Binomial distributions are used to model problems that have two possible outcomes 1 (successes or failures). Examples of such scenarios include: defective vs. not defective 5 on a die vs. not 5 on a die heads vs. tails score a goal vs. not score a goal Remember that although there are only two possible outcomes, you can still have different combinations of these two outcomes. where p is the probability of success q is the probability of failure (1-p) n is the number of trials k is the number of successes (n-k) is the number of failures Alternatively, you can use the binompdf (n, p, k) function to calculate specific binomial probabilities. If you have to add up binomial probabilities (starting from 0), use the binomcdf (n, p, k) where k is the number of successes up want to add up to. Binomial distribution keywords binompdf exactly ____ out of ____ binomcdf at most, at least more than, less than For example, what is the probability that the first honest man Diogenes encounters will be the third man he meets? This is clearly a binomial problem (person met is honest or not honest), and we are interested in meeting an honest man on the third trial. This implies that Diogenes would not meet an honest man in the first and second trials. So, the probability would be (failure) x (failure) x (success). In general, where p is the probability of success q is the probability of failure (1-p) k is the trial number when success occurs You can use the geometpdf (p, k) or the geometcdf (p, k) accordingly. Geometric distribution keywords (happens first, first success is, first occurrence is) geometpdf first, second, third geometcdf no later than Probability Distribution Probability Distribution Probability Distribution Discrete distribution Discrete distribution (cont.) When there are two or more possible outcomes, you can use a discrete distribution 1 to model the problem. Formula for discrete random variable For example, a highway engineer knows that his crew can lay 5 miles of highway on a clear day, 2 miles on a rainy day, and only 1 mile on a snowy day. You can construct a discrete probability distribution as follows. Formula for binomial random variable Combining random variables (random variables must be independent) 뺑 Note that the binomial distribution is a special case of the discrete distribution when there are only two possible outcomes. 테스트프렙어학원 For example, in a lottery, 10,000 tickets are sold at $1 each with a prize of $7,500 for one winner. You can construct a discrete/binomial probability distribution as follows. In both cases, the discrete random variable (usually denoted as X) is associated with a numerical value. Designing simulations The variance of sums and the variance of differences always adds the individual 1 variances. Variance may be combined only when the two random variables are independent. There is no formula for combining standard deviations of two random variables. So, you must calculate the variances of each random variable, use the variance combination formula above, and then take the square root of the combined variance. Transforming random variables (random variables must be independent) In performing a simulation, you must do the following. 1 1. Set up a correspondence between outcomes and random numbers (0~6 is success and 7~9 is failure). 2. Give a procedure for choosing random numbers. 3. Give a stopping rule. 4. Note what is to be counted. Probability Distribution Probability Distribution Probability Distribution Normal distribution The normal distribution is a type of continuous distribution. It is symmetric, bell1 shaped, and unimodal. It has two tails at both ends that approaches the horizontal infinitely. Normal distribution (cont.) Normal distribution (cont.) Finding area under a normal curve The area under the normal curve is the probability of whatever you are solving for. 1 You can use a z-table to find the area number a normal curve. For this method, you need to calculate the z-score. Remember that the ztable gives you the area to the right of that zscore. P . It is useful in describing various natural phenomena. 테스트프렙어학원 The normal distribution is the limiting case of the binomial distribution when n ∞. Alternatively, you can use normalcdf (lower bound, upper bound, mean, standard deviation) to find the area under a normal curve. You can enter raw values into this function. However, if you want to enter z-scores for the lower and upper bound, you must set the mean to 0 and the standard deviation to 1. To find the z-score given an area to the left of the normal curve, use invNorm (area). Probability Distribution Probability Distribution Probability Distribution Common probabilities and z-scores For statistical inference For percentile ranking Normal approximation to binomial 꺄 The binomial distribution takes values only at integers, 1 while the normal distribution is continuous with probabilities corresponding to areas over intervals. 테스트프렙어학원 For approximation purposes, we think of each binomial probability corresponding to the normal probability over a unit interval centered at the desired value. For example, to approximate the binomial probability of five successes we determine the normal probability of being between 4.5 and 5.5. Checking for normality You must be able to decide whether it is reasonable to assume the data come from a 1 normal population. This skill is especially important when you have to do statistical inference. To check for normality, you should create a graph of the data. For example, the ages at inauguration of U.S. presidents were: {57, 61, 57, 57, 58, 57, 61, 54, 68, 51, 49, 64, 50, 48, 65, 52, 56, 46, 54, 49, 51, 47, 55, 55, 54, 42, 51, 56, 55, 51, 54, 51, 60, 61, 43, 55, 56, 61, 52, 69, 64, 46, 54}. Can we conclude that the distribution is roughly normal? Checking for normality (cont.) Normal probability plot A diagonal straight line pattern in the normal probability plot is an indication that the distribution of data is roughly normal. Parameter vs. statistic A parameter is a number that describes some characteristic of the population. A statistic is a number that describes some characteristic of a sample. It is used to estimate the parameter of interest. Population parameter pop. proportion (p) pop. mean (μ) pop. standard deviation (σ) Sample statistic sample proportion (p̂) sample mean (x̄) sample standard deviation (s) Probability Distribution Probability Distribution Probability Distribution Sampling distribution 뻔 The AP Statistics exam tests you on five types of sampling distributions. sampling distribution of sample proportions sampling distribution of sample means sampling distribution of differences between sample proportions sampling distribution of differences between sample means sampling distribution of slope of sample LSRL 테스트프렙어학원 When random samples are taken from a population, the sample statistics vary from sample to sample. This natural deviation is called sampling variability. It refers to the fact that different random samples of the same size from the same population produce different values for a statistic. For example, every sample taken will have a unique sample proportion p̂. The various p̂ values possible can then be plotted to create a distribution. This distribution of various sample proportions is called the sampling distribution of sample proportions. A similar case can be made for sampling distributions of other sample statistic. Note that sampling distributions are different from sample distributions and population distributions. Sampling distribution (cont.) The population distribution of a variable describes the values of the variable for all individuals in a population. The sample distribution describes the values of the variable for all individuals in a particular sample. Biased and unbiased estimators A statistic can be an unbiased estimator or a biased estimator of a parameter. A statistic is an unbiased estimator if the center (mean) of its sampling distribution is equal to the true value of the parameter. When trying to estimate a parameter, choose a statistic with low or no bias and minimum variability. Sampling distribution of p̂ When we want information about the population proportion p of successes, we often take an SRS and use the sample proportion p̂ to estimate the unknown parameter p. The sampling distribution of the sample proportion p̂ describes how the statistic p̂ varies in all possible samples of the same size from the population. The mean of the sampling distribution of p̂ is So, p̂ is an unbiased estimator of p. The standard deviation of the sampling distribution of p̂ is Conditions you need to check are: SRS 10% condition (n < 0.10N) large counts condition (np≥10, n(1-p)≥10) state that since np≥10 and n(1p)≥10, the sampling distribution of p̂ is approximately normal by the large counts condition. Probability Distribution Probability Distribution Probability Distribution Sampling distribution of x̄ When we want information about the population mean μ for some quantitative variable, we often take an SRS and use the sample mean x̄ to estimate the unknown parameter μ. P The sampling distribution of the sample mean x̄ describes how the statistic x̄ varies in all possible samples of the same size from the population. 테스트프렙어학원 The mean of the sampling distribution of x̄ is So x̄ is an unbiased estimator of μ. The standard deviation of the sampling distribution of x̄ is Conditions you need to check are: SRS 10% condition (n < 0.10N) normality if normal, say that it is if not normal, use central limit theorem and state that since n≥30, the sampling distribution of x̄ is approximately normal by the central limit theorem. Sampling distribution of p̂1-p̂2 Sampling distribution of x̄1-x̄2 The mean of the sampling distribution of p̂1p̂2 is The mean of the sampling distribution of x̄1x̄2 is So p̂1-p̂2 is an unbiased estimator of p1-p2. So x̄1-x̄2 is an unbiased estimator of μ1-μ2. The standard deviation of the sampling distribution of p̂1-p̂2 is The standard deviation of the sampling distribution of x̄1-x̄2 is Conditions you need to check for both samples are: SRS for both samples 10% condition for both samples n1 < 0.10N1 and n2 < 0.10N2 large counts condition for both samples n1p1≥10, n1(1-p1)≥10 n2p2≥10, n2(1-p2)≥10 state that since the large counts condition is met for both samples, the sampling distribution of p̂1-p̂2 is approximately normal. independence condition mention that the two samples are independent random samples Conditions you need to check for both samples are: SRS for both samples 10% condition for both samples n1 < 0.10N1 and n2 < 0.10N2 normality for both samples if both are normal, say that they are if both are not normal, use central limit theorem on both samples and state that since the central limit theorem is met for both samples (n1≥30 and n2≥30), the sampling distribution of x̄1-x̄2 is approximately normal. if one is normal but the other isn't, state that it is normal for the normal data but use the central limit theorem on the other independence condition mention that the two samples are independent random samples Probability Distribution Probability Distribution Sampling distribution of b1 The mean of the sampling distribution of b1 is So b1 is an unbiased estimator of β1. The standard deviation of the sampling distribution of b1 is ' Conditions you need to check are: SRS 10% condition (n < 0.10N) scatterplot of sample data is approximately linear 테스트프렙어학원 Sampling distribution of b1 (cont.) distribution of residuals it approximately normal Statistical Inference I Confidence interval The AP Statistics exam tests you on five types of confidence intervals (CI). CI for population proportion CI for population mean CI for difference between population proportions CI for difference between population means CI for slope of the LSRL A confidence interval gives an interval of plausible values for a parameter based on sample data. A point estimator is a statistic that provides an estimate of a population parameter. no apparent pattern in residuals plot (=equal SD; residuals have roughly equal variability at all x-values in sample data) The value of that statistic from a sample is called a point estimate. The confidence level c gives the overall success rate of the method used to calculate the confidence interval. To interpret the confidence level: if we were to select many random samples from a population and construct a [c]% confidence interval using each sample, about [c]% of the intervals would capture the true [parameter in context]. Statistical Inference I Confidence interval (cont.) To interpret the confidence interval: we are [c]% confident that the interval from [lower bound] to [upper bound] captures the true [parameter in context]. 양 The margin of error of an estimate describes how far, at most, we expect the estimate to vary from the true population value. 테스트프렙어학원 Affecting margin of error In general, we prefer an estimate with a small margin of error. The margin of error gets smaller when: the confidence level decreases. To obtain a smaller margin of error from the same data, you must be willing to accept less confidence. the sample size n increases. In general, increasing the sample size n reduces the margin of error for any fixed confidence level. Statistical Inference I CI for population proportion CI for population mean σ is known where Identify: one sample z interval for p Conditions: SRS 10% condition (n<0.10N) large counts condition (np̂≥10, n(1-p̂)≥10) state that the number of successes (np) and the number of failures (n(1-p)) are both greater than or equal to 10, so the sampling distribution of p̂ is approximately normal. Calculate: if the conditions are met, perform the calculations (1-PropZInt) Conclude: interpret your confidence interval in the context of the problem. Sample size for a desired margin of error The critical value is a multiplier that makes the interval wide enough to have the stated capture rate. The critical value depends on both the confidence level c and the sampling distribution of the statistic. Statistical Inference I σ is unknown where Identify: one sample z interval for μ OR one sample t interval for μ (df = n-1) Conditions: SRS 10% condition (n<0.10N) normality if normal, say that it is if not normal, use central limit theorem and state that since n≥30, the sampling distribution of x̄ is approximately normal by the central limit theorem if not normal and n < 30, draw graph to check for normality, no strong skewness, and no outliers Calculate: if the conditions are met, perform the calculations (ZInterval or TInterval) Conclude: interpret your confidence interval in the context of the problem. where Sample size for a desired margin of error Statistical Inference I CI for difference in population proportions Statistical Inference I CI for difference in population means σ is known σ is unknown Statistical Inference I CI for slope of population regression line Identify: one sample t interval for β (df = n-2) Identify: two sample z interval for μ1-μ2 OR two sample t interval for μ1-μ2 PPO Identify: two sample z interval for p1-p2 ' 테스트프렙어학원 Conditions: *df = (n1-1 or n2-1, whichever is smaller) OR SRS for both samples (use technology for precision) 10% condition for both samples n1 < 0.10N1 and n2 < 0.10N2 Conditions: large counts condition for both samples SRS for both samples n1p̂1≥10, n1(1-p̂1)≥10 10% condition for both samples n2p̂2≥10, n2(1-p̂2)≥10 n1 < 0.10N1 and n2 < 0.10N2 state that the numbers of normality for both samples successes (n1p1, n2p2) and the if normal, say that it is numbers of failures (n1(1-p1), if not normal, use central limit n2(1-p2)) are both greater than or theorem and state that since n≥30, equal to 10, so the sampling the sampling distribution of x̄1- x̄2 distribution of p̂1-p̂2 is is approximately normal by the approximately normal. central limit theorem independence condition if not normal and n < 30, draw graph mention that the two samples are check for normality, no strong independent random samples skewness, and no outliers independence condition Calculate: if the conditions are met, perform mention that the two samples are the calculations (2-PropZInt) independent random samples Conclude: interpret your confidence interval in the context of the problem. Calculate: if the conditions are met, perform the calculations (2-SampZInt or 2-SampTInt) Conclude: interpret your confidence interval in the context of the problem. Conditions: SRS 10% condition (n<0.10N) scatterplot of sample data is approximately linear no apparent pattern in residuals plot (=equal SD; residuals have roughly equal variability at all x-values in sample data) distribution of residuals it approximately normal Calculate: if the conditions are met, perform the calculations (LinRegTInt) Conclude: interpret your confidence interval in the context of the problem. IMPORTANT For paired data (data that are not independent), create a new variable d, which is the variable for differences, by taking the differences of x̄1 from sample 1 and the corresponding x̄2 from sample 2. Then, you need to create a one sample t interval using the new variable d by using TInterval. Statistical Inference II Test of significance for quantitative data: hypothesis test 동 The AP Statistics exam tests you on five types of significance tests for quantitative data. hypothesis test for population proportion hypothesis test for population mean hypothesis test for difference between population proportions hypothesis test for difference between population means hypothesis test for slope of population LSRL 테스트프렙어학원 Confidence interval vs. significance test Confidence intervals aim to estimate over which interval the unknown parameter may lie. Significance tests aim to investigate whether the known parameter is valid or needs to be changed. Types of hypotheses The hypothesis test is conducted by setting up the null hypothesis (Ho) and the alternative hypothesis (Ha). Ha is also known as the research hypothesis. The null hypothesis almost always uses the equality symbol (=) while the alternative hypothesis uses inequality symbols (<, >, ≠). Statistical Inference II Statistical Inference II The inequality symbol used in the alternative hypothesis determines whether a one-tailed or two-tailed test should be performed. When we make a conclusion in a significance test, there are two kinds of mistakes we can make. Types of hypothesis tests Type I and Type II Errors When the < or > symbol is used, conduct a one-tailed test. One-tailed tests are also known as one-sided tests. When the ≠ symbol is used, conduct a twotailed test. Two-tailed tests are also known as two-sided tests. Do not forget to double the p-value at one tail for two-tailed tests. P-value vs. significance level The p-value refers to the probability of getting evidence for the alternative hypothesis as strong or stronger than the observed evidence assuming the null hypothesis is true. The significance level (α) is the value that we use as a boundary for deciding whether an observed result is unlikely to happen by chance alone assuming the null hypothesis is true. α = 1 - c In a hypothesis test, the p-value is compared with the significance level (α). You need to be able to describe type I and type II errors in context. The probability of making a type I error is α. The probability of making a type II error is β. α and β are inversely proportional. They do not necessarily add up to 1. The probability of avoiding a type II error is called power. Power = 1 - β Statistical Inference II HT for population proportion Identify: one sample z test for p Statistical Inference II HT for population proportion Calculate: if the conditions are met, perform the calculations (1-PropZTest) Statistical Inference II HT for population mean Identify: one sample z test for μ OR one sample t test for μ (df = n-1) Alternatively, find the p-value using normalcdf and compare this value with the significance level. When calculating the standard deviation of sample proportions, use the following formula. where p0 is the null proportion, not the sample proportion. 많 where p is (description in context). 테스트프렙어학원 State the significance level. (If not given, use 0.05). Conditions: SRS 10% condition (n<0.10N) large counts condition (np0≥10, n(1p0)≥10) state that the number of successes (np0) and the number of failures (n(1-p0)) are both greater than or equal to 10, so the sampling distribution of p̂ is approximately normal. Conclude: Since the p-value is (less than/greater than or equal to) the significance level α, we (reject/fail to reject) the null hypothesis. We (have/do not have) sufficient evidence that (state your alternative hypothesis). where μ is (description in context). State the significance level. (If not given, use 0.05). Conditions: SRS 10% condition (n<0.10N) normality if normal, say that it is if not normal, use CLT and state that since n≥30, the sampling distribution of x̄ is approximately normal by CLT if not normal and n < 30, draw graph to check for normality, no strong skewness, and no outliers Statistical Inference II HT for population mean Calculate: if the conditions are met, perform the calculations (Z-Test or T-Test) Statistical Inference II HT for difference in population proportions Statistical Inference II HT for difference in population proportions Identify: two sample z test for p1-p2 where p is the pooled proportion, which can be calculated using the formula below. 품 Alternatively, you can find the p-value using normalcdf or tcdf and compare this value Conditions: with the significance level. SRS for both samples 테스트프렙어학원 10% condition for both samples When using tcdf, remember that df = n - 1. n1 < 0.10N1 and n2 < 0.10N2 large counts condition for both samples When calculating the standard deviation of n1p1≥10, n1(1-p1)≥10 sample means, use the following formula. n2p2≥10, n2(1-p2)≥10 state that the numbers of σ is known σ is unknown successes (n1p1, n2p2) and the numbers of failures (n1(1-p1), n2(1-p2)) are both greater than or equal to 10, so the sampling Conclude: Since the p-value is (less distribution of p̂1-p̂2 is than/greater than or equal to) the approximately normal. significance level α, we (reject/fail to independence condition reject) the null hypothesis. We (have/do mention that the two samples are not have) sufficient evidence that (state independent random samples your alternative hypothesis). Calculate: if the conditions are met, perform the calculations (2-PropZTest) Alternatively, find the p-value using normalcdf and compare this value with the significance level. When calculating the standard deviation of the difference in sample proportions, use the pooled proportion in the following formula. Conclude: Since the p-value is (less than/greater than or equal to) the significance level α, we (reject/fail to reject) the null hypothesis. We (have/do not have) sufficient evidence that (state your alternative hypothesis). Statistical Inference II HT for difference in population means Identify: two sample z test for μ1-μ2 OR two sample t test for μ1-μ2 Statistical Inference II HT for difference in population means When calculating the standard deviation of sample means, use the following formula. σ is known S Conditions: 테스트프렙어학원 SRS for both samples 10% condition for both samples n1 < 0.10N1 and n2 < 0.10N2 normality for both samples if normal, say that it is if not normal, use central limit theorem and state that since n≥30, the sampling distribution of x̄1- x̄2 is approximately normal by CLT if not normal and n < 30, draw graph check for normality, no strong skewness, and no outliers independence condition mention that the two samples are independent random samples σ is unknown Calculate: if the conditions are met, perform the calculations (2-SampZTest or 2SampTTest) Alternatively, you can find the p-value using normalcdf or tcdf and compare this value with the significance level. When using tcdf, use df = n1-1 or df = n2-1, whichever is smaller. Or use technology for a more precise df. Conclude: Since the p-value is (less than/greater than or equal to) the significance level α, we (reject/fail to reject) the null hypothesis. We (have/do not have) sufficient evidence that (state your alternative hypothesis). Statistical Inference II HT for slope of population regression line Identify: one sample t interval for β (df = n-2) Remember, β1 = 0 simply states that there is no linear relationship between two variables. For Ha, use β1 > 0 if you need to prove a positive linear relationship between two variables, β1 < 0 if you need to prove a negative linear relationship between two variables, or β1 ≠ 0 if you need to show there is "some" linear relationship between two variables IMPORTANT For paired data (data that are not independent), create a new variable d, which is the variable for differences, by taking the differences of x̄1 from sample 1 and the corresponding x̄2 from sample 2. Conditions: SRS 10% condition (n<0.10N) scatterplot of sample data is approx linear no apparent pattern in residuals plot (=equal SD; residuals have roughly equal variability at all x-values in sample data) distribution of residuals it approx normal Then, you need to perform a one sample t test using the new variable d as shown below. Do not perform a two sample test on paired data! Calculate: if the conditions are met, perform the calculations (LinRegTTest) Identify: paired t-test for the set of differences Then, follow a similar procedure for HT for population mean. Conclude: Since the p-value is (less than/greater than or equal to) the significance level α, we (reject/fail to reject) the null hypothesis. We (have/do not have) sufficient evidence that (state your alternative hypothesis). Statistical Inference III Test of significance for qualitative data: chi-square test ' The AP Statistics exam tests you on three types of significance tests for qualitative data. chi-square test for goodness-of-fit chi-square test for independence chi-square test for homogeneity 테스트프렙어학원 Chi-square test statistic The chi-square test statistic is a measure of how far the observed counts are from the expected counts. You should be able to do a follow up analysis on which category has the largest contribution to the chi-square test statistic. Chi-square distribution A chi-square distribution is defined by a density curve that takes only nonnegative values and is skewed to the right. A particular chi-square distribution is specified by its degrees of freedom. Statistical Inference III Chi-square test for goodness-of-fit Statistical Inference III Chi-square test for homogeneity The χ2 test for goodness-of-fit compares the distribution of observed counts in the sample with the distribution of expected counts if Ho were true. The χ2 chi-square test for homogeneity compares the distribution of a single categorical variable for each of several populations. The expected count for any category is found by multiplying the sample size (n) by the proportion in each category according to the null hypothesis. The expected count for any category is found using the formula below. Data is usually given in a one-way table. Data is usually given in a two-way table. Identify: chi-square test for goodness-of-fit Identify: chi-square test for homogeneity Conditions: SRS 10% condition n < 0.10N all expected counts are at least 5 Conditions: SRS 10% condition n < 0.10N all expected counts are at least 5 Calculate: if the conditions are met, perform the calculations (χ2 GOF-Test) Calculate: if the conditions are met, perform the calculations (χ2-Test) Alternatively, you can find the p-value using χ2cdf and compare this value with the significance level. When using χ2cdf, df = (number of categories)-1. Alternatively, you can find the p-value using χ2cdf and compare this value with the significance level. When using χ2cdf, df = [r-1] x [c-1]. Conclude: compare p-value and significance level; reject or fail to reject Ho Conclude: compare p-value and significance level; reject or fail to reject Ho Chi-square test for independence Identify: chi-square test for independence The null hypothesis is that there is no association between the two categorical variables in the population of interest. Another way to state the null hypothesis is that the two categorical variables are independent in the population of interest. Conditions: SRS 10% condition n < 0.10N all expected counts are at least 5 *il The χ2 chi-square test for independence is used test the association/relationship between two categorical variables in a single population. anycategory category The expected count for테스트프렙어학원 any is found using the formula below. is Data is usually given in a two-way table. Calculate: if the conditions are met, perform the calculations (χ2-Test) Alternatively, you can find the p-value using χ2cdf and compare this value with the significance level. When using χ2cdf, df = [r-1] x [c-1]. Conclude: compare p-value and significance level; reject or fail to reject Ho How to Get a 5 in AP Statistics 첨 Chi-square test for independence Statistical Inference III Get a perfect score on the MCQs. rr Statistical Inference III Write something (that is sensical) on the FRQs. And do your homework.

Data Collection & Analysis Methods

Related documents

Products

Support

Data Collection & Analysis Methods

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib