亿思科学生之家——公益性质国际教育信息服务平台 AP Statistics Summary Book 1 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 THEME 1: Explanatory Analysis 2 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Univariate Data Median and Mean Mean Median ⚫ Affected by extreme values ⚫ ⚫ For statistical inference ⚫ For descriptive statistics ⚫ Can be obtained from both population and a sample ⚫ Usually can only be obtained from a sample Not affected by extreme values Range, Interquartile Range (IQR), Variance and Standard Deviation 𝐑𝐚𝐧𝐠𝐞 = Maximum – Minimum𝐈𝐐𝐑 = Q3 − 𝑄1 𝐏𝐨𝐩𝐮𝐥𝐚𝐭𝐢𝐨𝐧 𝐕𝐚𝐫𝐢𝐚𝐧𝐜𝐞 = σ2 = ∑(𝑥 − 𝜇)2 ∑(𝑥 − 𝑥̅ )2 𝐒𝐚𝐦𝐩𝐥𝐞 𝐕𝐚𝐫𝐢𝐚𝐧𝐜𝐞 = s 2 = 𝑛 𝑛−1 ∑(𝑥−𝜇)2 𝐏𝐨𝐩𝐮𝐥𝐚𝐭𝐢𝐨𝐧 𝐒. 𝐃. = σ = √ Range IQR Useful in evaluating samples few items. with 𝑛 very 2) ∑(𝑥−𝑥̅ 𝐒𝐚𝐦𝐩𝐥𝐞 𝐒. 𝐃. = √𝑛−1 Variance Removing the influence Dispersion from the of extreme values on the mean in square units. (formula sheet) Standard Deviation Dispersion from the mean in standard units. range. Percentile Ranking and Z-score Percentile ranking indicates what percentage of all values fall below the value under consideration The z-score indicates how many standard deviations above or below the mean the given value lies z= 𝑥−𝜇 𝜎 3 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Histogram and Central Tendency Symmetrical: Mean = Median = Mode Skewed to the right: Mean > Median > Mode Skewed to the left: Mean< Median < Mode Empirical Rule 68% of the values lie within 1 s.d. 95% of the values lie within 2 s.d. 99% of the values like within 3 s.d. Note: z is usually between -3 and 3 but not always! Effect of Changing Units on Summary Measure Central Tendency (Mean and Median) Adding a constant Multiplying by a constant Add constant Multiply by constant Spread (Range, IQR and Standard Deviation) Remains the same Multiply by constant Box Plots Lower quartile Medium Upper quartile Minimum Maximum 𝐈𝐐𝐑 = 𝐔𝐩𝐩𝐞𝐫 𝐐𝐮𝐚𝐫𝐭𝐢𝐥𝐞 (𝐐𝟑)– 𝐋𝐨𝐰𝐞𝐫 𝐐𝐮𝐚𝐫𝐭𝐢𝐥𝐞 (𝐐𝟏) 𝐨𝐮𝐭𝐥𝐢𝐞𝐫 > 𝑸𝟑 + 𝟏. 𝟓 × 𝑰𝑸𝑹 OR < 𝑸𝟏 − 𝟏. 𝟓 × 𝑰𝑸𝑹 (Outliers need to be marked separately on a boxplot.) Comparing Distributions (Back-to-back stem plot, Parallel box plot, Parallel dot plot) Compare the following features: • Shape (symmetric, skewed to the left, skewed to the right) • Center (mean and median) • Spread (range or/and IQR) • Outlier (identify) • Clusters and Gaps (identify) 4 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Bivariate Data Correlation Coefficient (r) Correlation coefficient is a mathematical measure of the strength of the association between two variables. 𝑟= 1 𝑛−1 𝑥 −𝑥̅ 𝑦 −𝑦̅ 𝑥 𝑦 ∑ ( 𝑖 ) ( 𝑖 ) (formula sheet) 𝑠 𝑠 • Significant correlation does not necessarily indicate causation. • A correlation at or near zero means there is no linear relationship, but there may still be a strong nonlinear relationship! • Changing units does not change the correlation. So the correlation between the standardized z-scores stays the same. • If x and y are interchanged, the correlation coefficient stays the same. Coefficient of Determination (𝐫 𝟐 ) r 2 indicates the percentage of variation in y (dependent variable) that can be predicted by the variation in x (independent variable). The coefficient of determination is usually expressed as a percentage. Least Squares Regression Line To find the equation of the least square regression line: ̂−𝒚 ̅ = 𝒃𝟏 (𝒙 − 𝒙 ̅) 𝒚 Where 𝑥̅ and 𝑦̅ are the average values of x and y, 𝑏1 is the slope of the least square regression line, and 𝑦̂ is the predicted value of y from a value of x. 𝒔 𝒃𝟏 = 𝒓 𝒔𝒚 (Formula sheet) 𝒙 Where 𝑟 is the correlation coefficient, 𝑠𝑥 is the standard deviation of 𝑥 and 𝑠𝑦 is the standard deviation of 𝑦. The equation needs to be written as: ̂ 𝑽𝒓𝒊𝒂𝒃𝒍𝒆 = 𝒂 + 𝒃𝟏 × 𝑰𝒏𝒅𝒆𝒑𝒆𝒏𝒅𝒆𝒏𝒕 𝑽𝒂𝒓𝒊𝒂𝒃𝒍𝒆 𝑫𝒆𝒑𝒆𝒏𝒅𝒆𝒏𝒕 Interpreting 𝐚 and 𝐛𝟏 𝑏1 : On average, when IV increases by 1 unit, DV will increase/decrease by |b1 | units. 𝑎: the average value of DV when IV is zero 5 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Making Prediction When we use the regression line to predict a y-value for a given x-value, we are actually predicting the mean y-value for that given x-value Interpolation: Predicting a y-value by an x-value within the range of given data. Extrapolation: Predicting a y-value by an x-value outside the range of given data. Extrapolation is less reliable than interpolation! Residual Plot The residual of a given y-value is: 𝐞̂𝐢 = 𝐲𝐢 − 𝐲̂𝐢 Where𝑒̂𝑖 is the residual 𝑦𝑖 is the observed value 𝑦̂𝑖 is the predicted value The standard deviation of the residuals is: ∑ 𝑒𝑖2 ∑(𝑦𝑖 − 𝑦̂𝑖 )2 𝑠𝑒 = √ =√ 𝑛−2 𝑛−2 se gives a measure of how the points are spread around the regression line. To test for linearity: • • When the residual is randomly distributed, a linear relationship can be assumed between he two variables. When the residual plot shows an obvious pattern, a non-linear model will show a better fit to the data than the straight regressionline 6 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Outliers and Influential Points In a scatterplot, regression outliers are indicated by points falling far away from the overall pattern. In many cases, a point is an outlier if its residual is an outlier in the set of residuals. In a scatterplot, influential points are those whose removal would sharply change the regression line. Sometimes this description is restricted to points with extreme x-values. Transformations to Achieve Linearity When a scatterplot shows non-linear pattern, it can sometimes be linearized by transforming one or both of the variables and then noting a linear relationship. If y is transformed to 𝐲 𝟐 to linearize the model, then the least squares regression line is: ̂ )2 = 𝑎 + 𝑏1 × IV (DV 7 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 THEME 2: Planning a Study Data Collection Observational Study Experiment Census Sample Survey 8 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Observational Study Observational Study We simply observe and measure something that has taken place or is taking place, whil trying not to cause any changes by our presence. The results of an observational study show only the existence of association but NOT cause relationship (cause and effect). Conditions for a well-designed and well-conducted survey • • • Must incorporate randomness Must have consistent response Must ask neutral questions Such survey will result in a representative sample. Sampling Error vs Bias Sampling error is the difference between sample statistics and population parameter. It is naturally existed. Bias occurs when there is a tendency to favour the selection of certain members of a population. It is the consequence of a poorly designed survey. 9 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Sampling Techniques Simple Random Sampling Systematic Sampling Stratified Sampling Cluster Sampling Assigning a number to everyone in the population and using a random number table or having a computer generate numbers to indicate choices. A Simple Random Sample (SRS) is one in which every possible sample of the desired size has an equal chance of being selected. Listing the population in some order, choosing a random point to start, picking every tenth/hundredth/thousandth/kth person from the list Dividing the population into homogeneous group called strata according to gender/income level/race/others, choosing random samples of persons from each strata. Dividing the population into heterogeneous groups called clusters, choosing a random sample of clusters from all clusters. Bias Type of Bias Definition Household Bias When a sample includes only one Nonresponse member of any given household, Bias members of large households are underrepresented. Response bias People may respond untruthfully when face to face with an interviewer or when filling out a questionnaire that is not anonymous Some items/people are naturally more likely to be selected due to their size / group size. Selection bias Bias exists when a particular group of people are selected, which may result in similar response Undercoverage bias When a particular method is used to reach people, those who cannot be reached through that method may be ignored Samples based on individuals who offer to participate typically give too much emphasis to people with strong opinions When people are given free choices, they tend to make a particular type of choice. Wording bias Nonneutral or poorly worded questions may lead to answers that are very unrepresentative of the population. Size bias Voluntary response bias Unintentional Bias Type of Bias Definition When people refuse to respond or are unreachable or too difficult to contact 10 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Experiments Experiments We impose some change or treatment and measure the results or responses. The subjects are usually divided into treatment group and control group. The result of an experiment suggest causal relationship. Explanatory Variables vs Response Variables Explanatory variables, called factors, are believed to have an effect on response variable. They are corresponding to independent variable and dependent variable in regression analysis. An explanatory variable can have different levels. Each level is called a treatment. An experiment allows multiple explanatory variables. Confounding Variable vs Lurking Variable Confounding Variable: When we are uncertain with regard to which variable is causing an effect, we say the variables are confounding variables. Lurking Variable: A lurking variable is a variable that drives two other variables, creating the mistaken impression that the two other variables are related by cause and effet. Treatment Group and Control Group In an experiment, the subjects are randomly assigned into two groups: a treatment group in which they receive treatments and a control group in which they don’t. The subjects in the control group receive a placebo, which is a ‘simulated’ or medically ineffectual treatment. A placebo effect is when people respond to any kind of perceived treatment. The physical response maybe caused by the psychological placebo effect instead of the actual treatment. 11 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Randomization Randomization is to use chance in deciding which subjects go into which group. It usually refers to how given subjects are assigned to treatments, not to how a group of subjects are chosen from an entire population. Randomization will make sure subjects in each group has a varieties of levels of any potential confounding or lurking variables. Therefore it helps to minimize the effect of confounding variables and lurking variables. Blinding Blinding occurs when the subjects don’t know which treatment they are receiving, or when the response evaluators don’t know which subjects are receiving which treatment. Single Blinding: Only the subjects do not know which treatment they are receiving Double Blinding: Neither the subjects nor the response evaluator know who is receiving which treatment. Blinding helps minimize any hidden bias. Many studies suggest that subjects appear to consciously or subconsciously want to help the researcher prove a point. Doctor’s judgment may also be influenced if they know which subject receives which Randomized Design Completely Randomized Design: Every subject has an equal chance of receiving any treatment. Randomized Block Design: Firstly divide the subjects into representative groups called blocks, then subjects in each block are randomly assigned to different treatment groups.Blocking helps control certain lurking variables by bringing them directly into the picture and helps make conclusions more specific. Randomized Paired Comparison Designs: Subjects are paired first and then the subjects in every pair are randomly assigned to different treatment groups. Often the paired subjects are just a single subject who are given both treatments, one at a time. It is a special case of block design with ‘very small blocks’. Replication The treatment should be repeated on a sufficient number of subjects so that the obtained response differences are statistically significant. To achieve this: For comparison design: increase the number of pairs of subjects For completely randomized design or block design: increasing the group size 12 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Generalizability of Results A major goal of experiments is to be able to generalize the results to broader populations. To achieve this: • • Often an experiment must be repeated in a variety of settings Realistic situation should be created in testing Testing and experimenting on people does not put them in natural states, and this situation can lead to artificial response. Three primary principles for a well planned and well conducted experiment ➢ Possible confounding variables must be controlled ➢ Chance should be used in assigning which subjects are to be placed in which groups for which treatment. ➢ Natural variation in outcomes can be lessened by using more subjects 13 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 THEME 3: Probability 14 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Proabability Basics Law of Large Numbers: The relative frequency tends to be closer and closer to a certain number (the probability) as an experiment is repeated more and more times. 𝐥𝐢𝐦 𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐞𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭𝐬→∞ 𝐫𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐟𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲 = 𝐩𝐫𝐨𝐛𝐚𝐛𝐢𝐥𝐢𝐭𝐲 Complementary Events P(AC ) = 1 − P(A), whereAC means that A does not occur. 𝐴𝑐 and 𝐴 are called complementary events. Addition Rule P(A ∪ B) = P(A) + P(B) − P(A ∩ B) (formula sheet) Mutually Exclusive Events If A and B are mutually exclusive, then A and B cannot occur simultaneously. So 𝑃(𝐴 ∩ 𝐵) = 0 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) Conditional probability: P(A|B) = P(A∩B) P(B) (formula sheet) 𝑃(𝐴|𝐵)is called the probability of A given B, that is, the probability of A given that B has happened. Independent Events If A and B are independent, then the occurrence of A is not affected by the occurrence of B. If A and B are independent, then, 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴) × 𝑃(𝐵) P(A|B) = 𝑃(𝐴) 15 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Discrete Random Variable Random Variables A random variable (X) is a certain type of results that we are interested about an experiment. Discrete random variables assume only a countable number of values, while continuous random variables assume values associated with an interval. Probability Distribution A probability distribution for a discrete random variable is a list or formula giving the probability for each value of the random variable. P(X=x) denotes the probability of the random variable X for value x. The sum of all probabilities must be equal to 1 for any probability distribution. The cumulative probability: 𝑏 𝑃𝑟(𝑎 ≤ 𝑋 ≤ 𝑏) = ∑𝑥𝑥= =𝑎 𝑝(𝑥) Bernoulli Trials A Bernoulli trial must have the following properities: • Each trial results in one of two outcomes, which are designated either a success, S or a failure F. • The probability of success on a single trial, p, is constant for all trials, and thus the probability of failure on a single trial is (1-p) • The trials are independent (so that the outcome on any trial is not affected by the outcome of any previous trial). Binomial Random Variable: The number of successes in a Bernoulli sequence of n trials is called a binomial random variable and is said to have a binomial probability distribution. The probability of achieving x successes in n Bernoulli trials is: Where: 𝑛 𝑛 𝑛! 𝑃(𝑋 = 𝑘) = ( ) 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 where( ) = 𝑥!(𝑛−𝑥)! 𝑘 𝑥 n = number of independent trials k = number of success p = probability of success (formula sheet) 16 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Using TI for Binomial Probability Binomial PDF: To find P(X = x) Binomial CDF:To find P(x1 < 𝑋 ≤ x2 ) Geometric Random Variable A geometric random variable is the number of Bernoulli trials that records the first success. Suppose an experiment has two possible outcomes, called success and failure, with the probability of success equal to p and the probability of failure equal to 𝑞 = 1 − 𝑝 , then the probability that the first success is on trial number 𝑋 = 𝑘 is 𝑃(𝑋 = 𝑘) = (1 − 𝑝)𝑘−1 𝑝 Mean (Expected Value) And Standard Deviation of a Random Variable (Formula Sheet) For any random variable X binomial random variable Expected value/ average/ mean E(X) = μx = ∑ xi pi μx = np Variance var(X) = σ2x = ∑(xi − μx )2 pi σ2x = np(1 − p) Standard deviation σx =√var(X) σx = √np(1 − p) The alternative formula for variance: Var(x) = E(x 2 ) − [E(x)]2 Fair Game If a game is fair, then the expected winning for each player in the game is zero. Independent Random Variables Two random variables X and Y are independent if P(X = x|Y = y) = P(X = x) OR P(X = x, Y = y) = P(X = x) × P(Y = y) forall values of X and Y. 17 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Linear Combination of E(X) and Var(X) 𝐸(𝑋 ± 𝑎) = 𝐸(𝑋) ± 𝑎 𝑉𝑎𝑟(𝑋 ± 𝑎) = 𝑉𝑎𝑟(𝑋) 𝐸(𝑏𝑋) = 𝑏𝐸(𝑋) 𝑉𝑎𝑟(𝑏𝑋) = 𝑏 2 𝑉𝑎𝑟(𝑋) Combining X and Y: 𝐸(𝑋 ± 𝑌) = 𝐸(𝑋) ± 𝐸(𝑌) 𝑉𝑎𝑟(𝑋 ± 𝑌) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌) if X and Y are independent. 𝜎𝑥±𝑦 = √𝜎𝑥2 + 𝜎𝑦2 Note: When combining X and Y, every value of X is combined with every value of Y. Normal Distribution Normal Distribution • • • • • Symmetric and bell shaped with an infinite base The total area under the normal curve is 1 The mean (μ) is located at the center and the standard deviation (σ) represents the width of the normal curve. A normal model is denoted as 𝐍(𝛍, 𝛔𝟐 ). The mean and standard deviation of a standard normal random variable Z is 0 and 1 respectively. 𝑃𝑟(𝑎 ≤ 𝑥 ≤ 𝑏)is the area under the normal distribution curve between x = a and x = b. Z– score 𝐱−𝛍 𝛔 Where x is the data value, μ is the mean, and σ is the standard deviation. 𝐳= The 68 – 95 – 99.7% Rule P(μ − σ < 𝑋 < 𝜇 + 𝜎) = P(−1 < 𝑍 < 1) ≈ 0.68 P(μ − 2σ < 𝑋 < 𝜇 + 2𝜎) = P(−2 < 𝑍 < 2) ≈ 0.95 P(μ − 3σ < 𝑋 < 𝜇 + 3𝜎) = P(−3 < 𝑍 < 3) ≈ 0.997 18 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Common Z-score Using TI for Normal Probability Normal CDF: Given N(μ, σ2 ), find P(x1 < 𝑋 < x2 ) Inverse Normal: Case 1 - Given P(X < 𝑥) = p0 and N(μ, σ2 ), find x. Case 2– Given P(X < x0 ) = p0 , find mean or standard deviation Normal Approximation to the Binomial When np > 10 and n(1 − p) > 10, it is reasonable to use normal to approximate the binomial: μx = np σx = √np(1 − p) To estimate 𝑃(𝑋 = 𝑥) for binomial, find 𝑃(𝑥 − 0.5 < 𝑋 < 𝑥 + 0.5) for 𝑁(𝜇𝑥 , 𝜎𝑥2 ) To estimate 𝑃(𝑋 ≤ 𝑥) for binomial, find 𝑃(𝑋 < 𝑥 + 0.5) for 𝑁(𝜇𝑥 , 𝜎𝑥2 ) To estimate 𝑃(𝑋 < 𝑥) for binomial, find 𝑃(𝑋 < 𝑥 − 0.5) for 𝑁(𝜇𝑥 , 𝜎𝑥2 ) To estimate 𝑃(𝑋 ≥ 𝑥) for binomial, find 𝑃(𝑋 > 𝑥 − 0.5) for 𝑁(𝜇𝑥 , 𝜎𝑥2 ) To estimate 𝑃(𝑋 > 𝑥) for binomial, find 𝑃(𝑋 > 𝑥 + 0.5) for 𝑁(𝜇𝑥 , 𝜎𝑥2 ) 19 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Sampling Distribution Sampling Distribution of Sample Proportion The sample proportions are normally distributed if the following conditions are satisfied: 1. Both np and n(1 − p) are at least 10 2. The sample is a simple random sample 3. The sample cannot be too large. The sample size n should be no larger than 10% of the population. The mean and standard deviation of the sampling distribution of sample proportion are: 𝜇𝑝̂ = 𝑝 σP̂ = √ p(1−p) (formula sheet) n Where p is the population proportion and n is the sample size Sampling Distribution of a Difference Between Two Sample Proportions The differences between two sample proportions are normally distributed if the following conditions are satisfied: 1. n1 p1 , n1 (1 − p1 ), n2 p2 and n2 (1 − p2 ) are all at least 10 2. The two samples are independent SRS. 3. The sample cannot be too large. The sample size n1 and n2 should be no larger than 10% of the respective population. The mean and standard deviation of the sampling distribution of differences betweentwo sample proportion are: μp̂1 −p̂2 = p1 − p2 p (1−p1 ) σp̂1 −p̂2 = √ 1 n 1 p (1−p2 ) + 2n 2 (formula sheet) Where p1 and p2 are the population proportions and n1 and n2 are the sample sizes. 20 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Sampling Distribution of a Sample Mean The sample means are normally distributed regardless of the shape of population distribution if the following conditions are satisfied: 1. n is sufficiently large (𝑛 > 30) 2. The sample is an SRS 3. The sample size is no larger than 10% of the population. This is also called the Central Limit Theorem. (Note: If the population distribution is normal, then the assumption above is NOT necessary!) The mean and standard deviation of the sampling distribution of sample means are: μx̅ = μ σx̅ = σ (formula sheet) √n Where μ is the population mean, σ is the population standard deviation and n is the sample size. Sampling Distribution of a Difference Between Two Sample Means The differences between two sample means are normally distributed if the following conditions are satisfied: 1. 𝑛1 and 𝑛2 are both sufficiently large (𝑛 > 30) 2. The two samples are independent SRS. 3. 𝑛1 and 𝑛2 are no larger than 10% of the respective population. The mean and standard deviation of the sampling distribution of differences betweentwo sample proportion are: μx̅1−x̅2 = μ1 − μ2 𝜎12 𝜎22 𝜎𝑥̅ 1 −𝑥̅2 = √ + 𝑛1 𝑛2 Where μ1 and μ2 are the population means, σ1 and σ2 are the population standard deviations and n1 and n2 are the sample sizes. 21 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 THEME 4: Statistical Inferences 22 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Confidence Intervals Confidence Interval 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑠 ± 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑠𝑐𝑜𝑟𝑒 × 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 Where ‘estimates’ is the observed sample statistics ‘critical score’ corresponds to the confidence level (either z or t score) ‘standard deviation’ of the sampling distribution is estimated from the sample statistics 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑠𝑐𝑜𝑟𝑒 × 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 is called the margin of error Interpretation of a Confidence Interval: ⚫ ⚫ We are (confidence level) % confident that the (population parameter) is within the interval of Estimates ± critical score × standard deviation There is a (confidence level)% chance that the confidence interval contains the true population propotion. Confidence Interval for a Proportion Confidence Interval: 𝑝̂ ± 𝑧 × 𝜎𝑝̂ 𝑝̂(1−𝑝̂) for 𝜎𝑝̂ ≈ √ 𝑛 Where 𝑝̂ is the observed sample proportion, z is the critical score, 𝜎𝑝̂ is the standard deviation of sample proportion (also called SEp̂, standard error of p̂ ). Assumption: ➢ ➢ ➢ np̂ > 10andn(1 − p̂) > 10 The sample is a simple random sample The sample is less than 10% of the population. Maximum error and minimum sample size: max 𝜎𝑝̂ = z× 0.5 0.5 √n z 2 ≤ error ➔ minimum n = [2(error)] n √ 23 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Confidence Interval for a Difference of Two Proportions Confidence Interval: (𝑝̂1 − 𝑝̂ 2 ) ± 𝑧 × 𝜎𝑝̂1 −𝑝̂2 𝑝̂ (1−𝑝̂1 ) for 𝜎𝑝̂1 −𝑝̂2 ≈ √ 1 𝑛 1 𝑝̂ (1−𝑝̂2 ) + 2𝑛 2 where 𝑝̂1 − 𝑝̂ 2 is the observed differences of two proportions, z is the critical score, 𝜎𝑝̂1 −𝑝̂2 is the standard deviation of difference between two sample proportions. Assumption: ➢ 𝑛1 𝑝̂1 , 𝑛1 (1 − 𝑝̂1 ), 𝑛2 𝑝̂ 2 , 𝑛2 (1 − 𝑝̂ 2 ) should all be at least 10 ➢ Both samples are SRSs and they are independent. ➢ Both samples should be less than 10% of the population. Maximum error and minimum sample size: max 𝜎𝑝̂1 −𝑝̂2 = z× √0.5 √𝑛 2 z √0.5 ≤ error➔minimum n = [ ] √2error √n Confidence Interval for a Mean Given population standard deviation 𝜎 x̅ ± z × σx̅ for σx̅ = 𝜎 √𝑛 Given sample standard deviation 𝑠: x̅ ± t × σx̅ (𝑑𝑓 = 𝑛 − 1) for σx̅ ≈ 𝑠 √𝑛 , where𝑥̅ is the observed sample mean, z is the critical score, σx̅ is the standard deviation of sample mean calculated by population standard deviation where𝑥̅ is the observed sample mean, t is the critical score, σx̅ is the standard deviation of sample mean calculated by sample standard deviation Assumption: Assumption: 1. n> 40 OR sample data is normally distributed OR population data is roughly symmetric and unimodal 2. The sample is an SRS 3. The sample size is no larger than 10% of the population s 1. n is sufficiently large (𝑛 > 30) 2. The sample is an SRS 3. The sample size is no larger than 10% of the population. 24 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Properties of t-distribution 𝒕= 𝑥̅ − 𝜇 𝑠/√𝑛 where 𝜎𝑥̅ = ➢ ➢ ➢ ➢ ➢ 𝑠 √𝑛 t-distribution is also bell-shaped and symmetric t-distribution is more spread out than the normal distribution t-distribution is different for different values of n. df = n − 1is called the degree of freedom The larger the df value, that is, the larger the sample size, the closer the distribution to the normal distribution. Choosing sample size In general, if z× σ √n ≤ error wherez is the critical score for the confidence level then the minimum sample size n to achieve a certain confidence interval with certain margins of error is: zσ 2 n=( ) error Confidence Interval for a Difference Between Two Means Given population standard deviation 𝜎1 and 𝜎2 Given population standard deviation 𝜎1 and 𝜎2 (𝑥̅1 − 𝑥̅2 ) ± 𝑧 × 𝜎𝑥̅1 −𝑥̅2 (𝑥̅1 − 𝑥̅2 ) ± 𝑡 × 𝜎𝑥̅1 −𝑥̅2 𝜎2 𝜎2 1 2 for 𝜎𝑥̅1 −𝑥̅2 = √𝑛1 + 𝑛2 𝑠2 𝑠2 1 2 for 𝜎𝑥̅1 −𝑥̅2 ≈ √𝑛1 + 𝑛2 𝑑𝑓 = (𝑛1 − 1) + (𝑛2 − 1) Assumption: 1. 𝑛1 and 𝑛2 are both sufficiently large (>30) 2. The samples are independent SRS 3. The sample sizes are no larger than 10% of the population. Assumption: 1. 𝑛1 > 40 and 𝑛2 > 40 OR sample data is normally distributed OR population data is normally distributed 2. The samples are independent SRS 3. The sample sizes are no larger than 10% of the population. 25 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Choosing sample sizes In general, if 𝜎12 𝜎22 𝑧×√ + ≤ 𝑒𝑟𝑟𝑜𝑟 𝑛 𝑛 where z is the critical score for the confidence level, then the minimum size for both samples is: 𝑧√𝜎12 + 𝜎22 𝑛=( ) 𝑒𝑟𝑟𝑜𝑟 2 Confidence Interval for the Slope of a Least Squares Regression Line Confidence interval: 𝐛𝟏 ± 𝐭 × 𝐬 𝐛 𝟏 where b1 is the slope of the sample regression line ŷ = y̅ + b1 (x − x̅) t is the critical score for 𝑑𝑓 = 𝑛 − 2 sb1 is the standard deviation of b1 Standard deviation of the slope 𝐛𝟏 : 2 √∑(yi −ŷi ) sb1 = n−2 √∑(𝑥𝑖 −𝑥̅ )2 (Formula Sheet) OR sb1 = ∑(yi −y ̂ i )2 Where se = √ n−2 ∑(xi −x̅)2 sx = √ n−1 𝑠𝑒 𝑠𝑥 √𝑛 − 1 is the standard deviation of the residual is the standard deviation of x. Assumptions: 1. The sample must be randomly selected 2. The scatterplot of the sample data should be approximately linear (No apparent patter in residual plot) 3. The distribution of the residuals should be approximately normal for sample data. 26 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Hypothesis Test Introduction to Hypothesis Test Purpose A hypothesis test is used to test whether a claimed population parameter, or a hypothesis test is acceptable. Null Hypothesis and Alternative Hypothesis • Null Hypothesis (𝐻0 ) is an equality about population parameter • Alternative Hypothesis (𝐻𝑎 ) is an inequality (greater, less, not equal) about population parameter Conclusion of Hypothesis Test We have sufficient evidence to reject the null hypothesis. OR We do not have sufficient evidence to reject the null hypothesis Errors in Hypothesis Test Population truth Decision based on sample • • • • • • Reject 𝑯𝟎 Fail to reject 𝑯𝟎 𝑯𝟎 𝒕𝒓𝒖𝒆 𝑯𝟎 𝒇𝒂𝒍𝒔𝒆 Type I error Correct decision Correct decision Type II error The probability of making a type I error is the significance level (or 𝛼 risk) The probability of making a type II error,𝛽 , is different for each possible value for the population parameter. The power of a hypothesis test is the probability that a type II error is not committed, or the probability that a false null hypothesis is correctly rejected. (1 − β) Choosing a smaller 𝛼 results in a higher risk of Type II error and a lower power The greater the difference between null hypothesis and the true population parameter, the smaller the risk of a Type II error and the greater the power 27 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Hypothesis Testing for Proportion Step 1 Test for a proportion (𝐩) Test for difference proportions (𝐩𝟏 − 𝐩𝟐 ) between Setting up hypothesis 𝐻0 : p = p0 𝐻𝑎 : p < p0 or > p0 or ≠ p0 Setting up hypothesis 𝐻0 : p1 − p2 = 0 𝐻𝑎 : p1 − p2 < 0 𝑜𝑟 > 0 𝑜𝑟 ≠ 0 Significance level: 𝛼 = 𝛼0 Significance level: α = α0 two Finding 𝛍𝐩̂𝟏 −𝐩̂𝟐 and 𝛔𝐩̂𝟏 −𝐩̂𝟐 Finding 𝛍𝐩̂ and 𝛔𝐩̂ 𝜇𝑝̂1−𝑝̂2 = 0 μp̂ = 𝑝0 Step 2 1 1 𝜎𝑝̂1−𝑝̂2 ≈ √𝑝̂ (1 − 𝑝̂ ) ( + ) 𝑛1 𝑛2 𝑝0 (1 − 𝑝0 ) σp̂ = √ 𝑛 2 where 𝑝̂ = 𝑛𝑥1+𝑥 +𝑛 1 Computing P-value attained significance) Step 3 (the Computing P-value significance) 2 (the attained P(p̂ < p̂0 ) or P(p̂ > p̂0 ) P(p̂1 − p̂2 < 𝑜𝑏𝑡𝑎𝑖𝑛𝑒𝑑 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒) where p̂0 is the obtained sample or proportion. P(p̂1 − p̂2 > 𝑜𝑏𝑡𝑎𝑖𝑛𝑒𝑑 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒) Compare the P-value with the significance level 𝛂 Step 4 α 1. α Both 𝑛𝑝0 and 𝑛(1 − 𝑝0 ) are at 1. least 10 Assumption 𝛼 2 𝑛1 𝑝̂1 , 𝑛1 (1 − 𝑝̂1 ), 𝑛2 𝑝̂2 , and 𝑛2 (1 − 𝑝̂2 ) should all be at least 10 2. The sample is a SRS. 2. Independent SRS 3. The sample size is less than 10% of 3. The sample sizes are less than 10% of the the population. population 28 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Hypothesis Testing for Mean Step 1 Test for a mean (𝛍) Test for difference between two means (𝛍𝟏 − 𝛍𝟐 ) Setting up hypothesis H0 : μ = μ 0 Ha : μ < μ0 or > μ0 or ≠ μ0 Significance level: 𝛼 = 𝛼0 Setting up hypothesis H0 : μ1 − μ2 = 0 Ha : μ1 − μ2 < 0 𝑜𝑟 > 0 𝑜𝑟 ≠ 0 Significance level: 𝛼 = 𝛼0 Finding 𝛍𝐱̅ and 𝛔𝐱̅ Finding 𝝁𝒙̅𝟏 −𝒙̅𝟐 and 𝝈𝒙̅𝟏 −𝒙̅𝟐 μx̅ = 𝜇0 Step 2 σx̅ ≈ 𝜇𝑥̅1 −𝑥̅2 = 0 𝑠 √𝑛 𝜎𝑥̅1 −𝑥̅2 ≈ √ 𝑠12 𝑠22 + 𝑛1 𝑛2 Option 1: ➢ Find t-score for the observed sample statistics ➢ Find the critical t-score for the significance level 𝛼0 ➢ Compare the t-scores to test Option 2: Step 3 ➢ Find t-score for the observed sample statistics ➢ Find the corresponding P-value ➢ Compare the P-value with the significance level 𝛼0 to test t-score sample mean 𝑡= 𝑥̅ −𝜇0 𝑠 √𝑛 differences between two sample means t= x̅1 −x̅2 𝑠2 𝑠2 √ 1+ 2 𝑛 1 𝑛2 1. The sample is a SRS. 2. The sample is large enough. OR The sample data are approximately symmetric and Assumption unimodal. OR The population distribution is approximately normal. 1. The two samples are SRS and independent. 2. The samples are large enough. OR The sample data are approximately symmetric and unimodal. OR The population distribution is approximately normal. 29 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Hypothesis Test for Slope of Least Squares Line Step 1: Set up H0 and Ha H0 : β = 0 Ha : β > 0 𝑜𝑟 𝛽 < 0 𝑜𝑟 𝛽 ≠ 0 Step 2: Compute t-scores: b0 − 0 sb where b0 is the observed slope, 𝑠𝑏 is the standard deviation of the slope t0 = Step 3: Compute the P-value P − value = P(t > t 0 ) or P(t < t 0 ) Step 4: Compare the P-value with the significance level α to test. Assumptions: 1. The sample must be randomly selected 2. The scatterplot should be approximately linear. (There should be no apparent pattern in the residuals plot) 3. The distribution of the residuals should be approximately normal. 30 / 31 eskedu.com 亿思科学生之家——公益性质国际教育信息服务平台 Chi-Square Test for Goodness of Fit, Independence and Homogeneity Goodness of Fit Independence H0 : The observed distribution H0 : The two variables are fits the expected distribution independent. Step 1 Ha : the observed distribution Ha : The two variables are doesn’t fit the expected not independent. distribution. Homogeneity H0 : The distributions from two populations are the same. Ha : The distributions from two populations are not the same. Compute the X 2 according to the observed values and the expected values: 𝜒0 Step 2 Step 3 2 (𝑂𝑏𝑠 − 𝐸𝑥𝑝)2 =∑ 𝐸𝑥𝑝 Where obs = the observed value exp = expected value Compute P-value: Compute P-value: P-value = P(χ2 > χ20 ) P-value = P(χ2 > χ20 ) where df = r– 1 Where df = (r − 1)(c − 1) Compare the P-value with the significance level to test the hypothesis Compare the P-value with the significance level α: If P-value is greater than α, then do not reject H0 . If P-value is less than α, then reject H0 . 1. The sample is a SRS. 1. The samples are 1. The sample is a SRS. independent SRS. 2. The expected values for all 2. The expected values for all cells must be at 2. The expected values cells must be at least 5. Assumption for all cells must be least 5. Step 4 at least 5. Test for Goodness of Fit Comparing a single sample to a population model Test for Independence Work with a single sample classified on two variables Test for Homogeneity Compare samples from two or more populations about a single variable THE END 31 / 31 eskedu.com