AP Statistics Study Guide By Geoffrey Gao Chapter 1 : Probability / Random Variables 1) PROBABILITY a) Probability – the likelihood that a particular event will occur i) The relative frequency of an event is the number of times the event happened divided by the total number of trials. ii) The probability of an outcome is always between 0 and 1 b) The Law of Large Numbers i) The concept that the relative frequency tends closer and closer to a certain number (the probability) as an experiment is repeated more nad more times until infinity is the Law of Large Numbers c) Complementary Events i) The probability that an event will not occur is equal to 1 minus the probability that the event will occur ii) Equation: P( ) = 1 - P(A) d) General Addition Rule i) When two events are not mutually exclusive, the sum of their probabilities includes the shared occurance. Thus you add the probabilities of each individual and subtract the probability of both ii) Equation: P(A and B) = P(A) + P(B) – P(A and B) e) Multiplication Rule i) The chance that two independent events both occur is the product of their separate probabilities ii) Equation: P(A and B) = P(A) * P(B) f) Conditional Probability i) Conditional Probability is the probability of an event given that another event has occurred ii) Equation: P(A|B) = P(A and B) / P(B) iii) You can reverse the conditional probability with Bayes’ Formula iv) Equation: P(B|A) = P(A and B) * P(B) / P(A) g) Independence / Disjoint i) Two events are independent if the occurance of one event does not affect the probability of the other (1) Equation: P(B) = P(B|A) ii) Two events are disjoint, or mutually exclusive if the two events cannot both happen simultaneously 2) RANDOM VARIABLES a) Random Variables are variables that represent the different numbers associated with the potential outcomes of a certain situation i) A Discrete random variable only has a countable number of values ii) A Continuous random variable has a range of values with any value in between iii) The Expected Value of a random variable X is the sum of the products obtained by multiplying each value by the corresponding value of p (1) Equation: E(X) = μx = Σ (xi * pi ) iv) The Variance is the mean average of squared deviations. (1) Equation: var(X) = σ2 = Σ ( xi - μx )2 * pi v) The Standard Deviation is the square root of the variance (1) Equation: σ = √ ( Σ ( xi - μx )2 * pi ) b) Bernoulli Trials are those that satisfy the following conditions i) There are only two possible outcomes on each trial: success and failure ii) The probability of success are the same on every trial iii) The trials are independent. If this assumption is violated, it is still acceptable if the sample is smaller than 10% of the population c) Binomial Probability Distribution are distributions that deal with HOW a certain chain of events occur i) The mean value of a binomial distribution describes the expected number of successes (1) Equation: μx = np ii) The standard deviation is the following: (1) Equation: σ = √(npq) iii) Probability Equation where X is the number of successes in n trials (1) P(X = x) = nCx * px * qn-x iv) Calculations (1) Binompdf(n,p,x) gives the probability of exactly x successes in n trials where p is the probability of success on a single trial (2) Binomcdf(n,p,x) gives the cumulative probability of x or fewer successes in n trials, where p is the probability of success on a single trial d) Geometric Probability Distribution are distributions that deal with WHEN certain events occur in a chain of events i) The expected value of a geometric distribution is the expected first occurrence of a success (1) Equation: E(X) = 1/P ii) Standard Deviation (1) Equation: σ = √ (P / Q2 ) iii) Probability Equation (where X is the number of trials until the first success occurs) (1) Equation: P(X = x) = qx-1p iv) Calculations (1) Geometpdf(p,x) solves the probability density function. You specify the probability of success (p) and the number of the first success trial (x) (2) Geometcdf(p,x) solves the cumulative density function. You specify the probability of success (p) and the value x, in which we calculate the probability of success on or before the xth value Chapter 2 : Describing/Comparing Distributions 1) Graphical Displays a) Dotplots and Bar Graphs show categorical variables b) Histograms show Quantitative numerical values i) Useful for large data sets but does not show individual scores c) Stem Plots i) Shows individual scores 2) Summarizing/Comparing Distributions a) Summarizing: i) Center (1) Separates the values roughly in half ii) Shape (1) Clusters and Gaps (a) Clusters show natural subgroups into which values fall (b) Gaps show holes where no values fall (2) Modes (a) Peaks are known as modes. Unimodal means one peak, etc. (3) Certain common patterns of shape: (a) Symmetric (b) Skewed Left (i) Mean < Median (c) Skewed Right (i) Mean > Median iii) Spread (1) Scope of the values from smallest to largest (2) Equation: Interquartile Range (IQR) is Q3 – Q1 iv) Outliers (1) Outliers are extreme values that can be the result of natural chance variation or errors in measurement (2) Equation: Outliers: Q1 – IQR and Q3 + IQR v) Standard Deviation/Variance (1) Standard Deviation: average distance from mean (a) Equation: σ = √ (Σ ( x - μ )2 / n ) (2) Variance: (a) Equation: σ2 = Σ ( x - μ )2 / n vi) Transformations (1) Adding a value to every value changes the mean/median but not the SD. Multiplying every value changes all 3 (multiply by the factor to get the new number) 3) Normal Distributions a) Normal Distributions are bell-shaped and symmetric and have an infinite base. b) Empirical Rule (68-95-99.7 rule) – Each standard deviation follows the pattern. 1 standard deviation away on each side is 68% of all values, 2 standard deviations is 95, 3 is 99.7. c) Z-Scores tell us how many standard deviations a certain value is away from the mean. i) Equation: z = (xi – x) / σ ii) Evaluating Z- Scores: (1) Equation: Finding a z-score from a percentile – Invnorm(%) (2) Equation: Normcdf(lower,upper) – finding the percentage between z-scores 4) Regressions a) A graphical display called a scatterplot gives an immediate visual impression of a possible relationship between two variables, while a numerical measurement, called a correlation coefficient, is often used as a quantitative value of the strength of a linear relationship i) R is the correlation coefficient. It ranges from -1 to 1. 1 and -1 are the strongest linear associations, and 0 has no linear association. The positive r values have a positive relationship (positive slope), and the negative r values have a negative relationship. Correlation does not imply causation! It only measures the strength of a linear relationship ii) R2 is called the coefficient of determination. It is solved by squaring the r-value. When you explain your R2 value you make the statement: [R2] of variability in [y-axis] can be explained by the linear association with [x-axis]. b) Line of best fit is a line that gives the best predictions for values given a set of data. We wish to minimize the residual values i) Residual Value: Observed – Expected ii) Equation: b1 = r (sy / sx) – Slope of the LSRL c) Residual Plots i) The residual plot is made up of the residuals of all the values. The sum of the residuals are always 0. A sample with a large R2 value, low absolute residual sum, no clear pattern in a residual plot makes the regression line an appropriate one ii) Influential points are those that sharply change the regression line. iii) Transformations are the altering of the y, x, or both values to achieve a non-patterned residual plot. Usually if the residual plot has a pattern, a linear model is inappropriate and a nonlinear model is more appropriate (thus the transformation). (1) Exponential: log of Y (2) Power: Log of Y and Log of X (3) Quadratic: Square root of Y (4) Reciprocal: 1/y (5) Logarithmic: Log of X Chapter 3 : Planning a Study 1) Methods of Data Collection a) The Population is all the people in a certain region, and the Sample is a small group from the population b) A Census is a complete enumeration of an entire population. It is an attempt to contact every member of a population c) A Sample Survey aims to obtain information about a whole population by studying a part of it, or a sample. i) A sample is biased if in some critical way it does not represent the population. The main technique to avoid bias is to incorporate randomness d) Experiment vs. Observational Study i) An Experiment is a controlled study. In an experiment, there is an action taken on one or more of the groups and the response is observed. There are often treatment groups and control groups. Good experimental designs include: (1) Controls – A group that receives similar conditions as the other groups without the treatment. This is used as a baseline comparison for the response measurement (2) Blocking – Process in which the subjects are divided into representative groups (such as gender) to bring certain differences directly into the picture (3) Randomization – Unknown and uncontrollable idfferences are handled by randomizing who receives which treatments ii) An Observational Study is a study in which there is no choice in regard to who goes into the treatment and control groups. There is no action taken and is merely an observation of what has occurred. Observational studies on the impact of some variable on another variable often fail because explanatory variables are confounded with other variables (1) Confounding Variables are variables that are not accounted for in the original design. 2) Planning and Conducting Surveys a) Simple Random Sample i) A Simple Random Sample (SRS) is one in which every possible sample of the desired size has an equal chance of being selected. (1) A typical way of an SRS is assigning everyone and using a random number generator b) Bias / Sampling Variation i) All surveys give a statistic as an estimate for a population parameter. Different samples give different statistics, all of which are estimates for the same population parameter, and so error, called sampling error is present. The chance of this error occurring is smaller when the sample size is larger. ii) Bias is the tendency to favor the selection of certain members of a population. Here are a few explanations of certain bias (1) Response Bias – People don’t want to be perceived as having unpopular or unsavory views and so they respond untruthfully when face to face with an interview (2) Wording Bias – Non-neutral or poorly worded questions may lead to answers that are unrepresentative of the population (3) Selection Bias - Choosing the wrong population to vote. For instance, asking for opinions regarding welfare reform to an area that is largely conservative. (4) Undercoverage Bias – This occurs when there is inadequate representation. Convenience Samples are based on choosing individuals who are easy to reach. These tend to produce under-representative data (5) Voluntary Response Bias – Samples based on individuals who offer to participate typically give too much emphasis to people with strong opinions (6) Nonresponse Bias – When certain people refuse to respond or are unreachable or too difficult to contact c) Other Sampling Methods. i) Systematic Sampling – Involves listing the population in some order and choosing a random point to start, and picking every person from the list in intervals (ie every 10th person). This gives a reasonable sample as long as the original order of the list is unrelated to the variables under consideration ii) Stratified Sampling – Involves dividing the population into homogeneous (similar) groups called strata, and random samples of persons from all strata are chosen. iii) Cluster Sampling – Involves dividing the population into heterogeneous (mixed) groups called clusters, and taking random samples of persons from all the clusters are chosen. Each cluster should then resemble the entire population iv) Multistage Sampling – Taking multiple sampling steps. 3) Confounding, Control Groups, Placebo Effects, and Blinding a) Experiments involve explanatory variables, called factors, which are believed to have an effect on response variables. b) When there is uncertainty with regard to which variable is causing an effect, the variables are confounded c) A lurking variable is a variable that drives two other variables, creating the mistaken impression that the two other variables are related by cause and effect. Thus the linkages are often by a common response d) The placebo effect is the fact that many people respond to any kind of perceived treatment, even though it may be nothing e) Blinding occurs when the subjects or the response evaluators don’t know which subjects are receiving which treatments. Double blind is when they are both unaware. Chapter 4 : Statistical Inference 1) A-G (for review) a) List statistics and parameters (p̂, n, π , N, OR x̄, s, n, μ, σ ) and degrees of freedom b) State which test you are using. State α c) State hypotheses in terms of the population (μ or π) i) H0 : null hypothesis (π = π0 OR μ = μ0). No change or difference from the specification (difference is due to natural sample-tosample variation) ii) Ha : alternative hypothesis, one sided or two sided iii) Assume H0 is true if doing a significance test (this doesn’t make sense in a confidence interval because you are trying to find π d) Verify the conditions that you can, and make assumptions for those that can’t be verified. State “we will proceed with caution anyway” if you need to e) If doing a significance test, make a diagram with shading. If doing a Confidence Interval, find the confidence interval f) Calculate p-value g) Three- Part Conclusion i) Make a statement about the P-Value: “Assuming the true population [proportion or mean] [H0 in context], the probability we could get a sample at least as extreme as [state results from sample] due to natural sample to sample variability is [insert PValue].” OR make a confidence interval conclusion ii) Make a statement about H0: Compare P-Value to α and state whether H0 is rejected or plausible (failed to be rejected). If PValue > α, reject the null; otherwise, fail to reject iii) Make a statement about Ha : (1) If H0 is rejected state in context: (a) “substantial evidence for Ha”: P ≤ 1% (b) “moderate evidence for Ha”: P = 1% - 5% (c) “weak evidence for Ha”: P = 5% - 10% (2) If H0 is plausible (failed to be rejected), state in context that there is not sufficient evidence for Ha 2) Proportion Tests a) 1-Prop and 2-Prop z-tests deal with proportions of populations. All proportions are between 0 and 1 and describe a proportion of a population with a certain characteristic. b) Confidence Interval i) Conditions: (1) Randomization – Is the sample random? (2) Normality – np̂ and n(1- p̂) ≥10 (3) Independence (population large enough) – N > 10n ii) Equation: CI = p̂ ± Margin of error (1) Margin of Error: z* SE iii) Equation: Standard Error (1) 1-Prop: √(p̂ * (1- p̂) / n) (2) 2-Prop: √[(p̂1(1- p̂1)/n1) + (p̂2(1- p̂2)/n2)] c) P-Value (Z-Scores) i) Conditions: (1) Randomization – Is the sample random? (2) Normality – np and nq ≥10 (3) Independence (population large enough) – N > 10n ii) Equation: Standard Deviation (1) 1-Prop: σ = √(pq/n) (2) 2-Prop: σ =√[(p̂c(1- p̂)c(1/n1 + 1/n2)] (a) Equation: p̂c = (x1+x2)/(n1+n2) iii) Note that in P-Values you use p and q to solve the equations, whereas in Confidence intervals you use p̂! d) Calculations (1) Z* = invNorm(%) (2) Confidence Interval: ZInterval, 2-PropZinterval (3) P-Value: 1-PropZtest, 2-PropZTest (4) Evaluating P value: normcdf(lower,upper) 3) Sample Tests a) 1-Sample and 2-Sample t-tests deal with the averages of populations. You will need to find the means and the standard deviations b) Confidence Interval / P-Value i) Conditions: (1) Randomization – Is the sample random? (2) Normality – Graph with histogram/box and whisker and describe shape (unimodal + symmetric) (3) Independence (population large enough) – N > 10n ii) Equation: CI = x̄ ± t* SE (x̄) iii) Equation: Standard error (1) 1-Sample: SE(x̄) = σ/√(n) (2) 2-Sample: √[(s12/n1) + (s22/n2)] iv) Equation: Degrees of Freedom: n-1 v) Calculations: (1) T* = invT(%,Df) (2) Confidence Interval: TInterval, 2-SampTInterval (3) P-Value: T-Test, 2-SampTTest (4) Evaluating P-Value: tcdf(lower, upper, degrees of freedom) c) Matched Pair i) These occur when two variables are applied to the same subject in a sample. These are calculated the same as a 1 sample t-test and you look at the difference in the data. 4) Chi-Squared Tests a) Chi-Squared tests were derived to perform significance testing for categorical variables. It focuses on inferring the validity of a sample i) Equation: x2 = Σ (observed – expected)2/ expected ii) Make sure to write the sum above in a form like x1 + x2 +…+ xn b) Conditions i) Randomization: is the sample chosen randomly ii) Expected Cell Frequency: The expected cell counts of subjects in each cell are at least 5 iii) Independence: N>10n c) Goodness of Fit i) Goodness of Fit is used to determine whether our observed data fits the theoretical distribution for that data ii) Equation: Degrees of Freedom = n-1 iii) Equation: Expected = Sum/columns iv) Hypotheses (1) H0: The is no difference between each event (2) Ha: There is a difference between each event v) Calculations: (1) X2GOF = goodness of fit test d) Homogeneity Test i) Homogeneity Test is used to compare the distribution of categories. We hope to observe the same amount of variation in all categories of multiple populations/samples ii) Equation: Degrees of Freedom = (rows – 1) (columns – 1) iii) Hypotheses (1) H0: The proportions across each sample are equal (2) Ha: The proportions across each sample are different iv) Calculations (1) X2-Test (input values into matrix) (2) X2cdf(lower, upper, df) e) Independence Test i) The independence test is used to gain evidence of association between two categorical variables. They will usually ask “are the two associated?” ii) Equation: Degrees of Freedom = (rows – 1) (columns – 1) iii) Hypotheses (1) H0: The two events are INDEPENDENT/NOT ASSOCIATED (2) Ha: The two events are DEPENDENT/ASSOCIATED iv) Calculations (1) X2-Test (input values into matrix) (2) X2cdf(lower, upper, df) 5) Regression Tests a) Will usually ask you if there is evidence that a relationship between two things is linear b) Conditions: i) Linearity Assumption – Check the scatter plot to see if the shape is linear ii) Independence Assumption – Check the residuals plot. The residuals should appear randomly scattered iii) Equal Variance Condition – Check the residuals plot again. The vertical spread of residuals should be roughly the same everywhere iv) Normal Population Assumption – Check the histogram of the residuals. The distribution of residuals should be unimodal and symmetric c) Hypotheses (1) H0: β = 0 (no linear association) (2) Ha: β > 0 (positive linear association), β < 0 (negative linear association), β ≠ 0 d) Confidence Interval i) Equation: SEb = s / (√(Σ(x- x̄)2)) ii) Equation: s = √[(1/(n-1))(y- ŷ)2] iii) Calculation (1) Linreg Int e) P-Value i) Equation: T= b/SEb ii) P value: p (insert alt | β = 0) iii) Calculation (1) Linreg t test f) Degrees of Freedom i) Equation: n-2 6) Errors a) Type I i) This occurs when you rejected the null when you shouldn’t have. Thus the null is not rejected b) Type II i) This occurs when you failed to reject the null when you should have. Thus the null is false.