Uploaded by CL Lucas

93373029-AP-Statistics-Study-Guide

advertisement
AP Statistics Study Guide
By Geoffrey Gao
Chapter 1 : Probability / Random Variables
1) PROBABILITY
a) Probability – the likelihood that a particular event will occur
i) The relative frequency of an event is the number of times the
event happened divided by the total number of trials.
ii) The probability of an outcome is always between 0 and 1
b) The Law of Large Numbers
i) The concept that the relative frequency tends closer and closer to
a certain number (the probability) as an experiment is repeated
more nad more times until infinity is the Law of Large Numbers
c) Complementary Events
i) The probability that an event will not occur is equal to 1 minus the
probability that the event will occur
ii) Equation: P( ) = 1 - P(A)
d) General Addition Rule
i) When two events are not mutually exclusive, the sum of their
probabilities includes the shared occurance. Thus you add the
probabilities of each individual and subtract the probability of both
ii) Equation: P(A and B) = P(A) + P(B) – P(A and B)
e) Multiplication Rule
i) The chance that two independent events both occur is the product
of their separate probabilities
ii) Equation: P(A and B) = P(A) * P(B)
f) Conditional Probability
i) Conditional Probability is the probability of an event given that
another event has occurred
ii) Equation: P(A|B) = P(A and B) / P(B)
iii) You can reverse the conditional probability with Bayes’ Formula
iv) Equation: P(B|A) = P(A and B) * P(B) / P(A)
g) Independence / Disjoint
i) Two events are independent if the occurance of one event does
not affect the probability of the other
(1) Equation: P(B) = P(B|A)
ii) Two events are disjoint, or mutually exclusive if the two events
cannot both happen simultaneously
2) RANDOM VARIABLES
a) Random Variables are variables that represent the different
numbers associated with the potential outcomes of a certain situation
i) A Discrete random variable only has a countable number of
values
ii) A Continuous random variable has a range of values with any
value in between
iii) The Expected Value of a random variable X is the sum of the
products obtained by multiplying each value by the corresponding
value of p
(1) Equation: E(X) = μx = Σ (xi * pi )
iv) The Variance is the mean average of squared deviations.
(1) Equation: var(X) = σ2 = Σ ( xi - μx )2 * pi
v) The Standard Deviation is the square root of the variance
(1) Equation: σ = √ ( Σ ( xi - μx )2 * pi )
b) Bernoulli Trials are those that satisfy the following conditions
i) There are only two possible outcomes on each trial: success and
failure
ii) The probability of success are the same on every trial
iii) The trials are independent. If this assumption is violated, it is still
acceptable if the sample is smaller than 10% of the population
c) Binomial Probability Distribution are distributions that deal with
HOW a certain chain of events occur
i) The mean value of a binomial distribution describes the expected
number of successes
(1) Equation: μx = np
ii) The standard deviation is the following:
(1) Equation: σ = √(npq)
iii) Probability Equation where X is the number of successes in n trials
(1) P(X = x) = nCx * px * qn-x
iv) Calculations
(1) Binompdf(n,p,x) gives the probability of exactly x successes in
n trials where p is the probability of success on a single trial
(2) Binomcdf(n,p,x) gives the cumulative probability of x or fewer
successes in n trials, where p is the probability of success on a
single trial
d) Geometric Probability Distribution are distributions that deal with
WHEN certain events occur in a chain of events
i) The expected value of a geometric distribution is the expected first
occurrence of a success
(1) Equation: E(X) = 1/P
ii) Standard Deviation
(1) Equation: σ = √ (P / Q2 )
iii) Probability Equation (where X is the number of trials until the first
success occurs)
(1) Equation: P(X = x) = qx-1p
iv) Calculations
(1) Geometpdf(p,x) solves the probability density function. You
specify the probability of success (p) and the number of the first
success trial (x)
(2) Geometcdf(p,x) solves the cumulative density function. You
specify the probability of success (p) and the value x, in which
we calculate the probability of success on or before the xth
value
Chapter 2 : Describing/Comparing Distributions
1) Graphical Displays
a) Dotplots and Bar Graphs show categorical variables
b) Histograms show Quantitative numerical values
i) Useful for large data sets but does not show individual scores
c) Stem Plots
i) Shows individual scores
2) Summarizing/Comparing Distributions
a) Summarizing:
i) Center
(1) Separates the values roughly in half
ii) Shape
(1) Clusters and Gaps
(a) Clusters show natural subgroups into which values fall
(b) Gaps show holes where no values fall
(2) Modes
(a) Peaks are known as modes. Unimodal means one peak, etc.
(3) Certain common patterns of shape:
(a) Symmetric
(b) Skewed Left
(i) Mean < Median
(c) Skewed Right
(i) Mean > Median
iii) Spread
(1) Scope of the values from smallest to largest
(2) Equation: Interquartile Range (IQR) is Q3 – Q1
iv) Outliers
(1) Outliers are extreme values that can be the result of natural
chance variation or errors in measurement
(2) Equation: Outliers: Q1 – IQR and Q3 + IQR
v) Standard Deviation/Variance
(1) Standard Deviation: average distance from mean
(a) Equation: σ = √ (Σ ( x - μ )2 / n )
(2) Variance:
(a) Equation: σ2 = Σ ( x - μ )2 / n
vi) Transformations
(1) Adding a value to every value changes the mean/median but
not the SD. Multiplying every value changes all 3 (multiply by
the factor to get the new number)
3) Normal Distributions
a) Normal Distributions are bell-shaped and symmetric and have an
infinite base.
b) Empirical Rule (68-95-99.7 rule) – Each standard deviation follows
the pattern. 1 standard deviation away on each side is 68% of all
values, 2 standard deviations is 95, 3 is 99.7.
c) Z-Scores tell us how many standard deviations a certain value is
away from the mean.
i) Equation: z = (xi – x) / σ
ii) Evaluating Z- Scores:
(1) Equation: Finding a z-score from a percentile – Invnorm(%)
(2) Equation: Normcdf(lower,upper) – finding the percentage
between z-scores
4) Regressions
a) A graphical display called a scatterplot gives an immediate visual
impression of a possible relationship between two variables, while a
numerical measurement, called a correlation coefficient, is often
used as a quantitative value of the strength of a linear relationship
i) R is the correlation coefficient. It ranges from -1 to 1. 1 and -1 are
the strongest linear associations, and 0 has no linear association.
The positive r values have a positive relationship (positive slope),
and the negative r values have a negative relationship.
Correlation does not imply causation! It only measures the
strength of a linear relationship
ii) R2 is called the coefficient of determination. It is solved by
squaring the r-value. When you explain your R2 value you make
the statement: [R2] of variability in [y-axis] can be explained by the
linear association with [x-axis].
b) Line of best fit is a line that gives the best predictions for values
given a set of data. We wish to minimize the residual values
i) Residual Value: Observed – Expected
ii) Equation: b1 = r (sy / sx) – Slope of the LSRL
c) Residual Plots
i) The residual plot is made up of the residuals of all the values. The
sum of the residuals are always 0. A sample with a large R2 value,
low absolute residual sum, no clear pattern in a residual plot
makes the regression line an appropriate one
ii) Influential points are those that sharply change the regression
line.
iii) Transformations are the altering of the y, x, or both values to
achieve a non-patterned residual plot. Usually if the residual plot
has a pattern, a linear model is inappropriate and a nonlinear
model is more appropriate (thus the transformation).
(1) Exponential: log of Y
(2) Power: Log of Y and Log of X
(3) Quadratic: Square root of Y
(4) Reciprocal: 1/y
(5) Logarithmic: Log of X
Chapter 3 : Planning a Study
1) Methods of Data Collection
a) The Population is all the people in a certain region, and the Sample
is a small group from the population
b) A Census is a complete enumeration of an entire population. It is an
attempt to contact every member of a population
c) A Sample Survey aims to obtain information about a whole
population by studying a part of it, or a sample.
i) A sample is biased if in some critical way it does not represent the
population. The main technique to avoid bias is to incorporate
randomness
d) Experiment vs. Observational Study
i) An Experiment is a controlled study. In an experiment, there is an
action taken on one or more of the groups and the response is
observed. There are often treatment groups and control groups.
Good experimental designs include:
(1) Controls – A group that receives similar conditions as the other
groups without the treatment. This is used as a baseline
comparison for the response measurement
(2) Blocking – Process in which the subjects are divided into
representative groups (such as gender) to bring certain
differences directly into the picture
(3) Randomization – Unknown and uncontrollable idfferences are
handled by randomizing who receives which treatments
ii) An Observational Study is a study in which there is no choice in
regard to who goes into the treatment and control groups. There
is no action taken and is merely an observation of what has
occurred. Observational studies on the impact of some variable on
another variable often fail because explanatory variables are
confounded with other variables
(1) Confounding Variables are variables that are not accounted
for in the original design.
2) Planning and Conducting Surveys
a) Simple Random Sample
i) A Simple Random Sample (SRS) is one in which every possible
sample of the desired size has an equal chance of being selected.
(1) A typical way of an SRS is assigning everyone and using a
random number generator
b) Bias / Sampling Variation
i) All surveys give a statistic as an estimate for a population
parameter. Different samples give different statistics, all of which
are estimates for the same population parameter, and so error,
called sampling error is present. The chance of this error
occurring is smaller when the sample size is larger.
ii) Bias is the tendency to favor the selection of certain members of a
population. Here are a few explanations of certain bias
(1) Response Bias – People don’t want to be perceived as having
unpopular or unsavory views and so they respond untruthfully
when face to face with an interview
(2) Wording Bias – Non-neutral or poorly worded questions may
lead to answers that are unrepresentative of the population
(3) Selection Bias - Choosing the wrong population to vote. For
instance, asking for opinions regarding welfare reform to an
area that is largely conservative.
(4) Undercoverage Bias – This occurs when there is inadequate
representation. Convenience Samples are based on choosing
individuals who are easy to reach. These tend to produce
under-representative data
(5) Voluntary Response Bias – Samples based on individuals
who offer to participate typically give too much emphasis to
people with strong opinions
(6) Nonresponse Bias – When certain people refuse to respond
or are unreachable or too difficult to contact
c) Other Sampling Methods.
i) Systematic Sampling – Involves listing the population in some
order and choosing a random point to start, and picking every
person from the list in intervals (ie every 10th person). This gives a
reasonable sample as long as the original order of the list is
unrelated to the variables under consideration
ii) Stratified Sampling – Involves dividing the population into
homogeneous (similar) groups called strata, and random samples
of persons from all strata are chosen.
iii) Cluster Sampling – Involves dividing the population into
heterogeneous (mixed) groups called clusters, and taking random
samples of persons from all the clusters are chosen. Each cluster
should then resemble the entire population
iv) Multistage Sampling – Taking multiple sampling steps.
3) Confounding, Control Groups, Placebo Effects, and Blinding
a) Experiments involve explanatory variables, called factors, which are
believed to have an effect on response variables.
b) When there is uncertainty with regard to which variable is causing an
effect, the variables are confounded
c) A lurking variable is a variable that drives two other variables,
creating the mistaken impression that the two other variables are
related by cause and effect. Thus the linkages are often by a
common response
d) The placebo effect is the fact that many people respond to any kind
of perceived treatment, even though it may be nothing
e) Blinding occurs when the subjects or the response evaluators don’t
know which subjects are receiving which treatments. Double blind is
when they are both unaware.
Chapter 4 : Statistical Inference
1) A-G (for review)
a) List statistics and parameters (p̂, n, π , N, OR x̄, s, n, μ, σ ) and
degrees of freedom
b) State which test you are using. State α
c) State hypotheses in terms of the population (μ or π)
i) H0 : null hypothesis (π = π0 OR μ = μ0). No change or difference
from the specification (difference is due to natural sample-tosample variation)
ii) Ha : alternative hypothesis, one sided or two sided
iii) Assume H0 is true if doing a significance test (this doesn’t make
sense in a confidence interval because you are trying to find π
d) Verify the conditions that you can, and make assumptions for those
that can’t be verified. State “we will proceed with caution anyway” if
you need to
e) If doing a significance test, make a diagram with shading. If doing a
Confidence Interval, find the confidence interval
f) Calculate p-value
g) Three- Part Conclusion
i) Make a statement about the P-Value: “Assuming the true
population [proportion or mean] [H0 in context], the probability we
could get a sample at least as extreme as [state results from
sample] due to natural sample to sample variability is [insert PValue].” OR make a confidence interval conclusion
ii) Make a statement about H0: Compare P-Value to α and state
whether H0 is rejected or plausible (failed to be rejected). If PValue > α, reject the null; otherwise, fail to reject
iii) Make a statement about Ha :
(1) If H0 is rejected state in context:
(a) “substantial evidence for Ha”: P ≤ 1%
(b) “moderate evidence for Ha”: P = 1% - 5%
(c) “weak evidence for Ha”: P = 5% - 10%
(2) If H0 is plausible (failed to be rejected), state in context that
there is not sufficient evidence for Ha
2) Proportion Tests
a) 1-Prop and 2-Prop z-tests deal with proportions of populations. All
proportions are between 0 and 1 and describe a proportion of a
population with a certain characteristic.
b) Confidence Interval
i) Conditions:
(1) Randomization – Is the sample random?
(2) Normality – np̂ and n(1- p̂) ≥10
(3) Independence (population large enough) – N > 10n
ii) Equation: CI = p̂ ± Margin of error
(1) Margin of Error: z* SE
iii) Equation: Standard Error
(1) 1-Prop: √(p̂ * (1- p̂) / n)
(2) 2-Prop: √[(p̂1(1- p̂1)/n1) + (p̂2(1- p̂2)/n2)]
c) P-Value (Z-Scores)
i) Conditions:
(1) Randomization – Is the sample random?
(2) Normality – np and nq ≥10
(3) Independence (population large enough) – N > 10n
ii) Equation: Standard Deviation
(1) 1-Prop: σ = √(pq/n)
(2) 2-Prop: σ =√[(p̂c(1- p̂)c(1/n1 + 1/n2)]
(a) Equation: p̂c = (x1+x2)/(n1+n2)
iii) Note that in P-Values you use p and q to solve the equations,
whereas in Confidence intervals you use p̂!
d) Calculations
(1) Z* = invNorm(%)
(2) Confidence Interval: ZInterval, 2-PropZinterval
(3) P-Value: 1-PropZtest, 2-PropZTest
(4) Evaluating P value: normcdf(lower,upper)
3) Sample Tests
a) 1-Sample and 2-Sample t-tests deal with the averages of populations.
You will need to find the means and the standard deviations
b) Confidence Interval / P-Value
i) Conditions:
(1) Randomization – Is the sample random?
(2) Normality – Graph with histogram/box and whisker and
describe shape (unimodal + symmetric)
(3) Independence (population large enough) – N > 10n
ii) Equation: CI = x̄ ± t* SE (x̄)
iii) Equation: Standard error
(1) 1-Sample: SE(x̄) = σ/√(n)
(2) 2-Sample: √[(s12/n1) + (s22/n2)]
iv) Equation: Degrees of Freedom: n-1
v) Calculations:
(1) T* = invT(%,Df)
(2) Confidence Interval: TInterval, 2-SampTInterval
(3) P-Value: T-Test, 2-SampTTest
(4) Evaluating P-Value: tcdf(lower, upper, degrees of freedom)
c) Matched Pair
i) These occur when two variables are applied to the same subject in
a sample. These are calculated the same as a 1 sample t-test and
you look at the difference in the data.
4) Chi-Squared Tests
a) Chi-Squared tests were derived to perform significance testing for
categorical variables. It focuses on inferring the validity of a sample
i) Equation: x2 = Σ (observed – expected)2/ expected
ii) Make sure to write the sum above in a form like x1 + x2 +…+ xn
b) Conditions
i) Randomization: is the sample chosen randomly
ii) Expected Cell Frequency: The expected cell counts of subjects in
each cell are at least 5
iii) Independence: N>10n
c) Goodness of Fit
i) Goodness of Fit is used to determine whether our observed data
fits the theoretical distribution for that data
ii) Equation: Degrees of Freedom = n-1
iii) Equation: Expected = Sum/columns
iv) Hypotheses
(1) H0: The is no difference between each event
(2) Ha: There is a difference between each event
v) Calculations:
(1) X2GOF = goodness of fit test
d) Homogeneity Test
i) Homogeneity Test is used to compare the distribution of
categories. We hope to observe the same amount of variation in
all categories of multiple populations/samples
ii) Equation: Degrees of Freedom = (rows – 1) (columns – 1)
iii) Hypotheses
(1) H0: The proportions across each sample are equal
(2) Ha: The proportions across each sample are different
iv) Calculations
(1) X2-Test (input values into matrix)
(2) X2cdf(lower, upper, df)
e) Independence Test
i) The independence test is used to gain evidence of association
between two categorical variables. They will usually ask “are the
two associated?”
ii) Equation: Degrees of Freedom = (rows – 1) (columns – 1)
iii) Hypotheses
(1) H0: The two events are INDEPENDENT/NOT ASSOCIATED
(2) Ha: The two events are DEPENDENT/ASSOCIATED
iv) Calculations
(1) X2-Test (input values into matrix)
(2) X2cdf(lower, upper, df)
5) Regression Tests
a) Will usually ask you if there is evidence that a relationship between
two things is linear
b) Conditions:
i) Linearity Assumption – Check the scatter plot to see if the shape is
linear
ii) Independence Assumption – Check the residuals plot. The
residuals should appear randomly scattered
iii) Equal Variance Condition – Check the residuals plot again. The
vertical spread of residuals should be roughly the same
everywhere
iv) Normal Population Assumption – Check the histogram of the
residuals. The distribution of residuals should be unimodal and
symmetric
c) Hypotheses
(1) H0: β = 0 (no linear association)
(2) Ha: β > 0 (positive linear association), β < 0 (negative linear
association), β ≠ 0
d) Confidence Interval
i) Equation: SEb = s / (√(Σ(x- x̄)2))
ii) Equation: s = √[(1/(n-1))(y- ŷ)2]
iii) Calculation
(1) Linreg Int
e) P-Value
i) Equation: T= b/SEb
ii) P value: p (insert alt | β = 0)
iii) Calculation
(1) Linreg t test
f) Degrees of Freedom
i) Equation: n-2
6) Errors
a) Type I
i) This occurs when you rejected the null when you shouldn’t have.
Thus the null is not rejected
b) Type II
i) This occurs when you failed to reject the null when you should
have. Thus the null is false.
Download