Uploaded by Junho Lee

AP Statistics Cheet Sheet

advertisement
Data Collection
Methods of data collection
Census
Sample Survey
entire population
time and cost
Observational Study
part of a population
subjects controlled
blocking (block)
indicates causation
control group and
treatment group
single-blind and
double-blind
part of a population
subjects observed
stratification (strata)
indicates correlation
confounding (lurking
variable) may occur
뺑
Experiment
part of a population
time and cost
테스트프렙어학원
Methods of data planning
(for surveys)
1.
2.
3.
4.
5.
6.
simple random sampling (equal prob. for all)
systematic sampling (every nth person)
stratified sampling (homogeneous strata)
proportional sampling (proportional to pop.)
cluster sampling (heterogenous clusters)
multistage sampling (methods combined)
Bias in data planning
(for surveys)
1.
2.
3.
4.
5.
6.
7.
8.
9.
household bias (family of two vs. five)
nonresponse bias (refuse to respond)
response bias (respond untruthfully)
voluntary response bias (strong opinions)
quota sampling bias (homogeneous group)
selection bias (specific subjects are chosen)
size bias (big vs. small coins)
undercoverage bias (part of pop. ignored)
wording bias (poorly worded questions)
Data Collection
More on experiments
x is the explanatory variable (factor)
y is the response variable
experimental units (nonhuman) are
sometime called subjects (human)
control group does not receive treatment
treatment group receives treatment
placebo effect is a response to "fake"
treatment
single blinding (only subjects are blind)
double blinding (both subjects and
evaluators are blind)
completely randomized design
different samples get different
treatments
randomized paired comparison design
one sample gets different treatments
randomized block design
population undergoes blocking
each block receive randomization
random samples get different
treatments
control, blocking, randomization, replicability
and generalizability
Data Analysis I
Graphs for data analysis
Bar Chart
Note: there are gaps between bars.
Dotplot
Histogram
Sampling error vs. bias
Sampling Error
natural variability
when taking
samples from a
population
cannot be avoided
Bias
tendency to favor
the selection of
certain members of
a population
can be avoided
Note: there may be no gaps between bars.
Data Analysis I
Data Analysis I
Graphs for data analysis (cont.)
Special features of graphs
Stemplot
Clusters
Data Analysis I
Shape
Outlier
y lp o
Note: the leaves may not be skipped and the key
must be clearly indicated. 테스트프렙어학원
Gaps
Cumulative Relative Frequency (CRF) Plot
Center
(Measures of Central Tendency)
Describing distributions (SOCS)
Shape
Note: the median value can be found by drawing
a horizontal line across 0.5 on the y-axis.
Boxplot
Note: min, Q1, median, Q3, max are indicated.
Outliers are indicated by separate dots
Choosing the right graphs
Qualitative
(Categorical)
Variable
Quantitative
(Numerical)
Variable
dotplot
bar chart
dotplot
histogram
stemplot
boxplot
CRF plot
symmetric
skewed to the right
median < mean
skewed to the left
median > mean
bell-shaped
uniform
Center
mean
median
divides area
under graph into
two equal parts
mode
uni vs. bimodal
mean: add all values and divide by n
Outlier
by inspection
by formula
less than
Q1 - 1.5 IQR
greater than
Q3 + 1.5 IQR
Spread
range
interquartile range
variance
standard deviation
median: arrange values in ascending order
and do one of the following
if n is odd, select the middle value
if n is even, take the average of two
middle values
mode: most frequently occurring value
Spread
(Measures of Dispersion)
range: max - min
interquartile range: Q3 - Q1
variance:
standard deviation
Data Analysis I
Measures of Position
simple ranking: indicates rank from an
ordered list
percentile ranking: indicates a percentage of
values under the value under consideration
z-score: indicates specifically by how many
standard deviations the value under
consideration varies from the mean
Empirical Rule (68-95-99.7 Rule)
Comparing distributions (cont.)
Addition and subtraction affects
mean
median
mode
Parallel Boxplots
Multiplication and division affects
mean
median
mode
range
IQR
variance
standard deviation
Comparing distributions
(Use SOCS when making comparisons)
Double Bar Charts
테스트프렙어학원
Note: range is approximately equal to 6 standard
deviations in a bell-shaped distribution.
Resistance to outliers
Not resistant
mean
range
variance
standard deviation
Resistant
median
mode
IQR
Data Analysis I
Transforming distributions
뻐
(applies to bell-shaped curves only)
Data Analysis I
Back-to-back Stemplots
Overlapping CRF Plots
Data Analysis II
Exploring bivariate data
Data Analysis II
Least squares regression line (LSRL)
Data Analysis II
Residual plots
Scatterplot
깸
There could be many lines of best fit, but the
x is the explanatory (independent) variable.
one which minimizes the sum of the squares
y is the response (dependent) variable.
of residuals is called the least squares
A line of best fit describes the overall pattern. regression line.
The correlation coefficient (r) gives the
strength of association between the two
variables. -1 ≤ r ≤ 1
테스트프렙어학원
residual = observed (actual) - predicted
ê=y-ŷ
The residual plot is used as evidence that a
linear regression is a good fit when the
residual plot shows no overall pattern as
shown below.
The sum of residuals is always zero.
LSRL passes through the point (x̄, ȳ) and has a
slope b1 , which has the same sign as r.
The standard deviation of the residuals is can
be calculated as follows.
Population Regression Line
The coefficient of determination (r2 ) gives the
percentage of variation in y that is
explained by the variation in x.
It gives a measure of how the data points are
spread around the regression line.
Sample Regression Line
If the residual plot shows a pattern, it means
that a nonlinear model is more appropriate.
Data Analysis II
Transformation to achieve linearity
Data Analysis III
Exploring categorical variables
Probability Distribution
Instead of using nonlinear regression models,
we can 1transform existing data so that the
scatterplot shows a linear pattern (which
means that linear regression could be used).
Marginal Frequency and Distribution
Relative frequency tells you the percentage of
an event
1 that happened relative to the whole.
Relative frequencies vary from experiment to
experiment.
Main transformation are log, square root, and
reciprocal transformations.
Law of large numbers
When an experiment is performed a large
number of times, the relative frequency
converges to a certain value. We call this
value the probability of that event.
Before log transformation
In other words, probability is long-term
relative frequency.
Conditional Frequency and Distribution
Calculating probabilities
General formulae
After log transformation
(for people who saw baby animals)
ㅹ
Mutually exclusive events
(implies that there is no intersection)
(for people who saw adult animals)
Independent events
(implies that events do not influence each other)
테스트프렙어학원
Conditional probability
(probability of B given A)
(for people who saw tasty foods)
Probability Distribution Probability Distribution Probability Distribution
Calculating multistage probabilities
Binomial distribution (cont.)
Geometric distribution
Product Principle
(multiply probabilities that
occur together or in series)
For instance, when you toss a coin, the two
possible
1 outcomes are heads (H) and tails (T).
But you can have various combinations of
them: HHHH, HHTT, TTHT, etc.
When there are two possible outcomes
(binomial)
and you want to find the probability
1
that the first success occurs after nth trial,
model the situation using a geometric
distribution.
Each of these combinations have a probability
associated with them: binomial probability.
폈
Addition Principle
(add probabilities that cannot occur together or
are from entirely different scenarios)
테스트프렙어학원
Types of probability distributions
Discrete
binomial
geometric
Continuous
normal
Binomial distribution
Binomial distributions are used to model
problems
that have two possible outcomes
1
(successes or failures). Examples of such
scenarios include:
defective vs. not defective
5 on a die vs. not 5 on a die
heads vs. tails
score a goal vs. not score a goal
Remember that although there are only two
possible outcomes, you can still have different
combinations of these two outcomes.
where
p is the probability of success
q is the probability of failure (1-p)
n is the number of trials
k is the number of successes
(n-k) is the number of failures
Alternatively, you can use the
binompdf (n, p, k) function to calculate
specific binomial probabilities.
If you have to add up binomial probabilities
(starting from 0), use the binomcdf (n, p, k)
where k is the number of successes up want
to add up to.
Binomial distribution keywords
binompdf
exactly
____ out of ____
binomcdf
at most, at least
more than, less than
For example, what is the probability that the
first honest man Diogenes encounters will be
the third man he meets?
This is clearly a binomial problem (person met
is honest or not honest), and we are
interested in meeting an honest man on the
third trial. This implies that Diogenes would
not meet an honest man in the first and
second trials. So, the probability would be
(failure) x (failure) x (success). In general,
where
p is the probability of success
q is the probability of failure (1-p)
k is the trial number when success occurs
You can use the geometpdf (p, k) or the
geometcdf (p, k) accordingly.
Geometric distribution keywords
(happens first, first success is, first occurrence is)
geometpdf
first, second, third
geometcdf
no later than
Probability Distribution Probability Distribution Probability Distribution
Discrete distribution
Discrete distribution (cont.)
When there are two or more possible
outcomes,
you can use a discrete distribution
1
to model the problem.
Formula for discrete random variable
For example, a highway engineer knows that
his crew can lay 5 miles of highway on a clear
day, 2 miles on a rainy day, and only 1 mile on
a snowy day. You can construct a discrete
probability distribution as follows.
Formula for binomial random variable
Combining random variables
(random variables must be independent)
뺑
Note that the binomial distribution is a special
case of the discrete distribution when there
are only two possible outcomes.
테스트프렙어학원
For example, in a lottery, 10,000 tickets are
sold at $1 each with a prize of $7,500 for one
winner. You can construct a discrete/binomial
probability distribution as follows.
In both cases, the discrete random variable
(usually denoted as X) is associated with a
numerical value.
Designing simulations
The variance of sums and the variance of
differences
always adds the individual
1
variances. Variance may be combined only
when the two random variables are
independent.
There is no formula for combining standard
deviations of two random variables. So, you
must calculate the variances of each random
variable, use the variance combination
formula above, and then take the square root
of the combined variance.
Transforming random variables
(random variables must be independent)
In performing a simulation, you must do the
following.
1
1. Set up a correspondence between
outcomes and random numbers (0~6 is
success and 7~9 is failure).
2. Give a procedure for choosing random
numbers.
3. Give a stopping rule.
4. Note what is to be counted.
Probability Distribution Probability Distribution Probability Distribution
Normal distribution
The normal distribution is a type of
continuous
distribution. It is symmetric, bell1
shaped, and unimodal. It has two tails at both
ends that approaches the horizontal infinitely.
Normal distribution (cont.)
Normal distribution (cont.)
Finding area under a normal curve
The area under the normal curve is the
probability
of whatever you are solving for.
1
You can use a z-table to find the area number
a normal curve. For this method, you need to
calculate the z-score. Remember that the ztable gives you the area to the right of that zscore.
P
.
It is useful in describing various natural
phenomena.
테스트프렙어학원
The normal distribution is the limiting case of
the binomial distribution when n
∞.
Alternatively, you can use normalcdf (lower
bound, upper bound, mean, standard
deviation) to find the area under a normal
curve. You can enter raw values into this
function.
However, if you want to enter z-scores for the
lower and upper bound, you must set the
mean to 0 and the standard deviation to 1.
To find the z-score given an area to the left
of the normal curve, use invNorm (area).
Probability Distribution Probability Distribution Probability Distribution
Common probabilities and z-scores
For statistical inference
For percentile ranking
Normal approximation to binomial
꺄
The binomial distribution takes values only at
integers,
1 while the normal distribution is
continuous with probabilities corresponding
to areas over intervals.
테스트프렙어학원
For approximation purposes, we think of each
binomial probability corresponding to the
normal probability over a unit interval
centered at the desired value.
For example, to approximate the binomial
probability of five successes we determine
the normal probability of being between 4.5
and 5.5.
Checking for normality
You must be able to decide whether it is
reasonable
to assume the data come from a
1
normal population. This skill is especially
important when you have to do statistical
inference.
To check for normality, you should create a
graph of the data. For example, the ages at
inauguration of U.S. presidents were: {57, 61,
57, 57, 58, 57, 61, 54, 68, 51, 49, 64, 50, 48,
65, 52, 56, 46, 54, 49, 51, 47, 55, 55, 54, 42,
51, 56, 55, 51, 54, 51, 60, 61, 43, 55, 56, 61,
52, 69, 64, 46, 54}. Can we conclude that the
distribution is roughly normal?
Checking for normality (cont.)
Normal probability plot
A diagonal straight line pattern in the normal
probability plot is an indication that the
distribution of data is roughly normal.
Parameter vs. statistic
A parameter is a number that describes
some characteristic of the population.
A statistic is a number that describes some
characteristic of a sample. It is used to
estimate the parameter of interest.
Population
parameter
pop. proportion (p)
pop. mean (μ)
pop. standard
deviation (σ)
Sample
statistic
sample proportion (p̂)
sample mean (x̄)
sample standard
deviation (s)
Probability Distribution Probability Distribution Probability Distribution
Sampling distribution
뻔
The AP Statistics exam tests you on five types
of sampling distributions.
sampling distribution of sample
proportions
sampling distribution of sample means
sampling distribution of differences
between sample proportions
sampling distribution of differences
between sample means
sampling distribution of slope of sample
LSRL
테스트프렙어학원
When random samples are taken from a
population, the sample statistics vary from
sample to sample. This natural deviation is
called sampling variability. It refers to the
fact that different random samples of the
same size from the same population produce
different values for a statistic.
For example, every sample taken will have a
unique sample proportion p̂. The various p̂
values possible can then be plotted to create
a distribution. This distribution of various
sample proportions is called the sampling
distribution of sample proportions.
A similar case can be made for sampling
distributions of other sample statistic.
Note that sampling distributions are
different from sample distributions and
population distributions.
Sampling distribution (cont.)
The population distribution of a variable
describes the values of the variable for all
individuals in a population.
The sample distribution describes the values
of the variable for all individuals in a
particular sample.
Biased and unbiased estimators
A statistic can be an unbiased estimator or a
biased estimator of a parameter.
A statistic is an unbiased estimator if the
center (mean) of its sampling distribution is
equal to the true value of the parameter.
When trying to estimate a parameter, choose
a statistic with low or no bias and minimum
variability.
Sampling distribution of p̂
When we want information about the
population proportion p of successes, we
often take an SRS and use the sample
proportion p̂ to estimate the unknown
parameter p.
The sampling distribution of the sample
proportion p̂ describes how the statistic p̂
varies in all possible samples of the same size
from the population.
The mean of the sampling distribution of p̂ is
So, p̂ is an unbiased estimator of p.
The standard deviation of the sampling
distribution of p̂ is
Conditions you need to check are:
SRS
10% condition (n < 0.10N)
large counts condition (np≥10, n(1-p)≥10)
state that since np≥10 and n(1p)≥10, the sampling distribution of
p̂ is approximately normal by the
large counts condition.
Probability Distribution Probability Distribution Probability Distribution
Sampling distribution of x̄
When we want information about the
population mean μ for some quantitative
variable, we often take an SRS and use the
sample mean x̄ to estimate the unknown
parameter μ.
P
The sampling distribution of the sample mean
x̄ describes how the statistic x̄ varies in all
possible samples of the same size from the
population.
테스트프렙어학원
The mean of the sampling distribution of x̄ is
So x̄ is an unbiased estimator of μ.
The standard deviation of the sampling
distribution of x̄ is
Conditions you need to check are:
SRS
10% condition (n < 0.10N)
normality
if normal, say that it is
if not normal, use central limit
theorem and state that since n≥30,
the sampling distribution of x̄ is
approximately normal by the
central limit theorem.
Sampling distribution of p̂1-p̂2
Sampling distribution of x̄1-x̄2
The mean of the sampling distribution of p̂1p̂2 is
The mean of the sampling distribution of x̄1x̄2 is
So p̂1-p̂2 is an unbiased estimator of p1-p2.
So x̄1-x̄2 is an unbiased estimator of μ1-μ2.
The standard deviation of the sampling
distribution of p̂1-p̂2 is
The standard deviation of the sampling
distribution of x̄1-x̄2 is
Conditions you need to check for both
samples are:
SRS for both samples
10% condition for both samples
n1 < 0.10N1 and n2 < 0.10N2
large counts condition for both samples
n1p1≥10, n1(1-p1)≥10
n2p2≥10, n2(1-p2)≥10
state that since the large counts
condition is met for both samples,
the sampling distribution of p̂1-p̂2
is approximately normal.
independence condition
mention that the two samples are
independent random samples
Conditions you need to check for both
samples are:
SRS for both samples
10% condition for both samples
n1 < 0.10N1 and n2 < 0.10N2
normality for both samples
if both are normal, say that they are
if both are not normal, use central
limit theorem on both samples and
state that since the central limit
theorem is met for both samples
(n1≥30 and n2≥30), the sampling
distribution of x̄1-x̄2 is
approximately normal.
if one is normal but the other isn't,
state that it is normal for the normal
data but use the central limit theorem
on the other
independence condition
mention that the two samples are
independent random samples
Probability Distribution Probability Distribution
Sampling distribution of b1
The mean of the sampling distribution of b1 is
So b1 is an unbiased estimator of β1.
The standard deviation of the sampling
distribution of b1 is
'
Conditions you need to check are:
SRS
10% condition (n < 0.10N)
scatterplot of sample data is
approximately linear
테스트프렙어학원
Sampling distribution of b1 (cont.)
distribution of residuals it approximately
normal
Statistical Inference I
Confidence interval
The AP Statistics exam tests you on five types
of confidence intervals (CI).
CI for population proportion
CI for population mean
CI for difference between population
proportions
CI for difference between population
means
CI for slope of the LSRL
A confidence interval gives an interval of
plausible values for a parameter based on
sample data.
A point estimator is a statistic that provides
an estimate of a population parameter.
no apparent pattern in residuals plot
(=equal SD; residuals have roughly equal
variability at all x-values in sample data)
The value of that statistic from a sample is
called a point estimate.
The confidence level c gives the overall
success rate of the method used to calculate
the confidence interval. To interpret the
confidence level: if we were to select many
random samples from a population and
construct a [c]% confidence interval
using each sample, about [c]% of the
intervals would capture the true
[parameter in context].
Statistical Inference I
Confidence interval (cont.)
To interpret the confidence interval: we are
[c]% confident that the interval from
[lower bound] to [upper bound] captures
the true [parameter in context].
양
The margin of error of an estimate
describes how far, at most, we expect the
estimate to vary from the true population
value.
테스트프렙어학원
Affecting margin of error
In general, we prefer an estimate with a small
margin of error. The margin of error gets
smaller when:
the confidence level decreases. To
obtain a smaller margin of error from the
same data, you must be willing to accept
less confidence.
the sample size n increases. In general,
increasing the sample size n reduces the
margin of error for any fixed confidence
level.
Statistical Inference I
CI for population proportion
CI for population mean
σ is known
where
Identify: one sample z interval for p
Conditions:
SRS
10% condition (n<0.10N)
large counts condition (np̂≥10, n(1-p̂)≥10)
state that the number of successes
(np) and the number of failures
(n(1-p)) are both greater than or
equal to 10, so the sampling
distribution of p̂ is approximately
normal.
Calculate: if the conditions are met, perform
the calculations (1-PropZInt)
Conclude: interpret your confidence interval
in the context of the problem.
Sample size for a desired margin of error
The critical value is a multiplier that makes
the interval wide enough to have the stated
capture rate. The critical value depends on
both the confidence level c and the sampling
distribution of the statistic.
Statistical Inference I
σ is unknown
where
Identify: one sample z interval for μ OR
one sample t interval for μ (df = n-1)
Conditions:
SRS
10% condition (n<0.10N)
normality
if normal, say that it is
if not normal, use central limit
theorem and state that since n≥30,
the sampling distribution of x̄ is
approximately normal by the
central limit theorem
if not normal and n < 30, draw graph
to check for normality, no strong
skewness, and no outliers
Calculate: if the conditions are met, perform
the calculations (ZInterval or TInterval)
Conclude: interpret your confidence interval
in the context of the problem.
where
Sample size for a desired margin of error
Statistical Inference I
CI for difference in
population proportions
Statistical Inference I
CI for difference in
population means
σ is known
σ is unknown
Statistical Inference I
CI for slope of
population regression line
Identify: one sample t interval for β (df = n-2)
Identify: two sample z interval for μ1-μ2 OR
two sample t interval for μ1-μ2
PPO
Identify: two sample z interval for p1-p2
'
테스트프렙어학원
Conditions:
*df = (n1-1 or n2-1, whichever is smaller) OR
SRS for both samples
(use technology for precision)
10% condition for both samples
n1 < 0.10N1 and n2 < 0.10N2
Conditions:
large counts condition for both samples
SRS for both samples
n1p̂1≥10, n1(1-p̂1)≥10
10% condition for both samples
n2p̂2≥10, n2(1-p̂2)≥10
n1 < 0.10N1 and n2 < 0.10N2
state that the numbers of
normality for both samples
successes (n1p1, n2p2) and the
if normal, say that it is
numbers of failures (n1(1-p1),
if not normal, use central limit
n2(1-p2)) are both greater than or
theorem and state that since n≥30,
equal to 10, so the sampling
the sampling distribution of x̄1- x̄2
distribution of p̂1-p̂2 is
is approximately normal by the
approximately normal.
central limit theorem
independence condition
if not normal and n < 30, draw graph
mention that the two samples are
check for normality, no strong
independent random samples
skewness, and no outliers
independence condition
Calculate: if the conditions are met, perform
mention that the two samples are
the calculations (2-PropZInt)
independent random samples
Conclude: interpret your confidence interval
in the context of the problem.
Calculate: if the conditions are met, perform
the calculations (2-SampZInt or 2-SampTInt)
Conclude: interpret your confidence interval
in the context of the problem.
Conditions:
SRS
10% condition (n<0.10N)
scatterplot of sample data is
approximately linear
no apparent pattern in residuals plot
(=equal SD; residuals have roughly equal
variability at all x-values in sample data)
distribution of residuals it approximately
normal
Calculate: if the conditions are met, perform
the calculations (LinRegTInt)
Conclude: interpret your confidence interval
in the context of the problem.
IMPORTANT
For paired data (data that are not independent), create
a new variable d, which is the variable for differences,
by taking the differences of x̄1 from sample 1 and the
corresponding x̄2 from sample 2.
Then, you need to create a one sample t interval
using the new variable d by using TInterval.
Statistical Inference II
Test of significance for
quantitative data: hypothesis test
동
The AP Statistics exam tests you on five types
of significance tests for quantitative data.
hypothesis test for population proportion
hypothesis test for population mean
hypothesis test for difference between
population proportions
hypothesis test for difference between
population means
hypothesis test for slope of population
LSRL
테스트프렙어학원
Confidence interval vs.
significance test
Confidence intervals aim to estimate over
which interval the unknown parameter may
lie.
Significance tests aim to investigate whether
the known parameter is valid or needs to be
changed.
Types of hypotheses
The hypothesis test is conducted by setting
up the null hypothesis (Ho) and the
alternative hypothesis (Ha). Ha is also
known as the research hypothesis.
The null hypothesis almost always uses the
equality symbol (=) while the alternative
hypothesis uses inequality symbols (<, >, ≠).
Statistical Inference II
Statistical Inference II
The inequality symbol used in the alternative
hypothesis determines whether a one-tailed
or two-tailed test should be performed.
When we make a conclusion in a significance
test, there are two kinds of mistakes we can
make.
Types of hypothesis tests
Type I and Type II Errors
When the < or > symbol is used, conduct a
one-tailed test. One-tailed tests are also
known as one-sided tests.
When the ≠ symbol is used, conduct a twotailed test. Two-tailed tests are also known
as two-sided tests. Do not forget to double
the p-value at one tail for two-tailed tests.
P-value vs. significance level
The p-value refers to the probability of
getting evidence for the alternative
hypothesis as strong or stronger than the
observed evidence assuming the null
hypothesis is true.
The significance level (α) is the value that
we use as a boundary for deciding whether
an observed result is unlikely to happen by
chance alone assuming the null hypothesis is
true. α = 1 - c
In a hypothesis test, the p-value is compared
with the significance level (α).
You need to be able to describe type I and
type II errors in context.
The probability of making a type I error is α.
The probability of making a type II error is β.
α and β are inversely proportional. They do
not necessarily add up to 1.
The probability of avoiding a type II error is
called power. Power = 1 - β
Statistical Inference II
HT for population proportion
Identify: one sample z test for p
Statistical Inference II
HT for population proportion
Calculate: if the conditions are met, perform
the calculations (1-PropZTest)
Statistical Inference II
HT for population mean
Identify: one sample z test for μ OR
one sample t test for μ (df = n-1)
Alternatively, find the p-value using
normalcdf and compare this value with the
significance level.
When calculating the standard deviation of
sample proportions, use the following
formula.
where p0 is the null proportion, not the
sample proportion.
많
where p is (description in context).
테스트프렙어학원
State the significance level. (If not given, use
0.05).
Conditions:
SRS
10% condition (n<0.10N)
large counts condition (np0≥10, n(1p0)≥10)
state that the number of successes
(np0) and the number of failures
(n(1-p0)) are both greater than or
equal to 10, so the sampling
distribution of p̂ is approximately
normal.
Conclude: Since the p-value is (less
than/greater than or equal to) the
significance level α, we (reject/fail to
reject) the null hypothesis. We (have/do
not have) sufficient evidence that (state
your alternative hypothesis).
where μ is (description in context).
State the significance level. (If not given, use
0.05).
Conditions:
SRS
10% condition (n<0.10N)
normality
if normal, say that it is
if not normal, use CLT and state that
since n≥30, the sampling
distribution of x̄ is approximately
normal by CLT
if not normal and n < 30, draw graph
to check for normality, no strong
skewness, and no outliers
Statistical Inference II
HT for population mean
Calculate: if the conditions are met, perform
the calculations (Z-Test or T-Test)
Statistical Inference II
HT for difference in
population proportions
Statistical Inference II
HT for difference in
population proportions
Identify: two sample z test for p1-p2
where p is the pooled proportion, which can
be calculated using the formula below.
품
Alternatively, you can find the p-value using
normalcdf or tcdf and compare this value
Conditions:
with the significance level.
SRS for both samples
테스트프렙어학원
10% condition for both samples
When using tcdf, remember that df = n - 1.
n1 < 0.10N1 and n2 < 0.10N2
large counts condition for both samples
When calculating the standard deviation of
n1p1≥10, n1(1-p1)≥10
sample means, use the following formula.
n2p2≥10, n2(1-p2)≥10
state that the numbers of
σ is known
σ is unknown
successes (n1p1, n2p2) and the
numbers of failures (n1(1-p1),
n2(1-p2)) are both greater than or
equal to 10, so the sampling
Conclude: Since the p-value is (less
distribution of p̂1-p̂2 is
than/greater than or equal to) the
approximately normal.
significance level α, we (reject/fail to
independence condition
reject) the null hypothesis. We (have/do
mention that the two samples are
not have) sufficient evidence that (state
independent random samples
your alternative hypothesis).
Calculate: if the conditions are met, perform
the calculations (2-PropZTest)
Alternatively, find the p-value using
normalcdf and compare this value with the
significance level.
When calculating the standard deviation of
the difference in sample proportions, use the
pooled proportion in the following formula.
Conclude: Since the p-value is (less
than/greater than or equal to) the
significance level α, we (reject/fail to
reject) the null hypothesis. We (have/do
not have) sufficient evidence that (state
your alternative hypothesis).
Statistical Inference II
HT for difference in
population means
Identify: two sample z test for μ1-μ2 OR
two sample t test for μ1-μ2
Statistical Inference II
HT for difference in
population means
When calculating the standard deviation of
sample means, use the following formula.
σ is known
S
Conditions:
테스트프렙어학원
SRS for both samples
10% condition for both samples
n1 < 0.10N1 and n2 < 0.10N2
normality for both samples
if normal, say that it is
if not normal, use central limit
theorem and state that since n≥30,
the sampling distribution of x̄1- x̄2
is approximately normal by CLT
if not normal and n < 30, draw graph
check for normality, no strong
skewness, and no outliers
independence condition
mention that the two samples are
independent random samples
σ is unknown
Calculate: if the conditions are met, perform
the calculations (2-SampZTest or 2SampTTest)
Alternatively, you can find the p-value using
normalcdf or tcdf and compare this value
with the significance level.
When using tcdf, use df = n1-1 or df = n2-1,
whichever is smaller. Or use technology for a
more precise df.
Conclude: Since the p-value is (less
than/greater than or equal to) the
significance level α, we (reject/fail to
reject) the null hypothesis. We (have/do
not have) sufficient evidence that (state
your alternative hypothesis).
Statistical Inference II
HT for slope of
population regression line
Identify: one sample t interval for β (df = n-2)
Remember, β1 = 0 simply states that there is no
linear relationship between two variables.
For Ha, use
β1 > 0 if you need to prove a positive linear
relationship between two variables,
β1 < 0 if you need to prove a negative linear
relationship between two variables, or
β1 ≠ 0 if you need to show there is "some"
linear relationship between two variables
IMPORTANT
For paired data (data that are not
independent), create a new variable d, which
is the variable for differences, by taking the
differences of x̄1 from sample 1 and the
corresponding x̄2 from sample 2.
Conditions:
SRS
10% condition (n<0.10N)
scatterplot of sample data is approx linear
no apparent pattern in residuals plot
(=equal SD; residuals have roughly equal
variability at all x-values in sample data)
distribution of residuals it approx normal
Then, you need to perform a one sample t
test using the new variable d as shown below.
Do not perform a two sample test on paired
data!
Calculate: if the conditions are met, perform
the calculations (LinRegTTest)
Identify: paired t-test for the set of differences
Then, follow a similar procedure for HT for
population mean.
Conclude: Since the p-value is (less
than/greater than or equal to) the
significance level α, we (reject/fail to
reject) the null hypothesis. We (have/do
not have) sufficient evidence that (state
your alternative hypothesis).
Statistical Inference III
Test of significance for
qualitative data: chi-square test
'
The AP Statistics exam tests you on three
types of significance tests for qualitative
data.
chi-square test for goodness-of-fit
chi-square test for independence
chi-square test for homogeneity
테스트프렙어학원
Chi-square test statistic
The chi-square test statistic is a measure of
how far the observed counts are from the
expected counts.
You should be able to do a follow up analysis
on which category has the largest
contribution to the chi-square test statistic.
Chi-square distribution
A chi-square distribution is defined by a
density curve that takes only nonnegative
values and is skewed to the right. A particular
chi-square distribution is specified by its
degrees of freedom.
Statistical Inference III
Chi-square test for goodness-of-fit
Statistical Inference III
Chi-square test for homogeneity
The χ2 test for goodness-of-fit compares
the distribution of observed counts in the
sample with the distribution of expected
counts if Ho were true.
The χ2 chi-square test for homogeneity
compares the distribution of a single
categorical variable for each of several
populations.
The expected count for any category is
found by multiplying the sample size (n) by
the proportion in each category according to
the null hypothesis.
The expected count for any category is
found using the formula below.
Data is usually given in a one-way table.
Data is usually given in a two-way table.
Identify: chi-square test for goodness-of-fit
Identify: chi-square test for homogeneity
Conditions:
SRS
10% condition
n < 0.10N
all expected counts are at least 5
Conditions:
SRS
10% condition
n < 0.10N
all expected counts are at least 5
Calculate: if the conditions are met, perform
the calculations (χ2 GOF-Test)
Calculate: if the conditions are met, perform
the calculations (χ2-Test)
Alternatively, you can find the p-value using
χ2cdf and compare this value with the
significance level. When using χ2cdf,
df = (number of categories)-1.
Alternatively, you can find the p-value using
χ2cdf and compare this value with the
significance level. When using χ2cdf,
df = [r-1] x [c-1].
Conclude: compare p-value and significance
level; reject or fail to reject Ho
Conclude: compare p-value and significance
level; reject or fail to reject Ho
Chi-square test for independence
Identify: chi-square test for independence
The null hypothesis is that there is no
association between the two categorical
variables in the population of interest.
Another way to state the null hypothesis is
that the two categorical variables are
independent in the population of interest.
Conditions:
SRS
10% condition
n < 0.10N
all expected counts are at least 5
*il
The χ2 chi-square test for independence
is used test the association/relationship
between two categorical variables in a
single population.
anycategory
category
The expected count for테스트프렙어학원
any
is
found using the formula below.
is
Data is usually given in a two-way table.
Calculate: if the conditions are met, perform
the calculations (χ2-Test)
Alternatively, you can find the p-value using
χ2cdf and compare this value with the
significance level. When using χ2cdf,
df = [r-1] x [c-1].
Conclude: compare p-value and significance
level; reject or fail to reject Ho
How to Get a 5
in AP Statistics
첨
Chi-square test for independence
Statistical Inference III
Get a perfect score on
the MCQs.
rr
Statistical Inference III
Write something (that is
sensical) on the FRQs.
And do your homework.
Download