Uploaded by 陶旭栋

AP statistics summary book

advertisement
亿思科学生之家——公益性质国际教育信息服务平台
AP Statistics Summary Book
1 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
THEME 1: Explanatory Analysis
2 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Univariate Data
Median and Mean
Mean
Median
⚫
Affected by extreme values
⚫
⚫
For statistical inference
⚫
For descriptive statistics
⚫
Can be obtained from both
population and a sample
⚫
Usually can only be obtained from a
sample
Not affected by extreme values
Range, Interquartile Range (IQR), Variance and Standard Deviation
𝐑𝐚𝐧𝐠𝐞 = Maximum – Minimum𝐈𝐐𝐑 = Q3 − 𝑄1
𝐏𝐨𝐩𝐮𝐥𝐚𝐭𝐢𝐨𝐧 𝐕𝐚𝐫𝐢𝐚𝐧𝐜𝐞 = σ2 =
∑(𝑥 − 𝜇)2
∑(𝑥 − 𝑥̅ )2
𝐒𝐚𝐦𝐩𝐥𝐞 𝐕𝐚𝐫𝐢𝐚𝐧𝐜𝐞 = s 2 =
𝑛
𝑛−1
∑(𝑥−𝜇)2
𝐏𝐨𝐩𝐮𝐥𝐚𝐭𝐢𝐨𝐧 𝐒. 𝐃. = σ = √
Range
IQR
Useful in evaluating
samples
few items.
with
𝑛
very
2)
∑(𝑥−𝑥̅
𝐒𝐚𝐦𝐩𝐥𝐞 𝐒. 𝐃. = √𝑛−1
Variance
Removing the influence Dispersion from the
of extreme values on the mean in square units.
(formula sheet)
Standard Deviation
Dispersion
from
the
mean in standard units.
range.
Percentile Ranking and Z-score
Percentile ranking indicates what percentage of all values fall below the value under consideration
The z-score indicates how many standard deviations above or below the mean the given value lies
z=
𝑥−𝜇
𝜎
3 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Histogram and Central Tendency
Symmetrical:
Mean = Median = Mode
Skewed to the right:
Mean > Median > Mode
Skewed to the left:
Mean< Median < Mode
Empirical Rule
68% of the values lie within 1 s.d.
95% of the values lie within 2 s.d.
99% of the values like within 3 s.d.
Note: z is usually between -3 and 3 but not
always!
Effect of Changing Units on Summary Measure
Central Tendency
(Mean and Median)
Adding a constant
Multiplying by a constant
Add constant
Multiply by constant
Spread
(Range, IQR and Standard
Deviation)
Remains the same
Multiply by constant
Box Plots
Lower quartile
Medium
Upper quartile
Minimum
Maximum
𝐈𝐐𝐑 = 𝐔𝐩𝐩𝐞𝐫 𝐐𝐮𝐚𝐫𝐭𝐢𝐥𝐞 (𝐐𝟑)– 𝐋𝐨𝐰𝐞𝐫 𝐐𝐮𝐚𝐫𝐭𝐢𝐥𝐞 (𝐐𝟏)
𝐨𝐮𝐭𝐥𝐢𝐞𝐫 > 𝑸𝟑 + 𝟏. 𝟓 × 𝑰𝑸𝑹 OR < 𝑸𝟏 − 𝟏. 𝟓 × 𝑰𝑸𝑹
(Outliers need to be marked separately on a boxplot.)
Comparing Distributions (Back-to-back stem plot, Parallel box plot, Parallel dot plot)
Compare the following features:
• Shape (symmetric, skewed to the left, skewed to the right)
• Center (mean and median)
• Spread (range or/and IQR)
• Outlier (identify)
• Clusters and Gaps (identify)
4 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Bivariate Data
Correlation Coefficient (r)
Correlation coefficient is a mathematical measure of the strength of the association between two variables.
𝑟=
1
𝑛−1
𝑥 −𝑥̅
𝑦 −𝑦̅
𝑥
𝑦
∑ ( 𝑖 ) ( 𝑖 ) (formula sheet)
𝑠
𝑠
•
Significant correlation does not necessarily indicate causation.
•
A correlation at or near zero means there is no linear relationship, but there may still be a strong nonlinear relationship!
•
Changing units does not change the correlation. So the correlation between the standardized z-scores
stays the same.
•
If x and y are interchanged, the correlation coefficient stays the same.
Coefficient of Determination (𝐫 𝟐 )
r 2 indicates the percentage of variation in y (dependent variable) that can be predicted by the variation in
x (independent variable). The coefficient of determination is usually expressed as a percentage.
Least Squares Regression Line
To find the equation of the least square regression line:
̂−𝒚
̅ = 𝒃𝟏 (𝒙 − 𝒙
̅)
𝒚
Where 𝑥̅ and 𝑦̅ are the average values of x and y, 𝑏1 is the slope of the least square regression line, and 𝑦̂
is the predicted value of y from a value of x.
𝒔
𝒃𝟏 = 𝒓 𝒔𝒚 (Formula sheet)
𝒙
Where 𝑟 is the correlation coefficient, 𝑠𝑥 is the standard deviation of 𝑥 and 𝑠𝑦 is the standard deviation of
𝑦.
The equation needs to be written as:
̂ 𝑽𝒓𝒊𝒂𝒃𝒍𝒆 = 𝒂 + 𝒃𝟏 × 𝑰𝒏𝒅𝒆𝒑𝒆𝒏𝒅𝒆𝒏𝒕 𝑽𝒂𝒓𝒊𝒂𝒃𝒍𝒆
𝑫𝒆𝒑𝒆𝒏𝒅𝒆𝒏𝒕
Interpreting 𝐚 and 𝐛𝟏
𝑏1 : On average, when IV increases by 1 unit, DV will increase/decrease by |b1 | units.
𝑎: the average value of DV when IV is zero
5 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Making Prediction
When we use the regression line to predict a y-value for a given x-value, we are actually predicting the mean
y-value for that given x-value
Interpolation: Predicting a y-value by an x-value within the range of given data.
Extrapolation: Predicting a y-value by an x-value outside the range of given data.
Extrapolation is less reliable than interpolation!
Residual Plot
The residual of a given y-value is:
𝐞̂𝐢 = 𝐲𝐢 − 𝐲̂𝐢
Where𝑒̂𝑖 is the residual
𝑦𝑖 is the observed value
𝑦̂𝑖 is the predicted value
The standard deviation of the residuals is:
∑ 𝑒𝑖2
∑(𝑦𝑖 − 𝑦̂𝑖 )2
𝑠𝑒 = √
=√
𝑛−2
𝑛−2
se gives a measure of how the points are spread around the regression line.
To test for linearity:
•
•
When the residual is randomly distributed, a linear relationship can be assumed between he
two variables.
When the residual plot shows an obvious pattern, a non-linear model will show a better fit to
the data than the straight regressionline
6 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Outliers and Influential Points
In a scatterplot, regression outliers are indicated by points falling far away from the overall pattern. In many
cases, a point is an outlier if its residual is an outlier in the set of residuals.
In a scatterplot, influential points are those whose removal would sharply change the regression line.
Sometimes this description is restricted to points with extreme x-values.
Transformations to Achieve Linearity
When a scatterplot shows non-linear pattern, it can sometimes be linearized by transforming one or both of the
variables and then noting a linear relationship.
If y is transformed to 𝐲 𝟐 to linearize the model, then the least squares regression line is:
̂ )2 = 𝑎 + 𝑏1 × IV
(DV
7 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
THEME 2: Planning a Study
Data Collection
Observational
Study
Experiment
Census
Sample
Survey
8 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Observational Study
Observational Study
We simply observe and measure something that has taken place or is taking place, whil trying not
to cause any changes by our presence. The results of an observational study show only the
existence of association but NOT cause relationship (cause and effect).
Conditions for a well-designed and well-conducted survey
•
•
•
Must incorporate randomness
Must have consistent response
Must ask neutral questions
Such survey will result in a representative sample.
Sampling Error vs Bias
Sampling error is the difference between sample statistics and population parameter. It is
naturally existed.
Bias occurs when there is a tendency to favour the selection of certain members of a population.
It is the consequence of a poorly designed survey.
9 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Sampling Techniques
Simple Random Sampling
Systematic Sampling
Stratified Sampling
Cluster Sampling
Assigning a number to everyone in the population and using a
random number table or having a computer generate numbers
to indicate choices.
A Simple Random Sample (SRS) is one in which every
possible sample of the desired size has an equal chance of
being selected.
Listing the population in some order, choosing a random point
to start, picking every tenth/hundredth/thousandth/kth person
from the list
Dividing the population into homogeneous group called strata
according to gender/income level/race/others, choosing
random samples of persons from each strata.
Dividing the population into heterogeneous groups called
clusters, choosing a random sample of clusters from all
clusters.
Bias
Type of Bias
Definition
Household Bias
When a sample includes only one Nonresponse
member of any given household, Bias
members of large households are
underrepresented.
Response bias
People may respond untruthfully
when face to face with an
interviewer or when filling out a
questionnaire that is not
anonymous
Some items/people are naturally
more likely to be selected due to
their size / group size.
Selection bias
Bias exists when a particular
group of people are selected,
which may result in similar
response
Undercoverage
bias
When a particular method is
used to reach people, those who
cannot be reached through that
method may be ignored
Samples based on individuals
who offer to participate typically
give too much emphasis to
people with strong opinions
When people are given free
choices, they tend to make a
particular type of choice.
Wording bias
Nonneutral or poorly worded
questions may lead to answers
that are very unrepresentative of
the population.
Size bias
Voluntary
response bias
Unintentional
Bias
Type of Bias
Definition
When people refuse to respond
or are unreachable or too
difficult to contact
10 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Experiments
Experiments
We impose some change or treatment and measure the results or responses. The subjects are
usually divided into treatment group and control group. The result of an experiment suggest causal
relationship.
Explanatory Variables vs Response Variables
Explanatory variables, called factors, are believed to have an effect on response variable. They are
corresponding to independent variable and dependent variable in regression analysis.
An explanatory variable can have different levels. Each level is called a treatment.
An experiment allows multiple explanatory variables.
Confounding Variable vs Lurking Variable
Confounding Variable: When we are uncertain with regard to which variable is causing an effect,
we say the variables are confounding variables.
Lurking Variable: A lurking variable is a variable that drives two other variables, creating the
mistaken impression that the two other variables are related by cause and effet.
Treatment Group and Control Group
In an experiment, the subjects are randomly assigned into two groups: a treatment group in which
they receive treatments and a control group in which they don’t.
The subjects in the control group receive a placebo, which is a ‘simulated’ or medically ineffectual
treatment.
A placebo effect is when people respond to any kind of perceived treatment. The physical response
maybe caused by the psychological placebo effect instead of the actual treatment.
11 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Randomization
Randomization is to use chance in deciding which subjects go into which group. It usually refers to how given
subjects are assigned to treatments, not to how a group of subjects are chosen from an entire population.
Randomization will make sure subjects in each group has a varieties of levels of any potential confounding or
lurking variables. Therefore it helps to minimize the effect of confounding variables and lurking variables.
Blinding
Blinding occurs when the subjects don’t know which treatment they are receiving, or when the response
evaluators don’t know which subjects are receiving which treatment.
Single Blinding: Only the subjects do not know which treatment they are receiving
Double Blinding: Neither the subjects nor the response evaluator know who is receiving which treatment.
Blinding helps minimize any hidden bias. Many studies suggest that subjects appear to consciously or
subconsciously want to help the researcher prove a point. Doctor’s judgment may also be influenced if they
know which subject receives which
Randomized Design
Completely Randomized Design:
Every subject has an equal chance of receiving any treatment.
Randomized Block Design:
Firstly divide the subjects into representative groups called blocks, then subjects in each block are randomly
assigned to different treatment groups.Blocking helps control certain lurking variables by bringing them directly
into the picture and helps make conclusions more specific.
Randomized Paired Comparison Designs:
Subjects are paired first and then the subjects in every pair are randomly assigned to different treatment groups.
Often the paired subjects are just a single subject who are given both treatments, one at a time. It is a special case
of block design with ‘very small blocks’.
Replication
The treatment should be repeated on a sufficient number of subjects so that the obtained response differences
are statistically significant. To achieve this:
For comparison design: increase the number of pairs of subjects
For completely randomized design or block design: increasing the group size
12 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Generalizability of Results
A major goal of experiments is to be able to generalize the results to broader populations. To achieve this:
•
•
Often an experiment must be repeated in a variety of settings
Realistic situation should be created in testing
Testing and experimenting on people does not put them in natural states, and this situation can lead to artificial
response.
Three primary principles for a well planned and well conducted experiment
➢
Possible confounding variables must be controlled
➢
Chance should be used in assigning which subjects are to be placed in which groups for which treatment.
➢
Natural variation in outcomes can be lessened by using more subjects
13 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
THEME 3: Probability
14 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Proabability Basics
Law of Large Numbers:
The relative frequency tends to be closer and closer to a certain number (the probability) as an experiment is
repeated more and more times.
𝐥𝐢𝐦
𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐞𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭𝐬→∞
𝐫𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐟𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲 = 𝐩𝐫𝐨𝐛𝐚𝐛𝐢𝐥𝐢𝐭𝐲
Complementary Events
P(AC ) = 1 − P(A),
whereAC means that A does not occur. 𝐴𝑐 and 𝐴 are called complementary events.
Addition Rule
P(A ∪ B) = P(A) + P(B) − P(A ∩ B) (formula sheet)
Mutually Exclusive Events
If A and B are mutually exclusive, then A and B cannot occur simultaneously. So
𝑃(𝐴 ∩ 𝐵) = 0
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵)
Conditional probability:
P(A|B) =
P(A∩B)
P(B)
(formula sheet)
𝑃(𝐴|𝐵)is called the probability of A given B, that is, the probability of A given that B has happened.
Independent Events
If A and B are independent, then the occurrence of A is not affected by the occurrence of B.
If A and B are independent, then,
𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴) × 𝑃(𝐵)
P(A|B) = 𝑃(𝐴)
15 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Discrete Random Variable
Random Variables
A random variable (X) is a certain type of results that we are interested about an experiment.
Discrete random variables assume only a countable number of values, while continuous random variables
assume values associated with an interval.
Probability Distribution
A probability distribution for a discrete random variable is a list or formula giving the probability for each value
of the random variable.
P(X=x) denotes the probability of the random variable X for value x.
The sum of all probabilities must be equal to 1 for any probability distribution.
The cumulative probability:
𝑏
𝑃𝑟(𝑎 ≤ 𝑋 ≤ 𝑏) = ∑𝑥𝑥=
=𝑎 𝑝(𝑥)
Bernoulli Trials
A Bernoulli trial must have the following properities:
•
Each trial results in one of two outcomes, which are designated either a success, S or a failure F.
•
The probability of success on a single trial, p, is constant for all trials, and thus the probability of
failure on a single trial is (1-p)
•
The trials are independent (so that the outcome on any trial is not affected by the outcome of any
previous trial).
Binomial Random Variable:
The number of successes in a Bernoulli sequence of n trials is called a binomial random variable
and is said to have a binomial probability distribution.
The probability of achieving x successes in n Bernoulli trials is:
Where:
𝑛
𝑛
𝑛!
𝑃(𝑋 = 𝑘) = ( ) 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 where( ) = 𝑥!(𝑛−𝑥)!
𝑘
𝑥
n = number of independent trials
k = number of success
p = probability of success
(formula sheet)
16 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Using TI for Binomial Probability
Binomial PDF: To find P(X = x)
Binomial CDF:To find P(x1 < 𝑋 ≤ x2 )
Geometric Random Variable
A geometric random variable is the number of Bernoulli trials that records the first success.
Suppose an experiment has two possible outcomes, called success and failure, with the probability of
success equal to p and the probability of failure equal to 𝑞 = 1 − 𝑝 , then the probability that the first
success is on trial number 𝑋 = 𝑘 is
𝑃(𝑋 = 𝑘) = (1 − 𝑝)𝑘−1 𝑝
Mean (Expected Value) And Standard Deviation of a Random Variable (Formula Sheet)
For any random variable X
binomial random variable
Expected value/
average/ mean
E(X) = μx = ∑ xi pi
μx = np
Variance
var(X) = σ2x = ∑(xi − μx )2 pi
σ2x = np(1 − p)
Standard deviation
σx =√var(X)
σx = √np(1 − p)
The alternative formula for variance:
Var(x) = E(x 2 ) − [E(x)]2
Fair Game
If a game is fair, then the expected winning for each player in the game is zero.
Independent Random Variables
Two random variables X and Y are independent if
P(X = x|Y = y) = P(X = x) OR
P(X = x, Y = y) = P(X = x) × P(Y = y)
forall values of X and Y.
17 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Linear Combination of E(X) and Var(X)
𝐸(𝑋 ± 𝑎) = 𝐸(𝑋) ± 𝑎
𝑉𝑎𝑟(𝑋 ± 𝑎) = 𝑉𝑎𝑟(𝑋)
𝐸(𝑏𝑋) = 𝑏𝐸(𝑋)
𝑉𝑎𝑟(𝑏𝑋) = 𝑏 2 𝑉𝑎𝑟(𝑋)
Combining X and Y:
𝐸(𝑋 ± 𝑌) = 𝐸(𝑋) ± 𝐸(𝑌)
𝑉𝑎𝑟(𝑋 ± 𝑌) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌) if X and Y are independent.
𝜎𝑥±𝑦 = √𝜎𝑥2 + 𝜎𝑦2
Note: When combining X and Y, every value of X is combined with every value of Y.
Normal Distribution
Normal Distribution
•
•
•
•
•
Symmetric and bell shaped with an infinite base
The total area under the normal curve is 1
The mean (μ) is located at the center and the standard deviation (σ) represents the width of the normal
curve. A normal model is denoted as 𝐍(𝛍, 𝛔𝟐 ).
The mean and standard deviation of a standard normal random variable Z is 0 and 1 respectively.
𝑃𝑟(𝑎 ≤ 𝑥 ≤ 𝑏)is the area under the normal distribution curve between x = a and x = b.
Z– score
𝐱−𝛍
𝛔
Where x is the data value, μ is the mean, and σ is the standard deviation.
𝐳=
The 68 – 95 – 99.7% Rule
P(μ − σ < 𝑋 < 𝜇 + 𝜎) = P(−1 < 𝑍 < 1) ≈ 0.68
P(μ − 2σ < 𝑋 < 𝜇 + 2𝜎) = P(−2 < 𝑍 < 2) ≈ 0.95
P(μ − 3σ < 𝑋 < 𝜇 + 3𝜎) = P(−3 < 𝑍 < 3) ≈ 0.997
18 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Common Z-score
Using TI for Normal Probability
Normal CDF: Given N(μ, σ2 ), find P(x1 < 𝑋 < x2 )
Inverse Normal:
Case 1 - Given P(X < 𝑥) = p0 and N(μ, σ2 ), find x.
Case 2– Given P(X < x0 ) = p0 , find mean or standard deviation
Normal Approximation to the Binomial
When np > 10 and n(1 − p) > 10, it is reasonable to use normal to approximate the binomial:
μx = np
σx = √np(1 − p)
To estimate 𝑃(𝑋 = 𝑥) for binomial, find 𝑃(𝑥 − 0.5 < 𝑋 < 𝑥 + 0.5) for 𝑁(𝜇𝑥 , 𝜎𝑥2 )
To estimate 𝑃(𝑋 ≤ 𝑥) for binomial, find 𝑃(𝑋 < 𝑥 + 0.5) for 𝑁(𝜇𝑥 , 𝜎𝑥2 )
To estimate 𝑃(𝑋 < 𝑥) for binomial, find 𝑃(𝑋 < 𝑥 − 0.5) for 𝑁(𝜇𝑥 , 𝜎𝑥2 )
To estimate 𝑃(𝑋 ≥ 𝑥) for binomial, find 𝑃(𝑋 > 𝑥 − 0.5) for 𝑁(𝜇𝑥 , 𝜎𝑥2 )
To estimate 𝑃(𝑋 > 𝑥) for binomial, find 𝑃(𝑋 > 𝑥 + 0.5) for 𝑁(𝜇𝑥 , 𝜎𝑥2 )
19 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Sampling Distribution
Sampling Distribution of Sample Proportion
The sample proportions are normally distributed if the following conditions are satisfied:
1.
Both np and n(1 − p) are at least 10
2.
The sample is a simple random sample
3.
The sample cannot be too large. The sample size n should be no larger than 10% of the population.
The mean and standard deviation of the sampling distribution of sample proportion are:
𝜇𝑝̂ = 𝑝
σP̂ = √
p(1−p)
(formula sheet)
n
Where p is the population proportion and n is the sample size
Sampling Distribution of a Difference Between Two Sample Proportions
The differences between two sample proportions are normally distributed if the following
conditions are satisfied:
1.
n1 p1 , n1 (1 − p1 ), n2 p2 and n2 (1 − p2 ) are all at least 10
2.
The two samples are independent SRS.
3.
The sample cannot be too large. The sample size n1 and n2 should be no larger than 10% of the
respective population.
The mean and standard deviation of the sampling distribution of differences betweentwo sample
proportion are:
μp̂1 −p̂2 = p1 − p2
p (1−p1 )
σp̂1 −p̂2 = √ 1 n
1
p (1−p2 )
+ 2n
2
(formula sheet)
Where p1 and p2 are the population proportions and n1 and n2 are the sample sizes.
20 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Sampling Distribution of a Sample Mean
The sample means are normally distributed regardless of the shape of population distribution if the
following conditions are satisfied:
1. n is sufficiently large (𝑛 > 30)
2. The sample is an SRS
3. The sample size is no larger than 10% of the population.
This is also called the Central Limit Theorem.
(Note: If the population distribution is normal, then the assumption above is NOT necessary!)
The mean and standard deviation of the sampling distribution of sample means are:
μx̅ = μ
σx̅ =
σ
(formula sheet)
√n
Where μ is the population mean, σ is the population standard deviation and n is the sample size.
Sampling Distribution of a Difference Between Two Sample Means
The differences between two sample means are normally distributed if the following conditions are
satisfied:
1.
𝑛1 and 𝑛2 are both sufficiently large (𝑛 > 30)
2.
The two samples are independent SRS.
3.
𝑛1 and 𝑛2 are no larger than 10% of the respective population.
The mean and standard deviation of the sampling distribution of differences betweentwo sample proportion
are:
μx̅1−x̅2 = μ1 − μ2
𝜎12 𝜎22
𝜎𝑥̅ 1 −𝑥̅2 = √ +
𝑛1 𝑛2
Where μ1 and μ2 are the population means, σ1 and σ2 are the population standard deviations and n1
and n2 are the sample sizes.
21 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
THEME 4: Statistical Inferences
22 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Confidence Intervals
Confidence Interval
𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑠 ± 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑠𝑐𝑜𝑟𝑒 × 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
Where ‘estimates’ is the observed sample statistics
‘critical score’ corresponds to the confidence level (either z or t score)
‘standard deviation’ of the sampling distribution is estimated from the sample statistics
𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑠𝑐𝑜𝑟𝑒 × 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 is called the margin of error
Interpretation of a Confidence Interval:
⚫
⚫
We are (confidence level) % confident that the (population parameter) is within the interval of
Estimates ± critical score × standard deviation
There is a (confidence level)% chance that the confidence interval contains the true population
propotion.
Confidence Interval for a Proportion
Confidence Interval:
𝑝̂ ± 𝑧 × 𝜎𝑝̂
𝑝̂(1−𝑝̂)
for 𝜎𝑝̂ ≈ √
𝑛
Where 𝑝̂ is the observed sample proportion, z is the critical score, 𝜎𝑝̂ is the standard deviation of sample
proportion (also called SEp̂, standard error of p̂ ).
Assumption:
➢
➢
➢
np̂ > 10andn(1 − p̂) > 10
The sample is a simple random sample
The sample is less than 10% of the population.
Maximum error and minimum sample size:
max 𝜎𝑝̂ =
z×
0.5
0.5
√n
z
2
≤ error ➔ minimum n = [2(error)]
n
√
23 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Confidence Interval for a Difference of Two Proportions
Confidence Interval:
(𝑝̂1 − 𝑝̂ 2 ) ± 𝑧 × 𝜎𝑝̂1 −𝑝̂2
𝑝̂ (1−𝑝̂1 )
for 𝜎𝑝̂1 −𝑝̂2 ≈ √ 1 𝑛
1
𝑝̂ (1−𝑝̂2 )
+ 2𝑛
2
where 𝑝̂1 − 𝑝̂ 2 is the observed differences of two proportions, z is the critical score, 𝜎𝑝̂1 −𝑝̂2 is the standard
deviation of difference between two sample proportions.
Assumption:
➢ 𝑛1 𝑝̂1 , 𝑛1 (1 − 𝑝̂1 ), 𝑛2 𝑝̂ 2 , 𝑛2 (1 − 𝑝̂ 2 ) should all be at least 10
➢ Both samples are SRSs and they are independent.
➢ Both samples should be less than 10% of the population.
Maximum error and minimum sample size:
max 𝜎𝑝̂1 −𝑝̂2 =
z×
√0.5
√𝑛
2
z
√0.5
≤
error➔minimum
n
=
[
]
√2error
√n
Confidence Interval for a Mean
Given population standard deviation 𝜎
x̅ ± z × σx̅
for σx̅ =
𝜎
√𝑛
Given sample standard deviation 𝑠:
x̅ ± t × σx̅ (𝑑𝑓 = 𝑛 − 1)
for σx̅ ≈
𝑠
√𝑛
,
where𝑥̅ is the observed sample mean, z is the
critical score, σx̅ is the standard deviation of
sample mean calculated by population
standard deviation
where𝑥̅ is the observed sample mean, t is the
critical score, σx̅ is the standard deviation of
sample mean calculated by sample standard
deviation
Assumption:
Assumption:
1. n> 40
OR sample data is normally distributed
OR population data is roughly symmetric and
unimodal
2. The sample is an SRS
3. The sample size is no larger than 10% of the
population
s
1. n is sufficiently large (𝑛 > 30)
2. The sample is an SRS
3. The sample size is no larger than 10%
of the population.
24 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Properties of t-distribution
𝒕=
𝑥̅ − 𝜇
𝑠/√𝑛
where 𝜎𝑥̅ =
➢
➢
➢
➢
➢
𝑠
√𝑛
t-distribution is also bell-shaped and symmetric
t-distribution is more spread out than the normal distribution
t-distribution is different for different values of n.
df = n − 1is called the degree of freedom
The larger the df value, that is, the larger the sample size, the closer the distribution to the normal
distribution.
Choosing sample size
In general, if
z×
σ
√n
≤ error
wherez is the critical score for the confidence level
then the minimum sample size n to achieve a certain confidence interval with certain margins of error is:
zσ 2
n=(
)
error
Confidence Interval for a Difference Between Two Means
Given population standard deviation 𝜎1 and 𝜎2
Given population standard deviation 𝜎1 and 𝜎2
(𝑥̅1 − 𝑥̅2 ) ± 𝑧 × 𝜎𝑥̅1 −𝑥̅2
(𝑥̅1 − 𝑥̅2 ) ± 𝑡 × 𝜎𝑥̅1 −𝑥̅2
𝜎2
𝜎2
1
2
for 𝜎𝑥̅1 −𝑥̅2 = √𝑛1 + 𝑛2
𝑠2
𝑠2
1
2
for 𝜎𝑥̅1 −𝑥̅2 ≈ √𝑛1 + 𝑛2
𝑑𝑓 = (𝑛1 − 1) + (𝑛2 − 1)
Assumption:
1. 𝑛1 and 𝑛2 are both sufficiently large
(>30)
2. The samples are independent SRS
3. The sample sizes are no larger than 10%
of the population.
Assumption:
1. 𝑛1 > 40 and 𝑛2 > 40
OR sample data is normally distributed
OR population data is normally distributed
2. The samples are independent SRS
3. The sample sizes are no larger than 10% of
the population.
25 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Choosing sample sizes
In general, if
𝜎12 𝜎22
𝑧×√ +
≤ 𝑒𝑟𝑟𝑜𝑟
𝑛
𝑛
where z is the critical score for the confidence level,
then the minimum size for both samples is:
𝑧√𝜎12 + 𝜎22
𝑛=(
)
𝑒𝑟𝑟𝑜𝑟
2
Confidence Interval for the Slope of a Least Squares Regression Line
Confidence interval:
𝐛𝟏 ± 𝐭 × 𝐬 𝐛 𝟏
where b1 is the slope of the sample regression line ŷ = y̅ + b1 (x − x̅)
t is the critical score for 𝑑𝑓 = 𝑛 − 2
sb1 is the standard deviation of b1
Standard deviation of the slope 𝐛𝟏 :
2
√∑(yi −ŷi )
sb1 =
n−2
√∑(𝑥𝑖 −𝑥̅ )2
(Formula Sheet)
OR
sb1 =
∑(yi −y
̂ i )2
Where se = √
n−2
∑(xi −x̅)2
sx = √
n−1
𝑠𝑒
𝑠𝑥 √𝑛 − 1
is the standard deviation of the residual
is the standard deviation of x.
Assumptions:
1. The sample must be randomly selected
2. The scatterplot of the sample data should be approximately linear (No apparent patter in residual plot)
3. The distribution of the residuals should be approximately normal for sample data.
26 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Hypothesis Test
Introduction to Hypothesis Test
Purpose
A hypothesis test is used to test whether a claimed population parameter, or a hypothesis test is acceptable.
Null Hypothesis and Alternative Hypothesis
•
Null Hypothesis (𝐻0 ) is an equality about population parameter
•
Alternative Hypothesis (𝐻𝑎 ) is an inequality (greater, less, not equal) about population parameter
Conclusion of Hypothesis Test
We have sufficient evidence to reject the null hypothesis.
OR
We do not have sufficient evidence to reject the null hypothesis
Errors in Hypothesis Test
Population truth
Decision based
on sample
•
•
•
•
•
•
Reject 𝑯𝟎
Fail to reject 𝑯𝟎
𝑯𝟎 𝒕𝒓𝒖𝒆
𝑯𝟎 𝒇𝒂𝒍𝒔𝒆
Type I error
Correct decision
Correct decision
Type II error
The probability of making a type I error is the significance level (or 𝛼 risk)
The probability of making a type II error,𝛽 , is different for each possible value for the population
parameter.
The power of a hypothesis test is the probability that a type II error is not committed, or the
probability that a false null hypothesis is correctly rejected. (1 − β)
Choosing a smaller 𝛼 results in a higher risk of Type II error and a lower power
The greater the difference between null hypothesis and the true population parameter, the smaller the
risk of a Type II error and the greater the power
27 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Hypothesis Testing for Proportion
Step 1
Test for a proportion (𝐩)
Test for difference
proportions (𝐩𝟏 − 𝐩𝟐 )
between
Setting up hypothesis
𝐻0 : p = p0
𝐻𝑎 : p < p0 or > p0 or ≠ p0
Setting up hypothesis
𝐻0 : p1 − p2 = 0
𝐻𝑎 : p1 − p2 < 0 𝑜𝑟 > 0 𝑜𝑟 ≠ 0
Significance level: 𝛼 = 𝛼0
Significance level: α = α0
two
Finding 𝛍𝐩̂𝟏 −𝐩̂𝟐 and 𝛔𝐩̂𝟏 −𝐩̂𝟐
Finding 𝛍𝐩̂ and 𝛔𝐩̂
𝜇𝑝̂1−𝑝̂2 = 0
μp̂ = 𝑝0
Step 2
1
1
𝜎𝑝̂1−𝑝̂2 ≈ √𝑝̂ (1 − 𝑝̂ ) ( + )
𝑛1 𝑛2
𝑝0 (1 − 𝑝0 )
σp̂ = √
𝑛
2
where 𝑝̂ = 𝑛𝑥1+𝑥
+𝑛
1
Computing
P-value
attained significance)
Step 3
(the Computing P-value
significance)
2
(the
attained
P(p̂ < p̂0 ) or P(p̂ > p̂0 )
P(p̂1 − p̂2 < 𝑜𝑏𝑡𝑎𝑖𝑛𝑒𝑑 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒)
where p̂0 is the obtained sample or
proportion.
P(p̂1 − p̂2 > 𝑜𝑏𝑡𝑎𝑖𝑛𝑒𝑑 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒)
Compare the P-value with the significance level 𝛂
Step 4
α
1.
α
Both 𝑛𝑝0 and 𝑛(1 − 𝑝0 ) are at
1.
least 10
Assumption
𝛼
2
𝑛1 𝑝̂1 ,
𝑛1 (1 − 𝑝̂1 ), 𝑛2 𝑝̂2 , and 𝑛2 (1 − 𝑝̂2 )
should all be at least 10
2.
The sample is a SRS.
2.
Independent SRS
3.
The sample size is less than 10% of
3.
The sample sizes are less than 10% of the
the population.
population
28 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Hypothesis Testing for Mean
Step 1
Test for a mean (𝛍)
Test for difference between two means
(𝛍𝟏 − 𝛍𝟐 )
Setting up hypothesis
H0 : μ = μ 0
Ha : μ < μ0 or > μ0 or ≠ μ0
Significance level: 𝛼 = 𝛼0
Setting up hypothesis
H0 : μ1 − μ2 = 0
Ha : μ1 − μ2 < 0 𝑜𝑟 > 0 𝑜𝑟 ≠ 0
Significance level: 𝛼 = 𝛼0
Finding 𝛍𝐱̅ and 𝛔𝐱̅
Finding 𝝁𝒙̅𝟏 −𝒙̅𝟐 and 𝝈𝒙̅𝟏 −𝒙̅𝟐
μx̅ = 𝜇0
Step 2
σx̅ ≈
𝜇𝑥̅1 −𝑥̅2 = 0
𝑠
√𝑛
𝜎𝑥̅1 −𝑥̅2 ≈ √
𝑠12 𝑠22
+
𝑛1 𝑛2
Option 1:
➢ Find t-score for the observed sample statistics
➢ Find the critical t-score for the significance level 𝛼0
➢ Compare the t-scores to test
Option 2:
Step 3
➢ Find t-score for the observed sample statistics
➢ Find the corresponding P-value
➢ Compare the P-value with the significance level 𝛼0 to test
t-score
sample mean
𝑡=
𝑥̅ −𝜇0
𝑠
√𝑛
differences between two sample means
t=
x̅1 −x̅2
𝑠2
𝑠2
√ 1+ 2
𝑛 1 𝑛2
1. The sample is a SRS.
2. The sample is large enough.
OR The sample data are
approximately symmetric and
Assumption
unimodal. OR The population
distribution is approximately
normal.
1. The two samples are SRS and
independent.
2. The samples are large enough. OR The
sample data are approximately symmetric
and unimodal. OR The population
distribution is approximately normal.
29 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Hypothesis Test for Slope of Least Squares Line
Step 1: Set up H0 and Ha
H0 : β = 0
Ha : β > 0 𝑜𝑟 𝛽 < 0 𝑜𝑟 𝛽 ≠ 0
Step 2: Compute t-scores:
b0 − 0
sb
where b0 is the observed slope, 𝑠𝑏 is the standard deviation of the slope
t0 =
Step 3: Compute the P-value
P − value = P(t > t 0 ) or P(t < t 0 )
Step 4: Compare the P-value with the significance level α to test.
Assumptions:
1. The sample must be randomly selected
2. The scatterplot should be approximately linear. (There should be no apparent pattern in the residuals
plot)
3. The distribution of the residuals should be approximately normal.
30 / 31
eskedu.com
亿思科学生之家——公益性质国际教育信息服务平台
Chi-Square Test for Goodness of Fit, Independence and Homogeneity
Goodness of Fit
Independence
H0 : The observed distribution H0 : The two variables are
fits the expected distribution
independent.
Step 1
Ha : the observed distribution Ha : The two variables are
doesn’t fit the expected not independent.
distribution.
Homogeneity
H0 : The distributions
from two populations
are the same.
Ha : The distributions
from two populations
are not the same.
Compute the X 2 according to the observed values and the expected values:
𝜒0
Step 2
Step 3
2
(𝑂𝑏𝑠 − 𝐸𝑥𝑝)2
=∑
𝐸𝑥𝑝
Where obs = the observed value
exp = expected value
Compute P-value:
Compute P-value:
P-value = P(χ2 > χ20 )
P-value = P(χ2 > χ20 )
where df = r– 1
Where df = (r − 1)(c − 1)
Compare the P-value with the significance level to test the hypothesis
Compare the P-value with the significance level α:
If P-value is greater than α, then do not reject H0 .
If P-value is less than α, then reject H0 .
1. The sample is a SRS. 1. The samples are
1. The sample is a SRS.
independent SRS.
2. The expected values for all 2. The expected values
for all cells must be at 2. The expected values
cells must be at least 5.
Assumption
for all cells must be
least 5.
Step 4
at least 5.
Test for Goodness of Fit
Comparing a single sample to
a population model
Test for Independence
Work with a single sample
classified on two variables
Test for Homogeneity
Compare samples from two or
more populations about a
single variable
THE END
31 / 31
eskedu.com
Download