Docx - Toomey.org

advertisement
Harold’s Statistics
“Cheat Sheet”
26 December 2015
Descriptive
Description
Population
Sample
Used For
Parameters
𝑋, π‘Œ
Statistics
π‘₯, 𝑦
𝑁
𝑛
Describing and predicting.
The random value from the evaluated population.
Number of observations in the
population/sample.
Indicates which value is typical for the data set.
Measure of center; includes entire population.
Average. Used when same probabilities for each
X. Answers “Where is the center of the data
located?”
Data
Random Variable
Size
Measures of Center
(Measure of central tendency)
𝒏
𝑁
𝟏
Μ… = ∑ π’™π’Š
𝒙
𝒏
1
πœ‡ = ∑ π‘₯𝑖
𝑁
Mean
π’Š=𝟏
𝑖=1
Mean with 𝒇 Table
Median
Mode
Mid-Range
1
∑ π‘₯𝑖 𝑓
𝑁
𝑛+1
𝑀𝑑 =
𝑖𝑓 𝑛 𝑖𝑠 π‘œπ‘‘π‘‘
2
π‘€π‘œ
π‘šπ‘Žπ‘₯. + π‘šπ‘–π‘›.
π‘€π‘–π‘‘π‘…π‘Žπ‘›π‘”π‘’ =
2
Measures of Variation
1
∑(π‘₯ − πœ‡)2
𝑁
𝑁
1
(∑ π‘₯𝑖2 − 𝑁 πœ‡2 )
𝑁
𝑖=1
1
𝜎(𝑋, π‘Œ) = ∑(π‘₯ − πœ‡π‘₯ )(𝑦 − πœ‡π‘¦ )
𝑁
𝑁
1
𝜎(𝑋, π‘Œ) = ∑ π‘₯𝑖 𝑦𝑖 − πœ‡π‘₯ πœ‡π‘¦
𝑁
𝜎2 =
Covariance
1
∑ π‘₯𝑖 𝑓
𝑛
𝑖=1
Copyright © 2015 by Harold Toomey, WyzAnt Tutor
Measure of center for a frequency distribution.
More useful when data are skewed.
The middle element in order of rank.
Appropriate for categorical data.
The most frequency value in a data set.
Not often used, easy to compute.
Highly sensitive to unusual values.
(Measure of dispersion)
𝜎2 =
Variance
Μ…=
𝒙
πœ‡=
Reflect the variability of the data (e.g. how
different the values are from each other.
1
∑(π‘₯ − π‘₯Μ… )2
𝑛−1
𝑛
1
2
𝑠 =
(∑ π‘₯𝑖2 − 𝑛 π‘₯Μ… 2 )
𝑛−1
Not often used. See standard deviation.
Special case of covariance when the two variables
are identical.
1
𝑔=
∑(π‘₯ − π‘₯Μ… )(𝑦 − 𝑦̅)
𝑛−1
𝑛
1
𝜎(𝑋, π‘Œ) =
(∑ π‘₯𝑖 𝑦𝑖 − 𝑛 π‘₯Μ… 𝑦̅)
𝑛−1
A measure of how much two random variables
change together. Measure of “linear
depenedence”. If X and Y are independent, then
their covarience is zero (0).
𝑠2 =
𝑖=1
𝑖=1
1
Description
Standard Deviation
Pooled Standard
Deviation
Interquartile Range
(IQR)
Population
Sample
∑(π‘₯ − πœ‡)2
𝜎 = √𝜎 2 = √
𝑁
∑(𝒙 − 𝒙
Μ…)𝟐
𝒔𝒙 = √
𝒏−𝟏
∑ π‘₯2
𝜎=√
− πœ‡2
𝑁
∑ π‘₯ 2 − 𝑛 π‘₯Μ… 2
𝑠=√
𝑛−1
𝑁1 𝜎12 + 𝑁2 𝜎22
√
πœŽπ‘ =
𝑁1 + 𝑁2
(π’πŸ − 𝟏)π’”πŸπŸ + (π’πŸ − 𝟏)π’”πŸπŸ
√
𝒔𝒑 =
(π’πŸ − 𝟏) + (π’πŸ − 𝟏)
Measures of Relative Standing
Percentile
Quartile
Not often used, easy to compute.
(Measures of relative position)
Data divided onto 100 equal parts by rank.
Data divided onto 4 equal parts by rank.
π‘₯ =πœ‡+π“πœŽ
Z-Score / Standard
Score / Normal Score
𝓏=
π‘₯−πœ‡
𝜎
PDF
Copyright © 2015 by Harold Toomey, WyzAnt Tutor
Measure of variation; average distance from the
mean. Same units as mean.
Answers “How spread out is the data?”
Inferences for two population means.
Less sensitive to extreme values.
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
π‘…π‘Žπ‘›π‘”π‘’ = π‘šπ‘Žπ‘₯. − π‘šπ‘–π‘›.
Range
Used For
π‘₯ = π‘₯Μ… + 𝓏 𝑠
𝓏=
π‘₯ − π‘₯Μ…
𝑠
Highly sensitive to unusual values.
Indicates how a particular value compares to the
others in the same data set.
Important in normal distributions.
Used to compute IQR.
The 𝓏 variable measures how many standard
deviations the value is away from the mean.
TI-84: [2nd][VARS][2] normalcdf(-1E99, z)
CDF
2
Regression and Correlation
Description
Formula
Response Variable
Covariate / Predictor
Variable
Least-Squares Regression
Line
Regression Coefficient
(Slope)
π‘Œ
Output
𝑋
Input
𝑏1 is the slope
𝑏0 is the y-intercept
(π‘₯Μ… , 𝑦̅) is always a point on the line
Μ‚ = π’ƒπŸŽ + π’ƒπŸ 𝒙
π’š
π’ƒπŸ =
∑(π’™π’Š − Μ…
Μ…)
𝒙)(π’šπ’Š − π’š
∑(𝒙 − 𝒙
Μ…)𝟐
𝑏1 is the slope
π’”π’š
π’ƒπŸ = 𝒓
𝒔𝒙
Μ… − π’ƒπŸ 𝒙
Μ…
π’ƒπŸŽ = π’š
Regression Slope Intercept
𝒓=
Linear Correlation
Coefficient (Sample)
Used For
𝑏0 is the y-intercept
Strength and direction of linear relationship
between x and y.
Μ… π’š−π’š
Μ…
𝟏
𝒙−𝒙
∑(
)(
)
𝒏−𝟏
𝒔𝒙
π’”π’š
π‘Ÿ=
𝑔
𝑠π‘₯ 𝑠𝑦
π‘Ÿ = ±1
π‘Ÿ = +0.9
π‘Ÿ = −0.9
π‘Ÿ = ~0
π‘Ÿ ≥ 0.8
π‘Ÿ ≤ 0.5
Perfect correlation
Positive linear relationship
Negative linear relationship
No relationship
Strong correlation
Weak correlation
Correlation DOES NOT imply causation.
Residual
𝑒̂𝑖 = 𝑦𝑖 − 𝑦̂
𝑒̂𝑖 = 𝑦𝑖 − (𝑏0 + 𝑏1 π‘₯)
Residual = Observed – Predicted
∑ 𝑒𝑖 = ∑(𝑦𝑖 − 𝑦̂𝑖 ) = 0
2
𝑠𝑏1
√ ∑ 𝑒𝑖
𝑛−2
=
√∑(π‘₯𝑖 − π‘₯Μ… )2
π’”π’ƒπŸ
Μ‚π’Š )
√∑(π’šπ’Š − π’š
𝒏−𝟐
=
Μ…)𝟐
√∑(π’™π’Š − 𝒙
Standard Error of
Regression Slope
𝟐
How well the line fits the data.
Coefficient of
Determination
π‘Ÿ2
Copyright © 2015 by Harold Toomey, WyzAnt Tutor
Represents the percent of the data that is the
closest to the line of best fit. Determines
how certain we can be in making predictions.
3
Proportions
Description
Population
𝑃=𝑝=
Proportion
π‘₯
𝑁
𝑝̂ =
π‘ž =1−𝑝
𝑄 = 1−𝑃
𝜎2 =
Variance of Population
(Sample Proportion)
Sample
𝜎2 =
Pooled Proportion
Used For
π‘₯
𝑛
Probability of success. The proportion of
elements that has a particular attribute (x).
Probability of failure. The proportion of
elements in the population that does not
have a specified attribute.
π‘žΜ‚ = 1 − 𝑝̂
π‘π‘ž
𝑁
𝑠𝑝2 =
𝑝(1 − 𝑝)
𝑁
𝑝̂ π‘žΜ‚
𝑛−1
Considered an unbiased estimate of the
true population or sample variance.
𝑝̂ (1 − 𝑝̂ )
𝑛−1
π‘₯1 + π‘₯2
𝑝̂𝑝 =
𝑛1 + 𝑛2
𝑠𝑝2 =
π‘₯ = 𝑝̂ 𝑛 = frequency, or number of
members in the sample that have the
specified attribute.
NA
𝑝̂1 𝑛1 + 𝑝̂2 𝑛2
𝑛1 + 𝑛2
𝑝̂𝑝 =
Discrete Random Variables
Description
Formula
Random Variable
Used For
Derived from a probability
experiment with different
probabilities for each X.
Used in discrete or finite PDFs.
𝑋
𝐸(𝑋) = π‘₯Μ…
𝑡
Expected Value of X
𝑬(𝑿) = 𝝁𝒙 = ∑ π’‘π’Š π’™π’Š = ∑ 𝑃(𝑋) 𝑋
𝑽𝒂𝒓(𝑿) =
π’Š=𝟏
𝟐
πˆπ’™ =
E(X) is the same as the mean. X
takes some countable number of
specific values. Discrete.
∑ π’‘π’Š (π’™π’Š − 𝝁𝒙 )𝟐
2
𝜎π‘₯2 = ∑ 𝑃(𝑋) (𝑋 − 𝐸(𝑋))
Variance of X
𝜎π‘₯2
2
= ∑ 𝑋 𝑃(𝑋) − 𝐸(𝑋)
2
Calculate variances with
proportions or expected values.
𝜎π‘₯2 = 𝐸(𝑋 2 ) − 𝐸(𝑋)2
𝑆𝐷(𝑋) = √π‘‰π‘Žπ‘Ÿ(𝑋)
Standard Deviation of X
𝜎π‘₯ = √𝜎π‘₯2
Calculate standard deviations with
proportions.
𝑁
Sum of Probabilities
∑ 𝑝𝑖 = 1
1
If same probability, then 𝑝𝑖 = 𝑁 .
𝑖=1
Copyright © 2015 by Harold Toomey, WyzAnt Tutor
4
Statistical Inference
Description
Sampling Distribution
Central Limit Theorem (CLT)
Sample Mean
Sample Mean Rule of
Thumb
Sample Proportion
Sample Proportion Rule
of Thumb
Difference of Sample Means
Mean
Is the probability distribution of a statistic; a statistic of a statistic.
Lots of π‘₯Μ… ’s form a Bell Curve,
approximating the normal distribution,
√𝑛(π‘₯Μ… − πœ‡) ≈ 𝒩(0, 𝜎 2 )
regardless of the shape of the distribution
of the individual π‘₯𝑖 ’s.
𝝈
πˆπ’™Μ… =
𝝁𝒙̅ = 𝝁
√𝒏
(2x accuracy needs 4x n)
Use if 𝑛 ≥ 30 or if the population distribution is normal
πœ‡=𝑝
Large Counts Condition:
Use if 𝑛𝑝 ≥ 10 𝐚𝐧𝐝 𝑛(1 − 𝑝) ≥ 10
𝐸(π‘₯Μ…1 − π‘₯Μ…2 ) = πœ‡π‘₯Μ… 1 − πœ‡π‘₯Μ… 2
π‘π‘ž
𝑝(1 − 𝑝)
=√
𝑛
𝑛
σ𝑝 = √
10 Percent Condition:
Use if 𝑁 ≥ 10𝑛
πˆπ’™Μ…πŸ−π’™Μ…πŸ = √
𝝈𝟐𝟏 𝝈𝟐𝟐
+
π’πŸ π’πŸ
𝟏
𝟏
πˆπ’™Μ…πŸ−π’™Μ…πŸ = 𝝈√ +
π’πŸ π’πŸ
Special case when
𝜎1 = 𝜎2
Difference of Sample
Proportions
Standard Deviation
π›₯𝑝̂ = 𝑝̂1 − 𝑝̂2
Special case when
𝑝1 = 𝑝2
𝑝1 π‘ž1 𝑝2 π‘ž2
π’‘πŸ (𝟏 − π’‘πŸ ) π’‘πŸ (𝟏 − π’‘πŸ )
𝝈=√
+
=√
+
𝑛1
𝑛2
π’πŸ
π’πŸ
1
1
𝟏
𝟏
𝝈 = √π‘π‘ž√ +
= √𝒑(𝟏 − 𝒑)√ +
𝑛1 𝑛2
π’πŸ π’πŸ
Bias
Caused by non-random samples.
Variability
Caused by too small of a sample.
𝑛 < 30
Copyright © 2015 by Harold Toomey, WyzAnt Tutor
5
Confidence Intervals for One Population Mean
Description
Formula
𝔃=
Standardized Test Statistic
(of the variable π‘₯Μ… )
Confidence Interval (C) for µ / zinterval
(σ known, normal population or
large sample)
Margin of Error/Standard Error
(SE)
(for the estimate of µ)
Sample Size
(for estimating µ, rounded up)
Critical Value
Null Hypothesis: π‘―πŸŽ
Alternative Hypotheses:
π‘―πŸ 𝒐𝒓 𝑯𝒂
Hypothesis Testing
π’”π’•π’‚π’•π’Šπ’”π’•π’Šπ’„ − π’‘π’‚π’“π’‚π’Žπ’†π’•π’†π’“
𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 π’…π’†π’—π’Šπ’‚π’•π’Šπ’π’ 𝒐𝒇 π’”π’•π’‚π’•π’Šπ’”π’•π’Šπ’„
π‘₯Μ… − πœ‡
𝓏=𝜎
⁄ 𝑛
√
z-interval = π’”π’•π’‚π’•π’Šπ’”π’•π’Šπ’„ ± (π’„π’“π’Šπ’•π’Šπ’„π’‚π’ 𝒗𝒂𝒍𝒖𝒆) ∗
(𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 π’…π’†π’—π’Šπ’‚π’•π’Šπ’π’ 𝒐𝒇 π’”π’•π’‚π’•π’Šπ’”π’•π’Šπ’„)
𝓏 - interval = π‘₯Μ… ± 𝐸
𝜎
= π‘₯Μ… ± 𝓏𝛼⁄2 βˆ™
√𝑛
𝛼 100 − 𝐢
=
2
2
𝓏𝛼⁄2 = 𝑧 − π‘ π‘π‘œπ‘Ÿπ‘’ π‘“π‘œπ‘Ÿ π‘π‘Ÿπ‘œπ‘π‘Žπ‘π‘–π‘™π‘–π‘‘π‘–π‘’π‘  π‘œπ‘“ 𝛼⁄2
𝜎
𝑆𝐸(π‘₯) = 𝐸 = 𝓏𝛼⁄ βˆ™
2
√𝑛
𝑆𝐸(π‘₯Μ… ) = 𝑠⁄
√𝑛
𝓏𝛼⁄2 βˆ™ 𝜎 2
𝑛=(
)
𝐸
𝓏𝛼⁄2
Always set ahead of time.
Usually at a threshold value of 0.05 (5%) or 0.01 (1%).
Is assumed true for the purpose of carrying out the hypothesis test.
ο‚· Always contains “=“
ο‚· The null value implies a specific sampling distribution for the test
statistic
ο‚· Can be rejected, or not rejected, but NEVER supported
Is supported only by carrying out the test, if the null hypothesis can be
rejected.
ο‚· Always contains “>“ (right-tailed), “<” (left-tailed), or “≠” (twotailed) [tail selection is Important]
ο‚· Without any specific value for the parameter of interest, the
sampling distribution is unknown
ο‚· Can be supported (by rejecting the null), or not supported (by
failing or rejecting the null), but NEVER rejected
1. Formulate null and alternative hypothesis
2. If traditional approach, observe sample data
3. Compute a test statistic from sample data
4. If p-value approach, compute the p-value from the test statistic
5. Reject the null hypothesis (supporting the alternative)
a. p-value: at a significance level α, if the p-value ≤ α;
b. Traditional: If the test statistic falls in the rejection region
otherwise, fail to reject the null hypothesis
Copyright © 2015 by Harold Toomey, WyzAnt Tutor
6
Test Statistics
Description
Test Statistic Formula
Hypothesis Test Statistic for π‘―πŸŽ
Population/Sample
Proportion
𝑝̂ − 𝑝0 𝑝̂ − 𝑝0
=
𝑆𝐸(𝑝̂ )
π‘π‘ž
√
𝑛
π‘₯Μ… − πœ‡0 π‘₯Μ… − πœ‡0
𝑑=
= 𝑠
𝑆𝐸(π‘₯Μ… )
⁄ 𝑛
√
𝑍=
Population/Sample Mean
π‘₯Μ… − πœ‡0
𝓏= 𝜎
⁄ 𝑛
√
Inputs/Conditions
Standard Normal 𝑍 under 𝐻0 .
Assumes 𝑛𝑝 ≥ 15 𝒂𝒏𝒅 π‘›π‘ž ≥ 15.
Variance unknown.
𝑑 Distribution, 𝑑𝑓 = 𝑛 − 1 under 𝐻0 .
Variance known.
Assumes data is normally distributed
or 𝑛 ≥ 30 since 𝑑 approaches
standard normal 𝑍 if n is sufficiently
large due to the CLT.
Goodness-of-Fit Test – Chi-Square
Expected Frequencies for a
Chi-Square
𝐸 = 𝑛𝑝
πœ’2 =
Chi-Square Test Statistic
𝝌𝟐 = ∑
(𝑛 − 1)𝑠 2
𝜎2
(𝒐𝒃𝒔𝒆𝒓𝒗𝒆𝒅 − 𝒆𝒙𝒑𝒆𝒄𝒕𝒆𝒅)𝟐
𝒆𝒙𝒑𝒆𝒄𝒕𝒆𝒅
𝑑𝑓 = π‘˜ − 1
Degrees of Freedom
𝑝 = π‘π‘Ÿπ‘œπ‘π‘œπ‘Ÿπ‘‘π‘–π‘œπ‘›
𝑛 = π‘ π‘Žπ‘šπ‘π‘™π‘’ 𝑠𝑖𝑧𝑒
Large πœ’ 2 values are evidence against
the null hypothesis, which states that
the percentages of observed and
expected match (as in, any
differences are attributed to chance).
π‘˜ = π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘π‘œπ‘ π‘ π‘–π‘π‘™π‘’ π‘£π‘Žπ‘™π‘’π‘’π‘ 
(π‘π‘Žπ‘‘π‘’π‘”π‘œπ‘Ÿπ‘–π‘’π‘ ) π‘“π‘œπ‘Ÿ π‘‘β„Žπ‘’ π‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘π‘™π‘’
π‘’π‘›π‘‘π‘’π‘Ÿ π‘π‘œπ‘›π‘ π‘–π‘‘π‘’π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›
Independence Test – Chi-Square
Expected Frequencies for a
Chi-Square
Chi-Square Test Statistic
Degrees of Freedom
π‘Ÿπ‘
𝑛
(𝑂
− 𝐸)2
πœ’2 =
𝐸
𝐸=
𝑑𝑓 = (π‘Ÿ − 1)(𝑐 − 1)
π‘Ÿ = # π‘œπ‘“ π‘Ÿπ‘œπ‘€π‘ 
𝑐 = # π‘œπ‘“ π‘π‘œπ‘™π‘’π‘šπ‘›π‘ 
(see above)
π‘Ÿ π‘Žπ‘›π‘‘ 𝑐
= π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘π‘œπ‘ π‘ π‘–π‘π‘™π‘’ π‘£π‘Žπ‘™π‘’π‘’π‘  π‘“π‘œπ‘Ÿ
π‘‘β„Žπ‘’ π‘‘π‘€π‘œ π‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘π‘™π‘’π‘  π‘’π‘›π‘‘π‘’π‘Ÿ
π‘π‘œπ‘›π‘ π‘–π‘‘π‘’π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›
Formulating Hypothesis
If claim consists of …
“…is not equal to…”
“…is less than…”
“…is greater than…”
“…is equal to…” or “…is
exactly…”
“…is at least…”
“…is at most…”
then the hypothesis test is
Two-tailed ≠
Left-tailed <
Right-tailed >
Two-tailed =
Left-tailed <
Right-tailed >
Copyright © 2015 by Harold Toomey, WyzAnt Tutor
and is represented by…
π‘―πŸ
π‘―πŸŽ
7
Copyright © 2015 by Harold Toomey, WyzAnt Tutor
8
Download