Uploaded by Gabriel Carmody

Exam-Cheat-Sheet

advertisement
FORMULAS
BIAS
#
𝑆𝐷!"! = %$ ∑$*+#
SIMPONS PARADOX
Selection Bias: Systematic
(&! '&Μ… )"
$
tendency to exclude someone.
popsd(data)
A
B
Day 1
63
= 70
90
9
= 90
10
Day 2
4
= 40
10
45
= 50
90
Total
67
100
53
100
Non-Response/Consent Bias
#
𝑆𝐷,-.!/0 = %$'# ∑$*+#
(&! '&Μ… )"
Survivor/Adherer Bias
$
Interviewer’s Bias:
sd(data)
πΆπ‘œπ‘’π‘“π‘“π‘–π‘π‘–π‘’π‘›π‘‘ π‘œπ‘“ π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘‘π‘–π‘œπ‘› = %&'(
Characteristics of the interviewer
For a Box with Two Elements:
that have an affect on answers.
𝑆𝐷1"& = (𝑏𝑖𝑔 − π‘ π‘šπ‘Žπ‘™π‘™)1π‘π‘Ÿπ‘œπ‘ 1*2 × π‘π‘Ÿπ‘œπ‘,.-//
Measurement Bias: The form of
#$
1. Baseline Prediction:
Given any value of π‘₯,
the response (E.g. Recall Bias,
For the Sum of a Sample in a Box:
Estimate: 𝐸𝑉,3. = π‘ π‘Žπ‘šπ‘π‘™π‘’ 𝑠𝑖𝑧𝑒 × π‘šπ‘’π‘Žπ‘›1"&
Chance Error: 𝑆𝐸,3. = 1π‘ π‘Žπ‘šπ‘π‘™π‘’ 𝑠𝑖𝑧𝑒 × π‘†π·1"&
For the Prop of a Sample in a {0,1} Box:
56)*+
Estimate: 𝐸𝑉!4"! = ,-.!/0 ,*80 = π‘šπ‘’π‘Žπ‘›1"&
95
a question in a survey that affect
PREDICTION MODELS
)*+
Chance Error: 𝑆𝐸,3. = ,-.!/0
=
,*80
9:,-.
;,-.!/0 ,*80
Sensitive Questions)
Placebo Effect
return the mean of 𝑦.
2. Prediction in a Strip:
Given a value of π‘₯, returns
Language
the mean of 𝑦 values
Dimension: A data set with p
correlating to that π‘₯ value.
variables has dimension p.
mean(y(x==xi)
Ordinal: Ordered data.
SE without Replacement:
Nominal: Not ordered.
𝑆𝐸<*=>"3= 40!/ = 𝑆𝐸<*=> 40!/ × π‘π‘œπ‘Ÿπ‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› π‘“π‘Žπ‘π‘‘π‘œπ‘Ÿ
Robust: Not affected by outliers.
3. Regression Line: using
𝑦 = π‘šπ‘₯ + 𝑏
4. Predicting Percentile
A Parameter is a numerical fact
Ranks: Find the percentile
about a population.
of π‘₯ from the Normal
distribution of the sample mean for a population
A statistic is from sample values
Curve. Find the 𝑦 value,
with finite variance approaches normal, so long as
to predict the parameter.
from the Normal Curve.
!"!$ ,*80',-.!/0 ,*80
π‘π‘œπ‘Ÿπ‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› π‘“π‘Žπ‘π‘‘π‘œπ‘Ÿ = %
!"!$ ,*80'#
The Central Limit Theorem says that the
the sample size is sufficiently large.
For a probability histogram, if we fix the number
of draws (say 2) and repeat for many times, the
histogram gets more stable by the Law of Averages.
For a simulation histogram, if we increase the
number of draws, the histogram of sum gets
smoother and approaches normal.
SCATTER PLOTS
The Independent Variable is the x value.
The Dependent Variable is the y value.
A strong Linear Association describes a tightly clustered data set.
The Centre of a Scatter Plot Cloud is at (π‘₯Μ… , 𝑦B). The Horizontal
Spread is measured by 𝜎& . The Vertical Spread is measured by 𝜎H .
The Correlation Coefficient, 𝒓, measures linear association. If 𝒓 is
MEASUREMENT ERRORS
positive, there is positive linear association. If 𝒓 is close to ±1, there
Chance Error is the inherent error in any predictive
is strong linear association.
statistical model. It can be predicted by repeating
measurements. πΆβ„Žπ‘Žπ‘›π‘π‘’ πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ = π‘ƒπ‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘’π‘‘ π‘‰π‘Žπ‘™π‘’π‘’ − 𝐸𝑉
π‘Ÿ!"! =
$ π‘₯ − π‘₯Μ…
1
𝑦* − 𝑦B
*
HI
×
J
𝑛
𝜎H
*+# 𝜎&
Bias is a constant added to a measurement, and
cor(x,y)*((n-1)/n)
cannot be predicted.
π‘Ÿ,-.!/0 = π‘Ÿ!"! ×
Left-Skewed
Symmetric
Right-Skewed
𝑛
𝑛−1
cor(x,y)
The Regression Line has gradient π‘š =
4)/+012 βˆ™J3
J.
and passes
through the centre of the scatter plot. It is a smoothed version of the
Graph of Averages, which plots the average 𝑦 for each π‘₯.
lm(y~x)
Mean Median
Mean
Median
Median Mean
PROBABILITY
SAMPLING
Multiplication Rule:
The Law of Averages states that the proportions from a simulation
𝑃(𝐸1 π‘Žπ‘›π‘‘ 𝐸2) = 𝑃(𝐸1) × π‘ƒ(𝐸2|𝐸1)
approach relative frequency (but does not equal) and become more
If two events are independent then,
stable, as the number of simulation increases.
𝑃(𝐸2) = 𝑃(𝐸2|𝐸1)
coins = sample(c(0,1), 10000, repl = T)
If two events are mutually exclusive then,
cumHeads = cumSum(coins)
probHeads = cumHeads/(1:10000)
Addition Rule:
Simple Random Sampling
𝑃(𝐸1 π‘œπ‘Ÿ 𝐸2) = 𝑃(𝐸1) + 𝑃(𝐸2)
Multi-Stage Cluster Sampling
Quota Sampling
THE NORMAL CURVE
Convenience Sampling
π‘ π‘‘π‘Žπ‘‘π‘–π‘ π‘‘π‘–π‘ = π‘π‘Žπ‘Ÿπ‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ + π‘π‘–π‘Žπ‘  + π‘β„Žπ‘Žπ‘›π‘π‘’ π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
π‘ π‘‘π‘Žπ‘‘π‘–π‘ π‘‘π‘–π‘ = π‘π‘Žπ‘Ÿπ‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ + π‘›π‘œπ‘› − π‘ π‘Žπ‘šπ‘π‘™π‘–π‘›π‘” π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ + π‘ π‘Žπ‘šπ‘π‘™π‘–π‘›π‘” π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
TYPES OF ONE-SAMPLE HYPOTHESIS TESTING
1. One-Sample Population SD Known: Z-Test
The Standard Normal Curve is 𝑍 ~ 𝑁(0,1)
The General Normal Curve is 𝑋 ~ 𝑁(πœ‡, 𝜎 K )
π‘₯ − π‘₯Μ…
𝑍 =
𝑆𝐷
To find the area under a Standard Curve:
To find area less than: pnorm(0.8)
𝑧∗ =
M6'56
450-0
√7
2. One-Sample Population SD Unknown: T-Test with 𝑛 − 1 Degrees
of Freedom.
𝑑∗ =
M6'56
45)/+012
√7
t.test(mu=0, data)
To find area more than: pnorm(0.8,
lower.tail = F)
TYPES OF TWO-SAMPLE HYPOTHESIS TESTING
To find the area under a Normal Curve:
Check:
pnorm(171, 162, 8)
1. Equality of Variance: Box Plot (equal spread) and F-Test (p-value >
If we took many samples at 95% CI, then
95% of the CIs would contain the unknown
parameter.
HYPOTHESIS TESTING
1. Set Up Hypothesis
2. Find Test Statistic
𝑂𝑉 − 𝐸𝑉
𝑧∗ =
𝑆𝐸
3. Find P-Value (area test-statistic covers)
0.05).
vartest(data)
2. Normality of Sample: Box Plot (symmetric, no outliers), ShapiroWilk Test (p-value > 0.05), and Q-Q Plot for a straight line.
shapiro.test(data)
ggplot(data)
From there:
1. Two-Sample Equal Population Variance: T-Test
𝑑∗ =
The p-value is the probability of
.0-$8 '.0-$" 'N
8
8
O9:0 " P7 Q7 R
8
"
($8 '#)9:8 " Q($" '#)9:" "
observing a test statistic as extreme or
𝑆𝐷! K =
more extreme than the one observed.
t.test(data1, data2, var.equal=T)
Lower: pnorm(testStatistic)
Two-Tail: 2*pnorm(testStatistic)
4. Conclusion
If 𝑝 < π‘π‘œπ‘›π‘“π‘–π‘‘π‘’π‘›π‘π‘’ 𝑙𝑒𝑣𝑒𝑙 = 0.05, reject
the Null Hypothesis.
𝑑𝑓 = 𝑛# + 𝑛K − 2
$8 Q$" 'K
2. Two-Sample Unequal Population Variance: WelchT-Test
𝑑∗ =
.0-$8 '.0-$" 'N
) " ) "
O 8 Q "
78
7"
t.test(data1, data2, var.equal=F)
The Mean of Random Samples is
Random.
Q1
Q3
Download