Exam-Cheat-Sheet

FORMULAS BIAS # 𝑆𝐷!"! = %$ ∑$*+# SIMPONS PARADOX Selection Bias: Systematic (&! '&̅ )" $ tendency to exclude someone. popsd(data) A B Day 1 63 = 70 90 9 = 90 10 Day 2 4 = 40 10 45 = 50 90 Total 67 100 53 100 Non-Response/Consent Bias # 𝑆𝐷,-.!/0 = %$'# ∑$*+# (&! '&̅ )" Survivor/Adherer Bias $ Interviewer’s Bias: sd(data) 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 = %&'( Characteristics of the interviewer For a Box with Two Elements: that have an affect on answers. 𝑆𝐷1"& = (𝑏𝑖𝑔 − 𝑠𝑚𝑎𝑙𝑙)1𝑝𝑟𝑜𝑝 1*2 × 𝑝𝑟𝑜𝑝,.-// Measurement Bias: The form of #$ 1. Baseline Prediction: Given any value of 𝑥, the response (E.g. Recall Bias, For the Sum of a Sample in a Box: Estimate: 𝐸𝑉,3. = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 × 𝑚𝑒𝑎𝑛1"& Chance Error: 𝑆𝐸,3. = 1𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 × 𝑆𝐷1"& For the Prop of a Sample in a {0,1} Box: 56)*+ Estimate: 𝐸𝑉!4"! = ,-.!/0 ,*80 = 𝑚𝑒𝑎𝑛1"& 95 a question in a survey that affect PREDICTION MODELS )*+ Chance Error: 𝑆𝐸,3. = ,-.!/0 = ,*80 9:,-. ;,-.!/0 ,*80 Sensitive Questions) Placebo Effect return the mean of 𝑦. 2. Prediction in a Strip: Given a value of 𝑥, returns Language the mean of 𝑦 values Dimension: A data set with p correlating to that 𝑥 value. variables has dimension p. mean(y(x==xi) Ordinal: Ordered data. SE without Replacement: Nominal: Not ordered. 𝑆𝐸<*=>"3= 40!/ = 𝑆𝐸<*=> 40!/ × 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟 Robust: Not affected by outliers. 3. Regression Line: using 𝑦 = 𝑚𝑥 + 𝑏 4. Predicting Percentile A Parameter is a numerical fact Ranks: Find the percentile about a population. of 𝑥 from the Normal distribution of the sample mean for a population A statistic is from sample values Curve. Find the 𝑦 value, with finite variance approaches normal, so long as to predict the parameter. from the Normal Curve. !"!$ ,*80',-.!/0 ,*80 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟 = % !"!$ ,*80'# The Central Limit Theorem says that the the sample size is sufficiently large. For a probability histogram, if we fix the number of draws (say 2) and repeat for many times, the histogram gets more stable by the Law of Averages. For a simulation histogram, if we increase the number of draws, the histogram of sum gets smoother and approaches normal. SCATTER PLOTS The Independent Variable is the x value. The Dependent Variable is the y value. A strong Linear Association describes a tightly clustered data set. The Centre of a Scatter Plot Cloud is at (𝑥̅ , 𝑦B). The Horizontal Spread is measured by 𝜎& . The Vertical Spread is measured by 𝜎H . The Correlation Coefficient, 𝒓, measures linear association. If 𝒓 is MEASUREMENT ERRORS positive, there is positive linear association. If 𝒓 is close to ±1, there Chance Error is the inherent error in any predictive is strong linear association. statistical model. It can be predicted by repeating measurements. 𝐶ℎ𝑎𝑛𝑐𝑒 𝐸𝑟𝑟𝑜𝑟 = 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑉𝑎𝑙𝑢𝑒 − 𝐸𝑉 𝑟!"! = $ 𝑥 − 𝑥̅ 1 𝑦* − 𝑦B * HI × J 𝑛 𝜎H *+# 𝜎& Bias is a constant added to a measurement, and cor(x,y)*((n-1)/n) cannot be predicted. 𝑟,-.!/0 = 𝑟!"! × Left-Skewed Symmetric Right-Skewed 𝑛 𝑛−1 cor(x,y) The Regression Line has gradient 𝑚 = 4)/+012 ∙J3 J. and passes through the centre of the scatter plot. It is a smoothed version of the Graph of Averages, which plots the average 𝑦 for each 𝑥. lm(y~x) Mean Median Mean Median Median Mean PROBABILITY SAMPLING Multiplication Rule: The Law of Averages states that the proportions from a simulation 𝑃(𝐸1 𝑎𝑛𝑑 𝐸2) = 𝑃(𝐸1) × 𝑃(𝐸2|𝐸1) approach relative frequency (but does not equal) and become more If two events are independent then, stable, as the number of simulation increases. 𝑃(𝐸2) = 𝑃(𝐸2|𝐸1) coins = sample(c(0,1), 10000, repl = T) If two events are mutually exclusive then, cumHeads = cumSum(coins) probHeads = cumHeads/(1:10000) Addition Rule: Simple Random Sampling 𝑃(𝐸1 𝑜𝑟 𝐸2) = 𝑃(𝐸1) + 𝑃(𝐸2) Multi-Stage Cluster Sampling Quota Sampling THE NORMAL CURVE Convenience Sampling 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 = 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 + 𝑏𝑖𝑎𝑠 + 𝑐ℎ𝑎𝑛𝑐𝑒 𝑒𝑟𝑟𝑜𝑟 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 = 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 + 𝑛𝑜𝑛 − 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑒𝑟𝑟𝑜𝑟 + 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑒𝑟𝑟𝑜𝑟 TYPES OF ONE-SAMPLE HYPOTHESIS TESTING 1. One-Sample Population SD Known: Z-Test The Standard Normal Curve is 𝑍 ~ 𝑁(0,1) The General Normal Curve is 𝑋 ~ 𝑁(𝜇, 𝜎 K ) 𝑥 − 𝑥̅ 𝑍 = 𝑆𝐷 To find the area under a Standard Curve: To find area less than: pnorm(0.8) 𝑧∗ = M6'56 450-0 √7 2. One-Sample Population SD Unknown: T-Test with 𝑛 − 1 Degrees of Freedom. 𝑡∗ = M6'56 45)/+012 √7 t.test(mu=0, data) To find area more than: pnorm(0.8, lower.tail = F) TYPES OF TWO-SAMPLE HYPOTHESIS TESTING To find the area under a Normal Curve: Check: pnorm(171, 162, 8) 1. Equality of Variance: Box Plot (equal spread) and F-Test (p-value > If we took many samples at 95% CI, then 95% of the CIs would contain the unknown parameter. HYPOTHESIS TESTING 1. Set Up Hypothesis 2. Find Test Statistic 𝑂𝑉 − 𝐸𝑉 𝑧∗ = 𝑆𝐸 3. Find P-Value (area test-statistic covers) 0.05). vartest(data) 2. Normality of Sample: Box Plot (symmetric, no outliers), ShapiroWilk Test (p-value > 0.05), and Q-Q Plot for a straight line. shapiro.test(data) ggplot(data) From there: 1. Two-Sample Equal Population Variance: T-Test 𝑡∗ = The p-value is the probability of .0-$8 '.0-$" 'N 8 8 O9:0 " P7 Q7 R 8 " ($8 '#)9:8 " Q($" '#)9:" " observing a test statistic as extreme or 𝑆𝐷! K = more extreme than the one observed. t.test(data1, data2, var.equal=T) Lower: pnorm(testStatistic) Two-Tail: 2*pnorm(testStatistic) 4. Conclusion If 𝑝 < 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙 = 0.05, reject the Null Hypothesis. 𝑑𝑓 = 𝑛# + 𝑛K − 2 $8 Q$" 'K 2. Two-Sample Unequal Population Variance: WelchT-Test 𝑡∗ = .0-$8 '.0-$" 'N ) " ) " O 8 Q " 78 7" t.test(data1, data2, var.equal=F) The Mean of Random Samples is Random. Q1 Q3

Exam-Cheat-Sheet

Related documents

Products

Support

Exam-Cheat-Sheet

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib