FORMULAS BIAS # ππ·!"! = %$ ∑$*+# SIMPONS PARADOX Selection Bias: Systematic (&! '&Μ )" $ tendency to exclude someone. popsd(data) A B Day 1 63 = 70 90 9 = 90 10 Day 2 4 = 40 10 45 = 50 90 Total 67 100 53 100 Non-Response/Consent Bias # ππ·,-.!/0 = %$'# ∑$*+# (&! '&Μ )" Survivor/Adherer Bias $ Interviewer’s Bias: sd(data) πΆππππππππππ‘ ππ ππππππ‘πππ = %&'( Characteristics of the interviewer For a Box with Two Elements: that have an affect on answers. ππ·1"& = (πππ − π ππππ)1ππππ 1*2 × ππππ,.-// Measurement Bias: The form of #$ 1. Baseline Prediction: Given any value of π₯, the response (E.g. Recall Bias, For the Sum of a Sample in a Box: Estimate: πΈπ,3. = π πππππ π ππ§π × ππππ1"& Chance Error: ππΈ,3. = 1π πππππ π ππ§π × ππ·1"& For the Prop of a Sample in a {0,1} Box: 56)*+ Estimate: πΈπ!4"! = ,-.!/0 ,*80 = ππππ1"& 95 a question in a survey that affect PREDICTION MODELS )*+ Chance Error: ππΈ,3. = ,-.!/0 = ,*80 9:,-. ;,-.!/0 ,*80 Sensitive Questions) Placebo Effect return the mean of π¦. 2. Prediction in a Strip: Given a value of π₯, returns Language the mean of π¦ values Dimension: A data set with p correlating to that π₯ value. variables has dimension p. mean(y(x==xi) Ordinal: Ordered data. SE without Replacement: Nominal: Not ordered. ππΈ<*=>"3= 40!/ = ππΈ<*=> 40!/ × πππππππ‘πππ ππππ‘ππ Robust: Not affected by outliers. 3. Regression Line: using π¦ = ππ₯ + π 4. Predicting Percentile A Parameter is a numerical fact Ranks: Find the percentile about a population. of π₯ from the Normal distribution of the sample mean for a population A statistic is from sample values Curve. Find the π¦ value, with finite variance approaches normal, so long as to predict the parameter. from the Normal Curve. !"!$ ,*80',-.!/0 ,*80 πππππππ‘πππ ππππ‘ππ = % !"!$ ,*80'# The Central Limit Theorem says that the the sample size is sufficiently large. For a probability histogram, if we fix the number of draws (say 2) and repeat for many times, the histogram gets more stable by the Law of Averages. For a simulation histogram, if we increase the number of draws, the histogram of sum gets smoother and approaches normal. SCATTER PLOTS The Independent Variable is the x value. The Dependent Variable is the y value. A strong Linear Association describes a tightly clustered data set. The Centre of a Scatter Plot Cloud is at (π₯Μ , π¦B). The Horizontal Spread is measured by π& . The Vertical Spread is measured by πH . The Correlation Coefficient, π, measures linear association. If π is MEASUREMENT ERRORS positive, there is positive linear association. If π is close to ±1, there Chance Error is the inherent error in any predictive is strong linear association. statistical model. It can be predicted by repeating measurements. πΆβππππ πΈππππ = πππππππ‘ππ ππππ’π − πΈπ π!"! = $ π₯ − π₯Μ 1 π¦* − π¦B * HI × J π πH *+# π& Bias is a constant added to a measurement, and cor(x,y)*((n-1)/n) cannot be predicted. π,-.!/0 = π!"! × Left-Skewed Symmetric Right-Skewed π π−1 cor(x,y) The Regression Line has gradient π = 4)/+012 βJ3 J. and passes through the centre of the scatter plot. It is a smoothed version of the Graph of Averages, which plots the average π¦ for each π₯. lm(y~x) Mean Median Mean Median Median Mean PROBABILITY SAMPLING Multiplication Rule: The Law of Averages states that the proportions from a simulation π(πΈ1 πππ πΈ2) = π(πΈ1) × π(πΈ2|πΈ1) approach relative frequency (but does not equal) and become more If two events are independent then, stable, as the number of simulation increases. π(πΈ2) = π(πΈ2|πΈ1) coins = sample(c(0,1), 10000, repl = T) If two events are mutually exclusive then, cumHeads = cumSum(coins) probHeads = cumHeads/(1:10000) Addition Rule: Simple Random Sampling π(πΈ1 ππ πΈ2) = π(πΈ1) + π(πΈ2) Multi-Stage Cluster Sampling Quota Sampling THE NORMAL CURVE Convenience Sampling π π‘ππ‘ππ π‘ππ = πππππππ‘ππ + ππππ + πβππππ πππππ π π‘ππ‘ππ π‘ππ = πππππππ‘ππ + πππ − π πππππππ πππππ + π πππππππ πππππ TYPES OF ONE-SAMPLE HYPOTHESIS TESTING 1. One-Sample Population SD Known: Z-Test The Standard Normal Curve is π ~ π(0,1) The General Normal Curve is π ~ π(π, π K ) π₯ − π₯Μ π = ππ· To find the area under a Standard Curve: To find area less than: pnorm(0.8) π§∗ = M6'56 450-0 √7 2. One-Sample Population SD Unknown: T-Test with π − 1 Degrees of Freedom. π‘∗ = M6'56 45)/+012 √7 t.test(mu=0, data) To find area more than: pnorm(0.8, lower.tail = F) TYPES OF TWO-SAMPLE HYPOTHESIS TESTING To find the area under a Normal Curve: Check: pnorm(171, 162, 8) 1. Equality of Variance: Box Plot (equal spread) and F-Test (p-value > If we took many samples at 95% CI, then 95% of the CIs would contain the unknown parameter. HYPOTHESIS TESTING 1. Set Up Hypothesis 2. Find Test Statistic ππ − πΈπ π§∗ = ππΈ 3. Find P-Value (area test-statistic covers) 0.05). vartest(data) 2. Normality of Sample: Box Plot (symmetric, no outliers), ShapiroWilk Test (p-value > 0.05), and Q-Q Plot for a straight line. shapiro.test(data) ggplot(data) From there: 1. Two-Sample Equal Population Variance: T-Test π‘∗ = The p-value is the probability of .0-$8 '.0-$" 'N 8 8 O9:0 " P7 Q7 R 8 " ($8 '#)9:8 " Q($" '#)9:" " observing a test statistic as extreme or ππ·! K = more extreme than the one observed. t.test(data1, data2, var.equal=T) Lower: pnorm(testStatistic) Two-Tail: 2*pnorm(testStatistic) 4. Conclusion If π < ππππππππππ πππ£ππ = 0.05, reject the Null Hypothesis. ππ = π# + πK − 2 $8 Q$" 'K 2. Two-Sample Unequal Population Variance: WelchT-Test π‘∗ = .0-$8 '.0-$" 'N ) " ) " O 8 Q " 78 7" t.test(data1, data2, var.equal=F) The Mean of Random Samples is Random. Q1 Q3