Introduction to hypothesis testing + Probability Seminar 6 A difficult mock question for mid-term Happiness Plot the following graph. People who use Facebook only (but not Twitter) are generally happier than those who use Twitter only (but not Facebook). This tendency is weakest among new subscribers. However, after a certain number of years, there is a declining trend for both groups, with a sharper decline among Twitter users. 11 9 7 Facebook-only Twitter-only 5 3 1 New Old Subscriber type Today’s Question • We want to know whether salary bonuses increases people’s psychological well-being. The average wellbeing of Delhi’s residents is 3.00 (SD = 1.00). We randomly sampled a group of 30 employees and gave them a salary bonus. Months later, we measure their well-being. The average well-being in this sample is 3.50. Two possibilities Colored (original population) Greys (another population) The sample mean of 3.50 was drawn from your original population The sample mean of 3.50 was drawn from another population. The real problem: Randomness Sampling error: Every sample is likely to have different statistical parameters. Using Excel’s =RANDBETWEEN(1,100): 1 2 3 4 5 6 7 8 9 10 M SD A 93 2 18 55 81 66 67 88 54 32 55.6 30.1 B 58 23 69 3 39 18 15 42 17 58 34.2 22.2 C 53 26 16 5 56 4 85 84 98 45 47.2 34.2 D 97 23 18 99 5 25 65 62 95 31 52.0 36.1 E 37 50 75 100 48 93 99 14 47 79 64.2 29.2 F 50 1 55 50 39 15 8 8 90 15 33.1 28.4 G 8 83 45 85 12 94 58 99 91 45 62.0 33.7 H 30 13 35 99 99 92 80 29 32 86 59.5 34.3 I 78 29 40 60 16 85 82 90 2 10 49.2 33.9 J 58 48 82 73 37 1 73 39 74 40 52.5 24.6 Big Question • Is the .50 difference between the salary bonus group and Delhi residents in general a result of the bonus, or simply an “accident” of sampling error “randomness”)? Two hypotheses are implied Null hypothesis • The sample comes from a population in which the mean is 3.00 • The difference we observed is due to sampling error. Alternative hypothesis • The sample does not come from a population in which the mean is 3.00. • The difference is due to salary bonus. (Often called the “research hypothesis.”) Mathematically… Null hypothesis H0: μx = 3.00 Alternative hypothesis H1: μx ≠ 3.00 Note that the two hypotheses are mutually exclusive. How can we determine which hypotheses is more likely to be true? • The most popular tools: Null Hypothesis Significance Tests (NHSTs). • Significance tests are quantitative techniques to evaluate the probability of observing the data, assuming that the null hypothesis is true. • This information is used to make a binary (yes/no) decision about whether the null hypothesis is a viable explanation for the study results. The NHST at its core • Two statistical datasets, A and B, are compared. • Each dataset has its own parameters (e.g., M & SD). • The question is, is A = B? (the null hypothesis) If A = B, A-B = 0 (that’s where the ‘null’ comes from) • Often, we want to prove A ≠ B (the alternative hypothesis) by disproving A = B. NHST & philosophy We cannot prove that something is true; we can only prove is something is false. “All swans are white” “Innocent until proven guilty” Inferential statistics are probabilistic. “My hypothesis is true.” “How likely is my hypothesis is true.” Basic probability • ๐๐๐๐๐๐๐๐๐๐ก๐ฆ, ๐ = ๐๐ข๐๐๐๐ ๐๐ ๐๐ฃ๐๐๐ก๐ ๐กโ๐๐ก ๐๐๐๐ข๐๐๐๐ ๐ก๐๐ก๐๐ ๐๐ข๐๐๐๐ ๐๐ ๐๐๐ ๐ ๐๐๐๐ ๐๐ฃ๐๐๐ก๐ • In a bag of 100 balls, 5 are red, 95 are blue. 5 ๐ ๐๐๐ = 100 ๐ ๐๐๐ข๐ = 95 100 ๐ ๐๐๐ข๐ = 1 − ๐(๐๐๐) Two basic rules of probability The bag now has: 5 red, 10 green, and 85 blue balls. What is the probability that I will draw either a red or green ball? • Additive rule: ๐ ๐๐๐ ๐๐ ๐๐๐๐๐ = ๐ ๐๐๐ + ๐(๐๐๐๐๐) If I draw two balls (with replacement), what is the probability that I will draw a red and a green ball? • Multiplicative rule: ๐ ๐๐๐ ๐๐๐ ๐๐๐๐๐ = ๐ ๐๐๐ × ๐(๐๐๐๐๐) Relationship to sampling distributions 0. 0.1 0.2 0.3 0.4 • Recall: Sampling distribution is the distribution of means for repeated random samples -4 -2 0 2 4 S CORE Relationship to sampling distributions How extreme is your sample mean of 3.50? 0. 0.1 0.2 0.3 0.4 We calculate a z-score: ๐−๐ ๐ง= ๐/ ๐ Note: This is different from ๐−๐ ๐ง= ๐ -4 -2 0 2 4 One is inferential, one is descriptive S CORE 0. 0.1 0.2 0.3 0.4 Relationship to sampling distributions ๐ง= 3.50 − 3 1/ 30 = 2.74 ๐ = .003 The probability of getting a zscore of ≥2.74 is .003. -4 -2 0 2 4 S CORE How NHSTs work • Is .003 a “small” probability? • Because the distribution of sample means is continuous, we create an arbitrary point along this continuum for denoting what is “small” and what is “large.” • By convention in psychology, if the probability of observing the sample mean is less than 5%, researchers reject the null hypothesis. Rules of the NHST Game • When p < .05, a result is said to be “statistically significant” • In short, when a result is statistically significant (p < .05), we conclude that the difference we observed was unlikely to be due to sampling error alone. We “reject the null hypothesis.” • If the statistic is not statistically significant (p > .05), we conclude that sampling error is a plausible interpretation of the results. We “fail to reject the null hypothesis.” Binary Yes vs. No criteria • NHSTs were developed for the purpose of making yes/no decisions about the null hypothesis. • As a consequence, the null is either rejected or not, based on the p-value. • Strictly speaking, NHSTs do not test the research hypothesis per se; only the null hypothesis is tested. Different significance tests • The previous example was an example of a z-test of a sample mean. (≠ z-score of a sample) • Significance tests have been developed for: – difference between two group means: t-test – difference between two or more group means: ANOVA – differences between proportions: chi-square What does statistical significance mean? • The term “significant” does not mean important, substantial, or worthwhile. • Showing that Facebook postings affect your mood with a probability of p = .001 with N > 1,000,000 says nothing about how important it is. • More about this in Week 14. Inferential Errors and NHST • A yes/no decision about whether the null hypothesis as a viable explanation can lead to mistakes. • What sort of mistakes? Inferential Errors and NHST Null is true Null is false Conclusion of the test (sample) Real world (population) Null is true Null is false Correct decision Type I error (false positive) Type II error (false negative) Correct decision NHST thinking applied to the real world Null is true (acquittal) Null is false (conviction) Conclusion of the test Real world Null is true Null is false (truly not guilty) (truly guilty) Correct decision Type I error (false positive) Type II error (false negative) Correct decision Or simply… Errors in Inference using NHST • The probability of making a Type I error is determined by the experimenter. Often called the alpha value. Usually set to 5%. • This determines how conservative we want to be. • The probability of making a Type II error is also determined by the experimenter. Often called the beta value (more in Week 12 on Power & Effect Size). One-tail or two-tail tests? Previously, H0: H1: μxฬ = μ μxฬ ≠ μ We could also have H1 as: H1: μxฬ < μ H1: μxฬ > μ Two-tail One-tail (directional) Often in psychology, we use two-tail tests. Problem with one-tail tests Before collecting data Null: Alternative: μxฬ = 30 μxฬ < 30 After collecting data, you found: Case 1 μx = 50, p = .0001 Case 2 μx = 26, p = .04 You must reject H0 in Case 1, but you’re forced to conclude that 50 > 30?! (the mean is grossly opposite to your alternative hypothesis.) Problem with two-tail tests Before collecting data Null: Alternative: μxฬ = 30 μxฬ ≠ 30 After collecting data, you found: Case 1 μx = 26, p = .04 Reject null Case 2 μx = 27, p = .06 Two tail tests can be too conservative Do not reject null Which should you choose? • The debate can continue forever. • Most psychologists would choose two-tail tests. • Some psychologists choose Bayesian statistics (not in SRM I and II) • What does your theory actually predict? Five steps to NHST 1. State the null and alternative hypothesis 2. Choose the type of statistical test 3. Select the significance level (usually 5%), and the tail of the test 4. Derive the sample statistic (z, t, F, r, B, etc.) 5. Report results State the appropriate H0 and H1 for the following studies • Researchers want to test whether there is a difference in spatial ability between left- and righthanded people. • Researchers want to test whether nurses who work 8-hour shifts deliver higher-quality work than those who work 12-hour shifts. • A psychologist predicted that the number of advertisements shown increases the sales of a product geometrically. Back to “Today’s Question” • “We want to know whether salary bonuses increases people’s psychological well-being. The average wellbeing of Delhi’s residents is 3.00 (SD = 1.00). We randomly sampled a group of 30 employees and gave them a salary bonus. Months later, we measure their wellbeing. The average well-being in this sample is 3.50.” • We derived this solution earlier: ๐−๐ ๐ง= ๐/ ๐ ๐ง= 3.50−3 1/ 30 = 2.74, ๐ = .003 The problem • Often the population variance is unknown (Seminar 5). “The average well-being of Delhi’s residents is 3.00 (SD = 1.00).” • What do we do? One-sample t-test ๐ง= ๐−๐ ๐/ ๐ vs. ๐ก= ๐−๐ ๐ / ๐ t distributions approximate z distributions as N ๏ ∞ df stands for “degrees of freedom” The number of scores that are free to vary. For one-sample ttest, df = n – 1 An example using one-sample t-test Question: Do Ashoka students spend โน200 a day on food on average? Suppose we sampled daily food expenditure among 100 students, and found M = โน 220; SD = โน 20. ๐ก= ๐−๐ ๐ / ๐ = 220−200 20/ 100 = 10, ๐ < .001 1. Check out the t-distribution table (p. 543) 2. Google “t-test calculator” and enter the t value 3. Use software e.g., JASP, SPSS, R t-test family • The previous example was a one-sample t-test. • Very seldom used in psychology • Very useful in quality control, e.g., “Does this batch of batteries meet ISO6001 standards?” • Next week: • Independent samples • Dependent samples An alternative to NHST: Bayesian Problems with NHST 1. The significance level is arbitrary 2. It doesn’t test the research hypothesis directly 3. Tendency to “accept” or “reject” hypotheses blindly Bayesian statistics (Google it; not in SRM I or II) 1. Bayes factors represent the weight of evidence in the data for competing hypotheses 2. Easily implemented in JASP 3. Has its own problems too Summary • Appreciate randomness in your data. • NHST results in binary outcomes; sometimes this is useful, other times not. • z-test is useful to understand statistical inference, but often useless to answer practical questions, which ttest are more suited. • Next week we cover different types of t-tests. Announcement • 9 Nov has been declared a university holiday. • Course syllabus has been rearranged. • Deadline for research project has been pushed back.