Inferential Statistics

advertisement
Introduction to hypothesis testing
+
Probability
Seminar 6
A difficult mock question for mid-term
Happiness
Plot the following graph. People who use Facebook only (but not
Twitter) are generally happier than those who use Twitter only (but
not Facebook). This tendency is weakest among new subscribers.
However, after a certain number of years, there is a declining trend
for both groups, with a sharper decline among Twitter users.
11
9
7
Facebook-only
Twitter-only
5
3
1
New
Old
Subscriber type
Today’s Question
• We want to know whether salary bonuses increases
people’s psychological well-being. The average wellbeing of Delhi’s residents is 3.00 (SD = 1.00). We
randomly sampled a group of 30 employees and
gave them a salary bonus. Months later, we measure
their well-being. The average well-being in this
sample is 3.50.
Two possibilities
Colored (original population)
Greys (another population)
The sample mean of 3.50 was drawn from your original population
The sample mean of 3.50 was drawn from another population.
The real problem: Randomness
Sampling error: Every sample is likely to have different
statistical parameters. Using Excel’s
=RANDBETWEEN(1,100):
1
2
3
4
5
6
7
8
9
10
M
SD
A
93
2
18
55
81
66
67
88
54
32
55.6
30.1
B
58
23
69
3
39
18
15
42
17
58
34.2
22.2
C
53
26
16
5
56
4
85
84
98
45
47.2
34.2
D
97
23
18
99
5
25
65
62
95
31
52.0
36.1
E
37
50
75
100
48
93
99
14
47
79
64.2
29.2
F
50
1
55
50
39
15
8
8
90
15
33.1
28.4
G
8
83
45
85
12
94
58
99
91
45
62.0
33.7
H
30
13
35
99
99
92
80
29
32
86
59.5
34.3
I
78
29
40
60
16
85
82
90
2
10
49.2
33.9
J
58
48
82
73
37
1
73
39
74
40
52.5
24.6
Big Question
• Is the .50 difference between the salary bonus
group and Delhi residents in general a result of
the bonus, or simply an “accident” of sampling
error “randomness”)?
Two hypotheses are implied
Null hypothesis
• The sample comes from
a population in which
the mean is 3.00
• The difference we
observed is due to
sampling error.
Alternative hypothesis
• The sample does not
come from a population
in which the mean is
3.00.
• The difference is due to
salary bonus. (Often
called the “research
hypothesis.”)
Mathematically…
Null hypothesis
H0: μx = 3.00
Alternative hypothesis
H1: μx ≠ 3.00
Note that the two hypotheses are mutually exclusive.
How can we determine which
hypotheses is more likely to be true?
•
The most popular tools: Null Hypothesis
Significance Tests (NHSTs).
•
Significance tests are quantitative techniques to
evaluate the probability of observing the data,
assuming that the null hypothesis is true.
•
This information is used to make a binary (yes/no)
decision about whether the null hypothesis is a
viable explanation for the study results.
The NHST at its core
• Two statistical datasets, A and B, are compared.
• Each dataset has its own parameters (e.g., M & SD).
• The question is, is A = B? (the null hypothesis)
If A = B, A-B = 0 (that’s where the ‘null’ comes from)
• Often, we want to prove A ≠ B (the alternative hypothesis)
by disproving A = B.
NHST & philosophy
We cannot prove that something is true; we can only
prove is something is false.
“All swans are white”
“Innocent until proven guilty”
Inferential statistics are probabilistic.
“My hypothesis is true.”
“How likely is my hypothesis is true.”
Basic probability
• ๐‘๐‘Ÿ๐‘œ๐‘๐‘Ž๐‘๐‘–๐‘™๐‘–๐‘ก๐‘ฆ, ๐‘ =
๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘’๐‘ฃ๐‘’๐‘›๐‘ก๐‘  ๐‘กโ„Ž๐‘Ž๐‘ก ๐‘œ๐‘๐‘๐‘ข๐‘Ÿ๐‘Ÿ๐‘’๐‘‘
๐‘ก๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘๐‘œ๐‘ ๐‘ ๐‘–๐‘๐‘™๐‘’ ๐‘’๐‘ฃ๐‘’๐‘›๐‘ก๐‘ 
• In a bag of 100 balls, 5 are red, 95 are blue.
5
๐‘ ๐‘Ÿ๐‘’๐‘‘ =
100
๐‘ ๐‘๐‘™๐‘ข๐‘’ =
95
100
๐‘ ๐‘๐‘™๐‘ข๐‘’ = 1 − ๐‘(๐‘Ÿ๐‘’๐‘‘)
Two basic rules of probability
The bag now has: 5 red, 10 green, and 85 blue balls.
What is the probability that I will draw either a red or green
ball?
• Additive rule:
๐‘ ๐‘Ÿ๐‘’๐‘‘ ๐‘œ๐‘Ÿ ๐‘”๐‘Ÿ๐‘’๐‘’๐‘› = ๐‘ ๐‘Ÿ๐‘’๐‘‘ + ๐‘(๐‘”๐‘Ÿ๐‘’๐‘’๐‘›)
If I draw two balls (with replacement), what is the probability
that I will draw a red and a green ball?
• Multiplicative rule:
๐‘ ๐‘Ÿ๐‘’๐‘‘ ๐‘Ž๐‘›๐‘‘ ๐‘”๐‘Ÿ๐‘’๐‘’๐‘› = ๐‘ ๐‘Ÿ๐‘’๐‘‘ × ๐‘(๐‘”๐‘Ÿ๐‘’๐‘’๐‘›)
Relationship to sampling distributions
0. 0.1 0.2 0.3 0.4
• Recall: Sampling distribution is the distribution of
means for repeated random samples
-4
-2
0
2
4
S CORE
Relationship to sampling distributions
How extreme is your sample mean of 3.50?
0. 0.1 0.2 0.3 0.4
We calculate a z-score:
๐‘‹−๐œ‡
๐‘ง=
๐œŽ/ ๐‘›
Note: This is different from
๐‘‹−๐œ‡
๐‘ง=
๐‘ 
-4
-2
0
2
4
One is inferential, one is
descriptive
S CORE
0. 0.1 0.2 0.3 0.4
Relationship to sampling distributions
๐‘ง=
3.50 − 3
1/ 30
= 2.74
๐‘ = .003
The probability of getting a zscore of ≥2.74 is .003.
-4
-2
0
2
4
S CORE
How NHSTs work
• Is .003 a “small” probability?
• Because the distribution of sample means is
continuous, we create an arbitrary point along this
continuum for denoting what is “small” and what is
“large.”
• By convention in psychology, if the probability of
observing the sample mean is less than 5%,
researchers reject the null hypothesis.
Rules of the NHST Game
• When p < .05, a result is said to be “statistically
significant”
• In short, when a result is statistically significant (p <
.05), we conclude that the difference we observed
was unlikely to be due to sampling error alone. We
“reject the null hypothesis.”
• If the statistic is not statistically significant (p > .05),
we conclude that sampling error is a plausible
interpretation of the results. We “fail to reject the
null hypothesis.”
Binary Yes vs. No criteria
• NHSTs were developed for the purpose of making
yes/no decisions about the null hypothesis.
• As a consequence, the null is either rejected or not,
based on the p-value.
• Strictly speaking, NHSTs do not test the research
hypothesis per se; only the null hypothesis is tested.
Different significance tests
• The previous example was an example of a z-test of
a sample mean. (≠ z-score of a sample)
• Significance tests have been developed for:
– difference between two group means: t-test
– difference between two or more group means:
ANOVA
– differences between proportions: chi-square
What does statistical significance mean?
• The term “significant” does not mean important,
substantial, or worthwhile.
• Showing that Facebook postings affect your mood
with a probability of p = .001 with N > 1,000,000 says
nothing about how important it is.
• More about this in Week 14.
Inferential Errors and NHST
• A yes/no decision about whether the null hypothesis
as a viable explanation can lead to mistakes.
• What sort of mistakes?
Inferential Errors and NHST
Null is true
Null is false
Conclusion of the test
(sample)
Real world
(population)
Null is true
Null is false
Correct
decision
Type I error
(false positive)
Type II error
(false negative)
Correct
decision
NHST thinking applied to the real world
Null is true
(acquittal)
Null is false
(conviction)
Conclusion of the test
Real world
Null is true
Null is false
(truly not guilty)
(truly guilty)
Correct
decision
Type I error
(false positive)
Type II error
(false negative)
Correct
decision
Or simply…
Errors in Inference using NHST
• The probability of making a Type I error is determined
by the experimenter. Often called the alpha value.
Usually set to 5%.
• This determines how conservative we want to be.
• The probability of making a Type II error is also
determined by the experimenter. Often called the
beta value (more in Week 12 on Power & Effect
Size).
One-tail or two-tail tests?
Previously,
H0:
H1:
μxฬ„ = μ
μxฬ„ ≠ μ
We could also have H1 as:
H1:
μxฬ„ < μ
H1:
μxฬ„ > μ
Two-tail
One-tail (directional)
Often in psychology, we use two-tail tests.
Problem with one-tail tests
Before collecting data
Null:
Alternative:
μxฬ„ = 30
μxฬ„ < 30
After collecting data, you found:
Case 1
μx = 50, p = .0001
Case 2
μx = 26, p = .04
You must reject H0 in Case 1, but you’re forced to conclude that 50 > 30?! (the mean is
grossly opposite to your alternative hypothesis.)
Problem with two-tail tests
Before collecting data
Null:
Alternative:
μxฬ„ = 30
μxฬ„ ≠ 30
After collecting data, you found:
Case 1
μx = 26, p = .04
Reject null
Case 2
μx = 27, p = .06
Two tail tests can be too conservative
Do not reject null
Which should you choose?
• The debate can continue forever.
• Most psychologists would choose two-tail tests.
• Some psychologists choose Bayesian statistics (not
in SRM I and II)
• What does your theory actually predict?
Five steps to NHST
1. State the null and alternative hypothesis
2. Choose the type of statistical test
3. Select the significance level (usually 5%), and the
tail of the test
4. Derive the sample statistic (z, t, F, r, B, etc.)
5. Report results
State the appropriate H0 and H1 for the
following studies
• Researchers want to test whether there is a
difference in spatial ability between left- and righthanded people.
• Researchers want to test whether nurses who work
8-hour shifts deliver higher-quality work than those
who work 12-hour shifts.
• A psychologist predicted that the number of
advertisements shown increases the sales of a
product geometrically.
Back to “Today’s Question”
• “We want to know whether salary bonuses increases
people’s psychological well-being. The average wellbeing of Delhi’s residents is 3.00 (SD = 1.00). We
randomly sampled a group of 30 employees and gave
them a salary bonus. Months later, we measure their wellbeing. The average well-being in this sample is 3.50.”
• We derived this solution earlier:
๐‘‹−๐œ‡
๐‘ง=
๐œŽ/ ๐‘›
๐‘ง=
3.50−3
1/ 30
= 2.74, ๐‘ = .003
The problem
• Often the population variance is unknown
(Seminar 5).
“The average well-being of Delhi’s residents is
3.00 (SD = 1.00).”
• What do we do?
One-sample t-test
๐‘ง=
๐‘‹−๐œ‡
๐œŽ/ ๐‘›
vs.
๐‘ก=
๐‘‹−๐œ‡
๐‘ / ๐‘›
t distributions approximate z distributions as N ๏ƒ ∞
df stands for
“degrees of
freedom”
The number of
scores that are free
to vary.
For one-sample ttest,
df = n – 1
An example using one-sample t-test
Question: Do Ashoka students spend โ‚น200 a day on
food on average?
Suppose we sampled daily food expenditure among
100 students, and found M = โ‚น 220; SD = โ‚น 20.
๐‘ก=
๐‘‹−๐œ‡
๐‘ / ๐‘›
=
220−200
20/ 100
= 10, ๐‘ < .001
1. Check out the t-distribution table (p. 543)
2. Google “t-test calculator” and enter the t value
3. Use software e.g., JASP, SPSS, R
t-test family
• The previous example was a one-sample t-test.
• Very seldom used in psychology
• Very useful in quality control, e.g., “Does this batch of
batteries meet ISO6001 standards?”
• Next week:
• Independent samples
• Dependent samples
An alternative to NHST: Bayesian
Problems with NHST
1. The significance level is arbitrary
2. It doesn’t test the research hypothesis directly
3. Tendency to “accept” or “reject” hypotheses blindly
Bayesian statistics (Google it; not in SRM I or II)
1. Bayes factors represent the weight of evidence in
the data for competing hypotheses
2. Easily implemented in JASP
3. Has its own problems too
Summary
• Appreciate randomness in your data.
• NHST results in binary outcomes; sometimes this is
useful, other times not.
• z-test is useful to understand statistical inference, but
often useless to answer practical questions, which ttest are more suited.
• Next week we cover different types of t-tests.
Announcement
• 9 Nov has been declared a university holiday.
• Course syllabus has been rearranged.
• Deadline for research project has been pushed back.
Download