Statistical Hypothesis Testing (8th Session in “Gentle Introduction to Modeling Uncertainty”) Lonnie Chrisman, Ph.D. Lumina Decision Systems Analytica User Group 15 July 2010 Copyright © 2010 Lumina Decision Systems, Inc. Scope of Today’s Webinar Included: • Conceptual underpinnings of classical hypothesis testing. • Interpretation of statistical significance (p-values). • General methodology for applying it in any scenario. Intended to promote conceptual understanding. Building on Monte Carlo tools. Not included: • Standard canned hypothesis tests (like t-tests, etc) Copyright © 2010 Lumina Decision Systems, Inc. Outline • • • • • • • • • Motivating example Statistical significance The Statistic Methodology Modeling the Null hypothesis Computing the pValue Interpretation of results Drawbacks of methodology Additional exercise Copyright © 2010 Lumina Decision Systems, Inc. Does Stock Market Volatility Vary with Day of Week? • Random selected 100 trading days (from 2000-2010). • Computed day change (close-open)/open for S&P 500 index. Day of week # samples Volatility Mon 20 19.4% Tue 20 11.9% Wed 20 21.5% Thu 20 20.1% Fri 20 14.3% Side note: Annualized volatility := SDeviation * sqrt(T) where T = # trading days/yr = 250 Total volatility: 18.1% • Alice: “This shows that the market volatility does depend on the day of the week.” • Bob: “No, the variation is just due to random sampling variation.” Copyright © 2010 Lumina Decision Systems, Inc. Download Model with S&P Data • Please download: “Hypothesis Test S&P Volatility.ana” the download link is at the bottom of talk abstract on Analytica Wiki. • You’ll use this data for exercises… Copyright © 2010 Lumina Decision Systems, Inc. Statistical Significance Day of week # samples Observed Volatility Mon 20 19.4% Tue 20 11.9% Wed 20 21.5% Thu 20 20.1% Fri 20 14.3% • Alice: “This shows that the market volatility depends on the day of the week.” • Alice’s mission: To show that this observed variation is unlikely if it is just due to random sampling variation. • Null Hypothesis: The “true” underlying volatility is the same for every day of the week. • Level of significance: The probability that this much variation in volatility would be observed if the Null Hypothesis is true. (termed the “p-value”) Copyright © 2010 Lumina Decision Systems, Inc. Statistical Significance #2 Day of week # samples Observed Volatility Mon 20 19.4% Tue 20 11.9% Wed 20 21.5% Thu 20 20.1% Fri 20 14.3% • After her statistical analysis, Alice might say: “This shows at a significance level p=3% that market volatility varies with the day of the week.” • By convention, p ≤ 5% is usually considered to be “statistically significant”. p>5% is said to be “not statistically significant”. • What can you conclude if the p-value turns out to be 20%? Copyright © 2010 Lumina Decision Systems, Inc. The “Statistic” Day # samples Observed Volatility (vol) Mon 20 19.4% Tue 20 11.9% Wed 20 21.5% Thu 20 20.1% Fri 20 14.3% Total volatility: 18.1% • We need a scalar metric to summarize degree of conflict with Null-hypothesis (H0). Smaller value more consistent with H0 Larger value greater disagreement with H0 • Examples: Max(vol,day) – Min(vol,day) SDeviation(vol,day) F = Variance(vol,day) / Total_volatility^2 • Exercise: Pick a statistic and compute its value for the S&P 500 dataset in your Analytica model. Copyright © 2010 Lumina Decision Systems, Inc. Methodology Model of Null Hypothesis Simulated Dataset Statistic on simulated pValue Measured dataset Statistic on measured • Construct a model that simulates measurements given that the null-hypothesis is true. Typically makes various assumptions. • Use Monte Carlo simulation to produce several simulated data sets. Apply the statistic to each. • pValue: Pr( Statsim ≥ Statmeas ) Copyright © 2010 Lumina Decision Systems, Inc. Modeling the Null Hypothesis Day # samples Observed Volatility (vol) Mon 20 19.4% Tue 20 11.9% Wed 20 21.5% Thu 20 20.1% Fri 20 14.3% Total volatility: 18.1% • Null Hypothesis: The volatility is 18.1% on every day of the week. • How could you simulate the data? (Hint: There are multiple possible approaches) What assumptions are you making? • Some ideas: Randomly generate each day’s price change from a LogNormal distribution. Shuffle existing data. • Exercise: Implement a model of the null-hypothesis in your Analytica model. (One random dataset for each item in Run) Copyright © 2010 Lumina Decision Systems, Inc. Computing Statistic on Simulated • Exercise: Apply your statistic to each simulated dataset. Note: Larger statistic values occur when the variation in volatility by day is largest. • Exercise: What fraction of simulated datasets have a larger statistic value than the actual data? This is the p-value Is Alice’s hypothesis statistically significant? Copyright © 2010 Lumina Decision Systems, Inc. Common Misuse of Paradigm: Multiple Hypotheses • Scenario: Alice identifies 20 other plausible hypotheses to test, e.g.: Volatility on Tues is different than the other 4 days. Volatility varies my month. September has a higher volatility than other months. … She tests each of these individually and finds one of them to be statistically significant at a 5% level. She publishes this result. • What’s wrong here? • What should she do differently? Copyright © 2010 Lumina Decision Systems, Inc. Interpreting p-Value • Small value (< 5%) Accept main hypothesis Data is inconsistent with Null-hypothesis • Otherwise (p > 5%) Conclude only that data sample was too small to detect relationship. Hypothesis may still be true or false: “Larger research study required” • P-value is not: A measure of the strength of relationship. The probability that the hypothesis is true. Copyright © 2010 Lumina Decision Systems, Inc. Drawbacks with Statistical Hypothesis Testing Paradigm • 1 in 20 false hypotheses are accepted (at 5% significance level). Often abused by people testing many hypotheses. • Nearly any hypothesis is confirmed with a large enough sample. Most hypotheses will have at least a miniscule “true” effect. With enough data, even the most miniscule effect becomes statistically significant. • The “uncertainty” about the hypothesis is not available. Doesn’t provide P(H), which would be useful in model that use the results. • Numerous subjective components that are not recognized or reported explicitly. • “Cookbook tests” are very often misapplied when assumptions don’t hold, leading to greater confidence than is warranted by the data. Copyright © 2010 Lumina Decision Systems, Inc. New Exercise Number of subjects: (purely fictional data) Parkinson’s No Parkinson’s Not exposed 10 140 Exposed to TCE 4 25 • Hypothesis: TCE exposure is associated with an increased risk of getting Parkinson’s disease. • Null Hypothesis: Parkinson’s rates are the same among those exposed and not exposed to TCE. • Exercise: Identify an appropriate statistic. Model the null-hypothesis Compute the p-Value Copyright © 2010 Lumina Decision Systems, Inc. Summary • Statistical Hypothesis Testing tests: Is the support for a hypothesis statistically significant given a dataset. • Significance level (p-value) is: Probability of seeing data at least as extreme as the actual data when the Null hypothesis is true. • p-value <= 5% accept hypothesis p-value > 5% conclude nothing, need more data. • Methodology: Identify statistic (scalar metric): A measure of divergence from null-hypothesis. Build model of null-hypothesis to “simulate” data sets. Compute p-value. Copyright © 2010 Lumina Decision Systems, Inc.