Hypothesis Testing The New York Times Daily Dilemma Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people click more on headline A or B? 2 Testing Hypotheses Two Populations 10 4 12 ? ? ? ? 9 ? ? ? Which one has the largest average? 3 The two-sample t-test Is difference in averages between two groups more than we would expect based on chance alone? 4 More Broadly: Hypothesis Testing Procedures Hypothesis Testing Procedures Nonparamet ric Parametric Z Test t Test Cohen's d Wilcoxon Rank Sum Test Kruskal-Walli H-Test KolmogorovSmirnov test 5 Parametric Test Procedures Tests Population Parameters (e.g. Mean) Distribution Assumptions (e.g. Normal distribution) Examples: Z Test, t-Test, 2 Test, F test 6 Nonparametric Test Procedures Not Related to Population Parameters Example: Probability Distributions, Independence Data Values not Directly Used Uses Ordering of Data Examples: Wilcoxon Rank Sum Test , Komogorov-Smirnov Test 7 Two-sample t-Test In class experiment with R Left= c(20,5,500,15,30) Right = c(0,50,70,100) t.test(Left, Right, alternative=c("two.sided","less","greater"), var.equal=TRUE, conf.level=0.95) 8 t-Test (Independent Samples) The goal is to evaluate if the average difference between two populations is zero Two hypotheses: H0: μ1 - μ2 = 0 H1: μ1 - μ2 ≠ 0 The t-test makes the following assumptions • The values in X(0) and X(1) follow a normal distribution • Observations are independent 9 t-Test Calculation General t formula t = sample statistic - hypothesized population parameter estimated standard error Independent samples t Empirical averages Estimated standard deviation?? 10 t-Test: Standard Deviation Calculation Standard deviation of difference in empirical averages How much variance when we use average difference of observations to represent the true average difference? 11 t-Test: Standard Deviation Calculation (2/2) Standard deviation of difference in empirical averages Sample variance of X(0) Number of observations in X(0) with degrees of freedom Also known as Welsh’s t 12 t-Statistics p-value H0: μ1 - μ2 = 0 H1: μ1 - μ2 ≠ 0 What is the p-value? Can we ever accept hypothesis H1 ? 13 14 t-Test: Effect Size t-Test tests only if the difference is zero or not. What about effect size? Cohen’s d where s is the pooled variance 15 16 Bayesian Approach Probability of hypothesis given data The Bayes factor 17 18 Nonparametric Testing of Distributions Two-sample Kolmogorov-Smirnov Test Sample size correction ◦ Do X(0) and X(1) come from same underlying distribution? ◦ Hypothesis (same distribution) rejected at level p if Confidence interval factor Empirical The K-S test is less sensitive when the differences between curves is greatest at the beginning or the end of the distributions. Works best when distributions differ at center. Wikipedia Good reading: M. Tygert, Statistical tests for whether a given set of independent, identically distributed draws comes from a specified probability density. PNAS 2010 19 20 Chi-Squared Test Twitter users can have gender and number of tweets. We want to determine whether gender is related to number of tweets. Use chi-square test for independence 21 When to use Chi-Squared test When to use chi-square test for independence: ◦ Uniform sampling design ◦ Categorical features ◦ Population is significantly larger than sample State the hypotheses: ◦ H0 ? ◦ H1 ? 22 Example Chi-Squared Test men = c(300, 100, 40) women = c(350, 200, 90) data = as.data.frame(rbind(men, women)) names(data) = c('low', 'med', 'large') data chisq.test(data) Reject H0 (p<0.05) means … 23 24 Revisiting The New York Times Dilemma Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Assign half the readers to headline A and half to headline B? ◦ Yes? ◦ No? ◦ Which test to use? What happens A is MUCH better than B? 25 Sequential Analysis (Sequential Hypothesis Test) How to stop experiment early if hypothesis seems true ◦ Stopping criteria often needs to be decided before experiment starts ◦ If ever needed: 26 But there is a better way… 27 Bandit Algorithms K distinct hypotheses (so far we had K = 2) ◦ Hypothesis = choosing NYT headline Each time we pull arm i we get reward Xi (simple version of problem) Underlying population (reward distribution) does not change over time Bandit algorithms attempt to minimize regret ◦ If n = total actions ; ni = total actions i largest true average 28 Challenge Note that regret is defined over the true average reward How can we estimate true average reward Xi? ◦ We need to get lots of observations from population i ◦ But what happens if E[Xi] is small? Core of decision making problems: ◦ Exploration vs. exploitation ◦ When exploring we seek to improve estimated average reward ◦ When exploiting we try what has worked better in the past Balancing exploration and exploitation: ◦ Instead of trying the action with highest estimated average, we try the action with the highest upper bound on its confidence interval (more on this next class) 29 UCB1 (Upper Confidence Bound 1) Multi-Armed Bandit (MAB) ◦ Bandit process is a special type of Markov Decision Process ◦ Generally, reward Xi(ni) at ni–th arm pull of arm i is P[Xi(ni) | Xi(ni1)] UCB1 ◦ Use arm i that maximizes 30 R Example numT <- 2500 # number of time steps ttest <- c() # mean of population 1 mean1 = 0.4 # mean of population 2 mean2 = 0.7 # initialize observations x1 <- c(rbinom(n=1,size=1,prob=mean1)) x2 <- c(rbinom(n=1,size=1,prob=mean2)) n1 = 1 n2 = 1 for (i in 2:numT){ # compute reward of bandit 1 reward_1 = mean(x1) + sqrt(2*log(i)/n1) # compute reward of bandit 2 reward_2 = mean(x2) + sqrt(2*log(i)/n2) # decides which arm to pull if (reward_1 > reward_2) { x1 <- c(rbinom(n=1,size=1,prob=mean1),x1) n1 = n1 + 1 } else { x2 <- c(rbinom(n=1,size=1,prob=mean2),x2) n2 = n2 + 1 } # computes the t-Test p-value of observations if ((n1 > 2) && (n2 > 2)) { d <- t.test(x1,x2)$p.value } else { d=1 } ttest <- c(ttest, d) } par(mfrow=c(1,2)) plot(2:numT,ttest,"l",xlab="Time",ylab="t-Test pvalue",log="y") barplot(c(n1,n2),xlab="Arm Pulls",ylab="Observations",names.arg=c("n0", "n1")) 31