Stat 31, Section 1, Last Time • Inference for Proportions – • Hypothesis Tests 2 Sample Proportions Inference – • Skipped 2-way Tables – Sliced populations in 2 different ways – Look for independence of factors – Chi Square Hypothesis test Reading In Textbook Approximate Reading for Today’s Material: Pages 582-611, 634-667 Approximate Reading for Next Class: Pages 634-667 Midterm I - Results Preliminary comments: • Circled numbers are points taken off • Total for each problem in brackets • Points evenly divided among parts • Page total in lower right corner • Check those sum to total on front • Overall score out of 100 points Midterm I - Results Interpretation of Scores: • Too early for letter grades • These will change a lot: • – Some with good grades will relax – Some with bad grades will wake up Don’t believe “A & C” average to “B” Midterm I - Results Interpretation of Scores: • Recall large variation over 2 midterms – No exception this semester Midterm I - Results Compare Midterm Scores 100 Midterm 2 I 90 80 70 60 50 40 40 50 60 70 Midterm I 80 90 100 Midterm I - Results Compare Midterm Scores Line of Equal Scores 100 Midterm 2 I 90 80 70 60 50 40 40 50 60 70 Midterm I 80 90 100 Midterm I - Results Compare Midterm Scores Some have Dramatically Improved 90 Midterm 2 I Others have Been distracted By other things 100 80 70 60 50 40 40 50 60 70 Midterm I 80 90 100 Midterm I - Results Interpretation of Scores: • Recall large variation over 2 midterms – • No exception this semester Get better info from 2 test Total – So will report answers in those terms Midterm I - Results Histogram Midterm I + II, Total Score of Results: 14 10 8 6 4 2 Total Score 5 19 0 18 5 16 0 15 5 13 0 12 5 10 90 75 60 45 0 30 Frequency 12 Midterm I - Results Interpretation of Scores (2 Test total): 170 - 200 A 155 – 168 B 131 – 154 C 120 – 129 D -- 119 F Midterm I - Results Where do we go from here? • I see 2 rather different groups… • Which are you in? • What can you do? • Most important: It is still early days…… Chapter 9: Two-Way Tables Main idea: Divide up populations in two ways – – • E.g. 1: E.g. 2: Age & Sex Education & Income Typical Major Question: How do divisions relate? Are the divisions independent? • – – Similar idea to indepe’nce in prob. Theory Statistical Inference? Two-Way Tables Big Question: Is there a relationship? Class Example 31 - Counts 45 40 35 30 # Bottles 25 purchased 20 15 10 Other Wine 5 Italian Wine 0 None French Wine French Note: tallest bars French Wine French Music Italian Wine Italian Music Other Wine No Music Suggests there is a relationship Music Italian Two-Way Tables General Directions: • Can we make this precise? • Could it happen just by chance? – • Really: how likely to be a chance effect? Or is it statistically significant? – I.e. music and wine purchase are related? Two-Way Tables An alternate view: Replace counts by proportions (or %-ages) Class Example 31 (Wine & Music), Part 2 http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg31.xls Advantage: May be more interpretable Drawback: No real difference (just rescaled) Two-Way Tables Testing for independence: What is it? From probability theory: P{A | B} = P{A} i.e. Chances of A, when B is known, are same as when B is unknown Table version of this idea? Independence in 2-Way Tables Counts analog of P{A|B}??? Equivalent condition for independence is: P{ A & B} P{ A} P{B} So for counts, look for: Table Prop’n = Row Marg’l Prop’n x Col’n Marg’l Prop’n i.e. Entry = Product of Marginals Independence in 2-Way Tables Visualize Product of Marginals for: Class Example 31 (Wine & Music), Part 4 http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg31.xls Shows same structure as marginals But not match between music & wine Good null hypothesis Class Example 31 - Independent Model 0.18 0.16 0.14 0.12 0.1 # Bottles purchased 0.08 0.06 0.04 Other Wine 0.02 Italian Wine 0 None Music French Wine French Italian Independence in 2-Way Tables Approach: • Measure “distance between tables” – Use Chi Square Statistic – Has known probability distribution when table is independent • Assess significance using P-value – Set up as: H0: Indep. – P-value = P{what saw or m.c. | Indep.} HA: Dependent Independence in 2-Way Tables Chi-square statistic: • Based on: Observed Counts (raw data), Obsi Expected Counts (under indep.), Expi • X 2 cells i Obsi Expi 2 Expi Notes: – Small for only random variation – Large for significant departure from indep. Independence in 2-Way Tables Chi-square statistic calculation: X 2 Obsi Expi cells i 2 Expi Class example 31, Part 5: http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg31.xls – Calculate term by term – Then sum – Is X2 = 18.3 “big” or “small”? Independence in 2-Way Tables H0 distribution of the X2 statistic: “Chi Squared” (another Greek letter ) 2 Parameter: “degrees of freedom” (similar to T distribution) Excel Computation: – CHIDIST (given cutoff, find area = prob.) – CHIINV (given prob = area, find cutoff) Independence in 2-Way Tables For test of independence, use: degrees of freedom = = (#rows – 1) x (#cols – 1) E.g. Wine and Music: d.f. = (3 – 1) x (3 – 1) = 4 Independence in 2-Way Tables E.g. Wine and Music: P-value = P{Observed X2 or m.c. | Indep.} = = P{X2 = 18.3 of m.c. | Indep.} = = P{X2 >= 18.3 | d.f. = 4} = = 0.0011 Also see Class Example 31, Part 5 http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg31.xls Independence in 2-Way Tables E.g. Wine and Music: P-value = 0.001 Yes-No: Very strong evidence against independence, conclude music has a statistically significant effect Gray-Level: evidence Also very strong Independence in 2-Way Tables Excel shortcut: CHITEST • Avoids the (obs-exp)^2 / exp calculat’n • Automatically computes d.f. • Returns P-value Independence in 2-Way Tables HW: 9.27 9.29 And Now for Something Completely Different A statistics joke, from: GARY C. RAMSEYER'S INTERNET GALLERY OF STATISTICS JOKES http://www.ilstu.edu/~gcramsey/Gallery.html And Now for Something Completely Different A somewhat advanced society has figured how to package basic knowledge in pill form. A student, needing some learning, goes to the pharmacy and asks what kind of knowledge pills are available. And Now for Something Completely Different The pharmacist says "Here's a pill for English literature." The student takes the pill and swallows it and has new knowledge about English literature! And Now for Something Completely Different "What else do you have?" asks the student. "Well, I have pills for art history, biology, and world history, "replies the pharmacist. The student asks for these, and swallows them and has new knowledge about those subjects! And Now for Something Completely Different Then the student asks, "Do you have a pill for statistics?" The pharmacist says "Wait just a moment", and goes back into the storeroom and brings back a whopper of a pill that is about twice the size of a jawbreaker and plunks it on the counter. "I have to take that huge pill for statistics?" inquires the student. And Now for Something Completely Different The pharmacist understandingly nods his head and replies: "Well, you know statistics always was a little hard to swallow." Caution about 2-Way Tables Simpson’s Paradox: Aggregation into tables can be dangerous E.g. from: http://www.math.sfu.ca/~cschwarz/Stat-301/Handouts/node50.html Study Admission rates to professional programs, look for sex bias…. Simpson’s Paradox Admissions to Business School: Male Female Admit 480 180 Deny 120 20 % Males ad’ted = 480 / (480 + 120) * 100% = 80% % Females ad’ted = 180 / (180 + 20)* 100% = 90% Better for females??? Simpson’s Paradox Admissions to Law School: Male Female Admit 10 100 Deny 90 200 % Males ad’ted = 10 / (10 + 90) * 100% = 10% % Females ad’ted = 100 / (100+200)*100% = 33.3% Better for females??? Simpson’s Paradox Combined Admissions: Male Female Admit 490 280 Deny 210 220 % Males ad’ted = 490 / (490 + 210) * 100% = 70% % Females ad’ted = 280 / (280+210)*100% = 56% Better for males??? Simpson’s Paradox How can the rate be higher for both females and also males? Reason: depends on relative proportions Notes: • In Business (male applicants dominant), easier to get in (660 / 800) • In Law (female applicants dominant), much harder to get in (110 / 400) Simpson’s Paradox Lesson: Must be very careful about aggregation Worse: may not be aware that aggregation has been done…. Recall terminology: Lurking Variable Can hide in aggregation… Could be used for cheating… Simpson’s Paradox HW: 9.15 9.17 Inference for Regression Chapter 10 Recall: • Scatterplots • Fitting Lines to Data Now study statistical inference associated with fit lines E.g. When is slope statistically significant? Recall Scatterplot Toy Scatterplot, Separate Points For data (x,y) View by plot: 2.5 (1,2) 1.5 (3,1) 0.5 (-1,0) 2 y 1 0 -2 -1 -0.5 0 1 -1 (2,-1) -1.5 x 2 3 4 Recall Linear Regression Idea: Fit a line to data in a scatterplot • To learn about “basic structure” • To “model data” • To provide “prediction of new values” Recall Linear Regression Recall some basic geometry: A line is described by an equation: y = mx + b m = slope b = y intercept m b Varying m & b gives a “family of lines”, Indexed by “parameters” m & b (or a & b) Recall Linear Regression Approach: Given a scatterplot of data: ( x1 , y1 ),..., ( xn , yn ) Find a & b (i.e. choose a line) to “best fit the data” Recall Linear Regression Given a line, y bx a , “indexed” by b & a ( x1 , y1 ) ( x2 , y 2 ) ( x3 , y 3 ) Define “residuals” = “data Y” – “Y on line” = yi (bxi a ) Now choose b & a to make these “small” Recall Linear Regression Excellent Demo, by Charles Stanton, CSUSB http://www.math.csusb.edu/faculty/stanton/m262/regress/regress.html More JAVA Demos, by David Lane at Rice U. http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html http://www.ruf.rice.edu/~lane/stat_sim/comp_r/index.html Recall Linear Regression Make Residuals > 0, by squaring Least Squares: adjust b & a to Minimize the “Sum of Squared Errors” n SSE yi (bxi a ) i 1 2 Least Squares in Excel Computation: 1. INTERCEPT (computes y-intercept a) 2. SLOPE (computes slope b) Revisit Class Example 14 http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg14.xls HW: 10.17a Inference for Regression Goal: develop • Hypothesis Tests and Confidence Int’s • For slope & intercept parameters, a & b • Also study prediction Inference for Regression Idea: do statistical inference on: – Slope a – Intercept b Model: Yi aX i b ei ei are random, independent and N 0, e Assume: Inference for Regression Viewpoint: Data generated as: y = ax + b Yi chosen from Xi Note: a and b are “parameters” Inference for Regression Parameters a and b determine the underlying model (distribution) Estimate with the Least Squares Estimates: â and b̂ (Using SLOPE and INTERCEPT in Excel, based on data) Inference for Regression Distributions of â and b̂ ? Under the above assumptions, the sampling distributions are: aˆ ~ N a, a bˆ ~ N b, b • Centerpoints are right (unbiased) • Spreads are more complicated Inference for Regression Formula for SD of â : SD aˆ a e n xi x 2 i 1 • – • Big (small) for e big (small, resp.) Accurate data Accurate est. of slope Small for x’s more spread out – • Data more spread More accurate Small for more data – More data More accuracy Inference for Regression Formula for SD of b̂ : SD bˆ b e 1 n x2 n xi x 2 i 1 • – • – • Big (small) for e big (small, resp.) Accurate data Accur’te est. of intercept Smaller for x 0 Centered data More accurate intercept Smaller for more data – More data More accuracy Inference for Regression One more detail: Need to estimate using data e For this use: n se yi aˆxi bˆ i 1 2 n2 • Similar to earlier sd estimate, • Except variation is about fit line • n 2 is similar to n 1 from before s Inference for Regression Now for Probability Distributions, Since are estimating e by se Use TDIST and TINV With degrees of freedom = n 2 Inference for Regression Convenient Packaged Analysis in Excel: Tools Data Analysis Regression Illustrate application using: Class Example 27, Old Text Problem 8.6 (now 10.12)