8-23-05 JJ Deely-prof, Tilman Achberger (asst) Office Hours: math building, room 546: t-th: 1-2pm or by apt, organized in class, not email 1. workbook- 3 items in your workbook, look at weekly schedule Oct- 6th: Test #1, 7-8pm November 14th- Monday: 7pm-8pm, test #2 1. Homework-familiarize with concepts 2. Data-how to collect it and think about it 3. Tests-comprehension All tests/quizzes are open note, book, workbook Final grade= 10% HW 10% rec. quiz 10% workbook and attendance 20% test 1 20% test 2 30% final Vista: Grades and for e-mail Quiz on Friday on “admin details” V and VI 8-25-05 What is stat and why am I here? So I can be educated Main methods to obtain Data Anecdote-small story with a point, people talk as if the truth Census- not 100% of the population, people avoid, are homeless, etc, Sampling-too expsensive to census -obtaining data from the population -measure in units -how many is sample -from where? Sampling frame -quantity of interest:: variable OJ: Pup~= all pu students Unit= a student Sample=m=500 Sample frame= mech to gen sample (i.e. phone numbers from book) Variable=do you drink OJ Observational Studies Experiments 8-26-05 (Rec) Gopal Panicker Office Hours: Math G132, M: 3-4pm, Tr: 2-3pm, Thurs: 7-9pm, Phys 117 Can go to any TA’s hours and there are 11 TAs 8-30-05 Assign 1 is on web. Turn in Friday Take data set from web. Draw random sample. 250 heating bills take sample: 20 SRS-simple random sample SampBC- if not prompted, go to tools-macro-security-medium Alt+f8, run macro A2:A251 20 1 c2 (column) without replacement (not back in hat) =average(c2:c21) do 10 times 9-1-05 p(triangle over)=53%=.53 n=208 p(hat)-x, p(hat)+x .53-x, .53+x using formula from workbook on pg 3=.07 using above, .46, .60 pg 37, cross out the formula, it’s an approximation everyone wants a good sample 2 sources of errors: sampling and non-sampling difference: sampling errors: variability of the population method used to obtain (voluntary/convenience) sampling frame (telephone) non-sampling errors (not a problem with the way the sample was drawn) non-response deceptive replies (truth hurts or data incorrectly tabulated) question not understood 9-2-05 (Rec) p is population, p-hat is sample size confidence intervals: n=number asked, not number in favor or against, total number asked 9-6-05 main purpose of stat. study: does changing one variable (the explanatory variable-controlled variable) cause changes in another variable (the response variable-measured variable) drinking milk -> happiness differences between observational studies and experiments Experiments (under our control): make plans about what, how, who to measure Observational (no control involving measure) Treatments- placebo Why observational studies: may be unethical or impossible to conduct a proper experiment (booze in preggers) -certan explanatory variables can’t be controlled Randomized Comparitive Experiments: -random, not haphazard -eliminates placebo -(look up more info) Statistically Insignificant -could have happened by chance -what’s the chance that two treatments are different by a certain amount given since the chance of observing the difference in the data is so small, it’s not significant 9-8-05 one page of good data, and then summarize it for extra credit Good experiments: 1. randomized block (why have blocks? Randomizes more, OJ in men/women) 6-18, start with ABCDE, then BCDEA, then CDEAB—latin square Sudoku 2. matched pair-p. 99- best kind of matching pair is experiment done on same unit, however not always possible, hence the name “matched pair” 9-13-05 Measurements: provide information about how, what, why, who Measurements not always numbers (who, etc) Categorical Measurements (not #s) Quantitative Measurements Valid, reliable but possibly biased measurements 1. valid-does it measure what it is supposed to? 2. Reliable-can someone else obtain the same measurement-aproximately 3. Biased- a systematic prejudice in some given direction of the measurement involved Do #’s make sense? How #s produced Exactly what as being produced Many examples in chapter 9 9-15-05 Graphs for categorical variables -pie chart and bar graph Pareto Bar Graph (bar graph with the data sorted) (these were categorical) Quantiative: Line graphs Histograms summarize data as well Excel: bin=interval Format-options-gap width-0 Column C: My Bin, 200-1200 Tools-histogram, C1:C12. Output: D1 Font out down to 8 9-20-05 Line Graphs for Quantitative variables Relationships of one variable to another Often “time” is on the x-axis, but not only Scatterplot: Delete lines Scale to 300 Manitee deaths-y axis, powerboat reg on x axis New material: One Number Summaries: Two Types: Measure of central tendency Measures of variability or spread Measures of central tendency-look at corn data Another measure of central tendency: Median, sort, middle number Measures of variablility Range: largest-smallest =MAX(B2:D88) =MIN(B2:D88) Quartiles and IQR (interquartile range) Lower quartile: number which 25% are below (156) Upper quartile: number which 25% are above (173) IQR-17 (50% of countines are within 17 bushells of one another) Standard Deviation (stdev) Big formula: pg 227 in book (n-1) because n is average and statisticians are weird distance from mean to all of the other points 9-22-05 find max and min fine the median find upper and lower quartile (number ¼ through and ¾ way through) min-lower q-median-upper q-max draw in a theremometor type vertical line format axis and make the numbers look good alt-f8 for boxplot have to include empty spaces from column c so B2:C33 two-dots: outliers: less than Q1 − (1.5 × IQR) or greater than Q3 + (1.5 × IQR) where Q1 and Q3 are the first and third quartiles, respectively, and IQR is the interquartile range (equal to Q3 − Q1). 9-27-05 The Bell Curve: Mu=center of the bell curve Mu-Sigma (min) Mu (Middle) Mu+3sigma (max) Standard Normal: Every normal distribution can be reduced to the “standard normal” 1. mu, the mean=0 2. sigma, the stdev=1 Standard Score: Z=((x-mean)/stdev) A sample proportion is nearly Normal for large samples – give an example – go to Binge drinking data 1. population mean=p 2. population stdev=next slide. St dev=sqrt((p(1-p)/m) 9-29-05 q31-q74, then take one hour and do 1-30 Quiz tommorow: 6-10 on sample test Bell Curve problem: A sample proportion is nearly normal for large samples M=total number asked Multiple by 3, add to p for upper, subtract from p for lower, p is middle Proportion of p hat below .17 Z=(.17-.2)/.004=-7.5 New Topic: We now want to think about relationships, how strong are they and in what direction. Look at the data. How strong, positive or negative Draw scatter plot What kind of association do they have? Correlation is the stat method of defining the strength and direction of a linear relationship Square all the numbers to the line, then add, make this number as small as possible, least-squares fit Correlation is given by the formula (never have to use) because excel can do it for us Correlation is between -1 and +1 If r=+1, all points are on a line and the line goes up If r= -1, all points are on a line and line goes down R=0, no linear relationship Add trend line-display r-squred Correlation =sqrt (of R)…must be same sign as the slope 10-4-05 test: CL50 75. 150 76. none 77. A, A=35 35. 95%-one of our rules CI on P p(hat) + or – 2 sqrt ((phat(1-phat))/n) 300/400 =.75=phat m=400 .706, .794 B 36. mean=p=.45 stdev: sqrt((p(1-p))/400)=.025 37. .375 .45 btwn .4 and .45 .525 below .45 is .5 (50%) below .4= z=(.4-.45)/.025=-2 back of book: .023 .5-.023=.477 (C) 43. 175 250 325 (3 sigma rule) below 225 z=225-250/25=-1 .16 44. above 300 z=300-250/25=2 .9772 or .98 1-.98= .02 (D) 45. .1 (10% of students) Z= -1.28 -1.28=(x-250)/25= $218 (A) 10-6-05 margin of error: do a 95% confidence 4. E 5. E 10. .45 and .4 .5-? z=(.4-.45)/.025=-.2 .475 C 18. 60% of 300=180, write out the tables D 21.D, with 100, 25=LQ, 50=median, 75=UQ 24.C STDEV=sqrt((p(1-p)/n) 43. KNOW THIS: do 3 sigma above x is .1 z=x-63/2 z must be .9 1.28=(x-63)/2 =66 Sigma=ST DEV 61. E, all of above 62. 31/131= B workbooks through pg 16 must report discrepancies by next Tuesday 10-13-05 The correlation is the statistical method of defining the strength and direction of a linear relationship between two quantitative variables (pg 270) What line do we use to compute R The least squares (best fit) Least Square estimates for the regression equation Correlation : R(sqr)=formula pg 290 =variation in predicted Y divided by the total variaion in Y =% variation in Y explained b(hat)=slope of regression line correlation must have same sign as slope r=o there is defnetly a reationship, just not a linear one r(sq)=percent of the variablility of the data up and down the line,m accounted for by the regression line. If all points on the line, r(sq) would be 1, 100% 10-18-05 www.stat.purdue.edu/~jdely/stat113/assign/Lecture16secret.xls What is an Index: a shorthand summary to compare one number to another or a bunch of numbers to other numbers An index is a convenient and simple method to compare quantitative measurements Price index=(current price/base price) x 100 Assignment A6: 2 extra credit: Sentance about each of those terms, explaining what each variable is 106 from 100, increase of 6% CPI-Consumer price index Based on Fixed market basket: Food index doesn’t take into account any change in living, all about cost Cpi=avg price now/avg price then x 100 Dollars now/dollars then = CPI now/CPI then CPI now=198.8 Homewrk question about beef and gasoline 95 price = (cpi then/cpi now) x price now 10-20-05 causation: two variables have a strong connection retroactive study: looks back and events that have already occurred prospective study: sets things up and then checks on them later Uses of CPI: Economic policy, compare prices over time, adjust other economic data Dollars now/dollars then=cpi now/cpi then Q16 on p324: www.stat.purdue.edu\~jdeely\stat113\MinWageNewDataFall05.xls use this data and say I’d like some extra credit. =A2*c2/a2 minimum wage has gone up, but in terms of 1960 dollars, it’s gone down 10-25-05 test 2: 11-14, Monday, 7pm Canceled Lectures: 11-17, 11-22 No Rec on 11-18 (no homework) Tools for Statistical Inference Dictionary: randm- going, made, occurring etc without definite aim, purpose, or reason: at random Stat meaning: various ideas, random sampling, random measurement, we can make quantitative inferences about events Language of Probability: Rules: 1 the eollection of all possible measurements called S has probability 1 2. the probability of any collection of measurements is btwn 0 and 1 3. If A and B are collections of measurements from S and they have no measurement in common, the probability of A or B is the sum of the two probabilities; we write P (a or B)= P(A)+P(B) P 360: 20 for A7 18 red and 18 black and 2 green 5 red cards out leaves more blacks, so a black coming out is more likely 10-27-05 3 rules: P(s)=1 Prob btwn 0 and 1 If A and B have no values in common, then P (A or B)=P(A)+P(B) A7-Q2 0=girl 1=boy 00000 00001 00010 00011 00100 00101 00100 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10111 11111 in theory all are equally likely, so 1/total number probability is number/total number (32) Do this for Extra Credit: 38 outcomes, 1/38, 18 red, 18 black, 2 green Add this: probability of Black or larger than 30 3 values in common: 31, 33, 35 P(A or B)= (# of values in a or B)/# total Black (18) +bigger than 30 (3) 21/38 Extra credit homework: Red or smaller than 6, 18+3=21/38 Probability tree Ant yes Test + .997 Ant no .015 (false pos) Test - .003 .985 (false negative) P(ant is there given that test is positive) Anti yes (.01) since 1% has= test + .997 += (.997 * .01) Ant No .99=test +.015=.99*.015 # from the branch we’re on/ # from both branches Extra Credit assignment Numbers 1-6: prob .167 Tools-data analysis-random number generation 2 900 discrite give values and prob. 6721 (random) output: ctrl+shift+ arrow up ctrl+d (down?) if(or(c2=6, c2=7, c2=8), 10, -10) 11-1-05 Independent Measurements The prob. Of an event A, given an event B, equals the prob. Of event A IF the prob of event A given an event B equals the probability of A, A and B are statistically independent Statstically independent: Measurements give us a new rule Multiplication Rule: P(A and B)=P(A)P(B) If A and B are independent, then the prob. Of A an B then P(A and B)=P(A)P(B) A8: 19.7 A=4, b=3, c=2, d/f=1 Get ex cred if calculate the theoretical answer: Prob 0=.8 Expectation: Expect: regard as likely to happen 11-3-05 on simulation, turn in: After simulation, I got ___ 19.10 on p 387 prob .7, use bernulli use this on Q 17&19 together prob won’t show up is .25, prob of show is .75 what’s the chance of getting 8 or 9 17: simulation .304 19: 180/1000 no bus, .18 Expectation: 1. roulette—workbook p. 23 (2)---txt p. 400 ex 3 value: 1 Prob: 18/38 -1 -20/38 Expt: 18/38-20/38: -2/30= -5 cents 11-8-05 Workbooks on Friday Sample Test Questions A9 on the Web today, comes from Sample test 2 questions 11-10-05 WTHR 200 11-15-05 2 by 2 tables one categorical tables independent of another 12-1-05 a10: due 12-8, hand in lectures Final: Dec-15, Lambert Field House 10:20-12:20, have ID Workbooks graded Dec-9, p25 not req 2-way table Is on categorical var independent of another? Is there an association? Are they related? Chi-square statistics x-sqr= sigma (observed count-expected count)-sqr / expected count if chi-sqr value is larger than value in table (475), we say there’s an association, there is a relationship, they’re not ind chi-sqr stat is number we would’ve expected to see if there was no relationship multiple row total by colum total and divide that by grand total=expected value of any cell DF-degrees of freedom Df=(r-1)(c-1) (use this to get the number from the table to see if chi-sqr is bigger) 2x2table= (2-1(2-1)=1 3x3table(3-1)(3-1)=4, so row 4 of .05 chi-sqr stat=all ch-sqr numbers added up .05 colum=error 5% or less of the time Simpson’s Paradox: how can the women be behind the men in both instances, yet overall? Lurking variable: weighted average 12-6-05 if we do Page 25 in our workbook, we get extra credit A10-p482:12 Prob would be equal if there were no relationship Once get probs, get chi-sqr statistics 37 is chi sqr value get number frm book, it’s larger so there IS a relationship Simpson’s Paradox: Relationships between two categorical variables may have hidden lurking variables (need women who are not employed are happier than those who are) (must add to the filled in cell) breaking it down into age group, there’s a reversal 18-35: 400-405 800-750 .33-.35 35-50 1600-45 2200-50 .421-.474 confidence intervals for an unknown population proportion, p-hat now for an unknown mu answer for pg 25: x-bar-2(s/sqrtm), x bar+2(s/sqrtm) for 95% confidence 2: the distribution of the sample mean approaches a bell curve as the sample size increases sample mean=x-bar, s=st dev, m=number of people a value not in the example? P 492. Ex 4 They’re wrong! 12-8-06 sample exam