GAISEing into the Statistics Common Core Day 2: Statistical Association June 27, 2013 Team • Dr. Stephanie Casey is an Assistant Prof. of MathEd at EMU. Her research focuses on teacher knowledge for teaching statistics at the middle and secondary levels, motivated by her experience of teaching secondary mathematics for fourteen years. • Dr. Andrew Ross is an Associate Prof. of Math at EMU, specializing in operations research. He was named the Michigan MAA Distinguished Teaching Awardee in 2011. • Dr. Brenda Gunderson is a Senior Lecturer in Stats Dept at the University of Michigan. She coordinates and teaches Statistics and Data Analysis, with approximately 1800 students each term. • Anamaria Kazanis, Pstat, is a Senior Statistician at MSU. She is the current president of the Ann Arbor Chapter of ASA • Karen Nielsen is a PhD student in the Stats Dept. at the University of Michigan. She has taught 2 years of undergraduate introductory Statistics labs and served as a mentor to other Graduate Student Instructors. As part of a cross-disciplinary team, she helped to bring online learning objects into large-enrollment gateway classes. • Mackenzie Fankell graduated from the U of M in 2009 with a degree in psychology. After graduating she worked as an English teacher in Chile for two years before returning to the US and working as a high school math teacher in Dearborn, MI. She began her masters in education at U of M in 2012 but transferred to a masters program in statistics later that year. She hopes to pursue research in education and the social sciences. Outline of Our Day • 9:00-10:30 a.m. GAISE into the CCSS-M statistics standard(s) of the day: • The standard , • its learning trajectory, and • content • 10:30-10:40 a.m.: BREAK • 10:40 a.m.-12:10 p.m. GAISE activities part 1 • activities that teach the standard through the GAISE process, • debrief on the experience and how to utilize the activity in their own classroom • 12:10-1:00 p.m.: LUNCH BREAK • 1:00-2:00 p.m.: GAISE activities part 2 • 2:00-2:30 p.m.: Interactive lecture on • knowledge of standard and students, • discussing what students are likely to think about and do as they progress through the learning trajectory for the standard; • common student conceptions, effective ways to support students as they move through the learning trajectory • 2:30-3:00 p.m.: Reflections on the day’s standard(s), share ideas, comments, concerns, etc. for teaching the standard(s) 9:00-10:30 a.m. GAISE into the CCSS-M statistics standards of the day: The standards Learning trajectory Content Standards, Grade 8 (part 1) Investigate patterns of association in bivariate data. • CCSS.Math.Content.8.SP.A.1 Construct and interpret scatter plots for bivariate measurement data to investigate patterns of association between two quantities. Describe patterns such as clustering, outliers, positive or negative association, linear association, and nonlinear association. • CCSS.Math.Content.8.SP.A.2 Know that straight lines are widely used to model relationships between two quantitative variables. For scatter plots that suggest a linear association, informally fit a straight line, and informally assess the model fit by judging the closeness of the data points to the line. Standards, Grade 8 (part 2) Investigate patterns of association in bivariate data. • CCSS.Math.Content.8.SP.A.3 Use the equation of a linear model to solve problems in the context of bivariate measurement data, interpreting the slope and intercept. For example, in a linear model for a biology experiment, interpret a slope of 1.5 cm/hr as meaning that an additional hour of sunlight each day is associated with an additional 1.5 cm in mature plant height. • CCSS.Math.Content.8.SP.A.4 Understand that patterns of association can also be seen in bivariate categorical data by displaying frequencies and relative frequencies in a two-way table. Construct and interpret a two-way table summarizing data on two categorical variables collected from the same subjects. Use relative frequencies calculated for rows or columns to describe possible association between the two variables. For example, collect data from students in your class on whether or not they have a curfew on school nights and whether or not they have assigned chores at home. Is there evidence that those who have a curfew also tend to have chores? Standards, High School (part 1) Summarize, represent, and interpret data on two categorical and quantitative variables • CCSS.Math.Content.HSS-ID.B.5 Summarize categorical data for two categories in two-way frequency tables. Interpret relative frequencies in the context of the data (including joint, marginal, and conditional relative frequencies). Recognize possible associations and trends in the data. • CCSS.Math.Content.HSS-ID.B.6 Represent data on two quantitative variables on a scatter plot, and describe how the variables are related. • CCSS.Math.Content.HSS-ID.B.6a Fit a function to the data; use functions fitted to data to solve problems in the context of the data. Use given functions or choose a function suggested by the context. Emphasize linear, quadratic, and exponential models. • CCSS.Math.Content.HSS-ID.B.6b Informally assess the fit of a function by plotting and analyzing residuals. • CCSS.Math.Content.HSS-ID.B.6c Fit a linear function for a scatter plot that suggests a linear association. Standards, High School (part 2) Interpret linear models • CCSS.Math.Content.HSS-ID.C.7 Interpret the slope (rate of change) and the intercept (constant term) of a linear model in the context of the data. • CCSS.Math.Content.HSS-ID.C.8 Compute (using technology) and interpret the correlation coefficient of a linear fit. • CCSS.Math.Content.HSS-ID.C.9 Distinguish between correlation and causation. AP Statistics (part 1) • 1 . Exploring Data: Describing patterns and departures from patterns (20%–30%) Exploratory analysis of data makes use of graphical and numerical techniques to study patterns and departures from patterns. Emphasis should be placed on interpreting information from graphical and numerical displays and summaries D . Exploring bivariate data 1 . Analyzing patterns in scatterplots 2 . Correlation and linearity 3 . Least-squares regression line 4 . Residual plots, outliers and influential points 5 . Transformations to achieve linearity: logarithmic and power transformations E . Exploring categorical data 1 . Frequency tables and bar charts 2 . Marginal and joint frequencies for two-way tables 3 . Conditional relative frequencies and association 4 . Comparing distributions using bar charts AP Statistics (part 2) • IV . Statistical Inference: Estimating population parameters and testing hypotheses (30%–40%) Statistical inference guides the selection of appropriate models. A . Estimation (point estimators and confidence intervals) 8 . Confidence interval for the slope of a leastsquares regression line B . Tests of significance 6 . Chi-square test for … homogeneity of proportions, and independence (…two-way tables) 7 . Test for the slope of a least-squares regression line Learning Trajectories/Progressions • TurnOnCCMath.net • Progressions for the Common Core State Standards in Mathematics • Project SET: http://project-set.com/ • http://project-set.com/presentations/121712-regressionlpfinal-released/ Turn On CC Math.net (up to 8th grade) Progressions for the Common Core State Standards in Mathematics • By The Common Core Standards Writing Team themselves GAISE Level A, assoc.-related • I. Formulate the Question • → Teachers help pose questions (questions in contexts of interest to the student). • II. Collect Data to Answer the Question • → Students conduct a census of the classroom. • → Students understand individual-to-individual natural variability. • → Students conduct simple experiments with nonrandom assignment of treatments. • III. Analyze the Data • → Students observe association between two variables • → Students use tools for exploring … association, including: • ▪ Scatterplot ▪ Tables (using counts) • IV. Interpret Results Example: GAISE Level B, assoc.-related • • • • • • • • • • I. Formulate Questions → Students begin to pose their own questions III. Analyze Data → Students quantify the strength of association between two variables, develop simple models for association between two numerical variables, and use expanded tools for exploring association, including: ▪ Contingency tables for two categorical variables ▪ Time series plots ▪ The QCR (Quadrant Count Ratio) as a measure of strength of association ▪ Simple lines for modeling association between two numerical variables IV. Interpret Results → Students understand basic interpretations of measures of association. Example: favorite music GAISE Level C, assoc.-related • I. Formulate Questions • → Students should be able to formulate questions and determine how data can be collected and analyzed to provide an answer • III. Analyze Data • → Students should be able to recognize association between two categorical variables. • → Students should be able to recognize when the relationship between two numerical variables is reasonably linear, know that Pearson’s correlation coefficient is a measure of the strength of the linear relationship between two numerical variables, and understand the least squares criterion in line fitting Example Example: plotting residuals • http://project-set.com (there are many other Project SET’s) • Aimed at high school • Loop 1, golf ball drop, could be used in middle school • Informal lines of fit • Loop 2, vertical leap, is for HS: least-squares, residuals • And possibility of categorical association • Loop 2, used car prices, is for HS: least-squares, residuals • Loop 3, NFL QB salaries, is for HS: least-squares, r or R^2 • Loop 4&5, txting, just for AP Stat Loop 1: Informal Fit • Using Golf Ball Drop data • Please read the handout, use spaghetti to show your informally fitted line. • Not allowed to break spaghetti to connect individual dots! • Finish instructions on handout. • Also, what is wrong with experimental plan? Lack of Replication! • When possible, should do at least 2 experiments under each experimental setting (drop height, in this case) • Helps quantify uncertainty at each x value • Can then use fancy tests for nonlinearity (post-AP-level stats) What if we had only done one trial at each dose? Might see just the diamonds, or just the Xs. Also, when designing, choose 3 or more X values, so we can detect nonlinearity. Show 3 Types of Scatterplots: • Designed experiment, with replication Don’t average the y values at each x value to “make it simpler”! Show 3 Types of Scatterplots: • Observational Study Pennsylvania, district-by-district y = 0.0059x + 1023.8 R2 = 0.2108 1800 1600 Math test scores 1400 1200 1000 800 600 400 200 0 $- $10,000 $20,000 $30,000 $40,000 $50,000 Avg Teacher Salary $60,000 $70,000 $80,000 Show 3 Types of Scatterplots: • Time Series Common Suggestions for Informal Fits • Connect First and Last Points • Connect Lowest and Highest Points • Divide the data in half • Connect as many points as possible • And others we’ll get to later. • Before we go on, sketch graphs that show these ideas aren’t great. Common suggestions for informal fits: Another common suggestion Loop 2: Residuals = actual - predicted NOTE: Residuals are measured VERTICALLY, not horizontally and not perpendicular to the line of best fit. New ideas for informal fit? Usual student’s answer: sum the absolute residuals • Not a bad idea! • But, some bad points about it: • Historically, harder to do than what we’ll see next. • Sometimes the choice of line is not unique. • Advanced statistical theory supports a different choice. • Good points: • Modern software can do it. • It’s resistant to outliers. Usual statistician’s answer: sum the squared residuals • This applet shows the geometric squares of the residuals: http://www.geogebra.org/en/upload/files/mrfox001/line_of_ best_fit.html • Does CCSSM require use or knowledge of formulas to find the line that minimizes the sum of squared residuals? • Standards aren’t so clear to me; the draft Progressions document seems to focus only on using technology to fit the line automatically. Standards, High School (part 1) Summarize, represent, and interpret data on two categorical and quantitative variables • CCSS.Math.Content.HSS-ID.B.5 Summarize categorical data for two categories in two-way frequency tables. Interpret relative frequencies in the context of the data (including joint, marginal, and conditional relative frequencies). Recognize possible associations and trends in the data. • CCSS.Math.Content.HSS-ID.B.6 Represent data on two quantitative variables on a scatter plot, and describe how the variables are related. • CCSS.Math.Content.HSS-ID.B.6a Fit a function to the data; use functions fitted to data to solve problems in the context of the data. Use given functions or choose a function suggested by the context. Emphasize linear, quadratic, and exponential models. • CCSS.Math.Content.HSS-ID.B.6b Informally assess the fit of a function by plotting and analyzing residuals. • CCSS.Math.Content.HSS-ID.B.6c Fit a linear function for a scatter plot that suggests a linear association. Standards, High School (part 2) Interpret linear models • CCSS.Math.Content.HSS-ID.C.7 Interpret the slope (rate of change) and the intercept (constant term) of a linear model in the context of the data. • CCSS.Math.Content.HSS-ID.C.8 Compute (using technology) and interpret the correlation coefficient of a linear fit. • CCSS.Math.Content.HSS-ID.C.9 Distinguish between correlation and causation. What is the length of this line? What is the length of this line? Is this a square? What is its Area? Popular drawing: sum of squared residuals But squares are actually coming “out of the page” at us; both base & depth are measured in $ • What is the danger lurking in the equation that it shows? http://xkcd.com/833/ Label the axes! • It is very easy to get confused: is y=original data, or y=residuals? • Other, more advanced plots have: • X=predicted, y=actual • X=predicted, y=residual • X=run sequence of data (1st, 2nd, etc) , y=residual • Here are six recommended plots for examining the residuals: http://www.itl.nist.gov/div898/handbook/eda/section3/6plot.htm However, it neglects another type that it mentions elsewhere: a run-order or run-sequence plot. It is standard practice to graph the residuals! Timing data from yesterday Let’s try it on the TI calculators. • Mackenzie 17 17 Lori 21 27 Paul 17 21 ASK 24 22 Katelyn 23 20 Karin 24 33 Allison 18 19 Karen 20 20 Jamie 22 19 Andra 18 20 Susan 24 25 Sherita 53 45 Susan 15 16 Stephanie 23 27 Jordan 27 26 Ed 18 18 Mila 25 27 Wendy 24 23 Claudia 28 27 Steve 24 26 Linda 25 25 Karen 28 28 Elizabeth 28 26 Jeff 26 25 Kim 38 26 Jeannette 19 24 Lisa 29 28 Joanne 30 25 Molly 31 38 Laura 33 35 With line of best fit: • What if we flip the x & y data before doing regression? It is standard practice to graph the residuals! What should residual graphs look like? • No patterns! • If there are any patterns, that means our original regression missed something. Which of these are okay/not okay? Each graph has x=original x data values, y=residuals Usual Procedure 1. 2. 3. 4. 5. Graph the data Fit a function Compute and graph residuals Any pattern left? Repeat from step 2 No pattern left? We’re done! • Students get confused: do I want to see a pattern, or not? • In the original data, yes (usually). In the residuals, no. Correlation Coefficient • • • • • r (that is, little-r) Always between -1 and +1 Close to zero: no linear relationship Close to -1 or +1: close-to-linear relationship 0 to 0.5 is weak, 0.5 to 0.8 is moderate, above 0.8 is strong (though that’s for social-science/biology stuff, not engr/physics) • r doesn’t change if x or y units change (or both), or axes flip. Which has the highest correl.? • This is called Anscombe’s Quartet • Again, which has the highest correlation? Teacher Value-Added Scores • New York City, 2006 vs 2008 school year • Data on each individual teacher • Same school, same subject, same grade level • At least 3 years of experience • X=VA z-score in 2006; Y=VA z-score in 2008 • What will the scatterplot look like? • What will the r or R^2 value be? Outliers & Influential Points • Each data point could be: • • • • • An x outlier A y outlier Both x and y outlier, or neither x nor y outlier A regression outlier (far from the pattern of the data) Influential (if removed, the slope of the regression line would change more than just a little bit, whatever that means in the context of the problem) • Not all outliers are influential! Outlier boundaries • Using the usual definition of outlier: more than 1.5 IQR from Q1 or Q3 • Slanted lines for regression outliers use Q1, Q3, IQR of residuals: • Trendline + Q3 + 1.5 IQR, and • Trendline + Q1 – 1.5 IQR (Q1 of residuals will be < 0, almost always) Influence on Slope • • • • Consider a lattice of possible points we might add to data set Compute abs(%change in regression slope) Color small changes blue, large changes red. Center is near (mean x, mean y) Influence on R^2 • Using abs(% change in R^2) • Red regions show large change, blue shows little change • Note white-ish regions at bottom-left & top-right: adding points from those regions (which are near the original trendline) increases R^2 • Adding points from top-left or bottom-right decreases R^2 Common Names for Variables horizontal x independent free predictor stimulus cause controlled explanatory regressor vertical y dependent outcome predicted response effect uncontrolled explained result Example • Suppose you read an article that says that people who eat at least one carrot a day tend to spend less on health care than those that don’t. Does this mean you should eat more carrots to stay healthier? • Perhaps a hidden variable is a person’s attitude toward health. People who try to take good care of themselves probably eat a lot of all veggies. They probably also have better health than those who don’t care what they eat. • This proposes a specific lurking variable (rather than saying “there is a lurking variable” with no further explanation), and it says how that variable affects both of the variables already mentioned. It doesn’t argue that the link doesn’t exist. Correlation does not imply Causation • Perhaps a 3rd variable, not in the study, is affecting both variables that were in the study (“lurking”) • Perhaps the causation runs the opposite way of what was proposed • Common lurking variables: • • • • • • • • SES = Socio-Economic Status (poverty, etc.) A person’s overall health A person’s health attitude Population Size (of a city/state/country) Inflation or flow of time Weather Local cost of living % liberal/conservative by regione Argue about these: • There is a positive link between consumption of tobacco and non-use of seatbelts. So if you want to cut down on smoking, buckle up! • There is a link between the # of years of math someone takes in high school and their future income. So, Michigan should require high school students to take at least Algebra 2. • An actual article said something like: there is a link between credit card debt and health problems. So, to make yourself healthier, pay down your credit cards, since credit card debt causes stress which can cause health problems. • There is a link between the presence of computers in K-12 schools and their standardized test scores. Therefore, we should spend more money on computers in schools. • Smoking and seat belts: health attitude • Algebra-2: lurking variable of geekiness? • Debt and health problems: maybe health problems cause debt, more than debt causes health problems? • Computers in schools: socio-economic status? http://xkcd.com/552/ Interpreting Slope Pennsylvania, district-by-district y = 0.0059x + 1023.8 R2 = 0.2108 1800 1600 Math test scores 1400 1200 1000 800 600 400 200 0 $- $10,000 $20,000 $30,000 $40,000 $50,000 Avg Teacher Salary $60,000 $70,000 $80,000 Algebra vs Statistics vocabulary: Slope • Algebra: slope = how much y will change for a 1-unit change in x • Statistics: slope of regression line = AVERAGE change in y per 1-unit DIFFERENCE in x • AVERAGE: no guarantee that y will change exactly that much. • DIFFERENCE: saying “change” might give the impression that we are changing the x value of a data point (putting someone on a stretching machine) instead of comparing two different x values (two people of different heights True or False? 1. T/F: If you give a raise of $10,000 to each teacher in a particular district, that district’s avg. test score will go up 59 points. 2. T/F: If you live in a district with a $50,000 average teacher salary and move to a district with a $60,000 average, your child’s test scores will go up, on average, by 59 points. 3. T/F: If you live in a district with a $50,000 average and move to a district with a $60,000 average, that district’s score will be 59 points higher. 4. T/F: If you live in a district with a $50,000 average and move to a district with a $60,000 average, then on average the scores in that district will be 59 points higher. 5. T/F: Since there is a district with a salary of about $30,000 with test scores above 1400, and another district around $65,000 with test scores below 1200, we can see that there’s no correlation between salary and test scores. Answers 1. False; this presumes that the correlation is a causation It talks about changing the x-value of one data point, not comparing two data points. 2. False; your child is still your child with all their existing demographics. While an increase might happen, there’s no reason to think it would even average 59 points. 3. False; this statement sounds like a guarantee. 4. True; saying “on average” is the key point. 5. False; a single counter-example (or even many of them) doesn’t disprove a general trend. x=% free lunch; y=score Free School Lunches cause bad test scores? y = -304.48x + 1394.4 R2 = 0.4524 1800 1600 1400 1200 1000 800 600 400 200 0 0% 10% 20% 30% 40% 50% Pct Free Lunch 60% 70% 80% 90% 100% http://xkcd.com/605/ http://xkcd.com/1007/ Ecological Fallacy • Better to call it Aggregation Fallacy (my personal opinion) • The fallacy is: aggregate data gives useful info on individuals. 10:40 a.m.-12:10 p.m. • GAISE activities part 1: participants engage in activities that teach the standard through the GAISE process, then debrief on the experience and how to utilize the activity in their own classroom • Possible / Favorite activities for Quantitative Association: • • • • • • • Barbie Bungee Spaghetti Bridge Balloon Descent Time Paper Helicopter Descent Time Sports ball bounce height (Golf? Ping-pong? Superbounce?) M&M Exponential Survival Curve Two Estimates of Timing, or weight of objects, or age of a person 12:10-1:00 p.m. : Lunch Break Workshop participants engage in a process of balancing nutritional value, price, and flavor options to decide what food to eat. They might have already done this in preclass work (“brown bag”) Participants will eat their selected food (or randomly selected?) n=30 times and record the resulting observations. 1:00-2:00 p.m. GAISE activities part 2: participants engage in activities that teach the standard through the GAISE process and utilize technology, then debrief on the experience and how to utilize the activity in their own classroom Categorical Association Activity: Eyes! Categorical Association • CCSS.Math.Content.8.SP.A.4 Understand that patterns of association can also be seen in bivariate categorical data by displaying frequencies and relative frequencies in a two-way table. Construct and interpret a two-way table summarizing data on two categorical variables collected from the same subjects. Use relative frequencies calculated for rows or columns to describe possible association between the two variables. For example, collect data from students in your class on whether or not they have a curfew on school nights and whether or not they have assigned chores at home. Is there evidence that those who have a curfew also tend to have chores? Categorical Association • CCSS.Math.Content.HSS-ID.B.5 Summarize categorical data for two categories in two-way frequency tables. Interpret relative frequencies in the context of the data (including joint, marginal, and conditional relative frequencies). Recognize possible associations and trends in the data. Summer Employment and Gender: Joint Frequency Job Experience Male Female Total Never had a part21 31 time job Had a part-time job 15 13 during summer only Had a part-time job 12 8 but not only during summer Total What numbers/tables and graphs could we look at to explore the data looking for any association between gender and summer employment? Are these helpful? 40 30 20 10 0 Male Female 35 30 25 20 15 10 5 0 Male Never had a part-time job Female Marginal Frequency Job Experience Male Female Total 21 31 Never had a part52 time job 15 13 Had a part-time job 28 during summer only 12 8 Had a part-time job 20 but not only during summer Total 48 52 100 What numbers/tables and graphs could we look at to explore the data looking for any association between gender and summer employment? Conditional Frequency, by Gender Job Experience Male Female Total Never had a part44% 60% 52% time job Had a part-time job 31% 25% 28% during summer only Had a part-time job 25% 15% 20% but not only during summer Total 100% 100% 100% Grade 8: Use relative frequencies calculated for rows or columns. HS: Interpret relative frequencies in the context of the data (including joint, marginal, and conditional relative frequencies) Conditional Frequency, by Gender: 100% Stacked Column • A.K.A. Segmented Bar Graph 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Had a part-time job but not only during summer Had a part-time job during summer only Never had a parttime job Male Female Conditional Frequency by Experience Job Experience Male Female Total Never had a part40% 60% 100% time job Had a part-time job 54% 46% 100% during summer only Had a part-time job 60% 40% 100% but not only during summer Total 48% 52% 100% What numbers/tables and graphs could we look at to explore the data looking for any association between gender and summer employment? Conditional Frequency, by Experience 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Female Male Never had a Had a part-time Had a part-time part-time job job during job but not only summer only during summer Famous Discrimination Case Male Female Total Recommended Not for Recommended Promotion for Promotion 21 3 14 10 35 13 Total 24 24 48 Suppose these numbers show an actual association (“statistically significant”). Is that evidence of illegal discrimination? Activity starter • Everyone get a triangle-ended slip of paper. • With both eyes open, hold hands at arm’s distance like this, • Focus on a distant object. • Close left eye. Reopen. • Close right eye. Reopen. • Whichever eye still saw the object is your “dominant eye” • Write your dominant eye (L or R) on the triangle-end of your slip of paper. • What might be related to eye dominance? • (suggestions from the whole group) • Write that on the non-triangle end of your slip of paper. • Pass slips of paper to the front Let’s make a table Left Hand Right Hand Left Eye 2 10 Right Eye 2 18 4 28 12 20 32 Example table 1 (fake data) Eye Hand Left 5 9 Right 35 31 Example table 2 (fake data) Left Hand Right Hand Left Eye 3 2 Right Eye 6 29 What if extreme association? Left Hand Left Eye 4 Right Eye 0 4 Right Hand 8 12 20 20 28 32 What if no association? •First, make a segmented bar chart (100% Stacked Column chart) •Then, make a table of conditional relative frequencies •Then, make a table of joint frequencies Conditional: Left Hand Right Hand Left Eye 12.5% 87.5% 100% Right Eye 12.5% 87.5% 100% Joint Left Hand Left Eye 12*12.5% Right Eye 20*12.5% 4 Right Hand 12*87.5% 12 20*87.5% 20 28 32 What if no association? •First, make a segmented bar chart (100% Stacked Column chart) •Then, make a table of conditional relative frequencies •Then, make a table of joint frequencies Conditional: Left Hand Right Hand Left Eye 12.5% 87.5% 100% Right Eye 12.5% 87.5% 100% Joint Left Hand Left Eye 1.5 Right Eye 2.5 4 Right Hand 10.5 12 17.5 20 28 32 Simulating No Association (artificial data) Left Hand Right Hand Left Eye Right Eye 9 31 5 35 40 2:00-2:30 p.m. • Interactive lecture on • knowledge of standard and students, • what students are likely to think about and do as they progress through the learning trajectory for the standard • common student (mis-)conceptions, • effective ways to support students as they move through the learning trajectory Common student conceptions “What Fits” --from Project SET and STEW Activity: drop a golf ball from various heights, record height of its first bounce The following graphs show where actual students placed a “best-fit” line (thin metal rod) For each, try to figure out the student's reasoning; then we'll reveal it. • I thought of the line of best fit like the mode, so I put the line through two points with the same y-coordinate because they occur most often. • The line needs to start at (0,0) then go through the most dots. I got my line to go through two of the dots so I put it there. • The line should be in the middle of the highest and lowest points because that’s like the average. • I know the line should go in the middle of the data. I put it here so it would be in the middle, four points on each side. • I tried to make my line go through the most dots. Help these students • You are leading the class in analyzing the relationship between GPA and ACT scores for a data set, using technology. You have asked your students to find the correlation coefficient, coefficient of determination, and regression line for the data set. One pair of students asks for your help. They have done their work independently and are now comparing their answers. They have the same correlation coefficient & coefficient of determination, but their regression lines are different. • What went wrong? • How would you help them? The answer • One of them used GPA to predict ACT, and the other did the reverse. • The r and R^2 values will be the same • The slopes will be different (slope1 approximately = 1/slope2 but not exactly) • The intercepts will be different • Predictions will be different • How to help? Perhaps ask each to predict the ACT of a 4.0 student; one will be able to answer quickly, the other will have to solve a linear equation. • Perhaps sketch a quick graph with vertical residuals and horizontal residuals? Rerouting a Student Idea • To start a lesson on lines of best fit, you are following the curriculum guide and have presented your students the following data set about the pounds of beans used by families of different sizes when traveling on the Overland Trail: #people 5 8 6 7 11 10 5 7 10 5 8 Pounds 61 95 56 75 125 135 80 100 103 75 100 of beans 7 9 12 10 105 125 150 125 • First question: how much does a party of 20 need? • Student suggests: You could look at people with like 10 people in their families and just double that amount. • How to reroute into a line-of-best-fit? Rerouting a Student Idea • • • • Which party of 10? There are 3 of them. Why not take a party of 5 and quadruple their usage? Why not take a party of 12 and a party of 8 and add them? What if the parties of 10 were by chance not big eaters? Can we use all of the data to estimate needs? • What if a student suggests to compute pounds-per-person for each party, average them all, then multiply by 20? • Industrial engineers would point out that while a linear regression will forecast the needed amount of food, you should actually bring along some safety stock, above the forecast. Vocabulary Time • A student in your class raises his hand and asks “What’s the difference between correlation, association, and regression?”. • What would you say, do, or draw? The answer • Association is the most general. All of these are associated: Association also applies to categorical data. • Correlation is the direction and strength of any linear relationship • Regression is the process of fitting a line or curve to a data set. • But, many people use them as if they mean the same thing, so we can't assume when we read or hear something that the person is being technically accurate in their usage. This is even true for some textbooks. • R^2 = % of variation explained by the predictor variable, using the model that you used. • = 1 – (sum of squared residuals)/(sum of squared y deviations from the y mean) • “Coefficient of Determination” • Quadratic fits will ALWAYS be better (or at least as good) than linear fits, as measured by R^2 or sum of squared residuals. • Similarly, cubic will ALWAYS beat quadratic (or be at least as good), etc. • This is the danger of “Overfitting”: modeling your existing data points too well, at the expense of making good predictions of future data points. Zero Correlation • Create three distinctly differently patterned scatterplots which all have a correlation coefficient of approximately zero. • Explain why the correlation coefficient is near zero for each of the cases. Zero Correlation • Also, if all the data is on a perfect flat line, y=0x+b, the correlation coefficient is technically undefined (divide-byzero), but we might redefine it to be zero in that instance. • But that would never happen with real data. Model Selection • You have been teaching your class about finding the best model for a data set. A student says “So I just try all of these different equations on the data set and whichever one gives me the biggest value of r is the best one, right?”. An answer • First, is there theory that indicates which model might be best? Exponential growth for populations, for example. Decay toward zero in some cases (could be power or exponential). Horizontal asymptotes predicted by theory? Are negative values (for x or for y) allowed? • If theory has nothing to say, we have an inherent preference for linear models, since (a) Occam’s razor says to use the simplest thing, and while mathematically they are all equally simple in some sense, we really like linear stuff, and (b) if you zoom in on any of these smooth curves you will get approximately a line. This preference might lead us to select a linear model with R^2=0.80 instead of a power/log/exponential model with R^2=0.83, for example. How big a difference is acceptable? Hard to say. • If we’re including polynomial models like a*x^2+b*x+c, warn about the danger of overfitting—adding more terms always increases R^2, but can make the predictions of future values ridiculous. And then there’s Logs • Logarithmic transforms aren’t part of the CCSSM, but are part of AP Statistics. • If some of the data has already been log-transformed (decibels, earthquake richter scale magnitudes, pH, etc) then it’s unlikely you would want to log it again. • If the data (usually y rather than x) spans roughly 2 or more orders of magnitude, consider logging it • If the residuals have a fan-shape, consider transforming y (either logging or square-rooting, usually) • If the residuals look skewed, consider transforming y to bring them back toward normality. Common Misconceptions on Categorical Association • Research has found that students commonly have three incorrect conceptions about association of categorical variables: • Determinist: students believed that an association meant all cases must show an association with no exceptions. These students believed that the cells in the two-way table that did not agree with the association should have zero frequency. • Unidirectional: students believed dependence occurred only when it was direct. This could be explained by the tendency of students to give more relevance to positive cases than negative cases that confirm a given hypothesis. • Localist: students looked at part of the data to determine if an association existed, often only looking at the cell with the highest frequency or at only one conditional distribution. Determinist: no exceptions! Artificial data: Has AIDS Doesn't have AIDS Has HIV 97 1 Doesn't have HIV 2 9900 • What are some arguments, based on the numbers in this table, that they are NOT associated? • What are some arguments, based on the numbers in this table, that they ARE associated? • What does a Determinist look at in a segmented bar chart? • Does an association have to hold for every single case for the association to be true? • Does the strong association shown in the table show that HIV causes AIDS? Unidirectional? • Poll 100 people at your high school about their college plans. Private Public In-state 5 60 Out-of-state 25 10 Does this table, above, show an association? Does the table below show an association? Public Private In-state 60 5 Out-of-state 10 25 What does Unidirectional thinking look like on a segmented bar chart? Localist? • Ask 100 people whether they like summer or winter weather more, and their birthplace. East of the Mississippi West of the Mississippi Summer 72 8 Winter 18 2 It's pretty clear that people from the East prefer Summer more than Winter? Does this mean that there is an association between birthplace and weather preference? Compute the conditional distributions: What is the probability that someone from the east likes Summer weather? [72/(72+18)=72/90=80%] What is the probability that someone from the west likes Summer weather? [8/(8+2)=8/10=80%] Categorical Association and Experimental Design • You are leading your students in an activity in which they complete all 4 steps of the GAISE framework (formulate a research question, collect their own data, analyze the data, and interpret the results) in a situation relevant to categorical association. • Students have brainstormed the following research questions. • For each question, determine if it is relevant to categorical association, and provide your response to the student. • Consider data collection issues as well. For those that are not relevant and/or have data collection issues, suggest some modifications. Relevant to Categ. Assoc.? Data collection issues? • LAURA: I want to study if girls are smarter than boys, so I am going to compare the GPAs of boys and girls at school. • BILL: I think people with higher GPAs are less likely to have had a car crash while driving. I am going to ask a senior with drivers licenses their GPAs and if they’ve crashed while driving. • ANNE MARIE: I want to see whether boys are more likely to own smartphones than girls. So I’m going to count during passing periods the number of boys I see with smartphones and the number of girls I see with no phones. Relevant to Categ. Assoc.? Data collection issues? • NICK: I think girls are more likely to bring their lunch to school than boys. So I’m going to stand at the entrance to the lunchroom and count the number of boys and girls that bring their lunches. • JEFF: I want to study the relationships between gender, race, grade level, political party and whether students make honor roll or not at our school. • KRISTIN: I read that 77% of high school students go on to college. I want to see if that’s true of the students here. 2:30-3:00 p.m. • Reflections on the day’s standard(s), share ideas, comments, concerns, etc. for teaching the standard(s) Standards for Mathematical Practice • How do they relate to Statistical Association? • (not sure what part of the day this fits in the best, if at all) • CCSS.Math.Practice.MP1 Make sense of problems and persevere in solving them. • Make sense of trends in scatterplots • Make sense of outliers/influential points • Persevere in trying different functional fits • CCSS.Math.Practice.MP2 Reason abstractly and quantitatively. • “The slope has important practical interpretations for most statistical investigations of this type” (CCSS Progressions grade 8) • CCSS.Math.Practice.MP3 Construct viable arguments and critique the reasoning of others. • Think of lurking variables that would argue against a correlation being a causation • Evaluate linear/exponential/power/log models based on their asymptotic behavior • CCSS.Math.Practice.MP4 Model with mathematics. • Everything! • “build statistical models to explore the relationship between two variables” (CCSS Progression grade 8) • “using their knowledge of functions to fit models to quantitative data” (CCSS Progression HS) • CCSS.Math.Practice.MP5 Use appropriate tools strategically. • Not much argument that technology is needed to do most regression analysis? • Tension between classroom technology and real-world technology? • CCSS.Math.Practice.MP6 Attend to precision. • Don't report too many decimal places in slope, intercept, R^2, or predictions of new values • Phrase predictions about new values carefully, avoiding causal or deterministic language. • CCSS.Math.Practice.MP7 Look for and make use of structure. • “looking for and making use of structure to describe possible association in bivariate Data” (CCSS Progression grade 8) • “using their knowledge of proportions to describe categorical associations”, “Looking for patterns in tables” (CCSS Progression HS) • CCSS.Math.Practice.MP8 Look for and express regularity in repeated reasoning. Day 2 Wrap-Up • What surprised you today? • What did you find interesting? • How might you bring these ideas to your class? • What would you change? • Other activities/ideas to share with the group? Project-SET sports data Gender male female male female female male female male female female male female female male Vertical Jump Height Height Likes (in) (in) Sports? 72.5 22.5Yes 70 18.5Yes 71 17No 64 17No 69 16.5Yes 72.5 27.5Yes 64.75 12.5Yes 70 16Yes 67.5 5.5No 66 12Yes 65.5 20.5Yes 66.75 13.5No 59.25 11No 69.25 16No What categorical association question can we ask?