CHAPTER 4, REGRESSION ANALYSIS… EXPLORING ASSOCIATIONS BETWEEN VARIABLES RELATIONSHIPS BETWEEN ... Talk to the person next to you. Think of two things that you believe may be related. For example, height and weight are generally related... generally, the taller the person, generally, the more they weigh. Or, the age of your car and its value... generally, the older a car, the less it is worth. Share out two numerical categories that you believe are related on the board. DO YOU BELIEVE THERE IS A RELATIONSHIP BETWEEN... •TIME SPENT STUDYING AND GPA? •# OF CIGARETTES SMOKED DAILY & LIFE EXPECTANCY •SALARY AND EDUCATION LEVEL? •AGE AND HEIGHT? RELATIONSHIPS When we consider data that comes in pairs or two’s or has two variables, the data is referred to as bivariate data. Much of the bivariate data we will examine is numeric. There may or may not exist a relationship/an association between the 2 variables. Does one variable influence the other? Or vice versa? Or do the two variables just ‘go together’ by chance? Or is the relationship influenced by another variable(s) that we are unaware of? Does one variable ‘cause’ the other? Caution! Put some examples here to discuss possibly BIVARIATE DATA Proceed similarly as univariate distributions … (review... which graphical models do we typically use with univariate data?) Still graph (use visual model(s) to describe data; scatter plot; LSRL; Least Squares Regression Line) Still look at overall patterns and deviations from those patterns (DOFS; Direction, Outlier(s), Form, Strength); review how did we look for patterns in univariate data; what did we use? Still analyze numerical summary (descriptive statistics) BIVARIATE DISTRIBUTIONS Explanatory variable, x, ‘factor,’ may help predict or explain changes in response variable; usually on horizontal axis Response variable, y, measures an outcome of a study, usually on vertical axis BIVARIATE DATA DISTRIBUTIONS For example ... Alcohol (explanatory) and body temperature (response). Generally, the more alcohol consumed, the higher the body temperature. Still use caution with ‘cause.’ Sometimes we don’t have variables that are clearly explanatory and response. Sometimes there could be two ‘explanatory’ variables, such as ACT scores and SAT scores, or activity level and physical fitness. Discuss with a partner for 1 minute; come up with another situation where we have two variables that are related, but neither are clearly explanatory nor response. GRAPHICAL MODELS… Many graphing models display uni-variate data exclusively (review). Discuss for 30 seconds and share out. Main graphical representation used to display bivariate data (two quantitative variables) is scatterplot. SCATTERPLOTS Scatterplots show relationship between two quantitative variables measured on the same individuals or objects. Each individual/object in data appears as a point (x, y) on the scatterplot. Plot explanatory variable (if there is one) on horizontal axis. If no distinction between explanatory and response, either can be plotted on horizontal axis. Label both axes. Scale both axes with uniform intervals (but scales don’t have to match) LABEL & SCALE SCATTERPLOT VARIABLES: CLEARLY EXPLANATORY AND RESPONSE?? CREATING & INTERPRETING SCATTERPLOTS Let’s collect some data; on the board write your height in inches and your hand span in inches (to nearest ½ inch) Input into Minitab & create scatterplot; which is our explanatory and which is our response variable? Let’s do some predicting... to the best of our ability... INTERPRETING SCATTERPLOTS Look for overall patterns (DOFS) including: •direction: up or down, + or – association? •outliers/deviations: individual value(s) falls outside overall pattern; no outlier rule for bi-variate data – unlike uni-variate data •form: linear? curved? clusters? gaps? •strength: how closely do the points follow a clear form? Strong, weak, moderate? MEASURING LINEAR ASSOCIATION Scatterplots (bi-variate data) show direction, outliers/ deviation(s), form, strength of relationship between two quantitative variables Linear relationships are important; common, simple pattern; linear relationships are our focus in this course Linear relationship is strong if points are close to a straight line; weak if scattered about Other relationships (quadratic, logarithmic, etc.) CREATING & INTERPRETING SCATTERPLOTS Go to my website, download the COC Math 140 Survey Data Fall 2015 Copy & paste last 2 columns (‘Approximately how many minutes a day, on average, do you spend on social media?’ And ‘How many people live in your household, including yourself?’) Is data messy? Does it need to be ‘fixed?’ ... Hint, scan for ordered pairs (this is bivariate data); each and every point must be an ordered pair. CREATING & INTERPRETING SCATTERPLOTS ‘Approximately how many minutes a day, on average, do you spend on social media?’ And ‘How many people live in your household, including yourself?’ Create a scatter plot of the data. Analyze (DOFS) Let’s do some predictions... SCATTERPLOTS: NOTE Might be asked to graph a scatterplot from data Might need to sketch what’s on Minitab Doesn’t have to be 100% exactly accurate; do your best Scaling, labeling: a must! HOW STRONG ARE THESE RELATIONSHIPS? WHICH ONE IS STRONGER? MEASURING LINEAR ASSOCIATION: CORRELATION OR “R” Sometimes our eyes are not a good judge Need to specify just how strong or weak a linear relationship is with bivariate data Need a numeric measure Correlation or ‘r’ MEASURING LINEAR ASSOCIATION: CORRELATION OR “R” * Correlation (r) is a numeric measure of direction and strength of a linear relationship between two quantitative variables • Correlation (r) is always between -1 and 1 1 r 1 • Correlation (r) is not resistant (look at formula; based on mean) • r doesn’t tell us about individual data points, but rather trends in the data * Never calculate by formula; use Minitab (dependent on having raw data) CALCULATING CORRELATION “R” n, x1, x2, etc., 𝒙, y1, y2, etc., 𝒚, sx, sy, … MEASURING LINEAR ASSOCIATION: CORRELATION OR “R” r ≈0 not strong linear relationship r close to 1 strong positive linear relationship r close to -1 strong negative linear relationship Go back to our height/hand span data & calculate ‘r,’ correlation; then practice calculating ‘r’ with our social media/# in household data (stat, regression, fitted line) GUESS THE CORRELATION WWW.ROSSMANCHANCE.COM/APPLETS ‘March Madness’ bracket-style Guess the Correlation tournament Playing cards; match up head-to-head competition/rounds Look at a scatterplot, make your guess Student who is closest survives until the next round CORRELATION & REGRESSION APPLET PARTNER ACTIVITY Go to www.whfreeman.com/tps5e Go to applets Go to Correlation & Regression Follow the directions on the hand out (or see my website) Partner up with the person next to you; this should take no more than 15-20 minutes, including the write-up; print out & turn in with both your names on it. CAUTION… INTERPRETING CORRELATION Note: be careful when addressing form in scatterplots Strong positive linear relationship ► correlation ≈ 1 But Correlation ≈ 1 does not necessarily mean relationship is linear; always plot data! R ≈ 0.816 FOR EACH OF THESE FACTS ABOUT CORRELATION Correlation doesn’t care which variables is considered explanatory and which is considered response; can switch x & y; still same correlation (r) value Try with hand span & height data; try with minutes on social media & # household data CAUTION! Switching x & y WILL change your scatterplot; try with our data sets!… just won’t change ‘r’ FACTS ABOUT CORRELATION r is in standard units, so r doesn’t change if units are changed If we change from yards to feet, or years to months, or gallons to liters ... r is not effected + r, positive association - r, negative association FACTS ABOUT CORRELATION Correlation is always between -1 & 1 Makes no sense for r = 13 or r = -5 r = 0 means very weak linear relationship r = 1 or -1 means strong linear association FACTS ABOUT CORRELATION Both variables must be quantitative, numerical. Doesn’t make any sense to discuss r for qualitative or categorical data Correlation is not resistant (like mean and SD). Be careful using r when outliers are present (think of the formula, think of our partner activity) FACTS ABOUT CORRELATION r isn’t enough! … if we just consider r, it could be misleading; we must also consider the distribution’s mean, standard deviation, graphical representation, etc. Correlation does not imply causation; i.e., # ice cream sales in a given week and # of pool accidents ABSURD EXAMPLES… CORRELATION DOES NOT IMPLY CAUSATION… Did you know that eating chocolate makes winning a Nobel Prize more likely? The correlation between per capita chocolate consumption and the number of Nobel laureates per 10 million people for 23 selected countries is r = 0.791 Did you know that statistics is causing global warming? As the number of statistics courses offered has grown over the years, so has the average global temperature! LEAST SQUARES REGRESSION Last section… scatterplots of two quantitative variables r measures strength and direction of linear relationship of scatterplot WHAT WOULD WE EXPECT THE SODIUM LEVEL TO BE IN A HOT DOG THAT HAS 170 CALORIES? LEAST SQUARES REGRESSION BETTER model to summarize overall pattern by drawing a line on scatterplot Not any line; we want a best-fit line over scatterplot Least Squares Regression Line (LSRL) or Regression Line LEAST-SQUARES REGRESSION LINE LET’S DO SOME PREDICTING BY USING THE LSRL... About how much would a home cost if it were: 2,000 square feet? 2,600 square feet? 1,600 square feet? LET’S DO SOME PREDICTING BY USING THE LSRL... About how large would a home be if it were worth: $450,000? $350,000? $220,000? Also, let’s discuss where the x and y axes start... LEAST SQUARES REGRESSION EQUATION TO PREDICT VALUES LSRL Model: 𝑦 = 𝑎 + 𝑏𝑥 𝑦 is predicted value of response variable a is y-intercept of LSRL b is slope of LSRL; slope is predicted (expected) rate of change x is explanatory variable LEAST SQUARES REGRESSION EQUATION Typical to be asked to interpret slope & y-intercept of the equation of the LSRL, in context Caution: Interpret slope equation of LSRL as the predicted or average change or expected change in the response variable given a unit change in the explanatory variable NOT change in y for a unit change in x; LSRL is a model; models are not perfect INTERPRET SLOPE & YINTERCEPT... Notice the embedded context in the equation of the LSRL LSRL: OUR DATA Go back to our data (hand span & height) and/or minutes on social media & # in your household. Create scatterplot; then put LSRL on our scatterplot; also determine the equation of the LSRL Minitab: stat, regression, fitted line plot LSRL: OUR DATA Look at graph of our LSRL for our data Look at our LSRL equation for our data Our line fits scatterplot well (best fit) but not perfectly Make some predictions… do we use our graph or our equation? Which is easier? Which is better? More on this in a minute... Interpret our y-intercept; does it make sense? Interpretation of our slope? ANOTHER EXAMPLE… VALUE OF A TRUCK TRUCK EXAMPLE… Suppose we were given the LSRL equation for our truck data as 𝒑𝒓𝒊𝒄𝒆 = 𝟑𝟖, 𝟐𝟓𝟕 − 𝟎. 𝟏𝟔𝟐𝟗(𝒎𝒊𝒍𝒆𝒔 𝒅𝒓𝒊𝒗𝒆𝒏) We want to find a more precise estimation of the value if we have driven 100,000 miles. Use the LSRL equation. Using graph, estimate price if we have driven 40,000 miles. Then use the above LSRL equation to calculate the predicted value of the truck. AGES & HEIGHTS… Age (years) Height (inches) 0 18 1 28 4 40 5 42 8 49 LET’S REVIEW FOR A MOMENT… Input into Minitab Create scatterplot and describe scatterplot (what do we include in a description?) Calculate r (btw, different from slope; why?), equation of LSRL; interpret equation of LSRL in context; does y-intercept make sense? Based on LSRL or the equation of the LSRL (you choose), make a prediction as to the height of a person at age 35. LSRL: OUR DATA Extrapolation: Use of a regression line (or equation of a regression line) for prediction outside the range of values of the explanatory variable, x, used to obtain the line/equation of the line. Such predictions are often not accurate. Friends don’t let friends extrapolate! CALCULATING THE EQUATION OF THE LSRL: WHAT IF WE DON’T HAVE THE RAW DATA? We still can calculate the equation for the LSRL, but a little more time consuming Note: Every LSRL goes through the point (𝒙, 𝒚) Formula for slope of LSRL: 𝑏 = 𝑟 LSRL: 𝑦 = 𝑎 + 𝑏𝑥 𝑠𝑦 𝑠𝑥 CALCULATING THE EQUATION FOR THE LSRL: WHAT IF WE DON’T HAVE THE RAW DATA? Equation of LSRL: 𝑦 = 𝑎 + 𝑏𝑥 If you do not have raw data, but still need to calculate a LSRL, you will be given: 𝒙, 𝒚 , 𝑟 (𝑜𝑟 𝑟 2 ), 𝑠𝑦 , 𝑎𝑛𝑑 𝑠𝑥 Remember, (𝑥, 𝑦) is an ordered pair that is on the graph of the LSRL EXAMPLE: CREATING EQUATION OF LSRL (WITHOUT RAW DATA) •𝐵𝐴𝐿= a + b (# of beers consumed) (equation of LSRL in context – better than x & y) Remember, slope formula of LSRL: 𝑏 = 𝑟 𝑠𝑦 𝑠𝑥 Givens: 𝒙 = 4.8125, 𝑦 = .07375 𝑆𝑥 = 2.1975, 𝑆𝑦 = .0441, 𝑎𝑛𝑑 𝑟 2 = .80 Calculate slope for equation of LSRL EXAMPLE: CREATING EQUATION OF LSRL (WITHOUT RAW DATA) 𝐵𝐴𝐿= a + b (# of beers consumed) Givens: 𝒙 = 4.8125, 𝑦 = .07375, 𝑆𝑥 = 2.1975, 𝑆𝑦 = .0441, 𝑎𝑛𝑑 𝑟 2 = .80 So, slope = b = .0179 Remember, equations of all LSRL’s go through 𝑥, 𝑦 … so what’s next? EXAMPLE: CREATING EQUATION OF LSRL (WITHOUT RAW DATA) 𝐵𝐴𝐿= a + b (# of beers consumed) Givens: 𝒙 = 4.8125, 𝑦 = .07375, 𝑆𝑥 = 2.1975, 𝑆𝑦 = .0441, 𝑎𝑛𝑑 𝑟 2 = .80 𝑦 = 𝑎 + .0179 𝑥 Substitute (𝑥, 𝑦) into equation EXAMPLE: CREATING EQUATION OF LSRL (WITHOUT RAW DATA) 0.07375 = a + (.0179) ( 4.8125) and solve for ‘a’ 𝐵𝐴𝐿= a + b (# of beers consumed) 𝐵𝐴𝐿= -0.0123 + 0.0179 (# of beers consumed) INTERPRETING SOFTWARE OUTPUT… Age vs. Gesell Score DETOUR… MEMORY MONDAY (OR WAY-BACK WEDNESDAY)… What is r? What is r’s range? r tells us how linear (and direction) scatterplot is. ‘r’ ranges from -1 to 1. ‘r’ describes the scatterplot only (not LSRL) Why do we want/need ‘r’? NOW… We need a numerical measurement that tells us how well the LSRL fits Coefficient of Determination, or 𝑟 2 NOW... We need a numerical measurement that tell us how well the LSRL fits/accurately describes the scatter plot points, the data. Coefficient of Determination, or r2 COEFFICIENT OF DETERMINATION … Do all the points on the scatterplot fall exactly on the LSRL? Sometimes too high and sometimes too low Is LSRL a good model to use for a particular data set? How well does our model fit our data? COEFFICIENT OF DETERMINATION OR 𝑟2 “R-sq” software output Always 0 ≤ 𝑟 2 ≤ 1 Never calculate by hand; always use Minitab No need to memorize formula; trust me … it’s ugly COEFFICIENT OF DETERMINATION OR 𝑟2 Remember “r” correlation, direction and strength of linear relationship of scatterplot −1 ≤ 𝑟 ≤ 1 𝑟 2 , coefficient of determination, fraction of the variation in the values of y that are explained by LSRL, describes to LSRL 0 ≤ 𝑟2 ≤ 1 COEFFICIENT OF DETERMINATION OR 𝑟 2 Interpretation of 𝒓𝟐 : We say, “x% of variation in (y variable) is explained by LSRL relating (y variable) to (x variable).” GENERAL FACTS TO REMEMBER ABOUT BIVARIATE DATA Distinction between explanatory and response variables. If switched, scatterplot changes and LSRL changes (but what doesn’t change?) LSRL minimizes distances from data points to line only vertically GENERAL FACTS TO REMEMBER ABOUT BIVARIATE DATA 𝑠𝑦 𝑏=𝑟 𝑠𝑥 Close relationship between correlation (r) and slope of LSRL; but r and b are (often) not the same; when would r and b have the same value? LSRL always passes through (𝑥, 𝑦) Don’t have to have raw data to identify the equation of LSRL GENERAL FACTS TO REMEMBER ABOUT BIVARIATE DATA Correlation (r) describes direction and strength of straight-line relationships in scatterplots Coefficient of determination (𝑟 2 ) is the fraction of variation in values of y explained by LSRL CORRELATION & REGRESSION WISDOM Which of the following scatterplots has the highest correlation? CORRELATION & REGRESSION WISDOM All r = 0.816; all have same exact LSRL equation Lesson: Always graph your data! … because correlation and regression describe only linear relationships CORRELATION & REGRESSION WISDOM Correlation and regression describe only linear relationships CORRELATION & REGRESSION WISDOM Correlation is not causation! Association does not imply causation… want a Nobel Prize? Eat some chocolate! How about Methodist ministers & rum imports? Year Number of Cuban Rum Methodist Ministers Imported to Boston in New England (in # of barrels) 1860 63 8,376 1865 48 6,506 1870 53 7,005 1875 64 8,486 1890 85 11,265 1900 80 10,547 1915 140 18,559 BEWARE OF NONSENSE ASSOCIATIONS… r = 0.9749, but no economic relationship between these variables Strong association is due entirely to the fact that both imports & health spending grew rapidly in these years. Common year is other variable. Any two variables that both increase over time will show a strong association. Doesn’t mean one explains the other or influences the other CORRELATION & REGRESSION WISDOM Correlation is not resistant; always plot data and look for unusual trends. … what if Bill Gates walked into a bar? CORRELATION & REGRESSION WISDOM Extrapolation! Don’t do it… ever. Example: Growth data from children from age 1 month to age 12 years … LSRL 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 ℎ𝑒𝑖𝑔ℎ𝑡 = 1.5𝑓𝑡 + 0.25(𝑎𝑔𝑒 𝑖𝑛 𝑦𝑒𝑎𝑟𝑠) What is the predicted height of a 40-year old? OUTLIERS & INFLUENTIAL POINTS All influential points are outliers, but not all outliers are influential points. Outliers: observations lie outside overall pattern OUTLIERS & INFLUENTIAL POINTS Influential points/observations: If removed would significantly change LSRL (slope and/or y-intercept) CLASS ACTIVITY… Groups of 2 or 3; measure each other’s head circumferences & arm spans (both in inches, rounded to the nearest ½ “). Write data on board. 1. Create scatterplot and describe the association between head circumference and arm span using DOFS. Calculate the correlation of the scatter plot (r). 2. Is a regression line appropriate for our data? Why or why not? If so, create LSRL graph & calculate equation; calculate the coefficient of determination & interpret r2. 3. Interpret the slope and the y-intercept of the LSRL in context. Continue on next slide for more questions .... CLASS ACTIVITY… Groups of 2 or 3; measure each other’s head circumferences & arm spans (both in inches, rounded to the nearest ½ “). Write data on board. 4. (a) Make a prediction (you can use your LSRL graph or your equation of the LSRL; your choice). If a student’s head circumference is 24.5”, what would be the predicted arm span (in inches) for that given person? (b) If a student’s head circumference is 36”, what would be the predicted arm span (in inches) for that given person? 5. If there is an outlier that is not an influential point on your scatter plot, circle it in red and label it as an outlier. Continue on next slide for more questions .... CLASS ACTIVITY… Groups of 2 or 3; measure each other’s head circumferences & arm spans (both in inches, rounded to the nearest ½ “). Write data on board. 6. If there is an influential point on your scatter plot, circle it in red and label it as an influential point. 7. Print everything up, put each group member name on it, turn it in.