Bivariate Statistics Bivariate data measures 2 variables at the same time to see if there is a relationship between the variables. Data is graphed using a scatter plot and we use this to draw conclusions about the relationship. Features of a scatter plot: Title Labelled axis Independent variable (the controlled variable) on the x axis Dependent variable (the variable that responds to changes in the independent variable) on the y axis Marks clearly indicated Eg. Is there a relationship between the length of your forearm and the size of your foot? Data: Length of foot Scatter plot: Length of arm Length of foot Length of arm Looking for relationships We describe the pattern of the data (the correlation) in a scatter plot as Strong/moderate/weak Positive/negative Linear/non-linear Eg Note: We can see a relationship exists, but we can’t say what “causes” the other. Ex 4A page 152 Pearson’s Correlation Coefficient (r) Pearson’s correlation coefficient (r) ranges gives a numeric value to the strength of a relationship between two variables. It ranges in value from 1 (a perfect positive relationship) to -1 (a perfect negative relationship). The formula is complicated, but your calculator does all the hard work. Values of r 1 0.75 0.5 0.25 0 -0.25 -0.5 -0.75 -1 Note: Pearson’s correlation coefficient can’t apply to data that is Non-linear Has outliers To find r using your calculator: Enter the data in the Lists and Spreadsheets section. <MENU> 4: Statistics 1: Stat Calculations 2:Two-Variable Statistics. Complete the table indicating which column has the x data and which has the y data, and which column you would like the results entered into. Scroll down to see the entry for r. The Coefficient of Determination (r²) To find the coefficient of determination, square Pearson’s correlation coefficient. The coefficient of determination ranges from 0 to 1. It is useful when we have two variables which have a linear relationship as it tells us the percentage of variation in one variable which can be explained by the variation in the other variable. Eg. A set of data giving the number of police traffic patrols in duty and the number of fatalities for the region was recorded and a correlation coefficient f r = -0.8 was found. Calculate the coefficient of determination and interpret its value. r = -0.8. r² = (-0.8)² = 0.64 WE can conclude from this that 64% of the variation in the number of fatalities can be explained by the variation in the number of police traffic patrols on duty. This means the number of police traffic patrols on duty is a major factor in predicting the number of fatalities. Linear Regression A line of best fit (sometimes called a regression line) can be can be drawn through a scatter plot to model the relationship between the variables. This model can then be used to predict one variable given the other. Methods for fitting a line of best fit 1. By eye. Using a ruler, draw a straight line through a scatter plot so that half of the points are above the line and half are below. This gives a reasonable approximation. To find the equation of this line, select two points that lie on the line, then find their gradient and use the y = mx + c formula to find the equation. Eg. Draw a line of best fit by eye through this data and determine an equation for it. 2. The two mean method Moving along the x axis divide the points into a lower half and an upper half. To find two points to draw your line of best fit; for the lower half of the data find the x mean and the y mean – ( xL , y L ) . Do the same for the upper half of the data to find ( xU , yU ) . The line of best fit will pass through these two points. Do as for the line of fit by eye to find the equation of this two mean regression line. Practice : Ex 4C page 161 questions 1-12 3. The Least Squares Regression Method The formula for these calculation is complicated, but your calculator will do all the hard work for you. Eg. The manager of a small ski resort has a problem. He wishes to be able to predict the number of skiers using his resort each weekend in advance so that he can organise additional resort staffing and catering if needed. He knows that good deep snow will attract skiers in big numbers but scant covering is unlikely to attract a crowd. To investigate the situation further he collects the following data over twelve consecutive weeks at his resort. Create a scatterplot of the data. This can be done on the calculator. 1. In Lists and Spreadsheet view, enter the data in the table. 2. Hit the Home button and go to Data and Statistics view. 3. Tab to the horizontal axis and select the independent variable depth and tab to the vertical axis and select the dependent variable skiers. The scatterplot will form. 4. It can be seen that there is a linear, positive, strong correlation between the depth of snow and the number of skiers. There is evidence to suggest that as the depth of the snow increases the number of skiers increases. 5. Next find 𝑟, the coefficient of correlation and the coefficient of determination 𝑟 2 . Hit Ctrl and Left Arrow to return to Lists and Spreadsheet View. Hit Menu, Statistics, Stat Calculations, Linear Regression (mx + b). Hit the Click button and select depth from the drop down list for X List. Hit tab and select skiers from the drop down list for the Y List. There is no need to enter data into the other boxes. Tab to OK and hit Enter. The coefficient of correlation 𝑟 = 0.88402 This indicates that there is a strong, positive correlation between the depth of snow and the number of skiers. The coefficient of determination, 𝑟 2 is 0.781492 We can say that 78% of the variation in the number of skiers can be explained by the variation in the depth of snow. The data also gives us the line of best fit, the least squares regression equation. 𝑦 = 186.418𝑥 + 28.3373 We can write this more clearly as 𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐬𝐤𝐢𝐞𝐫𝐬 = 𝟏𝟖𝟔. 𝟒𝟏𝟖 × 𝐝𝐞𝐩𝐭𝐡 𝐨𝐟 𝐬𝐧𝐨𝐰 𝐢𝐧 𝐦 + 𝟐𝟖. 𝟑𝟑𝟕𝟑 6. The equation of the least squares regression line can also be determined in Data and Statistics view. Hit Ctrl + right arrow to return to your scatterplot. Hit Menu, Analyse, Regression, Show Linear (mx + b) The least squares regression equation is 𝑦 = 186.418𝑥 + 28.3373 Interpreting the Gradient and y-intercept The gradient 186.418 indicates that for every 1 metre increase in depth of snow the number of skiers increases by 186. The y-intercept 28.3373 indicates that if the depth of snow is 0, there would be 28 skiers attending the resort. Practice: Bivariate data worksheets Using the Least Squares Regression Equation to make Predictions The usefulness of the model depends on the r value, and as a predictor, it depends on whether we are : INTERPOLATING: Or EXTRAPOLATING: Suppose we want to estimate the number of skiers when the depth of snow is 3m. Using 𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐬𝐤𝐢𝐞𝐫𝐬 = 𝟏𝟖𝟔. 𝟒𝟏𝟖 × 𝐝𝐞𝐩𝐭𝐡 𝐨𝐟 𝐬𝐧𝐨𝐰 + 𝟐𝟖. 𝟑𝟑𝟕𝟑 Number of skiers = 186.418 × 3 + 28.3373 = 587.5913 That is 588 skiers This result is reliable because we have interpolated. 3m lies within the bounds of the depth of snow given in the table. That is it is between 0.5 and 3.6m. Suppose we want to estimate the number of skiers when the depth of snow is 4m. Number of skiers = 186.418 × 4 + 28.3373 = 774 That is 774 skiers. The result is unreliable because we have extrapolated. 4m lies outside the bounds of the depth of snow given in the table. It is outside the range of 0.5 to 3.6m. Practice: Ex 4D page 172 qns 1-15