Chapter 6 Bivariate Data 6A Introduction to Bivariate Scatterplots (pg. 63) 6B Bivariate Data Relationships (pg. 65) 6C Pearson’s Correlation Coefficient (pg. 68) 6D Line of Best Fit (pg. 69) 6E Interpolation and Extrapolation (pg. 71) 6F Statistical Investigation (pg. 72) Written by Benjamin Odgers Maths Teacher B Teaching / B Science The following theory booklet lines up with the Cambridge Year 12 NSW Standard Mathematics 2 Textbook. This can be found using the following link: https://www.cambridge.edu.au/education/titles/CambridgeMATHS-Stage-6-Mathematics-Standard-2-Year-12-printand-interactive-textbook-powered-by-HOTmaths/#.XYgHTUszaUk https://www.youtube.com/user/benjodgers 62 | P a g e 6A Introduction to Bivariate Scatterplots https://youtu.be/9ycQrZDefxc Bivariate data comprises of two variables that may or may not have a correlation. A good example of bivariate data was given by the Roman architect, Vitruvius in the first century BC. He claimed that a person’s arm span is approximately the same as a person’s height. We can attempt to prove Vitruvius’ claim by using a bivariate scatter plot. We can measure a sample of recipients and plot them on a bivariate scatterplot. Each person’s height and arm span represent the two variables for our scatterplot. If Vitruvius’ claim is true we should see a strong correlation between people’s height and arm span. The image at right is called the Vitruvian Man and was drawn by Leonardo da Vinci. The image was obtained from the following site: https://old.world-mysteries.com/sci_17_vm.htm Example 1 The following table represents a sample of 15 people. Each person had their height and arm span measured and recorded. Height (cm) Arm Span (cm) 152 180 159 165 187 183 161 158 165 168 172 176 169 178 178 153 184 160 167 187 180 159 162 166 168 170 179 171 175 183 a) Construct a scatterplot by plotting the points on the number plane at right. 200 b) What is the scale for the horizontal axis? c) According to this data, would you say there is a strong correlation between a person’s height and arm spam? Why? Arm span (cm) 190 180 170 160 150 150 160 170 180 190 200 Height (cm) d) What do you notice about the shape of the scatter plot? Chapter continues on next page https://www.youtube.com/user/benjodgers 63 | P a g e Example 2 https://youtu.be/qTJimK-zTX0 The following table represents a sample of 15 students. Each student recorded the number of hours they studied for an exam as well as their marks. 0 5 18 95 5 80 a) Construct a scatterplot by plotting the points on the number plane at right. b) What is the scale for the vertical axis? 1 10 2 20 11 80 15 85 18 80 0 35 2 30 7 60 16 80 13 85 10 85 4 40 12 14 16 18 20 100 Exam Mark (%) Study Time (h) Exam Mark (%) 80 60 40 20 c) How many students got a mark greater than 60%? 0 0 2 4 6 8 10 Study Time (h) d) How many students completed less than 8 hours of study in preparation for the exam? e) According to this data, would you say there is a strong correlation between a person’s study time and exam mark? Why? f) What do you notice about the shape of the scatter plot? g) Were there any students that seemed to lie outside the normal trend when comparing students study time to their exam mark? https://www.youtube.com/user/benjodgers 64 | P a g e 6B Bivariate Data Relationships https://youtu.be/wj1Ghj4gA2Q When interpreting bivariate scatterplots, we often talk about variables having (or not having) a relationship. When talking about the relationship between variables we often use language such as variables having a correlation or an association with each other. We can tell if variables have a relationship by observing the position of the points. If there is an obvious pattern then we can see that a relationship exists between the variables. Strength of the Relationship (or Association) By observing the scattering of points on a scatter plot we can find the strength of the relationship. When a relationship is strong it will have an obvious trend that can be easily graphed using a straight line or curve. As relationships become weaker it becomes harder to graph the trend. Moderate Husband’s Age (years) Height (cm) Weak No Relationship Persons IQ Sale Price ($) Arm Span (cm) Wife’s Age (years) Strong Size of Land (Acres) Persons Height (cm) Form of the Relationship (or Association) Non-Linear Form Company Profits ($) Life Expectancy (years) Linear Form Number of Employees Person’s Age (years) Direction of the Relationship (or Association) Negative Arm Span (cm) Life Expectancy (years) Positive Height (cm) Person’s Age (years) Chapter continues on next page https://www.youtube.com/user/benjodgers 65 | P a g e Example 1 https://youtu.be/pRz5pa-fYXk Describe the (i) strength, (ii) form and (iii) direction for each scatterplot below (a) (b) (c) (d) (e) (f) What is the difference between dependent and independent variables when referring to bivariate scatterplots? https://youtu.be/8op89NrNWf0 Chapter continues on next page https://www.youtube.com/user/benjodgers 66 | P a g e Example 2 https://youtu.be/hyRiwAcHUqo A company would like to know what number of employees will bring the optimal profits. The table below compares the number of employees working to the weekly profits. 0 10 4 2 15 9 14 19 1 17 20 7 12 6 6 0 90 48 15 50 85 55 8 10 25 2 68 76 61 55 2 4 8 10 12 14 16 18 20 a) Construct a scatter plot from the above data b) Describe the strength of the association Company Weekly Profits ($) Number of employees Company weekly profits ($1000) 100 000 c) Describe the form of the association 80 000 60 000 40 000 20 000 0 0 6 Number of Employees d) Describe the direction of the association e) What is the independent variable? f) Predict the weekly profits gained when the company has 12 employees g) What is the dependent variable? h) Predict the number of employees when the company makes a weekly profit of $70 000 https://www.youtube.com/user/benjodgers 67 | P a g e 6C Pearson’s Correlation Coefficient https://youtu.be/z86m0kWUhks Pearson’s correlation coefficient (𝑟) can be used to measure the strength of a correlation. A value between −1 and +1 is used to represent the strength of the correlation. The examples below illustrate how Pearson’s correlation coefficient works. Perfect Negative (𝑟 = −1) −1 −0.9 Moderate Negative (𝑟 = −0.5) Moderate Positive (+0.65) No Relationship (0) −0.5 0 Strong Negative (−0.9) +0.65 +0.3 +1 Perfect Positive (+1) Weak Positive (+0.3) Example 1 Casio Calculator https://youtu.be/glubhdt8NZ0 Sharp Calculator https://youtu.be/O_HyzM1-fNY The table and scatterplot below represent a sample of 15 people. Each person had their height and arm span measured and recorded. Height (cm) Arm Span (cm) 152 180 159 165 187 183 161 158 165 168 172 176 169 178 178 153 184 160 167 187 180 159 162 166 168 170 179 171 175 183 a) Use a calculator to find Pearson’s correlation coefficient, correct to 2 decimal places. 200 Arm span (cm) 190 180 170 b) What is the strength of the relationship? 160 150 150 160 170 180 190 200 Height (cm) https://www.youtube.com/user/benjodgers 68 | P a g e 6D Line of Best Fit https://youtu.be/pM9jfsKZyGs Most scatterplots do not make a perfect line. Sometimes we need to draw the line that “best fits” the pattern of the plot. Once we draw a line of best fit we can make predictions. Example 1 The scatterplot at right was used to compare a person’s height and arm span. 200 a) Draw a line of best fit (by eye) for the scatterplot at right. Arm span (cm) b) Use your line of best fit to predict a person’s arm span, given that they have a height of 175cm 190 180 170 160 c) Use your line of best fit to predict a person’s arm span, given that they have a height of 195cm 150 150 160 170 180 190 200 Height (cm) Interpolation and Extrapolation Interpolation – when you make a prediction within the data set. In Example 1 (b) above, this was an example of interpolation since the line of best fit was already drawn at this point. Extrapolation – when you make a prediction outside the data set. In Example 1 (c) above, this was an example of extrapolation since we had to extend the line of best fit in order to find the arm span. Chapter continues on next page https://www.youtube.com/user/benjodgers 69 | P a g e Least-Squares Line of Best Fit The least-squares line of best fit method can be used to accurately draw a line of best fit for a scatter plot. We use the gradient-intercept formula (𝑦 = 𝑚𝑥 + 𝑐) and then find the gradient (m) and y-intercept (c) using the following formulas: 𝑠𝑦 𝑚 = 𝑟𝑠 𝑐 = 𝑦̅ − 𝑚𝑥̅ 𝑥 • • • • • 𝑟 is Pearson’s correlation coefficient. 𝑠𝑥 – standard deviation of 𝑥. 𝑠𝑦 – standard deviation of 𝑦. 𝑥̅ – mean of 𝑥. 𝑦̅ – mean of 𝑦. Example 2 Casio Calculator https://youtu.be/3nOhppOfv2Y Sharp Calculator https://youtu.be/nSjI_B-5nmI The table and scatterplot below represent a sample of 15 people. Each person gave their age and recorded the amount of time they were on the internet in one week. Age (years) Internet Usage (hours per week) 12 15 21 22 24 30 31 35 37 40 41 43 52 53 55 15 21 21 14 17 10 16 14 8 11 5 12 8 6 2 a) Use a calculator to find Pearson’s correlation coefficient, correct to 4 decimal places. Internet Usage (hours per week) b) Find the equation of the least-squares line of best fit. 25 20 15 10 5 0 10 20 40 30 Age (years) 50 60 c) Draw the least squares line of best fit on the scatterplot above. d) Use the equation to predict the internet usage of someone aged 5. https://www.youtube.com/user/benjodgers 70 | P a g e 6E Interpolation and Extrapolation https://youtu.be/l6A3m8goZws Interpolation and extrapolation were briefly described in the previous chapter (chapter 6D). In this chapter we will look at questions involving interpolation and extrapolation. Interpolation – when you make a prediction within the data set. This is usually where a line of best fit exists. Extrapolation – when you make a prediction outside the data set. This is when you need to extend the line of best fit to make a prediction. Interpolation can be used affectively to predict values. Extrapolation can also be used affectively but can sometimes be inaccurate or misleading. Example 1 https://youtu.be/QUbkxZGYEac The following table and graph represent a sample of 8 students. Each student recorded the number of hours they studied for an exam as well as their marks. This example was taken from chapter 6A Example 2. We have omitted some values to help you see the problems we face when we use extrapolation. Study Time (h) Exam Mark (%) 0 5 5 80 1 10 2 20 2 30 7 60 10 85 4 40 a) Use the following website to find the equation for the least-squares line of best fit https://www.mathsisfun.com/data/least-squares-calculator.html c) By referring to the leastsquares line of best fit, what exam mark would you expect to get if you studied for 6 hours? 100 Exam Mark (%) b) Draw the least-squares line of best fit on the scatter plot at right 80 60 40 20 0 0 2 4 6 8 10 12 14 16 18 20 Study Time (h) d) Use the equation from part (a) to calculate the expected exam mark for someone who studied for 18 hours e) Question (d) above is an example of extrapolation. In your own words, explain why extrapolation can sometimes be misleading or inaccurate https://www.youtube.com/user/benjodgers 71 | P a g e 6F Statistical Investigation https://youtu.be/yxDAxtyFk6c Statistical investigation is the process of gathering, organising and analysing data in order to make predictions and conclusions. This process can aid in making informed decisions for companies and government organisations. Example 1 https://youtu.be/PFt_nAIXcyM Statistical investigation involves four steps. In this example we are going to follow these four steps and explore the correlation between a girl’s age and height. 1. Collect the Data – We will start by collecting data about a girl’s age and height. You can do this using either a: - primary source of data (measuring the height of 15 girls aged between 2 to 15) - secondary source of data (going on the internet and finding the average height of girls aged between 2 and 15) Make sure you have a good spread of data, otherwise the dots on your scatterplot will be grouped too closely together. You are gathering data from a sample of the population so it is imperative that you gather data from reliable sources that are representative of the entire population. 2. Organising Data – Organise the data into the table below. Note: a table is not only a great way of organising data, it is also a great way to display data like in step 3. Girl’s age Height (cm) 3. Displaying and Summarising Data – We like to use tables and graphs to display data. We summarise data when we talk about mean, median, mode and standard deviation. There is no need to summarise the data above, instead you are going to display the data using a scatterplot. 180 Height (cm) 160 140 120 100 80 0 2 4 6 8 10 12 14 16 18 20 Girl’s Age Step 4 continues on the next page https://www.youtube.com/user/benjodgers 72 | P a g e 4. Analysing Data – When we analyse data we are interpreting the data and giving it meaning. For the scatterplot in step 3: a) Calculate the correlation coefficient. b) Comment on the strength, form and direction of the correlation. c) Find the equation for the least-squares line of best fit and sketch it on the graph in step 3. d) Use interpolation to calculate the approximate height of someone who is 10 years old. e) Use extrapolation to calculate the approximate height of someone who is 50 years old. f) We can use interpolation and extrapolation to make predictions and conclusions. Explain why extrapolation was misleading for this example. Causation https://youtu.be/VMUQSMFGBDo Ice cream sales and the number of murders in New York have been shown to have a correlation. Does this mean that ice cream sales are the cause of murders in New York? Explain the difference between correlation and causation. https://www.youtube.com/user/benjodgers 73 | P a g e