Uploaded by Chris Tran

Bivariate Data Notes - Chapter 6 (Benjamin Odgers)

advertisement
Chapter 6
Bivariate Data
6A Introduction to Bivariate Scatterplots (pg. 63)
6B Bivariate Data Relationships (pg. 65)
6C Pearson’s Correlation Coefficient (pg. 68)
6D Line of Best Fit (pg. 69)
6E Interpolation and Extrapolation (pg. 71)
6F Statistical Investigation (pg. 72)
Written by
Benjamin Odgers
Maths Teacher
B Teaching / B Science
The following theory booklet lines up with the Cambridge Year 12 NSW Standard Mathematics 2
Textbook. This can be found using the following link:
https://www.cambridge.edu.au/education/titles/CambridgeMATHS-Stage-6-Mathematics-Standard-2-Year-12-printand-interactive-textbook-powered-by-HOTmaths/#.XYgHTUszaUk
https://www.youtube.com/user/benjodgers
62 | P a g e
6A Introduction to Bivariate Scatterplots https://youtu.be/9ycQrZDefxc
Bivariate data comprises of two variables that may or may not
have a correlation. A good example of bivariate data was
given by the Roman architect, Vitruvius in the first century
BC. He claimed that a person’s arm span is approximately the
same as a person’s height.
We can attempt to prove Vitruvius’ claim by using a bivariate
scatter plot. We can measure a sample of recipients and plot
them on a bivariate scatterplot. Each person’s height and arm
span represent the two variables for our scatterplot. If
Vitruvius’ claim is true we should see a strong correlation
between people’s height and arm span.
The image at right is called the Vitruvian Man and was drawn
by Leonardo da Vinci. The image was obtained from the
following site:
https://old.world-mysteries.com/sci_17_vm.htm
Example 1
The following table represents a sample of 15 people. Each person had their height and arm span measured
and recorded.
Height (cm)
Arm Span (cm)
152 180 159 165 187 183 161 158 165 168 172 176 169 178 178
153 184 160 167 187 180 159 162 166 168 170 179 171 175 183
a) Construct a scatterplot by plotting the points on
the number plane at right.
200
b) What is the scale for the horizontal axis?
c) According to this data, would you say there is a
strong correlation between a person’s height and
arm spam? Why?
Arm span (cm)
190
180
170
160
150
150
160
170
180
190
200
Height (cm)
d) What do you notice about the shape of the scatter plot?
Chapter continues on next page
https://www.youtube.com/user/benjodgers
63 | P a g e
Example 2 https://youtu.be/qTJimK-zTX0
The following table represents a sample of 15 students. Each student recorded the number of hours they
studied for an exam as well as their marks.
0
5
18
95
5
80
a) Construct a scatterplot by
plotting the points on the
number plane at right.
b) What is the scale for the
vertical axis?
1
10
2
20
11
80
15
85
18
80
0
35
2
30
7
60
16
80
13
85
10
85
4
40
12
14
16
18
20
100
Exam Mark (%)
Study Time (h)
Exam Mark (%)
80
60
40
20
c) How many students got a mark
greater than 60%?
0
0
2
4
6
8
10
Study Time (h)
d) How many students completed
less than 8 hours of study in preparation for the exam?
e) According to this data, would you say there is a strong correlation between a person’s study time and
exam mark? Why?
f) What do you notice about the shape of the scatter plot?
g) Were there any students that seemed to lie outside the normal trend when comparing students study
time to their exam mark?
https://www.youtube.com/user/benjodgers
64 | P a g e
6B Bivariate Data Relationships https://youtu.be/wj1Ghj4gA2Q
When interpreting bivariate scatterplots, we often talk about variables having (or not having) a relationship.
When talking about the relationship between variables we often use language such as variables having a
correlation or an association with each other. We can tell if variables have a relationship by observing the
position of the points. If there is an obvious pattern then we can see that a relationship exists between the
variables.
Strength of the Relationship (or Association)
By observing the scattering of points on a scatter plot we can find the strength of the relationship. When a
relationship is strong it will have an obvious trend that can be easily graphed using a straight line or curve.
As relationships become weaker it becomes harder to graph the trend.
Moderate
Husband’s Age (years)
Height (cm)
Weak
No Relationship
Persons IQ
Sale Price ($)
Arm Span (cm)
Wife’s Age (years)
Strong
Size of Land (Acres)
Persons Height (cm)
Form of the Relationship (or Association)
Non-Linear Form
Company Profits ($)
Life Expectancy (years)
Linear Form
Number of Employees
Person’s Age (years)
Direction of the Relationship (or Association)
Negative
Arm Span (cm)
Life Expectancy (years)
Positive
Height (cm)
Person’s Age (years)
Chapter continues on next page
https://www.youtube.com/user/benjodgers
65 | P a g e
Example 1 https://youtu.be/pRz5pa-fYXk
Describe the (i) strength, (ii) form and (iii) direction for each scatterplot below
(a)
(b)
(c)
(d)
(e)
(f)
What is the difference between dependent and independent variables when referring to bivariate
scatterplots? https://youtu.be/8op89NrNWf0
Chapter continues on next page
https://www.youtube.com/user/benjodgers
66 | P a g e
Example 2 https://youtu.be/hyRiwAcHUqo
A company would like to know what number of employees will bring the optimal profits. The table below
compares the number of employees working to the weekly profits.
0
10
4
2
15
9
14
19
1
17
20
7
12
6
6
0
90
48
15
50
85
55
8
10
25
2
68
76
61
55
2
4
8
10
12
14
16
18
20
a) Construct a scatter plot
from the above data
b) Describe the strength of
the association
Company Weekly Profits ($)
Number of
employees
Company weekly
profits ($1000)
100 000
c) Describe the form of the
association
80 000
60 000
40 000
20 000
0
0
6
Number of Employees
d) Describe the direction of the association
e) What is the independent variable?
f) Predict the weekly profits gained when the company has 12 employees
g) What is the dependent variable?
h) Predict the number of employees when the company makes a weekly profit of $70 000
https://www.youtube.com/user/benjodgers
67 | P a g e
6C Pearson’s Correlation Coefficient https://youtu.be/z86m0kWUhks
Pearson’s correlation coefficient (𝑟) can be used to measure the strength of a correlation. A value between
−1 and +1 is used to represent the strength of the correlation. The examples below illustrate how Pearson’s
correlation coefficient works.
Perfect Negative
(𝑟 = −1)
−1
−0.9
Moderate Negative
(𝑟 = −0.5)
Moderate Positive
(+0.65)
No Relationship
(0)
−0.5
0
Strong Negative
(−0.9)
+0.65
+0.3
+1
Perfect Positive
(+1)
Weak Positive
(+0.3)
Example 1 Casio Calculator https://youtu.be/glubhdt8NZ0
Sharp Calculator https://youtu.be/O_HyzM1-fNY
The table and scatterplot below represent a sample of 15 people. Each person had their height and arm span
measured and recorded.
Height (cm)
Arm Span (cm)
152 180 159 165 187 183 161 158 165 168 172 176 169 178 178
153 184 160 167 187 180 159 162 166 168 170 179 171 175 183
a) Use a calculator to find Pearson’s correlation
coefficient, correct to 2 decimal places.
200
Arm span (cm)
190
180
170
b) What is the strength of the relationship?
160
150
150
160
170
180
190
200
Height (cm)
https://www.youtube.com/user/benjodgers
68 | P a g e
6D Line of Best Fit https://youtu.be/pM9jfsKZyGs
Most scatterplots do not make a perfect line. Sometimes we need to draw the line that “best fits” the pattern
of the plot. Once we draw a line of best fit we can make predictions.
Example 1
The scatterplot at right was used to compare a
person’s height and arm span.
200
a) Draw a line of best fit (by eye) for the
scatterplot at right.
Arm span (cm)
b) Use your line of best fit to predict a person’s
arm span, given that they have a height of
175cm
190
180
170
160
c) Use your line of best fit to predict a person’s
arm span, given that they have a height of
195cm
150
150
160
170
180
190
200
Height (cm)
Interpolation and Extrapolation
Interpolation – when you make a prediction within the data set. In Example 1 (b) above, this was an example
of interpolation since the line of best fit was already drawn at this point.
Extrapolation – when you make a prediction outside the data set. In Example 1 (c) above, this was an
example of extrapolation since we had to extend the line of best fit in order to find the arm span.
Chapter continues on next page
https://www.youtube.com/user/benjodgers
69 | P a g e
Least-Squares Line of Best Fit
The least-squares line of best fit method can be used to accurately draw a line of best fit for a scatter plot.
We use the gradient-intercept formula (𝑦 = 𝑚𝑥 + 𝑐) and then find the gradient (m) and y-intercept (c) using
the following formulas:
𝑠𝑦
𝑚 = 𝑟𝑠
𝑐 = 𝑦̅ − 𝑚𝑥̅
𝑥
•
•
•
•
•
𝑟 is Pearson’s correlation coefficient.
𝑠𝑥 – standard deviation of 𝑥.
𝑠𝑦 – standard deviation of 𝑦.
𝑥̅ – mean of 𝑥.
𝑦̅ – mean of 𝑦.
Example 2 Casio Calculator https://youtu.be/3nOhppOfv2Y Sharp Calculator https://youtu.be/nSjI_B-5nmI
The table and scatterplot below represent a sample of 15 people. Each person gave their age and recorded
the amount of time they were on the internet in one week.
Age (years)
Internet Usage
(hours per week)
12
15
21
22
24
30
31
35
37
40
41
43
52
53
55
15
21
21
14
17
10
16
14
8
11
5
12
8
6
2
a) Use a calculator to find Pearson’s correlation
coefficient, correct to 4 decimal places.
Internet Usage (hours per week)
b) Find the equation of the least-squares line of
best fit.
25
20
15
10
5
0
10
20
40
30
Age (years)
50
60
c) Draw the least squares line of best fit on the scatterplot above.
d) Use the equation to predict the internet usage of someone aged 5.
https://www.youtube.com/user/benjodgers
70 | P a g e
6E Interpolation and Extrapolation https://youtu.be/l6A3m8goZws
Interpolation and extrapolation were briefly described in the previous chapter (chapter 6D). In this chapter
we will look at questions involving interpolation and extrapolation.
Interpolation – when you make a prediction within the data set. This is usually where a line of best fit exists.
Extrapolation – when you make a prediction outside the data set. This is when you need to extend the line of
best fit to make a prediction.
Interpolation can be used affectively to predict values. Extrapolation can also be used affectively but can
sometimes be inaccurate or misleading.
Example 1 https://youtu.be/QUbkxZGYEac
The following table and graph represent a sample of 8 students. Each student recorded the number of hours
they studied for an exam as well as their marks. This example was taken from chapter 6A Example 2. We
have omitted some values to help you see the problems we face when we use extrapolation.
Study Time (h)
Exam Mark (%)
0
5
5
80
1
10
2
20
2
30
7
60
10
85
4
40
a) Use the following website to find the equation for the least-squares line of best fit
https://www.mathsisfun.com/data/least-squares-calculator.html
c) By referring to the leastsquares line of best fit, what
exam mark would you
expect to get if you studied
for 6 hours?
100
Exam Mark (%)
b) Draw the least-squares line
of best fit on the scatter plot
at right
80
60
40
20
0
0
2
4
6
8
10
12
14
16
18
20
Study Time (h)
d) Use the equation from part (a) to calculate the expected exam mark for someone who studied for 18
hours
e) Question (d) above is an example of extrapolation. In your own words, explain why extrapolation
can sometimes be misleading or inaccurate
https://www.youtube.com/user/benjodgers
71 | P a g e
6F Statistical Investigation https://youtu.be/yxDAxtyFk6c
Statistical investigation is the process of gathering, organising and analysing data in order to make
predictions and conclusions. This process can aid in making informed decisions for companies and
government organisations.
Example 1 https://youtu.be/PFt_nAIXcyM
Statistical investigation involves four steps. In this example we are going to follow these four steps and
explore the correlation between a girl’s age and height.
1. Collect the Data – We will start by collecting data about a girl’s age and height. You can do this
using either a:
- primary source of data (measuring the height of 15 girls aged between 2 to 15)
- secondary source of data (going on the internet and finding the average height of girls aged
between 2 and 15)
Make sure you have a good spread of data, otherwise the dots on your scatterplot will be grouped too
closely together. You are gathering data from a sample of the population so it is imperative that you
gather data from reliable sources that are representative of the entire population.
2. Organising Data – Organise the data into the table below. Note: a table is not only a great way of
organising data, it is also a great way to display data like in step 3.
Girl’s age
Height (cm)
3. Displaying and Summarising Data – We like to use tables and graphs to display data. We summarise
data when we talk about mean, median, mode and standard deviation. There is no need to summarise
the data above, instead you are going to display the data using a scatterplot.
180
Height (cm)
160
140
120
100
80
0
2
4
6
8
10
12
14
16
18
20
Girl’s Age
Step 4 continues on the next page
https://www.youtube.com/user/benjodgers
72 | P a g e
4. Analysing Data – When we analyse data we are interpreting the data and giving it meaning. For the
scatterplot in step 3:
a) Calculate the correlation coefficient.
b) Comment on the strength, form and direction of the correlation.
c) Find the equation for the least-squares line of best fit and sketch it on the graph in step 3.
d) Use interpolation to calculate the approximate height of someone who is 10 years old.
e) Use extrapolation to calculate the approximate height of someone who is 50 years old.
f) We can use interpolation and extrapolation to make predictions and conclusions. Explain why
extrapolation was misleading for this example.
Causation https://youtu.be/VMUQSMFGBDo
Ice cream sales and the number of murders in New York have been shown to have a correlation. Does
this mean that ice cream sales are the cause of murders in New York? Explain the difference between
correlation and causation.
https://www.youtube.com/user/benjodgers
73 | P a g e
Download