Bivariate Statistics notes

advertisement
Bivariate Statistics
Bivariate data measures 2 variables at the same time to see if there is a relationship between the variables. Data is
graphed using a scatter plot and we use this to draw conclusions about the relationship.
Features of a scatter plot:





Title
Labelled axis
Independent variable (the controlled variable) on the x axis
Dependent variable (the variable that responds to changes in the independent variable) on the y axis
Marks clearly indicated
Eg. Is there a relationship between the length of your forearm and the size of your foot?
Data:
Length of foot
Scatter plot:
Length of arm
Length of foot
Length of arm
Looking for relationships
We describe the pattern of the data (the correlation) in a scatter plot as



Strong/moderate/weak
Positive/negative
Linear/non-linear
Eg
Note: We can see a relationship exists, but we can’t say what “causes” the other.
Ex 4A page 152
Pearson’s Correlation Coefficient (r)
Pearson’s correlation coefficient (r) ranges gives a numeric value to the strength of a relationship between two
variables. It ranges in value from 1 (a perfect positive relationship) to -1 (a perfect negative relationship). The
formula is complicated, but your calculator does all the hard work.
Values of r
1
0.75
0.5
0.25
0
-0.25
-0.5
-0.75
-1
Note: Pearson’s correlation coefficient can’t apply to data that is


Non-linear
Has outliers
To find r using your calculator:




Enter the data in the Lists and Spreadsheets section.
<MENU> 4: Statistics 1: Stat Calculations 2:Two-Variable Statistics.
Complete the table indicating which column has the x data and which has the y data, and which column you
would like the results entered into.
Scroll down to see the entry for r.
The Coefficient of Determination (r²)
To find the coefficient of determination, square Pearson’s correlation coefficient.
The coefficient of determination ranges from 0 to 1.
It is useful when we have two variables which have a linear relationship as it tells us the percentage of
variation in one variable which can be explained by the variation in the other variable.
Eg. A set of data giving the number of police traffic patrols in duty and the number of fatalities for the region
was recorded and a correlation coefficient f r = -0.8 was found. Calculate the coefficient of determination
and interpret its value.
r = -0.8. r² = (-0.8)²
= 0.64
WE can conclude from this that 64% of the variation in the number of fatalities can be explained by the
variation in the number of police traffic patrols on duty. This means the number of police traffic patrols on
duty is a major factor in predicting the number of fatalities.
Linear Regression
A line of best fit (sometimes called a regression line) can be can be drawn through a scatter plot to model
the relationship between the variables. This model can then be used to predict one variable given the other.
Methods for fitting a line of best fit
1. By eye. Using a ruler, draw a straight line through a scatter plot so that half of the points are above the
line and half are below. This gives a reasonable approximation. To find the equation of this line, select
two points that lie on the line, then find their gradient and use the y = mx + c formula to find the
equation.
Eg. Draw a line of best fit by eye through this data and determine an equation for it.
2. The two mean method
 Moving along the x axis divide the points into a lower half and an upper half.
 To find two points to draw your line of best fit; for the lower half of the data find the x mean and
the y mean – ( xL , y L ) . Do the same for the upper half of the data to find ( xU , yU ) . The line of

best fit will pass through these two points.
Do as for the line of fit by eye to find the equation of this two mean regression line.
Practice : Ex 4C page 161 questions 1-12
3. The Least Squares Regression Method
The formula for these calculation is complicated, but your calculator
will do all the hard work for you.
Eg. The manager of a small ski resort has a problem. He wishes to be
able to predict the number of skiers using his resort each weekend in
advance so that he can organise additional resort staffing and
catering if needed. He knows that good deep snow will attract skiers
in big numbers but scant covering is unlikely to attract a crowd. To
investigate the situation further he collects the following data over
twelve consecutive weeks at his resort.
Create a scatterplot of the data. This can be done on the calculator.
1. In Lists and Spreadsheet view, enter the data in the table.
2. Hit the Home button and go to Data and Statistics view.
3. Tab to the horizontal axis and select the independent variable depth and tab to the vertical axis and select the
dependent variable skiers. The scatterplot will form.
4. It can be seen that there is a linear, positive, strong correlation between the depth of snow and the number of
skiers. There is evidence to suggest that as the depth of the snow increases the number of skiers increases.
5. Next find 𝑟, the coefficient of correlation and the coefficient of determination 𝑟 2 .
Hit Ctrl and Left Arrow to return to Lists and Spreadsheet View. Hit Menu, Statistics, Stat Calculations, Linear
Regression (mx + b). Hit the Click button and select depth from the drop down list for X List. Hit tab and select skiers
from the drop down list for the Y List.
There is no need to enter data into the other boxes. Tab to OK and hit Enter.
The coefficient of correlation 𝑟 = 0.88402
This indicates that there is a strong, positive correlation between the depth of snow and the number of skiers.
The coefficient of determination, 𝑟 2 is 0.781492
We can say that 78% of the variation in the number of skiers can be explained by the variation in the depth of
snow.
The data also gives us the line of best fit, the least squares regression equation.
𝑦 = 186.418𝑥 + 28.3373
We can write this more clearly as
𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐬𝐤𝐢𝐞𝐫𝐬 = 𝟏𝟖𝟔. 𝟒𝟏𝟖 × 𝐝𝐞𝐩𝐭𝐡 𝐨𝐟 𝐬𝐧𝐨𝐰 𝐢𝐧 𝐦 + 𝟐𝟖. 𝟑𝟑𝟕𝟑
6. The equation of the least squares regression line can also be determined in Data and Statistics view. Hit Ctrl + right
arrow to return to your scatterplot. Hit Menu, Analyse, Regression, Show Linear (mx + b)
The least squares regression equation is 𝑦 = 186.418𝑥 + 28.3373
Interpreting the Gradient and y-intercept
The gradient 186.418 indicates that for every 1 metre increase in depth of snow the number of skiers increases by
186.
The y-intercept 28.3373 indicates that if the depth of snow is 0, there would be 28 skiers attending the resort.
Practice: Bivariate data worksheets
Using the Least Squares Regression Equation to make Predictions
The usefulness of the model depends on the r value, and as a predictor, it depends on whether we are :
INTERPOLATING:
Or
EXTRAPOLATING:
Suppose we want to estimate the number of skiers when the depth of snow is 3m. Using
𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐬𝐤𝐢𝐞𝐫𝐬 = 𝟏𝟖𝟔. 𝟒𝟏𝟖 × 𝐝𝐞𝐩𝐭𝐡 𝐨𝐟 𝐬𝐧𝐨𝐰 + 𝟐𝟖. 𝟑𝟑𝟕𝟑
Number of skiers = 186.418 × 3 + 28.3373 = 587.5913
That is 588 skiers
This result is reliable because we have interpolated. 3m lies within the bounds of the depth of snow given in the
table. That is it is between 0.5 and 3.6m.
Suppose we want to estimate the number of skiers when the depth of snow is 4m.
Number of skiers = 186.418 × 4 + 28.3373 = 774
That is 774 skiers.
The result is unreliable because we have extrapolated. 4m lies outside the bounds of the depth of snow given in the
table. It is outside the range of 0.5 to 3.6m.
Practice: Ex 4D page 172 qns 1-15
Download