Correlation is a statistical method used to determine whether a

advertisement
CHAPTER 10
Correlation and Regression
Objectives
•
•
•
•
Draw a scatter plot for a set of ordered pairs.
Compute the correlation coefficient.
Test the hypothesis H0:   0.
Compute the equation of the regression line. Compute the coefficient of determination.
Section 10.1 Introduction
•
In addition to hypothesis testing and confidence intervals, inferential statistics involves determining
whether a relationship between two or more numerical or quantitative variables exists.
Statistical Methods
•
•
Correlation is a statistical method used to determine whether a relationship between variables
exists.
Regression is a statistical method used to describe the nature of the relationship between
variables—that is, positive or negative, linear or nonlinear.
Statistical Questions
1. Are two or more variables related?
2. If so, what is the strength of the relationship?
3. What type or relationship exists?
4. What kind of predictions can be made from the relationship?
Section 10.2 Correlation
I. Scatter Plots
•
•
A scatter plot is a graph of the ordered pairs (x,y) of numbers consisting of the independent
variable, x, and the dependent variable, y.
A scatter plot is a visual way to describe the nature of the relationship between the independent
and dependent variables.
Example: Construct a scatter plot for the data obtained in a study of the number of hours of sleep
and performance.
Age x
43
48
56
61
67
70
Pressure y
128
120
135
143
141
152
II. Correlation Coefficient
•
•
•
A correlation coefficient is a measure to determine the strength of the relationship between two
variables.
In a simple relationship, there are only two types of variables under study.
In multiple relationships, many variables are under study.
1
Correlation Coefficient
•
•
•
•
•
•
•
The correlation coefficient computed from the sample data measures the strength and direction of
a linear relationship between two variables.
The symbol for the sample correlation coefficient is r.
The symbol for the population correlation coefficient is  (Greek letter rho).
The range of the correlation coefficient is from 1 to 1.
If there is a strong positive linear relationship between the variables, the value of r will be close to
1.
If there is a strong negative linear relationship between the variables, the value of r will be close to
1.
When there is no linear relationship between the variables or only a weak relationship, the value of
r will be close to 0.
Range of Values for the Correlation Coefficient
In general,
r > 0.7, there is a positive/negative linear correlation between X and Y.
Relationship Between the Correlation Coefficient and the Scatter Plot
Formula for the Correlation Coefficient r
r
n( xy )  ( x)( y )
n

  x     x   n   y     y  
2
2
2
2
where n is the number of data pairs.
2
Example: Compute the value of the correlation coefficient for the data obtained in the study of age
and blood pressure.
Age x
Pressure y
43
128
48
120
56
135
61
143
67
141
70
152
r
x2
xy
y2
n( xy )  ( x)( y )
n

  x     x   n   y     y  
2
2
2
2

Possible Relationships Between Variables
•
•
•
•
•
There is a direct cause-and-effect relationship between the variables: that is, x causes y.
There is a reverse cause-and-effect relationship between the variables: that is, y causes x.
The relationship between the variable may be caused by a third variable: that is, y may appear to
cause x but in reality z causes x.
There may be a complexity of interrelationships among many variables; that is, x may cause y but
w, t, and z fit into the picture as well.
The relationship may be coincidental: although a researcher may find a relationship between x and
y, commonsense may prove otherwise.
Interpretation of Relationships
•
When the null hypothesis is rejected, the researcher must consider all possibilities and select the
appropriate relationship between the variables as determined by the study. Remember,
correlation does not necessarily imply causation.
Population Correlation Coefficient
Formally defined, the population correlation coefficient  is the correlation computed by using all
possible pairs of data values (x, y) taken from a population.
Hypothesis Testing
•
In hypothesis testing, one of the following is true:
H0:   0
This null hypothesis means that there is no correlation between
the x and y variables in the population.
H1:   0
This alternative hypothesis means that there is a significant correlation
between the variables in the population.
3
Formula for the t Test for the Correlation Coefficient
•
Formula for the t test for the correlation coefficient:
with degrees of freedom equal to n  2.
•
Example: Test the significance of the correlation coefficient found for the data obtained in the
study of age and blood pressure. Use   0.05 and r  0.897
Testing the significance of r using Table I
•
•
Table I shows the values of the correlation coefficient that are significant for a specific
level and a specific number of degrees of freedom.
Any value of r greater than a positive critical value or less than a negative critical value will be
significant, and the null hypothesis will be rejected.
Example: Using Table I, test the significance of the correlation coefficient r = 0.0667, at
and sample size is 9.
,
4
Section 10.3 Regression
•
Regression is a statistical method used to describe the nature of the relationship between
variables—that is, positive or negative, linear or nonlinear.
Types of Regression and correlation
exponential
regression
perfect
correlation
positive
correlation
As x increases,
y increases
negative correlation
no correlation
As x increases, y
decreases
Linear Regression
•
•
If the value of the correlation coefficient is significant, the next step is to determine the equation of
the regression line which is the data’s line of best fit.
Best fit means that the sum of the squares of the vertical distances from each point to the line is at
a minimum.
Scatter Plot with Three Lines
y
A Linear Relation
y
x
x
Equation of a Line
•
•
In algebra, the equation of a line is usually given as y = mx + b, where m is the slope of the line
and b is the y intercept.
In statistics, the equation of the regression line is written as y = a + bx, where a is the
y intercept and b is the slope of the line.
Formulas for the Regression Line
•
Formulas for the regression line y = a + bx:
a
b
 
n   x 2     x 2
  y   x 2    x   xy 
n   xy     x   y 


n  x 2    x 2
5
where a is the y' intercept and b is the slope of the line.
Rounding Rule
•
When calculating the values of a and b, round to three decimal places.
Example: Find the equation of the regression line for the data obtained in the study of age and blood
pressure.
Example: Using the equation of the regression line, predict the blood pressure for a person who is 50
years old.
Procedure
Finding the correlation coefficient and the regression line equation
• Step 1
Make a table with columns for subject, x, y, xy, x2, and y2.
• Step 2
Find the values of xy, x2, and y2. Place them in the appropriate columns,
•
and sum each column.
• Step 3
Substitute in the formula to find the value of r.
• Step 4
When r is significant, substitute in the formulas to find the values of a and
b for the regression line equation
.
Summary
•
The strength and direction of the linear relationship between variables is measured by the value of
the correlation coefficient r.
•
•
r can assume values between and including 1 and 1.
•
A value of 1 or 1 indicates a perfect linear relationship.
•
•
•
The closer the value of the correlation coefficient is to 1 or 1, the stronger the linear relationship
is between the variables.
Relationships can be linear or curvilinear.
To determine the shape, one draws a scatter plot of the variables.
If the relationship is linear, the data can be approximated by a straight line, called the regression
line or the line of best fit.
Conclusion
•
Many relationships among variables exist in the real world. One way to determine whether a
relationship exists is to use the statistical techniques known as correlation and regression.
6
Download