Pearson Product Moment Coefficient of Correlation: r
s xy s x s y
S xy
S xx
S yy
The variances and covariances are given by: s xy
S n xy
1 s
2 x
S n xx
1 s
2 y
S n yy
1
In general, when a sample of is selected and two variables are measured on each individual or unit so that both variables are random, the correlation coefficient r n individuals or experimental units is the appropriate measure of linearity for use in this situation.
Regression Analysis (2) 1
Example
The heights and weights of n 10 offensive backfield football players are randomly selected from a county ’ s football all-stars.
Calculate the correlation coefficient for the heights (in inches) and weights (in pounds) given in Table below.
Table Heights and weights of n 10 backfield all-stars
Player Height x
9
10
7
8
5
6
3
4
1
2
72
75
67
69
73
71
75
72
71
69
Weight y
185
175
200
210
190
195
150
170
180
175
Regression Analysis (2) 2
Solution
You should use the appropriate data entry method of your scientific calculator to verify the calculations for the sums of squares and cross-products:
S xy
328 S xx
60 .
4 S yy
2610 using the calculational formulas given earlier in this chapter.
Then r
( 60 .
328
4 )( 2610 )
.
8261 or r =.83. This value of r is fairly close to 1, the largest possible value of r , which indicates a fairly strong positive linear relationship between height and weight.
Regression Analysis (2) 3
There is a direct relationship between the calculation formulas for the correlation coefficient line b .
r and the slope of the regression
Since the numerator of both quantities is the same sign.
S xy
, both r and b have
Therefore, the correlation coefficient has these general properties:
- When r between
0, the slope is 0, and there is no linear relationship x and y.
- When r is positive, so is between x and y.
b , and there is a positive relationship
- When r is negative, so is between x and y.
b , and there is a negative relationship
Regression Analysis (2) 4
Y
^
s y
Y
r
X
s x
X
Y
^
Y
r
X
s s x y
r
s y s x
X
Therefore
^
1
r
s y s x
Regression Analysis (2) 5
Figure Some typical scatter plots
Regression Analysis (2) 6
The population correlation coefficient interpreted as it is in the sample.
r is calculated and
The experimenter can test the hypothesis that there is no correlation between the variables that is exactly equivalent to the test of the slope in previous
Section.
x and y using a test statistic
Regression Analysis (2) 7
Test of Hypothesis Concerning the correlation Coefficient r :
1. Null hypothesis: H
0
:
2. Alternative hypothesis: r 0
One-Tailed Test
H a
:
(or H r > 0 a
: r < 0)
Two-Tailed Test
H a
: r 0
3. Test statistic: t
0
r n
2
1
r 2
When the assumptions are satisfied, the test statistic will have a Student ’ s
( n 2) degrees of freedom.
t distribution with
Regression Analysis (2) 8
1. Null hypothesis: H
0
: r r
0
2. Alternative hypothesis:
One-Tailed Test
H a
(or
:
H r > r
0 a
: r < r
0
)
3. Test statistic: t
0
( r
( 1
r
0
)
Two-Tailed Test
H a
: r r
0 n r
2
)( 1
2 r
0
)
When the assumptions are satisfied, the test statistic will have a Student ’ s
( n 2) degrees of freedom.
t distribution with
Regression Analysis (2) 9
4. Rejection region: Reject H
0 when
One-Tailed Test Two-Tailed Test t > t a, n-2 alternative hypothesis is H a
: r < or p-value <
0 or a
H a
: r < r
0 t > t a /2, n-2
(or t < t or a, n-2
) t < t a /2, n-2 when the
Regression Analysis (2) 10
Example Refer to the height and weight data in the previous
Example The correlation of height and weight was calculated to be r =.8261. Is this correlation significantly different from 0?
Solution
To test the hypotheses
H
0 the value of the test statistic is
: r
0 versus Ha : r
0 t
0
r n
1
2 r
2
.
8261
1
10
2
(.
8261 )
2
4 .
15 which for n 10 has a t distribution with 8 degrees of freedom.
Since this value is greater than t .
005
3.355, the two-tailed pvalue is less than 2(.005) .01, and the correlation is declared significant at the 1% level ( of the variables is explained by the other. The Minitab printout n
Figure 12.17 displays the correlation testing its significance.
P
.82612 .6824 means that about 68% of the variation in one r
< .01). The value and the exact p r 2
-value for
Regression Analysis (2) 11
r is a measure of linear correlation and x and y could be perfectly related by some curvilinear function when the observed value of r is equal to 0.
Regression Analysis (2) 12
In general, we do not know the underlying distribution of the population, and we wish to test the hypothesis that a particular distribution will be satisfactory as a population model.
Probability Plotting can only be used for examining whether a population is normal distributed.
Histogram Plotting and others can only be used to guess the possible underlying distribution type.
Regression Analysis (2) 13
A random sample of size n from a population whose probability distribution is unknown.
These n observations are arranged in a frequency histogram, having k bins or class intervals.
Let O i be the observed frequency in the ith class interval, and E i be the expected frequency in the ith class interval from the hypothesized probability distribution, the test statistics is
Regression Analysis (2) 14
If the population follows the hypothesized distribution,
X
0
2 has approximately a chi-square distribution with k-p-1 d.f., where p represents the number of parameters of the hypothesized distribution estimated by sample statistics.
That is,
0
2 i k
1
O i
E i
2
E i
~
k
2
p
1
Reject the hypothesis if
0
2 >
2 a
, k
p
1
Regression Analysis (2) 15
Class intervals are not required to be equal width.
The minimum value of expected frequency can not be to small. 3, 4, and 5 are ideal minimum values.
When the minimum value of expected frequency is too small, we can combine this class interval with its neighborhood class intervals. In this case, k would be reduced by one.
Regression Analysis (2) 16
Example 8-18 The number of defects in printed circuit boards is hypothesized to follow a Poisson distribution. A random sample of size 60 printed boards has been collected, and the number of defects observed as the table below:
The only parameter in Poisson distribution is l , can be estimated by the sample mean = {0(32) + 1(15) + 2(19) +
3(4)}/60 = 0.75. Therefore, the expected frequency is: p
E
1
1
P ( X
0 )
e
0 .
75
( 0
0 !
0 .
472
60
28 .
32
.
75 )
0
0 .
472
Regression Analysis (2) 17
Since the expected frequency in the last cell is less than 3, we combine the last two cells:
Regression Analysis (2) 18
1. The variable of interest is the form of distribution of defects in printed circuit boards.
2. H
0
H
1
: The form of distribution of defects is Poisson
: The form of distribution of defects is not Poisson
3. k = 3, p = 1, k-p-1 = 1 d.f.
4. At a = 0.05, we reject H
0
5. The test statistics is: if X 2
0
> X 2
0.05, 1
= 3.84
0
2 i k
1
( O i
E i
)
2
E i
( 32
28 .
32 )
2
28 .
32
( 15
21 .
24 )
2
21 .
24
( 13
10 .
44 )
2
10 .
44
2 .
94
6. Since X 2
0
= 2.94 < X 2 boards is Poisson.
0.05, 1
= 3.84, we are unable to reject the null hypothesis that the distribution of defects in printed circuit
Regression Analysis (2) 19
Example 8-20
A company has to choose among three pension plans.
Management wishes to know whether the preference for plans is independent of job classification and wants to use a = 0.05.
The opinions of a random sample of 500 employees are shown in Table 8-4.
Regression Analysis (2) 20
There are two classifications, one has r levels and the other has c levels. (3 pension plans and 2 type of workers)
Want to know whether two methods of classification are statistically independent. (whether the preference of pension plans is independent of job classification)
The table:
Regression Analysis (2) 21
Let p ij be the probability that a random selected element falls in the ij th cell, given that the two classifications are independent.
Then p ij
= u i v j
, where the estimator for u i and v j are
i
1 n j c
1
O ij v j
1 n i r
1
O ij
Therefore, the expected frequency of each cell is
E ij
n
i
v j
1 n j c
1
O ij i r
1
O ij
Then, for large n, the statistic
2 r c ( O ij
E ij
)
2
0 i
1 j
1
E ij has an approximate chi-square distribution with (r-1)(c-1) d.f.
Regression Analysis (2) 22
Regression Analysis (2) 23
Regression Analysis (2) 24
I. A Linear Probabilistic Model
1. When the data exhibit a linear relationship, the appropriate model is y a x e .
2. The random error and variance s 2 .
e has a normal distribution with mean 0
II.Method of Least Squares
1. Estimates a and b , for a and , are chosen to minimize SSE,
The sum of the squared deviations about the regression line, y ˆ a
bx .
Regression Analysis (2) 25
2. The least squares estimates are b S xy
/ S xx and b x .
III. Analysis of Variance
1. Total SS SSR SSE, where Total SS
SSR ( S xy
) 2 / S xx
.
S yy and
2. The best estimate of s 2 is MSE SSE / ( n 2).
IV. Testing, Estimation, and Prediction
1. A test for the significance of the linear regression — H
0 can be implemented using one of the two test statistics:
: 0 t
b
MSE / S xx or F
MSR
MSE
Regression Analysis (2) 26
2. The strength of the relationship between measured using
R
2
MSR
Total SS x and y can be which gets closer to 1 as the relationship gets stronger.
3. Use residual plots to check for nonnormality, inequality of variances, and an incorrectly fit model.
4. Confidence intervals can be constructed to estimate the intercept a and slope the average value of
y, E of the regression line and to estimate
( y ) , for a given value of x .
5. Prediction intervals can be constructed to predict a particular observation, y , for a given value of x . For a given x , prediction intervals are always wider than confidence intervals.
Regression Analysis (2) 27
V. Correlation Analysis
1. Use the correlation coefficient to measure the relationship between x and y when both variables are random: r
S xy
S xx
S yy
2. The sign of r indicates the direction of the relationship;
0 indicates no linear relationship, and a strong linear relationship.
r near r near 1 or 1 indicates
3. A test of the significance of the correlation coefficient is identical to the test of the slope .
Regression Analysis (2) 28
X could cause Y
Y could cause X
X and Y could cause each other
X and Y could be caused by a third variable Z
X and Y could be related by chance
Bad (or good) luck
Need careful examination of the study. Try to find previous evidences or academic explanations.
Regression Analysis (2) 29