Document

Correlation Analysis

Pearson Product Moment Coefficient of Correlation: r

 s xy s x s y



S xy

S xx

S yy



The variances and covariances are given by: s xy



S n xy



1 s

2 x



S n xx



1 s

2 y



S n yy



1



In general, when a sample of is selected and two variables are measured on each individual or unit so that both variables are random, the correlation coefficient r n individuals or experimental units is the appropriate measure of linearity for use in this situation.

Regression Analysis (2) 1

Example

The heights and weights of n  10 offensive backfield football players are randomly selected from a county ’ s football all-stars.

Calculate the correlation coefficient for the heights (in inches) and weights (in pounds) given in Table below.

Table Heights and weights of n  10 backfield all-stars

Player Height x

9

10

7

8

5

6

3

4

1

2

72

75

67

69

73

71

75

72

71

69

Weight y

185

175

200

210

190

195

150

170

180

175


Solution

You should use the appropriate data entry method of your scientific calculator to verify the calculations for the sums of squares and cross-products:

S xy



328 S xx



60 .

4 S yy



2610 using the calculational formulas given earlier in this chapter.

Then r



( 60 .

328

4 )( 2610 )



.

8261 or r =.83. This value of r is fairly close to 1, the largest possible value of r , which indicates a fairly strong positive linear relationship between height and weight.








There is a direct relationship between the calculation formulas for the correlation coefficient line b .

r and the slope of the regression

Since the numerator of both quantities is the same sign.

S xy

, both r and b have

Therefore, the correlation coefficient has these general properties:

- When r  between

0, the slope is 0, and there is no linear relationship x and y.

- When r is positive, so is between x and y.

b , and there is a positive relationship

- When r is negative, so is between x and y.

b , and there is a negative relationship


The relationship between r (correlation coefficient) and the regression model

Y

^

 s y

Y

 r



X

 s x

X



Y

^









Y

 r



X

 s s x y













 r

 s y s x









X

Therefore

^



1

 r

 s y s x


Figure Some typical scatter plots






The population correlation coefficient interpreted as it is in the sample.

r is calculated and

The experimenter can test the hypothesis that there is no correlation between the variables that is exactly equivalent to the test of the slope  in previous

Section.

x and y using a test statistic


Test of Hypothesis Concerning the correlation Coefficient r :

1. Null hypothesis: H

0

:

2. Alternative hypothesis: r  0

One-Tailed Test

H a

:

(or H r > 0 a

: r < 0)

Two-Tailed Test

H a

: r  0

3. Test statistic: t

0

 r n



2

1

 r 2

When the assumptions are satisfied, the test statistic will have a Student ’ s

( n  2) degrees of freedom.

t distribution with


When comparing to non-zero constant

1. Null hypothesis: H

0

: r  r

0

2. Alternative hypothesis:

One-Tailed Test

H a

(or

:

H r > r

0 a

: r < r

0

)

3. Test statistic: t

0



( r

( 1



 r

0

)

Two-Tailed Test

H a

: r  r

0 n r

2

)( 1





2 r

0

)

When the assumptions are satisfied, the test statistic will have a Student ’ s

( n  2) degrees of freedom.

t distribution with


4. Rejection region: Reject H

0 when

One-Tailed Test Two-Tailed Test t > t a, n-2 alternative hypothesis is H a

: r < or p-value <

0 or a

H a

: r < r

0 t > t a /2, n-2

(or t <  t or a, n-2

) t <  t a /2, n-2 when the


Example Refer to the height and weight data in the previous

Example The correlation of height and weight was calculated to be r =.8261. Is this correlation significantly different from 0?

Solution

To test the hypotheses

H

0 the value of the test statistic is

: r 

0 versus Ha : r 

0 t

0

 r n

1





2 r

2



.

8261

1



10



2

(.

8261 )

2



4 .

15 which for n  10 has a t distribution with 8 degrees of freedom.

Since this value is greater than t .

005

 3.355, the two-tailed pvalue is less than 2(.005)  .01, and the correlation is declared significant at the 1% level ( of the variables is explained by the other. The Minitab printout n

Figure 12.17 displays the correlation testing its significance.

P

 .82612  .6824 means that about 68% of the variation in one r

< .01). The value and the exact p r 2

-value for


 r is a measure of linear correlation and x and y could be perfectly related by some curvilinear function when the observed value of r is equal to 0.


Testing for Goodness of Fit



In general, we do not know the underlying distribution of the population, and we wish to test the hypothesis that a particular distribution will be satisfactory as a population model.



Probability Plotting can only be used for examining whether a population is normal distributed.



Histogram Plotting and others can only be used to guess the possible underlying distribution type.


Goodness-of-Fit Test (I)







A random sample of size n from a population whose probability distribution is unknown.

These n observations are arranged in a frequency histogram, having k bins or class intervals.

Let O i be the observed frequency in the ith class interval, and E i be the expected frequency in the ith class interval from the hypothesized probability distribution, the test statistics is


Goodness-of-Fit Test (II)







If the population follows the hypothesized distribution,

X

0

2 has approximately a chi-square distribution with k-p-1 d.f., where p represents the number of parameters of the hypothesized distribution estimated by sample statistics.

That is,



0

2  i k 



1



O i



E i



2

E i

~

 k

2

 p



1

Reject the hypothesis if



0

2 > 

2 a

, k

 p



1


Goodness-of-Fit Test (III)



Class intervals are not required to be equal width.



The minimum value of expected frequency can not be to small. 3, 4, and 5 are ideal minimum values.



When the minimum value of expected frequency is too small, we can combine this class interval with its neighborhood class intervals. In this case, k would be reduced by one.


Example 8-18 The number of defects in printed circuit boards is hypothesized to follow a Poisson distribution. A random sample of size 60 printed boards has been collected, and the number of defects observed as the table below:



The only parameter in Poisson distribution is l , can be estimated by the sample mean = {0(32) + 1(15) + 2(19) +

3(4)}/60 = 0.75. Therefore, the expected frequency is: p

E

1

1



P ( X



0 )

 e



0 .

75

( 0

0 !



0 .

472



60



28 .

32

.

75 )

0



0 .

472


Example 8-18 (Cont.)



Since the expected frequency in the last cell is less than 3, we combine the last two cells:


Example 8-18 (Cont.)



1. The variable of interest is the form of distribution of defects in printed circuit boards.

2. H

0

H

1

: The form of distribution of defects is Poisson

: The form of distribution of defects is not Poisson

3. k = 3, p = 1, k-p-1 = 1 d.f.

4. At a = 0.05, we reject H

0

5. The test statistics is: if X 2

0

> X 2

0.05, 1

= 3.84

0

2  i k 



1

( O i



E i

)

2

E i



( 32



28 .

32 )

2

28 .

32



( 15



21 .

24 )

2

21 .

24



( 13



10 .

44 )

2

10 .

44



2 .

94

6. Since X 2

0

= 2.94 < X 2 boards is Poisson.

0.05, 1

= 3.84, we are unable to reject the null hypothesis that the distribution of defects in printed circuit




Contingency Table Tests

Example 8-20

A company has to choose among three pension plans.

Management wishes to know whether the preference for plans is independent of job classification and wants to use a = 0.05.

The opinions of a random sample of 500 employees are shown in Table 8-4.


Contingency Table Test

- The Problem Formulation (I)







There are two classifications, one has r levels and the other has c levels. (3 pension plans and 2 type of workers)

Want to know whether two methods of classification are statistically independent. (whether the preference of pension plans is independent of job classification)

The table:








Contingency Table Test

- The Problem Formulation (II)

Let p ij be the probability that a random selected element falls in the ij th cell, given that the two classifications are independent.

Then p ij

= u i v j

, where the estimator for u i and v j are



 i



1 n j c 



1



O ij v j



1 n i r 



1

O ij

Therefore, the expected frequency of each cell is

E ij

 n



 i

 v j



1 n j c 



1

O ij i r 



1

O ij

Then, for large n, the statistic



2  r c  ( O ij



E ij

)

2

0 i



1 j



1

E ij has an approximate chi-square distribution with (r-1)(c-1) d.f.


Example 8-20



Key Concepts and Formulas

I. A Linear Probabilistic Model

1. When the data exhibit a linear relationship, the appropriate model is y  a   x  e .

2. The random error and variance s 2 .

e has a normal distribution with mean 0

II.Method of Least Squares

1. Estimates a and b , for a and  , are chosen to minimize SSE,

The sum of the squared deviations about the regression line, y ˆ  a

 bx .


2. The least squares estimates are b  S xy

/ S xx and b x .

III. Analysis of Variance

1. Total SS  SSR  SSE, where Total SS 

SSR  ( S xy

) 2 / S xx

.

S yy and

2. The best estimate of s 2 is MSE  SSE / ( n  2).

IV. Testing, Estimation, and Prediction

1. A test for the significance of the linear regression — H

0 can be implemented using one of the two test statistics:

:   0 t

 b

MSE / S xx or F



MSR

MSE


2. The strength of the relationship between measured using

R

2 

MSR

Total SS x and y can be which gets closer to 1 as the relationship gets stronger.

3. Use residual plots to check for nonnormality, inequality of variances, and an incorrectly fit model.

4. Confidence intervals can be constructed to estimate the intercept a and slope the average value of

 y, E of the regression line and to estimate

( y ) , for a given value of x .

5. Prediction intervals can be constructed to predict a particular observation, y , for a given value of x . For a given x , prediction intervals are always wider than confidence intervals.


V. Correlation Analysis

1. Use the correlation coefficient to measure the relationship between x and y when both variables are random: r



S xy

S xx

S yy

2. The sign of r indicates the direction of the relationship;

0 indicates no linear relationship, and a strong linear relationship.

r near r near 1 or  1 indicates

3. A test of the significance of the correlation coefficient is identical to the test of the slope .


Cause and Effect













X could cause Y

Y could cause X

X and Y could cause each other

X and Y could be caused by a third variable Z

X and Y could be related by chance

Bad (or good) luck



Need careful examination of the study. Try to find previous evidences or academic explanations.


Document

Correlation Analysis

The relationship between r (correlation coefficient) and the regression model

When comparing to non-zero constant

Testing for Goodness of Fit

Goodness-of-Fit Test (I)

Goodness-of-Fit Test (II)

Goodness-of-Fit Test (III)

Example 8-18 (Cont.)

Example 8-18 (Cont.)

Contingency Table Tests

Contingency Table Test

- The Problem Formulation (I)

Contingency Table Test

- The Problem Formulation (II)

Example 8-20

Key Concepts and Formulas

Cause and Effect

Related documents

Products

Support

Document

Correlation Analysis

The relationship between r (correlation coefficient) and the regression model

When comparing to non-zero constant

Testing for Goodness of Fit

Goodness-of-Fit Test (I)

Goodness-of-Fit Test (II)

Goodness-of-Fit Test (III)

Example 8-18 (Cont.)

Example 8-18 (Cont.)

Contingency Table Tests

Contingency Table Test

- The Problem Formulation (I)

Contingency Table Test

- The Problem Formulation (II)

Example 8-20

Key Concepts and Formulas

Cause and Effect

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib