Nonparametric methods

advertisement

Nonparametric methods

Nonparametric tests are often used in place of their parametric counterparts when certain assumptions about the underlying population are questionable such as normality of the data, or when observations may be measured on an ordinal rather than an interval scale, or come from skewed or multimodal distributions. All tests involving ranked data , i.e. data that can be put in order, are nonparametric.

One sample

Nonparametric methods Parametric methods

Sign Test One-sample t test

(paired t-test)

Wilcoxon Signed Ranks Test One-sample t test

(paired t-test)

Two samples Wilcoxon Mann-Whitney

Test

One-way ANOVA Kruskal-Wallis Test

Two-way ANOVA Friedman Test

Correlation Spearman Rank Correlation

Two-sample t test

One-way ANOVA Test

Two-way ANOVA Test

Correlation Test

Goodness -of-Fit

Regression

Test

Kolmogorov-Smirnov Test Chi-Squared goodness of fit test

Nonparametric linear regression

Linear regression

Non-parametric statistics are used if the data are not compatible with the assumptions of parametric methods such as normality or homogeneous variance.

Advantages of non-parametric methods

1.

are easy to apply

2.

Assumptions, such as normality, can be relaxed

3.

When observations are drawn from non-normal populations, non-parametric methods are more reliable

4.

can be used for “ranks” –scores which are not exact in a numerical sense.

Disadavantages of non-parametric methods

1.

when observations are drawn from normal populations, non-parametric

tests are not as powerful as parametric tests.

2.

Parametric methods are sometimes robust to certain types of departure from normality (especially as n gets large)

–Central Limit Theorem

3.

Confidence interval construction is difficult with non-parametric

methods.

Sign Test

One Sample

The sign test is designed to test a hypothesis about the location of a population distribution. It is most often used to test the hypothesis about a population median, and often involves the use of matched pairs, for example, before and after data, in which case it tests for a median difference of zero.

Example

Out of a population, 10 mentally retarded boys received general appearance scores as follows: 4, 5, 8, 8, 9, 6, 10,7,6,6.

1. H

0

: median

5 vs H

A

: Not H

0

2. Transform the data into signs: Assign “+” if the observed value > 5

“-“ if the observed value < 5

“0” if the observed value = 5. obs 4 5 8 8 9 6 10 7 6 6

Signs - 0 + + + + + + + +

Zeros are eliminated from the analysis. Since there is 1 zero, the number of observations is reduced from 10 to 9.

3.

Thus we observed 8 +’s out of 9 trials. The probability that we observe as many as 8 or more +’s is, in EXCEL, (1-BINOMDIST(7,9,0.5, True))=.0195. Since we perform the two-sided test, the p-value = 2*.0195 = .0370.

Large Sample Approximation for n > 20:

Z

T

 n / 2

~ N ( 0 , 1 ) n / 4

In this example, Z =

8

9 / 2

9 / 4

2 .

33

From EXCEL, the p-value =2*[1- NORMDIST(2.33,0,1,TRUE)] = 0.020 < 0.05.

4. Since the p-value is smaller than 0.05

, we conclude that the median score is

not equal to 5.

Wilcoxon Signed Ranks Test

The Wilcoxon Signed Ranks test is designed to test a hypothesis about the median of a population distribution. It often involves the use of matched pairs, for example, before and after data, in which case it tests for a median difference of zero. In many applications, this test is used in place of the one sample t-test when the normality assumption is questionable. It is a more powerful alternative to the sign test, but does assume that the population probability distribution is symmetric.

Example id 1 2 3 score 4 5 8 x – 5 -1 0 3 x

5 1 0 3

4

8

3

3

5

4

6

1

7

5

8

2

9

9 6 10 7 6

4 1 5 2 1

1

10

6

1

1

T+

Rank- 2.5

Rank+ 6.5 6.5 8 2.5 9 5 2.5 2.5 42.5

1.

H

0

: median

Cardiac

5 vs H

A

: Not H

0

.

2.

Find the differences between observed values and the proposed median = 5.

3.

Find the absolute values of the differences.

4.

Eliminate the observation whose value, after subtracting 5, becomes 0.

5.

Rank the absolute values of the differences, breaking the ranks among the ties.

6.

Add all the ranks for the positive differences. T+ = 42.5

From the statistical table for the Wilcoxon Signed-Rank Test

when n = 9, T+ = 42.5, T- = 2.5, p-value = 2* .008 = .016

For Large sample approximation if n > 20

Z

T

 n ( n

[ n ( n

1 )( 2 n

1 )

/

1 )

4 ]

/ 24

~ N ( 0 , 1 )

For this example Z

42 .

5

[ 9 * 10 / 4 ]

9 * 10 * 19 / 24

2 .

37 , its p-value is .018.

8. Since n < 20, based on the p-value from the table (.1514), we conclude that the median is not equal to 5.

SAS program for the sign test and the Wilcoxon ranked sign test for 1 sample data

DATA IN;

INPUT X @@;

diff=X-5;

CARDS;

4 5 8 8 9 6 10 7 6 6 run;

PROC UNIVARIATE;

VAR diff; run;

Output

Univariate Procedure

Variable=DIFF

Moments

N 10 Sum Wgts 10

Mean 1.9 Sum 19

Std Dev 1.852926 Variance 3.433333

Skewness 0.180769 Kurtosis -0.62777

USS 67 CSS 30.9

CV 97.5224 Std Mean 0.585947

T:Mean=0 3.242617 Pr>|T| 0.0101

Num ^= 0 9 Num > 0 8

M(Sign) 3.5 Pr>=|M| 0.0391

Sgn Rank 20 Pr>=|S| 0.0195

Two Paired Samples

M(sign) = # of +’s – n/2

= 8 - 9/2 = 3.5

Sgn Rank =

= 42.5 – 22.5 = 20

Sign Test

Example (From Table 18-1 (p489))

Matched-pair design involving change scores in self-perception of heath among hypertensives id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 T+

Treatment 10 12 8 8 13 11 15 16 4 13 2 15 5 6 8

Control

Sign

6

+

5

+

7 9 10 12 9

+ - + - +

8

+

3 14 6

+ - -

10 1

+ +

2

+

1

+ 11

1.

H

0

: median

Treatment

 median

Control vs H

A

: Not H

0

.

2.

The signs are assigned “+” if Treatment > Control

“-“ if Treatment < Control

“=” if Treatment = Control.

3.

n = 15, T+ = 11

The exact p – value =2*[1 - BINOMDIST(10,15,0.5, TRUE)] = 0.1185

Large Sample Approximation for n > 20:

Z

T

 n n

/

/

4

2

~ N (

In this example, Z =

11

15

15 /

/

4

2

1 .

81

0 , 1 )

From EXCEL, the p-value =2*[1- NORMDIST(1.81,0,1,TRUE)] = 0.07 > 0.05.

Based on the large sample approximation and the exact p -value, we conclude that the median of the treatment group is not different from the median of the control group.

Wilcoxon Signed Rank Test

Example

Matched-pair design involving change scores in self-perception of heath among hypertensives id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 T+

Treatment 10 12 8 8 13 11 15 16 4 13 2 15 5 6 8

Control 6 5 7 9 10 12 9 8 3 14 6 10 1 2 1

Diff

Diff

4 7

4 7

1 -1 3

1 1 3

-1

1

6

6

8

8

1 -1 -4

1 1 4

5

5

4

4

4

4

7

7

Rank-

Rank+

1. H

0

3 3 3 8.5

:

8.5 13.5 3 6 12 15 3 11 8.5 8.5 13.5 102.5 median

Treatment

 median

Control vs H

A

: Not H

0

.

2. Find the difference between treatment and control.

3. Rank the absolute differences breaking the ranks among the ties.

4. Add all the ranks for the positive differences. T+ = 110

From the statistical table for the Wilcoxon Signed-Rank Test

when n = 15, T+ =102.5, T- = 17.5, p-value = 2* .007 = .014

For Large sample approximation if n > 20

Z

T

[ n ( n

1 ) n ( n

1 )( 2 n

/

1 )

4 ]

/ 24

~ N ( 0 , 1 )

For this example, Z

102 .

5

[ 15 * 16 / 4 ]

15 * 16 * 31 / 24

2 .

41 and its p-value = .016.

5. Since the p-value < .05, we conclude that the medians are not different.

Note: the results from the sign test and the Wilcoxon signed ranks test on the paired sample are different. The Wilcoxon signed ranks test is more powerful.

SAS program for the sign test for paired data data hyper;

input treat control @@;

diff = treat-control; cards;

10 6 12 5 8 7 8 9 13 10 11 12 15 9 16 8

4 3 13 14 2 6 15 10 5 1 6 2 8 1 run; proc univariate;

var diff; run;

Output

Univariate Procedure

Variable=DIFF

Moments

N 15 Sum Wgts 15

Mean 2.866667 Sum 43 parametric paired

Std Dev 3.563038 Variance 12.69524 t-test

M(sign) = # of +’s – n/2

= 11 – 15/2 = 3.5

Sgn Rank =

= 102.5 – 60 = 42.5

Skewness -0.34413 Kurtosis -0.82487

USS 301 CSS 177.7333

CV 124.292 Std Mean 0.919972

T:Mean=0 3.116036 Pr>|T| 0.0076

Num ^= 0 15 Num > 0 11

M(Sign) 3.5 Pr>=|M| 0.1185

Sgn Rank 42.5 Pr>=|S| 0.0139

Two Independent Samples

Wilcoxon Mann-Whitney Test

The Wilcoxon Mann-Whitney Test is one of the most powerful of the nonparametric tests for comparing two populations. In many applications, the Wilcoxon Mann-Whitney Test is used in place of the two sample t-test when the normality assumption is questionable.

This test can also be applied when the observations in a sample of data are ranks, that is, ordinal data rather than direct measurements.

Example

A researcher assess the effects of prolonged inhalation of cadmium oxide on the hemoglobin level.

Exposed Unexposed Exposed(sort) Rank Unexposed(sort) Rank

14.4

14.2

13.8

16.5

14.1

16.6

15.9

15.6

14.1

15.3

15.7

16.7

13.7

15.3

14.0

17.4

16.2

17.1

17.5

15.0

16.0

16.9

15.0

16.3

16.8

1. Hypotheses: H

0

: median

X

13.7 1

13.8 2

14

3

14.1 4.5

14.1 4.5

14.2 6

14.4

7

15.3 10.5

15.3 10.5

15.6 12

15.7 13

15.9 14

16.5 18

16.6 19

16.7

20

S1 145 vs H

A

: Not H

0

 median

Y

2. Sort each column of the variables. Assign the joint ranks to the samples from

the two variables. Find S1 and S2, the sum of the ranks assigned to each group.

S = max (S1,S2)

3. The test statistic is T

S

 n ( n

1 )

2

where n = max (n1, n2)

SAS program for the Wilcoxon Mann-Whitney Test data oxide;

infile cards missover;

input group $ @;

15 8.5

15 8.5

16

15

16.2 16

16.3 17

16.8 21

16.9

22

17.1 23

17.4 24

17.5 25

S2 180

do until (hemo = .);

input hemo @;

if hemo ne .

then output;

end; cards;

1 13.7 13.8 14 14.1 14.1 14.2 14.4 15.3 15.3 15.6 15.7 15.9 16.5 16.6 16.7

2 15 16 16.2 16.3 16.8 16.9 17.1 17.4 17.5 15 run; proc npar1way wilcoxon;

class group;

var hemo;

exact; run; proc means median;

class group;

var hemo; run;

N P A R 1 W A Y P R O C E D U R E

Wilcoxon Scores (Rank Sums) for Variable HEMO

Classified by Variable GROUP

Sum of Expected Std Dev Mean

GROUP N Scores Under H0 Under H0 Score

1 15 145.0 195.0 18.0173527 9.6666667

2 10 180.0 130.0 18.0173527 18.0000000

Average Scores Were Used for Ties

Wilcoxon 2-Sample Test S = 180.000

Exact P-Values

(One-sided) Prob >= S = 0.0021

(Two-sided) Prob >= |S - Mean| = 0.0042

Normal Approximation (with Continuity Correction of .5)

Z = 2.74735 Prob > |Z| = 0.0060

T-Test Approx. Significance = 0.0112

The MEANS Procedure

Analysis Variable : hemo

N

group Obs Median

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

1 15 15.3000000

2 10 16.5500000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Conclusion: Reject H

0

: median

X

 median

Y

and conclude that the median hemo of

Treatment group 1 and 2 are different or the median hemo due to Group 2 is greater than the median due to Group 1.

One-way ANOVA

Kruskal-Wallis Test

The Kruskal-Wallis test is a nonparametric test used to compare three or more samples.

Data

Levels

1

2

Observatio ns

Y

11

Y

12

Y

21

Y

22

Y

13

.....

Y

1 n

1

Y

23

.....

Y

2 n

2

Sum i

Y

1 .

Y

2 .

Mean i

Y

1 .

Y

2 .

: a Y a 1

Y a 2

Y a 3

.....

Y an a

Y a .

Y a .

__________ __________ __________ __________ __

Total Y

..

Y

..

N = n

1

 n

2

...

 n a

Converting the original data into the ranks, we get

Levels

1

Observatio ns

R

11

R

12

R

13

.....

R

1 n

1

R

21

R

22

R

23

.....

R

2 n

2

Sum i

R

1 .

R

2 .

Mean i

R

1 .

R

2 .

2

: a R a 1

R a 2

R a 3

.....

R an a

R a .

R a .

__________ __________ __________ __________ __

Total R

..

R

..

Hypotheses H

0

: Median

1

Median

2

...

Median a vs H

A

: Not H

0

Test Statistics T =

N (

12

N

1 ) a

 i

1 n i



R i .

n i

R ..

N



2

N (

12

N

1 ) i a

1

( R

2 i .

/ n i

)

3 ( N

1 ) ~ X

2 a

1

SAS program for the Kruskal-Wallis Test data Kruskal;

infile cards missover;

input treat $ @; do until (time = .);

input time @;

if time ne . then output; end; lines;

A 17 20 40 31 35

B 8 7 9 8

C 2 5 4 3 run; proc npar1way wilcoxon;

class treat;

var time;

exact; run; proc means median;

class treat;

var time; run;

N P A R 1 W A Y P R O C E D U R E

Wilcoxon Scores (Rank Sums) for Variable TIME

Classified by Variable TREAT

Sum of Expected Std Dev Mean

TREAT N Scores Under H0 Under H0 Score

A 5 55.0 35.0 6.82191040 11.0000000

B 4 26.0 28.0 6.47183246 6.5000000

C 4 10.0 28.0 6.47183246 2.5000000

Average Scores Were Used for Ties

Kruskal-Wallis Test S = 10.711

Exact P-Value Prob >= S = 6.66E-05

Chi-Square Approximation

DF = 2 Prob > S = 0.0047

The MEANS Procedure

Analysis Variable : time

N

treat Obs Median

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

A 5 31.0000000

B 4 8.0000000

C 4 3.5000000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Conclusion: Reject H

0

: median

A

 median

B

 median

C

and conclude that the median times of Treatment group A, B, and C are different. It seems that the median time due to Treatment A is greater than the medians due to Treatments B and C. This must be followed up by a nonparametric multiple comparison procedure.

Randomized Block Design

Friedman Test

Data Let the data be in the following format;

Levels

1

1 2

Blocks

3 b

Y

11

Y

12

Y

13

.....

Y

1 b

Y

21

Y

22

Y

23

.....

Y

2 b

2 a

:

Y a 1

Y a 2

Y a 3

.....

Y ab

Sum i

Y

1 .

Y

2 .

Y a .

Mean i

Y

1 .

Y

2 .

Y a .

Sum j

Mean j

Y

.

1

Y

.

1

Y

.

2

Y

.

2

Y

.

3

.....

Y

.

b

Y

.

3

.....

Y

.

b

Y

..

Y

..

where N = ab

Converting the data into the separate ranks within each block, we get

Blocks

Levels 1 2 3 b Sum i

Mean i

1

2 a

:

R

1

R

1

R

1

.....

R

1

R

2

R

2

R

2

.....

R

2

R

1 .

R

2 .

R

1 .

R

2 .

R a

R a

R a

.....

R a

R a .

R a .

Hypotheses H

0

: Median

1

Median

2

...

Median a vs H

A

: Not H

0

Test Statistics T

12 ab ( a

1 ) i a

1

R

2 i .

3 b ( a

1 ) ~ X

2 a

1

SAS program for the Friedman Test

DATA Fried;

INPUT BLOCK $ TRTMENT $ YIELD @@;

CARDS;

1 A 32.6 1 B 36.4 1 C 29.5 1 D 29.4

2 A 42.7 2 B 47.1 2 C 32.9 2 D 40.0

3 A 35.3 3 B 40.1 3 C 33.6 3 D 35.0

4 A 35.2 4 B 40.3 4 C 35.7 4 D 40.0

5 A 33.2 5 B 34.3 5 C 33.2 5 D 34.0

6 A 33.1 6 B 34.4 6 C 33.1 6 D 34.1

run;

PROC RANK;

BY BLOCK;

VAR YIELD;

RANKS RYIELD;

RUN; proc freq;

tables block*trtment*ryield / noprint cmh;

title 'Friedman''s Chi-Square'; run; proc means median;

class trtment;

var yield; run;

Output

Friedman's Chi-Square

The FREQ Procedure

Cochran-Mantel-Haenszel Statistics (Based on Table Scores)

Statistic Alternative Hypothesis DF Value Prob

---------------------------------------------------------------

1 Nonzero Correlation 1 0.7448 0.3881

2 Row Mean Scores Differ 3 12.6207 0.0055

3 General Association 12 27.7500 0.0060

Total Sample Size = 24

The MEANS Procedure

Analysis Variable : YIELD

N

TRTMENT Obs Median

-------------------------------

A 6 34.2000000

B 6 38.2500000

C 6 33.1500000

D 6 34.5500000

-------------------------------

Conclusion: We reject the null hypothesis and conclude that the median yields of

Treatment A, B, C and D are different. It seems that the median yield due to

Treatment B is greater than the medians due to Treatment A, C, and D. This must be followed up by the nonparametric multiple comparison procedure.

Correlation

The Spearman Rank Correlation Coefficeint

The Spearman Rank Correlation test uses the ranks (rather than the actual values) of the two sets of variables to calculate a statistic, the correlation coefficient: r s

. data rankcorr;

input age EEG @@; lines;

20 98 21 75 22 95 24 100 27 99 30 65 31 64 33 70 35 85

38 74 40 68 42 66 46 48 51 54 53 63 55 52 58 67 60 55 run;

proc rank;

var age;

ranks rage; run; proc rank;

var EEG;

ranks rEEG; run;

Proc corr;

var rage rEEG; run;

Output

Correlation Analysis

2 'VAR' Variables: RAGE REEG

Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum Label

RAGE 18 9.5000 5.3385 171.0 1.0000 18.0000 RANK FOR VARIABLE AGE

REEG 18 9.5000 5.3385 171.0 1.0000 18.0000 RANK FOR VARIABLE EEG

Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 18

RAGE REEG

RAGE 1.00000 -0.76264

RANK FOR VARIABLE AGE 0.0 0.0002

REEG -0.76264 1.00000

RANK FOR VARIABLE EEG 0.0002 0.0

Conclusion: The Spearman rank correlation between age and EEG is –0.76264. We reject H

0

:

 

0 and conclude that the correlation is different from 0 based on the p-value of 0.0002.

Goodness- Of-Fit Test

Kolmogorov-Smirnov Goodness- Of-Fit Test

1) The Kolmogorov-Smirnov test can be used to test whether a data is from a normal distribution or not.

The graph in the right is a plot of the empirical distribution function with a normal cumulative distribution function for 100 normal random numbers. The

Kolmogorov-Smirnov test is based on

the maximum distance between these two curves.

The test statistic, here designated D max

, is the maximum difference between the cumulative proportions of the two patterns.

2) Suppose that we observe N = m + n observations, X

1

,..., X m and Y

1

,..., Y n mutually independent, and from two populations.

, which are

The question becomes: are the populations the same or different?

Equivalently: H

0

: P ( X < a ) = P ( Y < a ) for all a .

To test this, define Empirical

Distribution Functions (EDFs):

F x

( t )

# X ' s m

 t and

F x

( t )

# X ' s m

 t

.

The test statistic, here designated

D max

, is the maximum difference between the cumulative proportions of the two patterns.

PROC NPAR1WAY computes the Kolmogorov-Smirnov statistic as

KS

 max j

1 n

2  i n j

( F i

( x j

)

F ( x j

))

2

where j = 1,2,...,n

The asymptotic Kolmogorov-Smirnov statistic is computed as

KS a

KS n

If there are only two class levels, PROC NPAR1WAY computes the two-sample

Kolmogorov statistic as

D = max j

| F

1

( x j

) - F

2

( x j

) | where j = 1,2, ... , n

SAS program for Kolmogorov-Smirnov Test data oxide;

infile cards missover;

input group $ @;

do until (hemo = .);

input hemo @;

if hemo ne .

then output;

end; cards;

1 13.7 13.8 14 14.1 14.1 14.2 14.4 15.3 15.3 15.6 15.7 15.9 16.5 16.6 16.7

2 15 16 16.2 16.3 16.8 16.9 17.1 17.4 17.5 15 run;

/* checking the normality of a single variable using the Kolmogorov-Smirnov test */ proc univariate normal;

var hemo; run;

SAS output(edited)

Tests for Normality

Test --Statistic--- -----p Value------

Shapiro-Wilk W 0.943237 Pr < W 0.1925

Kolmogorov-Smirnov D 0.123822 Pr > D >0.1500

Cramer-von Mises W-Sq 0.053008 Pr > W-Sq >0.2500

Anderson-Darling A-Sq 0.403511 Pr > A-Sq >0.2500

Conclusion: The distribution of the variable, hemo, may be considered normal.

/* comparing the distribution of two sample observations */ proc npar1way wilcoxon edf; /* edf = empirical distribution function */

class group;

var hemo; run;

SAS output

Kolmogorov-Smirnov 2-Sample Test (Asymptotic)

KS = 0.293939 D = 0.600000

KSa = 1.46969 Prob > KSa = 0.0266

Conclusion: The distribution of X is different from that of Y.

Regression

Nonparametric linear regression using the Theil estimate.

X X1 X2 .... Xn

Y Y1 Y2 .... Yn

Here, we construct estimates of the slope using ( n ( n -1))/2 pairs of observations, and use the median of these as the estimate of the slope.

Suppose that we have ordered the data by their x values so that for ( Y i

, x i

) and ( Y j

, x j

) satisfying i < j , then x i

 x j

, and we now have: S i , j

Y x i j

Y i x j

Note that this causes a big problem if some of the x values are the same! In this case, one must use the finite slopes (i.e. only use the values S i , j

from x i

 x j and use the median of this reduced set. Therefore,

 median ( S i , j

) i

 j

.

When one cannot assume that the error terms are symmetric about 0, find the n terms,

Y i

 

* X i

, i = 1,...,n. The median of these n terms is the estimate of the intercept.

In the following example, Y = acid levels and X = exercise times (in minutes). We want to establish the relationship between these two variables.

Parametric linear regression

DATA npreg1;

input indep dep @@;

CARDS;

230 421 175 278 315 618 290 482

275 465 150 105 360 550 425 750

run;

proc reg;

model dep = indep;

run;

Model: MODEL1

Dependent Variable: DEP

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Prob>F

Model 1 53614.66151 53614.66151 58.115 0.0003

Error 6 5535.33849 922.55642

C Total 7 59150.00000

Root MSE 30.37361 R-square 0.9064

Dep Mean 277.50000 Adj R-sq 0.8908

C.V. 10.94545

Parameter Estimates

Parameter Standard T for H0:

Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 76.210483 28.50456809 2.674 0.0368

INDEP 1 0.438898 0.05757290 7.623 0.0003

Nonparametric linear regression

DATA npreg1;

ARRAY X(8) X1-X8;

ARRAY Y(8) Y1-Y8;

DO I = 1 TO 8;

INPUT Y(I) X(I) @@;

END;

OUTPUT;

CARDS;

230 421 175 278 315 618 290 482

275 465 150 105 360 550 425 750

run;

DATA npreg2;

SET npreg1;

ARRAY X(8) X1-X8;

ARRAY Y(8) Y1-Y8;

DO I=1 TO 7;

DO J=I+1 TO 8;

SLOPE = (Y(J)-Y(I))/(X(J)-X(I));

OUTPUT;

END;

END;

KEEP SLOPE;

PROC SORT;

BY SLOPE; run;

PROC PRINT;

TITLE 'THEIL SLOPE ESTIMATE EXAMPLE'; run; proc means median;

var slope; run;

Data npreg3;

set npreg1;

ARRAY X(8) X1-X8;

ARRAY Y(8) Y1-Y8;

/* when one assumes that the error terms are not symmetric about 0 */

DO I = 1 to 8;

inter1 = Y(I)- 0.4878*X(I); /* .4878 is the median of the slopes */

output;

END;

keep inter1;

Proc print;

var inter1; run;

Proc means median;

var inter1; run;

THEIL SLOPE ESTIMATE EXAMPLE

Obs SLOPE

The MEANS Procedure

Analysis Variable : SLOPE

Median

------------

0.4878207

------------

Obs inter1

1 24.6362

2 39.3916

3 13.5396

4 54.8804

5 48.1730

6 98.7810

7 91.7100

Analysis Variable : inter1

Median

------------

51.5267000

------------

Compare the linear regression equations

8 59.1500 due to parametric regression: and due to nonparametric regression:

76 .

2105

0 .

4389 * X

51 .

5267

0 .

4878 * X .

To compare the performance of the predicted values, let’s compare the means of their residuals as follows:

DATA npreg1;

input indep dep @@;

pred1 = 76.2105

+ 0.4389

*dep;

resid1 = indep - pred1;

np_pred2 = 51.5267

+ 0.4878

*dep; resid2 = indep - np_pred2;

CARDS ;

230 421 175 278 315 618 290 482

275 465 150 105 360 550 425 750 run ; proc means ; var resid1 resid2; run ;

The MEANS Procedure

Variable N Mean Std Dev Minimum Maximum

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

resid1 8 -0.0010125 28.1205022 -32.4507000 42.3945000

resid2 8 2.2560250 29.7632035 -37.9871000 47.2543000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

In this case, seems that the parametric regression equation predicts the values of Y more closely than that of the nonparametric regression. Still, the use of the method depends on the distribution assumption of the data.

Homework problems

1. A sample of 15 patients suffering from asthma participated in an experiment to study the effect of a new treatment on pulmonary function. The dependent variable is FEV

(forced expiratory volume, liters, in 1 second) before and after application of the treatment.

Subject Before

1 1.69

2 2.77

After

1.69

2.22

Subject

9

10

Before

2.58

1.84

After

2.44

4.17

3

4

5

6

1.00

1.66

3.00

.85

3.07

3.35

3.00

2.74

11

12

13

14

1.89

1.91

1.75

2.46

2.42

2.94

3.04

4.62

7

8

1.42

2.82

3.61

5.14

15 2.35 4.42

On the basis of these data, can one conclude that the treatment is effective in increasing the FEV level? Let

 

0 .

05 and find the p-value. a) Perform the sign test. b) Perform the Wilcoxon signed-rank test.

2. From the same context with Problem 1, subjects 1-8 came from Clinic A and subjects

9 –15, from Clinic B. Can one conclude that the FEV levels from these two groups are different? Let

 

0 .

05 and find the p-value. Perform the Wilcoxon Mann-Whitney

Test.

3

4

5

6

7

Subject

1

2

Clinic A

1.69

2.22

3.07

3.35

3.00

2.74

3.61

Subject

9

10

11

12

13

14

15

Clinic B

2.44

4.17

2.42

2.94

3.04

4.62

4.42

8 5.14

3. From the same context with Problems 1 & 2, subjects 1-8 came from Clinic A, subjects 9 –15, from Clinic B, and subjects 16-20 from Clinic C were added later. Can one conclude that the FEV levels from these three groups are different? Let

 

0 .

05 and find the p-value. Perform the Kruskal-Wallis Test.

Subject Clinic A

1 1.69

2 2.22

Subject

9

10

Clinic B

2.44

4.17

Subject

16

17

Clinic C

2.34

3.17

3

4

5

6

3.07

3.35

3.00

2.74

11

12

13

14

2.42

2.94

3.04

4.62

18

19

20

4.42

4.94

5.04

7

8

3.61

5.14

15 4.42

4. The following table shows the scores made by nine randomly selected student nurses on final examination in three subject areas:

6

7

8

3

4

5

Student number

1

2

Subject Area

Fundamentals

98

95

76

95

83

99

82

75

Physiology

95

71

80

81

77

70

80

72

Anatomy

77

79

91

84

80

93

87

81

9 88 81 83

Test the null hypothesis that student nurses from which the above sample was drawn perform equally well in all three subject areas against the alternative hypothesis that they perform better in, at least, one area. Let

 

0 .

05 and find the p-value. Perform the

Friedman Test.

5. From Problem 4, find the Spearman rank correlation between the Physiology scores and the Anatomy scores.

Download