Nonparametric tests are often used in place of their parametric counterparts when certain assumptions about the underlying population are questionable such as normality of the data, or when observations may be measured on an ordinal rather than an interval scale, or come from skewed or multimodal distributions. All tests involving ranked data , i.e. data that can be put in order, are nonparametric.
One sample
Nonparametric methods Parametric methods
Sign Test One-sample t test
(paired t-test)
Wilcoxon Signed Ranks Test One-sample t test
(paired t-test)
Two samples Wilcoxon Mann-Whitney
Test
One-way ANOVA Kruskal-Wallis Test
Two-way ANOVA Friedman Test
Correlation Spearman Rank Correlation
Two-sample t test
One-way ANOVA Test
Two-way ANOVA Test
Correlation Test
Goodness -of-Fit
Regression
Test
Kolmogorov-Smirnov Test Chi-Squared goodness of fit test
Nonparametric linear regression
Linear regression
Non-parametric statistics are used if the data are not compatible with the assumptions of parametric methods such as normality or homogeneous variance.
Advantages of non-parametric methods
1.
are easy to apply
2.
Assumptions, such as normality, can be relaxed
3.
When observations are drawn from non-normal populations, non-parametric methods are more reliable
4.
can be used for “ranks” –scores which are not exact in a numerical sense.
Disadavantages of non-parametric methods
1.
when observations are drawn from normal populations, non-parametric
tests are not as powerful as parametric tests.
2.
Parametric methods are sometimes robust to certain types of departure from normality (especially as n gets large)
–Central Limit Theorem
3.
Confidence interval construction is difficult with non-parametric
methods.
Sign Test
One Sample
The sign test is designed to test a hypothesis about the location of a population distribution. It is most often used to test the hypothesis about a population median, and often involves the use of matched pairs, for example, before and after data, in which case it tests for a median difference of zero.
Example
Out of a population, 10 mentally retarded boys received general appearance scores as follows: 4, 5, 8, 8, 9, 6, 10,7,6,6.
1. H
0
: median
5 vs H
A
: Not H
0
2. Transform the data into signs: Assign “+” if the observed value > 5
“-“ if the observed value < 5
“0” if the observed value = 5. obs 4 5 8 8 9 6 10 7 6 6
Signs - 0 + + + + + + + +
Zeros are eliminated from the analysis. Since there is 1 zero, the number of observations is reduced from 10 to 9.
3.
Thus we observed 8 +’s out of 9 trials. The probability that we observe as many as 8 or more +’s is, in EXCEL, (1-BINOMDIST(7,9,0.5, True))=.0195. Since we perform the two-sided test, the p-value = 2*.0195 = .0370.
Large Sample Approximation for n > 20:
Z
T
n / 2
~ N ( 0 , 1 ) n / 4
In this example, Z =
8
9 / 2
9 / 4
2 .
33
From EXCEL, the p-value =2*[1- NORMDIST(2.33,0,1,TRUE)] = 0.020 < 0.05.
4. Since the p-value is smaller than 0.05
, we conclude that the median score is
not equal to 5.
Wilcoxon Signed Ranks Test
The Wilcoxon Signed Ranks test is designed to test a hypothesis about the median of a population distribution. It often involves the use of matched pairs, for example, before and after data, in which case it tests for a median difference of zero. In many applications, this test is used in place of the one sample t-test when the normality assumption is questionable. It is a more powerful alternative to the sign test, but does assume that the population probability distribution is symmetric.
Example id 1 2 3 score 4 5 8 x – 5 -1 0 3 x
5 1 0 3
4
8
3
3
5
4
6
1
7
5
8
2
9
9 6 10 7 6
4 1 5 2 1
1
10
6
1
1
T+
Rank- 2.5
Rank+ 6.5 6.5 8 2.5 9 5 2.5 2.5 42.5
1.
H
0
: median
Cardiac
5 vs H
A
: Not H
0
.
2.
Find the differences between observed values and the proposed median = 5.
3.
Find the absolute values of the differences.
4.
Eliminate the observation whose value, after subtracting 5, becomes 0.
5.
Rank the absolute values of the differences, breaking the ranks among the ties.
6.
Add all the ranks for the positive differences. T+ = 42.5
From the statistical table for the Wilcoxon Signed-Rank Test
when n = 9, T+ = 42.5, T- = 2.5, p-value = 2* .008 = .016
For Large sample approximation if n > 20
Z
T
n ( n
[ n ( n
1 )( 2 n
1 )
/
1 )
4 ]
/ 24
~ N ( 0 , 1 )
For this example Z
42 .
5
[ 9 * 10 / 4 ]
9 * 10 * 19 / 24
2 .
37 , its p-value is .018.
8. Since n < 20, based on the p-value from the table (.1514), we conclude that the median is not equal to 5.
SAS program for the sign test and the Wilcoxon ranked sign test for 1 sample data
DATA IN;
INPUT X @@;
diff=X-5;
CARDS;
4 5 8 8 9 6 10 7 6 6 run;
PROC UNIVARIATE;
VAR diff; run;
Output
Univariate Procedure
Variable=DIFF
Moments
N 10 Sum Wgts 10
Mean 1.9 Sum 19
Std Dev 1.852926 Variance 3.433333
Skewness 0.180769 Kurtosis -0.62777
USS 67 CSS 30.9
CV 97.5224 Std Mean 0.585947
T:Mean=0 3.242617 Pr>|T| 0.0101
Num ^= 0 9 Num > 0 8
M(Sign) 3.5 Pr>=|M| 0.0391
Sgn Rank 20 Pr>=|S| 0.0195
Two Paired Samples
M(sign) = # of +’s – n/2
= 8 - 9/2 = 3.5
Sgn Rank =
= 42.5 – 22.5 = 20
Sign Test
Example (From Table 18-1 (p489))
Matched-pair design involving change scores in self-perception of heath among hypertensives id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 T+
Treatment 10 12 8 8 13 11 15 16 4 13 2 15 5 6 8
Control
Sign
6
+
5
+
7 9 10 12 9
+ - + - +
8
+
3 14 6
+ - -
10 1
+ +
2
+
1
+ 11
1.
H
0
: median
Treatment
median
Control vs H
A
: Not H
0
.
2.
The signs are assigned “+” if Treatment > Control
“-“ if Treatment < Control
“=” if Treatment = Control.
3.
n = 15, T+ = 11
The exact p – value =2*[1 - BINOMDIST(10,15,0.5, TRUE)] = 0.1185
Large Sample Approximation for n > 20:
Z
T
n n
/
/
4
2
~ N (
In this example, Z =
11
15
15 /
/
4
2
1 .
81
0 , 1 )
From EXCEL, the p-value =2*[1- NORMDIST(1.81,0,1,TRUE)] = 0.07 > 0.05.
Based on the large sample approximation and the exact p -value, we conclude that the median of the treatment group is not different from the median of the control group.
Wilcoxon Signed Rank Test
Example
Matched-pair design involving change scores in self-perception of heath among hypertensives id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 T+
Treatment 10 12 8 8 13 11 15 16 4 13 2 15 5 6 8
Control 6 5 7 9 10 12 9 8 3 14 6 10 1 2 1
Diff
Diff
4 7
4 7
1 -1 3
1 1 3
-1
1
6
6
8
8
1 -1 -4
1 1 4
5
5
4
4
4
4
7
7
Rank-
Rank+
1. H
0
3 3 3 8.5
:
8.5 13.5 3 6 12 15 3 11 8.5 8.5 13.5 102.5 median
Treatment
median
Control vs H
A
: Not H
0
.
2. Find the difference between treatment and control.
3. Rank the absolute differences breaking the ranks among the ties.
4. Add all the ranks for the positive differences. T+ = 110
From the statistical table for the Wilcoxon Signed-Rank Test
when n = 15, T+ =102.5, T- = 17.5, p-value = 2* .007 = .014
For Large sample approximation if n > 20
Z
T
[ n ( n
1 ) n ( n
1 )( 2 n
/
1 )
4 ]
/ 24
~ N ( 0 , 1 )
For this example, Z
102 .
5
[ 15 * 16 / 4 ]
15 * 16 * 31 / 24
2 .
41 and its p-value = .016.
5. Since the p-value < .05, we conclude that the medians are not different.
Note: the results from the sign test and the Wilcoxon signed ranks test on the paired sample are different. The Wilcoxon signed ranks test is more powerful.
SAS program for the sign test for paired data data hyper;
input treat control @@;
diff = treat-control; cards;
10 6 12 5 8 7 8 9 13 10 11 12 15 9 16 8
4 3 13 14 2 6 15 10 5 1 6 2 8 1 run; proc univariate;
var diff; run;
Output
Univariate Procedure
Variable=DIFF
Moments
N 15 Sum Wgts 15
Mean 2.866667 Sum 43 parametric paired
Std Dev 3.563038 Variance 12.69524 t-test
M(sign) = # of +’s – n/2
= 11 – 15/2 = 3.5
Sgn Rank =
= 102.5 – 60 = 42.5
Skewness -0.34413 Kurtosis -0.82487
USS 301 CSS 177.7333
CV 124.292 Std Mean 0.919972
T:Mean=0 3.116036 Pr>|T| 0.0076
Num ^= 0 15 Num > 0 11
M(Sign) 3.5 Pr>=|M| 0.1185
Sgn Rank 42.5 Pr>=|S| 0.0139
Two Independent Samples
Wilcoxon Mann-Whitney Test
The Wilcoxon Mann-Whitney Test is one of the most powerful of the nonparametric tests for comparing two populations. In many applications, the Wilcoxon Mann-Whitney Test is used in place of the two sample t-test when the normality assumption is questionable.
This test can also be applied when the observations in a sample of data are ranks, that is, ordinal data rather than direct measurements.
Example
A researcher assess the effects of prolonged inhalation of cadmium oxide on the hemoglobin level.
Exposed Unexposed Exposed(sort) Rank Unexposed(sort) Rank
14.4
14.2
13.8
16.5
14.1
16.6
15.9
15.6
14.1
15.3
15.7
16.7
13.7
15.3
14.0
17.4
16.2
17.1
17.5
15.0
16.0
16.9
15.0
16.3
16.8
1. Hypotheses: H
0
: median
X
13.7 1
13.8 2
14
3
14.1 4.5
14.1 4.5
14.2 6
14.4
7
15.3 10.5
15.3 10.5
15.6 12
15.7 13
15.9 14
16.5 18
16.6 19
16.7
20
S1 145 vs H
A
: Not H
0
median
Y
2. Sort each column of the variables. Assign the joint ranks to the samples from
the two variables. Find S1 and S2, the sum of the ranks assigned to each group.
S = max (S1,S2)
3. The test statistic is T
S
n ( n
1 )
2
where n = max (n1, n2)
SAS program for the Wilcoxon Mann-Whitney Test data oxide;
infile cards missover;
input group $ @;
15 8.5
15 8.5
16
15
16.2 16
16.3 17
16.8 21
16.9
22
17.1 23
17.4 24
17.5 25
S2 180
do until (hemo = .);
input hemo @;
if hemo ne .
then output;
end; cards;
1 13.7 13.8 14 14.1 14.1 14.2 14.4 15.3 15.3 15.6 15.7 15.9 16.5 16.6 16.7
2 15 16 16.2 16.3 16.8 16.9 17.1 17.4 17.5 15 run; proc npar1way wilcoxon;
class group;
var hemo;
exact; run; proc means median;
class group;
var hemo; run;
N P A R 1 W A Y P R O C E D U R E
Wilcoxon Scores (Rank Sums) for Variable HEMO
Classified by Variable GROUP
Sum of Expected Std Dev Mean
GROUP N Scores Under H0 Under H0 Score
1 15 145.0 195.0 18.0173527 9.6666667
2 10 180.0 130.0 18.0173527 18.0000000
Average Scores Were Used for Ties
Wilcoxon 2-Sample Test S = 180.000
Exact P-Values
(One-sided) Prob >= S = 0.0021
(Two-sided) Prob >= |S - Mean| = 0.0042
Normal Approximation (with Continuity Correction of .5)
Z = 2.74735 Prob > |Z| = 0.0060
T-Test Approx. Significance = 0.0112
The MEANS Procedure
Analysis Variable : hemo
N
group Obs Median
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
1 15 15.3000000
2 10 16.5500000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Conclusion: Reject H
0
: median
X
median
Y
and conclude that the median hemo of
Treatment group 1 and 2 are different or the median hemo due to Group 2 is greater than the median due to Group 1.
One-way ANOVA
Kruskal-Wallis Test
The Kruskal-Wallis test is a nonparametric test used to compare three or more samples.
Data
Levels
1
2
Observatio ns
Y
11
Y
12
Y
21
Y
22
Y
13
.....
Y
1 n
1
Y
23
.....
Y
2 n
2
Sum i
Y
1 .
Y
2 .
Mean i
Y
1 .
Y
2 .
: a Y a 1
Y a 2
Y a 3
.....
Y an a
Y a .
Y a .
__________ __________ __________ __________ __
Total Y
..
Y
..
N = n
1
n
2
...
n a
Converting the original data into the ranks, we get
Levels
1
Observatio ns
R
11
R
12
R
13
.....
R
1 n
1
R
21
R
22
R
23
.....
R
2 n
2
Sum i
R
1 .
R
2 .
Mean i
R
1 .
R
2 .
2
: a R a 1
R a 2
R a 3
.....
R an a
R a .
R a .
__________ __________ __________ __________ __
Total R
..
R
..
Hypotheses H
0
: Median
1
Median
2
...
Median a vs H
A
: Not H
0
Test Statistics T =
N (
12
N
1 ) a
i
1 n i
R i .
n i
R ..
N
2
N (
12
N
1 ) i a
1
( R
2 i .
/ n i
)
3 ( N
1 ) ~ X
2 a
1
SAS program for the Kruskal-Wallis Test data Kruskal;
infile cards missover;
input treat $ @; do until (time = .);
input time @;
if time ne . then output; end; lines;
A 17 20 40 31 35
B 8 7 9 8
C 2 5 4 3 run; proc npar1way wilcoxon;
class treat;
var time;
exact; run; proc means median;
class treat;
var time; run;
N P A R 1 W A Y P R O C E D U R E
Wilcoxon Scores (Rank Sums) for Variable TIME
Classified by Variable TREAT
Sum of Expected Std Dev Mean
TREAT N Scores Under H0 Under H0 Score
A 5 55.0 35.0 6.82191040 11.0000000
B 4 26.0 28.0 6.47183246 6.5000000
C 4 10.0 28.0 6.47183246 2.5000000
Average Scores Were Used for Ties
Kruskal-Wallis Test S = 10.711
Exact P-Value Prob >= S = 6.66E-05
Chi-Square Approximation
DF = 2 Prob > S = 0.0047
The MEANS Procedure
Analysis Variable : time
N
treat Obs Median
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
A 5 31.0000000
B 4 8.0000000
C 4 3.5000000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Conclusion: Reject H
0
: median
A
median
B
median
C
and conclude that the median times of Treatment group A, B, and C are different. It seems that the median time due to Treatment A is greater than the medians due to Treatments B and C. This must be followed up by a nonparametric multiple comparison procedure.
Randomized Block Design
Friedman Test
Data Let the data be in the following format;
Levels
1
1 2
Blocks
3 b
Y
11
Y
12
Y
13
.....
Y
1 b
Y
21
Y
22
Y
23
.....
Y
2 b
2 a
:
Y a 1
Y a 2
Y a 3
.....
Y ab
Sum i
Y
1 .
Y
2 .
Y a .
Mean i
Y
1 .
Y
2 .
Y a .
Sum j
Mean j
Y
.
1
Y
.
1
Y
.
2
Y
.
2
Y
.
3
.....
Y
.
b
Y
.
3
.....
Y
.
b
Y
..
Y
..
where N = ab
Converting the data into the separate ranks within each block, we get
Blocks
Levels 1 2 3 b Sum i
Mean i
1
2 a
:
R
1
R
1
R
1
.....
R
1
R
2
R
2
R
2
.....
R
2
R
1 .
R
2 .
R
1 .
R
2 .
R a
R a
R a
.....
R a
R a .
R a .
Hypotheses H
0
: Median
1
Median
2
...
Median a vs H
A
: Not H
0
Test Statistics T
12 ab ( a
1 ) i a
1
R
2 i .
3 b ( a
1 ) ~ X
2 a
1
SAS program for the Friedman Test
DATA Fried;
INPUT BLOCK $ TRTMENT $ YIELD @@;
CARDS;
1 A 32.6 1 B 36.4 1 C 29.5 1 D 29.4
2 A 42.7 2 B 47.1 2 C 32.9 2 D 40.0
3 A 35.3 3 B 40.1 3 C 33.6 3 D 35.0
4 A 35.2 4 B 40.3 4 C 35.7 4 D 40.0
5 A 33.2 5 B 34.3 5 C 33.2 5 D 34.0
6 A 33.1 6 B 34.4 6 C 33.1 6 D 34.1
run;
PROC RANK;
BY BLOCK;
VAR YIELD;
RANKS RYIELD;
RUN; proc freq;
tables block*trtment*ryield / noprint cmh;
title 'Friedman''s Chi-Square'; run; proc means median;
class trtment;
var yield; run;
Output
Friedman's Chi-Square
The FREQ Procedure
Cochran-Mantel-Haenszel Statistics (Based on Table Scores)
Statistic Alternative Hypothesis DF Value Prob
---------------------------------------------------------------
1 Nonzero Correlation 1 0.7448 0.3881
2 Row Mean Scores Differ 3 12.6207 0.0055
3 General Association 12 27.7500 0.0060
Total Sample Size = 24
The MEANS Procedure
Analysis Variable : YIELD
N
TRTMENT Obs Median
-------------------------------
A 6 34.2000000
B 6 38.2500000
C 6 33.1500000
D 6 34.5500000
-------------------------------
Conclusion: We reject the null hypothesis and conclude that the median yields of
Treatment A, B, C and D are different. It seems that the median yield due to
Treatment B is greater than the medians due to Treatment A, C, and D. This must be followed up by the nonparametric multiple comparison procedure.
Correlation
The Spearman Rank Correlation Coefficeint
The Spearman Rank Correlation test uses the ranks (rather than the actual values) of the two sets of variables to calculate a statistic, the correlation coefficient: r s
. data rankcorr;
input age EEG @@; lines;
20 98 21 75 22 95 24 100 27 99 30 65 31 64 33 70 35 85
38 74 40 68 42 66 46 48 51 54 53 63 55 52 58 67 60 55 run;
proc rank;
var age;
ranks rage; run; proc rank;
var EEG;
ranks rEEG; run;
Proc corr;
var rage rEEG; run;
Output
Correlation Analysis
2 'VAR' Variables: RAGE REEG
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum Label
RAGE 18 9.5000 5.3385 171.0 1.0000 18.0000 RANK FOR VARIABLE AGE
REEG 18 9.5000 5.3385 171.0 1.0000 18.0000 RANK FOR VARIABLE EEG
Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 18
RAGE REEG
RAGE 1.00000 -0.76264
RANK FOR VARIABLE AGE 0.0 0.0002
REEG -0.76264 1.00000
RANK FOR VARIABLE EEG 0.0002 0.0
Conclusion: The Spearman rank correlation between age and EEG is –0.76264. We reject H
0
:
0 and conclude that the correlation is different from 0 based on the p-value of 0.0002.
Goodness- Of-Fit Test
Kolmogorov-Smirnov Goodness- Of-Fit Test
1) The Kolmogorov-Smirnov test can be used to test whether a data is from a normal distribution or not.
The graph in the right is a plot of the empirical distribution function with a normal cumulative distribution function for 100 normal random numbers. The
Kolmogorov-Smirnov test is based on
the maximum distance between these two curves.
The test statistic, here designated D max
, is the maximum difference between the cumulative proportions of the two patterns.
2) Suppose that we observe N = m + n observations, X
1
,..., X m and Y
1
,..., Y n mutually independent, and from two populations.
, which are
The question becomes: are the populations the same or different?
Equivalently: H
0
: P ( X < a ) = P ( Y < a ) for all a .
To test this, define Empirical
Distribution Functions (EDFs):
F x
( t )
# X ' s m
t and
F x
( t )
# X ' s m
t
.
The test statistic, here designated
D max
, is the maximum difference between the cumulative proportions of the two patterns.
PROC NPAR1WAY computes the Kolmogorov-Smirnov statistic as
KS
max j
1 n
2 i n j
( F i
( x j
)
F ( x j
))
2
where j = 1,2,...,n
The asymptotic Kolmogorov-Smirnov statistic is computed as
KS a
KS n
If there are only two class levels, PROC NPAR1WAY computes the two-sample
Kolmogorov statistic as
D = max j
| F
1
( x j
) - F
2
( x j
) | where j = 1,2, ... , n
SAS program for Kolmogorov-Smirnov Test data oxide;
infile cards missover;
input group $ @;
do until (hemo = .);
input hemo @;
if hemo ne .
then output;
end; cards;
1 13.7 13.8 14 14.1 14.1 14.2 14.4 15.3 15.3 15.6 15.7 15.9 16.5 16.6 16.7
2 15 16 16.2 16.3 16.8 16.9 17.1 17.4 17.5 15 run;
/* checking the normality of a single variable using the Kolmogorov-Smirnov test */ proc univariate normal;
var hemo; run;
SAS output(edited)
Tests for Normality
Test --Statistic--- -----p Value------
Shapiro-Wilk W 0.943237 Pr < W 0.1925
Kolmogorov-Smirnov D 0.123822 Pr > D >0.1500
Cramer-von Mises W-Sq 0.053008 Pr > W-Sq >0.2500
Anderson-Darling A-Sq 0.403511 Pr > A-Sq >0.2500
Conclusion: The distribution of the variable, hemo, may be considered normal.
/* comparing the distribution of two sample observations */ proc npar1way wilcoxon edf; /* edf = empirical distribution function */
class group;
var hemo; run;
SAS output
Kolmogorov-Smirnov 2-Sample Test (Asymptotic)
KS = 0.293939 D = 0.600000
KSa = 1.46969 Prob > KSa = 0.0266
Conclusion: The distribution of X is different from that of Y.
Regression
Nonparametric linear regression using the Theil estimate.
X X1 X2 .... Xn
Y Y1 Y2 .... Yn
Here, we construct estimates of the slope using ( n ( n -1))/2 pairs of observations, and use the median of these as the estimate of the slope.
Suppose that we have ordered the data by their x values so that for ( Y i
, x i
) and ( Y j
, x j
) satisfying i < j , then x i
x j
, and we now have: S i , j
Y x i j
Y i x j
Note that this causes a big problem if some of the x values are the same! In this case, one must use the finite slopes (i.e. only use the values S i , j
from x i
x j and use the median of this reduced set. Therefore,
median ( S i , j
) i
j
.
When one cannot assume that the error terms are symmetric about 0, find the n terms,
Y i
* X i
, i = 1,...,n. The median of these n terms is the estimate of the intercept.
In the following example, Y = acid levels and X = exercise times (in minutes). We want to establish the relationship between these two variables.
Parametric linear regression
DATA npreg1;
input indep dep @@;
CARDS;
230 421 175 278 315 618 290 482
275 465 150 105 360 550 425 750
run;
proc reg;
model dep = indep;
run;
Model: MODEL1
Dependent Variable: DEP
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Model 1 53614.66151 53614.66151 58.115 0.0003
Error 6 5535.33849 922.55642
C Total 7 59150.00000
Root MSE 30.37361 R-square 0.9064
Dep Mean 277.50000 Adj R-sq 0.8908
C.V. 10.94545
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Parameter=0 Prob > |T|
INTERCEP 1 76.210483 28.50456809 2.674 0.0368
INDEP 1 0.438898 0.05757290 7.623 0.0003
Nonparametric linear regression
DATA npreg1;
ARRAY X(8) X1-X8;
ARRAY Y(8) Y1-Y8;
DO I = 1 TO 8;
INPUT Y(I) X(I) @@;
END;
OUTPUT;
CARDS;
230 421 175 278 315 618 290 482
275 465 150 105 360 550 425 750
run;
DATA npreg2;
SET npreg1;
ARRAY X(8) X1-X8;
ARRAY Y(8) Y1-Y8;
DO I=1 TO 7;
DO J=I+1 TO 8;
SLOPE = (Y(J)-Y(I))/(X(J)-X(I));
OUTPUT;
END;
END;
KEEP SLOPE;
PROC SORT;
BY SLOPE; run;
PROC PRINT;
TITLE 'THEIL SLOPE ESTIMATE EXAMPLE'; run; proc means median;
var slope; run;
Data npreg3;
set npreg1;
ARRAY X(8) X1-X8;
ARRAY Y(8) Y1-Y8;
/* when one assumes that the error terms are not symmetric about 0 */
DO I = 1 to 8;
inter1 = Y(I)- 0.4878*X(I); /* .4878 is the median of the slopes */
output;
END;
keep inter1;
Proc print;
var inter1; run;
Proc means median;
var inter1; run;
THEIL SLOPE ESTIMATE EXAMPLE
Obs SLOPE
The MEANS Procedure
Analysis Variable : SLOPE
Median
------------
0.4878207
------------
Obs inter1
1 24.6362
2 39.3916
3 13.5396
4 54.8804
5 48.1730
6 98.7810
7 91.7100
Analysis Variable : inter1
Median
------------
51.5267000
------------
Compare the linear regression equations
8 59.1500 due to parametric regression: and due to nonparametric regression:
76 .
2105
0 .
4389 * X
51 .
5267
0 .
4878 * X .
To compare the performance of the predicted values, let’s compare the means of their residuals as follows:
DATA npreg1;
input indep dep @@;
pred1 = 76.2105
+ 0.4389
*dep;
resid1 = indep - pred1;
np_pred2 = 51.5267
+ 0.4878
*dep; resid2 = indep - np_pred2;
CARDS ;
230 421 175 278 315 618 290 482
275 465 150 105 360 550 425 750 run ; proc means ; var resid1 resid2; run ;
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
resid1 8 -0.0010125 28.1205022 -32.4507000 42.3945000
resid2 8 2.2560250 29.7632035 -37.9871000 47.2543000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
In this case, seems that the parametric regression equation predicts the values of Y more closely than that of the nonparametric regression. Still, the use of the method depends on the distribution assumption of the data.
Homework problems
1. A sample of 15 patients suffering from asthma participated in an experiment to study the effect of a new treatment on pulmonary function. The dependent variable is FEV
(forced expiratory volume, liters, in 1 second) before and after application of the treatment.
Subject Before
1 1.69
2 2.77
After
1.69
2.22
Subject
9
10
Before
2.58
1.84
After
2.44
4.17
3
4
5
6
1.00
1.66
3.00
.85
3.07
3.35
3.00
2.74
11
12
13
14
1.89
1.91
1.75
2.46
2.42
2.94
3.04
4.62
7
8
1.42
2.82
3.61
5.14
15 2.35 4.42
On the basis of these data, can one conclude that the treatment is effective in increasing the FEV level? Let
0 .
05 and find the p-value. a) Perform the sign test. b) Perform the Wilcoxon signed-rank test.
2. From the same context with Problem 1, subjects 1-8 came from Clinic A and subjects
9 –15, from Clinic B. Can one conclude that the FEV levels from these two groups are different? Let
0 .
05 and find the p-value. Perform the Wilcoxon Mann-Whitney
Test.
3
4
5
6
7
Subject
1
2
Clinic A
1.69
2.22
3.07
3.35
3.00
2.74
3.61
Subject
9
10
11
12
13
14
15
Clinic B
2.44
4.17
2.42
2.94
3.04
4.62
4.42
8 5.14
3. From the same context with Problems 1 & 2, subjects 1-8 came from Clinic A, subjects 9 –15, from Clinic B, and subjects 16-20 from Clinic C were added later. Can one conclude that the FEV levels from these three groups are different? Let
0 .
05 and find the p-value. Perform the Kruskal-Wallis Test.
Subject Clinic A
1 1.69
2 2.22
Subject
9
10
Clinic B
2.44
4.17
Subject
16
17
Clinic C
2.34
3.17
3
4
5
6
3.07
3.35
3.00
2.74
11
12
13
14
2.42
2.94
3.04
4.62
18
19
20
4.42
4.94
5.04
7
8
3.61
5.14
15 4.42
4. The following table shows the scores made by nine randomly selected student nurses on final examination in three subject areas:
6
7
8
3
4
5
Student number
1
2
Subject Area
Fundamentals
98
95
76
95
83
99
82
75
Physiology
95
71
80
81
77
70
80
72
Anatomy
77
79
91
84
80
93
87
81
9 88 81 83
Test the null hypothesis that student nurses from which the above sample was drawn perform equally well in all three subject areas against the alternative hypothesis that they perform better in, at least, one area. Let
0 .
05 and find the p-value. Perform the
Friedman Test.
5. From Problem 4, find the Spearman rank correlation between the Physiology scores and the Anatomy scores.