proc univariate

advertisement
EIPB 698C Lecture 7
Raul Cruz-Cano
Summer 2012
Statistical analysis procedures
• Proc univariate
• Proc t test
• Proc corr
Proc Univariate
• The UNIVARIATE procedure provides data
summarization on the distribution of numeric variables.
PROC UNIVARIATE <option(s)>;
Var variable-1 variable-n;
Run;
Options:
PLOTS : create low-resolution stem-and-leaf, box, and
normal probability plots
NORMAL: Request tests for normality
data blood;
INFILE 'C:\teaching\SAS09\lecture9\blood.txt';
INPUT subjectID $ gender $ bloodtype $
age_group $ RBC WBC cholesterol;
run;
proc univariate data =blood ;
var cholesterol;
run;
OUTPUT (1)
The UNIVARIATE Procedure
Variable: cholesterol
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
795
Sum Observations
160141
Variance
2488.6844
Kurtosis
-0.0706044
Corrected SS
1976015.41
Std Error Mean 1.76929947
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
N - This is the number of valid
observations for the variable. The
total number of observations is the
sum of N and the number of missing
values.
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
Moments Moments are
statistical
summaries of a
distribution
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
Sum Weights - A numeric variable can be
specified as a weight variable to weight the
values of the analysis variable. The default
weight variable is defined to be 1 for each
observation. This field is the sum of
observation values for the weight variable
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
Sum Observations - This is the sum of observation values. In case
that a weight variable is specified, this field will be the weighted
sum. The mean for the variable is the sum of observations divided
by the sum of weights.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
Std Deviation - Standard deviation is the square root of the
variance. It measures the spread of a set of observations. The
larger the standard deviation is, the more spread out the
observations are.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
Variance - The variance is a measure of variability. It is the sum of
the squared distances of data value from the mean divided by N-1.
We don't generally use variance as an index of spread because it is
in squared units. Instead, we use standard deviation.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
Skewness - Skewness measures the degree and direction of
asymmetry. A symmetric distribution such as a normal
distribution has a skewness of 0, and a distribution that is
skewed to the left, e.g. when the mean is less than the
median, has a negative skewness.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
(1)Kurtosis - Kurtosis is a measure of the heaviness of the tails of a
distribution. In SAS, a normal distribution has kurtosis 0.
(2) Extremely nonnormal distributions may have high positive or negative
kurtosis values, while nearly normal distributions will have kurtosis values
close to 0.
(3) Kurtosis is positive if the tails are "heavier" than for a normal distribution
and negative if the tails are "lighter" than for a normal distribution.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Uncorrected SS This is the sum
of squared data
values.
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
Corrected SS - This is the sum of squared
distance of data values from the mean.
This number divided by the number of
observations minus one gives the variance.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
(1)Coeff Variation - The coefficient of variation is another way
of measuring variability.
(2)It is a unitless measure.
(3)It is defined as the ratio of the standard deviation to the
mean and is generally expressed as a percentage.
(4) It is useful for comparing variation between different
variables.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
(1)Std Error Mean - This is the estimated standard deviation of
the sample mean.
(2)It is estimated as the standard deviation of the sample
divided by the square root of sample size.
(3)This provides a measure of the variability of the sample
mean.
OUTPUT (2)
Basic Statistical Measures
Location
Variability
Mean 201.4352 Std Deviation
49.88672
Median 202.0000 Variance
2489
Mode 208.0000 Range
314.00000
Interquartile Range 71.00000
NOTE: The mode displayed is the smallest of 2 modes with a count of 12.
OUTPUT (2)
Location
Variability
Mean 201.4352 Std Deviation
49.88672
Median 202.0000 Variance
2489
Mode 208.0000 Range
314.00000
Interquartile Range 71.00000
NOTE: The mode displayed is the smallest of 2 modes with a count of 12.
Median - The median is a
measure of central tendency.
It is the middle number when
the values are arranged in
ascending (or descending)
order. It is less sensitive than
the mean to extreme
observations.
Mode - The mode is another measure
of central tendency. It is the value
that occurs most frequently in the
variable.
OUTPUT (3)
Location
Variability
Mean 201.4352 Std Deviation
49.88672
Median 202.0000 Variance
2489
Mode 208.0000 Range
314.00000
Interquartile Range 71.00000
NOTE: The mode displayed is the smallest of 2 modes with a count of 12.
Range - The range is a measure
of the spread of a variable. It is
equal to the difference
between the largest and the
smallest observations.
It is easy to compute and easy
to understand.
Interquartile Range - The interquartile
range is the difference between the
upper (75% Q) and the lower quartiles
(25% Q). It measures the spread of a
data set. It is robust to extreme
observations.
OUTPUT (3)
Tests for Location: Mu0=0
Test
-Statistic- -----p Value------
Student's t
t 113.8503 Pr > |t| <.0001
Sign
M 397.5 Pr >= |M| <.0001
Signed Rank S 158205 Pr >= |S| <.0001
OUTPUT (3)
Student's t
t 113.8503 Pr > |t| <.0001
Sign
M 397.5 Pr >= |M| <.0001
Signed Rank S 158205 Pr >= |S| <.0001
(1)Student's t - The Student t-test is used to test the null hypothesis that
the population mean equals Mu0. The default value in SAS for Mu0 is 0.
(2)The t-statistic is defined to be the difference between the mean and the
hypotheses mean divided by the standard error of the mean.
(3)The p-value is the two-tailed probability computed using a t
distribution. If the p-value associated with the t-test is small (usually set at
p < 0.05), there is evidence to reject the null hypothesis in favor of the
alternative. In other words, the mean is statistically significantly different
than the hypothesized value
OUTPUT (3)
Student's t
t 113.8503 Pr > |t| <.0001
Sign
M 397.5 Pr >= |M| <.0001
Signed Rank S 158205 Pr >= |S| <.0001
(1) Sign - The sign test is a simple nonparametric procedure to test the null
hypothesis regarding the population median.
(2) It is used when we have a small sample from a nonnormal distribution.
(3)The statistic M is defined to be M=(N+-N-)/2 where N+ is the number of
values that are greater than Mu0 and N- is the number of values that are less
than Mu0. Values equal to Mu0 are discarded.
(4)Under the hypothesis that the population median is equal to Mu0,
the sign test calculates the p-value for M using a binomial distribution.
(5)The interpretation of the p-value is the same as for t-test. In our example the
M-statistic is 398 and the p-value is less than 0.0001. We conclude that the
median of variable is significantly different from zero.
OUTPUT (3)
Student's t
t 113.8503 Pr > |t| <.0001
Sign
M 397.5 Pr >= |M| <.0001
Signed Rank S 158205 Pr >= |S| <.0001
(1) Signed Rank - The signed rank test is also known as the Wilcoxon test. It is
used to test the null hypothesis that the population median equals Mu0.
(2) It assumes that the distribution of the population is symmetric.
(3)The Wilcoxon signed rank test statistic is computed based on the rank sum
and the numbers of observations that are either above or below the median.
(4) The interpretation of the p-value is the same as for the t-test. In our
example, the S-statistic is 158205 and the p-value is less than 0.0001. We
therefore conclude that the median of the variable is significantly different from
zero.
OUTPUT (4)
Quantiles (Definition 5)
Quantile
Estimate
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
10%
5%
1%
0% Min
331
318
282
267
236
202
165
138
123
94
17
OUTPUT (4)
Quantiles (Definition 5)
Quantile
Estimate
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
10%
5%
1%
0% Min
331
318
282
267
236
202
165
138
123
94
17
95% - Ninety-five
percent of all values
of the variable are
equal to or less than
this value.
OUTPUT (5)
Extreme Observations
----Lowest-------Highest--Value Obs
Value Obs
17 829
323 828
36 492
328 203
56 133
328 375
65 841
328 541
69
79
331 191
Missing Values
-----Percent Of----Missing
Missing
Value
Count All Obs
Obs
.
205
20.50
100.00
Extreme
Observations This is a list of
the five lowest
and five highest
values of the
variable
Student's t-test
• Independent One-Sample t-test
• This equation is used to compare one sample mean to a
specific value μ0.
t
X  0
s/ N
• Where s is the grand standard deviation of the sample. N is the
sample size. The degrees of freedom used in this test is N-1.
25
Student's t-test
• Dependent t-test is used when the samples are dependent;
that is, when there is only one sample that has been tested
twice (repeated measures) or when there are two samples that
have been matched or "paired".
t
X D  0
sD / N
• For this equation, the differences between all pairs must be
calculated. The pairs are either one person's pretest and
posttest scores or one person in a group matched to another
person in another group. The average (XD) and standard
deviation (sD) of those differences are used in the equation.
The constant μ0 is non-zero if you want to test whether the
average of the difference is significantly different than μ0. The
degree of freedom used is N-1.
26
PROC TTEST
The following statements are available in PROC TTEST.
PROC TTEST < options > ;
CLASS variable ;
PAIRED variables ;
BY variables ;
VAR variables ;
CLASS: CLASS statement giving the name of the classification (or
grouping) variable must accompany the PROC TTEST statement in
the two independent sample cases (TWO SAMPLE T TEST). The class
variable must have two, and only two, levels.
Paired Statements
• PAIRED: the PAIRED statement identifies the variables to be
compared in paired t test
1. You can use one or more variables in the PairLists.
2. Variables or lists of variables are separated by an asterisk (*)
or a colon (:).
3. The asterisk (*) requests comparisons between each
variable on the left with each variable on the right.
4. Use the PAIRED statement only for paired comparisons.
5. The CLASS and VAR statements cannot be used with the
PAIRED statement.
PROC TTEST
OPTIONS :
ALPHA=p
specifies that confidence intervals are to be 100(1-p)% confidence intervals,
where 0<p<1. By default, PROC TTEST uses ALPHA=0.05. If p is 0 or less, or
1 or more, an error message is printed.
H0=m
requests tests against m instead of 0 in all three situations (one-sample, twosample, and paired observation t tests). By default, PROC TTEST uses
H0=0.
DATA=SAS-data-set
names the SAS data set for the procedure to use
*One sample ttest*;
Proc ttest data =blood H0=200;
var cholesterol;
run;
One sample t test Output
The TTEST Procedure
Variable: cholesterol
N
Mean
Std Dev
795
201.4
49.8867
Mean
201.4
95% CL Mean
198.0 204.9
Std Err
1.7693
0.81
17.0000
Std Dev
49.8867
DF t Value Pr > |t|
794
Minimum
0.4175
Maximum
331.0
95% CL Std Dev
47.5493 52.4676
One sample t test Output
The TTEST Procedure
Variable: cholesterol
N
Mean
Std Dev
795
201.4
49.8867
Mean
201.4
95% CL Mean
198.0 204.9
Std Err
1.7693
Minimum
17.0000
Std Dev
49.8867
Maximum
331.0
95% CL Std Dev
47.5493 52.4676
DF t Value Pr > |t|
95%CL Mean is 95%
confidence interval
for the mean.
794
0.81
0.4175
95%CL Std Dev is
95% confidence
interval for the
standard deviation.
One sample t test Output
N
795
It is the
Maximum probability of
331.0
observing a
greater absolute
95% CL Mean
Std Dev 95% CL Std Dev value of t under
the null
198.0 204.9 49.8867 47.5493 52.4676
hypothesis.
Mean
201.4
Mean
201.4
Variable: cholesterol
Std Dev Std Err Minimum
49.8867 1.7693 17.0000
DF t Value Pr > |t|
794
0.81 0.4175
DF - The degrees of freedom for the t-test is simply the
number of valid observations minus 1. We loose one degree
of freedom because we have estimated the mean from the
sample. We have used some of the information from the
data to estimate the mean; therefore, it is not available to use
for the test and the degrees of freedom accounts for this
T value is the tstatistic. It is the ratio
of the difference
between the sample
mean and the given
number to the
standard error of the
mean.
title 'Paired Comparison';
data pressure;
input SBPbefore SBPafter @@;
diff_BP=SBPafter-SBPbefore ;
datalines;
120 128 124 131 130 131 118 127
140 132 128 125 140 141 135 137
126 118 130 132 126 129 127 135
;
run;
proc ttest data=pressure;
paired SBPbefore*SBPafter;
run;
Paired t test Output
The TTEST Procedure
Difference: SBPbefore - SBPafter
N
Mean
Std Dev
12
-1.8333
5.8284
Mean
-1.8333
Std Err
1.6825
Minimum
-9.0000
Maximum
8.0000
95% CL Mean
Std Dev
95% CL Std Dev
-5.5365 1.8698
5.8284
4.1288 9.8958
DF t Value Pr > |t|
11
-1.09
0.2992
Paired t test Output
The TTEST Procedure
Mean of the
differences
Difference: SBPbefore - SBPafter
N
Mean
Std Dev
12
-1.8333
5.8284
Mean
-1.8333
Std Err
1.6825
Minimum
-9.0000
Maximum
8.0000
95% CL Mean
Std Dev
95% CL Std Dev
-5.5365 1.8698
5.8284
4.1288 9.8958
DF t Value Pr > |t|
T statistics for
testing if the mean
of the difference is
0
11
-1.09
0.2992
P =0.3, suggest the mean of
the difference is equal to 0
Proc corr
The CORR procedure is a statistical procedure for numeric
random variables that computes correlation statistics (The
default correlation analysis includes descriptive statistics,
Pearson correlation statistics, and probabilities for each
analysis variable).
PROC CORR options;
VAR variables;
WITH variables;
BY variables;
Proc corr data=blood;
var RBC WBC cholesterol;
run;
Proc Corr Output
Simple Statistics
Variable
RBC
WBC
cholesterol
N
908
916
795
Mean
Std Dev
7043
1003
5.48353
0.98412
201.43522 49.88672
Sum
Minimum
6395020
5023
160141
4070
1.71000
17.00000
Maximum
10550
8.75000
331.00000
N - This is the number of valid (i.e., non-missing) cases used in the correlation. By
default, proc corr uses pairwise deletion for missing observations, meaning that a
pair of observations (one from each variable in the pair being correlated) is
included if both values are non-missing. If you use the nomiss option on the proc
corr statement, proc corr uses listwise deletion and omits all observations with
missing data on any of the named variables.
Proc Corr
Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
RBC
RBC
1.00000
908
WBC
P-value
cholesterol
Number of
observations
-0.00203
0.9534
833
0.06583
0.0765
725
WBC
cholesterol
-0.00203
0.9534
833
1.00000
916
0.02496
0.5014
728
0.06583
0.0765
725
0.02496
0.5014
728
1.00000
795
Pearson Correlation
Coefficients - measure
the strength and
direction of the linear
relationship between
the two variables.
The correlation
coefficient can range
from -1 to +1, with -1
indicating a perfect
negative correlation, +1
indicating a perfect
positive correlation, and
0 indicating no
correlation at all.
Download