run

advertisement
HLTH 653 Lecture 2
Raul Cruz-Cano
Spring 2013
Statistical analysis procedures
•
•
•
•
Proc univariate
Proc t test
Proc corr
Proc reg
Proc Univariate
• The UNIVARIATE procedure provides data
summarization on the distribution of numeric variables.
PROC UNIVARIATE <option(s)>;
Var variable-1 variable-n;
Run;
Options:
PLOTS : create low-resolution stem-and-leaf, box, and
normal probability plots
NORMAL: Request tests for normality
data blood;
INFILE 'C:\blood.txt';
INPUT subjectID $ gender $ bloodtype $
age_group $ RBC WBC cholesterol;
run;
proc univariate data =blood ;
var cholesterol;
run;
OUTPUT (1)
The UNIVARIATE Procedure
Variable: cholesterol
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
795
Sum Observations
160141
Variance
2488.6844
Kurtosis
-0.0706044
Corrected SS
1976015.41
Std Error Mean 1.76929947
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
N - This is the number of valid
observations for the variable. The
total number of observations is the
sum of N and the number of missing
values.
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
Moments Moments are
statistical
summaries of a
distribution
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
Sum Weights - A numeric variable can be
specified as a weight variable to weight the
values of the analysis variable. The default
weight variable is defined to be 1 for each
observation. This field is the sum of
observation values for the weight variable
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
Sum Observations - This is the sum of observation values. In case
that a weight variable is specified, this field will be the weighted
sum. The mean for the variable is the sum of observations divided
by the sum of weights.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
Std Deviation - Standard deviation is the square root of the
variance. It measures the spread of a set of observations. The
larger the standard deviation is, the more spread out the
observations are.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
Variance - The variance is a measure of variability. It is the sum of
the squared distances of data value from the mean divided by N-1.
We don't generally use variance as an index of spread because it is
in squared units. Instead, we use standard deviation.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
Skewness - Skewness measures the degree and direction of
asymmetry. A symmetric distribution such as a normal
distribution has a skewness of 0, and a distribution that is
skewed to the left, e.g. when the mean is less than the
median, has a negative skewness.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
(1)Kurtosis - Kurtosis is a measure of the heaviness of the tails of a
distribution. In SAS, a normal distribution has kurtosis 0.
(2) Extremely nonnormal distributions may have high positive or negative
kurtosis values, while nearly normal distributions will have kurtosis values
close to 0.
(3) Kurtosis is positive if the tails are "heavier" than for a normal distribution
and negative if the tails are "lighter" than for a normal distribution.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Uncorrected Sum of
Square Distances
from the Mean This is the sum of
squared data
values.
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
Corrected SS - This is the sum of squared
distance of data values from the mean.
This number divided by the number of
observations minus one gives the variance.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
(1)Coeff Variation - The coefficient of variation is another way
of measuring variability.
(2)It is a unitless measure.
(3)It is defined as the ratio of the standard deviation to the
mean and is generally expressed as a percentage.
(4) It is useful for comparing variation between different
variables.
OUTPUT (1)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
795
201.43522
49.8867157
-0.0014449
34234053
24.7656371
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
795
160141
2488.6844
-0.0706044
1976015.41
1.76929947
(1)Std Error Mean - This is the estimated standard deviation of
the sample mean.
(2)It is estimated as the standard deviation of the sample
divided by the square root of sample size.
(3)This provides a measure of the variability of the sample
mean.
OUTPUT (2)
Location
Variability
Mean 201.4352 Std Deviation
49.88672
Median 202.0000 Variance
2489
Mode 208.0000 Range
314.00000
Interquartile Range 71.00000
NOTE: The mode displayed is the smallest of 2 modes with a count of 12.
Median - The median is a
measure of central tendency.
It is the middle number when
the values are arranged in
ascending (or descending)
order. It is less sensitive than
the mean to extreme
observations.
Mode - The mode is another measure
of central tendency. It is the value
that occurs most frequently in the
variable.
OUTPUT (3)
Location
Variability
Mean 201.4352 Std Deviation
49.88672
Median 202.0000 Variance
2489
Mode 208.0000 Range
314.00000
Interquartile Range 71.00000
NOTE: The mode displayed is the smallest of 2 modes with a count of 12.
Range - The range is a measure
of the spread of a variable. It is
equal to the difference
between the largest and the
smallest observations.
It is easy to compute and easy
to understand.
Interquartile Range - The interquartile
range is the difference between the
upper (75% Q) and the lower quartiles
(25% Q). It measures the spread of a
data set. It is robust to extreme
observations.
OUTPUT (3)
Tests for Location: Mu0=0
Test
-Statistic- -----p Value------
Student's t
t 113.8503 Pr > |t| <.0001
Sign
M 397.5 Pr >= |M| <.0001
Signed Rank S 158205 Pr >= |S| <.0001
OUTPUT (3)
Student's t
t 113.8503 Pr > |t| <.0001
Sign
M 397.5 Pr >= |M| <.0001
Signed Rank S 158205 Pr >= |S| <.0001
(1) Sign - The sign test is a simple nonparametric procedure to test the null
hypothesis regarding the population median.
(2) It is used when we have a small sample from a nonnormal distribution.
(3)The statistic M is defined to be M=(N+-N-)/2 where N+ is the number of
values that are greater than Mu0 and N- is the number of values that are less
than Mu0. Values equal to Mu0 are discarded.
(4)Under the hypothesis that the population median is equal to Mu0,
the sign test calculates the p-value for M using a binomial distribution.
(5)The interpretation of the p-value is the same as for t-test. In our example the
M-statistic is 398 and the p-value is less than 0.0001. We conclude that the
median of variable is significantly different from zero.
OUTPUT (3)
Student's t
t 113.8503 Pr > |t| <.0001
Sign
M 397.5 Pr >= |M| <.0001
Signed Rank S 158205 Pr >= |S| <.0001
(1) Signed Rank - The signed rank test is also known as the Wilcoxon test. It is
used to test the null hypothesis that the population median equals Mu0.
(2) It assumes that the distribution of the population is symmetric.
(3)The Wilcoxon signed rank test statistic is computed based on the rank sum
and the numbers of observations that are either above or below the median.
(4) The interpretation of the p-value is the same as for the t-test. In our
example, the S-statistic is 158205 and the p-value is less than 0.0001. We
therefore conclude that the median of the variable is significantly different from
zero.
OUTPUT (4)
Quantiles (Definition 5)
Quantile
Estimate
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
10%
5%
1%
0% Min
331
318
282
267
236
202
165
138
123
94
17
95% - Ninety-five
percent of all values
of the variable are
equal to or less than
this value.
OUTPUT (5)
Extreme Observations
----Lowest-------Highest--Value Obs
Value Obs
17 829
323 828
36 492
328 203
56 133
328 375
65 841
328 541
69
79
331 191
Missing Values
-----Percent Of----Missing
Missing
Value
Count All Obs
Obs
.
205
20.50
100.00
Extreme
Observations This is a list of
the five lowest
and five highest
values of the
variable
Student's t-test
• Independent One-Sample t-test
• This equation is used to compare one sample mean to a
specific value μ0.
t
X  0
s/ N
• Where s is the grand standard deviation of the sample. N is the
sample size. The degrees of freedom used in this test is N-1.
22
Student's t-test
• Dependent t-test is used when the samples are dependent;
that is, when there is only one sample that has been tested
twice (repeated measures) or when there are two samples that
have been matched or "paired".
t
X D  0
sD / N
• For this equation, the differences between all pairs must be
calculated. The pairs are either one person's pretest and
posttest scores or one person in a group matched to another
person in another group. The average (XD) and standard
deviation (sD) of those differences are used in the equation.
The constant μ0 is non-zero if you want to test whether the
average of the difference is significantly different than μ0. The
degree of freedom used is N-1.
23
PROC TTEST
The following statements are available in PROC TTEST.
PROC TTEST < options > ;
CLASS variable ;
PAIRED variables ;
BY variables ;
VAR variables ;
CLASS: CLASS statement giving the name of the classification (or
grouping) variable must accompany the PROC TTEST statement in
the two independent sample cases (TWO SAMPLE T TEST). The class
variable must have two, and only two, levels.
Paired Statements
• PAIRED: the PAIRED statement identifies the variables to be
compared in paired t test
1. You can use one or more variables in the PairLists.
2. Variables or lists of variables are separated by an asterisk (*)
or a colon (:).
3. The asterisk (*) requests comparisons between each
variable on the left with each variable on the right.
4. Use the PAIRED statement only for paired comparisons.
5. The CLASS and VAR statements cannot be used with the
PAIRED statement.
PROC TTEST
OPTIONS :
ALPHA=p
specifies that confidence intervals are to be 100(1-p)% confidence intervals,
where 0<p<1. By default, PROC TTEST uses ALPHA=0.05. If p is 0 or less, or
1 or more, an error message is printed.
H0=m
requests tests against m instead of 0 in all three situations (one-sample, twosample, and paired observation t tests). By default, PROC TTEST uses
H0=0.
DATA=SAS-data-set
names the SAS data set for the procedure to use
*One sample ttest*;
Proc ttest data =blood H0=200;
var cholesterol;
run;
One sample t test Output
The TTEST Procedure
Variable: cholesterol
N
Mean
Std Dev
795
201.4
49.8867
Mean
201.4
95% CL Mean
198.0 204.9
Std Err
1.7693
Minimum
17.0000
Std Dev
49.8867
Maximum
331.0
95% CL Std Dev
47.5493 52.4676
DF t Value Pr > |t|
95%CL Mean is 95%
confidence interval
for the mean.
794
0.81
0.4175
95%CL Std Dev is
95% confidence
interval for the
standard deviation.
One sample t test Output
N
795
It is the
Maximum probability of
331.0
observing a
greater absolute
95% CL Mean
Std Dev 95% CL Std Dev value of t under
the null
198.0 204.9 49.8867 47.5493 52.4676
hypothesis.
Mean
201.4
Mean
201.4
Variable: cholesterol
Std Dev Std Err Minimum
49.8867 1.7693 17.0000
DF t Value Pr > |t|
794
0.81 0.4175
DF - The degrees of freedom for the t-test is simply the
number of valid observations minus 1. We loose one degree
of freedom because we have estimated the mean from the
sample. We have used some of the information from the
data to estimate the mean; therefore, it is not available to use
for the test and the degrees of freedom accounts for this
T value is the tstatistic. It is the ratio
of the difference
between the sample
mean and the given
number to the
standard error of the
mean.
title 'Paired Comparison';
data pressure;
input SBPbefore SBPafter @@;
diff_BP=SBPafter-SBPbefore ;
datalines;
120 128 124 131 130 131 118 127
140 132 128 125 140 141 135 137
126 118 130 132 126 129 127 135
;
run;
proc ttest data=pressure;
paired SBPbefore*SBPafter;
run;
Paired t test Output
The TTEST Procedure
Mean of the
differences
Difference: SBPbefore - SBPafter
N
Mean
Std Dev
12
-1.8333
5.8284
Mean
-1.8333
Std Err
1.6825
Minimum
-9.0000
Maximum
8.0000
95% CL Mean
Std Dev
95% CL Std Dev
-5.5365 1.8698
5.8284
4.1288 9.8958
DF t Value Pr > |t|
T statistics for
testing if the mean
of the difference is
0
11
-1.09
0.2992
P =0.3, suggest the mean of
the difference is equal to 0
Two independent samples t-test
• An independent samples t-test is used when you
want to compare the means of a normally
distributed interval dependent variable for two
independent groups. For example, using the
hsb2 data file, say we wish to test whether the
mean for write is the same for males and
females.
proc ttest data = "c:\hsb2";
class female;
var write;
run;
Proc corr
The CORR procedure is a statistical procedure for numeric
random variables that computes correlation statistics (The
default correlation analysis includes descriptive statistics,
Pearson correlation statistics, and probabilities for each
analysis variable).
PROC CORR options;
VAR variables;
WITH variables;
BY variables;
Proc corr data=blood;
var RBC WBC cholesterol;
run;
Proc Corr Output
Simple Statistics
Variable
RBC
WBC
cholesterol
N
908
916
795
Mean
Std Dev
7043
1003
5.48353
0.98412
201.43522 49.88672
Sum
Minimum
6395020
5023
160141
4070
1.71000
17.00000
Maximum
10550
8.75000
331.00000
N - This is the number of valid (i.e., non-missing) cases used in the correlation. By
default, proc corr uses pairwise deletion for missing observations, meaning that a
pair of observations (one from each variable in the pair being correlated) is
included if both values are non-missing. If you use the nomiss option on the proc
corr statement, proc corr uses listwise deletion and omits all observations with
missing data on any of the named variables.
Proc Corr
Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
RBC
RBC
1.00000
908
WBC
P-value
cholesterol
Number of
observations
-0.00203
0.9534
833
0.06583
0.0765
725
WBC
cholesterol
-0.00203
0.9534
833
1.00000
916
0.02496
0.5014
728
0.06583
0.0765
725
0.02496
0.5014
728
1.00000
795
Pearson Correlation
Coefficients - measure
the strength and
direction of the linear
relationship between
the two variables.
The correlation
coefficient can range
from -1 to +1, with -1
indicating a perfect
negative correlation, +1
indicating a perfect
positive correlation, and
0 indicating no
correlation at all.
Proc Reg
• The REG procedure is one of many regression procedures in
the SAS System.
PROC REG < options > ;
MODEL dependents=<regressors> < / options > ;
BY variables ;
OUTPUT < OUT=SAS-data-set > keyword=names ;
data blood;
INFILE ‘F:\blood.txt';
INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol;
run;
data blood1;
set blood;
if gender='Female' then sex=1; else sex=0;
if bloodtype='A' then typeA=1; else typeA=0;
if bloodtype='B' then typeB=1; else typeB=0;
if bloodtype='AB' then typeAB=1; else typeAB=0;
if age_group='Old' then Age_old=1; else Age_old=0;
run;
proc reg data =blood1;
model cholesterol =sex typeA typeB typeAB Age_old RBC WBC ;
run;
Proc reg output
Analysis of Variance
Source
DF
Model
7
Error
655
Corrected Total 662
Sum of
Mean
Squares
Square
F Value Pr > F
41237 5891.02895
2.54 0.0140
1521839 2323.41811
1563076
DF - These are the degrees of freedom associated with the sources of variance.
(1) The total variance has N-1 degrees of freedom (663-1=662).
(2) The model degrees of freedom corresponds to the number of predictors
minus 1 (P-1). Including the intercept, there are 8 predictors, so the model has
8-1=7 degrees of freedom.
(3) The Residual degrees of freedom is the DF total minus the DF model, 662-7
is 655.
Proc reg output
Analysis of Variance
Source
DF
Model
7
Error
655
Corrected Total 662
Sum of
Mean
Squares
Square
F Value Pr > F
41237
5891.02895
2.54 0.0140
1521839 2323.41811
1563076
Sum of Squares - associated with the three sources of variance, total, model and residual.
SSTotal
The total variability around the mean. Sum(Y - Ybar)2.
SSResidual The sum of squared errors in prediction. Sum(Y - Ypredicted)2.
SSModel The improvement in prediction by using the predicted value of Y over just
using the mean of Y. Hence, this would be the squared differences between
the predicted value of Y and the mean of Y, Sum (Ypredicted - Ybar)2.
Note that the SSTotal = SSModel + SSResidual. SSModel / SSTotal is equal to the value of
R-Square, the proportion of the variance explained by the independent variables
Proc reg output
Analysis of Variance
Source
DF
Model
7
Error
655
Corrected Total 662
Sum of
Mean
Squares
Square
F Value Pr > F
41237
5891.02895
2.54 0.0140
1521839 2323.41811
1563076
Mean Square - These are the Mean Squares, the Sum of Squares divided by
their respective DF. These are computed so you can compute the F ratio,
dividing the Mean Square Model by the Mean Square Residual to test
the significance of the predictors in the model
Proc reg output
Analysis of Variance
Source
DF
Model
7
Error
655
Corrected Total 662
Sum of
Mean
Squares
Square
F Value Pr > F
41237
5891.02895
2.54 0.0140
1521839 2323.41811
1563076
F Value and Pr > F - The F-value is the Mean Square Model divided by the Mean
Square Residual. F-value and P value are used to answer the question "Do the
independent variables predict the dependent variable?". The p-value is
compared to your alpha level (typically 0.05) and, if smaller, you can conclude
"Yes, the independent variables reliably predict the dependent variable". Note
that this is an overall significance test assessing whether the group of
independent variables when used together reliably predict the dependent
variable, and does not address the ability of any of the particular independent
variables to predict the dependent variable.
Proc reg output
Root MSE
Dependent Mean
Coeff Var
48.20185 R-Square
201.69683 Adj R-Sq
23.89817
Root MSE - Root MSE is the
standard deviation of the error
term, and is the square root of the
Mean Square Residual (or Error).
0.0264
0.0160
Proc reg output
Root MSE
Dependent Mean
Coeff Var
48.20185 R-Square
201.69683 Adj R-Sq
23.89817
Dependent Mean - This is
the mean of the
dependent variable.
0.0264
0.0160
Coeff Var - This is the
coefficient of variation, which
is a unit-less measure of
variation in the data. It is the
root MSE divided by the mean
of the dependent variable,
multiplied by 100:
(100*(48.2/201.69) =23.90).
Proc reg output
Parameter Estimates
Variable
DF
Parameter
Estimate
Intercept
sex
typeA
typeB
typeAB
Age_old
RBC
WBC
1
1
1
1
1
1
1
1
187.91927
1.48640
0.74839
10.14482
-19.90314
-11.61798
0.00264
0.20512
Standard
Error
t Value Pr > |t|
17.45409
3.79640
4.01841
6.97339
10.45833
3.85823
0.00191
1.88816
10.77 <.0001
0.39 0.6955
0.19 0.8523
1.45 0.1462
-1.90 0.0575
-3.01 0.0027
1.38 0.1676
0.11 0.9135
t Value and Pr > |t|These columns provide
the t-value and 2 tailed
p-value used in testing
the null hypothesis
that the
coefficient/parameter
is 0.
ANOVA
• A one-way analysis of variance (ANOVA) is
used when you have a categorical
independent variable (with two or more
categories) and a normally distributed interval
dependent variable and you wish to test for
differences in the means of the dependent
variable broken down by the levels of the
independent variable.
.
ANOVA
•
The following example studies the effect of bacteria on the nitrogen content of red clover plants.
The treatment factor is bacteria strain, and it has six levels. Five of the six levels consist of five
different Rhizobium trifolii bacteria cultures combined with a composite of five Rhizobium meliloti
strains. The sixth level is a composite of the five Rhizobium trifolii strains with the composite of the
Rhizobium meliloti. Red clover plants are inoculated with the treatments, and nitrogen content is
later measured in milligrams.
title1 'Nitrogen Content of Red Clover Plants';
data Clover;
input Strain $ Nitrogen @@;
datalines;
3DOK1 19.4 3DOK1 32.6 3DOK1 27.0 3DOK1 32.1 3DOK1 33.0 3DOK5 17.7 3DOK5 24.8 3DOK5 27.9
3DOK5 25.2 3DOK5 24.3 3DOK4 17.0 3DOK4 19.4 3DOK4 9.1 3DOK4 11.9 3DOK4 15.8 3DOK7 20.7
3DOK7 21.0 3DOK7 20.5 3DOK7 18.8 3DOK7 18.6 3DOK13 14.3 3DOK13 14.4 3DOK13 11.8 3DOK13
11.6 3DOK13 14.2 COMPOS 17.3 COMPOS 19.4 COMPOS 19.1 COMPOS 16.9 COMPOS 20.8
;
run;
proc anova data = Clover;
class strain;
model Nitrogen = Strain;
run;
proc freq data = Clover;
tables Strain;
run;
ANOVA
• The test for Strain suggests that there are
differences among the bacterial strains, but it
does not reveal any information about the
nature of the differences. Mean comparison
methods can be used to gather further
information.
HLTH 653
• This is a required class
• It is part of the qualifier exams
Download