Chap4and8

advertisement
 The print procedure
PROC PRINT data=data-set NOOBS LABEL;
By variable-list; Group output by the by-variables. Data must be presorted.
ID variable-list; The observation numbers are replaced by the ID variables.
SUM variable-list; Print sums for the variables in the list.
VAR variable-list; Specify variables to print and order
LABEL variable=’label’; Use label for the specified variable.
The NOOBS option requests the observation numbers to be suppressed. The
LABEL option requests labels instead of variable names to be printed, if variable
labels have been defined in a DATA step with a LABEL statement. Note that
when a LABEL statement is used in a DATA step, the labels become part of the
data set; but when used in a PROC, the labels stay in effect only for the duration
of that step.
data shoes;
input style $ 1-15 ExcerciseType $ :10. Sales price;
datalines;
Max Flight
running 1930 142.99
Zip fit leather walking 2250 83.99
zoom airborne
running 4150 112.99
Light step
walking 1130 73.99
Max step woven walking 2230 75.99
zip sneak
c-train 1190 92.99
Air
basketball 1000 150
;
proc sort data=shoes;
by ExerciseType;
proc print data=shoes label;
by ExerciseType;
sum sales;
var style sales price;
label Sales="sales in 2009";
run;
 The sort procedure
PROC SORT DATA=messy OUT=neat NODUPKEY;
BY state DESCENDING city;
The NODUPKEY option tells SAS to eliminate any duplicate observations that
have the same values for the BY variables. The DESCENDING option before the
city variable requests SAS to sort by descending order of the city. By default,
SAS sorts by ascending order.
proc sort data=shoes out=shoes_sorted NODUPKEY;
by ExerciseType;
run;
proc print; run;
 The format procedure: mainly used to recode variable values through user-defined
formats.
PROC FORMAT library=libref.catalogname ;
VALUE numfmt value1='formatted-value-1' value2='formatted-value-2'
........ valuen='formatted-value-n' ;
VALUE $charfmt 'value1'='formatted-value-1' 'value2'='formatted-value-2'
........ 'valuen'='formatted-value-n' ;
RUN;
PROC FORMAT Statement: Without the LIBRARY=option, formats are stored
in a catalog called FORMATS in the temporary WORK library and exist only for
the duration of the SAS session. If the LIBRARY= option specifies only a libref,
formats are permanently stored in that library in a catalog called FORMATS.
data temp;
infile cards dlm=',';
input id $ sex $ emp_stat yr_edu jobcat $ ;
cards;
A, m, 2, 18, 42
B, F, 0, 16, 00
C, f, 2, 16, 32
D, M, 1, 12, 52
E, f, 1, 18, 01
;
run;
proc format;
value $job '01'='Teacher'
'31'-'33'='Computing Consultant'
'41'-'49','51'-'59'='Medical Professional'
other='N/A' ;
value empst 0='NotEmployed'
1='Part-time Employed'
2='Full-time Employed' ;
value edu_cat 1-11='Less than High School'
12='High School'
12<-high='More than High School' ;
value $gen 'M','m'='Males'
'F','f'='Females' ;
run;
data rec;
set temp;
FORMAT emp_stat empst. jobcat $job.;
run;
proc print data=rec;
run;
/* Using formats temporarily in a PROC step */
proc freq data=saslib.rec;
tables sex yr_edu ;
format sex $gen. yr_edu edu_cat. ;
run;
data new;
set rec;
/* Detaching formats from these variables */
FORMAT emp_stat jobcat ;
run;
proc print data=new;
title "Data with NO user-written formats" ;
run;
Specifying range of values:
1. Ranges can be constant values or values separated by commas:
a. · ‘a’, ‘b’, ‘c’
b. · 1,22,43
2. Ranges can include intervals such as:
<lower> – <higher> means that the interval includes both endpoints.
<lower> <- <higher> means that the interval includes higher endpoint, but not the lower.
<lower> - < <higher> means that the interval includes lower endpoint, but not the higher.
<lower> <- < <higher> means that the interval does not include either endpoint.
3. The numeric “ . “ and character ‘ “ “ ‘ missing values can be individually assigned
values.
4. Ranges can be specified with special keywords:
a. LOW: From the least (most negative) possible number.
b. HIGH: To the largest (positive) possible number.
c. OTHER: All other numbers not otherwise specified.
5. The LOW keyword does not format missing values.
6. The OTHER keyword does include missing
 The means procedure
proc means options;
statements;
Commonly used options:
n, nmiss, mean, median, std, stderr, clm, lclm, uclm, min, max, sum, var, q1, q3,
qrange, cv, skewness, kurtosis, t, prt (p-value for the t-test), maxdec.
Commonly used statements:
class variable-list; request summary analysis done for each group. Data need not
to be ordered first
by variable-list; request summary analysis done for each group. Data need to be
ordered first
var variable-list;
output out=data set name statKeywords=names;
data htwt;
input subject $ gender $ height weight score
$;
datalines;
1 M 68.5 155 L
2 F 61.2 99 H
3 F 63.0 115 M
4
5
6
7
8
;
M
M
F
M
M
70.0 205
68.6 170
65.1 125
72.4 220
. 188 H
.
M
H
L
proc means data=htwt maxdec=2 N mean std stderr
clm;
run;
 The univariate procedure
proc univariate options;
statements/statment options;
Some useful options:
normal: test for normality
plot: produce three text plots: stem-and-leaf, box plot and normal probability plot
(QQplot)
Statements:
Var variable-list;
By variable-list;
histogram variable-list/normal; This will generate a histogram with normal
density curve superimposed.
QQplot: quantile-quantile plot
Probplot: quantile-probability plot
Inset: add a box that displays selected stats
proc univariate data=htwt plot normal;
var weight;
histogram weight/normal;
Inset mean='Mean' (5.2)
std='standard deviation' (6.3)/Font='Arial'
POS=NW HEIGHT=3;
QQplot weight/normal(mu=160 sigma=44 color=red);
Probplot weight/normal(mu=160 sigma=44 color=red);
run;
See http://www.ats.ucla.edu/stat/sas/output/univ.htm for a detailed annotation on the
output
 Two sample comparisons
T-test: testing the differences between two independent group means
Assumptions:
1. Two groups are independent and samples within each group are independent
2. The means of the two groups are normally distributed
3. The variances of the two groups are approximately equal
data grouptime;
Do group="C", "T";
Do sub=1 to 5;
input time @;
output;
end;
end;
drop sub;
datalines;
80 93 83 89 98 100 103 104 99 102
;
proc ttest data=grouptime;
class group;
var time;
run;
 Wilcoxon rank-sum test
Appropriate for nonnormal distributions and small sample size, and ordinal data. The null
hypothesis is that the distributions of X in both groups are the same.
Group A: 3.1 2.2 1.7 2.7 2.5
Group B: 0.0 0.0 1.0 2.3
Order data: 0.0 0.0 1.0 1.7 2.2 2.3 2.5 2.7 3.1
groups:
B B B A A B A A A
rank:
1.5 1.5 3 4 5 6 7 8 9
Rank-sum A: 4+5+7+8+9=33
Rank-sum B: 1.5+1.5+3+6=12
test statistic: min(Rank-sum A, Rank-sum B)
data tumor;
infile datalines missover;
input group $ mass1-mass5;
datalines;
A 3.1 2.2 1.7 2.7 2.5
B 0.0 0.0 1.0 2.3
;
proc transpose data=tumor out=tumor1 prefix=mass;
by group;
var mass1-mass5;
run;
proc npar1way data=tumor1 wilcoxon;
class group;
var mass1;
exact wilcoxon;
run;
 Paired t-test
The same subject is measured under the two different treatment conditions
Assumptions: The mean of the within-pair differences is normally distributed
data grouptime1;
set grouptime;
ctime=lag5(time);
if ctime ^=.;
rename time=ttime;
drop group;
proc ttest data=grouptime1;
paired ctime*ttime;
run;
Proc univariate can be used for paired t-test. It's better than the ttest proc since it also
does nonparametric tests
data grouptime2;
set grouptime1;
change=ctime-ttime;
keep change;
proc univariate data=grouptime2;
run;
 One way analysis of variance (one way ANOVA): Comparison of one continuous
variable among multiple groups
Assumptions:
1. Groups are independent and samples within each group are independent
2. data are normally distributed
3. The variances of the groups are approximately equal
F-test:
1. Total Sum of Squres: TSS
2. Sum of Squres due to treatment: SST
3. Sum of Squres due to error: SSE
TSS = SST + SSE
SST/(k-1)
F = --------------, N = total sample size, k = number of groups
SSE/(N-k)
data reading;
input group $ words
datalines;
X 700 X 850 X 820 X 640 X
Y 480 Y 460 Y 500 Y 570 Y
Z 500 Z 550 Z 480 Z 600 Z
;
proc anova data=reading;
class group;
@@;
920
580
610
model words=group;
means group;
run;
 Categorical data
One binomial or multinomial variable
proc freq data=data;
tables variable-list/statment options;
Some proc statement options:
missing: includes missing values in frequency statistics
nocum: no cumulative frequency
nopercent: no percentage
List: print cross-tabulations in list format rather than grid
nocol: suppresses printing of column percentages in cross-tabulations
norow: suppresses printing of row percentages in cross-tabulation
Table statement options
AGREE: requests tests and measures of classification agreement including
McNemar's test, kappa statistics, etc
BIN: requests binomial proportion, confidence limits and test for one-way tables
CHISQ: requests chi-square tests of homogeneity and measures of association
CL: requests confidence limits for measures of association
EXACT: requests Fisher's exact test
MEASURES: requests measures of association including Pearson and spearman
correlation coefficients, etc
RELRISK: requests relative risk measures for 2x2 tables
RISKDIFF: requests risk difference and confidence limits for 2x2 tables
data htwt;
input subject gender $ height weight score $;
datalines;
1 M 68.5 155 L
2 F 61.2 99 H
3 F 63.0 115 M
4 M 70.0 205 .
5 M 68.6 170 M
6 F 65.1 125 H
7 M 72.4 220 L
8 M . 188 H
;
proc freq data=htwt;
tables score/bin(p0=.6 level="L");
*tables score;
*exact bin;
run;
data htwt1;
set htwt;
score1=(score="H");
proc print;run;
proc freq data=htwt1;
tables gender*score1/riskdiff;
run;
2-way contingency table (cross tabulations)
Useful for
1. Comparing two proportions with independent samples
2. Testing independence between two categorical variables for one sample
Commonly used tests:
1. Chi squre test (chisq): the expected number of count in each cell > 5
2. Fisher exact test (fisher): for small sample sizes
data fisher;
input gender $ vote $ count;
datalines;
M Y 5
M N 0
F Y 1
F N 4
;
proc freq data=fisher;
tables gender*vote/fisher;
weight count;
run;
proc freq data=htwt;
tables gender*score/chisq fisher;
exact chisq;
run;
 proc gplot
proc gplot data=data_name;
plot y1*x=symbol y2*x=symbol/overlay haxis vaxis;
run;
The example below is from data collected on a series of plots in Maryland to examine the
relationships between gypsy moth egg mass densities and subsequent defoliation. The
plots are 60 ha in size and a random sample of .01ha subplots were obtained in each plot.
Egg masses were counted and defolation (as a percent) was measured.
option ls=70;
title 'create defol means';
data dd;
infile 'C:\Documents and
Settings\anna\Desktop\597\87md.dat';
input plot $ subplot egg def;
run;
proc print data=dd;run;
proc means data =dd nway;
class plot;
var egg def;
output out=result2 mean=meanegg meandef stderr=seegg sedef;
run;
proc print data=result2;run;
The nway option in the means procedure statement: Limit the output statistics to the
observations with the highest _TYPE_ value.
data c;
set result2;
up=meanegg+(1.96*seegg);
low = meanegg-(1.96*seegg);
run;
proc print data=c;run;
title1 'Mean egg mass and two standard errors';
title2 ' Maryland 1987';
title2;
axis2 label=("Egg mass");
symbol1 value=u color=red;
symbol2 value=l color=red;
symbol3 value=m color=black;
proc gplot data=c;
plot up*plot=1 low*plot=2 meanegg*plot=3/overlay
vaxis=axis2;
run;
Title statement:
1. Global statement
2. TITLE1 is twice the height of all other titles and uses the SWISS font.
3. All other TITLE statements are one unit high and use the default hardware font.
4. The following quoted paragraph is from SAS online document
“Using TITLE and FOOTNOTE Statements
You can define TITLE and FOOTNOTE statements anywhere in your SAS program.
They are global and remain in effect until you cancel them or until you end your SAS
session. All currently defined FOOTNOTE and TITLE statements are automatically
displayed.
You can define up to ten TITLE statements and ten FOOTNOTE statements in your SAS
session. A TITLE or FOOTNOTE statement without a number is treated as a TITLE1 or
FOOTNOTE1 statement. You do not have to start with TITLE1 and you do not have to
use sequential statement numbers. Skipping a number in the sequence leaves a blank line.
You can use as many text strings and options as you want, but place the options before
the text strings they modify.
The most recently specified TITLE or FOOTNOTE statement of any number completely
replaces any other TITLE or FOOTNOTE statement of that number. In addition, it
cancels all TITLE or FOOTNOTE statements of a higher number. For example, if you
define TITLE1, TITLE2, and TITLE3, resubmitting the TITLE2 statement cancels
TITLE3. To cancel individual TITLE or FOOTNOTE statements, define a TITLE or
FOOTNOTE statement of the same number without options (a null statement):
title4; But remember that this will cancel all other existing statements of a higher number.
To cancel all current TITLE or FOOTNOTE statements, use the RESET= graphics option
in a GOPTIONS statement:
goptions reset=footnote;
Specifying RESET=GLOBAL or RESET=ALL also cancels all current TITLE and
FOOTNOTE statements as well as other settings.”
Symbol statement:
1. Global statement
2. Syntax
Symbol<1...255>
keyword=value;
keywords include:
color
line
value
interpol
3. A new symbol definition of any number replaces the old symbol definition of
the same number with the same keywords
Axis statement:
1. Global statement
2. Syntax:
axis<1...99> label=("value");
proc reg data=htwt;
model weight=height;
plot weight * height P.*height/overlay;
run;
symbol1 value=plus color=black;
symbol2 I=RLCLM95 line=1 color=red;
symbol3 I=RLCLI95 line=4 color=blue;
proc gplot data=htwt;
plot weight*height weight*height=2
weight*height=3/overlay;
run;
Download