RUN

advertisement
EPIB 698D Lecture 2
Raul Cruz
Spring 2013
SAS functions
• SAS has over 400 functions, with the following general form:
Function-name (argument, argument, …)
• All functions must have parentheses even if they don’t require
any arguments
• Example:
 X=Int(log(10));
 Mean_score = mean(score1, score2, score3);
The Mean function returns mean of non-missing arguments, which differs
from simply adding and dividing by their number, which would return a
missing values if any arguments are missing
2
Common Functions And Operators
 Functions
ABS: absolute value
EXP: exponential
LOG: natural logarithm
MAX and MIN: maximum and minimum
SQRT: square root
SUM: sum of variables
Example: SUM (of x1-x10, x21)
• Arithmetic: +, -, *, /, ** (not ^)
3
More SAS functions
Function Name
Max
Round
Sum
Length
Example
Result
Y=Max(1, 3, 5);
Y=5
Y=Round (1.236, 2);
Y=1.24
Y=sum(1, 3, 5);
Y=9
a=‘my cat’; Y=Length
(a);
Y=6
Trim
a=‘my ’, b=‘cat’
Y=trim(a)||b
Y=‘mycat’
4
Using IF-THEN statement
• IF-THEN statement is used for conditional
processing. Example: you want to derive means
test scores for female students but not male
students. Here we derive means conditioning on
gender =‘female’
• Syntax:
If condition then action;
Eg:
If gender =‘F’ then mean_score =mean(scr1, scr2);
5
Using IF-THEN statement
List of Logical comparison operators
Logical comparison
Mnemonic term
symbol
Equal to
EQ
=
Not equal to
NE
^= or ~=
Less than
LT
<
Less than or equal to
LE
<=
Greater than
GT
>
greater than or equal to
GE
>=
Equal to one in a list
IN
Note: Missing numeric values will be treated as the most negative values you
can reference on your computer
6
Using IF-THEN statement
• Example: We have data contains the following information
of subjects: Age Gender Midterm Quiz FinalExam
21 M 80 B- 82
20 F 90 A 93
35 M 87 B+ 85
48 F 80 C 76
59 F 95 A+ 97
15 M 88 C 93
• Task: To group student based on their age (<20, [20-40),
[40-60), >=60)
7
data conditional;
input Age Gender $ Midterm Quiz $2. FinalExam;
datalines;
21 M 80 B- 82
20 F 90 A 93
35 M 87 B+ 85
48 F 80 C 76
59 F 95 A+ 97
15 M 88 C 93
;
data new1;
set conditional;
if Age < 20 then AgeGroup = 1;
if 20 <= Age < 40 then AgeGroup = 2;
if 40 <= Age < 60 then AgeGroup = 3;
if Age >= 60 then AgeGroup = 4;
Run;
8
Multiple conditions with AND and OR
• IF condition1 and condition2 then action;
• Eg:
If age <40 and gender=‘F’ then group=1;
If age <40 or gender=‘F’ then group=2;
9
IF-THEN statement, multiple conditions
• Example: We have data contains the following information
of subjects: Age Gender Midterm Quiz FinalExam
21 M 80 B- 82
20 F 90 A 93
35 M 87 B+ 85
48 F 80 C 76
59 F 95 A+ 97
15 M 88 C 93
• Task: To group student based on their age (<40, >=40),and
gender
10
data new1;
set conditional;
If age <40 and gender='F' then group=1;
If age >=40 and gender='F' then group=2;
IF age <40 and gender ='M' then group=3;
IF age >=40 and gender ='M' then group=4;
run;
11
• Note: Missing numeric values will be treated as the most
negative values you can reference on your computer
• Example: group age into age groups with missing values
21 M 80 B- 82
20 F 90 A 93
. M 87 B+ 85
48 F 80 C 76
59 F 95 A+ 97
. M 88 C 93
12
IF-THEN statement, with multiple actions
• Example: We have data contains the following information
of subjects: Age Gender Midterm Quiz FinalExam
21 M 80 B- 82
20 F 90 A 93
35 M 87 B+ 85
48 F 80 C 76
59 F 95 A+ 97
15 M 88 C 93
• Task: To group student based on their age, and assign test
date based on the age group
13
Multiple actions with Do, end
• Syntax:
IF condition then do;
Action1 ;
Action 2;
End;
If age <=20 then do ;
group=1;
exam_date =“Monday”;
End;
14
IF-THEN/ELSE statement
• Syntax
IF condition1 then action1;
Else if condition2 then action2;
Else if condition3 then action3;
• IF-THEN/Else statement has two advantages than
IF-THEN statement
(1) It is more efficient, use less computing time
(2) Else logic ensures that your groups are mutually
exclusive so that you do not put one obervation
into more than one groups.
15
IF-THEN/ELSE statement
data new1;
set conditional;
if Age < 20 then AgeGroup = 1;
else if Age >= 20 and Age < 40 then AgeGroup = 2;
else if Age >= 40 and Age < 60 then AgeGroup = 3;
else if Age >= 60 then AgeGroup = 4;
run;
16
Subsetting your data
• You can subset you data using a IF
statement in a data step
• Example:
Data new1;
Set new;
If gender =‘F’;
Data new1;
Set new;
If gender ^=‘F’ then delete;
17
Stacking data sets using the SET statement
• With more than one data, the SET statement stacks the data
sets one on top of the other
• Syntax:
DATA new-data-set;
SET data-set-1 data-set-2 … data-set-n;
• The Number of observations in the new data set will equal to
the sum of the number of observations in the old data sets
• The order of observations is determined by the order of the
list of old data sets
• If one of the data set has a variables not contained in the
other data sets, then observations from the other data sets
will have missing values for that variable
18
Stacking data sets using the SET statement
• Example: Here is data set contains information of visitors to a park.
There are two entrances: south entrance and north entrance. The
data file for the south entrance has an S for south, followed by the
customers pass numbers, the size of their parties, and ages. The data
file for the north entrance has an N for north, the same data as the
south entrance, plus one more variable for parking lot.
/* South .dat */
S 43 3 27
S 44 3 24
S 45 3 2
/* North.dat */
N
N
N
N
21
87
65
66
5
4
2
2
41
33
67
7
1
3
1
1
19
DATA southentrance;
INPUT Entrance $ PassNumber PartySize Age;
cards;
S 43 3 27
S 44 3 24
S 45 3 2
;
run;
DATA northentrance;
INPUT Entrance $ PassNumber PartySize Age Lot;
Cards;
N 21 5 41 1
N 87 4 33 3
N 65 2 67 1
N 66 2 7 1
;
run;
DATA both;
SET southentrance northentrance;
RUN;
20
Combining data sets with one-to-many match
• One-to-many match: matching one observation from one
data set with more than one observation to another data
set
• The statement of one-to-many match is the same as oneto-one match
DATA new-data-set;
Merge data-set-1 data-set-2;
By variable-list;
• The data sets must be sorted first by the BY variables
• If the two data sets have variables with the same names,
besides the BY variables, the variables from the second
data set will overwrite any variables with the same name
in the first data set
21
Example: Shoes data
• The shoe store is putting all its shoes on sale. They have
two data file, one contains information about each type of
shoe, and one with discount information. We want to find
out new price of the shoes
Shoe data:
Discount data
Max Flight
running 142.99
Zip Fit Leather walking 83.99
Zoom Airborne running 112.99
Light Step
walking 73.99
Max Step Woven walking 75.99
Zip Sneak
c-train 92.99
c-train .25
running .30
walking .20
22
DATA regular;
INFILE datalines dsd;
length style $15;
INPUT Style $ ExerciseType $ RegularPrice @@;
datalines;
Max Flight , running, 142.99, …
;
PROC SORT DATA = regular;
BY ExerciseType;
DATA discount;
INPUT ExerciseType $ Adjustment @@; cards;
c-train
.25 …
;
DATA prices;
MERGE regular discount;
BY ExerciseType;
NewPrice = ROUND(RegularPrice - (RegularPrice *
Adjustment), .01);
RUN;
23
Simplifying programs with Arrays
• SAS Arrays are a collection of elements (usually SAS
variables) that allow you to write SAS statements
referencing this group of variables.
• Arrays are defined using Array statement as:
ARRAY name (n) variable list
name: is a name you give to the array
n: is the number of variables in the array
eg: ARRAY store (4) macys sears target costco
Store(1) is the variable for macys
Store(2) is the variable for sears
24
Simplifying programs with Arrays
• A radio station is conducting a survey asking people to rate 10
songs. The rating is on a scale of 1 to 5, with 1=Do not like the
song; 5-like the song;
• IF the listener does not want to rate a song, he puts a “9” to
indicate missing values
• Here is the data with location, listeners age and rating for 10
songs
Albany
54 4 3 5 9 9 2 1 4 4 9
Richmond
33 5 2 4 3 9 2 9 3 3 3
Oakland
27 1 3 2 9 9 9 3 4 2 3
Richmond
41 4 3 5 5 5 2 9 4 5 5
Berkeley
18 3 4 9 1 4 9 3 9 3 2
• We want to change 9 to missing values (.)
25
Simplifying programs with Arrays
DATA songs;
INFILE ‘E:\radio.txt';
INPUT City $ 1-15 Age domk wj hwow simbh
kt aomm libm tr filp ttr;
ARRAY song (10) domk wj hwow simbh kt
aomm libm tr filp ttr;
DO i = 1 TO 10;
IF song(i) = 9 THEN song(i) = .;
END;
run;
26
Using shortcuts for lists of variable names
• When writing SAS programs, we will often need to write a
list of variables names. When you have a data will many
variables, a shortcut for lists of variables names is helpful
• Numbered range list: variables which starts with same
characters and end with consecutive number can be part
of a numbered range list
• Eg :
INPUT cat8 cat9 cat10 cat11
INPUT cat8 – cat11
27
Using shortcuts for lists of variable names
• Name range list: name range list depends on the internal
order, or position of the variables in a SAS dataset. This is
determined by the appearance of the variables in the
DATA step.
• Eg :
Data new;
Input x1 x2 y2 y3;
Run;
• Then the internal range list is: x1 x2 y2 y3
• Shortcut for this variable list is x1-y3;
• Proc contents procedure with the POSITION option can be
used to find out the internal order
28
Using shortcuts for lists of variable names
DATA songs;
INFILE
‘E:\radio.txt';
INPUT City $ 1-15 Age domk wj hwow simbh
kt aomm libm tr filp ttr;
ARRAY new (10) Song1 - Song10;
ARRAY old (10) domk -- ttr;
DO i = 1 TO 10;
IF old(i) = 9 THEN new(i) = .;
ELSE new(i) = old(i);
END;
AvgScore = MEAN(OF Song1 - Song10);
run;
29
Sorting, Printing and Summarizing Your Data
• SAS Procedures (or PROC) perform specific analysis or
function, produce results or reports
• Eg: Proc Print data =new; run;
• All procedures have required statements, and most have
optional statements
• All procedures start with the key word “PROC”, followed
by the name of the procedure, such as PRINT, or contents
• Options, if there are any, follow the procedure name
• Data=data_name options tells SAS which dataset to use as
an input for this procedure. NOTE: if you skip it, SAS will
use the most recently created dataset, which is not
necessary the same as the mostly recently used data.
30
BY statement
• The BY statement is required for only one procedure, Proc
sort
PROC Sort data = new;
By gender;
Run;
• For all the other procedures, BY is an optional statement,
and tells SAS to perform analysis for each level of the
variable after the BY statement, instead of treating all
subjects as one group
Proc Print data =new;
By gender;
Run;
• All procedures, except Proc sort, assumes you data are
already sorted by the variables in your BY statement
31
PROC Sort
• Syntax
Proc Sort data =input_data_name out =out_data_name ;
By variable-1 … variable-n;
• The variables in the by statement are called by variables.
• With one by variable, SAS sorts the data based on the
values of that variable
• With more than one variable, SAS sorts observations by
the first variable, then by the second variable within the
categories of the first variable, and so on
• The DATA and OUT options specify the input and output
data sets. Without the DATA option, SAS will use the most
recently created data set. Without the OUT statement,
SAS will replace the original data set with the newly sorted
version
32
PROC Sort
• By default, SAS sorts data in ascending order, from the
lowest to the highest value or from A to Z. To have the the
ordered reversed, you can add the keyword DESCENDING
before the variable you want to use the highest to the
lowest order or Z to A order
• The NODUPKEY option tells SAS to eliminate any
duplicate observations that have the same values for the
BY variables
33
PROC Sort
• Example: The sealife.txt contains information on the average length in
feet of selected whales and sharks. We want to sort the data by the
family and length
Name Family Length
beluga whale 15
whale shark 40
basking shark 30
gray whale 50
mako shark 12
sperm whale 60
dwarf shark .5
whale shark 40
humpback . 50
blue whale 100
killer whale 30
34
PROC Sort
• Example: The sealife.txt contains information on the average length in
feet of selected whales and sharks. We want to sort the data by the
family and length
Name Family Length
beluga whale 15
whale shark 40
basking shark 30
gray whale 50
mako shark 12
sperm whale 60
dwarf shark .5
whale shark 40
humpback . 50
blue whale 100
killer whale 30
35
PROC Sort
DATA marine;
INFILE ‘E:\Sealife.txt';
INPUT Name $ Family $ Length;
run;
* Sort the data;
PROC SORT DATA = marine OUT = seasort
NODUPKEY;
BY Family DESCENDING Length;
run;
36
Summarizing you data with PROC MEANS
• The proc means procedure provide simple statistics on
numeric variables. Syntax: Proc means options ;
• List of simple statistics can be produced by proc means:
MAX: the maximum value
MIN: the minimum value
DEFAULT
MEAN: the mean
N : number of non-missing values
STDDEV: the standard deviation
NMISS: number of missing values
RANGE: the range of the data
SUM: the sum
MEDIAN: the median
37
Proc means
• Options of Proc means:
 By variable-list : perform analysis for each level of the
variables in the list. Data needs to be
sorted first
 Class variable-list: perform analysis for each level of the
variables in the list. Data do not need to
be sorted
 Var variable list: specifies which variables to use in the
analysis
38
Proc means
• A wholesale nursery is selling garden flowers, they want to
summarize their sales figures by month. The data is as
follows:
ID
756-01
756-01
834-01
834-01
901-02
834-01
756-01
901-02
756-01
Date
Lily
05/04/2001 120
05/14/2001 130
05/12/2001 90
05/14/2001 80
05/18/2001 50
06/01/2001 80
06/11/2001 100
06/19/2001 60
06/25/2001 85
SnapDragon
80
90
160
60
100
60
160
60
110
Marigold
110
120
60
70
75
100
75
60
100
39
DATA sales;
INFILE ‘E:\Flowers.txt';
INPUT CustomerID $ @9 SaleDate MMDDYY10. Lily
SnapDragon Marigold;
Month = MONTH(SaleDate);
PROC SORT DATA = sales;
BY Month;
* Calculate means by Month for flower sales;
PROC MEANS DATA = sales;
BY Month;
VAR Lily SnapDragon Marigold;
TITLE 'Summary of Flower Sales by Month';
RUN;
40
Proc GCHART for bar charts
•
Example: A bar chart showing the distribution of blood
types from the Blood data set
/* The blood.txt data contain information of 1000 subjects.
The variables include: subject ID, gender, blood_type, age
group, red blood cell count, white blood cell count, and
cholesterol.
DATA blood;
INFILE ‘C:\blood.txt';
INPUT ID Sex $ BloodType $ AgeGroup $ RBC WBC Cholesterol;
run;
title "Distribution of Blood Types";
proc gchart data=blood;
vbar BloodType;
run;
Proc GCHART for bar charts
• VBAR: request a vertical bar chart for the variable
• Alternatives to VBAR are as follows:
HBAR: horizontal bar chart
VBAR3D: three-dimensional vertical bar chart
HBAR3D: three-dimensional horizontal bar chart
PIE: pie chart
PIE3D: three-dimensional pie chart
DONUT: donut chart
A Few Options
proc gchart data=blood;
vbar bloodtype/space=0 type=percent ;
run;
Controls spacing between bars
Changes the statistic from frequency
to percent
Type option
• Type =freq : displays frequencies of a categorical variable
• Type =pct (Percent): displays percent of a categorical
variable
• Type =cfreq : displays cumulative frequencies of a
categorical variable
• Type =cpct (cPercent): displays cumulative percent of a
categorical variable
Basic Output
This value of 7,000
corresponds to a
class ranging from
6500 to 7500
(with a frequency
of about 350)
SAS computes midpoints of each bar automatically. You can change it by supplying your own
midpoints: vbar RBC / midpoints=4000 to 11000 by 1000;
Creating charts with values representing categories
• SAS places continuous variables into groups before
generating a frequency bar chart
• If you want to treat the values as discrete categories, you
can use DISCRETE option
• Example: create bar chart showing the frequencies by day
of the week for the visit to a hospital
libname d “C:\”;
data day_of_week;
set d.hosp;
Day = weekday(AdmitDate);
run;
*Program Demonstrating the DISCRETE option of PROC GCHART;
title "Visits by Month of the Year";
proc gchart data =day_of_week;
vbar Day / discrete;
run;
The Discrete Option
proc gchart data= day_of_week;
vbar day /discrete;
run;
If you use discrete with
a numeric variable you
should:
1. Be sure it has only a
few distinct values.
or
2. Use a format to make
categories for it.
Discrete establishes each distinct
value of the midpoint variable as
a midpoint on the graph. If the
variable is formatted, the formatted
values are used for the construction.
Summary Variables
• If I want my bar chart to summarize values of some
analysis variable for each midpoint, use the
sumvar= (and type= ) option.
• sumvar= variable name
• Type =mean: displays mean of a continuous
variable
• Type =sum: displays totals of a continuous variable
( this is default value)
Creating bar charts representing sums
• The GCHART procedure can be used to create bar charts where the
height of bars represents some statistic, means (or sums) for example,
for each value of a classification variable
• Example: Bar chart showing the sum of the Totalsales for each region
of the country
title "Total Sales by Region";
proc gchart data=d.sales;
vbar Region / sumvar=TotalSales
type=sum
;
format TotalSales dollar8.;
run;
Creating bar charts representing means
proc gchart data=blood;
vbar Gender / sumvar=cholesterol
type=mean;
run;
quit;
GPLOT
• The GPLOT procedure plots the values of
two or more variables on a set of
coordinate axes (X and Y).
• The procedure produces a variety of twodimensional graphs including
– simple scatter plots
– overlay plots in which multiple sets of data
points display on one set of axes
Procedure Syntax: PROC GPLOT
• PROC GPLOT;
PLOT y*x </option(s)>;
run;
• Example: plot of systolic blood pressure (SBP) by diastolic
blood pressure (DBP)
title "Scatter Plot of SBP by DBP";
proc gplot data=d.clinic;
plot SBP * DBP;
run;
• Multiple plots can be made in 3 ways:
(1)proc gplot; plot y1*x y2*x /overlay; run; plots y1
versus x and y2 versus x using the same horizontal and
vertical axes.
(2) proc gplot; plot y1*x; plot2 y2*x; run;
plots y1 versus x and y2 versus x using different vertical
axes. The second vertical axes appears on the right hand
side of the graph.
(3) proc gplot ; plot y1*x=z; run;
uses z as a classification variable and will produce a single
graph plotting y1 against x for each value of the variable z.
*controlling the axis ranges;
title "Scatter Plot of SBP by DBP";
proc gplot data=d.clinic;
plot SBP * DBP / haxis=70 to 120 by 5
vaxis=100 to 220 by 10;
run;
Download