EPIB 698A SAS lecture 7 Modifying and combining SAS data sets

advertisement
EPIB 698C SAS lecture 5
Modifying and combining SAS data sets
Raul Cruz-Cano
Summer 2012
1
Modifying a data set with the SET statement
The SET statement
• The SET statement in the data step allows you to
read a SAS data set so that you can add new
variables, create a subset, or modify the data set
• The SET statement brings a SAS data set, one
observation at a time, in to a data step for
processing
• Syntax:
Data new-data-set ;
Set data set;
2
Modifying a data set with the SET statement
Data new;
input x y ;
cards;
1 2
3 4
;
run;
Data new1;
set new;
z=x+y;
run;
3
Stacking data sets using the SET statement
• With more than one data, the SET statement stacks the data
sets one on top of the other
• Syntax:
DATA new-data-set;
SET data-set-1 data-set-2 … data-set-n;
• The Number of observations in the new data set will equal to
the sum of the number of observations in the old data sets
• The order of observations is determined by the order of the
list of old data sets
• If one of the data set has a variables not contained in the other
data sets, then observations from the other data sets will have
missing values for that variable
4
Stacking data sets using the SET statement
• Example: Here is data set contains information of visitors to a park.
There are two entrances: south entrance and north entrance. The data
file for the south entrance has an S for south, followed by the
customers pass numbers, the size of their parties, and ages. The data
file for the north entrance has an N for north, the same data as the
south entrance, plus one more variable for parking lot.
/* South .dat */
S 43 3 27
S 44 3 24
S 45 3 2
/* North.dat */
N
N
N
N
21
87
65
66
5
4
2
2
41
33
67
7
1
3
1
1
5
DATA southentrance;
INPUT Entrance $ PassNumber PartySize Age @@;
cards;
S 43 3 27 S 44 3 24 S 45 3 2 ; run;
DATA northentrance;
INPUT Entrance $ PassNumber PartySize Age Lot
@@;
Cards;
N 21 5 41 1 N 87 4 33 3 N 65 2 67 1
N 66 2 7 1; run;
DATA both;
SET southentrance northentrance;
RUN;
6
Interleaving data sets using the SET statement
• If you have data sets that are already sorted by some important
variable, then simply stacking the data sets may unsort the data
sets. You can add a BY statement to keep the final data set in a
sorted version.
• Syntax:
DATA new-data-set;
SET data-set-1 data-set-2 … data-set-n;
BY variable-list;
The old data sets need to be sorted by the BY variables
7
The Park Data
• Example: The data file for the south entrance has an S for south,
followed by the customers’ pass numbers, the size of their parties, and
ages. The data file for the north entrance has an N for north, the same
data as the south entrance, plus one more variable for parking lot.
Suppose the data sets have been sorted by the customer’s pass numbers
/* South .dat */
S 43 3 27
S 44 3 24
S 45 3 2
/* North.dat */
N
N
N
N
21
65
66
87
5
2
2
4
41
67
7
33
1
1
1
3
8
PROC sort DATA = southentrance;
by PassNumber;
run;
PROC sort DATA = northentrance;
by PassNumber;
run;
DATA interleave;
SET northentrance southentrance;
BY PassNumber;
run;
9
Combining data sets with one-to-one match merge
• The MERGE statement match observations from one data
set with observations from another
• If the two data sets are in EXACTLY the same order, you
don’t have to have any common variables between the two
data sets
• However, usually you want to have, for matching purpose,
some common variables which can uniquely identify each
observation
• Syntax
DATA new-data-set;
Merge data-set-1 data-set-2;
By variable-list;
10
Combining data sets with one-to-one match merge
Chocolate sales example:
• A store keeps all the chocolate sales data each day which
contains the code number of the products and the number of
pieces sold that day
• In a separate file, they keep the detailed information of the
products
• We need to merge the two data sets in order to print the
day’s sales along with the descriptions of the products.
11
/* sales data */
C865 15
K086 9
A536 21
S163 34
K014 1
A206 12
B713 29
/* Descriptions */
A206 Mokka Coffee buttercream in dark chocolate
A536 Walnoot Walnut halves in bed of dark chocolate
B713 Frambozen Raspberry marzipan covered in milk chocolate
C865 Vanille Vanilla-flavored rolled in ground hazelnuts
K014 Kroon Milk chocolate with a mint cream center
K086 Koning Hazelnut paste in dark chocolate
M315 Pyramide White with dark chocolate trimming
S163 Orbais Chocolate cream in dark chocolate
12
DATA descriptions;
INFILE 'F:\teaching\SAS\lecture7\chocolate.txt'
TRUNCOVER;
INPUT CodeNum $ 1-4 Name $ 6-14 Description $ 15-60;
DATA sales;
INFILE 'F:\teaching\SAS\lecture7\sales.txt';
INPUT CodeNum $ PiecesSold ;
PROC SORT DATA = sales;
BY CodeNum;
DATA chocolates;
MERGE sales descriptions;
BY CodeNum;
run;
13
Combining data sets with one-to-many match
• One-to-many match: matching one observation from one
data set with more than one observation to another data set
• The statement of one-to-many match is the same as one-toone match
DATA new-data-set;
Merge data-set-1 data-set-2;
By variable-list;
• The data sets must be sorted first by the BY variables
• If the two data sets have variables with the same names,
besides the BY variables, the variables from the second
data set will overwrite any variables with the same name in
the first data set
14
Example: Shoes data
• The shoe store is putting all its shoes on sale. They have
two data file, one contains information about each type of
shoe, and one with discount information. We want to find
out new price of the shoes
Shoe data:
Discount data
Max Flight
running 142.99
Zip Fit Leather walking 83.99
Zoom Airborne running 112.99
Light Step
walking 73.99
Max Step Woven walking 75.99
Zip Sneak
c-train 92.99
c-train .25
running .30
walking .20
15
DATA regular;
INFILE datalines dsd;
length style $15;
INPUT Style $ ExerciseType $ RegularPrice @@;
datalines;
Max Flight , running, 142.99, …
;
PROC SORT DATA = regular;
BY ExerciseType;
DATA discount;
INPUT ExerciseType $ Adjustment @@; cards;
c-train
.25 …
;
DATA prices;
MERGE regular discount;
BY ExerciseType;
NewPrice = ROUND(RegularPrice - (RegularPrice *
Adjustment), .01);
RUN;
16
Updating a master data set with transactions
• The update statement is used when you have a master data
set that must be updated with some new information.
• The basic form of the UPDATE statement is
DATA master-data-set;
UPDATE master-data-set transaction-data-set;
BY variable-list;
• With the UPDATE statement, the resulting master data set
has just one observation for each unique value of the
common variables
• Missing values in the transaction data do not overwrite the
existing values of the master data set
17
Example: hospital data
•
A hospital maintains a master database with information about patients.
Account Name Address
DOB
Sex Insur lastupdate
620135 Smith
234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave.
04-03-1936
f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT
09-23-1993
874329 Kazoyan 76-C La Vista
.
. MCD 01-15-2003
•
The hospital create a transaction record for every new patient and any
returning patient whose status has changed
Account Name Address
620135 .
.
874329 .
.
235777 Harman 5656 Land Way
DOB
Sex Insur lastupdate
.
. HLT 06-15-2003
04-24-1954 m
. 06-15-2003
01-18-2000
f MCD 06-15-2003
18
LIBNAME perm 'F:\teaching\SAS\lecture7';
DATA perm.patientmaster;
INFILE 'F:\teaching\SAS\lecture7\Admit.txt';
INPUT Account LastName $ 8-16 Address $ 17-34
BirthDate MMDDYY10. Sex $ InsCode $ 48-50
@52 LastUpdate MMDDYY10.;
DATA transactions;
INFILE 'F:\teaching\SAS\lecture7\NewAdmit.txt';
INPUT Account LastName $ 8-16 Address $ 17-34
BirthDate MMDDYY10. Sex $ InsCode $ 48-50
@52 LastUpdate MMDDYY10.;
PROC SORT DATA = transactions;
BY Account; run;
DATA perm.patientmaster;
UPDATE perm.patientmaster transactions;
BY Account; run;
19
SAS data set options
• SAS has three basic types of options: system options, statement
options, and data set options
• System options are those that stay in effect for the duration of your job
or session. These options affect how SAS operates, and usually issued
when you invoke SAS or via Options statement. Eg: Options center
nodate
• Statement options appear in individual statement and influence how
SAS runs that particular DATA or PROC step. Eg: Proc means data
=new noprint;
• Data set options affect only how SAS reads or writes an individual
data set. You can use data set options in DATA step (in DATA, SET,
MERGE, or UPDATE statements) or in PROC step ( in conjunct with
a DATA=statement option). Eg: DATA New; set old (keep= x1 x2);
20
SAS data set options
• To use a data set option, you simply put it between
parentheses directly following the data set name. Here are
some commonly used data set options:
Data set options
Functions
Keep =variable-list
Which variables to keep
Drop=variable-list
Which variables to drop
Rename=(oldvar=newvar)
Rename oldvar to newvar
Firstobs=n
To start reading at observation n
Obs=n
To stop reading at observation n
IN=new-var-name
Create a temporary variable for tracking
whether that data set contributed to the
current observation
21
Data new;
input x y @@;
datalines;
1 2 3 4 5 6 7 8
;
run;
data new1;
set new (keep=x firstobs=2 obs=3
rename= (x=z) );
run;
22
Tracking and selecting observations with the IN=option
• The IN=new-var-name option creates a new temporary
variable. The variable exist only for the duration of the
current DATA step and are not added to the data set being
created
• SAS gives IN=new-var-name a value of 0 if that data set
did not contribute to the current observation and a value of
1 if it did
• You can use the IN=new-var-name to track, select, or
delete observations based on the data set of origin
• The IN=new-var-name option is most often used with
MERGE statement
23
data new1;
1 F
2 M
3 F
;
input ID $ gender $ ; cards;
data new2 ; input ID $
1 15
3 20
4 35
;
age;
cards;
data new3;
merge new1 (In=indicator1) new2 (In=indicator2);
by ID;
if indicator1=1 and indicator2=1;
run;
24
Writing multiple datasets using the OUTPUT
statement
• You can create more than one data set in a single DATA
step by putting multiple data set names after the DATA:
DATA dataset1 dataset2 dataset3;
• IF you want the data sets to be different, you need to
combine the above data statement with an OUTPUT
statement. The basic form of an output statement is:
OUTPUT data-set-name;
• IF you leave out the data-set-name after the OUTPUT,
then the observation will be written to all data sets named
in the DATA statement
25
• Example: A local zoo maintains a data base about the feeding of
animals. For each groups of animals, the data include the scientific
class, the enclosure those animals live in and whether they get fed in
the morning, afternoon or both. Here is the data:
bears
Mammalia
elephants Mammalia
flamingos Aves
frogs
Amphibia
kangaroos Mammalia
lions
Mammalia
snakes
Reptilia
tigers
Mammalia
zebras Mammalia
E2
W3
W1
S2
N4
W6
S1
W9
W2
both
am
pm
pm
am
pm
pm
both
am
We want to create two list, one for morining feedings and one for
afternoon feedings
26
DATA morning afternoon;
INFILE 'F:\lecture7\Zoo.txt';
INPUT Animal $ 1-9 Class $ 11-18
Enclosure $ FeedTime $;
IF FeedTime = 'am' THEN OUTPUT morning;
ELSE IF FeedTime ='pm' THEN OUTPUT afternoon;
ELSE IF FeedTime = 'both' THEN OUTPUT;
Run;
27
Use SAS automatic variables
• In addition to the variables you created in your SAS data set, SAS
creates a few automatic variable. These variables are temporary and
are not saved with your data. But they are available in the DATA step
and you can use them just like you use any variables that your created
yourself
• _N_: indicates the number of times SAS has looped through the
DATA step
• _ERROR_: has a value of 1 if there is a data error for that observation
and 0 if there isn’t
• First.variable and Last.variable: available when you are using a BY
statement in a DATA step. The First.variable =1 (Last.variable =1 ) is
when an observation with the first (last) occurrence of a new value for
that variable and 0 for the other observations.
28
• Example: here is a data set contains information of walking
race. It contains subject ID, age group and finishing time
54 youth 35.5 21 adult 21.6 6 adult 25.8 13 senior 29.0
38 senior 40.3 19 youth 39.6 3 adult 19.0 25 youth 47.3
11 adult 21.9 8 senior 54.3 41 adult 43.0 32 youth 38.6
We need to derive the overall finishing place and find out
subjects who finished first within each age group
29
DATA walkers;
cards;
INPUT EntryID AgeGroup $ Time @@;
54 youth 35.5 21 adult 21.6 6 adult 25.8 13 senior 29.0
38 senior 40.3 19 youth 39.6 3 adult 19.0 25 youth 47.3
11 adult 21.9 8 senior 54.3 41 adult 43.0 32 youth 38.6
;
PROC SORT DATA = walkers; BY Time; run;
DATA ordered;
Place = _N_;
SET walkers;
run;
PROC SORT DATA = ordered; BY AgeGroup Time;
DATA winners; SET ordered; BY AgeGroup;
IF FIRST.AgeGroup = 1;
run;
30
Download