EPIB 698C SAS lecture 5 Modifying and combining SAS data sets Raul Cruz-Cano Summer 2012 1 Modifying a data set with the SET statement The SET statement • The SET statement in the data step allows you to read a SAS data set so that you can add new variables, create a subset, or modify the data set • The SET statement brings a SAS data set, one observation at a time, in to a data step for processing • Syntax: Data new-data-set ; Set data set; 2 Modifying a data set with the SET statement Data new; input x y ; cards; 1 2 3 4 ; run; Data new1; set new; z=x+y; run; 3 Stacking data sets using the SET statement • With more than one data, the SET statement stacks the data sets one on top of the other • Syntax: DATA new-data-set; SET data-set-1 data-set-2 … data-set-n; • The Number of observations in the new data set will equal to the sum of the number of observations in the old data sets • The order of observations is determined by the order of the list of old data sets • If one of the data set has a variables not contained in the other data sets, then observations from the other data sets will have missing values for that variable 4 Stacking data sets using the SET statement • Example: Here is data set contains information of visitors to a park. There are two entrances: south entrance and north entrance. The data file for the south entrance has an S for south, followed by the customers pass numbers, the size of their parties, and ages. The data file for the north entrance has an N for north, the same data as the south entrance, plus one more variable for parking lot. /* South .dat */ S 43 3 27 S 44 3 24 S 45 3 2 /* North.dat */ N N N N 21 87 65 66 5 4 2 2 41 33 67 7 1 3 1 1 5 DATA southentrance; INPUT Entrance $ PassNumber PartySize Age @@; cards; S 43 3 27 S 44 3 24 S 45 3 2 ; run; DATA northentrance; INPUT Entrance $ PassNumber PartySize Age Lot @@; Cards; N 21 5 41 1 N 87 4 33 3 N 65 2 67 1 N 66 2 7 1; run; DATA both; SET southentrance northentrance; RUN; 6 Interleaving data sets using the SET statement • If you have data sets that are already sorted by some important variable, then simply stacking the data sets may unsort the data sets. You can add a BY statement to keep the final data set in a sorted version. • Syntax: DATA new-data-set; SET data-set-1 data-set-2 … data-set-n; BY variable-list; The old data sets need to be sorted by the BY variables 7 The Park Data • Example: The data file for the south entrance has an S for south, followed by the customers’ pass numbers, the size of their parties, and ages. The data file for the north entrance has an N for north, the same data as the south entrance, plus one more variable for parking lot. Suppose the data sets have been sorted by the customer’s pass numbers /* South .dat */ S 43 3 27 S 44 3 24 S 45 3 2 /* North.dat */ N N N N 21 65 66 87 5 2 2 4 41 67 7 33 1 1 1 3 8 PROC sort DATA = southentrance; by PassNumber; run; PROC sort DATA = northentrance; by PassNumber; run; DATA interleave; SET northentrance southentrance; BY PassNumber; run; 9 Combining data sets with one-to-one match merge • The MERGE statement match observations from one data set with observations from another • If the two data sets are in EXACTLY the same order, you don’t have to have any common variables between the two data sets • However, usually you want to have, for matching purpose, some common variables which can uniquely identify each observation • Syntax DATA new-data-set; Merge data-set-1 data-set-2; By variable-list; 10 Combining data sets with one-to-one match merge Chocolate sales example: • A store keeps all the chocolate sales data each day which contains the code number of the products and the number of pieces sold that day • In a separate file, they keep the detailed information of the products • We need to merge the two data sets in order to print the day’s sales along with the descriptions of the products. 11 /* sales data */ C865 15 K086 9 A536 21 S163 34 K014 1 A206 12 B713 29 /* Descriptions */ A206 Mokka Coffee buttercream in dark chocolate A536 Walnoot Walnut halves in bed of dark chocolate B713 Frambozen Raspberry marzipan covered in milk chocolate C865 Vanille Vanilla-flavored rolled in ground hazelnuts K014 Kroon Milk chocolate with a mint cream center K086 Koning Hazelnut paste in dark chocolate M315 Pyramide White with dark chocolate trimming S163 Orbais Chocolate cream in dark chocolate 12 DATA descriptions; INFILE 'F:\teaching\SAS\lecture7\chocolate.txt' TRUNCOVER; INPUT CodeNum $ 1-4 Name $ 6-14 Description $ 15-60; DATA sales; INFILE 'F:\teaching\SAS\lecture7\sales.txt'; INPUT CodeNum $ PiecesSold ; PROC SORT DATA = sales; BY CodeNum; DATA chocolates; MERGE sales descriptions; BY CodeNum; run; 13 Combining data sets with one-to-many match • One-to-many match: matching one observation from one data set with more than one observation to another data set • The statement of one-to-many match is the same as one-toone match DATA new-data-set; Merge data-set-1 data-set-2; By variable-list; • The data sets must be sorted first by the BY variables • If the two data sets have variables with the same names, besides the BY variables, the variables from the second data set will overwrite any variables with the same name in the first data set 14 Example: Shoes data • The shoe store is putting all its shoes on sale. They have two data file, one contains information about each type of shoe, and one with discount information. We want to find out new price of the shoes Shoe data: Discount data Max Flight running 142.99 Zip Fit Leather walking 83.99 Zoom Airborne running 112.99 Light Step walking 73.99 Max Step Woven walking 75.99 Zip Sneak c-train 92.99 c-train .25 running .30 walking .20 15 DATA regular; INFILE datalines dsd; length style $15; INPUT Style $ ExerciseType $ RegularPrice @@; datalines; Max Flight , running, 142.99, … ; PROC SORT DATA = regular; BY ExerciseType; DATA discount; INPUT ExerciseType $ Adjustment @@; cards; c-train .25 … ; DATA prices; MERGE regular discount; BY ExerciseType; NewPrice = ROUND(RegularPrice - (RegularPrice * Adjustment), .01); RUN; 16 Updating a master data set with transactions • The update statement is used when you have a master data set that must be updated with some new information. • The basic form of the UPDATE statement is DATA master-data-set; UPDATE master-data-set transaction-data-set; BY variable-list; • With the UPDATE statement, the resulting master data set has just one observation for each unique value of the common variables • Missing values in the transaction data do not overwrite the existing values of the master data set 17 Example: hospital data • A hospital maintains a master database with information about patients. Account Name Address DOB Sex Insur lastupdate 620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998 645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999 645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993 874329 Kazoyan 76-C La Vista . . MCD 01-15-2003 • The hospital create a transaction record for every new patient and any returning patient whose status has changed Account Name Address 620135 . . 874329 . . 235777 Harman 5656 Land Way DOB Sex Insur lastupdate . . HLT 06-15-2003 04-24-1954 m . 06-15-2003 01-18-2000 f MCD 06-15-2003 18 LIBNAME perm 'F:\teaching\SAS\lecture7'; DATA perm.patientmaster; INFILE 'F:\teaching\SAS\lecture7\Admit.txt'; INPUT Account LastName $ 8-16 Address $ 17-34 BirthDate MMDDYY10. Sex $ InsCode $ 48-50 @52 LastUpdate MMDDYY10.; DATA transactions; INFILE 'F:\teaching\SAS\lecture7\NewAdmit.txt'; INPUT Account LastName $ 8-16 Address $ 17-34 BirthDate MMDDYY10. Sex $ InsCode $ 48-50 @52 LastUpdate MMDDYY10.; PROC SORT DATA = transactions; BY Account; run; DATA perm.patientmaster; UPDATE perm.patientmaster transactions; BY Account; run; 19 SAS data set options • SAS has three basic types of options: system options, statement options, and data set options • System options are those that stay in effect for the duration of your job or session. These options affect how SAS operates, and usually issued when you invoke SAS or via Options statement. Eg: Options center nodate • Statement options appear in individual statement and influence how SAS runs that particular DATA or PROC step. Eg: Proc means data =new noprint; • Data set options affect only how SAS reads or writes an individual data set. You can use data set options in DATA step (in DATA, SET, MERGE, or UPDATE statements) or in PROC step ( in conjunct with a DATA=statement option). Eg: DATA New; set old (keep= x1 x2); 20 SAS data set options • To use a data set option, you simply put it between parentheses directly following the data set name. Here are some commonly used data set options: Data set options Functions Keep =variable-list Which variables to keep Drop=variable-list Which variables to drop Rename=(oldvar=newvar) Rename oldvar to newvar Firstobs=n To start reading at observation n Obs=n To stop reading at observation n IN=new-var-name Create a temporary variable for tracking whether that data set contributed to the current observation 21 Data new; input x y @@; datalines; 1 2 3 4 5 6 7 8 ; run; data new1; set new (keep=x firstobs=2 obs=3 rename= (x=z) ); run; 22 Tracking and selecting observations with the IN=option • The IN=new-var-name option creates a new temporary variable. The variable exist only for the duration of the current DATA step and are not added to the data set being created • SAS gives IN=new-var-name a value of 0 if that data set did not contribute to the current observation and a value of 1 if it did • You can use the IN=new-var-name to track, select, or delete observations based on the data set of origin • The IN=new-var-name option is most often used with MERGE statement 23 data new1; 1 F 2 M 3 F ; input ID $ gender $ ; cards; data new2 ; input ID $ 1 15 3 20 4 35 ; age; cards; data new3; merge new1 (In=indicator1) new2 (In=indicator2); by ID; if indicator1=1 and indicator2=1; run; 24 Writing multiple datasets using the OUTPUT statement • You can create more than one data set in a single DATA step by putting multiple data set names after the DATA: DATA dataset1 dataset2 dataset3; • IF you want the data sets to be different, you need to combine the above data statement with an OUTPUT statement. The basic form of an output statement is: OUTPUT data-set-name; • IF you leave out the data-set-name after the OUTPUT, then the observation will be written to all data sets named in the DATA statement 25 • Example: A local zoo maintains a data base about the feeding of animals. For each groups of animals, the data include the scientific class, the enclosure those animals live in and whether they get fed in the morning, afternoon or both. Here is the data: bears Mammalia elephants Mammalia flamingos Aves frogs Amphibia kangaroos Mammalia lions Mammalia snakes Reptilia tigers Mammalia zebras Mammalia E2 W3 W1 S2 N4 W6 S1 W9 W2 both am pm pm am pm pm both am We want to create two list, one for morining feedings and one for afternoon feedings 26 DATA morning afternoon; INFILE 'F:\lecture7\Zoo.txt'; INPUT Animal $ 1-9 Class $ 11-18 Enclosure $ FeedTime $; IF FeedTime = 'am' THEN OUTPUT morning; ELSE IF FeedTime ='pm' THEN OUTPUT afternoon; ELSE IF FeedTime = 'both' THEN OUTPUT; Run; 27 Use SAS automatic variables • In addition to the variables you created in your SAS data set, SAS creates a few automatic variable. These variables are temporary and are not saved with your data. But they are available in the DATA step and you can use them just like you use any variables that your created yourself • _N_: indicates the number of times SAS has looped through the DATA step • _ERROR_: has a value of 1 if there is a data error for that observation and 0 if there isn’t • First.variable and Last.variable: available when you are using a BY statement in a DATA step. The First.variable =1 (Last.variable =1 ) is when an observation with the first (last) occurrence of a new value for that variable and 0 for the other observations. 28 • Example: here is a data set contains information of walking race. It contains subject ID, age group and finishing time 54 youth 35.5 21 adult 21.6 6 adult 25.8 13 senior 29.0 38 senior 40.3 19 youth 39.6 3 adult 19.0 25 youth 47.3 11 adult 21.9 8 senior 54.3 41 adult 43.0 32 youth 38.6 We need to derive the overall finishing place and find out subjects who finished first within each age group 29 DATA walkers; cards; INPUT EntryID AgeGroup $ Time @@; 54 youth 35.5 21 adult 21.6 6 adult 25.8 13 senior 29.0 38 senior 40.3 19 youth 39.6 3 adult 19.0 25 youth 47.3 11 adult 21.9 8 senior 54.3 41 adult 43.0 32 youth 38.6 ; PROC SORT DATA = walkers; BY Time; run; DATA ordered; Place = _N_; SET walkers; run; PROC SORT DATA = ordered; BY AgeGroup Time; DATA winners; SET ordered; BY AgeGroup; IF FIRST.AgeGroup = 1; run; 30