• Use the UPDATE statement to: – update a master dataset with new transactions (e.g. a bank account updated regularly with deposits and withdrawals…). Not used a lot, but when you need it, it’s exactly what you need… – the general form is DATA master_data_set; UPDATE master_data_set transaction_data_set; BY variable_list; Notes on the UPDATE statement: • only two datasets can be specified (master & transactions) • both sets must be SORTed by their common variables • the values of the BY variables must by unique in the master set (e.g., only one account per account number in the master bank dataset…could be many transactions per account though) • missing values in the transaction dataset don’t overwrite existing values in the master dataset. *Go over the example in section 6.8 on page 194-195; LIBNAME perm 'c:\MySASLib'; DATA perm.patientmaster; *INFILE fill in here; INPUT Account LastName $ 8-16 Address $ 17-34 BirthDate MMDDYY10. Sex $ InsCode $ 48-50 @52 LastUpdate MMDDYY10.; RUN; /* Second Program */ LIBNAME perm 'c:\MySASLib'; DATA transactions; *INFILE fill in here; INPUT Account LastName $ 8-16 Address $ 17-34 BirthDate MMDDYY10. Sex $ InsCode $ 48-50 @52 LastUpdate MMDDYY10.; PROC SORT DATA = transactions; BY Account; * Update patient data with transactions; DATA perm.patientmaster; UPDATE perm.patientmaster transactions; BY Account; PROC PRINT DATA = 'c:\MySASLib\patientmaster'; FORMAT BirthDate LastUpdate MMDDYY10.; TITLE 'Admissions Data'; RUN; There are many SAS dataset OPTIONS. The list in section 6.9 is not comprehensive, but gives a flavor of what’s possible… • RENAME = (oldvariable_name = newvariable_name) – this changes a variable’s name • FIRSTOBS = n – this tells SAS the observation number on which to begin reading • OBS = n – this tells SAS the observation number on which to stop reading • IN = new_variable_name – this tells SAS to create a new variable (temporarily) to track whether an observation comes from that dataset (value=1) or not (value=0). Let’s try the example in section 6.10… Here’s the customer data: 101 102 103 104 105 Murphy's Sports Sun N Ski Sports Outfitters Cramer & Johnson Sports Savers 115 Main St. 2106 Newberry Ave. 19 Cary Way 4106 Arlington Blvd. 2708 Broadway Here’s the orders data: 102 104 104 101 102 562.01 254.98 1642.00 3497.56 385.30 Here’s the SAS code to find the customers who didn’t place any orders: DATA customer; *INFILE fill-in TRUNCOVER; INPUT CustomerNumber Name $ 5-21 Address $ 23-42; DATA orders; *INFILE why no TRUNCOVER?; INPUT CustomerNumber Total; PROC SORT DATA = orders; BY CustomerNumber; * Combine the data sets using the IN= option; DATA noorders; MERGE customer orders (IN = Recent); BY CustomerNumber; IF Recent = 0; PROC PRINT DATA = noorders; TITLE 'Customers with No Orders in the Third Quarter'; RUN; Now modify the code so you can see the effect of the IN= statement… • take out the subsetting IF statement • create a new variable whose values are those of the variable RECENT (why do I have to do this?) • PRINT the entire dataset including this new one made from RECENT to see its effect. • We may use the OUTPUT statement to create more than one dataset; e.g., DATA X Y Z; INPUT … ; This will create 3 identical datasets (named WORK.X, WORK.Y, and WORK.Z.). The next example uses IF … THEN statements to create different datasets with the OUTPUT statement. /* Here’s the zoo data with feeding time as the last column. Create two datasets using the OUTPUT statement, one for each of the feeding times: morning and evening - be sure to put the animals in both datasets if they are fed at both times… */ bears elephants flamingos frogs kangaroos lions snakes tigers zebras Mammalia Mammalia Aves Amphibia Mammalia Mammalia Reptilia Mammalia Mammalia E2 W3 W1 S2 N4 W6 S1 W9 W2 both am pm pm am pm pm both am DATA morning afternoon; *INFILE fill-in here; INPUT Animal $ 1-9 Class $ 11-18 Enclosure $ FeedTime $; IF FeedTime = 'am' THEN OUTPUT morning; ELSE IF FeedTime = 'pm' THEN OUTPUT afternoon; ELSE IF FeedTime = 'both' THEN OUTPUT; PROC PRINT DATA = morning; TITLE 'Animals with Morning Feedings'; PROC PRINT DATA = afternoon; TITLE 'Animals with Afternoon Feedings'; RUN; We may also use OUTPUT statements to generate our own data and to create datasets from raw data formatted in unusual ways (see section 6.12 and below…) dm log 'clear'; dm output 'clear'; options ls=80; DATA generate; DO x=1 to 10; y=x**2; z=sqrt(x); OUTPUT; END; PROC PRINT DATA=generate; run; quit; /* Put this into a raw datafile */ Jan Varsity 56723 Downtown 69831 Super-6 70025 Feb Varsity 62137 Downtown 43901 Super-6 81534 Mar Varsity 49982 Downtown 55783 Super-6 69800 *now read it in properly…; DATA theaters; *INFILE fill-in; INPUT Month $ Location $ Tickets @; OUTPUT; INPUT Location $ Tickets @; OUTPUT; INPUT Location $ Tickets; OUTPUT; PROC PRINT DATA = theaters; TITLE 'Ticket Sales'; RUN; /* We may also convert observations to variables and vice versa… */ PROC TRANSPOSE DATA=old OUT=new; BY var_list; ID variable; VAR var_list; /* go over the example on p.194 - here’s the data… team name, player #, type of data, value of the salary or b.a. */ Garlics Peaches Garlics Peaches Garlics Peaches Garlics Peaches 10 8 21 10 10 8 21 10 salary salary salary salary batavg batavg batavg batavg 43000 38000 51000 47500 .281 .252 .265 .301 /* Here’s the SAS code… */ DATA baseball; *INFILE fill-in here; INPUT Team $ Player Type $ Entry; PROC SORT DATA = baseball; BY Team Player; PROC PRINT DATA = baseball; TITLE 'Baseball Data After Sorting and Before Transposing'; * Transpose data so salary & batavg are vars; PROC TRANSPOSE DATA = baseball OUT = flipped; BY Team Player; ID Type; VAR Entry; PROC PRINT DATA = flipped; TITLE 'Baseball Data After Transposing'; RUN; BY variables are included in the new dataset, not transposed. There will be one obs. for each BY level per variable transposed. ID variable’s values become the names of the variables in the newly transposed dataset. The ID variable’s values must be unique within the BY-values. VAR statement names the variables whose values are going to be transposed. SAS creates a new variable (_NAME_) whose value(s) is the name of the VAR variable(s). SEE THE PREVIOUS EXAMPLE AND THE GRAPHIC ON THE TOP OF P.194 There are several variables that SAS creates automatically when you create a new dataset, but because they are temporary, you never see them. A short list is given on page 196: _N_ = the number of times SAS has looped through the DATA step _ERROR_ = 0 or 1 depending upon whether there is a data error for that particular observation. FIRST.variable and LAST.variable are created when you use a BY statement in the DATA step. FIRST.variable has the value 1 when SAS is processing the first occurrence of a new value of the BY variable and 0 otherwise. The LAST.variable is similar - it has the value 1 when SAS is processing the last occurrence of a value of the BY variable and 0 otherwise. See the example program on pages 196-197… Here’s the data (entry #, age group, finishing time). We want to create a new variable whose value is the overall place that the person finished. Note that the value of place can be determined from the _N_ variable if the new dataset is being created from a dataset sorted by finishing time. The second part of the program uses the FIRST.agegroup automatic variable to pick the top finisher in each age category. 54 youth 35.5 21 adult 21.6 6 adult 25.8 13 senior 29.0 38 senior 40.3 19 youth 39.6 3 adult 19.0 25 youth 47.3 11 adult 21.9 8 senior 54.3 41 adult 43.0 32 youth 38.6 DATA walkers; *INFILE fill in here; INPUT Entry AgeGroup $ Time @@; /*note >1 obs per line*/ PROC SORT DATA = walkers; BY Time; * Create a new variable, Place; DATA ordered; SET walkers; Place = _N_; PROC PRINT DATA = ordered; TITLE 'Results of Walk'; PROC SORT DATA = ordered; BY AgeGroup Time; * Keep first observation in each age group; DATA winners; SET ordered; BY AgeGroup; IF FIRST.AgeGroup = 1; PROC PRINT DATA = winners; TITLE 'Winners in Each Age Group'; RUN;