HRP 223 - 2008 - Stanford University

HRP 223 - 2008
HRP223 2008
Topic 4 – Data Manipulation
Copyright © 1999-2008 Leland Stanford Junior University. All rights reserved.
Warning: This presentation is protected by copyright law and international treaties.
Unauthorized reproduction of this presentation, or any portion of it, may result in
severe civil and criminal penalties and will be prosecuted to maximum extent possible
under the law.
Why Code
 Data step advantages
–
–
–
–
Splitting data into many subsets
Tasks that require looping
Quickly subsetting
Complex retains
 Minor tweaks with nice pay offs
–
–
–
–
Adding Page Numbers
Inserting Group Names in Titles
Title and Footnote Justification
Conditional Highlighting
 Including parameters
HRP223 2008
Common Ground … where
HRP223 2008
 The first week of class you saw that you can
point-and-click with EG or write data step
code or PROC SQL statements to subset data.
where
HRP223 2008
 The syntax for where is identical in SQL and data steps.
 Differences vs. if statements:
– main points work in where only
• sub points work in either
– x between y and z
• x >= y and x <= z
• y <= x <= z
– string1 ? string2 or string1 contains string2
• index(string1,string2) > 0
– string1 =* string2
• soundex(string1) = soundex(string2)
– x is null or x is missing
• missing(x)
– String1 like “U%of%A%”
• use regular expressions (PRX)
Why bother?
HRP223 2008
 If you can use the GUI to write the subsets,
why bother learning the code?
– It takes time to make a new dataset. If all you
want is to subset for an analysis, it is a LOT faster
to add the where into the analysis code.
• First run the analysis on the complete data.
• Right click the node and choose open last
submitted code.
• (Tell it to keep all variables.)
• Scroll to the procedure and add in the
where.
Keep All Data
HRP223 2008
 Before the analysis code, SAS puts in
instructions to subset the data. Tell it to
include all variables by adding the variable in
the where statement or just use a *. More on
this in a bit.
where Syntax
HRP223 2008
 The where statement, like all SAS statements,
begins with a keyword (where) and ends in a
semicolon.
–
–
–
–
–
–
where
where
where
where
where
where
isDead = "false";
isDead ne "true";
missing(gender);
salary > 100000;
country in ("USA", "Japan", "UK");
country in ("USA" "Japan" "UK");
where Syntax
HRP223 2008
 Arithmetic
– where salary/12 > 10000;
– where (salary /12) * 1.20 ge 9900;
– where salary + bonus < 120000;
 Logical
–
–
–
–
where
where
where
where
gender ne "M" and salary >= 50000;
gender ne "M" or salary >= 50000;
country = "UK" or country = "UTAH";
country not in ("USA", "AU");
Make Decisions
HRP223 2008
 SAS has many operations available to help you
make decisions.
– = eq, ~= ne, < lt, > gt, <= le, >= ge, in ( )
– Not
• requires the expression following it to not be true.
– & And, | or, in
• & Requires both operands to be true.
• | Requires one operand to be true.
• In () requires at least one comparison to be true.
– Math operations:
• + - * / **.
Logical Decisions & Compound
Expressions
HRP223 2008
 Use the List Data … option on the Describe menu to choose
what variables to report, then include validity checks on the
data.
 Common tests and common problems:
where YODeath < YOBirth;
where Sex = "M" and numPreg > 0;
where Sex="M" and numPreg > 0 or ageLMP > 0;
*** bad ***;
where Sex="M" and (numPreg > 0 or ageLMP > 0);
*** good ***;
– Moral: Use parentheses generously with ands and ors.
Looking at Data
HRP223 2008
 The traditional way to look at data is with proc print.
proc print data=parity;
var gender numBirths yoBirth yoDeath ageLMP;
run;
 You can print out the corners of your data table.
proc print data=parity(firstobs=6 obs=6);
var gender ageLMP;
run;
SAS should have called
this lastobs.
Moving Stuff in EG
HRP223 2008
 Last time somebody asked me how to move
stuff between process flows and I said that I
just copied the entire project.
 Actually, you can copy a bunch of stuff then
right click and choose “Move to > somewhere”.
Data Step
HRP223 2008
 There are a few things that can be done in a
data step that can’t be done in SQL.
 Most SAS programmers do not know SQL and I
need you to be able to look at their code.
Data Step Parts
HRP223 2008
 Data steps begin with a data statement.
 The second statement is usually a set
statement or an input statement.
 There are any number of additional
statements after the set or input line.
 The data step ends with a run statement or (if
the programmer is too lazy to type run;) at the
beginning of the next data step or procedure.
About that second line…
 set blah;
HRP223 2008
– Says you are going to read data from an existing SAS
data set (called blah in this case) into your new data
set.
 input gender $ age;
– Means that you are going to read existing data from
this page of code or from a text file. Typically the
input statement appears with a datalines statement
(for reading from this file) or an infile statement (for
reading from another text file).
– “gender $” means that one variable is a character
string
– Age does not have a $. So this signifies a numeric
variable.
Those lines after the 2nd line
HRP223 2008
 Commands that you are likely to see after the
set line include:
 where statements are used to select what
records to include based on the values in the
source file.
 if-then-else statements are used to check
simple logic to assign new values.
 select statements are used to perform
complex checking and choosing from a list.
How SAS Processes a Dataset

1.
2.
3.
4.
5.
HRP223 2008
When you create a SAS data set with data step code, SAS
does the following things:
It figures out what variables it needs to track and it sets
aside some space in the computer’s working memory (RAM)
to hold the variables. This space is called the Program Data
Vector (PDV).
It sets the values in the PDV to missing.
Then it does all the instructions you tell it to do, in the order
you have written them.
Then it writes all the variables out to the new dataset.
It then repeats the process if there is more data.
Manipulating Data
HRP223 2008
 Say you have a dataset with a bunch of
variables. How does SAS keep track of the
data and allow you to manipulate it?
 Say this is the dataset called “OLD”.
Manipulating Data
HRP223 2008
 When you do a data step every variable on the set or
input lines are added to the PDV for the life of the data
step.
The variables id, race, case, refage and lname are put
into the new dataset.
data new;
The variables id, race, case, refage and lname are put
set old;
into the PDV.
run;
 If you don’t tell SAS to do something different, every1
variable in the PDV is output to the new dataset.
How SAS Really Works
The Program Data Vector
HRP223 2008
 SAS processes information a record at a time in the
PDV. The PDV tracks variable names and their
contents, plus a couple of automatic variables. The
automatic ones don’t get output.
 SAS forgets what is in the vector when it reads
the next record but you can force it to
remember without too much effort.
Working With Variables
HRP223 2008
 It is easy to add variables to the PDV.
 You can create a variable called isMale and set it to a
value for everyone in a dataset like this:
Id, race, case, refage, lname and isMale are put into the new
data new;
dataset .
set old;
isMale = ‘yes’;
isMale is added to PDV.
run;
or you can conditionally assign a variable’s value
with an if statement:
if refAge < 50 then isYoung = 1;
if then (else) statements
HRP223 2008
 If you want to do something if a condition is true,
use an if-then statement.
 Remember you need both words, if and then .
data males females;
set parity;
if gender = "m" then output males;
else output females;
run;
What could possibly go wrong?
HRP223 2008
 If you send the people who have a gender of
“m” to one dataset and everyone else to
another, you will be dealing with major
headaches later. The “female” records plus all
the “Males”, “ ”, and every misspelling of male
and female goes into that second file.
 I do use if-then statements but very rarely do I
use simple else statements.
select-when-otherwise-end
HRP223 2008
 I use select statements instead of complex else logic.
 The first condition in the block that is true is executed
and the rest are ignored.
data males females others;
set parity;
select (gender);
when ("M") output males;
when ("F") output females;
otherwise output others;
end;
run;
Creating a Variable
HRP223 2008
data x; input grade $ @@;
datalines;
A B C D F
;
run;
data y;
set x;
select (grade);
when ("A")
score = "Woop!!!!";
when ("B", "C") score = "Bah";
when ("D", "F") score = "Ut oh";
end;
run;
Complex Decisions
HRP223 2008
The first condition that is true is done and the rest
are ignored. So get thing in the correct order.
data ovary.affected;
set ovary.rptca;
select;
when (refage =. )
missing is negative
infinity so check for
when (refage <60)
missing before your
when (refage <65)
first <
when (refage <70)
when (refage >=70)
end;
Note: Select ends with end;
run;
Note: NO THEN
agegr
agegr
agegr
agegr
agegr
=
=
=
=
=
.;
1;
2;
3;
4;
 I use select statements to track known problems in a
dataset.
data alice2 missingStuff badAge ageThing;
set alice;
select;
HRP223 2008
*These are FATAL data errors each dataset should
have 0 observations;
*no year blood draw, yob, age at entry in study;
when (missing(yr_bl_dr) or missing(birthyr) or
missing(dadage)) output missingStuff;
*age from blood draw inconsistent with reported age;
when ((yr_bl_dr-birthyr)-dadage > 2) output badAge;
* blood draw before age at birth;
when (yr_bl_dr - birthyr < 0) output ageThing;
otherwise output alice2;
end;
run; * NOTE: this does not notice multiple errors;
No Otherwise
HRP223 2008
 If you leave off the otherwise statement, SAS
will generate an error if the data is not
“trapped” by one of the other conditions.
 This is very helpful because it makes it easy to
see problems.
Adding New Variables
HRP223 2008
 As it scans down the page containing a data
step, SAS figures out if new variables are
character or numeric by looking for quotation
marks. The first time it sees a new variable it
sets the width in the PDV.
Playing with Character Variables
HRP223 2008
 If you manipulate character strings you want to remember these
things:
– upcase()
– lowercase()
 What variables and contents are in the new dataset?
data case;
band = "Skinny Puppy";
uBand = upcase(band);
output;
band = "Assemblage 23";
lBand = lowcase(band);
output;
run;
Length
HRP223 2008
 Be sure to set the length of the variable to be wide enough
to hold your data.
data case2;
length band $50.;
band = "Skinny Puppy";
uBand = upcase(band);
output;
uBand = "";
band = "Assemblage 23";
lBand = lowcase(band);
output;
run;
EG Helps
HRP223 2008
Combining
HRP223 2008
 EG 4.1 does not have all the functions in SAS 9.1.3 listed. A
couple important missing functions are the CATs.
 CAT Function
– Concatenates character strings without removing leading or
trailing blanks
 CATS Function
– Concatenates character strings and removes leading and trailing
blanks
 CATT Function
– Concatenates character strings and removes trailing blanks
 CATX Function
– Concatenates character strings, removes leading and trailing
blanks, and inserts separators
Compressing
HRP223 2008
 Often you will a variable which has extra
characters in it and you want to get rid of
them.
– Check digits in medical record numbers.
 Use the function compress() to remove the –
and spaces.
"- "
HRP223 2008
Splitting Strings
HRP223 2008
 If you need to break a string of letters into
words use the scan function()
– Specify the original string, comma, the word
number, comma, an optional list of word
delimiters.
The First Word
HRP223 2008
Example of Character Functions
HRP223 2008
Variable Order
HRP223 2008
 There are times when you will want to move a
variable to the beginning of the PDV and
therefore, to the left side of a dataset. I do this
if I am calculating values and I do not want to
scroll to the end of the spreadsheet
(viewtable) to check a value.
 Just reference the variable before the set
statement.
data life;
input subj_id yob yod @@;
datalines;
HRP223 2008
1000100 1920 1942 1000101 1921 1942
1000102 1930 1995
;
run;
data span;
* move age to head of pdv by referencing it before
it is read in the set statement;
age = 0;
set life;
age=yod-yob;
run;
Importing Data
from External Text Files
HRP223 2008
 You also use the keyword input to get data
from a stored text file. Specify an infile
statement to define the source of your data
and do not use datalines.
data life;
infile ‘c:\projects\blah\life.txt’;
input subj_id yob dob;
run;
Importing Data…
the Hard Way
 You can specify what column
has the data you want as well as
how wide it is:
HRP223 2008
data rawblah;
infile ‘c:\projects\pam\prostate.dat’;
input
This variable is written as a seven digit number with no
@1 id 7.
@3 race 1. decimal places.
@2 case 1.
@24 refage 2.
Here you tell SAS that this variable is
@99 l_name $10.;
going to hold 10 characters.
run;
Importing Data…
the Hard Way
HRP223 2008
 If you have fixed length character variables, specify
them with a dollar sign and an informat like this:
 input l_name $10.;
 If your character variables are of variable length and
you want to read them up to a maximum length or a
delimiter, include a : in the specification:
 input l_name : $10.;
 This is handy if you are reading tab-delimited data
with character variables with imbedded blanks.
Comments
HRP223 2008
 Comment the heck out of the code you write.
 Two syntaxes you have seen:
– * blah;
– /* blah */
 You can also select a block of code and push
– Control /
to comment it out
 Control shift /
– Turns the comment back into code.
What is a bug anyway?
HRP223 2008
 When you write a program and it doesn’t work the
way that you intended, it is described as having a bug.
 There are many types of bugs. Syntax and semantic
errors are relatively easy to find and fix. When these
errors happen, SAS can not figure out what you want
done. Conceptual errors happen when SAS
understands the words you give it but it does not do
what you intended. These can be very, very hard to
find and fix.
 Spotting syntax and semantic bugs is easy. You just
need to look in the SAS log.
Syntax Errors
HRP223 2008
 As you try to write code you will see syntax errors
and lots of red in the log. Look at the line it
marks first. If you can’t see the problem, look for
problems (especially a missing semicolon) on the
line above where the red begins.
–
–
–
–
Misspelled keywords
Unmatched quotation marks
Missing semicolons
Invalid options
What is a bug anyway? (2)
HRP223 2008
 You will look in the log window to find out if
SAS found any syntax errors.
* oops forgot the "then";