HRP 222
Lecture 2 Using The Data Step
Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved.
Warning: This presentation is protected by copyright law and international treaties.
Unauthorized reproduction of this presentation, or any portion of it, may result in
severe civil and criminal penalties and will be prosecuted to maximum extent possible
under the law.
From Last Time(1)
Stanford Software
Stanford uses a security system called
Kerberos to protect passwords going to and
from machines on campus (e.g., your email
account);
The program with actually does the Kerberos
is called Leland. You can download it from
here:
http://www.stanford.edu/group/itss/ess/
From Last Time(2)
Logging into a UNIX machine to run SAS.
Download and install Leland and Sampson.
Start Sampson
The first time you run Sampson push the add
button and tell it you want to set up a
connection to:
tree.stanford.edu
You can then follow the instructions for using
X-Windows.
Help
SAS has a hypertext (weblike) set of help
files that are just awesome!
The help system is called OnlineDoc. The
software ships on a separate CD so you
will need to request it separately from the
typical SAS installation disks.
UCLA has a copy on the web:
http://sasdocs.ats.ucla.edu
Getting More Faster From
the Data Step
More on libraries and importing data
How SAS really works – PDV
Different types of variables
Working with sets of variables
The do block
Calculations and simple statistics
Sorting, processing “by”, “class”, first., last.
More About Libraries
Assignment
(0)
Recall from last time that SAS doesn’t know
about folders full of data. It only knows
libraries.
You can use the point and click interface but
it gets tiresome after a while.
More About Libraries
Assignment
(1)
You can do the same thing in one step by
typing into the program editor:
libname name ‘where’;
libname ingrid
‘c:\projects\ingrid\dis’;
libname ingridv6 v6
‘c:\projects\ingrid\dis\old’;
The name is anything eight or fewer
This tells SAS to
alpha-numeric characters but I usually read and write old
add v6 or just 6 to keep things clear.
style SAS files.
More About Libraries
Deleting Contents
(2)
With a library called Ingrid you can get
information on and modify its’ data with
proc datasets. Read the text for more
information on proc datasets.
One important task you do with proc
datasets is deleting datasets.
proc datasets library=ingrid;
delete sourceFiles;
quit;
More About Libraries
Data Sets’ Contents
(3)
If you just want to know the variables of a
single data set in alphabetical order, you can
use proc contents.
proc contents data=ingrid.raw; run;
To get the variables in their stored order, use:
proc contents data=ingrid.raw position; run;
Knowing the variables’ order can help you do
complex things.
Importing – Excel
(1)
While you can use the import wizard to
get data into SAS, most of the time you
will want to write a program to do it.
proc import out = work.testData
datafile = "C:\Projects\HRP222-2001\TestData.xls"
dbms = Excel2000 replace;
There is no semicolon
sheet = "Raw";
There is a semicolon here. here. The line is very
run;
The documentation is
wrong when it says to put
sheet before the
semicolon.
long so I wrapped it
onto two lines.
Importing – Excel
(2)
Make sure the Excel workbook is closed or you
will get this lousy message.
It is a good idea to have a single row indicating
the variable names at the top of every column
but if you have no column headings you can
include this line in the import procedure (pt it
below the sheet line).
getnames=no;
Creating Data(1)
There are times when you need to create
an entirely new data set without using an
outside program like Excel.
You can do this in a data step using the
input and datalines statements.
Creating Data(2)
Example 1
SAS 8.x Windows This works nicely if
somebody gives you data
as a text file or as part of
the content of a website
or an email.
Creating Data(3)
Example 2 - Double Trailing @
Data life;
input subj_id yob dob @@;
datalines;
1000100 1920 1942 1000101 1921 1942
1000102 1930 1995
;
run;
Importing Data….
The Hard Way
(1)
You also use the keyword input to get
data from a stored text file. Instead of
using datalines you use infile.
data life;
infile ‘c:\projects\blah\life.txt’;
input subj_id yob dob;
run;
Importing Data….
The Hard Way
(2)
Reading data delimited with special
characters is easy:
data fakedata.life;
infile ‘c:\projects\genealogy.txt’ DLM=‘:’;
input subj_id yob dob;
run;
Reads in a text file with this data:
1000101:1921:1941
1000102:1930:1995
Comma delimited is commonly used with DSD
infile ‘c:\projects\genealogy.txt’ DSD;
Importing Data….
The Hard Way
(3)
You can specify what column
has the data you want as well as
how wide it is:
data rawblah;
infile ‘c:\projects\pam\prostate.dat’;
input
@1 id 7.
This variable is going to be a seven digit
@3 race 1. number with no decimal places.
@2 case 1.
@24 refage 2.
@99 l_name $ 10.; Here you tell SAS that this
run;
variable is going to hold up to
10 characters.
New Datasets
The first thing you should do when you
get data is look at it to see if it could be
real.
Check the “corner” values to make sure
everything was imported.
Check the range of values.
Frequency of values.
Never do descriptive stats without plots!
Looking at Data
The traditional way to look at data is with proc print.
proc print data=work.blah;
var id dob fname;
run;
This should be
called lastobs.
You can print out the corners of your
data table.
proc print data=rawWeight(firstobs=762 obs=762);
var chart_numb gest_weeks;
run;
The first and last
variables
How SAS Really Works
The Program Data Vector
SAS processes information a record at a time in
the PDV. The PDV tracks variable names and
their contents, plus a couple of automatic
variables:
SAS forgets what is in the vector when it reads
the next record but you can force it to
remember without too much effort.
Working With Variables
It is easy to add variables to the PDV.
You can create a variable called isCase and set it
to a value for everyone in a dataset like this:
data newdata;
set olddata;
isCase = ‘yes’;
run;
or you can conditionally assign a variable’s value
with an if statement:
if yob > 1967 then isYoung = 1;
The Power to Make Decisions
The “if-then-else” statement is used when
you need to have SAS make a simple yes
or no decision. It is frequently used to set
values for variables in a data step.
data ovary.affected;
set ovary.rptca;
if ca_site=1 or ca_site=26 then ov_can=‘yes’;
else ov_can = ‘no’;
run;
logical decisions
Logical Decisions
(1)
SAS has many operations available to help you
make decisions.
= eq, ~= ne, < lt, > gt, <= le, >=ge.
And, or, in()
and & requires both operands to be true.
or | requires one operand to be true.
in requires at least one comparison to be true.
Math operations:
** * / + -
Functions galore!
Logical Decisions
(2)
Compound Expressions
Common tests and common problems:
if YODeath < YOBirth then isBad="Y";
if Sex = "M" and numPreg>0 then isBad = "Y";
if Sex="M" and numPreg>0 or ageLMP>0 then isBad="Y";
*** bad ***;
if Sex="M" and (numPreg>0 or ageLMP > 0) then
isBad="Y");
*** good ***;
Moral: Use parentheses generously with ands and ors.
Logical Decisions
(3)
The order of operations: Exponentiation,
Multiplication, Division, Addition, Subtraction
evaluated from left to right
Parentheses override the normal order.
score = sex**beta1 + cohort**yob-1900;
*** bad ***;
score = sex**beta1 + cohort**(yob-1900);
*** good ***;
Moral: Use parentheses generously if you don’t
do math all the time.
Logical Decisions (3)
Tricky code is bad!
This is so useful that you need to know it.
Yes = 1 and No = 0
data blah;
input sex $ @@;
datalines;
M M F
;run;
data blah2;
set blah;
isMale = (Sex="M");
run;
This is a boolean logic
check. A true statement is
replaced by 1 and a false
statement returns 0.
Complex Decisions
“If” is used for simple decisions. “Select”
is used for complex decisions.
data ovary.affected;
Note: NO THEN
set ovary.rptca;
select;
when (refage <55) agegr = 1;
when (refage <60) agegr = 2;
when (refage <65) agegr = 3;
when (refage <70) agegr = 4;
when (refage >=70) agegr = 5;
end;
Note: Select ends with end;
run;
Setting Several Variables
If you need to create or modify several
variables when a condition is true, you
can use a do block:
data blah; set t_source;
if yob > 1990 then do;
isearly = ‘false’;
islate =‘true’;
end;
else do;
isearly = ‘true’; islate =‘false’;
end;
run;
Checking Variables
One of the most important tasks you will
do in this class is finding bad data. The
logic of the problem finding code is
always:
Check to see if something (bad) is true.
If the problem exists, document the problem.
Checking Variables
Data
data newData;
input @1 id 1.
@4 sex $1.;
datalines;
1 M
2 F
3
4 M
5 N
;run;
(2)
Checking Variables
Lousy
(2)
Don’t bother making
data _null_;
a new dataset.
set work.newData;
if sex = 'M' or sex = 'F' then ;
else put id= sex=;
run;
Write these two variables
with labels to the log.
Checking Variables
Not Lousy
(3)
data _null_;
set work.newData;
if not (sex = 'M' or sex = 'F')
then put id= sex=;
run;
Checking Variables
Good
(4)
data _null_;
set work.newData;
if sex not in ('M', 'F') then
put id= sex=;
run;
Working With Variables
Functions
You can do simple calculations on variables in
the PDV like this :
varx = 1+blah;
varx+1; same as varx=varx+1
YearsSurvived = AgeAtInterview – AgeAtDx;
You can also use functions:
varx = max(of var1--var3)
-- means all variables between (and including) the two
specified
varx= min(of x1990-x1995)
- gets the minimum value in the variables x1990, x1991,
x1992, x1993, x1994 and x1995
Procedures vs Functions
(1)
Procedures are designed to work with
multiple records in a dataset. Procedures
are used to create printed summaries (or
new tables).
proc means data = blah mean;
var cScore1 cScore2 cScore3;
run;
ID isMale cScore1 cScore2 cScore3
1
1
5
1
3
2
0
5
4
2
3
0
4
3
4
4
1
4
3
1
5
1
5
3
1
Procedures vs Functions
(2)
Functions are designed to work on records
one at a time. Functions are used to create
new variables.
Optional comma
data new;
set blah;
theAverage = mean(of cScore1, cScore2, cScore3);
run;
ID isMale cScore1 cScore2 cScore3 theAverage
1
1
5
1
3
3
2
0
5
4
2
3.666666667
3
0
4
3
4
3.666666667
4
1
4
3
1
2.666666667
5
1
5
3
1
3.333333333
Functions
SAS Gives You 100’s of Them
Arithmetic
Character
Date and time
Financial
Mathematical
Probability
Quantile
Random Number
Sample Statistic
State & ZIP Code
Trig & Hyperbolic
Truncation
Frequently Used Functions(1)
Arithmetic
Abs(v)-returns the absolute value
Dim(v)-returns the current dimension of an array
HBOUND(v)-returns the upper bound of an array
LBOUND(v)-returns the lower bound of an array
Mod(v)-calculates the remainder
Sign(v)-returns the sign of the argument or 0
SQRT(v)-calculates the square root
theSD = sqrt(theVariance);
Frequently Used Functions(2)
Character
Compress(v)- removes blanks or specified characters from
a character variable
Index(v)- searches for a pattern of characters
Left(v)- left justifies a variable
LOWCASE/UPCASE(v)- converts all to upper or lower case
Reverse(v)- reverses characters
Scan(v)- scans for words
SUBSTR(v)- extracts a sub-string
Translate(v)- changes characters
Trim(v)- removes trailing blanks
Frequently Used Functions(3)
Date and Time
There is a quarter of a lecture on dates and
times because they are challenging in SAS:
Date – returns today's date as a SAS date value
Days since 01jan60
MDY – returns a SAS date value from three
variables holding month, day, and year
Year – returns the year from a SAS date value
Datepart – gives you the date part of a time and
date variable. You will need this if you import dates
from Excel.
Frequently Used Functions(4)
Mathematical
Exp(v)-raises e (2.71828) to a specified power
Log(v)-calculates the natural logarithm (base
e)
LOG2(v)-calculates the logarithm to the base 2
Log10(v)-calculates the common logarithm
Lots of GAMMA functions
A trick:
exponential = gamma(val+1);
Frequently Used Functions(5)
Probability
Poisson-calculates the Poisson prob. dist.
PROBBETA -calculates the beta prob. distribution
PROBBNML-calculates the binomial prob. dist.
PROBCHI-calculates the chi-squared prob. dist.
PROBF-calculates the F probability distribution
PROBGAM-calculates the gamma prob. dist.
PROBHYPR-calculates the hypergeometric prob. dist.
PROBIT-calculates the inverse normal distribution
PROBNEGB-calculates the negative binomial prob. dist.
PROBNORM-calculates the standard normal prob. dist.
PROBT-calculates a Student's t distribution
Frequently Used Functions(6)
Quantile
BETAINV (p) -returns a quantile from the beta
distribution
CINV (p) -returns a quantile from the chi-squared
distribution
FINV (p) -returns a quantile from the F distribution
GAMINV (p) -returns a quantile from the inverse
gamma distribution
TINV (p) -returns a quantile from a Student's t
distribution
PROBIT (p) returns quantile from the standard
normal distribution
Frequently Used Functions(7)
Random Number
NORMAL(v)-generates a normally distributed pseudorandom variate
RANBIN(v)-generates an observation from a binomial
distribution
RANUNI(v) or UNIFORM(v)-generates a pseudo-random
variate uniformly distributed on the interval (0,1)
data fakebabies (keep = trimester fakeweight);
set grace.predictors;
fakeweight=fetal_wgt_+int(ranuni(77777)*10);
run;
Frequently Used Functions(8)
Calculations
CV(v)-calculates the coefficient of variation
MAX(v) or MIN(v)-returns the largest/smallest value
MEAN(v)-computes the arithmetic mean (average)
N(v)-returns the number of nonmissing arguments
NMISS(v)-returns the number of missing arguments
RANGE(v)-calculates the range
STD(v)-calculates the standard deviation
SUM(v)-calculates the sum of the arguments
VAR(v)-calculates the variance
Frequently Used Functions(9)
State and ZIP Code
ZIPNAME(v)-converts ZIP codes to state
names (all uppercase)
ZIPNAMEL(v)-converts ZIP codes to state
names (uppercase and lowercase)
ZIPSTATE(v)-converts ZIP codes to twoletter state codes
Frequently Used Functions(10)
Truncation
CEIL(v)-returns the smallest integer greater than or
equal to the argument
FLOOR(v)-returns the largest integer less than or
equal to the argument
FUZZ(v)-returns the integer if the argument is within
1E-12
INT(v)-returns the integer value (truncates)
ROUND(v)-rounds a value to the nearest round-off
unit
TRUNC(v)-truncates a numeric value to a specified
length
Function Examples
If you need to get the cumulative
expected frequency for a binomial
distribution you can do something like
this:
* from Sokal and Rohlf 3rd Edition page 79;
data _null_;
x = CDF('BINOMIAL',3,.5, 17) ;
put x;
run;
Sorting
While you probably think of sorting as
nothing more than alphabetizing, sorting
in SAS gives you the power to:
Find duplicate records
Process related groups of data
Things like families in a data set
Data from the same decade
Case vs. control groups
Sorting
Syntax
(2)
proc sort data=ingrid.raw out=ingrid.sorted;
by fam_id; run;
/*delete observations with common BY values*/
proc sort data=ingrid.raw out=ingrid.sorted
nodupkey;
by dude_id; run;
Sorting
Syntax
(3)
If you want to get rid of duplicates do this:
proc sort data=ingrid.raw out=ingrid.sorted;
nodupkey;
by _all_;
run;
Sorting (4)
Working With Sorted Data
Once you have a data set sorted, you
have the power to issue commands on the
first or last occurrence within a sorted set.
For example, if you have a variable that is
keeping track of the family IDs you can
have SAS do special things when it gets to
the first or last family member. More on
this in a week.
Next Time
Security
More on problem detection
Descriptive statistics
Common graphics
Creating data sets with Procs
Common procedures revisited
Making things look nice
Titles, Footnotes, Labels
Proc Format
Before Next Time…
Cody & Smith 22-34, 45-75