HRP 222 Lecture 2 Using The Data Step Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law. From Last Time(1) Stanford Software Stanford uses a security system called Kerberos to protect passwords going to and from machines on campus (e.g., your email account); The program with actually does the Kerberos is called Leland. You can download it from here: http://www.stanford.edu/group/itss/ess/ From Last Time(2) Logging into a UNIX machine to run SAS. Download and install Leland and Sampson. Start Sampson The first time you run Sampson push the add button and tell it you want to set up a connection to: tree.stanford.edu You can then follow the instructions for using X-Windows. Help SAS has a hypertext (weblike) set of help files that are just awesome! The help system is called OnlineDoc. The software ships on a separate CD so you will need to request it separately from the typical SAS installation disks. UCLA has a copy on the web: http://sasdocs.ats.ucla.edu Getting More Faster From the Data Step More on libraries and importing data How SAS really works – PDV Different types of variables Working with sets of variables The do block Calculations and simple statistics Sorting, processing “by”, “class”, first., last. More About Libraries Assignment (0) Recall from last time that SAS doesn’t know about folders full of data. It only knows libraries. You can use the point and click interface but it gets tiresome after a while. More About Libraries Assignment (1) You can do the same thing in one step by typing into the program editor: libname name ‘where’; libname ingrid ‘c:\projects\ingrid\dis’; libname ingridv6 v6 ‘c:\projects\ingrid\dis\old’; The name is anything eight or fewer This tells SAS to alpha-numeric characters but I usually read and write old add v6 or just 6 to keep things clear. style SAS files. More About Libraries Deleting Contents (2) With a library called Ingrid you can get information on and modify its’ data with proc datasets. Read the text for more information on proc datasets. One important task you do with proc datasets is deleting datasets. proc datasets library=ingrid; delete sourceFiles; quit; More About Libraries Data Sets’ Contents (3) If you just want to know the variables of a single data set in alphabetical order, you can use proc contents. proc contents data=ingrid.raw; run; To get the variables in their stored order, use: proc contents data=ingrid.raw position; run; Knowing the variables’ order can help you do complex things. Importing – Excel (1) While you can use the import wizard to get data into SAS, most of the time you will want to write a program to do it. proc import out = work.testData datafile = "C:\Projects\HRP222-2001\TestData.xls" dbms = Excel2000 replace; There is no semicolon sheet = "Raw"; There is a semicolon here. here. The line is very run; The documentation is wrong when it says to put sheet before the semicolon. long so I wrapped it onto two lines. Importing – Excel (2) Make sure the Excel workbook is closed or you will get this lousy message. It is a good idea to have a single row indicating the variable names at the top of every column but if you have no column headings you can include this line in the import procedure (pt it below the sheet line). getnames=no; Creating Data(1) There are times when you need to create an entirely new data set without using an outside program like Excel. You can do this in a data step using the input and datalines statements. Creating Data(2) Example 1 SAS 8.x Windows This works nicely if somebody gives you data as a text file or as part of the content of a website or an email. Creating Data(3) Example 2 - Double Trailing @ Data life; input subj_id yob dob @@; datalines; 1000100 1920 1942 1000101 1921 1942 1000102 1930 1995 ; run; Importing Data…. The Hard Way (1) You also use the keyword input to get data from a stored text file. Instead of using datalines you use infile. data life; infile ‘c:\projects\blah\life.txt’; input subj_id yob dob; run; Importing Data…. The Hard Way (2) Reading data delimited with special characters is easy: data fakedata.life; infile ‘c:\projects\genealogy.txt’ DLM=‘:’; input subj_id yob dob; run; Reads in a text file with this data: 1000101:1921:1941 1000102:1930:1995 Comma delimited is commonly used with DSD infile ‘c:\projects\genealogy.txt’ DSD; Importing Data…. The Hard Way (3) You can specify what column has the data you want as well as how wide it is: data rawblah; infile ‘c:\projects\pam\prostate.dat’; input @1 id 7. This variable is going to be a seven digit @3 race 1. number with no decimal places. @2 case 1. @24 refage 2. @99 l_name $ 10.; Here you tell SAS that this run; variable is going to hold up to 10 characters. New Datasets The first thing you should do when you get data is look at it to see if it could be real. Check the “corner” values to make sure everything was imported. Check the range of values. Frequency of values. Never do descriptive stats without plots! Looking at Data The traditional way to look at data is with proc print. proc print data=work.blah; var id dob fname; run; This should be called lastobs. You can print out the corners of your data table. proc print data=rawWeight(firstobs=762 obs=762); var chart_numb gest_weeks; run; The first and last variables How SAS Really Works The Program Data Vector SAS processes information a record at a time in the PDV. The PDV tracks variable names and their contents, plus a couple of automatic variables: SAS forgets what is in the vector when it reads the next record but you can force it to remember without too much effort. Working With Variables It is easy to add variables to the PDV. You can create a variable called isCase and set it to a value for everyone in a dataset like this: data newdata; set olddata; isCase = ‘yes’; run; or you can conditionally assign a variable’s value with an if statement: if yob > 1967 then isYoung = 1; The Power to Make Decisions The “if-then-else” statement is used when you need to have SAS make a simple yes or no decision. It is frequently used to set values for variables in a data step. data ovary.affected; set ovary.rptca; if ca_site=1 or ca_site=26 then ov_can=‘yes’; else ov_can = ‘no’; run; logical decisions Logical Decisions (1) SAS has many operations available to help you make decisions. = eq, ~= ne, < lt, > gt, <= le, >=ge. And, or, in() and & requires both operands to be true. or | requires one operand to be true. in requires at least one comparison to be true. Math operations: ** * / + - Functions galore! Logical Decisions (2) Compound Expressions Common tests and common problems: if YODeath < YOBirth then isBad="Y"; if Sex = "M" and numPreg>0 then isBad = "Y"; if Sex="M" and numPreg>0 or ageLMP>0 then isBad="Y"; *** bad ***; if Sex="M" and (numPreg>0 or ageLMP > 0) then isBad="Y"); *** good ***; Moral: Use parentheses generously with ands and ors. Logical Decisions (3) The order of operations: Exponentiation, Multiplication, Division, Addition, Subtraction evaluated from left to right Parentheses override the normal order. score = sex**beta1 + cohort**yob-1900; *** bad ***; score = sex**beta1 + cohort**(yob-1900); *** good ***; Moral: Use parentheses generously if you don’t do math all the time. Logical Decisions (3) Tricky code is bad! This is so useful that you need to know it. Yes = 1 and No = 0 data blah; input sex $ @@; datalines; M M F ;run; data blah2; set blah; isMale = (Sex="M"); run; This is a boolean logic check. A true statement is replaced by 1 and a false statement returns 0. Complex Decisions “If” is used for simple decisions. “Select” is used for complex decisions. data ovary.affected; Note: NO THEN set ovary.rptca; select; when (refage <55) agegr = 1; when (refage <60) agegr = 2; when (refage <65) agegr = 3; when (refage <70) agegr = 4; when (refage >=70) agegr = 5; end; Note: Select ends with end; run; Setting Several Variables If you need to create or modify several variables when a condition is true, you can use a do block: data blah; set t_source; if yob > 1990 then do; isearly = ‘false’; islate =‘true’; end; else do; isearly = ‘true’; islate =‘false’; end; run; Checking Variables One of the most important tasks you will do in this class is finding bad data. The logic of the problem finding code is always: Check to see if something (bad) is true. If the problem exists, document the problem. Checking Variables Data data newData; input @1 id 1. @4 sex $1.; datalines; 1 M 2 F 3 4 M 5 N ;run; (2) Checking Variables Lousy (2) Don’t bother making data _null_; a new dataset. set work.newData; if sex = 'M' or sex = 'F' then ; else put id= sex=; run; Write these two variables with labels to the log. Checking Variables Not Lousy (3) data _null_; set work.newData; if not (sex = 'M' or sex = 'F') then put id= sex=; run; Checking Variables Good (4) data _null_; set work.newData; if sex not in ('M', 'F') then put id= sex=; run; Working With Variables Functions You can do simple calculations on variables in the PDV like this : varx = 1+blah; varx+1; same as varx=varx+1 YearsSurvived = AgeAtInterview – AgeAtDx; You can also use functions: varx = max(of var1--var3) -- means all variables between (and including) the two specified varx= min(of x1990-x1995) - gets the minimum value in the variables x1990, x1991, x1992, x1993, x1994 and x1995 Procedures vs Functions (1) Procedures are designed to work with multiple records in a dataset. Procedures are used to create printed summaries (or new tables). proc means data = blah mean; var cScore1 cScore2 cScore3; run; ID isMale cScore1 cScore2 cScore3 1 1 5 1 3 2 0 5 4 2 3 0 4 3 4 4 1 4 3 1 5 1 5 3 1 Procedures vs Functions (2) Functions are designed to work on records one at a time. Functions are used to create new variables. Optional comma data new; set blah; theAverage = mean(of cScore1, cScore2, cScore3); run; ID isMale cScore1 cScore2 cScore3 theAverage 1 1 5 1 3 3 2 0 5 4 2 3.666666667 3 0 4 3 4 3.666666667 4 1 4 3 1 2.666666667 5 1 5 3 1 3.333333333 Functions SAS Gives You 100’s of Them Arithmetic Character Date and time Financial Mathematical Probability Quantile Random Number Sample Statistic State & ZIP Code Trig & Hyperbolic Truncation Frequently Used Functions(1) Arithmetic Abs(v)-returns the absolute value Dim(v)-returns the current dimension of an array HBOUND(v)-returns the upper bound of an array LBOUND(v)-returns the lower bound of an array Mod(v)-calculates the remainder Sign(v)-returns the sign of the argument or 0 SQRT(v)-calculates the square root theSD = sqrt(theVariance); Frequently Used Functions(2) Character Compress(v)- removes blanks or specified characters from a character variable Index(v)- searches for a pattern of characters Left(v)- left justifies a variable LOWCASE/UPCASE(v)- converts all to upper or lower case Reverse(v)- reverses characters Scan(v)- scans for words SUBSTR(v)- extracts a sub-string Translate(v)- changes characters Trim(v)- removes trailing blanks Frequently Used Functions(3) Date and Time There is a quarter of a lecture on dates and times because they are challenging in SAS: Date – returns today's date as a SAS date value Days since 01jan60 MDY – returns a SAS date value from three variables holding month, day, and year Year – returns the year from a SAS date value Datepart – gives you the date part of a time and date variable. You will need this if you import dates from Excel. Frequently Used Functions(4) Mathematical Exp(v)-raises e (2.71828) to a specified power Log(v)-calculates the natural logarithm (base e) LOG2(v)-calculates the logarithm to the base 2 Log10(v)-calculates the common logarithm Lots of GAMMA functions A trick: exponential = gamma(val+1); Frequently Used Functions(5) Probability Poisson-calculates the Poisson prob. dist. PROBBETA -calculates the beta prob. distribution PROBBNML-calculates the binomial prob. dist. PROBCHI-calculates the chi-squared prob. dist. PROBF-calculates the F probability distribution PROBGAM-calculates the gamma prob. dist. PROBHYPR-calculates the hypergeometric prob. dist. PROBIT-calculates the inverse normal distribution PROBNEGB-calculates the negative binomial prob. dist. PROBNORM-calculates the standard normal prob. dist. PROBT-calculates a Student's t distribution Frequently Used Functions(6) Quantile BETAINV (p) -returns a quantile from the beta distribution CINV (p) -returns a quantile from the chi-squared distribution FINV (p) -returns a quantile from the F distribution GAMINV (p) -returns a quantile from the inverse gamma distribution TINV (p) -returns a quantile from a Student's t distribution PROBIT (p) returns quantile from the standard normal distribution Frequently Used Functions(7) Random Number NORMAL(v)-generates a normally distributed pseudorandom variate RANBIN(v)-generates an observation from a binomial distribution RANUNI(v) or UNIFORM(v)-generates a pseudo-random variate uniformly distributed on the interval (0,1) data fakebabies (keep = trimester fakeweight); set grace.predictors; fakeweight=fetal_wgt_+int(ranuni(77777)*10); run; Frequently Used Functions(8) Calculations CV(v)-calculates the coefficient of variation MAX(v) or MIN(v)-returns the largest/smallest value MEAN(v)-computes the arithmetic mean (average) N(v)-returns the number of nonmissing arguments NMISS(v)-returns the number of missing arguments RANGE(v)-calculates the range STD(v)-calculates the standard deviation SUM(v)-calculates the sum of the arguments VAR(v)-calculates the variance Frequently Used Functions(9) State and ZIP Code ZIPNAME(v)-converts ZIP codes to state names (all uppercase) ZIPNAMEL(v)-converts ZIP codes to state names (uppercase and lowercase) ZIPSTATE(v)-converts ZIP codes to twoletter state codes Frequently Used Functions(10) Truncation CEIL(v)-returns the smallest integer greater than or equal to the argument FLOOR(v)-returns the largest integer less than or equal to the argument FUZZ(v)-returns the integer if the argument is within 1E-12 INT(v)-returns the integer value (truncates) ROUND(v)-rounds a value to the nearest round-off unit TRUNC(v)-truncates a numeric value to a specified length Function Examples If you need to get the cumulative expected frequency for a binomial distribution you can do something like this: * from Sokal and Rohlf 3rd Edition page 79; data _null_; x = CDF('BINOMIAL',3,.5, 17) ; put x; run; Sorting While you probably think of sorting as nothing more than alphabetizing, sorting in SAS gives you the power to: Find duplicate records Process related groups of data Things like families in a data set Data from the same decade Case vs. control groups Sorting Syntax (2) proc sort data=ingrid.raw out=ingrid.sorted; by fam_id; run; /*delete observations with common BY values*/ proc sort data=ingrid.raw out=ingrid.sorted nodupkey; by dude_id; run; Sorting Syntax (3) If you want to get rid of duplicates do this: proc sort data=ingrid.raw out=ingrid.sorted; nodupkey; by _all_; run; Sorting (4) Working With Sorted Data Once you have a data set sorted, you have the power to issue commands on the first or last occurrence within a sorted set. For example, if you have a variable that is keeping track of the family IDs you can have SAS do special things when it gets to the first or last family member. More on this in a week. Next Time Security More on problem detection Descriptive statistics Common graphics Creating data sets with Procs Common procedures revisited Making things look nice Titles, Footnotes, Labels Proc Format Before Next Time… Cody & Smith 22-34, 45-75