Introducing SAS® software Acknowlegements to David Williams Caroline Brophy Statistics in Science Need to know – SAS environment – SAS files (datasets, catalogs etc) & libraries – SAS programs How to: Get data in Manipulate data Get results out Statistics in Science SAS software environment Statistics in Science SAS Windows (SAS 9) Statistics in Science Some (!) SAS windows – Editor Where code is written or imported, and submitted – Log What happened, including what went wrong – Output Results of program procedures that produce output – Explorer Shows libraries (SAS & Windows), their files, and where you can see data, graphs – Results Shows how the output is made up of tables, graphs, datasets etc – Notepad A useful place to keep bits of code Statistics in Science SAS software programs Statistics in Science SAS Programs data one; input x y; datalines; -3.2 0.0024 -3.1 0.0033 . . . ; run; proc print data = one (obs = 5); run; proc means data = one; run; Statistics in Science DATA step creates SAS data set PROC steps process data in data set Step Boundaries SAS steps begin with a DATA statement PROC statement. SAS detects the end of a step when it encounters Statistics in Science a RUN statement (for most steps) a QUIT statement (for some procedures) the beginning of another step (DATA statement or PROC statement). Recommendation: use RUN; at end of each step Step Boundaries data seedwt; input oz $ rad wt; datalines; Low 118.4 0.7 High 109.1 1.3 Low 215.2 2.9 run; proc print data = two; proc means data = seedwt; class oz; var rad wt; run; Statistics in Science Submitting a SAS Program When you execute a SAS program, the output generated by SAS is divided into two major parts: SAS log contains information about the processing of the SAS program, including any warning and error messages. SAS output contains reports generated by SAS procedures and DATA steps. Statistics in Science Recommended steps! 1) Submit all (or selected) code by F4 Click on the runner in the toolbar 2) Read log 3) Look in output window if you expect code to produce output 4) Problems Bad syntax Missing ; at end of line Missing quote ’ at end of title (nasty!) Statistics in Science Improved output - HTML Tools Options Preferences Results Do this & resubmit code Check HTML output in Results Window Statistics in Science SAS data sets Statistics in Science SAS data sets • SAS procedures (PROC … ) process data from SAS data sets • Need to know (briefly!) – What a SAS data set looks like – How to get out data into a SAS data set Statistics in Science SAS data sets • live in libraries • have a descriptor part (with useful info) • have a data part which is a rectangular table of character and/or numeric data values (rows called observations) • have names with syntax <libname.>datasetname libname defaults to work if omitted Statistics in Science work library SAS data sets with a single part name like oz, wp or mybestdata99 1) are stored in the work library 2) can be referenced e.g. as mybestdata99 or work.mybestdata99 3) Statistics in Science are deleted at end of SAS session! Don’t loose your data! Keep the SAS program that read the data from its original source . . . More later! Statistics in Science Viewing descriptor & data /* view descriptor part */ proc contents data = wp; run; /* view data part */ proc print data = work.wp; run; Alternatively: Use SAS Explorer: Open (for data) Properties (for descriptor) Properties is not as clear as CONTENTS Statistics in Science SAS variables There are two types of variables: • character contain any value: letters, numbers, special characters, and blanks. Character values are stored with a length of 1 to 32,767 bytes (default is 8). One byte equals one character. • numeric stored as floating point numbers in 8 bytes of storage by default. Eight bytes of floating point storage provide space for 16 or 17 significant digits. You are not restricted to 8 digits. Don’t change the 8 byte length! Statistics in Science SAS variables OUTPUT The CONTENTS Procedure Alphabetic List of Variables and Attributes # 1 2 3 Statistics in Science Variable oz rad wt Type Char Num Num Len 8 8 8 SAS names – for data sets & variables • can be 32 characters long. • can be uppercase, lowercase, or mixed-case but are not case sensitive! • must start with a letter or underscore. Subsequent characters can be letters, underscores, or numeric digits - no %$!*&#@ or spaces. Statistics in Science Missing Data Values A value must exist for every variable for each observation. Missing values are valid values. LastName FirstName JobTitle Salary TORRES LANGKAMM SMITH WAGSCHAL TOERMOEN JAN SARAH MICHAEL NADJA JOCHEN Pilot Mechanic Mechanic Pilot 50000 80000 . 77500 65000 A character missing value is displayed as a blank. Statistics in Science A numeric missing value is displayed as a period. SAS syntax • Not case sensitive • Each ‘line’ usually begins with keyword and ends with ; • Common Errors: – Forget ; – Miss-spelt or wrong keyword – Missing final quote in title title ‘Woodpecker Habitat; /* quote mark missing */ title ‘Woodpecker Habitat’; Statistics in Science Comments 1. Type /* to begin a comment. 2. Type your comment text. 3. Type */ to end the comment. • To comment selected typed text remember: Ctrl+/ • Alternative: * comment ; Statistics in Science SAS Creating a SAS data set Statistics in Science Getting data in! Consider 2 methods Statistics in Science 1) Data in program (briefly!) 2) Data in Excel workbook Getting data in! Data in program file: data oz; input datalines; Low 118.4 High 109.1 Low 215.2 . . . ; run; oz $ rad wt; 0.7 1.3 2.9 Note: 1. oz is text variable so requires $ 2. No missing values 3. Values of oz Statistics in Science • don’t contain spaces • are at most 8 character long Getting data in! from Excel • Use IMPORT wizard saving program to reduce future clicking! Statistics in Science Creating new variables Adding a new variable to an existing SAS data set (say work.old) 1. Use set 2. Give definition of new variable data new; /* read data from work.old */ set old; y2 = y**2; ly = log(y); ly_base10 = log10(y); t1 = (treat = 1); run; Statistics in Science Data set: work.new Obs Statistics in Science treat y ysquared logy logy_base10 t1 1 A 10.0 100.00 2.30259 1 0 2 A 100.0 10000.00 4.60517 2 0 3 B -10.0 100.00 . . 1 4 B 0.0 0.00 . . 1 5 B 0.1 0.01 -2.30259 -1 1 Data Screening Statistics in Science Data Screening checking input data for gross errors • Use PRINT procedure to scan for obvious anomalies • Use MEANS procedure & examine summary table – MAXIMUM, MINIMUM – reasonable? – MEAN - near middle of range? – MISSING VALUES - input or calculation error e.g. log(0)? – CV (= 100*std.dev/mean) - < 10% for plant growth, between 12 & 30% for animal production variables, > 50% implies skewness for any positive variable Statistics in Science SAS syntax MEANS syntax What else should go here? Statistics in Science Dealing with data errors • Check original records • Change mistakes in recording where the correct value is beyond question • Regenerate observations where possible – e.g. reweigh sample, redo chemical analysis • With a large body of data in an unbalanced design err on the side of omitting questionable data Do not proceed until data has been properly cleaned – if necessary perform a number of screening runs Statistics in Science