An overview of SAS system and how to run it. What is SAS? SAS (Statistical analysis system) is many things. Among other things, it can be used for data management and report generation, as a statistical package including descriptive and inferential statistics (for a wide variety of problems) and as a general programming environment, including a portion called IML which allows for easy handling of matrices. The complete package is extremely comprehensive, and continues to expand at a rapid rate. This course provides an introduction to SAS with the primary focus on data management and statistical computing, including how to program to handle customized analyses. Important links through OIT: - Info on OIT Classrooms (which have SAS): http://www.oit.umass.edu/computer-classrooms - Info on SAS site licenses if you want to buy your own copy: http://www.oit.umass.edu/support/software/statistical-software-site-licences - Basic instructions for using SAS, including online documentation for SAS http://www.umass.edu/statdata/resources.html Where do I run SAS? a. On a PC with Windows: You can run SAS in any of the OIT computer classrooms or you can buy a copy (renewable each year) for your own machine. b. For graduate students in Mathematics and Statistics: There is a dedicated PC in LGRT1537 with SAS on it. c. SAS on demand. SAS now allows students to register and run SAS directly through a SAS server. For this class, you will need to register for SAS OnDemand for Academics and then access Client Bundle (Enterprise Guide and Enterprise Miner). Here is how to get started: Access the following Web site: http://support.sas.com/ondemand/index.html#account Review the information and follow the steps at this site. If you have additional questions about using SAS® OnDemand for Academics, see http://support.sas.com/ondemand or contact me. How do I run SAS? On a PC, SAS is run interactively in the windows environment. When SAS is invoked there are various sub-windows, the three main ones being the program window, the output window and the log window. When you are actually using programming statements (as opposed to click and go as in Analyst, etc.) the programming statements are put in the program window. The program is then submitted (by clicking on the running figure), with output going to the output window and the log window containing information (including error messages) about the job. To run only portions of the program in the program window, you highlight that portion and then submit.You can open and save files from the various windows. Various other aspects of running in this environment will be demonstrated in class and are described in the on-line help. The general structure of running SAS using programming statements versus “click and go”. a. Using programming statements. The data step creates a SAS data file. The data step might involve reading data from a file in a particular way or consist of programming statements which generate a data file (or a combination of the two). SAS data sets can also be created by importing data. Once a SAS data set is in place, there is a large number of pre-programmed procedures (procs) that can be run on that data. These are invoked with a proc statement. For the most part, each proc has a fixed set of rules and options that must be followed. b. Running SAS without “programming”. There are various options now available in SAS for running SAS without programming, but instead using drop down menus and “click and go”. We will not cover this type of analysis. The analyst, the main tool for the “click and go” analysis, is easy to use (although if you are proficient with SAS language, it is sometimes faster and easier to program directly), creates the accompanying SAS code, produces graphs and has some statistical tools not available in other parts of SAS (such as power and sample size calculations). At the same time, it does not have all of the options available through the procedures. Getting your data into SAS: entering data with viewtable Menu: tools table editor A blank viewtable is displayed. The rows (observations) are labeled with numbers and the columns (variables) are labeled with letters. You can enter the data and SAS will automatically figure out if your columns are numeric or character, according to the data values The column names can be changed by double clicking the default names. The column attributes such as the variable type, can be changed by right clicking the column names and open the Column Attributes Window. To save the data, click the save icon above the viewtable. Select a library, which corresponds to a directory, and then specify the member name of your table. Note if you choose the library Work, the data will only be available for the current SAS session. Open and edit existing tables by clicking the open icon above the blank viewtable. Or one can open an existing table through SAS explorer or windows explorer. To reference the table you created in viewtable, use libraryname.membername. For example, Proc print data=sasuser.test; Run; Getting started using SAS software The SAS program A SAS program is a sequence of statements executed in order Every SAS statement ends with a semicolon Layout of SAS programs is flexible 1) SAS statements can be in upper or lower case 2) Statements can continue on the next line (don’t split words) 3) Statements can be on the same line as other statements, but use semicolon to separate 4) Statements can start in any column Comments can be inserted into a program. Comments can be enclosed in asterisk (*) and semicolon (;), or a slash asterisk (/*) and an asterisk slash (*/) *Add a new variable to the existing data sasuser.test; data sasuser.test; set sasuser.test; height_cm=2.5*height; proc print data=sasuser.test; /* print the results*/ run; SAS programs are constructed from two basic building blocks: DATA steps and PROC steps. Most statements work in only one type of step. The basic differences between them: DATA steps PROC steps begin with DATA statements begin with PROC statements read and modify data perform specific analysis or function many statements can be used: input data, do only a handful of possible statements for each loops, if-then/else logic and many functions SAS procedure create a SAS data set produce results or report A step ends when SAS encounters a new step, a RUN statement, or the end of the program. Run statements tell SAS to run all the proceeding lines of the step and are among those rare global statements that are not part of a DATA or PROC step. SAS data sets SAS data sets are also called tables, observations are also called rows, and variables are also called columns. There are just two types of variables: numeric and character. A variable containing letters or special characters must be character data. Sometimes data that consist only of numerals make more sense as character data than as numeric, ZIP codes, for example. Missing character data are represented by blanks, and missing numeric data are represented by a single period Rules for SAS names: 1). Names must be 32 characters or fewer in length 2). Names must start with a letter or an underscore ( _ ) 3). Names can contain only letters, numerals, or underscores. No other special characters 4). Names can contain upper and lower case letters. SAS is insensitive to case. Viewing data sets with SAS explorer: menu: view explorer, where you can 1). Create new library 2). View existing data sets 3). List the properties of a SAS data set Submitting SAS programs SAS programs are entered into the editor window. Highlight the part of the program that you want to run, then click on the submit icon on top After submission, any notes, errors, or warnings associated with your program as well as the program statements themselves will appear in the log window. It is very important to check SAS log for errors before proceeding to the results. If your program generates any printable results, then they will appear in the Output window The Results window is like a table of contents for your output window Using SAS system options System options control SAS –how it works, what the output looks like, how much memory is used, error handing and a host of other things. In SAS windowing environment, two common ways of specifying system options: 1) Change selected options in the SAS System Options window. Menu: Tools options 2) Use the OPTIONS global statement as a part of your SAS program: OPTIONS LINESIZE=80 NODATE; Common options: CENTER|NOCENTER Controls whether output are centered or left-aligned. Default: CENTER DATE|NODATE Controls whether or not today’s date will appear at the top of each page of output. Default: DATE LINESIZE = n Controls the maximum length of output lines. Possible values for n are 64 to 256. Default varies according to system. NUMBER|NONUMBER Controls whether or not page number appear on each page of SAS output. Default: NUMBER ORIENTATION=PORTRAIT Specifies the orientation for printing output. Default: PORTRAINT ORIENTATION=LANDSCAPE PAGENO = n Starts numbers output pages with n. Default is 1. PAGESIZE = n Controls the maximum number of lines per page of output. Possible values for n are 15 to 32767. Default varies. RIGHTMARGIN = n Specifies size of margin (such as 0.75in or 2cm) to be used for printing output. Default: 0in LEFTMARGIN = n TOPMARGIN = n BOTTOMMARGIN = n YEARCUTOFF = yyyy Specifies the cutoff year for DATE informats/functions Getting your data into SAS: continued Methods of getting data into SAS: 1) Enter data directly with viewtable 2) Creating SAS data sets from raw data files 3) Converting other software’s data files into SAS data sets 4) Reading other software’s data files directly Reading files with the Import Wizard Menu: File Import Data. Show example: bands.csv The Import Wizard can read all types of delimited files including comma-separated values (CSV) files which are a common file type for moving data between applications. If you have SAS/ACCESS for PC File Formats software, then you can also read a number of popular PC file types (Excel, Access, Lotus, etc). Creating SAS data sets from raw data files Telling SAS where to find your raw data If your data are in raw data files (also referred to as text, ASCII, sequential, or flat files), using the DATA step to read the data give you the most flexibility. 1) Internal raw data: use the CARDS or DATALINES statement to indicate internal data. The CARDS or DATALINES statement must be the last statement in the DATA step. All lines in the SAS program following the DATALINES statement are considered data until SAS encounters a semicolon. The semicolon must be on a line by itself. DATA uspresidents; INPUT President $ Party $ Number; DATALINES; Adams F 2 Lincoln R 16 Grant R 18 Kennedy D 35 ; RUN; 2) External raw data: use the INFILE statement to tell SAS the filename and location DATA uspresidents; INFILE 'C:\Documents and Settings\anna\Desktop\MyDesktop\597.F12\president.txt'; INPUT President $ Party $ Number; RUN; SAS log shows the number of records read from the file, which should be checked to ensure the file is read correctly. In some operating systems, SAS assumes external files have a record length of 256 or less. The default can be changed with the LRECL= option in the INFILE statement: INFILE 'C:\Documents and Settings\anna\Desktop\MyDesktop\597.F12\president.txt' LRECL=2000; Reading Raw Data Separated by Spaces The simplest INPUT statement simply list the variable names after the INPUT keyword in the order they appear in the data file. This type of input is called list input (or free formatted input). It is appropriate for reading data separated by spaces. It is easy to use but have a few limitations: a. You must read all the data in a record – no skipping b. Any missing data must be indicated with a period c. Character data cannot have embedded spaces and cannot be greater than eight characters in length d. Special data such as dates needs formatted input, therefore cannot be read by list input DATA HTWT; INFILE 'C:\Documents and Settings\anna\Desktop\MyDesktop\597.F12\htwt.txt'; INPUT ID gender$ age height weight; RUN; Reading Raw Data Arranged in Columns Column input can be used if each of the variable’s value is always found in the same place of the data line. It has the following advantages over list input: i. Spaces are not required between values ii. Missing values can be left blank iii. Character data can have embedded spaces or longer than 8 characters iv. You can skip unwanted variables DATA sales; INPUT visitingteam $ 1-20 ConcessionSales 21-24 BleacherSales 25-28 ourhits 29-31 theirhits 32-34 ourRuns 35-37 theirRuns 38-40; *---+----1----+----2----+----3----+----4; DATALINES; Columbia Peaches 35 67 1 10 2 1 Plains Peanuts 210 2 5 0 2 Gilroy Garlics 151035 12 11 7 6 Sacramento Tomatoes 124 85 15 4 9 1 ; Reading Raw Data Not in Standard Format Numbers with embedded commas or dollar signs, and dates like 10-31-2003 or 31OCT03 are examples of non-standard data. These data can be read with formatted input. DATA contest; INPUT name $ 1-16 age type $ date mmddyy10. sccore1 *format date mmddyy10.; *format amount comma9.2; *---+----1----+----2----+----3----+----4----+; DATALINES; Alica Grossman 13 c 10-28-2003 7.8 $1,000.8 Matthew Lee 9 D 10-30-2003 6.5 $1,023.4 ; PROC PRINT DATA=contest; RUN; General format for informat (input format): Character Numeric $informatw. $informatw.d amount comma9.1; Date $informatw. Commonly used informats: a. For character: $w. b. For numeric: w.d, commaw.d c. For date: mmddyyw. Summary on types of input 1. list input: easy to use 2. column input: allow space in each variable field; do not require spaces between variables 3. formatted input: read special data /*Exercise1*/ Variable names: Name Age Type Date Score1-Score5 Alicia Grossman 13 c 10-28-2003 7.8 6.5 7.2 8.0 7.9 Matthew Lee 9 d 10-30-2003 6.5 5.9 6.8 6.0 8.1 Elizabeth Garcia 10 C 10-29-2003 8.9 7.9 8.5 9.0 8.8 /*Exercise2*/ Variable names: ID Name Score; 1 stevenson 89 2 cody 100 3 smith 55 4 gettlefinger 92 /*Exercise3*/ Variable names: Park Name State Year Acreage Yellowstone ID/MT/WY 1872 4,065,493 Everglades FL 1934 1,398,800 Yosemite CA 1864 760,917 Great Smoky Mountains NC/TN 1926 520,269 Wolf Trap Farm VA 1966 130 Mixing input styles One needs to be careful when mixing input styles. Note when SAS reads a line of raw data, it uses a pointer to mark its place. Each style of input uses the pointer a little differently. After reading a variable, 1. if formatted input or column input, the pointer is left in the column that immediately follows the last column that is read. 2. List input, on the other hand, uses a scanning method to determine the pointer location. With list input, the pointer reads until a blank is reached and then stops in the next column. To read the next variable value, the pointer moves automatically to the first nonblank column, discarding any leading blanks it encounters. There are three commonly used column pointers and two line pointers. Three column pointers: 1. @n, put pointer at column n 2. +n, move pointer to the right by n columns 3. @character, put pointer next to the end of the specified character data contest; input name $16. age 3. +1 type $1. +1 date mmddyy10. sccore1 5.1 dollar comma10.1; format dollar comma9.2; *---+----1----+----2----+----3----+----4----+; datalines; Alica Grossman 13 c 10-28-2003 7.8 $1,000.8 Matthew Lee 9 D 10-30-2003 6.5 $1,023.4 ; proc print data=contest; run; +2 or equivalently, data contest; input name $16. age 3. @21 type $1. @23 date mmddyy10. sccore1 5.1 @38 dollar comma8.1; *---+----1----+----2----+----3----+----4----+; datalines; Alica Grossman Matthew Lee 13 c 10-28-2003 7.8 $1,000.8 9 D 10-30-2003 6.5 $1,023.4 ; /* @'character' pointer */ data weblogs; input @'[' accessdate date11. @'/' file :$20.; datalines; 130.192.70.235 -- [08/Jun/2001:23:51:32 - 0700] "GET /rover.jpg http/1.1" 200 66820 128.32.236.8 -- [08/Jun/2001:23:51:40 - 0700] "GET /grooming.html http/1.0" 200 8471 ; proc print data=weblogs; run; /* Note that pointer can also be used to change the order of reading variables*/ Multiple lines of data per observation data highlow; input city $ state $ normalHigh normalLow RecordHigh RecordLow; datalines; Nome AK 55 44 88 29 Miami FL 90 75 97 65 Raleigh NC 88 68 105 50 ; proc print; run; /*SAS secret: the pointer moves to the next line if there are more variables than data on the current line*/ data highlow; input city $ state $; input normalHigh normalLow; input RecordHigh RecordLow; datalines; Nome AK 55 44 88 29 Miami FL 90 75 97 65 Raleigh NC 88 68 105 50 ; proc print; run; /*SAS secret: the pointer moves to the next line after an input statement;*/ Two line pointers: moves pointer to the next line (/) or the specified line (#n). Using line pointers, one can skip certain lines and read selective lines. data highlow; input city $ state $ / normalHigh normalLow #3 RecordHigh RecordLow; datalines; Nome AK 55 44 88 29 Miami FL 90 75 97 65 Raleigh NC 88 68 105 50 ; proc print; run; SAS secret: Data steps execute line by line and observation by observation data highlow; input city $ state $ / normalHigh normalLow #3; datalines; Nome AK 55 44 88 29 Miami FL 90 75 97 65 Raleigh NC 88 68 105 50 ; proc print; run; Multiple observations per line and reading part of raw data Use line holders: double trailing @@ or single trailing @. Double trailing @@: a double trailing at the end of the sas input statement instructs sas to hold that line of data, continuing to read observations until it either runs out of data or reaches an input statement that does not end with a double trailing. Single trailing releases the pointer to the next line until it either runs out of data, or reaches an input statement that does not end with trailing, or SAS reaches the end of the data step for an observation. Double trailing holds the line more strongly. The double trailing holds a line even when SAS starts building a new observation. @@ is mainly used to read data with multiple observations per line data rainfall; input city $ state $ normalRain meanDaysRain @@; datalines; Nome AK 2.5 15 Miami FL 6.75 18 Raleigh NC . 12 ; proc print; run; @ is mainly used to read part of raw data /* The following data step outputs "freeway" data only*/ data traffic; input type $ name $ 9-38 AMtraffic PMtraffic; if type="freeway"; Mtraffic=mean(of AMtraffic PMtraffic); *---+----1----+----2----+----3----+----4; datalines; freeway 408 3684 3459 surface Martin Luther King Jr. Blvd. 1590 1234 surface Broadway 1259 1290 surface Rodeo Dr. 1890 2067 freeway 608 4583 3860 freeway 808 2386 2518 surface Lake Shore Dr. 1590 1234 surface Pennsylvania Ave. 1259 1290 ; proc print; run; data traffic; input type $ @; if type="surface" then delete; if type="freeway" then input name $ PMtraffic; datalines; freeway 408 3684 surface Martin Luther King Jr. Blvd. 1590 surface Broadway 1259 surface Rodeo Dr. 1890 freeway 608 4583 freeway 808 2386 surface Lake Shore Dr. 1590 surface Pennsylvania Ave. 1259 ; proc print; run; 9-38 AMtraffic 3459 1234 1290 2067 3860 2518 1234 1290 /*double trailing and single trailing do not make a difference in the following example*/ data traffic; input type $ @@; if type="surface" then input name_c $ 9-38 AMtraffic PMtraffic; if type="freeway" then input name_n $ 9-38 AMtraffic PMtraffic; datalines; freeway 408 3684 3459 surface Martin Luther King Jr. Blvd. 1590 1234 surface Broadway 1259 1290 surface Rodeo Dr. 1890 2067 freeway 608 4583 3860 freeway 808 surface Lake Shore Dr. surface Pennsylvania Ave. ; proc print; run; 2386 2518 1590 1234 1259 1290 Exercise: Each line contains a three-digit number, a two-digit number, and a four-digit number, with four sets of numbers for each subject; datalines; 123121234217874444123872345873235432 192837465748392919283747372818182838 ; Solution 1: put the data in a file called, for example, ex.txt in the directory C:\Users\anna\Documents\stat597.F13. Then input with the following SAS program data ex; infile "C:\Users\anna\Documents\stat597.F13\ex.txt"; input num1 3. num2 2. num3 4. @@; Solution 2: with in-stream data (datalines), SAS automatically padded spaces to the datalines so that the datalines have width 80. There is no easy way of removing it (see a discussion for the exactly same problem here: http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0405a&L=sas-l&P=1399) One way to read this data with datalines: data ex; do set=1 to 4; input num1 3. num2 2. num3 4. @; output; end; datalines; 123567890123456789123567890123456789 321459876321459876321459876321459876 ; run; proc print;run; Reading delimited files with the data step To read files delimited by comma, or tab, use the dlm option with the infile statement. *Reading comma delimited files; data HTWT; infile datalines dlm=','; input ID name$ age height weight; datalines; 1, Ali G, 23, 68, 155 2, Mat L, , 61, 102 3, Liz G, 55, 70, 202 ; proc print;run; *Use dlm='09'X to read tab delimited files; data HTWT; infile "C:\Documents and Settings\anna\Desktop\MyDesktop\597.F12\htwt_tab.txt" dlm='09'X; input ID name$ age height weight; proc print;run; data HTWT; infile datalines dlm=',' dsd; input ID name $ age height weight; datalines; 1, "G, Ali", 23, 68, 155 2, "L, Mat", , 61, 102 3, "G, Liz", 55, 70, 202 ; proc print data=htwt; run; More options with the infile statement 1. DLM option. Two delimiters in a row is treated as one. 2. DSD option (DSD represents delimiter-sensitive data). It does: 1) Ignores delimiter enclosed in quotation marks 2) Strip off the quotation marks 3) Two consecutive delimiters indicate a missing value 3. firstobs= 4. obs= 5. lrecl=256 6. missover: missing values anticipated at the end of a data line 7. truncover: use truncover if you are using column or formatted input for the last variable and expect some of its data values are shorter than others. 8. pad: pad each line of data with blank spaces data HTWT; infile 'C:\Documents and Settings\anna\Desktop\MyDesktop\597.F12\htwt_tab.txt' dlm='09'X lrecl=5; input ID gender$ age height weight; proc print data=htwt; run; Note The DATALINES automatically PADs the data whereas the external file does not. /*Exercise*/ Variable names: Name, Number, Address John Garcia 114 Maple Ave. Sylvia Chung 1302 Washington Drive Martha Newton 45 S.E. 14th St. ; /*Exercise*/ Variable names: BandName, GigDate, EightPM, NinePM, TenPM, ElevenPM Lupine Lights,12/3/2003,45,63,70, Awesome Octaves,12/15/2003,17,28,44,12 "Stop,Drop, and Rock-N-Roll",1/5/2004,34,62,77,91 The Silveyville Jazz Quartet,1/18/2004,38,30,42,43 Catalina Converts,1/13/2004,56,,65,34 ; Reading delimited files or PC files with the import procedure Proc import datafile='filename' out=data-set DBMS=identifier REPLACE; If filename is with extension .csv, .txt, .xls, dbf, .wk1, .wk3, .wk4, DBMS option can be skiped. Otherwise, use DBMS=DLM, and specify delimiter with DELIMITER='delimiter-character' statement; Proc import datafile='C:\Documents and Settings\anna\Desktop\MyDesktop\597.F12\Bands.csv' out=bandsData REPLACE; getnames=no; run; proc print;run; Permanent SAS data sets When a SAS data set is created and given a two level name: libname.memberName, it is a permanent unless the libname is WORK or unspecified. A library can be defined through point-and-click the menus. It can also be defined through a SAS statement: libname mysas "C:\Documents and Settings\anna\Desktop\MyDesktop\597.F12"; Proc import datafile='C:\Documents and Settings\anna\Desktop\MyDesktop\597.F12\Bands.csv' out=mysas.bandsData REPLACE; getnames=no; run; You can see a SAS data set called bandsData.sas7bdat created under the directory to which the library mysas corresponds. Another way to create permanent SAS data set is through direct referencing: data "C:\Documents and Settings\anna\Desktop\MyDesktop\597.F12\bands"; infile datalines dlm="," dsd; input BandName :$50. GigDate :mmddyy10. EightPM NinePM TenPM ElevenPM; datalines; Lupine Lights,12/3/2003,45,63,70, Awesome Octaves,12/15/2003,17,28,44,12 "Stop,Drop, and Rock-N-Roll",1/5/2004,34,62,77,91 The Silveyville Jazz Quartet,1/18/2004,38,30,42,43 Catalina Converts,1/13/2004,56,,65,34 ; run; You can see a SAS data set called bandsData.sas7bdat created under the specified directory The contents procedure proc contents data=datasetName; Observation 1 Observation 2 Observation 3 Observation 4 Line 1 Line 2 Line 3 Input data SAS program Output data Observation 1 Observation 2 Observation 3 Observation 4