Ch. 2: Getting Your Data into SAS • At OSEDA we try to ease the burden of converting data by having data already in SAS data set format in the archive. • But it is still important that you know how to convert your own. • If you understand input statements and proc import, you’ll be well on your way to how to use put statements and proc export. Four Basic Methods • Direct data entry via Viewtable or SAS/FSP windows. • Reading raw data files (ascii) with input statements. • Converting files from other packages such as dbf, xls, wkn -- usually via proc import. • Accessing data in other DB formats directly via special engines that make them look like SAS data sets. (Oracle access, for example.) Entering Data Directly via Windows • Not used much here. This is usually done as part of an on-line transaction system, which is not something we do much with SAS. • It can be done, however, and can be quite sophisticated. Reading Raw Data with Input Stmts • This has been traditionally the most important way we capture our data. • This requires the most knowledge of SAS language to do. Sometimes the format of the data makes it quite challenging. • Reading .csv files has become very important and common. SAS makes it pretty easy. Converting xls, dbf, etc. • DDE (Windows only) option. Requires having the other application running during conversion. • Proc import and the Import Data Wizard is easiest. • Exporting the other application’s data to csv and then importing remains an option. Reading SPSS, Oracle, etc. • We now have special “engines” that can be used to make things such as Oracle tables look just like a SAS data set. • We have had good success accessing Oracle tables via the engine. • Nice part of doing it this way is that if the source data changes you do not have to reconvert. Sometimes, that’s a problem. Reading Raw Data Files • For small data collections you can imbed the data lines in your own code, preceded by a datalines; statement. • Most of the time you’ll be dealing with data stored in external files. • The infile statement is used to point SAS to the file to be read. The input statement does the actual reading. Accessing the Sample Code • Go to URL: http://ftp.sas.com/samples/A56649 • Cut and paste to get individual programs Long Records • An annoying “gotcha” is the default of truncating input lines at 256 bytes. • Easy to circumvent if you know to code lrecl=2000 (or some large value) as option on the infile statement. Sets up the buffer size - it does not mean the records have to really be that long. They just can’t be longer. Space Separated Data • Avoid this except for really small sets of data that you are keying in yourself. • Too easy for something to go wrong. • Character data containing blanks are a big problem. Reading Fixed Format File • These are files where the “fields” are in the same column locations on each input line. • Read using a combination of column and formatted read specifications. • A good idea is to use formatted input for everything. Keeps it simple. Formatted vs. Column Input • • • • • Input id $5. +1 year $4. +1 sales 7. ; Input id $1-5 year $7-10 sales 12-18; input id $char5. @7 year $4. @12 sales 7.; input id $char5. +1 year $4. +1 sales 7; *-!-; vs. free format: – input id $ year $ sales ; SAS builtin formats • Used within input (and put) statements (and functions) to convert to/from external formats to SAS internal formats. • $ always used as 1st char of a character format or informat. • Most references to formats require a period at the end to distinguish what it is. • Input dob date7 … ; *<--what happens?-; Rich Collection of Formats • As seen on pp. 34-35 of TLSB. • Most of these can be used to write data (in a put statement) as well as read data. They just reverse the conversion process. • Note the wide assortment of data and time related formats. Converting Raw Data at OSEDA • Major purpose of creating data archive is to make the expensive and error-prone task of getting the data properly converted something that is done just once and by pro. programmers. • Goal is to replace infile/input with set statements as much as possible. Make record-layout type documentation obsolete. Need to Carefully Review • It is extremely easy to make a mistake when coding complex input statements. • Always review the resulting data set(s) very carefully before going on. • Use Proc Contents and Proc Print in batch, or the corresponding interactive windows in DM to carefully inspect data sets. SAS Data Sets • Absolutely essential that you become very comfortable with what these are, how they are created, referenced, etc. • For some reason, many new users see the way SAS handles data set references as really hard. • But it’s really pretty simple. New in V7: Data Set Literals • Starting with v7 you can code: data ‘c:\temp\boone_county’; • Before v7 this was heresy. You always had to define the data library separately and use a 2-level name to reference a data set within that directory: libname tempsas ‘c:\temp’; data tempsas.boone_county; One Level Data Set Names • References to SAS data sets that have only one level (I.e. no period such as save.set1) are assumed to be stored in a special temporary SAS data library named work. • An important exception to this is if you define the specific user libref. If such a library is defined, SAS will store and look for all 1-level names in that data library. SAS Data Libraries • On Unix and Windows these are just collections of SAS data sets stored in directories. • At least it used to be that simple. With v8 we can now define SAS data libraries that include multiple directories. • But the latter kind are still rare around here. Creating a Permanent SAS Data Set • Typically involves something such as: libname mysas ‘s:\myname\mysas’; data mysas.data0901; • The directory specified in quotes in the libname statement must be allocated outside SAS (using mkdir or the equivalent.) • The data set is stored as a system file within that directory with name data0901. The extension will vary with the engine/SAS version/platform. Entering Data with VT Window • I have never done this or known anyone who has. • One of our goals at OSEDA is to be able to have a record of how the data we use relates to the original source. • Keying in data like this can be tricky in that it is way too easy to make a mistake that the software cannot detect. Reading Multiple Input Lines • The input statement typically reads one line of data to create one observation in a SAS data set. However, …. • You can use the “/” to tell SAS to go to the next input line, or “#2” (or #3, etc.) to tell it to position itself at the 2nd line (relative to where it began reading.) Load Data and Code for 2.12 • Go to the sample code at http://ftp.sas.com/samples/A56649 • Search for text ‘2.12’ • Go to Windows to create c:\MyRawData. • Type DM command note rawdata to open a SAS notepad window. • Copy the data from your browser and paste it to the notepad window. Data & Code for 2.12 - cont. • Use File-Save as to save the contents of the notepad window to c:\MyRawData\temperature.dat • Go back and get the sample data step code from the browser. Come back to SAS and paste the code into the editor window. • Submit the code. Multiple Obs Per Line of Raw Data • Use the double trailing @ to tell SAS to stay on this line, even across cycles of the DATA step. (Default is to always flush the current record at the end of a d-s cycle.) • data scores; input name $ score @@; datalines; Mike 25 Samuel 36 Melanie 40 Me 2 proc print; sum score; run; Trailing @ to Read Part of Line • A single trailing @ in an input stmt says to leave the data pointer where it is. • Subsequent input statements in the same step will pick up where last input left off. • But record is released at end of data step cycle (unlike with double trailing @). • Allows you to read part of a record and then conditionally continue reading, or have alternative reads based on “record type”. Infile Statement Options • Firstobs= and obs= commonly used to begin/end at specific locations within file. • Missover, truncover, and stopover (not in TLSB) specify how to handle case where SAS needs to go to a new line to complete the input read. Reading Delimited (.csv) Files • You almost always want to use the dsd option. Add option dlm=‘09’x for tabdelimited files. • A problem with reading this way is that character variables may not get right length imputed. • Strongly recommend “declaring” all vars before reading (prior to input statement.) Example of Reading csv File • data class; • length stud_id $6 Name $24 Address $40 City $20 State $2 zip $5 gpa 5; • infile datalines dsd; • input stud_id -- gpa; • put / _all_; • datalines; • 001234, Joe Smith,123 S 5th St,Columbia,MO,65201,3.0 • 003456, Mary Jones,909 Nifong, Jefferson City,MO,65103,3.5 • run; Running the Samples • We have captured the sample code for the text (TLSB) and edited/extended it for local use. • See the code in file s:\sas8\Ch2_Samples.sas s:\sas8\Ch3_Samples.sas etc. The %copyto Macro • Utility macro used to copy the sample data to our local data directory, s:\sas8\RawData • Uses a single positional parameter and a single keyword parameter. • Generates a simple SAS data step that reads data from datalines and writes a file in the RawData directory with name determined by the parameters. Run the Samples - Ch 2 • Use the new enhanced editor window. • Do a file-open and select the Ch2_Samples.sas program file. • Note that all the steps which invoke %copyto have already been run and the results are already out there. But it will not hurt if you rerun. • You can edit the copyto macro and make your own personal copies which you can then play with.