Ch. 2: Getting Your Data into SAS

advertisement
Ch. 2: Getting Your Data into SAS
• At OSEDA we try to ease the burden of
converting data by having data already in
SAS data set format in the archive.
• But it is still important that you know how
to convert your own.
• If you understand input statements and proc
import, you’ll be well on your way to how
to use put statements and proc export.
Four Basic Methods
• Direct data entry via Viewtable or SAS/FSP
windows.
• Reading raw data files (ascii) with input
statements.
• Converting files from other packages such
as dbf, xls, wkn -- usually via proc import.
• Accessing data in other DB formats directly
via special engines that make them look like
SAS data sets. (Oracle access, for example.)
Entering Data Directly via Windows
• Not used much here. This is usually done
as part of an on-line transaction system,
which is not something we do much with
SAS.
• It can be done, however, and can be quite
sophisticated.
Reading Raw Data with Input Stmts
• This has been traditionally the most
important way we capture our data.
• This requires the most knowledge of SAS
language to do. Sometimes the format of
the data makes it quite challenging.
• Reading .csv files has become very
important and common. SAS makes it
pretty easy.
Converting xls, dbf, etc.
• DDE (Windows only) option. Requires
having the other application running during
conversion.
• Proc import and the Import Data Wizard is
easiest.
• Exporting the other application’s data to csv
and then importing remains an option.
Reading SPSS, Oracle, etc.
• We now have special “engines” that can be
used to make things such as Oracle tables
look just like a SAS data set.
• We have had good success accessing Oracle
tables via the engine.
• Nice part of doing it this way is that if the
source data changes you do not have to
reconvert. Sometimes, that’s a problem.
Reading Raw Data Files
• For small data collections you can imbed
the data lines in your own code, preceded
by a datalines; statement.
• Most of the time you’ll be dealing with data
stored in external files.
• The infile statement is used to point SAS to
the file to be read. The input statement does
the actual reading.
Accessing the Sample Code
• Go to URL:
http://ftp.sas.com/samples/A56649
• Cut and paste to get individual programs
Long Records
• An annoying “gotcha” is the default of
truncating input lines at 256 bytes.
• Easy to circumvent if you know to code
lrecl=2000 (or some large value) as option
on the infile statement. Sets up the buffer
size - it does not mean the records have to
really be that long. They just can’t be
longer.
Space Separated Data
• Avoid this except for really small sets of
data that you are keying in yourself.
• Too easy for something to go wrong.
• Character data containing blanks are a big
problem.
Reading Fixed Format File
• These are files where the “fields” are in the
same column locations on each input line.
• Read using a combination of column and
formatted read specifications.
• A good idea is to use formatted input for
everything. Keeps it simple.
Formatted vs. Column Input
•
•
•
•
•
Input id $5. +1 year $4. +1 sales 7. ;
Input id $1-5 year $7-10 sales 12-18;
input id $char5. @7 year $4. @12 sales 7.;
input id $char5. +1 year $4. +1 sales 7; *-!-;
vs. free format:
– input id $ year $ sales ;
SAS builtin formats
• Used within input (and put) statements (and
functions) to convert to/from external
formats to SAS internal formats.
• $ always used as 1st char of a character
format or informat.
• Most references to formats require a period
at the end to distinguish what it is.
• Input dob date7 … ; *<--what happens?-;
Rich Collection of Formats
• As seen on pp. 34-35 of TLSB.
• Most of these can be used to write data (in a
put statement) as well as read data. They
just reverse the conversion process.
• Note the wide assortment of data and time
related formats.
Converting Raw Data at OSEDA
• Major purpose of creating data archive is to
make the expensive and error-prone task of
getting the data properly converted
something that is done just once and by pro.
programmers.
• Goal is to replace infile/input with set
statements as much as possible. Make
record-layout type documentation obsolete.
Need to Carefully Review
• It is extremely easy to make a mistake when
coding complex input statements.
• Always review the resulting data set(s) very
carefully before going on.
• Use Proc Contents and Proc Print in batch,
or the corresponding interactive windows in
DM to carefully inspect data sets.
SAS Data Sets
• Absolutely essential that you become very
comfortable with what these are, how they
are created, referenced, etc.
• For some reason, many new users see the
way SAS handles data set references as
really hard.
• But it’s really pretty simple.
New in V7: Data Set Literals
• Starting with v7 you can code:
data ‘c:\temp\boone_county’;
• Before v7 this was heresy. You always had
to define the data library separately and use
a 2-level name to reference a data set within
that directory:
libname tempsas ‘c:\temp’;
data tempsas.boone_county;
One Level Data Set Names
• References to SAS data sets that have only
one level (I.e. no period such as save.set1)
are assumed to be stored in a special
temporary SAS data library named work.
• An important exception to this is if you
define the specific user libref. If such a
library is defined, SAS will store and look
for all 1-level names in that data library.
SAS Data Libraries
• On Unix and Windows these are just
collections of SAS data sets stored in
directories.
• At least it used to be that simple. With v8
we can now define SAS data libraries that
include multiple directories.
• But the latter kind are still rare around here.
Creating a Permanent SAS Data Set
• Typically involves something such as:
libname mysas ‘s:\myname\mysas’;
data mysas.data0901;
• The directory specified in quotes in the libname
statement must be allocated outside SAS (using
mkdir or the equivalent.)
• The data set is stored as a system file within that
directory with name data0901. The extension will
vary with the engine/SAS version/platform.
Entering Data with VT Window
• I have never done this or known anyone
who has.
• One of our goals at OSEDA is to be able to
have a record of how the data we use relates
to the original source.
• Keying in data like this can be tricky in that
it is way too easy to make a mistake that the
software cannot detect.
Reading Multiple Input Lines
• The input statement typically reads one line
of data to create one observation in a SAS
data set. However, ….
• You can use the “/” to tell SAS to go to the
next input line, or “#2” (or #3, etc.) to tell it
to position itself at the 2nd line (relative to
where it began reading.)
Load Data and Code for 2.12
• Go to the sample code at
http://ftp.sas.com/samples/A56649
• Search for text ‘2.12’
• Go to Windows to create c:\MyRawData.
• Type DM command note rawdata to open
a SAS notepad window.
• Copy the data from your browser and paste
it to the notepad window.
Data & Code for 2.12 - cont.
• Use File-Save as to save the contents of the
notepad window to
c:\MyRawData\temperature.dat
• Go back and get the sample data step code
from the browser. Come back to SAS and
paste the code into the editor window.
• Submit the code.
Multiple Obs Per Line of Raw Data
• Use the double trailing @ to tell SAS to stay on
this line, even across cycles of the DATA step.
(Default is to always flush the current record at the
end of a d-s cycle.)
• data scores;
input name $ score @@;
datalines;
Mike 25 Samuel 36 Melanie 40 Me 2
proc print; sum score; run;
Trailing @ to Read Part of Line
• A single trailing @ in an input stmt says to leave
the data pointer where it is.
• Subsequent input statements in the same step will
pick up where last input left off.
• But record is released at end of data step cycle
(unlike with double trailing @).
• Allows you to read part of a record and then
conditionally continue reading, or have alternative
reads based on “record type”.
Infile Statement Options
• Firstobs= and obs= commonly used to
begin/end at specific locations within file.
• Missover, truncover, and stopover (not in
TLSB) specify how to handle case where
SAS needs to go to a new line to complete
the input read.
Reading Delimited (.csv) Files
• You almost always want to use the dsd
option. Add option dlm=‘09’x for tabdelimited files.
• A problem with reading this way is that
character variables may not get right length
imputed.
• Strongly recommend “declaring” all vars
before reading (prior to input statement.)
Example of Reading csv File
• data class;
• length stud_id $6 Name $24 Address $40 City $20 State
$2 zip $5 gpa 5;
• infile datalines dsd;
• input stud_id -- gpa;
• put / _all_;
• datalines;
• 001234, Joe Smith,123 S 5th St,Columbia,MO,65201,3.0
• 003456, Mary Jones,909 Nifong, Jefferson City,MO,65103,3.5
• run;
Running the Samples
• We have captured the sample code for the
text (TLSB) and edited/extended it for local
use.
• See the code in file
s:\sas8\Ch2_Samples.sas
s:\sas8\Ch3_Samples.sas
etc.
The %copyto Macro
• Utility macro used to copy the sample data
to our local data directory,
s:\sas8\RawData
• Uses a single positional parameter and a
single keyword parameter.
• Generates a simple SAS data step that reads
data from datalines and writes a file in the
RawData directory with name determined
by the parameters.
Run the Samples - Ch 2
• Use the new enhanced editor window.
• Do a file-open and select the
Ch2_Samples.sas program file.
• Note that all the steps which invoke
%copyto have already been run and the
results are already out there. But it will not
hurt if you rerun.
• You can edit the copyto macro and make
your own personal copies which you can
then play with.
Download