SAS Lesson 2: More Ways to Input Data

advertisement
SAS Lesson 2: More Ways to Input Data
In the previous lesson, the following statements were used to create a dataset.
DATA oranges;
INPUT state $ 1-10 early 12-14 late 16-18;
DATALINES;
Florida
130 90
California 37 26
Texas
1.3 .15
Arizona
.65 .85
;
This is an example of column input, in which the columns of text in which data are stored are
explicitly furnished to SAS. It may be impractical to follow this example for all datasets. SAS
has facilities for reading data stored in many different ways; some of these facilities are
described below.
Reading data from a text file
Suppose that the data above are stored in a folder on a drive. Blank spaces, not tabs, are used to
separate the values. Also, the file is a simple ASCII text file. There are no hidden codes in the
file pertaining to word processors (such as margins and font sizes) or spreadsheets (such as
formulas and graphs).
The following lines of code could be used to read the dataset:
DATA oranges;
INFILE 'D:\yields\oranges.dat' FIRSTOBS=3 OBS=6;
INPUT state $ 1-10 early 12-14 late 16-18;
The INFILE statement replaces DATALINES. INFILE gives the location of the external text
file, including the drive name and any subdirectories. The FIRSTOBS option tells SAS to skip
the first two lines of the file and to begin reading data on line 3. The OBS option tells SAS that
Line 6 is the last line which contains legitimate data. If the data were edited in a text file by
removing the top two lines and the bottom two lines, so that the text file contained
only data. Then, the INFILE statement listed below would be sufficient.
Reading data from the Internet
SAS like R offer the capability to read data from files available on the Internet. For example,
the eggs dataset contains data on the yearly average number of eggs produced by female king
crabs near Kodiak Island, Alaska. The following statements could be used to create a SAS
dataset.
FILENAME kodiak url 'http://lib.stat.cmu.edu:80/crab/eggs';
DATA eggdata;
INFILE kodiak;
INPUT year 1-2 numeggs 4-9;
In these statements, kodiak is a nickname used by SAS to refer to the longer Internet address.
The statements create a dataset called eggdata which contains two variables, year and
numeggs. SAS can also read data from FTP sites, but you must supply the appropriate
information about the FTP site address, subdirectories, user names, and passwords. This
example shows how to obtain the eggs dataset by anonymous FTP.
FILENAME kodiak ftp 'eggs' cd='/crab/'
user='anonymous' pass='guest' host='lib.stat.cmu.edu';
DATA eggdata;
INFILE kodiak;
INPUT year 1-2 numeggs 4-9;
Of course, it would be easy to find the eggs dataset in a Web browser, save the file as a text file
on your computer, and use the methods for reading data from a text file. However, this method
would be useful if the data files are very large. It is also convenient if data files are continually
updated; example include stock market data, weather data, and batting averages.
Creating and reading permanent SAS datasets
In all of the previous examples, the SAS datasets that have been created were temporary. They
remain in working memory and can be used throughout the SAS session, but they disappear
when the SAS session ends.
Permanent SAS datasets can be created; these are stored on a disk and can be recalled easily in
future SAS sessions. Permanent SAS datasets are convenient to use when the amount of data is
large. Also, if you have to go through several steps to create a SAS dataset, you only need to do
those steps once if you create a permanent dataset.
A LIBNAME statement is needed to create a permanent dataset or to read one that has already
been created. The LIBNAME is a surrogate name for the location on a disk where the
permanent dataset is or will be stored. For example, consider the following statement:
LIBNAME college 'D:';
This prepares SAS to look in Drive D for permanent datasets. The name college is called a
libref (library reference). The names that you can supply for librefs follow the same rules as
dataset and variable names; however, you should not use the names LIBRARY, WORK, USER,
or anything starting with the three letters SAS, since these are reserved for special uses within
SAS. For example, the following statements create a permanent SAS dataset.
LIBNAME college 'D:';
DATA original;
INPUT dept $ 1-8 count 10-13 class $ 15-21;
DATALINES;
FineArts 449 day
Science 1411 day
Music
259 evening
Language 759 day
;
DATA college.enrolled;SET original;
IF class='evening' THEN DELETE;
PROC PRINT;
RUN;
This creates the new file ENROLLED.SD in your drive. This is the new permanent SAS
dataset. Only SAS will be able to interpret this file; you will not be able to see its contents by
using a word processor or spreadsheet program. The two statements after data work like the
following: "DATA name1; SET name2;" means to create a data set name1 from dataset name2.
The "IF ... THEN DELETE;" statement does the obvious operation. We will spend more time
on them in Lesson 5. Thus, the data set college.enrolled has only three observations with
class='day'. In order to use the dataset that was just created, you must refer to it with its full
name, college.enrolled.
To retrieve this data set from your diskette:
LIBNAME college 'D:';
DATA tempenrl;SET college.enrolled;
PROC PRINT DATA=tempenrl;
RUN;
Delimited files
With list input, blank spaces are delimiters, or special characters used to separate the values of
variables in a line. SAS can also interpret other characters as delimiters. For example, suppose
that the dataset in the previous example was stored in the text file A:\GRADES.TXT as
follows:
Ann/84/90/A-/0
Bill/78/84/B/0
Cathy/95/89/A/1
David/84/88/B+/1
Then, the following statements could be used to create the GRADES dataset.
DATA grades;
INFILE 'a:\grades.txt' delimiter='/';
INPUT name $ quiz test project $ absences;
The phrase 'dlm=' can be used in place of 'delimiter='. This option is used if a keyboard
character, such as a comma, slash, or asterisk, separates values in a line. Of course, the
character used to separate variables should not appear within a data value. For example, in a
comma-delimited file, the number 125,000 would have to be written as 125000;
otherwise, SAS would try to break it apart into two variables with values 125 and 000.
Tabs are an exception to the use of the DELIMITER option. In a tab-delimited file, the
EXPANDTABS option replaces the DELIMITER option. In the example, if tabs had been used
in place of slashes, the proper statement for reading the data would be:
INFILE 'a:\grades.txt' expandtabs;
Column pointers
Suppose that the data set with student grades is stored in D:\GRADES.TXT as follows:
Ann
Bill
Cathy
David
84
78
95
84
90
84
89
88
AB
A
B+
0
0
1
1
Creating a dataset from this file would be easy to read with column input and even easier with
list input. However, suppose that you only needed to use the students' names and project
grades. You can use column pointers to skip over undesired data. Column pointers use the @
symbol to tell SAS to begin reading data at a specified column. In this example,
we need to know that the names start in Column 1; project grades, in Column 13. The
following statements could be used.
DATA grades;
INFILE 'D:\grades.txt';
INPUT @1 name $ @13 project $;
The quiz grade, test grade, and absences do not appear in this dataset. As shown above, column
pointers can be used to skip over unneeded data.
Mixed input
You may occasionally find it necessary or convenient to use a combination of input techniques
for a particular dataset.
For example, suppose that the dataset of grades appears as follows:
Ann
Bill
Catherine
David
84
78
95
84
90
84
89
88
AB
A
B+
0
0
1
1
Recall that list input can be used only when character variables have 8 or fewer characters, with
no blanks. Catherine has 9 letters, so simple list input cannot be used. However, the absences
are not neatly aligned in a column, and the last four variables would be easy to read with list
input. The following statements could be used:
DATA grades;
INFILE 'A:\grades.txt';
INPUT name $ 1-9 quiz test project $ absences;
It is also possible to use list input, column input, and column pointers simultaneously. For
example, if you only needed the name, project grade, and absences, you could use the
following INPUT statement:
INPUT name $ 1-9 @17 project $ absences;
Line pointers
So far, all of the data for each observation have appeared in one line. You may occasionally
encounter data in which the variables for one observation appear in two or more consecutive
lines, as shown below:
Ann
84 90 A- 0
Bill
78 84 B 0
Cathy
95 89 A 1
David
84 88 B+ 1
You may need to use a line pointer to read such data. A line pointer is like a column pointer,
except that it specifies the line on which SAS should begin reading the data. A slash (/) tells
SAS to skip to the next line, and #number tell SAS to go to that line of an observation's data to
resume reading data. If the data above were stored in D:\GRADES.TXT, a dataset could be
created in SAS as follows:
DATA grades;
INFILE 'a:\grades.txt';
INPUT name $ / quiz test project $ absences;
Equivalently, you could use the following INPUT statement:
INPUT name $ #2 quiz test project $ absences;
Since the data consist of simple characters and numbers, the following INPUT statement could
also be used. Notice that there are no line pointers.
INPUT name $ quiz test project $ absences;
SAS will automatically go to the next line of data to complete the set of variables listed in the
INPUT statement. However, if the data are irregular (missing values, blanks in character
variables, etc.), then line pointers may be necessary.
Multiple observations on one line
Discussed in Lesson 1 with the @@ symbol.
Download