Using Mixed Processing Methods to Read Raw Data Files

Using Mixed Processing
Methods to Read Raw Data
Files
Transcript
Using Mixed Processing Methods to Read Raw Data Files Transcript was developed by Ted Durie.
Additional contributions were made by Cindy Cragin, David Ghan, Linda Mitterling, and Bruce Reed.
Editing and production support was provided by the Curriculum Development and Support Department.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product
names are trademarks of their respective companies.
Using Mixed Processing Methods to Read Raw Data Files Transcript
Copyright © 2009 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States of
America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written
permission of the publisher, SAS Institute Inc.
Book code E1445, course code RLSPCMI, prepared date 19Feb2009.
RLSPCMI_001
ISBN 978-1-60764-033-2
For Your Information
Table of Contents
Lecture Description ..................................................................................................................... iv Prerequisites ................................................................................................................................. v Using Mixed Processing Methods to Read Raw Data Files ........................................ 1 1.
Investigate File Layouts ...................................................................................................... 5 2.
Describe SAS Input Styles ................................................................................................ 13 3.
Read Mixed Data Type Records ....................................................................................... 31 iii
iv
For Your Information
Lecture Description
In this lecture you’ll see how to process raw data files with structures that change within records of data.
The following topics are addressed: using SAS as a tool to examine the structure of various raw data files,
choosing the appropriate input styles for different structures in a raw data file, and using the appropriate
style as formats change within a record.
To learn more…
For information on other courses in the curriculum, contact the SAS Education
Division at 1-800-333-7660, or send e-mail to training@sas.com. You can also
find this information on the Web at support.sas.com/training/ as well as in the
Training Course Catalog.
For a list of other SAS books that relate to the topics covered in this
Course Notes, USA customers can contact our SAS Publishing Department at
1-800-727-3228 or send e-mail to sasbook@sas.com. Customers outside the
USA, please contact your local SAS office.
Also, see the Publications Catalog on the Web at support.sas.com/pubs for a
complete list of books and a convenient order form.
For Your Information
Prerequisites
Before viewing this lecture, you should be familiar with the DATA step, the PRINT procedure, and
INFILE and INPUT statement syntax. You should also understand IF-THEN/ELSE logic and be able to
view raw data using text editors. You can gain this knowledge by completing the SAS Programming I
Essentials (PRG1) course.
v
vi
For Your Information
Using Mixed Processing Methods to
Read Raw Data Files
1. Investigate File Layouts .................................................................................................... 5 2. Describe SAS Input Styles .............................................................................................. 13 3. Read Mixed Data Type Records ..................................................................................... 31 2
Using Mixed Processing Methods to Read Raw Data Files
1. Investigate File Layouts
Using Mixed Processing
Methods to Read Raw Data Files
Welcome to this e-Lecture on Using Mixed Processing Methods to Read Raw Data Files. My name is
Ted and I am an instructor for SAS. Today we will be looking at how to process raw data files with
structures that change within records of data.
3
4
Using Mixed Processing Methods to Read Raw Data Files
Using Mixed Processing Methods to Read Raw Data Files
1. Investigate File Layouts
2. Describe SAS Input Styles
3. Read Mixed Data Type Records
2
The following three topics will be covered as part of this e-Lecture:
• First, we will examine the structure of various raw data files and we will look at a SAS tool to use for
this purpose.
• Next, we will discuss how to choose the appropriate input styles for the different structures that might
exist in a raw data file.
• Then, after we have a clear understanding of the different input styles that SAS provides, we will take a
look at using the appropriate style as formats change within a record.
1. Investigate File Layouts
1.
Investigate File Layouts
Using Mixed Processing Methods to Read Raw Data Files
1. Investigate File Layouts
2. Describe SAS Input Styles
3. Read Mixed Data Type Records
3
Let’s start by examining the structure of different kinds of raw data fields.
5
6
Using Mixed Processing Methods to Read Raw Data Files
Objectives
„
Recognize the need to familiarize yourself with a file layout.
„
Use the FSLIST procedure in SAS to examine raw data file
structure.
„
Investigate various record layouts.
4
In this first section, we will take a look at the kinds of information that we will need to supply SAS to
read in raw data records and create SAS data sets. We will look at different layouts and we’ll do this by
opening up several raw data files with a tool in SAS.
1. Investigate File Layouts
7
File Description
This is the file layout for the offers.dat raw data file:
Field Description
Columns
Data Types
Customer Type
1-4
Character
Offer Date
5-12
Date in the format
of mmddyy8.
Item Group
14-21
Character
Discount
22-24
numeric with
percent signs
5
The first step in processing a raw data file is to determine the layout of the data values in the data records.
The arrangement of values can change from field to field, or even from record to record. The good news
is that SAS has different input styles to handle different layouts.
To determine which style to use for reading data values into SAS, someone either needs to hand you some
sort of codebook that explains the file layout to you, or you will need to open the data file and examine it.
The table shown here is what I am referring to as a codebook or file layout. As you can see, it contains
locations, and data types of the fields of information. The task of converting a raw data file into SAS data
set can be greatly simplified if the person who created the file can give you this type of information. But
this often is not the case.
8
Using Mixed Processing Methods to Read Raw Data Files
Raw Data File
This is a partial listing of the offers.dat raw data file:
Partial raw data file: offers.dat
104012/02/07
202010/07/07
103009/22/07
103009/22/07
202007/08/07
203007/08/07
Outdoors15%
Golf
7%
Shoes
10%
Clothes 10%
Clothes 15%
Clothes 25%
6
More often, you are going to have to open the file to become familiar with the file layout. Here, I have
opened a file named offers.dat. By looking at the file, I can see that there is some sort of ID number
in the first field and all of these values are fixed within certain column positions. Then it looks like there
might be some sort of date value here and it looks like it is always 8 bytes in length. So, without a
codebook, I have to view the data in the file to determine its structure.
1. Investigate File Layouts
9
General Syntax of FSLIST Procedure
Any tool that can read text can open a raw data file, such as
ƒ Notepad
ƒ Microsoft Word
ƒ PROC FSLIST in SAS
General syntax for the FSLIST procedure:
PROC FSLIST file=‘<directory-location/filename>’;
7
There are many different tools that you can use to view a raw data file, including text editors, like
Notepad or Microsoft Word. In this e-lecture, we will be using a SAS procedure called FSLIST, because
it allows us to view contents of a raw data file regardless of size or the operating system where the data
resides.
This is the general syntax for the FSLIST procedure. It starts with PROC FSLIST, then you will typically
specify the directory location where the file is stored and the name of the raw data file to be examined.
Let’s look at an example of opening a raw data file with PROC FSLIST.
10
Using Mixed Processing Methods to Read Raw Data Files
Examining a Raw Data File
Open the offers.dat raw data file with PROC FSLIST.
proc fslist file='s:\workshop\offers.dat';
run;
8
Here I am opening a file named offers.dat with the FSLIST procedure. I say: proc fslist
file= , and then I have specified the fully qualified path to the raw data file. Note that if you execute
this procedure in a non-interactive SAS session or batch job, then you will get a report listing of the data.
If you execute the procedure in an interactive session, such as in SAS Display Manager, then the FSLIST
window will open showing the contents of the raw data file as you see here.
1. Investigate File Layouts
11
Investigate Raw Data File Structures
This demonstration illustrates different raw data file layouts.
9
Great. Well now that we have a tool to use in SAS to open our raw data files, let’s take a look at some of
the different file structures that we might run across.
I’m going to open several files with PROC FSLIST. Here is code that I wrote earlier so you don’t have to
sit and watch me type! Let me highlight this first set of code that opens up supplier.dat. The
FSLIST window opens, and here you see the contents of the file. I am going to turn on a command in this
window to make it easier for me to see how the data values are laid out within each record. It is called the
COLS command. I type it here on the command bar. It creates this nice ruler that I can use to determine
the position of the data values in the file.
Let’s look at this file named supplier.dat. Notice that the data fields in this file always start in the
same position from record to record. And notice that the data is fixed within a certain width. What I mean
by that is, this first set of values is fixed between columns 1 and column 7. The next set of values is fixed
between columns 8 and 39. So, the data here is fixed in particular columns. Another item to note in this
file is that it contains all standard data values. There is nothing special about the data values. There are no
date values, or numeric values that contain dollar signs or commas or percent signs. These are just
standard character and numeric values. So, as we look through these files, these are the two question or
pieces of information that we will be looking for:
• First, is there structure to the values – in other words are the data fields in fixed columns?
• And, second, are there any non-standard data values?
12
Using Mixed Processing Methods to Read Raw Data Files
Let me go back to the editor again. This next file is named offers.dat. We actually just saw this file
earlier in this section. If I submit this code, this file contains fixed data values just like the last file that we
opened. But in this file, there are special data values as you can see here – this is a date value and this last
set of values contains percent signs. This is all ok to SAS. It just means that we need to give SAS this
information when it is reading in the file so it can convert the values into the proper form in the data set
being created. So, the data in this file is fixed with some special processing needed.
Let me go back to the code in my editor and let’s open another raw data file named sales.txt. Let’s
ask the first question, “Are the data values in fixed columns?” The answer here is “no”. Here we have
what we call free-format data. Notice that the data does not line up in nice fixed columns. This type of
data is definitely readable in SAS, but will require a different style of input then the fixed data files that
we just saw. Glancing through the records, we ask the second question, “Do we need to deal with any
special data values?” Yes, it looks like there are two date fields here and another thing to note about this
file is that the data values are separated by commas. That’s another piece of information that we need to
know when selecting the appropriate input style in SAS to read the file.
And, let me open one more file…. Here we have a file that is mixed. It contains some fixed data and some
free-format data. We’ll see how to handle a mixed type file in this e-lecture as well.
In this demo, we have seen several different file structures. Next we want to look at the different methods
or styles of input that SAS has to handle these different file structures.
2. Describe SAS Input Styles
2.
13
Describe SAS Input Styles
Using Mixed Processing Methods to Read Raw Data Files
1. Investigate File Layouts
2. Describe SAS Input Styles
3. Read Mixed Data Type Records
10
In Section 2, we will describe the commonly used input styles available in SAS. This section is a quick
review of the essential features of the these input styles. These topics are covered more fully in our SAS
Programming 1: Essentials and SAS Programming II: Accessing and Manipulating Data courses, where
additional options are introduced and there are more data scenarios presented than we are able to cover
here.
14
Using Mixed Processing Methods to Read Raw Data Files
Objectives
„
Describe the commonly-used input styles in SAS.
„
Identify the type of input to use for different file structures.
11
So, lets take a look at the input styles that are most commonly used in SAS and we’ll look at the different
file structures and talk about which input style needs to be used for each type of structure.
2. Describe SAS Input Styles
15
Input Styles
There are three commonly-used styles of input in SAS
„
Column
„
Formatted
„
List
SAS Data Set
Raw Data
1
1
1---5----0----5
data …;
infile…;
input…;
run;
12
Records of raw data files are read into SAS in the DATA step through an INPUT statement. The INPUT
statement takes the values and converts them appropriately and places them into a SAS data set.
To address the different types of layouts that can exist within raw data files, there are three commonlyused styles of input in SAS. They are called Column input, Formatted input, and List input. In the next
few slides, we’ll see what the syntax for these styles looks like on the INPUT statement, and how they
correspond to the record layouts I showed you in the previous section. If you have used INPUT
statements in SAS before and are familiar with the differences between each style of input, then some of
the information in this section may be review to you. If so, you can skip directly to the next section.
16
Using Mixed Processing Methods to Read Raw Data Files
General Form of Column Input
Column input is appropriate for reading the following:
„
data in fixed columns
„
standard character and numeric data
General form of a column INPUT statement:
INPUT variable <$> startcol-endcol ;
13
We’ll begin our discussion by looking at Column input. This mode of input is used when the data is in a
fixed-width form and there are standard character and numeric values – no special values, like dates or
currency values that contain dollar signs or commas- and you want the values all read in as numeric.
When Column input is used to process raw data, the INPUT statement will consist of three components.
• First, you need to provide the name of the variable that will be created.
• Second, specify a “$” for data that will be stored in a character variable. Omit the “$” if the data will be
put into a numeric variable.
• Third, provide beginning and ending column locations for the field in the external file.
You will need to provide this set of three specifications for every field that will be processed from the raw
data file. Now, this does not mean that when you are writing your INPUT statement, you have to include
all of the data contained in a record. In fact, you can choose only to read in a subset of the data values. For
instance, if the raw data file has 20 fields of data, but you only need to process 6 or 7 of them when
working in SAS, then you would specify a variable name and beginning and ending column locations for
just those 6 fields of data that you want to include.
2. Describe SAS Input Styles
17
Processing Raw Data File With Column Input
Partial raw data file: supplier.dat
data work.supplier;
infile 's:\workshop\supplier.dat';
input supplier_name $ 8-39
id $ 1-7
country_code $ 40-41;
run;
14
If your data is structured like this file, then Column input is the best approach. The values are fixed within
columns, and there are no special fields that need instructions for SAS as it is reading in the values. And,
a little twist here, I would like to change the order of the data when it is read into the work.supplier
data set. I want to see the name of the supplier first, followed by the supplier’s id and then their country
code. So, here is the Column INPUT statement that I will write:
• Start with the keyword INPUT.
• Then we tell SAS that we would like to create a variable named supplier_name. We tell SAS that
this is a character variable by using the dollar sign. Then we need to tell SAS where it can find the
values for suppier_name. So, we specify columns 8-39 in the raw data file.
• Next, we tell SAS to create the variable ID. These values are all numeric, but we have no intention to
do any sort of numeric calculation with them. So, I will store them as character variables. This will
save storage space in my new data set. If I stored them as numeric, they would take up a default 8 bytes
of storage. I then tell SAS where to find the values, and that is in columns 1-7. And note that this also
means that the values will be stored in 7 bytes.
• Last, I tell SAS that I want to create a variable named country_code and I want it to be defined in
SAS as a character variable. The values can be found in columns 40-41 of the raw data file.
Now, this syntax that you see here - variable name, followed by a dollar sign if it is a character variable,
followed by the column numbers - is the order that you need to place these specifications on the INPUT
statement. Don’t mix up this order. This order and the fact that you have supplied SAS with column
numbers makes this Column input.
18
Using Mixed Processing Methods to Read Raw Data Files
Column Input Results
proc print data=supplier noobs;
run;
Partial SAS data set: work.supplier
The SAS System
15
supplier_name
id
Scandinavian Clothing A/S
Petterson AB
Prime Sports Ltd
Top Sports
AllSeasons Outdoor Clothing
Sportico
British Sports Ltd
Eclipse Inc
50
109
316
755
772
798
1280
1303
country_
code
NO
SE
GB
DK
US
ES
GB
US
And if we execute the DATA step and print the resulting SAS data set, this is what we see. The id values
were first in the raw data file, but now the names of the suppliers come first and then the id values and
finally the country codes.
2. Describe SAS Input Styles
19
General Form of Formatted Input
Formatted input is appropriate for reading the following:
„
data in fixed columns
„
standard and nonstandard character and numeric data
„
date values that need to be stored as numeric values in SAS
General form of a formatted INPUT statement:
INPUT <pointer-control > variable informat ;
16
The next style of input that we want to look at is called formatted input. With formatted input, data must
be in fixed-width fields and special instructions called informats are used to tell SAS how to convert data
values as they come into SAS.
When using formatted input to process data, a combination of three values have to be specified to convert
data values.
• First, you must specify a pointer control to tell SAS the starting column location of the field to be read
from the raw data file.
• Second, you will specify the name of the variable being created, using SAS variable naming
convention rules.
• And third, you will tell SAS how to read in values with informats, which are simply conversion
routines. Note that you use informats to read in special data fields, but there are also informats for
standard data as well.
These three specifications must be provided for every field that will be processed using formatted input.
Note that this syntax is very different from what we saw for Column input. With Column input, you
specify a variable name first, followed by the location of the column. Here you state the starting position
for the value, followed by the variable name and then an informat.
20
Using Mixed Processing Methods to Read Raw Data Files
Standard Character and Numeric Values
Standard Character Data
17
Standard Numeric Data
Contain any value: letters,
numbers, special characters, and
blanks.
Can contain
positive and negative numbers
(ex. -10.24 120) , exponential
notation (ex. 5.67E5), and decimal
Values (ex. 6.7894562)
Are stored with a length of 1 to 32,
767 bytes with one byte equal to
one character
Are stored as floating point
numbers in 8 bytes of storage by
default
All other data forms will have to be converted using an informat.
Before we look at an example of formatted input, let’s review briefly what we mean by standard character
and numeric values in SAS, because this ties into the need for informats for non-standard data values.
This table describes what SAS views as standard character and numeric data. All other data forms that are
not described in this table have to be converted using an informat.
Any raw data values can be read into SAS as character values. Character values can contain special
characters, blanks, etc. and they are stored byte for byte. So if the length of a field in the raw data is 5 and
it comes into SAS as a character value, it will be stored, by default, in a length of 5 bytes.
For numerics, they are standard if they take on one of these forms – a value with digits that is either
positive or negative, exponential values and digits that include a decimal point and decimal values. These
values can be converted into numeric variables in SAS, without having to specify an informat. If the data
is stored in another form, and a numeric variable is required in the output data set, then that field will
have to go through a conversion routine specified with an informat for SAS to read it in properly.
2. Describe SAS Input Styles
21
Conversion Requirements
18
SAS recognizes that not all data is stored the way that it likes it. Data coming from raw data files or other
software packages might contain information in other forms, including: packed decimal, text dates, or
currency values. In these cases, a SAS informat must be used to convert non-native SAS data types into a
form that is compatible with the SAS system.
22
Using Mixed Processing Methods to Read Raw Data Files
Processing a Raw Data File With Formatted Input
Partial raw data file: offers.dat
data work.discounts;
infile 's:\workshop\offers.dat';
input @1 customer_type $4.
offer_date :mmddyy8.
@14 item_group $8.
@22 discount :percent3.;
run;
19
Let’s look at an example of formatted input. Looking at the raw data file, we see that the fields are fixed
widths. Even though some of the fields are adjacent to others, they are still in fixed locations. But, we also
notice that there are date values here. We can read those date values in as character strings, or if we want
to do manipulations with them later. like subtract an offer_date from a later offer_date to see
how many days passed between offers, then we will need to create these date values as SAS date numeric
values in the resulting data set. And, we also need to provide special instructions, or informats for the
discount values that you see here. We want to store these values as numerics in a new data set. To do that,
we will need to remove the percent signs and we want to store these as percentages, so we will need to
divide the values by 100. I don’t want to have to do all that work – especially if I were dealing with a
really large SAS data set. But, there is an informat that will do all of this for me and it is called the
PERCENT informat.
Putting it all together in the INPUT statement:
• First, we have the keyword INPUT.
• Then we want to start reading values in columns 1 through columns 4 and place those values into the
variable named customer_type. If we were using Column input, we would just specify the column
number here. But we are using Formatted input, so the structure of the input specification is a little
different. We specify the starting position by saying @1, then the variable name, then the informat that
tells SAS how far to read to get the values for this variable. This is a standard character variable, so I’ll
use the $4. informat.
• The pointer is now resting in column 5 and that is where I want SAS to start reading the next set of
values. So, I have not specified a pointer control here. I ask SAS to read values from the current
2. Describe SAS Input Styles
23
column, column 5, and I tell SAS that it can expect a two-digit month, a two-digit day and a two-digit
year.
• Next is item_group. We want to start reading values for this variable in column 14 and read the next
8 columns. It will be a character variable.
• And the last variable that we will define is discount. For this variable, we will get the values from
columns 22, 23 and 24, and we are asking SAS to use the PERCENT. informat to strip out percent
signs and divide the values by 100.
If we execute this DATA step and then use PROC PRINT to view it, this is what we will get…
24
Using Mixed Processing Methods to Read Raw Data Files
Formatted Input Results
proc print data=discounts noobs;
run;
Partial SAS data set: work.discounts
The SAS System
20
customer_
type
offer_
date
1040
2020
1030
1030
2020
2030
17502
17446
17431
17431
17355
17355
item_
group
Outdoors
Golf
Shoes
Clothes
Clothes
Clothes
discount
0.15
0.07
0.10
0.10
0.15
0.25
Note that the date values are in a SAS date numeric form, so they are not very understandable at the
moment. But we can see that that our data conversion did work. Also, the percent signs are gone from our
discount values, and the values have been converted to decimal values.
I am ok with the discount values on this report, but let’s make those date values more understandable.
2. Describe SAS Input Styles
25
Formatted Input Results
Add a FORMAT statement to print the date value in a more readable
form:
proc print data=discounts noobs;
format offer_date date9.;
run;
Partial SAS data set: work.discounts
The SAS System
customer_
type
21
1040
2020
1030
1030
2020
2030
offer_
date
02DEC2007
07OCT2007
22SEP2007
22SEP2007
08JUL2007
08JUL2007
item_
group
Outdoors
Golf
Shoes
Clothes
Clothes
Clothes
discount
0.15
0.07
0.10
0.10
0.15
0.25
I’ve added a FORMAT statement to my PROC PRINT step, and now, I can more easily read those dates.
So, we read these date values in with an informat of MMDDYY8. because that is the way that they were
stored in the raw data file and we wanted SAS to store them as numeric date values in the data set. Then
we used a format to take the internal SAS value and write it out in DATE9. form. Pretty cool!
26
Using Mixed Processing Methods to Read Raw Data Files
General Form of List Input
List input is appropriate for reading the following:
„
free-format delimited data
„
standard or nonstandard character and numeric data
General form of a List INPUT statement:
INPUT variable <$> <:informat > ;
22
The third and last style of input that we want to look at is called List input. With List input, data is freeformat. This is another type of raw data file that is commonly processed in SAS programs. Free-format
means that the values for fields do not start or end in the same column locations from one record to the
next. The data must be delimited by a blank or some sort of defined delimiter. The only constant that we
have for this type of data is the order of the fields going from left to right across the file. So, List input
requires that you list variable names on your INPUT statement in the order that the fields appear in the
raw data records. Again, this is because with this input style, SAS scans each data record from left to right
to locate fields based on a delimiter.
When using List input to read standard data values, only a variable name has to be specified. No column
numbers or pointer controls are necessary because SAS is determining the start and stop positions of
fields based on the delimiters between fields.
If you are dealing with nonstandard data, then you can still use List input, you just need to specify a colon
modifier and the appropriate informat.
2. Describe SAS Input Styles
27
Processing a Raw Data File With List Input
Partial raw data file: sales.txt
23
data sales_employees;
infile 's:\workshop\sales.txt' dlm=',';
input employee_ID $
first_name :$20.last_name :$20.
gender $ salary job_title :$20.
country $ birth_date :date9.
hire_date :mmddyy10.;
run;
Let’s look at an example of List input. Looking at the raw data file, we immediately notice that the fields
are separated by commas. The default delimiter is a blank. So, we will need to give SAS a special
instruction to let SAS know that commas separate the fields. We do that with the DLM= option on the
INFILE statement. Now any commas in our data become delimiters, not part of a value.
Looking further along our data records, we notice that the job_title values contain blanks. Since our
delimiter is a comma, these blanks will not pose a problem to us. If our delimiter had been a blank, then
SAS would have seen the blanks within the job_title values as delimiters, not spaces within a field
value. But we don’t have that problem here. Then we run across a date field. We want it to be defined in
our data set as a numeric date value, so we’ll need to specify an informat for this value, and the same
holds true for this second date value as well.
Now that you are familiar with the input specification requirements for our raw data file, let’s generate the
INPUT statement to read this data.
• We will begin with the keyword INPUT as always.
• Then we tell SAS that we want to create the employee_ID variable. We don’t have to give SAS a
starting or ending position. It will just start reading the raw data file wherever the pointer currently
resides and will continue to read, or scan the data values until it sees a comma. In this first record, it
will read 120102, stop and take that value and place it into the employee_ID variable.
• Next we read in the first_name values. Note the different syntax here. We have a colon modifier
and an informat. By default, any variable created with List input is assigned a becomes a length of 8
28
•
•
•
•
•
Using Mixed Processing Methods to Read Raw Data Files
bytes. In the case of the first_name variable, we have values that are longer than 8 bytes. If there
are fields over 8 characters wide, then you will need to specify an informat as we have done here.
For last_name, we are also specifying a format modifier.
For gender, the values are always only 1 byte, so we don’t have to specify anything special here.
We define the job _title and country variables.
Then we come across the date variables. The birth_date values are in a date9. form, so we’ll use a
format modifier and an informat for it.
And for hire_date, it is in a slightly different date form, so we’ll specify MMDDYY10.
If we were to execute the DATA step….
2. Describe SAS Input Styles
29
List Input Results
proc print data=sales_employees noobs;
run;
Partial SAS data set: work.sales_employees
The SAS System
employee_
ID
120102
120103
120121
120122
120123
120124
120125
120126
first_
name
last_name
Tom
Wilson
Irenie
Christina
Kimiko
Lucian
Fong
Satyakam
Zhou
Dawes
Elvish
Ngan
Hotstone
Daymond
Hofmeister
Denny
gender
M
M
F
F
F
M
M
M
salary
108255
87975
26600
27475
26190
26480
32040
26780
job_title
Sales
Sales
Sales
Sales
Sales
Sales
Sales
Sales
Manager
Manager
Rep. II
Rep. II
Rep. I
Rep. I
Rep. IV
Rep. II
country
AU
AU
AU
AU
AU
AU
AU
AU
birth_
date
hire_
date
3510
-3996
-5630
-1984
1732
-233
-1852
10490
10744
5114
5114
6756
9405
6999
6999
17014
24
...and then view it with PROC PRINT, notice that the date values have been converted into a SAS date
numeric form and also notice that the length attributes for first_name, and last_name and
job_title are large enough to store the longest values. If I added a FORMAT statement to my PROC
PRINT step, I could make my date values more understandable. But, I’m not going to do that for this
example, since we saw an example earlier of how to do it.
30
Using Mixed Processing Methods to Read Raw Data Files
Comparing Input Styles
FixedWidth
Data
Column Input
X
Formatted Input
X
FreeFormat
Data
Nonstandard
Data
X
Simple List Input
X
List Input with
Colon Modifier
X
X
25
This table compares the three input styles that we just looked at. List input is represented twice, once as
simple List input, which means that you would just list the variable name and a dollar sign if applicable,
and then there is List input with the colon modifier which means that the data is free-format, but you are
going to be using informats to read in special values.
Taking a look at the chart, if you are dealing with fixed-width data, then you can either use Column or
Formatted input. If you are dealing with free-format data, then List input is the way to go. If you have
non-standard data values coming into SAS, then you will need to read those values in with an informat.
Both Formatted input and List input can use informats, but Column input cannot do so.
3. Read Mixed Data Type Records
3.
Read Mixed Data Type Records
Using Mixed Processing Methods to Read Raw Data Files
1. Investigate File Layouts
2. Describe SAS Input Styles
3. Read Mixed Data Type Records
26
Now let’s take a look at a case where there is more than one input style needed within each raw data
record.
31
32
Using Mixed Processing Methods to Read Raw Data Files
Objectives
„
Investigate the raw data file.
„
Create a DATA step to process the mixed record type file.
27
In this section:
• I will take a look at a raw data file using FSLIST.
• Finally we’ll generate DATA step syntax to process records where the input syntax changes from
Column, to Formatted, to List, and then back to Formatted specifications.
3. Read Mixed Data Type Records
Processing Scenario
„
Data from donations database.
„
Employee make contributions quarterly.
„
The data structure changes within a record.
28
The charities.txt raw data file contains cash, check, or payroll deductions of employees who
donate quarterly to various charities throughout the year. We will be looking at data records for 2008.
As we investigate the data, we will find that the style of input needed to read the raw data is going to
change within a record. This means that we will be specifying Formatted, Column, and List input all in
one INPUT statement.
33
34
Using Mixed Processing Methods to Read Raw Data Files
Examining the Mixed Data
proc fslist file='s:\workshop\charities.txt';
run;
Partial raw data file: charities.txt
29
Let’s take a quick look at the raw data file with the FSLIST procedure. Notice that some of the data fields
are separated with spaces and some are separated with commas. We also notice that some of the fields are
fixed-width and others are free-format. So, there is definitely a mixture of input styles needed to read just
a single record.
Over the next several slides, we’ll take a look at the fields in the raw data one at a time and build our
INPUT statement as we go along.
3. Read Mixed Data Type Records
35
Coding the INPUT Statement
The first two data fields are fixed width and standard character
data.
Partial raw data file: charities.txt
Partial DATA step
data work.donations;
infile 's:\workshop\charities.txt';
input employee_id $ 1-6
@8 paid_by $17.
…
30
The first data field is fixed and there is nothing special about the values. The field actually represents
employee ids. We have no intention of ever manipulating these data values, so we will create them as
standard character values in SAS. We have a choice of reading the data with either Column or Formatted
input. For simplicity, we will use Column input. We will call the variable employee_id. It’s character,
so we’ll need to use a dollar sign and, looking at the ruler, we’ll tell SAS to read columns 1-6 to get the
values for this variable.
Looking at the second data field, it is also fixed-width standard data. So, again either Column or
Formatted would work here. Let’s use Formatted for demonstration purposes. After reading through
column 6 for the employee_id values, the pointer is now resting in column 7. We need to move the
pointer to column 8, so we say @8. We name the variable paid_by and we use the $17. informat
because we need SAS to read the next 17 columns of data for this variable.
36
Using Mixed Processing Methods to Read Raw Data Files
Coding the INPUT Statement
The quarterly contribution fields are free-formatted and
comma-delimited.
Partial raw data file: charities.txt
31
Partial DATA step
data work.donations;
infile 's:\workshop\charities.txt' dlm=',';
input employee_id $ 1-6
@8 paid_by $17.
@27 qtr1 qtr2 qtr3 qtr4
…
As we continue from left to right across the file, the data structure changes from fixed width to comma
delimited free-format for the quarterly contributions. List input will need to be applied here.
Before we add to our INPUT statement, we need to tell SAS that the values are delimited by commas. We
will do that on our INFILE statement with this DLM=option. Now, when SAS sees a comma in the data,
it will treat it as a delimiter, not part of a field value.
On our INPUT statement, we need to direct the pointer to move to column 27. Then we name the
variables qtr1, qtr2, qtr3, and qtr4. These are standard numeric values, so no informats are needed.
We simply write qtr1, qtr2, qtr3, and qtr4 and SAS will scan for the comma delimiter to get the
values. The input style is recognized as List input because we have not provided column start-stop
positions or informats.
3. Read Mixed Data Type Records
37
Coding the INPUT Statement
The hiredate field is free-formatted, comma-delimited, and needs a
date informat.
Partial raw data file: charities.txt
32
Partial DATA step
data work.donations;
infile 's:\workshop\charities.txt' dlm=',';
input employee_id $ 1-6
@8 paid_by $17.
@27 qtr1 qtr2 qtr3 qtr4
hiredate :mmddyy10.
…
As we look at the next field of data, we see more free-format comma delimited values. Also, this is a date
field, and we want it defined in SAS as a SAS date value. Therefore, we will need to specify a date
informat for this variable using a colon modifier.
The input syntax, for this field, will consist of a variable name, colon format modifier, and informat that
will be used.
38
Using Mixed Processing Methods to Read Raw Data Files
Coding the INPUT Statement
The date field is free-formatted, comma-delimited, and needs a date
informat.
Partial raw data file: charities.txt
33
Partial DATA step
data work.donations;
infile 's:\workshop\charities.txt' dlm=',';
input employee_id $ 1-6
@8 paid_by $17.
@27 qtr1 qtr2 qtr3 qtr4
hiredate :mmddyy10.
@50 organization $26.;
run;
The last field of data is fixed with a starting position of column 50. These are standard character values,
so we can read them with either Column or Formatted input. Let’s go with Formatted. We’ll say @50, the
name of the variable, organization , and give it a $26. informat. Note that some of the values contain
commas. The DLM= option has identified the comma as a delimiter. However, because we have switched
back to Formatted input, the commas in this field are treated as part of the value rather than as delimiters.
One quick note here, for this particular file, each record has a fixed length of 80. If this were not a fixed
length file, if it were a variable length file, then you would have to specify additional options on your
INFILE statement to make this program work correctly. Consult the documentation for your operating
system for more information about fixed and variable length records.
If we print the data set...
3. Read Mixed Data Type Records
Unformatted Results
proc print data=work.donations noobs;
run;
Partial work.donations data set
34
We see that all of the data fields have been converted correctly. Note that the hire_date field is in a
SAS date form. It is a number relative to January 1, 1960. Let’s make that date more readable.
39
40
Using Mixed Processing Methods to Read Raw Data Files
Formatted Results
proc print data=work.donations noobs;
format qtr1-qtr4 dollar3. hiredate date9.;
run;
Partial work.donations data set
35
We’ll make this final modification to the scenario solution by adding this FORMAT statement to the
PROC PRINT step. The quarter values will be displayed as currency by using a DOLLAR3. format. This
format will place dollar signs in the values and we have specified no decimal places. For the date values
we want them to be displayed with a two-digit day, a three letter month, and a four-digit year. The
DATE9. format will display the values in this fashion.
And, you can see the final results. So, our INPUT statement notation along the way went from Column
input, to simple List input to colon modified List input, to Formatted input. Mixing the styles is no
problem, it’s just a matter of knowing what your data looks like, deciding how you want it stored in SAS
and selecting the correct technique or style of input to use to read the data into SAS and have it converted
appropriately.
3. Read Mixed Data Type Records
41
Lecture Summary
„
Investigated various file structures using PROC FSLIST.
„
Defined List, Column, and Formatted input styles.
„
Worked through a DATA step scenario using all of the input styles
within one INPUT statement.
36
During this lecture, we saw that there are many different forms in which raw data can be stored. If we
want to take that raw data and convert it into a SAS data set, then we must tell SAS what the data looks
like as it is being read into a SAS data set. So, we either need to have a file layout available for a given
file, or we need to open the file and become familiar with the data ourselves. There are many different
editors and browsers that you can use to investigate the data depending on your operating system. If you
want to stick with one editor that will work in SAS and will work across operating systems, then you can
use the FSLIST procedure, as we did in this lecture.
Next we matched the style of input needed for various file structures. We talked about List, Column and
Formatted input, and when you use one over the other.
And finally, we looked at a situation where the raw data file was mixed type, meaning that there were
data fields that worked with Column input, fields that required us to use Formatted input, and fields that
required us to use List input. We saw that it is OK to mix different styles within one INPUT statement.
42
Using Mixed Processing Methods to Read Raw Data Files
Other Related e-Lectures
A complete list of available SAS training, including SAS e-Lectures
can be found at the following site:
http://support.sas.com/training
37
This concludes our e-Lecture entitled Using Mixed Processing Methods to Read Raw Data Files. We
hope that you found the material and presentation helpful.
Please visit the SAS Web site at http://support.sas.com/training/ for a complete list of other available eLectures and SAS training.
3. Read Mixed Data Type Records
43
Credits
Using Mixed Processing Methods to Read Raw Data Files was
developed by Ted Durie. Additional contributions were made
by Cindy Cragin, David Ghan, Linda Mitterling, and Bruce Reed.
38
This lecture was developed by Ted Durie with additional contributions from Cindy Cragin, David Ghan,
Linda Mitterling, and Bruce Reed.
44
Using Mixed Processing Methods to Read Raw Data Files
Comments?
We would like to hear what you think.
„
Do you have any comments about this lecture?
„
Did you find the information in this lecture useful?
„
What other e-Lectures would you like to see SAS develop
in the future?
Please e-mail your comments to
EDULectures@sas.com
39
SAS Education would like to know what you think about this e-Lecture and e-Lectures in general. If you
have any comments, we would appreciate receiving your input. You can use the e-mail address listed here
to provide feedback, or fill out the short survey at the end of this lecture.
3. Read Mixed Data Type Records
Copyright
SAS and all other SAS Institute Inc. product or service names are
registered trademarks or trademarks of SAS Institute Inc. in the
USA and other countries.
® indicates USA registration. Other brand and product names
are trademarks of their respective companies.
Copyright © 2009 by SAS Institute Inc., Cary, NC 27513, USA.
All rights reserved.
40
Thank you.
45
46
Using Mixed Processing Methods to Read Raw Data Files