Using Mixed Processing Methods to Read Raw Data Files Transcript Using Mixed Processing Methods to Read Raw Data Files Transcript was developed by Ted Durie. Additional contributions were made by Cindy Cragin, David Ghan, Linda Mitterling, and Bruce Reed. Editing and production support was provided by the Curriculum Development and Support Department. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Using Mixed Processing Methods to Read Raw Data Files Transcript Copyright © 2009 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. Book code E1445, course code RLSPCMI, prepared date 19Feb2009. RLSPCMI_001 ISBN 978-1-60764-033-2 For Your Information Table of Contents Lecture Description ..................................................................................................................... iv Prerequisites ................................................................................................................................. v Using Mixed Processing Methods to Read Raw Data Files ........................................ 1 1. Investigate File Layouts ...................................................................................................... 5 2. Describe SAS Input Styles ................................................................................................ 13 3. Read Mixed Data Type Records ....................................................................................... 31 iii iv For Your Information Lecture Description In this lecture you’ll see how to process raw data files with structures that change within records of data. The following topics are addressed: using SAS as a tool to examine the structure of various raw data files, choosing the appropriate input styles for different structures in a raw data file, and using the appropriate style as formats change within a record. To learn more… For information on other courses in the curriculum, contact the SAS Education Division at 1-800-333-7660, or send e-mail to training@sas.com. You can also find this information on the Web at support.sas.com/training/ as well as in the Training Course Catalog. For a list of other SAS books that relate to the topics covered in this Course Notes, USA customers can contact our SAS Publishing Department at 1-800-727-3228 or send e-mail to sasbook@sas.com. Customers outside the USA, please contact your local SAS office. Also, see the Publications Catalog on the Web at support.sas.com/pubs for a complete list of books and a convenient order form. For Your Information Prerequisites Before viewing this lecture, you should be familiar with the DATA step, the PRINT procedure, and INFILE and INPUT statement syntax. You should also understand IF-THEN/ELSE logic and be able to view raw data using text editors. You can gain this knowledge by completing the SAS Programming I Essentials (PRG1) course. v vi For Your Information Using Mixed Processing Methods to Read Raw Data Files 1. Investigate File Layouts .................................................................................................... 5 2. Describe SAS Input Styles .............................................................................................. 13 3. Read Mixed Data Type Records ..................................................................................... 31 2 Using Mixed Processing Methods to Read Raw Data Files 1. Investigate File Layouts Using Mixed Processing Methods to Read Raw Data Files Welcome to this e-Lecture on Using Mixed Processing Methods to Read Raw Data Files. My name is Ted and I am an instructor for SAS. Today we will be looking at how to process raw data files with structures that change within records of data. 3 4 Using Mixed Processing Methods to Read Raw Data Files Using Mixed Processing Methods to Read Raw Data Files 1. Investigate File Layouts 2. Describe SAS Input Styles 3. Read Mixed Data Type Records 2 The following three topics will be covered as part of this e-Lecture: • First, we will examine the structure of various raw data files and we will look at a SAS tool to use for this purpose. • Next, we will discuss how to choose the appropriate input styles for the different structures that might exist in a raw data file. • Then, after we have a clear understanding of the different input styles that SAS provides, we will take a look at using the appropriate style as formats change within a record. 1. Investigate File Layouts 1. Investigate File Layouts Using Mixed Processing Methods to Read Raw Data Files 1. Investigate File Layouts 2. Describe SAS Input Styles 3. Read Mixed Data Type Records 3 Let’s start by examining the structure of different kinds of raw data fields. 5 6 Using Mixed Processing Methods to Read Raw Data Files Objectives Recognize the need to familiarize yourself with a file layout. Use the FSLIST procedure in SAS to examine raw data file structure. Investigate various record layouts. 4 In this first section, we will take a look at the kinds of information that we will need to supply SAS to read in raw data records and create SAS data sets. We will look at different layouts and we’ll do this by opening up several raw data files with a tool in SAS. 1. Investigate File Layouts 7 File Description This is the file layout for the offers.dat raw data file: Field Description Columns Data Types Customer Type 1-4 Character Offer Date 5-12 Date in the format of mmddyy8. Item Group 14-21 Character Discount 22-24 numeric with percent signs 5 The first step in processing a raw data file is to determine the layout of the data values in the data records. The arrangement of values can change from field to field, or even from record to record. The good news is that SAS has different input styles to handle different layouts. To determine which style to use for reading data values into SAS, someone either needs to hand you some sort of codebook that explains the file layout to you, or you will need to open the data file and examine it. The table shown here is what I am referring to as a codebook or file layout. As you can see, it contains locations, and data types of the fields of information. The task of converting a raw data file into SAS data set can be greatly simplified if the person who created the file can give you this type of information. But this often is not the case. 8 Using Mixed Processing Methods to Read Raw Data Files Raw Data File This is a partial listing of the offers.dat raw data file: Partial raw data file: offers.dat 104012/02/07 202010/07/07 103009/22/07 103009/22/07 202007/08/07 203007/08/07 Outdoors15% Golf 7% Shoes 10% Clothes 10% Clothes 15% Clothes 25% 6 More often, you are going to have to open the file to become familiar with the file layout. Here, I have opened a file named offers.dat. By looking at the file, I can see that there is some sort of ID number in the first field and all of these values are fixed within certain column positions. Then it looks like there might be some sort of date value here and it looks like it is always 8 bytes in length. So, without a codebook, I have to view the data in the file to determine its structure. 1. Investigate File Layouts 9 General Syntax of FSLIST Procedure Any tool that can read text can open a raw data file, such as Notepad Microsoft Word PROC FSLIST in SAS General syntax for the FSLIST procedure: PROC FSLIST file=‘<directory-location/filename>’; 7 There are many different tools that you can use to view a raw data file, including text editors, like Notepad or Microsoft Word. In this e-lecture, we will be using a SAS procedure called FSLIST, because it allows us to view contents of a raw data file regardless of size or the operating system where the data resides. This is the general syntax for the FSLIST procedure. It starts with PROC FSLIST, then you will typically specify the directory location where the file is stored and the name of the raw data file to be examined. Let’s look at an example of opening a raw data file with PROC FSLIST. 10 Using Mixed Processing Methods to Read Raw Data Files Examining a Raw Data File Open the offers.dat raw data file with PROC FSLIST. proc fslist file='s:\workshop\offers.dat'; run; 8 Here I am opening a file named offers.dat with the FSLIST procedure. I say: proc fslist file= , and then I have specified the fully qualified path to the raw data file. Note that if you execute this procedure in a non-interactive SAS session or batch job, then you will get a report listing of the data. If you execute the procedure in an interactive session, such as in SAS Display Manager, then the FSLIST window will open showing the contents of the raw data file as you see here. 1. Investigate File Layouts 11 Investigate Raw Data File Structures This demonstration illustrates different raw data file layouts. 9 Great. Well now that we have a tool to use in SAS to open our raw data files, let’s take a look at some of the different file structures that we might run across. I’m going to open several files with PROC FSLIST. Here is code that I wrote earlier so you don’t have to sit and watch me type! Let me highlight this first set of code that opens up supplier.dat. The FSLIST window opens, and here you see the contents of the file. I am going to turn on a command in this window to make it easier for me to see how the data values are laid out within each record. It is called the COLS command. I type it here on the command bar. It creates this nice ruler that I can use to determine the position of the data values in the file. Let’s look at this file named supplier.dat. Notice that the data fields in this file always start in the same position from record to record. And notice that the data is fixed within a certain width. What I mean by that is, this first set of values is fixed between columns 1 and column 7. The next set of values is fixed between columns 8 and 39. So, the data here is fixed in particular columns. Another item to note in this file is that it contains all standard data values. There is nothing special about the data values. There are no date values, or numeric values that contain dollar signs or commas or percent signs. These are just standard character and numeric values. So, as we look through these files, these are the two question or pieces of information that we will be looking for: • First, is there structure to the values – in other words are the data fields in fixed columns? • And, second, are there any non-standard data values? 12 Using Mixed Processing Methods to Read Raw Data Files Let me go back to the editor again. This next file is named offers.dat. We actually just saw this file earlier in this section. If I submit this code, this file contains fixed data values just like the last file that we opened. But in this file, there are special data values as you can see here – this is a date value and this last set of values contains percent signs. This is all ok to SAS. It just means that we need to give SAS this information when it is reading in the file so it can convert the values into the proper form in the data set being created. So, the data in this file is fixed with some special processing needed. Let me go back to the code in my editor and let’s open another raw data file named sales.txt. Let’s ask the first question, “Are the data values in fixed columns?” The answer here is “no”. Here we have what we call free-format data. Notice that the data does not line up in nice fixed columns. This type of data is definitely readable in SAS, but will require a different style of input then the fixed data files that we just saw. Glancing through the records, we ask the second question, “Do we need to deal with any special data values?” Yes, it looks like there are two date fields here and another thing to note about this file is that the data values are separated by commas. That’s another piece of information that we need to know when selecting the appropriate input style in SAS to read the file. And, let me open one more file…. Here we have a file that is mixed. It contains some fixed data and some free-format data. We’ll see how to handle a mixed type file in this e-lecture as well. In this demo, we have seen several different file structures. Next we want to look at the different methods or styles of input that SAS has to handle these different file structures. 2. Describe SAS Input Styles 2. 13 Describe SAS Input Styles Using Mixed Processing Methods to Read Raw Data Files 1. Investigate File Layouts 2. Describe SAS Input Styles 3. Read Mixed Data Type Records 10 In Section 2, we will describe the commonly used input styles available in SAS. This section is a quick review of the essential features of the these input styles. These topics are covered more fully in our SAS Programming 1: Essentials and SAS Programming II: Accessing and Manipulating Data courses, where additional options are introduced and there are more data scenarios presented than we are able to cover here. 14 Using Mixed Processing Methods to Read Raw Data Files Objectives Describe the commonly-used input styles in SAS. Identify the type of input to use for different file structures. 11 So, lets take a look at the input styles that are most commonly used in SAS and we’ll look at the different file structures and talk about which input style needs to be used for each type of structure. 2. Describe SAS Input Styles 15 Input Styles There are three commonly-used styles of input in SAS Column Formatted List SAS Data Set Raw Data 1 1 1---5----0----5 data …; infile…; input…; run; 12 Records of raw data files are read into SAS in the DATA step through an INPUT statement. The INPUT statement takes the values and converts them appropriately and places them into a SAS data set. To address the different types of layouts that can exist within raw data files, there are three commonlyused styles of input in SAS. They are called Column input, Formatted input, and List input. In the next few slides, we’ll see what the syntax for these styles looks like on the INPUT statement, and how they correspond to the record layouts I showed you in the previous section. If you have used INPUT statements in SAS before and are familiar with the differences between each style of input, then some of the information in this section may be review to you. If so, you can skip directly to the next section. 16 Using Mixed Processing Methods to Read Raw Data Files General Form of Column Input Column input is appropriate for reading the following: data in fixed columns standard character and numeric data General form of a column INPUT statement: INPUT variable <$> startcol-endcol ; 13 We’ll begin our discussion by looking at Column input. This mode of input is used when the data is in a fixed-width form and there are standard character and numeric values – no special values, like dates or currency values that contain dollar signs or commas- and you want the values all read in as numeric. When Column input is used to process raw data, the INPUT statement will consist of three components. • First, you need to provide the name of the variable that will be created. • Second, specify a “$” for data that will be stored in a character variable. Omit the “$” if the data will be put into a numeric variable. • Third, provide beginning and ending column locations for the field in the external file. You will need to provide this set of three specifications for every field that will be processed from the raw data file. Now, this does not mean that when you are writing your INPUT statement, you have to include all of the data contained in a record. In fact, you can choose only to read in a subset of the data values. For instance, if the raw data file has 20 fields of data, but you only need to process 6 or 7 of them when working in SAS, then you would specify a variable name and beginning and ending column locations for just those 6 fields of data that you want to include. 2. Describe SAS Input Styles 17 Processing Raw Data File With Column Input Partial raw data file: supplier.dat data work.supplier; infile 's:\workshop\supplier.dat'; input supplier_name $ 8-39 id $ 1-7 country_code $ 40-41; run; 14 If your data is structured like this file, then Column input is the best approach. The values are fixed within columns, and there are no special fields that need instructions for SAS as it is reading in the values. And, a little twist here, I would like to change the order of the data when it is read into the work.supplier data set. I want to see the name of the supplier first, followed by the supplier’s id and then their country code. So, here is the Column INPUT statement that I will write: • Start with the keyword INPUT. • Then we tell SAS that we would like to create a variable named supplier_name. We tell SAS that this is a character variable by using the dollar sign. Then we need to tell SAS where it can find the values for suppier_name. So, we specify columns 8-39 in the raw data file. • Next, we tell SAS to create the variable ID. These values are all numeric, but we have no intention to do any sort of numeric calculation with them. So, I will store them as character variables. This will save storage space in my new data set. If I stored them as numeric, they would take up a default 8 bytes of storage. I then tell SAS where to find the values, and that is in columns 1-7. And note that this also means that the values will be stored in 7 bytes. • Last, I tell SAS that I want to create a variable named country_code and I want it to be defined in SAS as a character variable. The values can be found in columns 40-41 of the raw data file. Now, this syntax that you see here - variable name, followed by a dollar sign if it is a character variable, followed by the column numbers - is the order that you need to place these specifications on the INPUT statement. Don’t mix up this order. This order and the fact that you have supplied SAS with column numbers makes this Column input. 18 Using Mixed Processing Methods to Read Raw Data Files Column Input Results proc print data=supplier noobs; run; Partial SAS data set: work.supplier The SAS System 15 supplier_name id Scandinavian Clothing A/S Petterson AB Prime Sports Ltd Top Sports AllSeasons Outdoor Clothing Sportico British Sports Ltd Eclipse Inc 50 109 316 755 772 798 1280 1303 country_ code NO SE GB DK US ES GB US And if we execute the DATA step and print the resulting SAS data set, this is what we see. The id values were first in the raw data file, but now the names of the suppliers come first and then the id values and finally the country codes. 2. Describe SAS Input Styles 19 General Form of Formatted Input Formatted input is appropriate for reading the following: data in fixed columns standard and nonstandard character and numeric data date values that need to be stored as numeric values in SAS General form of a formatted INPUT statement: INPUT <pointer-control > variable informat ; 16 The next style of input that we want to look at is called formatted input. With formatted input, data must be in fixed-width fields and special instructions called informats are used to tell SAS how to convert data values as they come into SAS. When using formatted input to process data, a combination of three values have to be specified to convert data values. • First, you must specify a pointer control to tell SAS the starting column location of the field to be read from the raw data file. • Second, you will specify the name of the variable being created, using SAS variable naming convention rules. • And third, you will tell SAS how to read in values with informats, which are simply conversion routines. Note that you use informats to read in special data fields, but there are also informats for standard data as well. These three specifications must be provided for every field that will be processed using formatted input. Note that this syntax is very different from what we saw for Column input. With Column input, you specify a variable name first, followed by the location of the column. Here you state the starting position for the value, followed by the variable name and then an informat. 20 Using Mixed Processing Methods to Read Raw Data Files Standard Character and Numeric Values Standard Character Data 17 Standard Numeric Data Contain any value: letters, numbers, special characters, and blanks. Can contain positive and negative numbers (ex. -10.24 120) , exponential notation (ex. 5.67E5), and decimal Values (ex. 6.7894562) Are stored with a length of 1 to 32, 767 bytes with one byte equal to one character Are stored as floating point numbers in 8 bytes of storage by default All other data forms will have to be converted using an informat. Before we look at an example of formatted input, let’s review briefly what we mean by standard character and numeric values in SAS, because this ties into the need for informats for non-standard data values. This table describes what SAS views as standard character and numeric data. All other data forms that are not described in this table have to be converted using an informat. Any raw data values can be read into SAS as character values. Character values can contain special characters, blanks, etc. and they are stored byte for byte. So if the length of a field in the raw data is 5 and it comes into SAS as a character value, it will be stored, by default, in a length of 5 bytes. For numerics, they are standard if they take on one of these forms – a value with digits that is either positive or negative, exponential values and digits that include a decimal point and decimal values. These values can be converted into numeric variables in SAS, without having to specify an informat. If the data is stored in another form, and a numeric variable is required in the output data set, then that field will have to go through a conversion routine specified with an informat for SAS to read it in properly. 2. Describe SAS Input Styles 21 Conversion Requirements 18 SAS recognizes that not all data is stored the way that it likes it. Data coming from raw data files or other software packages might contain information in other forms, including: packed decimal, text dates, or currency values. In these cases, a SAS informat must be used to convert non-native SAS data types into a form that is compatible with the SAS system. 22 Using Mixed Processing Methods to Read Raw Data Files Processing a Raw Data File With Formatted Input Partial raw data file: offers.dat data work.discounts; infile 's:\workshop\offers.dat'; input @1 customer_type $4. offer_date :mmddyy8. @14 item_group $8. @22 discount :percent3.; run; 19 Let’s look at an example of formatted input. Looking at the raw data file, we see that the fields are fixed widths. Even though some of the fields are adjacent to others, they are still in fixed locations. But, we also notice that there are date values here. We can read those date values in as character strings, or if we want to do manipulations with them later. like subtract an offer_date from a later offer_date to see how many days passed between offers, then we will need to create these date values as SAS date numeric values in the resulting data set. And, we also need to provide special instructions, or informats for the discount values that you see here. We want to store these values as numerics in a new data set. To do that, we will need to remove the percent signs and we want to store these as percentages, so we will need to divide the values by 100. I don’t want to have to do all that work – especially if I were dealing with a really large SAS data set. But, there is an informat that will do all of this for me and it is called the PERCENT informat. Putting it all together in the INPUT statement: • First, we have the keyword INPUT. • Then we want to start reading values in columns 1 through columns 4 and place those values into the variable named customer_type. If we were using Column input, we would just specify the column number here. But we are using Formatted input, so the structure of the input specification is a little different. We specify the starting position by saying @1, then the variable name, then the informat that tells SAS how far to read to get the values for this variable. This is a standard character variable, so I’ll use the $4. informat. • The pointer is now resting in column 5 and that is where I want SAS to start reading the next set of values. So, I have not specified a pointer control here. I ask SAS to read values from the current 2. Describe SAS Input Styles 23 column, column 5, and I tell SAS that it can expect a two-digit month, a two-digit day and a two-digit year. • Next is item_group. We want to start reading values for this variable in column 14 and read the next 8 columns. It will be a character variable. • And the last variable that we will define is discount. For this variable, we will get the values from columns 22, 23 and 24, and we are asking SAS to use the PERCENT. informat to strip out percent signs and divide the values by 100. If we execute this DATA step and then use PROC PRINT to view it, this is what we will get… 24 Using Mixed Processing Methods to Read Raw Data Files Formatted Input Results proc print data=discounts noobs; run; Partial SAS data set: work.discounts The SAS System 20 customer_ type offer_ date 1040 2020 1030 1030 2020 2030 17502 17446 17431 17431 17355 17355 item_ group Outdoors Golf Shoes Clothes Clothes Clothes discount 0.15 0.07 0.10 0.10 0.15 0.25 Note that the date values are in a SAS date numeric form, so they are not very understandable at the moment. But we can see that that our data conversion did work. Also, the percent signs are gone from our discount values, and the values have been converted to decimal values. I am ok with the discount values on this report, but let’s make those date values more understandable. 2. Describe SAS Input Styles 25 Formatted Input Results Add a FORMAT statement to print the date value in a more readable form: proc print data=discounts noobs; format offer_date date9.; run; Partial SAS data set: work.discounts The SAS System customer_ type 21 1040 2020 1030 1030 2020 2030 offer_ date 02DEC2007 07OCT2007 22SEP2007 22SEP2007 08JUL2007 08JUL2007 item_ group Outdoors Golf Shoes Clothes Clothes Clothes discount 0.15 0.07 0.10 0.10 0.15 0.25 I’ve added a FORMAT statement to my PROC PRINT step, and now, I can more easily read those dates. So, we read these date values in with an informat of MMDDYY8. because that is the way that they were stored in the raw data file and we wanted SAS to store them as numeric date values in the data set. Then we used a format to take the internal SAS value and write it out in DATE9. form. Pretty cool! 26 Using Mixed Processing Methods to Read Raw Data Files General Form of List Input List input is appropriate for reading the following: free-format delimited data standard or nonstandard character and numeric data General form of a List INPUT statement: INPUT variable <$> <:informat > ; 22 The third and last style of input that we want to look at is called List input. With List input, data is freeformat. This is another type of raw data file that is commonly processed in SAS programs. Free-format means that the values for fields do not start or end in the same column locations from one record to the next. The data must be delimited by a blank or some sort of defined delimiter. The only constant that we have for this type of data is the order of the fields going from left to right across the file. So, List input requires that you list variable names on your INPUT statement in the order that the fields appear in the raw data records. Again, this is because with this input style, SAS scans each data record from left to right to locate fields based on a delimiter. When using List input to read standard data values, only a variable name has to be specified. No column numbers or pointer controls are necessary because SAS is determining the start and stop positions of fields based on the delimiters between fields. If you are dealing with nonstandard data, then you can still use List input, you just need to specify a colon modifier and the appropriate informat. 2. Describe SAS Input Styles 27 Processing a Raw Data File With List Input Partial raw data file: sales.txt 23 data sales_employees; infile 's:\workshop\sales.txt' dlm=','; input employee_ID $ first_name :$20.last_name :$20. gender $ salary job_title :$20. country $ birth_date :date9. hire_date :mmddyy10.; run; Let’s look at an example of List input. Looking at the raw data file, we immediately notice that the fields are separated by commas. The default delimiter is a blank. So, we will need to give SAS a special instruction to let SAS know that commas separate the fields. We do that with the DLM= option on the INFILE statement. Now any commas in our data become delimiters, not part of a value. Looking further along our data records, we notice that the job_title values contain blanks. Since our delimiter is a comma, these blanks will not pose a problem to us. If our delimiter had been a blank, then SAS would have seen the blanks within the job_title values as delimiters, not spaces within a field value. But we don’t have that problem here. Then we run across a date field. We want it to be defined in our data set as a numeric date value, so we’ll need to specify an informat for this value, and the same holds true for this second date value as well. Now that you are familiar with the input specification requirements for our raw data file, let’s generate the INPUT statement to read this data. • We will begin with the keyword INPUT as always. • Then we tell SAS that we want to create the employee_ID variable. We don’t have to give SAS a starting or ending position. It will just start reading the raw data file wherever the pointer currently resides and will continue to read, or scan the data values until it sees a comma. In this first record, it will read 120102, stop and take that value and place it into the employee_ID variable. • Next we read in the first_name values. Note the different syntax here. We have a colon modifier and an informat. By default, any variable created with List input is assigned a becomes a length of 8 28 • • • • • Using Mixed Processing Methods to Read Raw Data Files bytes. In the case of the first_name variable, we have values that are longer than 8 bytes. If there are fields over 8 characters wide, then you will need to specify an informat as we have done here. For last_name, we are also specifying a format modifier. For gender, the values are always only 1 byte, so we don’t have to specify anything special here. We define the job _title and country variables. Then we come across the date variables. The birth_date values are in a date9. form, so we’ll use a format modifier and an informat for it. And for hire_date, it is in a slightly different date form, so we’ll specify MMDDYY10. If we were to execute the DATA step…. 2. Describe SAS Input Styles 29 List Input Results proc print data=sales_employees noobs; run; Partial SAS data set: work.sales_employees The SAS System employee_ ID 120102 120103 120121 120122 120123 120124 120125 120126 first_ name last_name Tom Wilson Irenie Christina Kimiko Lucian Fong Satyakam Zhou Dawes Elvish Ngan Hotstone Daymond Hofmeister Denny gender M M F F F M M M salary 108255 87975 26600 27475 26190 26480 32040 26780 job_title Sales Sales Sales Sales Sales Sales Sales Sales Manager Manager Rep. II Rep. II Rep. I Rep. I Rep. IV Rep. II country AU AU AU AU AU AU AU AU birth_ date hire_ date 3510 -3996 -5630 -1984 1732 -233 -1852 10490 10744 5114 5114 6756 9405 6999 6999 17014 24 ...and then view it with PROC PRINT, notice that the date values have been converted into a SAS date numeric form and also notice that the length attributes for first_name, and last_name and job_title are large enough to store the longest values. If I added a FORMAT statement to my PROC PRINT step, I could make my date values more understandable. But, I’m not going to do that for this example, since we saw an example earlier of how to do it. 30 Using Mixed Processing Methods to Read Raw Data Files Comparing Input Styles FixedWidth Data Column Input X Formatted Input X FreeFormat Data Nonstandard Data X Simple List Input X List Input with Colon Modifier X X 25 This table compares the three input styles that we just looked at. List input is represented twice, once as simple List input, which means that you would just list the variable name and a dollar sign if applicable, and then there is List input with the colon modifier which means that the data is free-format, but you are going to be using informats to read in special values. Taking a look at the chart, if you are dealing with fixed-width data, then you can either use Column or Formatted input. If you are dealing with free-format data, then List input is the way to go. If you have non-standard data values coming into SAS, then you will need to read those values in with an informat. Both Formatted input and List input can use informats, but Column input cannot do so. 3. Read Mixed Data Type Records 3. Read Mixed Data Type Records Using Mixed Processing Methods to Read Raw Data Files 1. Investigate File Layouts 2. Describe SAS Input Styles 3. Read Mixed Data Type Records 26 Now let’s take a look at a case where there is more than one input style needed within each raw data record. 31 32 Using Mixed Processing Methods to Read Raw Data Files Objectives Investigate the raw data file. Create a DATA step to process the mixed record type file. 27 In this section: • I will take a look at a raw data file using FSLIST. • Finally we’ll generate DATA step syntax to process records where the input syntax changes from Column, to Formatted, to List, and then back to Formatted specifications. 3. Read Mixed Data Type Records Processing Scenario Data from donations database. Employee make contributions quarterly. The data structure changes within a record. 28 The charities.txt raw data file contains cash, check, or payroll deductions of employees who donate quarterly to various charities throughout the year. We will be looking at data records for 2008. As we investigate the data, we will find that the style of input needed to read the raw data is going to change within a record. This means that we will be specifying Formatted, Column, and List input all in one INPUT statement. 33 34 Using Mixed Processing Methods to Read Raw Data Files Examining the Mixed Data proc fslist file='s:\workshop\charities.txt'; run; Partial raw data file: charities.txt 29 Let’s take a quick look at the raw data file with the FSLIST procedure. Notice that some of the data fields are separated with spaces and some are separated with commas. We also notice that some of the fields are fixed-width and others are free-format. So, there is definitely a mixture of input styles needed to read just a single record. Over the next several slides, we’ll take a look at the fields in the raw data one at a time and build our INPUT statement as we go along. 3. Read Mixed Data Type Records 35 Coding the INPUT Statement The first two data fields are fixed width and standard character data. Partial raw data file: charities.txt Partial DATA step data work.donations; infile 's:\workshop\charities.txt'; input employee_id $ 1-6 @8 paid_by $17. … 30 The first data field is fixed and there is nothing special about the values. The field actually represents employee ids. We have no intention of ever manipulating these data values, so we will create them as standard character values in SAS. We have a choice of reading the data with either Column or Formatted input. For simplicity, we will use Column input. We will call the variable employee_id. It’s character, so we’ll need to use a dollar sign and, looking at the ruler, we’ll tell SAS to read columns 1-6 to get the values for this variable. Looking at the second data field, it is also fixed-width standard data. So, again either Column or Formatted would work here. Let’s use Formatted for demonstration purposes. After reading through column 6 for the employee_id values, the pointer is now resting in column 7. We need to move the pointer to column 8, so we say @8. We name the variable paid_by and we use the $17. informat because we need SAS to read the next 17 columns of data for this variable. 36 Using Mixed Processing Methods to Read Raw Data Files Coding the INPUT Statement The quarterly contribution fields are free-formatted and comma-delimited. Partial raw data file: charities.txt 31 Partial DATA step data work.donations; infile 's:\workshop\charities.txt' dlm=','; input employee_id $ 1-6 @8 paid_by $17. @27 qtr1 qtr2 qtr3 qtr4 … As we continue from left to right across the file, the data structure changes from fixed width to comma delimited free-format for the quarterly contributions. List input will need to be applied here. Before we add to our INPUT statement, we need to tell SAS that the values are delimited by commas. We will do that on our INFILE statement with this DLM=option. Now, when SAS sees a comma in the data, it will treat it as a delimiter, not part of a field value. On our INPUT statement, we need to direct the pointer to move to column 27. Then we name the variables qtr1, qtr2, qtr3, and qtr4. These are standard numeric values, so no informats are needed. We simply write qtr1, qtr2, qtr3, and qtr4 and SAS will scan for the comma delimiter to get the values. The input style is recognized as List input because we have not provided column start-stop positions or informats. 3. Read Mixed Data Type Records 37 Coding the INPUT Statement The hiredate field is free-formatted, comma-delimited, and needs a date informat. Partial raw data file: charities.txt 32 Partial DATA step data work.donations; infile 's:\workshop\charities.txt' dlm=','; input employee_id $ 1-6 @8 paid_by $17. @27 qtr1 qtr2 qtr3 qtr4 hiredate :mmddyy10. … As we look at the next field of data, we see more free-format comma delimited values. Also, this is a date field, and we want it defined in SAS as a SAS date value. Therefore, we will need to specify a date informat for this variable using a colon modifier. The input syntax, for this field, will consist of a variable name, colon format modifier, and informat that will be used. 38 Using Mixed Processing Methods to Read Raw Data Files Coding the INPUT Statement The date field is free-formatted, comma-delimited, and needs a date informat. Partial raw data file: charities.txt 33 Partial DATA step data work.donations; infile 's:\workshop\charities.txt' dlm=','; input employee_id $ 1-6 @8 paid_by $17. @27 qtr1 qtr2 qtr3 qtr4 hiredate :mmddyy10. @50 organization $26.; run; The last field of data is fixed with a starting position of column 50. These are standard character values, so we can read them with either Column or Formatted input. Let’s go with Formatted. We’ll say @50, the name of the variable, organization , and give it a $26. informat. Note that some of the values contain commas. The DLM= option has identified the comma as a delimiter. However, because we have switched back to Formatted input, the commas in this field are treated as part of the value rather than as delimiters. One quick note here, for this particular file, each record has a fixed length of 80. If this were not a fixed length file, if it were a variable length file, then you would have to specify additional options on your INFILE statement to make this program work correctly. Consult the documentation for your operating system for more information about fixed and variable length records. If we print the data set... 3. Read Mixed Data Type Records Unformatted Results proc print data=work.donations noobs; run; Partial work.donations data set 34 We see that all of the data fields have been converted correctly. Note that the hire_date field is in a SAS date form. It is a number relative to January 1, 1960. Let’s make that date more readable. 39 40 Using Mixed Processing Methods to Read Raw Data Files Formatted Results proc print data=work.donations noobs; format qtr1-qtr4 dollar3. hiredate date9.; run; Partial work.donations data set 35 We’ll make this final modification to the scenario solution by adding this FORMAT statement to the PROC PRINT step. The quarter values will be displayed as currency by using a DOLLAR3. format. This format will place dollar signs in the values and we have specified no decimal places. For the date values we want them to be displayed with a two-digit day, a three letter month, and a four-digit year. The DATE9. format will display the values in this fashion. And, you can see the final results. So, our INPUT statement notation along the way went from Column input, to simple List input to colon modified List input, to Formatted input. Mixing the styles is no problem, it’s just a matter of knowing what your data looks like, deciding how you want it stored in SAS and selecting the correct technique or style of input to use to read the data into SAS and have it converted appropriately. 3. Read Mixed Data Type Records 41 Lecture Summary Investigated various file structures using PROC FSLIST. Defined List, Column, and Formatted input styles. Worked through a DATA step scenario using all of the input styles within one INPUT statement. 36 During this lecture, we saw that there are many different forms in which raw data can be stored. If we want to take that raw data and convert it into a SAS data set, then we must tell SAS what the data looks like as it is being read into a SAS data set. So, we either need to have a file layout available for a given file, or we need to open the file and become familiar with the data ourselves. There are many different editors and browsers that you can use to investigate the data depending on your operating system. If you want to stick with one editor that will work in SAS and will work across operating systems, then you can use the FSLIST procedure, as we did in this lecture. Next we matched the style of input needed for various file structures. We talked about List, Column and Formatted input, and when you use one over the other. And finally, we looked at a situation where the raw data file was mixed type, meaning that there were data fields that worked with Column input, fields that required us to use Formatted input, and fields that required us to use List input. We saw that it is OK to mix different styles within one INPUT statement. 42 Using Mixed Processing Methods to Read Raw Data Files Other Related e-Lectures A complete list of available SAS training, including SAS e-Lectures can be found at the following site: http://support.sas.com/training 37 This concludes our e-Lecture entitled Using Mixed Processing Methods to Read Raw Data Files. We hope that you found the material and presentation helpful. Please visit the SAS Web site at http://support.sas.com/training/ for a complete list of other available eLectures and SAS training. 3. Read Mixed Data Type Records 43 Credits Using Mixed Processing Methods to Read Raw Data Files was developed by Ted Durie. Additional contributions were made by Cindy Cragin, David Ghan, Linda Mitterling, and Bruce Reed. 38 This lecture was developed by Ted Durie with additional contributions from Cindy Cragin, David Ghan, Linda Mitterling, and Bruce Reed. 44 Using Mixed Processing Methods to Read Raw Data Files Comments? We would like to hear what you think. Do you have any comments about this lecture? Did you find the information in this lecture useful? What other e-Lectures would you like to see SAS develop in the future? Please e-mail your comments to EDULectures@sas.com 39 SAS Education would like to know what you think about this e-Lecture and e-Lectures in general. If you have any comments, we would appreciate receiving your input. You can use the e-mail address listed here to provide feedback, or fill out the short survey at the end of this lecture. 3. Read Mixed Data Type Records Copyright SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright © 2009 by SAS Institute Inc., Cary, NC 27513, USA. All rights reserved. 40 Thank you. 45 46 Using Mixed Processing Methods to Read Raw Data Files