Chapter 17 Read Raw Data in Fixed Format using Formatted Input Objectives • Distinguish between standard and nonstandard numeric data • Read standard fixed-field data • Read nonstandard fixed-field data Review of Column Input General Syntax of Column Input: INPUT var <$> start_col – end_col ………. ; – Var is the variable name – $ is for character variable. – Start_col, end _clo specify the starting and ending col # for reading the variable. Ex. INPUT L_Name $ 1 – 15 F_name $ 17-25 age 26-28 Choles 30-35; Important features and usages of Column Input • It can read character variables of the data values have embedded blanks. • Missing data values will be read as missing from the defined columns (Blank for character and ‘.’ for numeric). • Columns can be re-read. ex. INPUT supplier $ 5-20 ItemNum 15-20 amount 22-30; • Columns can be read backwards or forwards. Ex. INPUT F_name $ 20-30 L_Name $ 1-15 age 16-18; Raw data that can not be read by Column Input • When data values are not standard numeric data: Ex. Data values having $ sign, having comma, having %, etc. • Data values are not organized in a fixed columns for each variable. • Numeric data values having decimal places, the decimal is not recorded in the data. • Date, Time, Datetime data that are not recorded in numeric values, instead, recorded as commonly used date, such as 11/14/2010 . Such as date requires a special format to read it. If it is read using Column Input, it much be read as a character data. • Data values are not recorded in fixed format, for example, data values are recorded by using delimiters, such as blank / , ; tab, and so on. A review of Standard Vs. Nonstandard Numeric Data Standard numeric data can contain only • Numbers • Decimal places • Numbers in scientific or E-notation (ex, 4.2E3) • Plus or minus signs Nonstandard numeric data includes • Values contain special characters, such as %, $, comma (,), etc. • Date and time values • Data in fractions, integer binary, real binary, hexadecimal forms, etc. Determine if each of the following numeric data standard or nonstandard data 345.12 $345.12 3,456.12 20DEC2010 12/20/2010 Standard Nonstandard Nonstandard Nonstandard date Nonstandard date A review of Fixed Format Vs. Free format Fixed format means a variable occupies in a fixed range of columns from observation to observation. Free format means the data values are not in a fixed range of columns. Ex: Fixed format Free format 12345678901234567890 -------------------HIGH 340 12.5 F LOW 5630 7.5 F MEDIAN 674 26.73 M 12345678901234567890 -------------------HIGH 340 12.5 F LOW 5630 7.5 F MEDIAN 674 26.73 M A Review of basic statements for reading External Raw data in a Data Step General form for the complete DATA step without FILEMANE statement: DATA SAS_data_set_name; INFILE ‘input-raw-data-file’; INPUT variable $ start - end . . .; RUN; General form for the complete DATA step with Filename statement: DATA SAS_data_set_name; FILENAME Fileref ‘input-raw-data-file’; INFILE Fielref ; INPUT variable $ start - end . . .; RUN; Example: Review of Reading External Raw Data Using Column Input Read External Data salesdata.dat with FILENAME statement FILENAME sal_dat ‘C:\math707\RawData\RawData_dat\salesdata.dat’ ; DATA saleslib.sales_sasdata; INFILE sal_dat; INPUT last_name $ 1-7 sale_date $ 9-11 residential 13 -21 commercial 23 – 31; Read External Data salesdata.dat without FILENAME statement DATA saleslib.Sales_sasdata; INFILE ‘C:\math707\RawData\RawData_dat\salesdata.dat’ ; INPUT last_name $ 1-7 sale_date $ 9-11 residential 13 -21 commercial 23 – 31; Reading External Raw Data using Formatted Input General Syntax: INPUT <pointer-control> variable Informat.; • Pointer-control: pointers to control the position of the column. • Variable: variable name to be created. • Informat: the format to input the variable. NOTE: Two pointer-controls to control the column position are: @n : moves the pointer to the specified column. This is the absolute column of the data record. +n : move the pointer forward n columns beginning from the current position. Informat in the INPUT statement • Informat is the SAS format used in the Input statement to READ data values. It is used in the INPUT statement, and it is called INformat. • The SAS format we discussed in Chapter four DISPLAYING data values can be used as Informat in the INPUT statement as fixed formatted input. Recall: SAS Format • A format is an instruction that SAS uses to write data values. • SAS formats have the following form: <$>format<w>.<d> Indicates a character format Number of decimal places Format name Required delimiter Total width (including decimal places and special characters) 12 SAS INFormats External Data Value 27,134.28 27134.28 2713428 271342864 27134.2864 $27,134.28 27.5% 13 Informat COMMA10.2 10.2 10.2 10.2 10.2 COMMA10.2 Percent5.1 Data Value Read 27134.28 27134.28 27134.28 2713428.64 27134.2864 27134.28 .275 SAS INFormats for Date, Time Recall that a SAS date is stored as the number of days between 01JAN1960 and the specified date. If the date or time values are created using ex, 10/16/2001, 16OCT2001, in order to read the date properly from the external data set, we need to use the Informat: Date in External data Informat set 10/16/2001 mmddyy10. Data value read 16OCT2001 101601 Date9. 101601 14 Some commonly used Informat PERCENTw.d $BINARYw. $VARYINGw. $w. COMMAw.d DATE9. DATETIMEw. HEXw. JULIANw. MMDDYYw. NENGOw. PDw.d PERCENTw. TIMEw. W.d The Informat COMMAw.d COMMAw.d informat reads nonstandard numeric data and removes the embedded Blanks, commas, dashes, dollar signs, percent signs, right parenthesis, left parenthesis, which are converted to negative sign. Actual Data value 12,345.67 $12,345.67 12 345.67 COMMAw.d COMMA9.2 COMMA10.2 COMMA9.2 Data value read 12345.67 12345.67 12345.67 12-345.67 COMMA9.2 12345.67 (12,345.67) COMMA11.2 -12345.67 Exercise Practice Informat in Input statement: The data set aug99n.dat is posted on the class website. Field Name Start Column End Column Maximum Width Data Type ID 1 3 3 numeric Date 5 13 9 character Item 15 15 1 character Quantity 17 19 3 numeric Price 21 24 4 numeric Percentsale 26 29 7 numeric Three observations of the data set are shown below : 100 02AUG1999 R 103 03AUG1999 C 102 04AUG1999 T 6 20 4 180 90% 80 100% 800 85% Write a SAS program to read this data, pay special attention to the use of Informat to read non-standard numeric values. Print the data set using proper display formats for non-standard variables. Answer: There are many ways to accomplish the same goal. Here is an example data orders; infile 'C:\math707\RawData\RawData_dat\aug99n.dat'; input ID 1-3 @5 date date9. item $ 15 quantity 1719 totalcost 21-24 @26 percentsale percent4. ; proc print; format date MMDDYY8. percentsale percent6. ; run; Salesdata.dat 1234567890123456789012345678901234567890 ---------------------------------------SMITH 10JAN2010 140097.98 440356.70 DAVIS 15JAN2010 385873.00 178234.00 JOHNSON 20JAN2010 98654.32 339485.00 SMITH 01FEB2010 225983.09 12250.00 DAVIS 12FEB2010 88456.23 55564.00 JOHNSON 22FEB2010 135837.34 32423.00 SMITH 10MAR2010 .000000 584593.00 DAVIS 18MAR2010 89342.00 173222.00 JOHNSON 26MAR2010 398573.90 193567.00 Read Salesdata.dat data using Formatted Input Data work.sale; INFILE ‘C:\math707\RawData\RawData_dat \Salesdata.dat’; input last_name $7. @9 month date9. @19 residential 9.2 +1 commercial 9.2; run; Proc print; Run; NOTE: INFILE defines the location of the data set. $w. is the format for character variable. @n: moves the pointer to the column n. +n: move forward n columns from the current position. The pointer starts at column 1. After reading a variable, the pointer move the next column as the current position. Ex: After reading last_name with 7 columns, the pointer moves to column 8 as the current position. After reading residential (starting at 19, reading 9 columns), that is , residential is from 19 to 27. The pointer moves to column 28 as the current column. Hence, +1 asks the pointer move one column forward from col 28 to col 29, then, read 9 columns for commercial. Example: Read the following data using Formatted Input The following is the scores of quizzes, test1, test2 and final of a class. Name Q1 Q2 Q3 Q4 Q5 T1 T2 Final ----+----1----+----2----+----3----+----4 CSA 17 18 15 19 18 85 92 145 DB . 16 14 18 16 72 76 120 QC 20 18 19 16 20 92 95 143 DC 18 15 . 15 20 82 79 125 E 20 18 15 15 18 80 82 135 F 16 16 15 15 16 72 75 116 GC 20 16 17 16 17 . 87 139 HD 18 15 15 . 19 85 79 115 IM 17 18 19 20 20 95 92 145 WB 13 16 14 15 16 72 66 110 Write a SAS program to read the data by having the data included in the SAS program. /*Program Statements */ DATA scores; /*Column Input */ INPUT Name $ 1-5 Q1 6-7 Q2 10-11 Q3 14-15 Q4 18-19 Q5 22-23 TEST1 25-27 TEST2 29-31 Final 33-36; /*Formatted Input */ INPUT NAME $5. Q1 2. @10 Q2 2. @14 Q3 2. @18 Q4 2. @22 Q5 2. @25 TEST1 3. @29 TEST2 3. @33 FINAL 4.; 1234567890123456789012345678901234567890 CSA 17 18 15 19 18 85 92 145 DB . 16 14 18 16 72 76 120 QC 20 18 19 16 20 92 95 143 DC 18 15 . 15 20 82 79 125 E 20 18 15 15 18 80 82 135 F 16 16 15 15 16 72 75 116 GC 20 16 17 16 17 . 87 139 HD 18 15 15 . 19 85 79 115 IM 17 18 19 20 20 95 92 145 WB 13 16 14 15 16 72 66 110 ; RUN; Different formatted inputs to read the same data /*Column Input */ INPUT Name $ 1-5 Q1 6-7 Q2 10-11 Q3 14-15 Q4 18-19 Q5 22-23 TEST1 25-27 TEST2 29-31 Final 33-36; /*Formatted Input */ INPUT NAME $5. Q1 2. @10 Q2 2. @14 Q3 2. @18 Q4 2. @22 Q5 2. @25 TEST1 3. @29 TEST2 3. @33 FINAL 4.; INPUT NAME $5. Q1 2. +2 Q2 2. +2 Q3 2. +2 Q4 2. @22 Q5 2. @25 TEST1 3. +1 TEST2 3. @33 FINAL 4.; INPUT NAME $5. (Q1-Q5 TEST1 TEST2)(2. +2) @33 FINAL 4. ; Exercise Open the program c5_colInp And change the Column INPUT statement using Formatted Input. Fixed Record Length Vs. Variable Record Length In reading an external data set, the record length is the size of each record. Usually a record consists of the variables of an observation. • NOTE: It is possible one record can consists of multiple observations. This will be discussed later The size of each record is usually ‘FIXED’, that is the same record size for every record. However, it may not be the case in data recording. That is, the record size may differ. When the record lengths differ, Formatted input may not read the data values correctly due to the fact that Formatted input will look for the # of columns specified for each variable. When the record lengths vary, the pointer may continue to the next record in order to read the specified # of columns for last variable (usually) in the INPUT statement. An error will occur when this situation happens. Formatted Input when Reading Records with Variable Record Lengths Using the PAD option in the INFILE Statement One way to fix the problem is to add the blank spaces to the existing records that are short of the record length to change the record length to be ‘FIXED’. The other way is to inform SAS to ‘PAD’ the blanks to those records which are too short. Suppose the record length for the Salesdata.dat is not fixed. Example: Data work.sale; INFILE ‘C:\math707\RawData\RawData_dat \ Salesdata.dat’ pad; input last_name $7. @9 sale_date date9. @19 residential 9.2 +1 commercial 9.2; run; Formatted PUT to created External Data Set Similar to formatted Input to read external raw data set, one can create external data set using formatted PUT statement. FILENAME fileref ‘file-location’; FILE fileref; PUT var format ……… ; RUN; Example: Create External Data Set using Formatted PUT To create an external data for the salesdata that consists of only MARCH. Data work.sale; INFILE ‘C:\math707\RawData\RawData_dat\ Salesdata.dat’; input last_name $7. @9 sale_date date9. @19 residential 9.2 +1 commercial 9.2; Run; Data marchsale; set work.sale; FILE ‘C:\math707\RawData\RawData_dat\ Sales_March.dat’; IF MONTH(Sale_date) = 3; PUT l_name $7. @10 sale_date MMDDYY10. @20 residential 10.2 @35 commercial 10.2; run; Exercise The following is a finance data. Variables are SSN, Name, Salary, Nyear, Birthday 029-46-9261 074-53-9892 228-88-9649 442-21-8075 446-93-2122 776-84-5391 929-75-0218 Rudelich Vincent Benito Sirignano Harbinger Phillipon Gunter 55,000.00 65,063.28 78,386.54 $5,890.06 73,925.37 $49,750.92 57,445.32 2 5 9 1 7 12 4 4595 6780 9295 7896 6596 6097 Write a SAS program to read this data using formatted format. Practice using PAD option in the Infile statement and make sure you see and understand the difference between with PAD and without PAD An answer /* program (b) to read variable length records - This program has error. Carefully check the errors */ data financeb; infile 'C:\math707\RawData\RawData_dat\finance3_recordlength.dat' ; input SSN $ 1-11 Name $ 13-22 @24 salary comma10. +2 Nyear 3. +2 birthdate 5.; proc print; format birthdate date9.; title 'Errors in reading variable-length records'; run; /* Program c: use PAD option in the INFLIE statement */ data financec; infile 'C:\math707\RawData\RawData_dat\finance3_recordlength.dat' pad; input SSN $ 1-11 Name $ 13-22 @24 salary comma10. +2 Nyear 3. +2 birthdate 5.; proc print; format birthdate date9.; title 'Use PAD option to read variable-length records'; run;