Chapter 17 Read Raw Data in Fixed Format

advertisement
Chapter 17
Read Raw Data in Fixed Format
using Formatted Input
Objectives
• Distinguish between standard and
nonstandard numeric data
• Read standard fixed-field data
• Read nonstandard fixed-field data
Review of Column Input
General Syntax of Column Input:
INPUT var <$> start_col – end_col ………. ;
– Var is the variable name
– $ is for character variable.
– Start_col, end _clo specify the starting and ending col # for
reading the variable.
Ex.
INPUT L_Name $ 1 – 15 F_name $ 17-25 age 26-28 Choles 30-35;
Important features and usages of Column Input
• It can read character variables of the data values have embedded
blanks.
• Missing data values will be read as missing from the defined
columns (Blank for character and ‘.’ for numeric).
• Columns can be re-read.
ex. INPUT supplier $ 5-20 ItemNum 15-20 amount 22-30;
• Columns can be read backwards or forwards.
Ex. INPUT F_name $ 20-30 L_Name $ 1-15 age 16-18;
Raw data that can not be read by Column Input
• When data values are not standard numeric data:
Ex. Data values having $ sign, having comma, having %, etc.
• Data values are not organized in a fixed columns for each variable.
• Numeric data values having decimal places, the decimal is not
recorded in the data.
• Date, Time, Datetime data that are not recorded in numeric
values, instead, recorded as commonly used date, such as
11/14/2010 . Such as date requires a special format to read it. If it
is read using Column Input, it much be read as a character data.
• Data values are not recorded in fixed format, for example, data
values are recorded by using delimiters, such as blank / , ;
tab, and so on.
A review of
Standard Vs. Nonstandard Numeric Data
Standard numeric data can contain only
• Numbers
• Decimal places
• Numbers in scientific or E-notation (ex, 4.2E3)
• Plus or minus signs
Nonstandard numeric data includes
• Values contain special characters, such as %, $,
comma (,), etc.
• Date and time values
• Data in fractions, integer binary, real binary,
hexadecimal forms, etc.
Determine if each of the following numeric
data standard or nonstandard data
345.12
$345.12
3,456.12
20DEC2010
12/20/2010
Standard
Nonstandard
Nonstandard
Nonstandard date
Nonstandard date
A review of
Fixed Format Vs. Free format
Fixed format means a variable occupies in a
fixed range of columns from observation to
observation.
Free format means the data values are not in
a fixed range of columns.
Ex: Fixed format
Free format
12345678901234567890
-------------------HIGH
340 12.5 F
LOW
5630 7.5
F
MEDIAN 674 26.73 M
12345678901234567890
-------------------HIGH 340 12.5 F
LOW 5630 7.5 F
MEDIAN 674 26.73 M
A Review of basic statements for reading
External Raw data in a Data Step
General form for the complete DATA step without FILEMANE
statement:
DATA SAS_data_set_name;
INFILE ‘input-raw-data-file’;
INPUT variable $ start - end . . .;
RUN;
General form for the complete DATA step with Filename statement:
DATA SAS_data_set_name;
FILENAME Fileref ‘input-raw-data-file’;
INFILE Fielref ;
INPUT variable $ start - end . . .;
RUN;
Example: Review of Reading External Raw Data
Using Column Input
Read External Data salesdata.dat with FILENAME statement
FILENAME sal_dat
‘C:\math707\RawData\RawData_dat\salesdata.dat’ ;
DATA saleslib.sales_sasdata;
INFILE sal_dat;
INPUT last_name $ 1-7 sale_date $ 9-11 residential 13 -21 commercial
23 – 31;
Read External Data salesdata.dat without FILENAME
statement
DATA saleslib.Sales_sasdata;
INFILE ‘C:\math707\RawData\RawData_dat\salesdata.dat’ ;
INPUT last_name $ 1-7 sale_date $ 9-11 residential 13 -21 commercial
23 – 31;
Reading External Raw Data using
Formatted Input
General Syntax:
INPUT <pointer-control> variable Informat.;
• Pointer-control: pointers to control the position of the column.
• Variable: variable name to be created.
• Informat: the format to input the variable.
NOTE: Two pointer-controls to control the column position are:
@n : moves the pointer to the specified column. This is the
absolute column of the data record.
+n : move the pointer forward n columns beginning from the
current position.
Informat in the INPUT statement
• Informat is the SAS format used in the Input
statement to READ data values. It is used in the
INPUT statement, and it is called INformat.
• The SAS format we discussed in Chapter four
DISPLAYING data values can be used as
Informat in the INPUT statement as fixed
formatted input.
Recall: SAS Format
• A format is an instruction that SAS uses to
write data values.
• SAS formats have the following form:
<$>format<w>.<d>
Indicates a
character
format
Number of decimal
places
Format name
Required
delimiter
Total width (including
decimal places and special
characters)
12
SAS INFormats
External
Data Value
27,134.28
27134.28
2713428
271342864
27134.2864
$27,134.28
27.5%
13
Informat
COMMA10.2
10.2
10.2
10.2
10.2
COMMA10.2
Percent5.1
Data Value
Read
27134.28
27134.28
27134.28
2713428.64
27134.2864
27134.28
.275
SAS INFormats for Date, Time
Recall that a SAS date is stored as the number of days
between 01JAN1960 and the specified date.
If the date or time values are created using ex, 10/16/2001,
16OCT2001, in order to read the date properly from the
external data set, we need to use the Informat:
Date in External data Informat
set
10/16/2001
mmddyy10.
Data value read
16OCT2001
101601
Date9.
101601
14
Some commonly used Informat
PERCENTw.d
$BINARYw.
$VARYINGw.
$w.
COMMAw.d
DATE9.
DATETIMEw.
HEXw.
JULIANw.
MMDDYYw.
NENGOw.
PDw.d
PERCENTw.
TIMEw.
W.d
The Informat COMMAw.d
COMMAw.d informat reads nonstandard numeric data and
removes the embedded
Blanks,
commas,
dashes,
dollar signs,
percent signs,
right parenthesis,
left parenthesis, which are converted to negative sign.
Actual Data value
12,345.67
$12,345.67
12 345.67
COMMAw.d
COMMA9.2
COMMA10.2
COMMA9.2
Data value read
12345.67
12345.67
12345.67
12-345.67
COMMA9.2
12345.67
(12,345.67)
COMMA11.2
-12345.67
Exercise
Practice Informat in Input statement: The data set aug99n.dat
is posted on the class website.
Field Name
Start Column
End Column
Maximum Width Data Type
ID
1
3
3
numeric
Date
5
13
9
character
Item
15
15
1
character
Quantity
17
19
3
numeric
Price
21
24
4
numeric
Percentsale
26
29
7
numeric
Three observations of the data set are shown below :
100 02AUG1999 R
103 03AUG1999 C
102 04AUG1999 T
6
20
4
180 90%
80 100%
800 85%
Write a SAS program to read this data, pay special attention to the use of
Informat to read non-standard numeric values. Print the data set using
proper display formats for non-standard variables.
Answer: There are many ways to
accomplish the same goal. Here is an
example
data orders;
infile
'C:\math707\RawData\RawData_dat\aug99n.dat';
input ID 1-3 @5 date date9. item $ 15 quantity 1719 totalcost 21-24 @26 percentsale percent4. ;
proc print;
format date MMDDYY8. percentsale percent6. ;
run;
Salesdata.dat
1234567890123456789012345678901234567890
---------------------------------------SMITH
10JAN2010 140097.98 440356.70
DAVIS
15JAN2010 385873.00 178234.00
JOHNSON 20JAN2010 98654.32 339485.00
SMITH
01FEB2010 225983.09 12250.00
DAVIS
12FEB2010 88456.23 55564.00
JOHNSON 22FEB2010 135837.34 32423.00
SMITH
10MAR2010
.000000 584593.00
DAVIS
18MAR2010 89342.00 173222.00
JOHNSON 26MAR2010 398573.90 193567.00
Read Salesdata.dat data using Formatted Input
Data work.sale;
INFILE ‘C:\math707\RawData\RawData_dat \Salesdata.dat’;
input last_name $7. @9 month date9.
@19 residential 9.2 +1 commercial 9.2;
run;
Proc print; Run;
NOTE: INFILE defines the location of the data set.
$w. is the format for character variable.
@n: moves the pointer to the column n.
+n: move forward n columns from the current position.
The pointer starts at column 1. After reading a variable, the pointer move the
next column as the current position.
Ex: After reading last_name with 7 columns, the pointer moves to column 8 as
the current position. After reading residential (starting at 19, reading 9
columns), that is , residential is from 19 to 27. The pointer moves to column
28 as the current column. Hence, +1 asks the pointer move one column
forward from col 28 to col 29, then, read 9 columns for commercial.
Example: Read the following data using Formatted Input
The following is the scores of quizzes, test1, test2 and final of a
class.
Name Q1 Q2 Q3 Q4 Q5 T1 T2 Final
----+----1----+----2----+----3----+----4
CSA 17 18 15 19 18 85 92 145
DB
. 16 14 18 16 72 76 120
QC
20 18 19 16 20 92 95 143
DC
18 15 .
15 20 82 79 125
E
20 18 15 15 18 80 82 135
F
16 16 15 15 16 72 75 116
GC
20 16 17 16 17 .
87 139
HD
18 15 15
. 19 85 79 115
IM
17 18 19 20 20 95 92 145
WB
13 16 14 15 16 72 66 110
Write a SAS program to read the data by having the
data included in the SAS program.
/*Program Statements */
DATA scores;
/*Column Input */
INPUT Name $ 1-5 Q1 6-7 Q2 10-11 Q3 14-15 Q4 18-19 Q5
22-23
TEST1 25-27 TEST2 29-31 Final 33-36;
/*Formatted Input */
INPUT NAME $5. Q1 2. @10 Q2 2. @14 Q3 2. @18 Q4 2. @22
Q5 2. @25 TEST1 3. @29 TEST2 3. @33 FINAL 4.;
1234567890123456789012345678901234567890
CSA 17 18 15 19 18 85 92 145
DB
. 16 14 18 16 72 76 120
QC
20 18 19 16 20 92 95 143
DC
18 15 .
15 20 82 79 125
E
20 18 15 15 18 80 82 135
F
16 16 15 15 16 72 75 116
GC
20 16 17 16 17 .
87 139
HD
18 15 15
. 19 85 79 115
IM
17 18 19 20 20 95 92 145
WB
13 16 14 15 16 72 66 110
;
RUN;
Different formatted inputs to read the
same data
/*Column Input */
INPUT Name $ 1-5 Q1 6-7 Q2 10-11 Q3 14-15 Q4 18-19
Q5 22-23 TEST1 25-27 TEST2 29-31 Final 33-36;
/*Formatted Input */
INPUT NAME $5. Q1 2. @10 Q2 2. @14 Q3 2. @18 Q4
2. @22 Q5 2. @25 TEST1 3. @29 TEST2 3. @33
FINAL 4.;
INPUT NAME $5. Q1 2. +2 Q2 2. +2 Q3 2. +2 Q4 2.
@22 Q5 2. @25 TEST1 3. +1 TEST2 3. @33 FINAL
4.;
INPUT NAME $5. (Q1-Q5 TEST1 TEST2)(2. +2) @33
FINAL 4. ;
Exercise
Open the program c5_colInp
And change the Column INPUT statement using
Formatted Input.
Fixed Record Length Vs. Variable
Record Length
In reading an external data set, the record length is the size of each
record. Usually a record consists of the variables of an observation.
• NOTE: It is possible one record can consists of multiple observations.
This will be discussed later
The size of each record is usually ‘FIXED’, that is the same record size for
every record. However, it may not be the case in data recording. That
is, the record size may differ.
When the record lengths differ, Formatted input may not read the data
values correctly due to the fact that Formatted input will look for the #
of columns specified for each variable. When the record lengths vary,
the pointer may continue to the next record in order to read the
specified # of columns for last variable (usually) in the INPUT
statement. An error will occur when this situation happens.
Formatted Input when Reading Records with
Variable Record Lengths
Using the PAD option in the INFILE Statement
One way to fix the problem is to add the blank spaces to the existing
records that are short of the record length to change the record
length to be ‘FIXED’.
The other way is to inform SAS to ‘PAD’ the blanks to those records
which are too short.
Suppose the record length for the Salesdata.dat is not fixed.
Example:
Data work.sale;
INFILE ‘C:\math707\RawData\RawData_dat \
Salesdata.dat’ pad;
input last_name $7. @9 sale_date date9.
@19 residential 9.2 +1 commercial 9.2;
run;
Formatted PUT to created External
Data Set
Similar to formatted Input to read external raw data
set, one can create external data set using formatted
PUT statement.
FILENAME fileref ‘file-location’;
FILE fileref;
PUT var format ……… ;
RUN;
Example: Create External Data Set using
Formatted PUT
To create an external data for the salesdata that
consists of only MARCH.
Data work.sale;
INFILE ‘C:\math707\RawData\RawData_dat\ Salesdata.dat’;
input last_name $7. @9 sale_date date9.
@19 residential 9.2 +1 commercial 9.2;
Run;
Data marchsale; set work.sale;
FILE ‘C:\math707\RawData\RawData_dat\ Sales_March.dat’;
IF MONTH(Sale_date) = 3;
PUT l_name $7. @10 sale_date MMDDYY10.
@20 residential 10.2 @35 commercial 10.2;
run;
Exercise
The following is a finance data. Variables are
SSN, Name, Salary, Nyear, Birthday
029-46-9261
074-53-9892
228-88-9649
442-21-8075
446-93-2122
776-84-5391
929-75-0218
Rudelich
Vincent
Benito
Sirignano
Harbinger
Phillipon
Gunter
55,000.00
65,063.28
78,386.54
$5,890.06
73,925.37
$49,750.92
57,445.32
2
5
9
1
7
12
4
4595
6780
9295
7896
6596
6097
Write a SAS program to read this data using formatted
format. Practice using PAD option in the Infile statement
and make sure you see and understand the difference
between with PAD and without PAD
An answer
/* program (b) to read variable length records - This program has error. Carefully
check the errors */
data financeb;
infile 'C:\math707\RawData\RawData_dat\finance3_recordlength.dat' ;
input SSN $ 1-11 Name $ 13-22 @24 salary comma10. +2 Nyear 3. +2 birthdate 5.;
proc print; format birthdate date9.; title 'Errors in reading variable-length
records'; run;
/* Program c: use PAD option in the INFLIE statement */
data financec;
infile 'C:\math707\RawData\RawData_dat\finance3_recordlength.dat' pad;
input SSN $ 1-11 Name $ 13-22 @24 salary comma10. +2 Nyear 3. +2 birthdate 5.;
proc print; format birthdate date9.; title 'Use PAD option to read variable-length
records'; run;
Download