PROC IMPORT and more
Or
When PROC IMPORT just doesn't do the job
David B. Horvath, CCP, MS
PhilaSUG Spring 2014 Meeting
PROC IMPORT and more
Copyright © 2014, David B. Horvath, CCP — All Rights Reserved
The Author can be contacted at:
504 Longbotham Drive, Aston PA 19014-2502, USA
Phone: 1-610-859-8826
Email: dhorvath@cobs.com
Web: http://www.cobs.com/
All trademarks and servicemarks are the
property of their respective owners.
2
Introductions
•
•
•
•
•
My Background
PROC IMPORT Basics
CSV Specification and Issues
Fixing CSV Data Outside of SAS
Tips and Tricks
3
Abstract
• PROC IMPORT comes in handy when quickly trying to load a CSV or similar
file. But it does have limitations. Unfortunately, I've run into those limitations
and had to work around them.
• This session will discuss the original CSV specification (early 1980's), how
Microsoft Excel violates that specification, how SAS PROC IMPORT does not
follow that specification, and the issues that can result.
• Simple UNIX tools will be described that can be used to ensure that data
hilarities do not occur due to CSV issues.
• Recommendations will be made to get around some of PROC IMPORT
limitations (like field naming, data type determination, limitation in number of
fields, separator in data).
• CSV, TAB, and DLM types will be discussed.
4
My Background
• Base SAS on Mainframe, UNIX, and PC Platforms
• SAS is primarily an ETL tool or Programming Language for me
• My background is IT – I am not a modeler
• Not my first User Group presentation – presented sessions and
seminars in Australia, France, the US, and Canada.
• Undergraduate: Computer and Information Sciences, Temple Univ.
• Graduate: Organizational Dynamics, UPenn
• Most of my career was in consulting
• Have written several books (none SAS-related, yet)
• Adjunct Instructor covering IT topics.
5
PROC IMPORT Basics
• Will focus on code rather than GUI Import Wizard
• Handy tool to load data from other sources
• You provide the input (flat) file, SAS creates the dataset
• The primary formats:
• CSV – Comma Separated Variables
• TAB – Tab (ASCII ‘09x’ EBCDIC ’05’x)
• DLM – You pick it
• Quoted strings get special treatment any time you use delimited files
• PROC IMPORT
• INFILE with DLM and DSD (“Delimiter-Sensitive Data”)
6
PROC IMPORT Basics
• Syntax:
PROC IMPORT
DATAFILE="filename”
OUT=<libref.>SAS data-set <(SAS data-set-options)>
<DBMS=identifier>
<REPLACE> ;
<data-source-statement(s);>
• DBMS identifies data source type
• CSV, TAB, and DLM will be discussed here
• DBMS options could be an entire talk in themselves and largely depends
on operating system and installed components.
• REPLACE allows output dataset to be overwritten.
7
PROC IMPORT Basics
• Data-source-statements:
• DELIMITER only specified with DBMS=DLM
• Will override DBMS=CSV
• Can contain multiple values DELIMTER=‘!|’
• GETNAMES gets variable names from file
• GETNAMES=YES is the default
• Will “sasify” the names (underlines replace spaces, length, etc.)
• Defaults DATAROW to 2
• DATAROW=n
• Is similar in function to FIRSTOBS
• Defaults to 1 unless GETNAMES=YES
• GUESSINGROWS=20 is the default
• Used by wizard to build format and input statements
• Tradeoff between time and accuracy – higher count gives greater accuracy
8
Example from SAS Documentation
• The input CSV:
"Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769"
"Asia","Boot","Bangkok","1","$1,996","$9,576","$80"
"Canada","Boot","Calgary","8","$17,720","$63,280","$472"
"Central America/Caribbean","Boot","Kingston","33","$102,372","$393,376","$4,454"
"Eastern Europe","Boot","Budapest","22","$74,102","$317,515","$3,341"
"Middle East","Boot","Al-Khobar","10","$15,062","$44,658","$765"
"Pacific","Boot","Auckland","12","$20,141","$97,919","$962"
"South America","Boot","Bogota","19","$15,312","$35,805","$1,229"
"United States","Boot","Chicago","16","$82,483","$305,061","$3,735"
"Western Europe","Boot","Copenhagen","2","$1,663","$4,657","$129“
• The Code:
proc import datafile=“csv1.csv"
out=shoes dbms=csv replace;
getnames=no;
run;
9
Example from SAS Documentation
• Generated Code:
7
8
9
10
11
12
13
14
15
16
17
18
...
25
...
32
33
...
40
41
variable
42
/**********************************************************************
*
PRODUCT:
SAS
*
VERSION:
9.2
*
CREATOR:
External File Interface
*
DATE:
10JUN14
*
DESC:
Generated SAS Datastep Code
*
TEMPLATE SOURCE: (None Specified.)
***********************************************************************/
data WORK.SHOES
;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'csv1.csv' delimiter = ',' MISSOVER DSD lrecl=32767 ;
informat VAR1 $34. ;
format VAR1 $34. ;
input
VAR1 $
*/
;
if _ERROR_ then call symputx('_EFIERR_',1);
run;
10
/* set ERROR detection macro
Example from SAS Documentation
• Output:
Obs
VAR4
12
1
8
33
1
2
3
4
5
22
...
10
2
VAR1
VAR5
VAR6
VAR7
Africa
$29,761
$191,821
Asia
$1,996
$9,576
Canada
$17,720
$63,280
Central America/Caribbean
$102,372
$393,376
Eastern Europe
$74,102
$317,515
Western Europe
$1,663
$769
$80
$472
$4,454
$3,341
$4,657
$129
11
VAR2
VAR3
Boot
Addis Ababa
Boot
Bangkok
Boot
Calgary
Boot
Kingston
Boot
Budapest
Boot
Copenhagen
CSV Specification and Issues
• When CSV first came out, there was no real specification
• History traced to list directed/free form input specification in
FORTRAN 77
• First mention I could find (in SAS’ TS-673) was version 6
• Did not handle empty fields (two delimiters in a row)
• Did not handle quoted strings containing delimiter
• Each vendor had their own “standard” which may or may not support
quoted strings
• For a long time, Excel did not quote strings
• The creator of your file may not quote strings
• RFC 4180, IETF 2005, is the definitive reference now!
12
CSV Specification and Issues
• The format is basically:
• Optional Header (usually column names):
field_name,field_name,field_name
• Followed by Datalines:
aaa,bbb,ccc
zzz,yyy,xxx
• Optionally, strings are contained within single or double quotes
"aaa","bbb","ccc"
• If a quoted string contains the quote character, it must be escaped:
• "Af"r,ica" Becomes “Af”r and ica” fields
• "Af”"r,ica" Becomes “Af”r,ica” single field
• Each line should have a line terminator
• This only really matters if you have issues importing
13
CSV Specification and Issues
• What happens if delimiter-containing fields are not quoted?
• Depends on consistency of data
• OK:
Applicant, Instructor
Resources, Human
Management, Fiscal
• Not OK:
Applicant, Instructor
Resources, Human
Training
• Inconsistent number of fields
• Column names exist in some files
• Page titles
14
Fixing CSV Data Outside of SAS
•
•
•
•
•
I strongly advise against CSV in favor of TAB delimited
Second choice is delimited by other character (like ‘|’ pipe)
Ask for a new version properly formatted
Manually edit the file
Use some UNIX and SAS tricks
15
Tips and Tricks – Modifying Generated Code
• My favorite trick is to let PROC IMPORT write the initial code for a file
and then copy/paste/edit the resulting DATA step into the program.
• Can alter variable names, formats, and data types
• Can use more INFILE options like DLMSTR, DLMSOPT, END, others
• Can perform data validation/correction
• Excel will let you enter ‘14-JUL’ as a date, treat it as July 14 2014, but
write it out to the file as ‘14-JUL’
• Humans are lazy.
• Your program could input the field and correct in on flow through the data.
• Programmers are lazy. I don’t want to have to type all the variables in
a large delimited file.
• UNIX has a tool to cut columns out of a file:
• cut -c 10- csv3.log > data.txt
• Will eliminate row numbers and formatting in log columns 1-9
16
Tips and Tricks – Hanging PROC
• If you see this error when using PROC IMPORT or EXPORT while
running from the command line:
ERROR: Cannot open X display.
Check the name/server access authorization.
• Then the wizard is trying to run interactively. Your command line
should include:
sas –noterminal
• Supposed to be fixed in v9.3
• Another symptom is that the PROC “hangs”
17
Tips and Tricks – Delimited File Not Quote Escaped
• Problem reported recently by Don Stanley bchr0001@GMAIL.COM
• This is the case where quotes are just data:
AAAA.BBBB.'I.CCCC'.0.DL
• PROC IMPORT/INFILE DLM DSD wants to treat 'I.CCCC‘ as one
field, developer wants it to be two.
• With the possibility of empty fields, partial DSD functionality is needed.
But DSD also invokes the quoted field handling.
18
Tips and Tricks – Delimited File Not Quote Escaped
• SAS Solution (I lost the contributor name)
infile custdata dlm=‘,' ;
input @ ;
do while(index(_infile_,‘,,')>0) ;
_infile_ = tranwrd(_infile_,‘,,',‘, ,') ;
end ; /* comma-comma to comma-space-comma */
input << delimited fields >> ;
• _infile_ is limited to 32,767 – could be problem for large records
• UNIX Shell Solution
nawk '{gsub(“,,", “, ,");print $0;}' INPUTFILE > OUTPUTFILE
• Handles (essentially) unlimited record lengths
19
Tips and Tricks – File Contents
• UNIX head will show the first few rows
• Does the file have extra lines at the top (like a title line)?
• UNIX tail will show the last few rows
• Do I have all the data?
• Are the records consistent beginning to end?
• UNIX wc -l (word count, line) will show how many lines in total
• Is this file the right size?
• You can always edit the file (unless too large)
• UNIX od -c -x (object dump, character and hex) is useful to see
format and contents of file (including unprintable characters).
20
Tips and Tricks – File Contents
• Determine maximum record length and field count (based on
delimiters) with a script
• This example is written in awk (new awk):
BEGIN{
FS=",";
mNF=0;
mRL=0; }
{
if (NF > mNF) mNF=NF;
RL=length($0);
if (RL > mRL) mRL=RL;
}
END{
print "max NF " mNF " max RECL " mRL;}
• Max NF is highest number of fields on any record (does not handle
quote escaping)
• Max RECL is the longest record length in the file
• Can be saved in a file and executed on your data:
nawk csv_check.awk INPUTFILE.csv
21
Tips and Tricks – Dropping Titles
• The input CSV:
This is a title line and should be here
country,shoe,city,count,average salary,max salary,average
"Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769"
"Asia","Boot","Bangkok","1","$1,996","$9,576","$80"
"Canada","Boot","Calgary","8","$17,720","$63,280","$472“
• PROC IMPORT results:
Number of names found is less than number of variables found.
Name This is a title line and should be here truncated to
This_is_a_title_line_and_should.
Problems were detected with provided names. See LOG.
• How to get rid of the first line since DATAROW= won’t help:
• wc -l INPUT
• tail -4 INPUT > OUTPUT
– returns value 5 (5 lines)
– saves last 4 lines
22
Tips and Tricks – Dropping Junk Records at End
• The input CSV:
country,shoe,city,count,average salary,max salary,average
"Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769"
"Asia","Boot","Bangkok","1","$1,996","$9,576","$80"
"Canada","Boot","Calgary","8","$17,720","$63,280","$472"
,,,,,,
• Results in:
Obs
1
2
3
4
country
Africa
Asia
Canada
shoe
Boot
Boot
Boot
city
Addis Ababa
Bangkok
Calgary
count
12
1
8
average_
salary
$29,761
$1,996
$17,720
max_salary
$191,821
$9,576
$63,280
• How to get rid of the last line since DATAROW= won’t help:
• wc -l INPUT
• head -4 INPUT > OUTPUT
– returns value 5 (5 lines)
– saves first 4 lines
23
average
$769
$80
$472
Personal Note
• On a Personal Note
• I seem to learn a bunch when working on presentations, new classes, and
writings
• I was still treating PROC IMPORT like I was still using V6, I didn’t realize it
would handle quoted strings properly now.
• I have to learn some new habits and trust the tool more.
• In any commands, the single and double quotation marks should be
simple, not the “smart quotes” forced my Microsoft. The same applies to
dashes or minus signs – they should not be “em dashes” (- versus –)
24
Wrap Up
?!
?
!
!
Questions
?!
?
and
?
?!
?
Answers
!
!
25
?!
Proc Import References
• NOTES
• Proc Import with a twist: http://www2.sas.com/proceedings/sugi30/03830.pdf
• Data Example:
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/vie
wer.htm#a000314361.htm
• RFC 4180, IETF 2005, http://tools.ietf.org/html/rfc4180
• Wiki has good CSV background: http://en.wikipedia.org/wiki/Commaseparated_values
• Hanging PROC IMPORT: http://support.sas.com/kb/3/610.html
• CSV-1203 “Standard” – identifies some Excel “oddities”:
http://mastpoint.curzonnassau.com/csv-1203
26
A Few More Words About
Mainframe Data in Non-EBCDIC Environment
David B. Horvath, CCP, MS
PhilaSUG Spring 2014 Meeting
Mainframe Data in Non-EBCDIC Environment
• SAS TS-642 is a great place to start
http://support.sas.com/techsup/technote/ts642.html
• UNIX character set conversion:
• dd if=INPUT of=OUTPUT conv=ascii
• Does not handle binary or packed decimal correctly
• Is readable so you can look at the file
• Input formats:
• $ebcdicNN. – converts EBCDIC character data to ASCII (COBOL pic X
and 9 usage display – the default)
• s370fpdN.M. – converts packed data (2 binary coded decimal digits per
byte, COBOL pic 9 usage comp-3)
• s370fzdN. – converts zoned decimal (COBOL S9 usage display where
sign is not separate leading/trailing character)
• s370fibN. – converts binary data (COBOL pic 9 usage comp)
28
Mainframe Data in Non-EBCDIC Environment
• Looking at the data:
• On the Mainframe itself (easiest if you have access and speak ISPF)
• Converted to ASCII (via dd command)
• In native format in dump format (character and hex):
• od -c -x EBCDIC_FILE
• Transferring the Data – Fixed Format
• If no binary/packed data, you can use FTP ASCII mode
• Otherwise you need BINARY mode
29
Mainframe Data in Non-EBCDIC Environment
• Transferring the Data – Variable Format
• Conversion on mainframe to fixed layout
• Wasteful of space
• Allows FTP in ASCII or BINARY depending on contents
• May still have varying records that require processing
• Convert on mainframe to RECFM=U via IEBGENER (copy)
• Hides Block and Record descriptor data from the operating system and FTP
• FTP in BINARY
• Use recfm=s370vb on infile statement to handle layout oddities
• Access data directly via filename/ftp:
• filename test1 ftp "'SASXXX.VB.TEST1'" HOST='MVS'
USER='SASXXX' PROMPT s370v RCMD='site rdw';
• Highlighted portion tells SAS to handle layout oddities
30
Mainframe Data in Non-EBCDIC Environment
• Variable Format Internal Layout (RECFM=VB)
• The files are formatted like:
• BB00RR00yourdataRR00secondrecordRR00thirdrecord
• BB is the blksize (block) and counts everything (47 bytes)
• RR is the physical record length – includes the space for the data and RR/00 (12
for your data, 16 for second, and 15 for third)
• A record cannot exceed the defined LRECL, a block cannot exceed the
defined BLKSIZE. These are defined when the dataset is created.
• Each block has the same general layout
• In the record, these fields are binary data, do not try to use BB or RR in
calculations if you have converted the file from EBCDIC to ASCII.
• Native FTP will strip out the layout data and may insert CR/LF
• Your data may look like CR/LF
• May you never have to deal with mainframe tapes (See me if you do)
31
Mainframe Data in Non-EBCDIC Environment
• Meaning Changes in Variable Record Format Files
• Typically a record type in first byte
• Remainder of the record is of that type
• For instance, Employee File:
• A – basic employee information: name, address, SSN
• B – department and report-to information
• C – education, 1 to many records
• D – hire/termination information, 1 to many records
• Individual byte position in the record can have different meaning (COBOL
redefines, C union)
input @1 rectype $1. @;
if rectype = “A” then do;
input @2 nextfield …
32
Mainframe Data in Non-EBCDIC Environment – Wrap Up
?!
?
!
!
Questions
?!
?
and
?
?!
?
Answers
!
!
33
?!