PROC IMPORT and more Or When PROC IMPORT just doesn't do the job David B. Horvath, CCP, MS PhilaSUG Spring 2014 Meeting PROC IMPORT and more Copyright © 2014, David B. Horvath, CCP — All Rights Reserved The Author can be contacted at: 504 Longbotham Drive, Aston PA 19014-2502, USA Phone: 1-610-859-8826 Email: dhorvath@cobs.com Web: http://www.cobs.com/ All trademarks and servicemarks are the property of their respective owners. 2 Introductions • • • • • My Background PROC IMPORT Basics CSV Specification and Issues Fixing CSV Data Outside of SAS Tips and Tricks 3 Abstract • PROC IMPORT comes in handy when quickly trying to load a CSV or similar file. But it does have limitations. Unfortunately, I've run into those limitations and had to work around them. • This session will discuss the original CSV specification (early 1980's), how Microsoft Excel violates that specification, how SAS PROC IMPORT does not follow that specification, and the issues that can result. • Simple UNIX tools will be described that can be used to ensure that data hilarities do not occur due to CSV issues. • Recommendations will be made to get around some of PROC IMPORT limitations (like field naming, data type determination, limitation in number of fields, separator in data). • CSV, TAB, and DLM types will be discussed. 4 My Background • Base SAS on Mainframe, UNIX, and PC Platforms • SAS is primarily an ETL tool or Programming Language for me • My background is IT – I am not a modeler • Not my first User Group presentation – presented sessions and seminars in Australia, France, the US, and Canada. • Undergraduate: Computer and Information Sciences, Temple Univ. • Graduate: Organizational Dynamics, UPenn • Most of my career was in consulting • Have written several books (none SAS-related, yet) • Adjunct Instructor covering IT topics. 5 PROC IMPORT Basics • Will focus on code rather than GUI Import Wizard • Handy tool to load data from other sources • You provide the input (flat) file, SAS creates the dataset • The primary formats: • CSV – Comma Separated Variables • TAB – Tab (ASCII ‘09x’ EBCDIC ’05’x) • DLM – You pick it • Quoted strings get special treatment any time you use delimited files • PROC IMPORT • INFILE with DLM and DSD (“Delimiter-Sensitive Data”) 6 PROC IMPORT Basics • Syntax: PROC IMPORT DATAFILE="filename” OUT=<libref.>SAS data-set <(SAS data-set-options)> <DBMS=identifier> <REPLACE> ; <data-source-statement(s);> • DBMS identifies data source type • CSV, TAB, and DLM will be discussed here • DBMS options could be an entire talk in themselves and largely depends on operating system and installed components. • REPLACE allows output dataset to be overwritten. 7 PROC IMPORT Basics • Data-source-statements: • DELIMITER only specified with DBMS=DLM • Will override DBMS=CSV • Can contain multiple values DELIMTER=‘!|’ • GETNAMES gets variable names from file • GETNAMES=YES is the default • Will “sasify” the names (underlines replace spaces, length, etc.) • Defaults DATAROW to 2 • DATAROW=n • Is similar in function to FIRSTOBS • Defaults to 1 unless GETNAMES=YES • GUESSINGROWS=20 is the default • Used by wizard to build format and input statements • Tradeoff between time and accuracy – higher count gives greater accuracy 8 Example from SAS Documentation • The input CSV: "Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769" "Asia","Boot","Bangkok","1","$1,996","$9,576","$80" "Canada","Boot","Calgary","8","$17,720","$63,280","$472" "Central America/Caribbean","Boot","Kingston","33","$102,372","$393,376","$4,454" "Eastern Europe","Boot","Budapest","22","$74,102","$317,515","$3,341" "Middle East","Boot","Al-Khobar","10","$15,062","$44,658","$765" "Pacific","Boot","Auckland","12","$20,141","$97,919","$962" "South America","Boot","Bogota","19","$15,312","$35,805","$1,229" "United States","Boot","Chicago","16","$82,483","$305,061","$3,735" "Western Europe","Boot","Copenhagen","2","$1,663","$4,657","$129“ • The Code: proc import datafile=“csv1.csv" out=shoes dbms=csv replace; getnames=no; run; 9 Example from SAS Documentation • Generated Code: 7 8 9 10 11 12 13 14 15 16 17 18 ... 25 ... 32 33 ... 40 41 variable 42 /********************************************************************** * PRODUCT: SAS * VERSION: 9.2 * CREATOR: External File Interface * DATE: 10JUN14 * DESC: Generated SAS Datastep Code * TEMPLATE SOURCE: (None Specified.) ***********************************************************************/ data WORK.SHOES ; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ infile 'csv1.csv' delimiter = ',' MISSOVER DSD lrecl=32767 ; informat VAR1 $34. ; format VAR1 $34. ; input VAR1 $ */ ; if _ERROR_ then call symputx('_EFIERR_',1); run; 10 /* set ERROR detection macro Example from SAS Documentation • Output: Obs VAR4 12 1 8 33 1 2 3 4 5 22 ... 10 2 VAR1 VAR5 VAR6 VAR7 Africa $29,761 $191,821 Asia $1,996 $9,576 Canada $17,720 $63,280 Central America/Caribbean $102,372 $393,376 Eastern Europe $74,102 $317,515 Western Europe $1,663 $769 $80 $472 $4,454 $3,341 $4,657 $129 11 VAR2 VAR3 Boot Addis Ababa Boot Bangkok Boot Calgary Boot Kingston Boot Budapest Boot Copenhagen CSV Specification and Issues • When CSV first came out, there was no real specification • History traced to list directed/free form input specification in FORTRAN 77 • First mention I could find (in SAS’ TS-673) was version 6 • Did not handle empty fields (two delimiters in a row) • Did not handle quoted strings containing delimiter • Each vendor had their own “standard” which may or may not support quoted strings • For a long time, Excel did not quote strings • The creator of your file may not quote strings • RFC 4180, IETF 2005, is the definitive reference now! 12 CSV Specification and Issues • The format is basically: • Optional Header (usually column names): field_name,field_name,field_name • Followed by Datalines: aaa,bbb,ccc zzz,yyy,xxx • Optionally, strings are contained within single or double quotes "aaa","bbb","ccc" • If a quoted string contains the quote character, it must be escaped: • "Af"r,ica" Becomes “Af”r and ica” fields • "Af”"r,ica" Becomes “Af”r,ica” single field • Each line should have a line terminator • This only really matters if you have issues importing 13 CSV Specification and Issues • What happens if delimiter-containing fields are not quoted? • Depends on consistency of data • OK: Applicant, Instructor Resources, Human Management, Fiscal • Not OK: Applicant, Instructor Resources, Human Training • Inconsistent number of fields • Column names exist in some files • Page titles 14 Fixing CSV Data Outside of SAS • • • • • I strongly advise against CSV in favor of TAB delimited Second choice is delimited by other character (like ‘|’ pipe) Ask for a new version properly formatted Manually edit the file Use some UNIX and SAS tricks 15 Tips and Tricks – Modifying Generated Code • My favorite trick is to let PROC IMPORT write the initial code for a file and then copy/paste/edit the resulting DATA step into the program. • Can alter variable names, formats, and data types • Can use more INFILE options like DLMSTR, DLMSOPT, END, others • Can perform data validation/correction • Excel will let you enter ‘14-JUL’ as a date, treat it as July 14 2014, but write it out to the file as ‘14-JUL’ • Humans are lazy. • Your program could input the field and correct in on flow through the data. • Programmers are lazy. I don’t want to have to type all the variables in a large delimited file. • UNIX has a tool to cut columns out of a file: • cut -c 10- csv3.log > data.txt • Will eliminate row numbers and formatting in log columns 1-9 16 Tips and Tricks – Hanging PROC • If you see this error when using PROC IMPORT or EXPORT while running from the command line: ERROR: Cannot open X display. Check the name/server access authorization. • Then the wizard is trying to run interactively. Your command line should include: sas –noterminal • Supposed to be fixed in v9.3 • Another symptom is that the PROC “hangs” 17 Tips and Tricks – Delimited File Not Quote Escaped • Problem reported recently by Don Stanley bchr0001@GMAIL.COM • This is the case where quotes are just data: AAAA.BBBB.'I.CCCC'.0.DL • PROC IMPORT/INFILE DLM DSD wants to treat 'I.CCCC‘ as one field, developer wants it to be two. • With the possibility of empty fields, partial DSD functionality is needed. But DSD also invokes the quoted field handling. 18 Tips and Tricks – Delimited File Not Quote Escaped • SAS Solution (I lost the contributor name) infile custdata dlm=‘,' ; input @ ; do while(index(_infile_,‘,,')>0) ; _infile_ = tranwrd(_infile_,‘,,',‘, ,') ; end ; /* comma-comma to comma-space-comma */ input << delimited fields >> ; • _infile_ is limited to 32,767 – could be problem for large records • UNIX Shell Solution nawk '{gsub(“,,", “, ,");print $0;}' INPUTFILE > OUTPUTFILE • Handles (essentially) unlimited record lengths 19 Tips and Tricks – File Contents • UNIX head will show the first few rows • Does the file have extra lines at the top (like a title line)? • UNIX tail will show the last few rows • Do I have all the data? • Are the records consistent beginning to end? • UNIX wc -l (word count, line) will show how many lines in total • Is this file the right size? • You can always edit the file (unless too large) • UNIX od -c -x (object dump, character and hex) is useful to see format and contents of file (including unprintable characters). 20 Tips and Tricks – File Contents • Determine maximum record length and field count (based on delimiters) with a script • This example is written in awk (new awk): BEGIN{ FS=","; mNF=0; mRL=0; } { if (NF > mNF) mNF=NF; RL=length($0); if (RL > mRL) mRL=RL; } END{ print "max NF " mNF " max RECL " mRL;} • Max NF is highest number of fields on any record (does not handle quote escaping) • Max RECL is the longest record length in the file • Can be saved in a file and executed on your data: nawk csv_check.awk INPUTFILE.csv 21 Tips and Tricks – Dropping Titles • The input CSV: This is a title line and should be here country,shoe,city,count,average salary,max salary,average "Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769" "Asia","Boot","Bangkok","1","$1,996","$9,576","$80" "Canada","Boot","Calgary","8","$17,720","$63,280","$472“ • PROC IMPORT results: Number of names found is less than number of variables found. Name This is a title line and should be here truncated to This_is_a_title_line_and_should. Problems were detected with provided names. See LOG. • How to get rid of the first line since DATAROW= won’t help: • wc -l INPUT • tail -4 INPUT > OUTPUT – returns value 5 (5 lines) – saves last 4 lines 22 Tips and Tricks – Dropping Junk Records at End • The input CSV: country,shoe,city,count,average salary,max salary,average "Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769" "Asia","Boot","Bangkok","1","$1,996","$9,576","$80" "Canada","Boot","Calgary","8","$17,720","$63,280","$472" ,,,,,, • Results in: Obs 1 2 3 4 country Africa Asia Canada shoe Boot Boot Boot city Addis Ababa Bangkok Calgary count 12 1 8 average_ salary $29,761 $1,996 $17,720 max_salary $191,821 $9,576 $63,280 • How to get rid of the last line since DATAROW= won’t help: • wc -l INPUT • head -4 INPUT > OUTPUT – returns value 5 (5 lines) – saves first 4 lines 23 average $769 $80 $472 Personal Note • On a Personal Note • I seem to learn a bunch when working on presentations, new classes, and writings • I was still treating PROC IMPORT like I was still using V6, I didn’t realize it would handle quoted strings properly now. • I have to learn some new habits and trust the tool more. • In any commands, the single and double quotation marks should be simple, not the “smart quotes” forced my Microsoft. The same applies to dashes or minus signs – they should not be “em dashes” (- versus –) 24 Wrap Up ?! ? ! ! Questions ?! ? and ? ?! ? Answers ! ! 25 ?! Proc Import References • NOTES • Proc Import with a twist: http://www2.sas.com/proceedings/sugi30/03830.pdf • Data Example: http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/vie wer.htm#a000314361.htm • RFC 4180, IETF 2005, http://tools.ietf.org/html/rfc4180 • Wiki has good CSV background: http://en.wikipedia.org/wiki/Commaseparated_values • Hanging PROC IMPORT: http://support.sas.com/kb/3/610.html • CSV-1203 “Standard” – identifies some Excel “oddities”: http://mastpoint.curzonnassau.com/csv-1203 26 A Few More Words About Mainframe Data in Non-EBCDIC Environment David B. Horvath, CCP, MS PhilaSUG Spring 2014 Meeting Mainframe Data in Non-EBCDIC Environment • SAS TS-642 is a great place to start http://support.sas.com/techsup/technote/ts642.html • UNIX character set conversion: • dd if=INPUT of=OUTPUT conv=ascii • Does not handle binary or packed decimal correctly • Is readable so you can look at the file • Input formats: • $ebcdicNN. – converts EBCDIC character data to ASCII (COBOL pic X and 9 usage display – the default) • s370fpdN.M. – converts packed data (2 binary coded decimal digits per byte, COBOL pic 9 usage comp-3) • s370fzdN. – converts zoned decimal (COBOL S9 usage display where sign is not separate leading/trailing character) • s370fibN. – converts binary data (COBOL pic 9 usage comp) 28 Mainframe Data in Non-EBCDIC Environment • Looking at the data: • On the Mainframe itself (easiest if you have access and speak ISPF) • Converted to ASCII (via dd command) • In native format in dump format (character and hex): • od -c -x EBCDIC_FILE • Transferring the Data – Fixed Format • If no binary/packed data, you can use FTP ASCII mode • Otherwise you need BINARY mode 29 Mainframe Data in Non-EBCDIC Environment • Transferring the Data – Variable Format • Conversion on mainframe to fixed layout • Wasteful of space • Allows FTP in ASCII or BINARY depending on contents • May still have varying records that require processing • Convert on mainframe to RECFM=U via IEBGENER (copy) • Hides Block and Record descriptor data from the operating system and FTP • FTP in BINARY • Use recfm=s370vb on infile statement to handle layout oddities • Access data directly via filename/ftp: • filename test1 ftp "'SASXXX.VB.TEST1'" HOST='MVS' USER='SASXXX' PROMPT s370v RCMD='site rdw'; • Highlighted portion tells SAS to handle layout oddities 30 Mainframe Data in Non-EBCDIC Environment • Variable Format Internal Layout (RECFM=VB) • The files are formatted like: • BB00RR00yourdataRR00secondrecordRR00thirdrecord • BB is the blksize (block) and counts everything (47 bytes) • RR is the physical record length – includes the space for the data and RR/00 (12 for your data, 16 for second, and 15 for third) • A record cannot exceed the defined LRECL, a block cannot exceed the defined BLKSIZE. These are defined when the dataset is created. • Each block has the same general layout • In the record, these fields are binary data, do not try to use BB or RR in calculations if you have converted the file from EBCDIC to ASCII. • Native FTP will strip out the layout data and may insert CR/LF • Your data may look like CR/LF • May you never have to deal with mainframe tapes (See me if you do) 31 Mainframe Data in Non-EBCDIC Environment • Meaning Changes in Variable Record Format Files • Typically a record type in first byte • Remainder of the record is of that type • For instance, Employee File: • A – basic employee information: name, address, SSN • B – department and report-to information • C – education, 1 to many records • D – hire/termination information, 1 to many records • Individual byte position in the record can have different meaning (COBOL redefines, C union) input @1 rectype $1. @; if rectype = “A” then do; input @2 nextfield … 32 Mainframe Data in Non-EBCDIC Environment – Wrap Up ?! ? ! ! Questions ?! ? and ? ?! ? Answers ! ! 33 ?!