Get Your DATA Step “Amped” on Java By David Bell,

advertisement
Get Your DATA Step “Amped” on Java
By David Bell, Staff Programmer Analyst,
State of California: Dept. of Industrial Relations
Abstract
Beginning with SAS v. 9.1.3, the Data Step supported calls to Java objects using the "javaobj" command.
This emormously extends SAS in that Java can be used to make experimental processes available to SAS
data not provided by SAS procedures or other statements. The procedures presented will be a "fuzzy
Hamming" dissimilarity procedure and a metaphone phonetic code procedure that comes in handy when
trying to match names from different databases. Setting up Eclipse for use as the development environment
will be demonstrated together with the steps to including Java within the Data step.
Search Keywords Data Step, Java, Eclipse IDE, Fuzzy Hamming, Metaphone
Operating Systems applicable :Windows
Introduction:
In this session, I will present how to setup a SAS Java development environment using Eclipse,
demonstrate the development and compilation of a small java program for inclusion into the SAS Data step,
and show how to include the small SAS program in the datastep.
The session will conclude with a demonstration of applying the Double Metaphone , and Fuzzy Hamming
methods to link two datasets where the major identifiers, names and number identifiers are misspelled
and/or are miskeyed.
Why Java Objects?
SAS is a very robust analytical system that is known for its quality and its extensive amount of analytical
tools and suites that can do just about anything. But what if you come across that rare occasion that you
want to use a method that is not (i.e., “not yet”) in SAS arsenal of analytical tools? Maybe the method is
new or experimental (e.g., Fuzzy Hamming dissimilarity measurement or Independent Components
Analysis), is just emerging as a widespread method (e.g., Double Metaphone phonetic coding), or is just
getting prototyped by your shop. At any case, you need to write a method in a programming language and
use a consistent data management environment to test your newly developed method.
Yet you want to use SAS Data step tools as a reliable and stable way of managing your data and providing a
rock solid data prep environment so you can test your new method without having to worry that your data
preparations are unstable or unreliable.
A side benefit is that your data logic is totally encapsulated from your method logic which enforces the
modularization desired in Object Oriented Design and Development (OOD2).
Why Eclipse?
In the world of IDEs, the industry is becoming standardized primarily on two:
Visual Studio/ Visual Studio.Net (x86 and x64 Windows and .Net) and Eclipse (x86 and x64 Windows, Unix,
.Net (using plugins), and z/OS (in development)). Each IDE has its strong and weak points.
Since the development here is not going to involve .Net, and since we want to develop both Java and SAS
code, Eclipse makes for a sensible choice. Eclipse's generous offering of plugins and support for other
languages such as C/C++, Python, Matlab et al., make it a very capable IDE for Java and SAS
development.
Setting up the Eclipse Environment
“Okey dokey... let's go” Hannibal Lecter.
The first thing to do is to get Eclipse and Java. First let's get Java at http://java.sun.com. The current
version is Java 6. You will want the “standard edition” (aka: Java SE 6 ). After Java is installed the next
thing is to get Eclipse.
Eclipse is an open source IDE that can be obtained from: http://www.eclipse.org. Eclipse 3.3 is the current
version, but 3.2 will work as well.
After Java and Eclipse are installed, the next thing is to setup Eclipse to run SAS programs. This is a bit
tricky, but very do-able.
The first thing is to open up Eclipse and go into the “External Tools area (green arrow with a toolbox at the
bottom) and open it up. Next a new tool needs to be created. We will call it “runSAS.”
Under location put:
<SAS Directory or SAS HOME>\sas.exe.
Next we have to set up the working directory. On that line put:
<Drive Letter>:\workspace\SASFILES.
We'll assume we have already setup a general project in our workspace called “ SASFILES.” Okay so far so
good. Now the razzle dazzle...
This next step CANNOT BE DONE IN INTERACTIVE SAS. It has to be done before the general SAS
runtime begins (i.e., before SAS is started at all). Here is where Eclipse starts earning its keep. Under the
arguments section type:
-sysin ${resource_name} -set CLASSPATH .;
This will allow you to run a COMPILED Java program that is placed in your \workspace\SASFILES.
A Little Program.
Let's try a small program. First we have to compile a small Java program THAT IS COMPATIBLE WITH
THE SAS JAVA JRE. This is very very important, since it will not run in the Data step if it is not compatible.
This can be a major pain, but in Eclipse it is a snap. All we have to do is configure our build path to call the
SAS JRE and NOT the Java 6 JRE. We do this by configuring our build path to use the SAS JRE (aka:
1.4.1).
Now you might be tempted to just leave it at that, but not so fast. WE MUST SET the compiler
COMPLIANCE LEVEL to be compatible with 1.4!! Otherwise “bad things man... bad things!”
Okay we're ready to roll. Type the following in your editor screen:
public class Greetings
{
private String phrases[] = { "Hello", "Hi There ", "" };
public void setGreeting (double index, String phrase)
{
phrases [ (int) index ] = phrase; }
public String getGreeting ()
{
return phrases [ 0 ]; }
public String getGreeting (double index)
{ return phrases [ (int) index ]; }
}
Now compile it into a class file. OK.
Now copy that class file into your SASFILES directory in your Eclipse workspace. Now type the following
SAS program in your editor:
data happy(drop=greets);
declare JavaObj j ('Greetings');
length s1-s3 $200 greets $30;
greets= "Happy happy JOY JOY!!";
j.callStringMethod ('getGreeting', s1);
j.callStringMethod ('getGreeting', 1, s2);
j.callVoidMethod ('setGreeting', 2, greets);
j.callStringMethod ('getGreeting', 2, s3);
put (s1-s3) (=/);
run;
proc print ;
run;
This generates the following output:
The SAS System
Obs
1
s1
Hello
s2
Hi There
20:43 Friday, July 27, 2007 1
s3
Happy happy JOY JOY!!
Okay this is cool. Wha... happened?
We first created an indexed String array called phrases[] with two elements : "Hello", "Hi There happy
happy!!". We read those into our sas data set using the getGreeting method that accepts a double value as
an index and returns a String to our SAS string variable that we created in our Data Step. Not to be
constrained by this, we then passed our SAS variable “greets” to the Java method which incorporated it
into our empty third String element and giving us:
Happy happy JOY JOY!!
Now The Wow..
Okay now that we have the hang of this, lets do something with a little more pizzaz like applying a jar file to
the mix. I just happen to have one that does Fuzzy hamming distance dissimilarity calculations and
Metaphone coding of names.
We need to include the following into our runtime arguments statement:
-sysin ${resource_name} -set CLASSPATH .;<DRIVE>:/jars/javasas.jar;
See what we did? We now are going to have SAS call an external jar file that we compiled and built. I used
the “fatjar” Eclipse Plugin. Ya can get downright spoiled.. Let's try a little SAS program:
LIBNAME MYDAT "C:\DATA\SASDATA";
/* ****************************************************************
* IMPORT DATA SETS INTO SAS
*
* OUTPUT TO PERMANENT SAS DATASETS
*
* DATA SomeDATA
*
* CLASSPATH STATEMENT:
*
* DIRNAME USES UNIX CONVENTIONS NOT WINDOWS!
*
* E.G.: c:/jars/javasas.jar IS OK. NOT c:\jars\javasas.jar
*
* sas -sysin <FILENAME>.sas -set CLASSPATH .;<DIRNAME>javasas.jar;*
* NOTE: COMPILE JAVA IN SAME VERSION AS SAS IS RUNNING !!
*
* USE proc javainfo; to determine the SAS JRE version
*
*
*
********************************************************************/
/* GET data Set One WITH SSN*/
* DONT WANT JUNK. FILTER OUT BAD SSNS;
data ONE(where=(SUBSTR(SSN,1,5)not in('99999','11111','00000'))) ;
lenght SSN_ONE $9;
set ONE;
SSN_ONE = SSN;
run;
/* Sort it */
proc SORT ;
BY SSN;
RUN;
data TWO ;
/* *************************************************************************
* load (declare) the java object set _N_ = 1 to load it once during data *
* iterations this is similar to loading a Macro library once note the *
* setting up of an alias 'j';
*
***************************************************************************/
IF _N_ = 1 THEN declare JavaObj j ('DeuParseSAS');
* OKAY CALL IN THE DATA
SET MYDAT.TWO;
LENGTH SSN $9 CLIENT_NO $7 DEU_EMPLOYEE_LNAME $11 DEU_EMPLOYEE_LNAME $10;
* Call the method This is similar to calling a macro in a macro library
* the format here is ObjAlias.call<Type>Method('<method Name>",IN VARS,
* Return Var);
CLIENT_NO= VAR1; * WE HAVE TO RENAME SOME RAW VARIABLES
LNAME= TRIM(VAR3);
FNAME= TRIM(VAR4);
* REMOVE DASHES FROM SSN USING JAVA'S STRING TOKENIZER;
j.callStringMethod ('parseSSN',VAR5,SSN);
run;
proc SORT;
BY SSN;
run;
MERGE ONE(in=ok) TWO(in=ok2);
BY SSN;
IF (ok and ok2) then output ONENEW; * DETERMINISTIC MATCH;
else if(ok and not ok2)THEN output outFROI;
else output TWONEW;
run;
* NOW LET'S USE METAPHONE FOR THOSE THAT DID NOT MATCH;
data outONE;
IF _N_ = 1 THEN declare JavaObj j ('DeuParseSAS'); * constuctor call;
SET ONENEW;
length LNphoneme FNphoneme $6;
j.callVoidMethod ('setPhoneme',LAST_NAME);
j.callStringMethod ('getPhoneme',LNphoneme);
j.callVoidMethod ('setPhoneme',FIRST_NAME);
j.callStringMethod ('getPhoneme',FNphoneme);
run;
proc sort;
by LNphoneme;
run;
data outTWO;
IF _N_ = 1 THEN declare JavaObj j ('DeuParseSAS'); *constuctor call;
SET TWONEW;
length LNphoneme FNphonemeD $6 ;
j.callVoidMethod ('setPhoneme',LNAME); *method call SEND IN LNAME;
j.callStringMethod ('getPhoneme',LNphoneme); *method call: GET METAPHONE CODED;
j.callVoidMethod ('setPhoneme',FNAME); *method call:SEND IN FNAME;
j.callStringMethod ('getPhoneme',FNphonemeTWO); *method call: GET METAPHONE CODED;
RUN; proc contents ;
run;
proc sort;
by LNphoneme;
run;
data ONETWO UNMATCHED;
MERGE outTWO(in=ok) outONE(in=ok2);
BY LNphoneme;
IF (ok and ok2) then
output ONETWO;
else output UNMATCHED; * NO NAME LINKAGE;
run;
* FINAL MATCH ON METAPHONE LAST AND FIRST NAMES AND FUZZY HAMMING SSN;
* change year;
data MYDAT.ONETWO1P NOMATCH;
IF _N_ = 1 THEN declare JavaObj j ('DeuParseSAS');
SET ONETWO;
length SIM 4.2;
if FNphonemeTWO = FNphoneme then do;
j.callVoidMethod('setFHamming',SSN,SSN_ONE);
j.callDoubleMethod('getFHamming',prob);
end;
* SET OUR THRESHOLD;
if SIM >.85 and CLIENT_NO ^='.' then output MMYDAT.ONETWO1P;
else output NOMATCH01;
run;
proc PRINT data=MYDAT.FROIDEU01P;
RUN;
This matches the following records:
recNO SSN
LAST NAME FIRST NAME SIMILARITY
1 123862718 HAMMY
PIGGIE
2 123861718 HAMMS
PIGGY
0.88889
1
2
183123456 ROBBLES
183123456 RUBBLES
BARNEY
BARN
0.88889
We just matched two records that would NOT match using DETERMINISTIC linkage in a million years.
Why? The names are not spelled exactly the same and the SSNs are keyed differently. It might be slightly
differently, but computers are very very literal. Those slight differences would result in a total non-match. In
most datasets, such slight differences are quite common-place. When looking for records to link to one
another probabilistic and/or proximal methods may often prove necessary when absolute error free data
cannot be assumed.
Download