Get Your DATA Step “Amped” on Java By David Bell, Staff Programmer Analyst, State of California: Dept. of Industrial Relations Abstract Beginning with SAS v. 9.1.3, the Data Step supported calls to Java objects using the "javaobj" command. This emormously extends SAS in that Java can be used to make experimental processes available to SAS data not provided by SAS procedures or other statements. The procedures presented will be a "fuzzy Hamming" dissimilarity procedure and a metaphone phonetic code procedure that comes in handy when trying to match names from different databases. Setting up Eclipse for use as the development environment will be demonstrated together with the steps to including Java within the Data step. Search Keywords Data Step, Java, Eclipse IDE, Fuzzy Hamming, Metaphone Operating Systems applicable :Windows Introduction: In this session, I will present how to setup a SAS Java development environment using Eclipse, demonstrate the development and compilation of a small java program for inclusion into the SAS Data step, and show how to include the small SAS program in the datastep. The session will conclude with a demonstration of applying the Double Metaphone , and Fuzzy Hamming methods to link two datasets where the major identifiers, names and number identifiers are misspelled and/or are miskeyed. Why Java Objects? SAS is a very robust analytical system that is known for its quality and its extensive amount of analytical tools and suites that can do just about anything. But what if you come across that rare occasion that you want to use a method that is not (i.e., “not yet”) in SAS arsenal of analytical tools? Maybe the method is new or experimental (e.g., Fuzzy Hamming dissimilarity measurement or Independent Components Analysis), is just emerging as a widespread method (e.g., Double Metaphone phonetic coding), or is just getting prototyped by your shop. At any case, you need to write a method in a programming language and use a consistent data management environment to test your newly developed method. Yet you want to use SAS Data step tools as a reliable and stable way of managing your data and providing a rock solid data prep environment so you can test your new method without having to worry that your data preparations are unstable or unreliable. A side benefit is that your data logic is totally encapsulated from your method logic which enforces the modularization desired in Object Oriented Design and Development (OOD2). Why Eclipse? In the world of IDEs, the industry is becoming standardized primarily on two: Visual Studio/ Visual Studio.Net (x86 and x64 Windows and .Net) and Eclipse (x86 and x64 Windows, Unix, .Net (using plugins), and z/OS (in development)). Each IDE has its strong and weak points. Since the development here is not going to involve .Net, and since we want to develop both Java and SAS code, Eclipse makes for a sensible choice. Eclipse's generous offering of plugins and support for other languages such as C/C++, Python, Matlab et al., make it a very capable IDE for Java and SAS development. Setting up the Eclipse Environment “Okey dokey... let's go” Hannibal Lecter. The first thing to do is to get Eclipse and Java. First let's get Java at http://java.sun.com. The current version is Java 6. You will want the “standard edition” (aka: Java SE 6 ). After Java is installed the next thing is to get Eclipse. Eclipse is an open source IDE that can be obtained from: http://www.eclipse.org. Eclipse 3.3 is the current version, but 3.2 will work as well. After Java and Eclipse are installed, the next thing is to setup Eclipse to run SAS programs. This is a bit tricky, but very do-able. The first thing is to open up Eclipse and go into the “External Tools area (green arrow with a toolbox at the bottom) and open it up. Next a new tool needs to be created. We will call it “runSAS.” Under location put: <SAS Directory or SAS HOME>\sas.exe. Next we have to set up the working directory. On that line put: <Drive Letter>:\workspace\SASFILES. We'll assume we have already setup a general project in our workspace called “ SASFILES.” Okay so far so good. Now the razzle dazzle... This next step CANNOT BE DONE IN INTERACTIVE SAS. It has to be done before the general SAS runtime begins (i.e., before SAS is started at all). Here is where Eclipse starts earning its keep. Under the arguments section type: -sysin ${resource_name} -set CLASSPATH .; This will allow you to run a COMPILED Java program that is placed in your \workspace\SASFILES. A Little Program. Let's try a small program. First we have to compile a small Java program THAT IS COMPATIBLE WITH THE SAS JAVA JRE. This is very very important, since it will not run in the Data step if it is not compatible. This can be a major pain, but in Eclipse it is a snap. All we have to do is configure our build path to call the SAS JRE and NOT the Java 6 JRE. We do this by configuring our build path to use the SAS JRE (aka: 1.4.1). Now you might be tempted to just leave it at that, but not so fast. WE MUST SET the compiler COMPLIANCE LEVEL to be compatible with 1.4!! Otherwise “bad things man... bad things!” Okay we're ready to roll. Type the following in your editor screen: public class Greetings { private String phrases[] = { "Hello", "Hi There ", "" }; public void setGreeting (double index, String phrase) { phrases [ (int) index ] = phrase; } public String getGreeting () { return phrases [ 0 ]; } public String getGreeting (double index) { return phrases [ (int) index ]; } } Now compile it into a class file. OK. Now copy that class file into your SASFILES directory in your Eclipse workspace. Now type the following SAS program in your editor: data happy(drop=greets); declare JavaObj j ('Greetings'); length s1-s3 $200 greets $30; greets= "Happy happy JOY JOY!!"; j.callStringMethod ('getGreeting', s1); j.callStringMethod ('getGreeting', 1, s2); j.callVoidMethod ('setGreeting', 2, greets); j.callStringMethod ('getGreeting', 2, s3); put (s1-s3) (=/); run; proc print ; run; This generates the following output: The SAS System Obs 1 s1 Hello s2 Hi There 20:43 Friday, July 27, 2007 1 s3 Happy happy JOY JOY!! Okay this is cool. Wha... happened? We first created an indexed String array called phrases[] with two elements : "Hello", "Hi There happy happy!!". We read those into our sas data set using the getGreeting method that accepts a double value as an index and returns a String to our SAS string variable that we created in our Data Step. Not to be constrained by this, we then passed our SAS variable “greets” to the Java method which incorporated it into our empty third String element and giving us: Happy happy JOY JOY!! Now The Wow.. Okay now that we have the hang of this, lets do something with a little more pizzaz like applying a jar file to the mix. I just happen to have one that does Fuzzy hamming distance dissimilarity calculations and Metaphone coding of names. We need to include the following into our runtime arguments statement: -sysin ${resource_name} -set CLASSPATH .;<DRIVE>:/jars/javasas.jar; See what we did? We now are going to have SAS call an external jar file that we compiled and built. I used the “fatjar” Eclipse Plugin. Ya can get downright spoiled.. Let's try a little SAS program: LIBNAME MYDAT "C:\DATA\SASDATA"; /* **************************************************************** * IMPORT DATA SETS INTO SAS * * OUTPUT TO PERMANENT SAS DATASETS * * DATA SomeDATA * * CLASSPATH STATEMENT: * * DIRNAME USES UNIX CONVENTIONS NOT WINDOWS! * * E.G.: c:/jars/javasas.jar IS OK. NOT c:\jars\javasas.jar * * sas -sysin <FILENAME>.sas -set CLASSPATH .;<DIRNAME>javasas.jar;* * NOTE: COMPILE JAVA IN SAME VERSION AS SAS IS RUNNING !! * * USE proc javainfo; to determine the SAS JRE version * * * ********************************************************************/ /* GET data Set One WITH SSN*/ * DONT WANT JUNK. FILTER OUT BAD SSNS; data ONE(where=(SUBSTR(SSN,1,5)not in('99999','11111','00000'))) ; lenght SSN_ONE $9; set ONE; SSN_ONE = SSN; run; /* Sort it */ proc SORT ; BY SSN; RUN; data TWO ; /* ************************************************************************* * load (declare) the java object set _N_ = 1 to load it once during data * * iterations this is similar to loading a Macro library once note the * * setting up of an alias 'j'; * ***************************************************************************/ IF _N_ = 1 THEN declare JavaObj j ('DeuParseSAS'); * OKAY CALL IN THE DATA SET MYDAT.TWO; LENGTH SSN $9 CLIENT_NO $7 DEU_EMPLOYEE_LNAME $11 DEU_EMPLOYEE_LNAME $10; * Call the method This is similar to calling a macro in a macro library * the format here is ObjAlias.call<Type>Method('<method Name>",IN VARS, * Return Var); CLIENT_NO= VAR1; * WE HAVE TO RENAME SOME RAW VARIABLES LNAME= TRIM(VAR3); FNAME= TRIM(VAR4); * REMOVE DASHES FROM SSN USING JAVA'S STRING TOKENIZER; j.callStringMethod ('parseSSN',VAR5,SSN); run; proc SORT; BY SSN; run; MERGE ONE(in=ok) TWO(in=ok2); BY SSN; IF (ok and ok2) then output ONENEW; * DETERMINISTIC MATCH; else if(ok and not ok2)THEN output outFROI; else output TWONEW; run; * NOW LET'S USE METAPHONE FOR THOSE THAT DID NOT MATCH; data outONE; IF _N_ = 1 THEN declare JavaObj j ('DeuParseSAS'); * constuctor call; SET ONENEW; length LNphoneme FNphoneme $6; j.callVoidMethod ('setPhoneme',LAST_NAME); j.callStringMethod ('getPhoneme',LNphoneme); j.callVoidMethod ('setPhoneme',FIRST_NAME); j.callStringMethod ('getPhoneme',FNphoneme); run; proc sort; by LNphoneme; run; data outTWO; IF _N_ = 1 THEN declare JavaObj j ('DeuParseSAS'); *constuctor call; SET TWONEW; length LNphoneme FNphonemeD $6 ; j.callVoidMethod ('setPhoneme',LNAME); *method call SEND IN LNAME; j.callStringMethod ('getPhoneme',LNphoneme); *method call: GET METAPHONE CODED; j.callVoidMethod ('setPhoneme',FNAME); *method call:SEND IN FNAME; j.callStringMethod ('getPhoneme',FNphonemeTWO); *method call: GET METAPHONE CODED; RUN; proc contents ; run; proc sort; by LNphoneme; run; data ONETWO UNMATCHED; MERGE outTWO(in=ok) outONE(in=ok2); BY LNphoneme; IF (ok and ok2) then output ONETWO; else output UNMATCHED; * NO NAME LINKAGE; run; * FINAL MATCH ON METAPHONE LAST AND FIRST NAMES AND FUZZY HAMMING SSN; * change year; data MYDAT.ONETWO1P NOMATCH; IF _N_ = 1 THEN declare JavaObj j ('DeuParseSAS'); SET ONETWO; length SIM 4.2; if FNphonemeTWO = FNphoneme then do; j.callVoidMethod('setFHamming',SSN,SSN_ONE); j.callDoubleMethod('getFHamming',prob); end; * SET OUR THRESHOLD; if SIM >.85 and CLIENT_NO ^='.' then output MMYDAT.ONETWO1P; else output NOMATCH01; run; proc PRINT data=MYDAT.FROIDEU01P; RUN; This matches the following records: recNO SSN LAST NAME FIRST NAME SIMILARITY 1 123862718 HAMMY PIGGIE 2 123861718 HAMMS PIGGY 0.88889 1 2 183123456 ROBBLES 183123456 RUBBLES BARNEY BARN 0.88889 We just matched two records that would NOT match using DETERMINISTIC linkage in a million years. Why? The names are not spelled exactly the same and the SSNs are keyed differently. It might be slightly differently, but computers are very very literal. Those slight differences would result in a total non-match. In most datasets, such slight differences are quite common-place. When looking for records to link to one another probabilistic and/or proximal methods may often prove necessary when absolute error free data cannot be assumed.