EXTRACTION FROM ENORMOUS SEQUENTIAL HEALTH CARE CLAIMS DATA FILES USING MULTIPLE FORMAT TABLES Nancy A. Fox, Pracon Incorporated Craig Ray, ORI, Inc. from enormous sequential health care claims data files. The format of health care claims data sets used for analysis and the types of fields to be extracted in each record are discussed. Advantages and disadvantages in the various methods available for data extraction of health care claims are presented (the comparative strengths of these methods were previously discussed in the paper, "A Comparison of Table Lookup Techniques," by Craig Ray, in the SUIGI12 proceedings).1 Finally, generalized SASe macros that perform the extraction using multiple format tables are described. Introduction The cost of medical care has become an increasingly major concern among government, third party insurers, business coalitions, private industry and the ge-neral public. Health care costs have increased dramatically during the past two decades, creating pressure to investigate and evaluate costs at all levels of the health care system. To economically evaluate an illness, drug or other medical innovation, it is necessary to assess the changes in the associated costs and benefits. Research related to health care utilization and costs include the following types of studies: costeffectiveness and cost-benefit analysis; quality of life and productivity studies; adverse reaction analysis; and epidemio~ logical studies. Data Source Medicaid claims represent a record of each health care encounter of program eligibles submitted for reimbursement to the Medicaid program. These data files include health care services from pharmacies, physicians, laboratories, hospitals and nursing homes. Thus, use of these data allow the tracking of patients throughout the health care system, from the hospital to the physician's office to the pharmacy. utilizing these data allow for an analysis of the costs incurred by the Medicaid program, the manner in which physicians are diagnosing and treating patients, what drugs or other health care products are being utilized and how patient demographic characteristics are related to treatment regimens and their associated costs. Research of this nature utilizes a wide variety of secondary data resources. One of the most important data sources used is state Medicaid claims data. Detailed health Gare utilization and cost data exist in the form of state Medicaid claims. These claims represent a record of each program-eligible individual's health care encounter submitted for reimbursement to the Medicaid program. These os Medicaid data files are enormous; the number of records from one year often exceed 25 million, representing the health care utilization of over one million patients. In conducting health economic research stUdies utilizing this data source, it is necessary to extract all the claims of a subset of the patient popUlation (for example, 65,000 patients) from the main file for analysis. The keys of the records to be extracted (patient identification numbers) may be thought of as a lookup table. However, when processing very large data sets, table lookup procedures must be carefully designed. If designed poorly, these procedures may be a source of program bottleneck. PROC FORMAT usually would not be considered when using a table of 65,000 pecause the cost of creating multiple format tables is normally considered prohibitive. However, due to the enormous size of the main file, the speed of the lookup is the overriding design concern compared to the cost of the set up. Variables that are typically used from Medicaid claims files in cost-effective analyses are: o o o o o o a o patient identification number; provider type (e.g., hospital, pharmacy, physician); date of service; medical diagnoses (provided as ICD-9-CM codes): National Drug Code; days' supply and quantity of drug: demographic characteristics of patient (age, s-ex and race); and/or amount paid by Medicaid for service. There are approximately 25 million records contained in these data files annually, representing the health care claims of approximately 1.5 million patients. contained in these 25 million records are multiple records per patient. Each record provides the variables listed above. There is no internal sort order This paper describes the use of multiple format tables for the extraction of data 561 This implies that 50 million. lookups must be performed. All table lookup techniques have two costs associated with them: setup cost (performed once) and lookup cost (performed once for each lookup). Because this problem entails such a large number of lookups, the overriding-design concern is the speed of each lookup; the overhead associated with the setup becomes negligible. to the records contained in the data file. Definitions There are four terms that are used throughout this paper: o Main File - the Medicaid claims data file of interest that is being processed one record at a time. o Lookup File - the subset of patients (e.g., 65,000 patients) for which all matching records from the main file are needed. o Key - the variable(s) in common between the main file and lookup file. For this application, the patient identification number is the key. o Result of Lookup - the data that is obtained from the lookup file. For this application, the Medicaid record is the result of lookup. Multiple Format Tables vs. Binary Search vs. Hashing vs. Sort/Merge Multiple Format Tables. The traditional use of PROC FORMAT is for the creation of formats used to display the value of variables on output. However when used in conjunction with the PUT f~nction, PROC FORMAT becomes an effective . efficient and generalized tool f~r table lookup. For this application, -the lookup file is the list of 65,000 patients for which all health care records from the main file are needed for a~aly~is. Due to the size of the lookup f~le ~nvolved, the setup, which involves ~he execu~ion of multiple format tables, ~s expens~ve. However, the actual lookup has been found to be the most efficient with respect to CPU time in the SAS system. Figure 1 presents an example of the layout of a typical Medicaid claims file and the lookup file. Typical Extract Problem Binary Search. Binary search divides the range of records to be searched in half until the sought-after record is found. 1 As compared to multiple format tables, the overhead cost of setting up the lookup file was found to be cheaper. The setup involves only the sorting of the records in the lookup file. The lookup, however, utilizes SET with the POINT option and was found to be less efficient with respect to the speed of the lookup. It was found that SET with the POINT used approximately three times more CPU time than sequential SET without the POINT. An example of a study recently completed that necessitated the extraction of the health care records of a subset patient population involved the evaluation of the cost-effectiveness of drugs used to treat diabetes. A total of 65,000 diabetics were identified to be included in the study. Rates of hospitalization, emergency room treatment and physician visits were evaluated for this study population of diabetics. To conduct this study, it was necessary to extract all the health care records of the diabetic population. Using multiple format tables, all the health care records (which include records for pharmacy, physician and hospital) for this population of 65,000 patients (lookup file) were extracted from the main file of two years of data -- 50 million records. The SAS data set resulting from this extract contained approximately 15 million claims, representing all health care claims of the .65,000 patients. Hashing. The method of hashing arranges the observations in the lookup file such that the value of the key indicates where the observation is to be placed. The setup associated with hashing is also inexpensive as compared to the setup cost associated with multiple format tables. The setup of hashing involves: computing the key variable; sorting the lookup data set; and creating the hash table. The technique of hashing also utilizes the SET with the POINT option. 2 However, because hashing utilizes SET with the POINT .option, each lookup was also found to be much more expensive with respect to CPU time as compared to multiple format tables. Design Concerns Due to the enormous size of the Medicaid claims file, it was necessary to design a technique to efficiently extract the necessary data from the main data files of 50 million records. For each of the 50 million records in the main file, the list of 65,000 patient identification numbers to be extracted must be searched. Sort Merge. The technique of SORT/MERGE was investigated; however, because the Medicaid (main) data files are non-SAS 562 observations. A pointer format table is also created ($KEEP format) which points to the correct format table for any given key. files and are not sorted, this option was not desirable. The setup costs associated with this technique include the creation qf SAS data sets (for 50 million records) and the sorting of these SAS data sets: these costs would be prohibitively expensive. The lookup time associated with the merge step would generate approximately the same cost as the lookup cost of multiple format tables. Thus, the following specifications are supplied to execute the %MACRO PARTIT: %PARTIT(5000,LOOKUP,PATID,RECIP) %MAKEFMT is called during %MACRO PARTIT and creates each of the format tables for the SAS data sets (Figure 3). In the DATA _NULL_ step, the macro reads each observation of the lookup file and writes both the key and the result to successive macro variables using the data step function CALL SYMPUT. At data step execution time, these macro variables are created. They are then resolved, creating successive lines of FROC FORMAT code. Multiple Format Tables After reviewing the four options, it was determined that the multiple format tables were the most efficient technique to use for data extraction because the lookup is most efficient. The following section presents an overview of the generalized SAS macros that perform the extraction using multiple format tables. Once the format tables and pointer table is created, the data extraction is executed as follows: For our applications, the size of the lookup files range from 25,000 to 100,000 patients. It would not be efficient with respect to CPU time to create one format table. The cost of creating the format table relates exponentially to the size of the lookup files. However, smaller format tables are less expensive to create than larger tables. We therefore split the lookup files into smaller SAS data sets of equal records. Figure 2 presents a SAS macro, %PARTIT, that partitions a lookup SAS data set into equal pieces. DATA EXTRACT; INFILE OLD; INPUT @l RECIP $4.@; %DO I~l %TO &NUMDS; IF PUT (RECIP,$KEEP.)="&I" THEN DO; IF PUT (RECIP,$RECIP&I.A.)~'Y' THEN 00; INPUT @11 PROV 2. @13 DATE YYMMDD6. @19 DIAG $6 • . . . . OUTPUT; END: END; tEND; This generalized macro tool partitions one SAS data set into many SAS data sets with each having an equal number of observations. The macro then iteratively calls a generalized macro, %MAKEFMT, that constructs a SAS format from each of the SAS data sets. l After all the formats are created, %MACRO PARTIT creates a pointer format that, for any given patient identification number, indicates the format table that will contain this value, if it is present. (Note: to create this pointer format, the lookup file of the patient identification numbers must be sorted by the patient identification numPer.) The pointer format was included to eliminate the need of checking in each format table during each successive lookup. Conclusion This paper has demonstrated the technique of using multiple format tables for extraction of data from enormous sequential as data files. Due to the enormo~s size of the data, our primary concern in evaluating the most efficient technique was the speed of the lookup with respect to CPU time. We found that although the setup CPU of the multiple format lookup table is expensive compared to binary search and hashing, the overall CPU cost using multiple format is lowest due to the efficiency of the actual lookup prqcedure (Figure 4). It is necessary to specify in the %MACRO PARTIT the following: 0 0 0 0 Size: Data: Key: Format: size of the lookup tables name of lookup SAS data set list of keys for the lookup rootname for each format table For this application, the lookup file of 65,000 observations would be divided evenly into 13 format tables of 5,000 563 Figure 1 SAMPLE DATA MAIN FILE Patient Identification Provider Number Type ~ 01A 01A 01A 01A 02P 02P 01C 01C 02P 01A Physician Pharmacy Physician Hospital Physician Nursing Home Physician Pharmacy Hospital Physician Date of Service 10/01/87 10/01/87 09/15/87 11/15/87 01/03/88 01/01/88 04/30/88 05/01/88 02/01/88 04/15/88 Primary Secondary National Medical Medical Drug Days' Diagnosis Diagnosis Code Supply Quantity Age Sex 250.5 401.9 0049 250.5 250.5 599.0 599.0 250.01 250.9 251.0 401.0 531.0 188.0 531.0 535.0 599.0 250.01 0077 LOOKUP FILE Patient Identification Number 01A 01C KEY ~ Recipient Identification Number RESULT OF LOOKUP = Medicaid Claims Record 30 20 60 40 35 35 35 35 60 60 18 18 60 18 F F F F F F M M F M ~ W W W W B B B B B B Amount Paid 13.50 15.90 13.60 950.25 14.50 1005.50 12.00 25.50 945.25 12.00 Figure 2 A SAS MACRO TO CREATE MULTIPLE FORMAT TABLES %MACRO PARTIT(SIZE~, DATA~, KEY=, FORMAT=, 1* 1* 1* 1* I* THE MAX SIZE OF ANY FORMAT TABLE THE NAME OF THE LOOKUP SASDS THE LIST OF KEYS FOR THE LOOKUP ROOT NAME FOR EACH FORMAT TABLE AT MOST 5 CHARACTERS!!! *1 *1 *1 *1 *1 ) ; %***************************************************** ***i %* *i %* %* %* %* %* %* %* THIS MACRO PARTITIONS A LOOKUP SAS DATA SET INTO EQUAL PIECES DETERMINED BY &SIZE AND CREATES A FORMAT TABLE FROM EACH OF THE PARTITIONED DATA SETS. IT IS ASSUMED THAT THE LOOKUP TABLE IS SORTED BY &KEY (THE KEY OF THE LOOKUP). IN WHICH CASE, A POINTER FORMAT IS CREATED WHICH FOR ANY GIVEN KEY WILL INDICATE WHICH OF THE TABLES TO FIND THE DESIRED RESULT. %* *; *; *; *; *; *; *; *; %* *i %***************************************************** ***i %LOCAL I; 1* CALCULATE THE NUMBER OF FORMATS FOR SAS DATA SET LOOKUP *1 DATA _NULL_; IF 0 THEN SET &DATA POINT~ N NOBS~NOBS; NUMDS ~CEIL(NOBS/&SIZE); - CALL SYMPUT('NUMDS',PUT(NUMDS,5.»; STOP; RUN; %00 I ~ 1 %TO &NUMDS; %LET FIRSTOBS %LET LASTOBS DATA ~ ~ %EVAL«(&I - 1) * &SIZE) + 1); %EVAL(&I * &SIZE); _TABLE&I(KEEP~KEY); SET &DATA(FIRSTOBS~&FIRSTOBS OBS~&LASTOBS) END~LASTREC; IF _N_ ~ 1 THEN CALL SYMPUT("_KEY1&I",&KEY); IF LASTREC THEN CALL SYMPUT("_KEY2&I",&KEY); RUN; %MAKEFMT(FROMDATA ~ TABLE&I, KEY ~ &KEY,NAME ~ $&FORMAT&I.A, = RESULT OTHER ~ 'Y' 'N') %END; PROC FORMAT; VALUE $KEEP %DO I = 1 %TO &NUMOS; "&& KEY1&I" - "&& %END; KEY2&I" - OTHER 'N' RUN; %MEND PARTIT; 565 " &1" Figure 3 MACRO WHICH CREATES FORMAT TABLE FROM SAS DATA SET %MACRO MAKEFMT(DATA=, KEY=, RESULT=, FORMAT=, OTHER=, /* SAS DATA SET */ /* KEY VARIABLE FOR FORMAT; CONCATENATED IF MOLT. KEYS */ /* RESULT OF LOOKUP, CONCATENATED IF MOLT. VARIABLES */ /* NAME OF THE FORMAT TABLE; PREFIX WITH $ IF CHAR. FORMAT */ /* OPTION IN PROC FORMAT */ OPTIONS DQUOTE; DATA NULL; Do UNTIL(LASTREC); SET &DATA END=LASTREC; CNTR+l; =CCNTR=LEFT(PUT(_CNTR,5.»; %IF %SUBSTR(&FORMAT,l,l) = $ %THEN %DO /* CHARACTER FORMAT */ CALL SYMPUT( '_VL' II _CCNTR, , .. , II &KEY II , .. '); %END; %ELSE %DO; /* NUMERIC FORMAT */ CALL SYMPUT('_VL' II _CCNTR,&KEY); %END; CALL SYMPUT('_LB' II _CCNTR, .... END; CALL SYMPUT('ENTRIES', CCNTR); STOP; RUN; PROC FORMAT; VAlliE &FORMAT %DO I=l %TO &ENTRIES; && VL&I %END; %IF &OTHER NE %THEN %00; OTHER = .. &OTHER" ; %END RUN; %MEND MAKEFMT; 566 II &RESULT II .... ); Figure 4 EVALUATION OF SETUP AND LOOKUP TIMES OF MULTIPLE FORMAT TABLES, BINARY SEARCH, HASHING AND SORT/MERGE Least Expensive setup Time Lookup Time Binary Search Multiple Format Tables Hashing Sort/Merge, Hashing Multiple Format Tables Most Expensive Binary Search Sort/Merge REFERENCES 1. 2. Ray C: "A Comparison of Table Lookup Techniques.: SUGI12 Proceedings, 1987. Ray c: "The Implementation of a Hashing Routine Using SAS Software." To be presented at SUGI13, Orlando Florida. 1988. For further information about this article contact Nancy A. Fox, center for Economic Studies in Medicine, Pracon Incorporated, 1800 Robert Fulton Drive, Reston, Virginia, 22091. SAS is the registered trademark of SAS Institute Inc., Cary, Ne, USA. 567