Sugi-13-113 Fox Ray

advertisement
EXTRACTION FROM ENORMOUS SEQUENTIAL
HEALTH CARE CLAIMS DATA FILES
USING MULTIPLE FORMAT TABLES
Nancy A. Fox, Pracon Incorporated
Craig Ray, ORI, Inc.
from enormous sequential health care
claims data files.
The format of health
care claims data sets used for analysis
and the types of fields to be extracted
in each record are discussed. Advantages
and disadvantages in the various methods
available for data extraction of health
care claims are presented (the comparative strengths of these methods were
previously discussed in the paper, "A
Comparison of Table Lookup Techniques,"
by Craig Ray, in the SUIGI12
proceedings).1 Finally, generalized SASe
macros that perform the extraction using
multiple format tables are described.
Introduction
The cost of medical care has become an
increasingly major concern among
government, third party insurers,
business coalitions, private industry and
the ge-neral public. Health care costs
have increased dramatically during the
past two decades, creating pressure to
investigate and evaluate costs at all
levels of the health care system. To
economically evaluate an illness, drug or
other medical innovation, it is necessary
to assess the changes in the associated
costs and benefits. Research related to
health care utilization and costs include
the following types of studies:
costeffectiveness and cost-benefit analysis;
quality of life and productivity studies;
adverse reaction analysis; and epidemio~
logical studies.
Data Source
Medicaid claims represent a record of
each health care encounter of program
eligibles submitted for reimbursement to
the Medicaid program. These data files
include health care services from
pharmacies, physicians, laboratories,
hospitals and nursing homes. Thus, use
of these data allow the tracking of
patients throughout the health care
system, from the hospital to the
physician's office to the pharmacy.
utilizing these data allow for an
analysis of the costs incurred by the
Medicaid program, the manner in which
physicians are diagnosing and treating
patients, what drugs or other health care
products are being utilized and how
patient demographic characteristics are
related to treatment regimens and their
associated costs.
Research of this nature utilizes a wide
variety of secondary data resources. One
of the most important data sources used
is state Medicaid claims data.
Detailed
health Gare utilization and cost data
exist in the form of state Medicaid
claims. These claims represent a record
of each program-eligible individual's
health care encounter submitted for
reimbursement to the Medicaid program.
These os Medicaid data files are
enormous; the number of records from one
year often exceed 25 million, representing the health care utilization of over
one million patients.
In conducting
health economic research stUdies
utilizing this data source, it is
necessary to extract all the claims of a
subset of the patient popUlation (for
example, 65,000 patients) from the main
file for analysis. The keys of the
records to be extracted (patient
identification numbers) may be thought of
as a lookup table. However, when
processing very large data sets, table
lookup procedures must be carefully
designed.
If designed poorly, these
procedures may be a source of program
bottleneck.
PROC FORMAT usually would
not be considered when using a table of
65,000 pecause the cost of creating
multiple format tables is normally
considered prohibitive. However, due to
the enormous size of the main file, the
speed of the lookup is the overriding
design concern compared to the cost of
the set up.
Variables that are typically used from
Medicaid claims files in cost-effective
analyses are:
o
o
o
o
o
o
a
o
patient identification number;
provider type (e.g., hospital,
pharmacy, physician);
date of service;
medical diagnoses (provided as
ICD-9-CM codes):
National Drug Code;
days' supply and quantity of drug:
demographic characteristics of
patient (age, s-ex and race); and/or
amount paid by Medicaid for service.
There are approximately 25 million
records contained in these data files
annually, representing the health care
claims of approximately 1.5 million
patients. contained in these 25 million
records are multiple records per patient.
Each record provides the variables listed
above. There is no internal sort order
This paper describes the use of multiple
format tables for the extraction of data
561
This implies that 50 million. lookups must
be performed. All table lookup techniques have two costs associated with
them: setup cost (performed once) and
lookup cost (performed once for each
lookup). Because this problem entails
such a large number of lookups, the
overriding-design concern is the speed of
each lookup; the overhead associated with
the setup becomes negligible.
to the records contained in the data
file.
Definitions
There are four terms that are used
throughout this paper:
o
Main File - the Medicaid claims
data file of interest that is being
processed one record at a time.
o
Lookup File - the subset of patients
(e.g., 65,000 patients) for which
all matching records from the main
file are needed.
o
Key - the variable(s) in common
between the main file and lookup
file.
For this application, the
patient identification number is the
key.
o
Result of Lookup - the data that is
obtained from the lookup file. For
this application, the Medicaid
record is the result of lookup.
Multiple Format Tables vs. Binary Search
vs. Hashing vs. Sort/Merge
Multiple Format Tables. The traditional
use of PROC FORMAT is for the creation of
formats used to display the value of
variables on output. However when used
in conjunction with the PUT f~nction,
PROC FORMAT becomes an effective .
efficient and generalized tool f~r table
lookup. For this application, -the
lookup file is the list of 65,000
patients for which all health care
records from the main file are needed for
a~aly~is.
Due to the size of the lookup
f~le ~nvolved, the setup, which involves
~he execu~ion of multiple format tables,
~s expens~ve.
However, the actual lookup
has been found to be the most efficient
with respect to CPU time in the SAS
system.
Figure 1 presents an example of the
layout of a typical Medicaid claims file
and the lookup file.
Typical Extract Problem
Binary Search. Binary search divides the
range of records to be searched in half
until the sought-after record is found. 1
As compared to multiple format tables,
the overhead cost of setting up the
lookup file was found to be cheaper. The
setup involves only the sorting of the
records in the lookup file. The lookup,
however, utilizes SET with the POINT
option and was found to be less efficient
with respect to the speed of the lookup.
It was found that SET with the POINT
used approximately three times more CPU
time than sequential SET without the
POINT.
An example of a study recently completed
that necessitated the extraction of the
health care records of a subset patient
population involved the evaluation of the
cost-effectiveness of drugs used to treat
diabetes. A total of 65,000 diabetics
were identified to be included in the
study. Rates of hospitalization,
emergency room treatment and physician
visits were evaluated for this study
population of diabetics. To conduct this
study, it was necessary to extract all
the health care records of the diabetic
population. Using multiple format
tables, all the health care records
(which include records for pharmacy,
physician and hospital) for this
population of 65,000 patients (lookup
file) were extracted from the main file
of two years of data -- 50 million
records. The SAS data set resulting
from this extract contained approximately
15 million claims, representing all
health care claims of the .65,000
patients.
Hashing. The method of hashing arranges
the observations in the lookup file such
that the value of the key indicates where
the observation is to be placed.
The
setup associated with hashing is also
inexpensive as compared to the setup cost
associated with multiple format tables.
The setup of hashing involves: computing
the key variable; sorting the lookup data
set; and creating the hash table. The
technique of hashing also utilizes the
SET with the POINT option. 2
However,
because hashing utilizes SET with the
POINT .option, each lookup was also found
to be much more expensive with respect to
CPU time as compared to multiple format
tables.
Design Concerns
Due to the enormous size of the Medicaid
claims file, it was necessary to design a
technique to efficiently extract the
necessary data from the main data files
of 50 million records. For each of the
50 million records in the main file, the
list of 65,000 patient identification
numbers to be extracted must be searched.
Sort Merge. The technique of SORT/MERGE
was investigated; however, because the
Medicaid (main) data files are non-SAS
562
observations. A pointer format table is
also created ($KEEP format) which points
to the correct format table for any given
key.
files and are not sorted, this option was
not desirable. The setup costs
associated with this technique include
the creation qf SAS data sets (for 50
million records) and the sorting of these
SAS data sets: these costs would be
prohibitively expensive. The lookup time
associated with the merge step would
generate approximately the same cost as
the lookup cost of multiple format
tables.
Thus, the following specifications are
supplied to execute the %MACRO PARTIT:
%PARTIT(5000,LOOKUP,PATID,RECIP)
%MAKEFMT is called during %MACRO PARTIT
and creates each of the format tables for
the SAS data sets (Figure 3). In the
DATA _NULL_ step, the macro reads each
observation of the lookup file and writes
both the key and the result to successive
macro variables using the data step
function CALL SYMPUT. At data step
execution time, these macro variables are
created. They are then resolved,
creating successive lines of FROC FORMAT
code.
Multiple Format Tables
After reviewing the four options, it was
determined that the multiple format
tables were the most efficient technique
to use for data extraction because the
lookup is most efficient. The following
section presents an overview of the
generalized SAS macros that perform the
extraction using multiple format tables.
Once the format tables and pointer table
is created, the data extraction is
executed as follows:
For our applications, the size of the
lookup files range from 25,000 to 100,000
patients.
It would not be efficient with
respect to CPU time to create one format
table. The cost of creating the format
table relates exponentially to the size
of the lookup files. However, smaller
format tables are less expensive to
create than larger tables. We therefore
split the lookup files into smaller SAS
data sets of equal records.
Figure 2
presents a SAS macro, %PARTIT, that
partitions a lookup SAS data set into
equal pieces.
DATA EXTRACT;
INFILE OLD;
INPUT @l RECIP $4.@;
%DO I~l %TO &NUMDS;
IF PUT (RECIP,$KEEP.)="&I"
THEN DO;
IF PUT (RECIP,$RECIP&I.A.)~'Y'
THEN 00;
INPUT @11 PROV 2. @13 DATE YYMMDD6.
@19 DIAG $6 • . . . .
OUTPUT;
END:
END;
tEND;
This generalized macro tool partitions
one SAS data set into many SAS data sets
with each having an equal number of
observations.
The macro then iteratively
calls a generalized macro, %MAKEFMT,
that constructs a SAS format from each of
the SAS data sets. l After all the
formats are created, %MACRO PARTIT
creates a pointer format that, for any
given patient identification number,
indicates the format table that will
contain this value, if it is present.
(Note: to create this pointer format, the
lookup file of the patient identification
numbers must be sorted by the patient
identification numPer.)
The pointer
format was included to eliminate the need
of checking in each format table during
each successive lookup.
Conclusion
This paper has demonstrated the technique
of using multiple format tables for
extraction of data from enormous
sequential as data files.
Due to the
enormo~s size of the data, our primary
concern in evaluating the most efficient
technique was the speed of the lookup
with respect to CPU time. We found that
although the setup CPU of the multiple
format lookup table is expensive compared
to binary search and hashing, the overall
CPU cost using multiple format is lowest
due to the efficiency of the actual
lookup prqcedure (Figure 4).
It is necessary to specify in the %MACRO
PARTIT the following:
0
0
0
0
Size:
Data:
Key:
Format:
size of the lookup tables
name of lookup SAS data set
list of keys for the lookup
rootname for each format
table
For this application, the lookup file of
65,000 observations would be divided
evenly into 13 format tables of 5,000
563
Figure 1
SAMPLE DATA
MAIN FILE
Patient
Identification Provider
Number
Type
~
01A
01A
01A
01A
02P
02P
01C
01C
02P
01A
Physician
Pharmacy
Physician
Hospital
Physician
Nursing Home
Physician
Pharmacy
Hospital
Physician
Date
of
Service
10/01/87
10/01/87
09/15/87
11/15/87
01/03/88
01/01/88
04/30/88
05/01/88
02/01/88
04/15/88
Primary
Secondary National
Medical
Medical
Drug
Days'
Diagnosis Diagnosis Code
Supply Quantity Age Sex
250.5
401.9
0049
250.5
250.5
599.0
599.0
250.01
250.9
251.0
401.0
531.0
188.0
531.0
535.0
599.0
250.01
0077
LOOKUP FILE
Patient
Identification
Number
01A
01C
KEY ~ Recipient Identification Number
RESULT OF LOOKUP = Medicaid Claims Record
30
20
60
40
35
35
35
35
60
60
18
18
60
18
F
F
F
F
F
F
M
M
F
M
~
W
W
W
W
B
B
B
B
B
B
Amount
Paid
13.50
15.90
13.60
950.25
14.50
1005.50
12.00
25.50
945.25
12.00
Figure 2
A SAS MACRO TO CREATE MULTIPLE FORMAT TABLES
%MACRO
PARTIT(SIZE~,
DATA~,
KEY=,
FORMAT=,
1*
1*
1*
1*
I*
THE MAX SIZE OF ANY FORMAT TABLE
THE NAME OF THE LOOKUP SASDS
THE LIST OF KEYS FOR THE LOOKUP
ROOT NAME FOR EACH FORMAT TABLE
AT MOST 5 CHARACTERS!!!
*1
*1
*1
*1
*1
) ;
%***************************************************** ***i
%*
*i
%*
%*
%*
%*
%*
%*
%*
THIS MACRO PARTITIONS A LOOKUP SAS DATA SET INTO
EQUAL PIECES DETERMINED BY &SIZE AND CREATES A
FORMAT TABLE FROM EACH OF THE PARTITIONED DATA
SETS. IT IS ASSUMED THAT THE LOOKUP TABLE IS
SORTED BY &KEY (THE KEY OF THE LOOKUP). IN WHICH
CASE, A POINTER FORMAT IS CREATED WHICH FOR ANY
GIVEN KEY WILL INDICATE WHICH OF THE TABLES TO
FIND THE DESIRED RESULT.
%*
*;
*;
*;
*;
*;
*;
*;
*;
%*
*i
%***************************************************** ***i
%LOCAL I;
1* CALCULATE THE NUMBER OF FORMATS FOR SAS DATA SET LOOKUP *1
DATA _NULL_;
IF 0 THEN SET &DATA POINT~ N NOBS~NOBS;
NUMDS ~CEIL(NOBS/&SIZE); - CALL SYMPUT('NUMDS',PUT(NUMDS,5.»;
STOP;
RUN;
%00 I
~
1 %TO &NUMDS;
%LET FIRSTOBS
%LET LASTOBS
DATA
~
~
%EVAL«(&I - 1) * &SIZE) + 1);
%EVAL(&I * &SIZE);
_TABLE&I(KEEP~KEY);
SET &DATA(FIRSTOBS~&FIRSTOBS OBS~&LASTOBS) END~LASTREC;
IF _N_ ~ 1 THEN CALL SYMPUT("_KEY1&I",&KEY);
IF LASTREC THEN CALL SYMPUT("_KEY2&I",&KEY);
RUN;
%MAKEFMT(FROMDATA ~ TABLE&I,
KEY ~ &KEY,NAME ~ $&FORMAT&I.A,
=
RESULT
OTHER
~
'Y'
'N')
%END;
PROC FORMAT;
VALUE $KEEP
%DO I = 1 %TO &NUMOS;
"&& KEY1&I" - "&&
%END;
KEY2&I"
-
OTHER
'N'
RUN;
%MEND PARTIT;
565
" &1"
Figure 3
MACRO WHICH CREATES FORMAT TABLE FROM SAS DATA SET
%MACRO MAKEFMT(DATA=,
KEY=,
RESULT=,
FORMAT=,
OTHER=,
/* SAS DATA SET */
/* KEY VARIABLE FOR FORMAT;
CONCATENATED IF MOLT. KEYS */
/* RESULT OF LOOKUP, CONCATENATED
IF MOLT. VARIABLES */
/* NAME OF THE FORMAT TABLE;
PREFIX WITH $ IF CHAR. FORMAT */
/* OPTION IN PROC FORMAT */
OPTIONS DQUOTE;
DATA NULL;
Do UNTIL(LASTREC);
SET &DATA END=LASTREC;
CNTR+l;
=CCNTR=LEFT(PUT(_CNTR,5.»;
%IF %SUBSTR(&FORMAT,l,l) = $ %THEN
%DO /* CHARACTER FORMAT */
CALL SYMPUT( '_VL'
II
_CCNTR,
, .. ,
II
&KEY
II , .. ');
%END;
%ELSE
%DO; /* NUMERIC FORMAT */
CALL SYMPUT('_VL'
II
_CCNTR,&KEY);
%END;
CALL SYMPUT('_LB' II _CCNTR, ....
END;
CALL SYMPUT('ENTRIES', CCNTR);
STOP;
RUN;
PROC FORMAT;
VAlliE &FORMAT
%DO I=l %TO &ENTRIES;
&& VL&I
%END;
%IF &OTHER NE %THEN %00;
OTHER =
.. &OTHER" ;
%END
RUN;
%MEND MAKEFMT;
566
II
&RESULT
II .... );
Figure 4
EVALUATION OF SETUP AND LOOKUP TIMES OF
MULTIPLE FORMAT TABLES, BINARY SEARCH, HASHING
AND SORT/MERGE
Least
Expensive
setup Time
Lookup Time
Binary Search
Multiple Format Tables
Hashing
Sort/Merge, Hashing
Multiple Format Tables
Most
Expensive
Binary Search
Sort/Merge
REFERENCES
1.
2.
Ray C:
"A Comparison of Table
Lookup Techniques.: SUGI12
Proceedings, 1987.
Ray c:
"The Implementation of a
Hashing Routine Using SAS Software."
To be presented at SUGI13, Orlando
Florida. 1988.
For further information about this article contact Nancy A. Fox, center for Economic
Studies in Medicine, Pracon Incorporated, 1800 Robert Fulton Drive, Reston, Virginia,
22091.
SAS is the registered trademark of SAS Institute Inc., Cary, Ne, USA.
567
Download