1 LOCF-Different Approaches, Same Results Using LAG Function

advertisement
LOCF-Different Approaches, Same Results Using LAG Function, RETAIN
Statement, and ARRAY Facility
Iuliana Barbalau, ClinOps LLC. San Francisco, CA.
ABSTRACT
LOCF stands for “Last Observation Carried Forward” and is a frequently used method in Clinical Trials Environment,
and a popular imputation method used in the pharmaceutical industry. For example, if a patient drops out of the study
after the second week, then the value is “carried forward” until the end of the treatment as a conservative estimate of
how well the subject would have done had he or she remained in the study [1]. In another application, a vital signs
dataset contains an observation for each time a patient has a scheduled/unscheduled visit with a doctor. If a patient
missed an assigned visit, then the primary measure is missing. For this paper, we will be considering the weight, as
measurement of interest. LOCF imputation would use non-missing weight from an early visit as the weight for the
later missing visit.
This paper introduces SAS® syntax to accomplish LOCF, and demonstrates the use of RETAIN statement, ARRAY
facility, and LAG function.
THE ORGANIZATION OF THIS PAPER
1. Example data set before and after LOCF
2. LOCF using RETAIN statement
3. LOCF using ARRAY facility and LAG function
4. Conclusions
1. EXAMPLE OF DATA SET BEFORE AND AFTER LOCF
Below is a SAS code that generates four distinct patients with 1 to 4 distinct visits and random measurements for
weight. A couple of patients have missing weight information such as patient 1 at visits 2 and 3, patient 2 at visit 3,
patient 3 at visit 2 and patient 4 at visits 1 and 3. We can use LOCF methods to retain the non-missing weight
measurement from an early visit if the weight measurement for current visit is missing.
data sda;
input ptno visit weight;
format ptno z3. ;
cards;
1 1 122
1 2 .
1 3 .
1 4 123
2 1 156
2 3 .
3 1 112
3 2 .
4 1 .
4 2 123
4 3 .
;
run;
The data before applying LOCF looks like the one below – figure 1. Please note all the patients have measurements
available for visit 1 (considered baseline, or screening), except patient 4. Patient 4 is considered an exception. It
doesn’t have weight measurement available for visit 1. We will check for this particular patient after we apply LOCF
method. If the LOCF method is applied correctly, the weight measurement will be missing in the final dataset for
patient 4 at visit 1, while for the other patients the missing weights will be carried forward from previous non-missing
visits.
1
Figure 1 – Data set sda before LOCF
After applying LOCF method, data should look like figure 2 below. Please note that patient 4 is missing weight
information at visit 1. When applying LOCF methods we need to make sure the correct information for the correct
patient is being carried forward from the non-missing visits.
Figure 2 – Data set sda after LOCF
2. LOCF USING THE RETAIN STATEMENT
A very elegant way to use RETAIN statement is presented below. This approach is inspired by paperwork “LOCF
Method and Application in Clinical Data Analysis” by Huijuan Xu[3]. It uses a RETAIN statement to create a
temporary variable called tempval. This variable will retain the value that needs to be imputed from one patient to
another. The only drawback of this procedure is the assumption that the baseline value (visit 1) will always be nonmissing.
We need to create a list of common subjects using the code below.
data all;
format ptno z3.;
do i=1 to 4;
do j=1 to 4;
ptno=i;
visit=j;
output;
end;
end;
drop i j;
run;
A description of data all is presented in figure 3. The dataset includes four patients, each one being assigned a
number of four scheduled visits for a total of sixteen observations.
2
Figure 3 – data set all
We will sort the common list of subjects called all and our regular dataset called sda by patient and visit. This will help
us later on, when the two datasets will be merged by patient and visit.
proc sort data=sda; by ptno visit; run;
proc sort data=all; by ptno visit; run;
Below we will RETAIN tempval as 0.
data final (drop=tempval);
retain tempval 0;
merge sda(in=b) all (in=val);
by ptno visit ;
if all;
if weight eq . then weight=tempval;
else tempval=weight;
run;
Using the above code, we obtain the data presented in figure 4.
Figure 4 – Data set final
Please note that using this method, for patient 4, we carried forward the weight measurement from previous patient
(patient 3) at visit two (weight equals 112). Although the method gives valid results if visit 1 is present, in case data is
missing for first visit, then LOCF will not be implemented correctly. It is a good idea to check the data before
attempting any LOCF methods, and check the data after applying LOCF.
3
3. LOCF USING ARRAY FACILITY AND LAG FUNCTION
LAGN () returns the value of the nth previous observation. Example: if our data has 3 observations where x takes on
rd
the values of 1, 2 and 3, then LAG2(x) on the 3 observation will return 1, the value of the first observation. LAG () is
the same as LAG1 ().
This procedure is a lengthier one and obtains the last available non-missing observation using a set of conditions (in
our case, the dataset final3 is set by patient and visit). Once the condition is met (first patient), we reset the LAGN ()
value to missing. After that we assign LOCF using ARRAY facility.
FIRST STEP – sort data set sda by patient and by visit. This way, the appropriate weight measurements will be
used for LAGN () function in the data step.
SECOND STEP - define the array “reset” by declaring the number of missing variables per patient. As a general
rule, we should have a total of n-1 array elements (where n equals total number of possible or scheduled visits). In
our case, n is 4 as total possible visits. The reason behind this logic is that we need to carry forward for a maximum
number of possible visits less one (the original observation we use as primary for our LOCF).
THIRD STEP – set to missing the array “reset” – lagx1, lagx2, lagx3 each time first observation for a patient occurs.
This way, we prevent a weight measurement being “carried forward” from one patient to another.
FOURTH STEP - we need to consider all LAGN () for weight values we need to carry forward from one visit to
another. For example, if weight is missing for a particular visit and lagx1 is not missing, then we use lagx1 value to
populate the current missing weight measurement. In another case, if the current weight and lagx1 are missing, then
we use the earliest non-missing measurement available – either one lagx2, or lagx3.
data final3;
set sda;
by ptno visit;
array reset(*) lagx1-lagx3;
lagx1=lag(weight);
lagx2=lag2(weight);
lagx3=lag3(weight);
if first.ptno then count=1;
do i=count to dim(reset);
reset(i)=.;
end;
count+1;
if weight=. and lagx1 ne . then weight=lagx1;
else if weight=. and lagx1 eq . and lagx2 ne . then weight=lagx2;
else if weight=. and lagx1 eq . and lagx2 eq . then weight=lagx3;
run;
The dataset created using this method is presented below in figure 5. As we can observe, the weight for patient 1 will
be retained from visit 1 for visits 2 and 3, while for patient 4 at visit 1 weight measurement will be missing since we
have no information available for that particular visit.
Please note the number of observations (eleven) is the same as original data set sda (figure 2) versus previous
example – data final (figure 4) - when we created a common dataset (sixteen observations) with all the possible
(scheduled) visits a patient might have.
Figure 5 – Interim set final3
4
Final dataset of interest will look like figure 6 presented below.
Figure 6 – Data set final3
4. CONCLUSIONS
It is a good idea to know there are multiple ways to obtain LOCF results because it encourages SAS programmers to
become more creative while programming their code. Before starting LOCF, we need to ask ourselves what is the
final dataset we want to obtain. Do we want to generate a common dataset with all the possible values for scheduled
visits, or are we interested only in LOCF values for missing measurements in the dataset of interest to us? After we
answer this question, we need to check the structure of our data and chose the most efficient method that produces
accurate results. The most important thing to mention about LOCF is that we need to be familiar with our data. If we
are not cautious enough, we could impute incorrect information to different patients or time points and that could
affect the integrity of the information to be analyzed.
REFERENCES
[1] Definition: http://en.wikipedia.org/wiki/Analysis_of_clinical_trials
[2] Encyclopedia of biopharmaceutical statistics, by Shein-Chung Chow, page 176
[3] LOCF Method and Application in Clinical Data Analysis Huijuan Xu, Biogenidec, Inc.
ACKNOWLEDGMENTS
I would like to thank my manager Irina Walsh for continuous support, Patrick Thornton for helpful mentoring and
Jeanina Worden for encouraging me to be one of the SAS gigs at WUSS Conference.
CONTACT INFORMATION
Iuliana Barbalau
ClinOps LLC.
353 Sacramento Street, Suite 800
San Francisco, CA 94111
Work Phone: (415) 679-2373
Fax: (415) 679-3280
E-mail: iuliana24@yahoo.com
Web: www.clinops.com
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks
of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and
product names are trademarks of their respective companies.
5
Download