LOCF-Different Approaches, Same Results Using LAG Function, RETAIN Statement, and ARRAY Facility Iuliana Barbalau, ClinOps LLC. San Francisco, CA. ABSTRACT LOCF stands for “Last Observation Carried Forward” and is a frequently used method in Clinical Trials Environment, and a popular imputation method used in the pharmaceutical industry. For example, if a patient drops out of the study after the second week, then the value is “carried forward” until the end of the treatment as a conservative estimate of how well the subject would have done had he or she remained in the study [1]. In another application, a vital signs dataset contains an observation for each time a patient has a scheduled/unscheduled visit with a doctor. If a patient missed an assigned visit, then the primary measure is missing. For this paper, we will be considering the weight, as measurement of interest. LOCF imputation would use non-missing weight from an early visit as the weight for the later missing visit. This paper introduces SAS® syntax to accomplish LOCF, and demonstrates the use of RETAIN statement, ARRAY facility, and LAG function. THE ORGANIZATION OF THIS PAPER 1. Example data set before and after LOCF 2. LOCF using RETAIN statement 3. LOCF using ARRAY facility and LAG function 4. Conclusions 1. EXAMPLE OF DATA SET BEFORE AND AFTER LOCF Below is a SAS code that generates four distinct patients with 1 to 4 distinct visits and random measurements for weight. A couple of patients have missing weight information such as patient 1 at visits 2 and 3, patient 2 at visit 3, patient 3 at visit 2 and patient 4 at visits 1 and 3. We can use LOCF methods to retain the non-missing weight measurement from an early visit if the weight measurement for current visit is missing. data sda; input ptno visit weight; format ptno z3. ; cards; 1 1 122 1 2 . 1 3 . 1 4 123 2 1 156 2 3 . 3 1 112 3 2 . 4 1 . 4 2 123 4 3 . ; run; The data before applying LOCF looks like the one below – figure 1. Please note all the patients have measurements available for visit 1 (considered baseline, or screening), except patient 4. Patient 4 is considered an exception. It doesn’t have weight measurement available for visit 1. We will check for this particular patient after we apply LOCF method. If the LOCF method is applied correctly, the weight measurement will be missing in the final dataset for patient 4 at visit 1, while for the other patients the missing weights will be carried forward from previous non-missing visits. 1 Figure 1 – Data set sda before LOCF After applying LOCF method, data should look like figure 2 below. Please note that patient 4 is missing weight information at visit 1. When applying LOCF methods we need to make sure the correct information for the correct patient is being carried forward from the non-missing visits. Figure 2 – Data set sda after LOCF 2. LOCF USING THE RETAIN STATEMENT A very elegant way to use RETAIN statement is presented below. This approach is inspired by paperwork “LOCF Method and Application in Clinical Data Analysis” by Huijuan Xu[3]. It uses a RETAIN statement to create a temporary variable called tempval. This variable will retain the value that needs to be imputed from one patient to another. The only drawback of this procedure is the assumption that the baseline value (visit 1) will always be nonmissing. We need to create a list of common subjects using the code below. data all; format ptno z3.; do i=1 to 4; do j=1 to 4; ptno=i; visit=j; output; end; end; drop i j; run; A description of data all is presented in figure 3. The dataset includes four patients, each one being assigned a number of four scheduled visits for a total of sixteen observations. 2 Figure 3 – data set all We will sort the common list of subjects called all and our regular dataset called sda by patient and visit. This will help us later on, when the two datasets will be merged by patient and visit. proc sort data=sda; by ptno visit; run; proc sort data=all; by ptno visit; run; Below we will RETAIN tempval as 0. data final (drop=tempval); retain tempval 0; merge sda(in=b) all (in=val); by ptno visit ; if all; if weight eq . then weight=tempval; else tempval=weight; run; Using the above code, we obtain the data presented in figure 4. Figure 4 – Data set final Please note that using this method, for patient 4, we carried forward the weight measurement from previous patient (patient 3) at visit two (weight equals 112). Although the method gives valid results if visit 1 is present, in case data is missing for first visit, then LOCF will not be implemented correctly. It is a good idea to check the data before attempting any LOCF methods, and check the data after applying LOCF. 3 3. LOCF USING ARRAY FACILITY AND LAG FUNCTION LAGN () returns the value of the nth previous observation. Example: if our data has 3 observations where x takes on rd the values of 1, 2 and 3, then LAG2(x) on the 3 observation will return 1, the value of the first observation. LAG () is the same as LAG1 (). This procedure is a lengthier one and obtains the last available non-missing observation using a set of conditions (in our case, the dataset final3 is set by patient and visit). Once the condition is met (first patient), we reset the LAGN () value to missing. After that we assign LOCF using ARRAY facility. FIRST STEP – sort data set sda by patient and by visit. This way, the appropriate weight measurements will be used for LAGN () function in the data step. SECOND STEP - define the array “reset” by declaring the number of missing variables per patient. As a general rule, we should have a total of n-1 array elements (where n equals total number of possible or scheduled visits). In our case, n is 4 as total possible visits. The reason behind this logic is that we need to carry forward for a maximum number of possible visits less one (the original observation we use as primary for our LOCF). THIRD STEP – set to missing the array “reset” – lagx1, lagx2, lagx3 each time first observation for a patient occurs. This way, we prevent a weight measurement being “carried forward” from one patient to another. FOURTH STEP - we need to consider all LAGN () for weight values we need to carry forward from one visit to another. For example, if weight is missing for a particular visit and lagx1 is not missing, then we use lagx1 value to populate the current missing weight measurement. In another case, if the current weight and lagx1 are missing, then we use the earliest non-missing measurement available – either one lagx2, or lagx3. data final3; set sda; by ptno visit; array reset(*) lagx1-lagx3; lagx1=lag(weight); lagx2=lag2(weight); lagx3=lag3(weight); if first.ptno then count=1; do i=count to dim(reset); reset(i)=.; end; count+1; if weight=. and lagx1 ne . then weight=lagx1; else if weight=. and lagx1 eq . and lagx2 ne . then weight=lagx2; else if weight=. and lagx1 eq . and lagx2 eq . then weight=lagx3; run; The dataset created using this method is presented below in figure 5. As we can observe, the weight for patient 1 will be retained from visit 1 for visits 2 and 3, while for patient 4 at visit 1 weight measurement will be missing since we have no information available for that particular visit. Please note the number of observations (eleven) is the same as original data set sda (figure 2) versus previous example – data final (figure 4) - when we created a common dataset (sixteen observations) with all the possible (scheduled) visits a patient might have. Figure 5 – Interim set final3 4 Final dataset of interest will look like figure 6 presented below. Figure 6 – Data set final3 4. CONCLUSIONS It is a good idea to know there are multiple ways to obtain LOCF results because it encourages SAS programmers to become more creative while programming their code. Before starting LOCF, we need to ask ourselves what is the final dataset we want to obtain. Do we want to generate a common dataset with all the possible values for scheduled visits, or are we interested only in LOCF values for missing measurements in the dataset of interest to us? After we answer this question, we need to check the structure of our data and chose the most efficient method that produces accurate results. The most important thing to mention about LOCF is that we need to be familiar with our data. If we are not cautious enough, we could impute incorrect information to different patients or time points and that could affect the integrity of the information to be analyzed. REFERENCES [1] Definition: http://en.wikipedia.org/wiki/Analysis_of_clinical_trials [2] Encyclopedia of biopharmaceutical statistics, by Shein-Chung Chow, page 176 [3] LOCF Method and Application in Clinical Data Analysis Huijuan Xu, Biogenidec, Inc. ACKNOWLEDGMENTS I would like to thank my manager Irina Walsh for continuous support, Patrick Thornton for helpful mentoring and Jeanina Worden for encouraging me to be one of the SAS gigs at WUSS Conference. CONTACT INFORMATION Iuliana Barbalau ClinOps LLC. 353 Sacramento Street, Suite 800 San Francisco, CA 94111 Work Phone: (415) 679-2373 Fax: (415) 679-3280 E-mail: iuliana24@yahoo.com Web: www.clinops.com SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 5