NESUG 2011 PROC FORMAT Using PROC FORMAT to Create Time-of-Day Intervals James Zeitler, Harvard Business School, Boston, MA Abstract It’s not too hard to find end-of-day stock prices and summary measures such as open, high, low, close, and volume, but historical intraday prices are harder to come by, and similar summary measures for time periods within the trading day often require some custom programming. In this presentation I describe the how I used PROC FORMAT to generate summary measures of trading activity for fifteen-minute subintervals throughout a trading day. Introduction To set the scene, let’s look at this chart (Figure 1) of IBM’s daily stock price for the month of May, 2011, which I drew (using Excel) on the basis of data I found at the Investor Relations page of the company’s web site. This is an example of a candlestick chart that tracks the stock’s open, high, low, and close price each day. The endpoints of the narrow vertical lines indicate the day’s low and high prices. The endpoints of the wider bars indicate the open (actually, the previous day’s closing price) and close prices, and the color indicates whether the price has risen (green) or fallen (red) relative to the previous day’s close. This chart is based on readilyavailable end-of-day observations. However, I was asked recently to come up with similar data for all stocks traded on the New York Stock Figure 1 Exchange on the basis of fifteen minute intervals within the trading day rather than an entire day. In this paper I’d like to describe how I used PROC FORMAT to help me find summary measures for fifteen-minute intraday trading intervals. How SAS® Handles Time of Day I’ve worked enough with dates in SAS® to know that SAS® stores dates as integer values counting the number of days before or after January 1, 1960 – day zero in the SAS® calendar. But I needed to refer back to the documentation to verify my recollection that SAS® stores time-of-day as the number of seconds past midnight – time zero in the SAS® clock. So a SAS® time variable can assume values from 0 (midnight) to 86,400 seconds (24 hours * 60 minutes per hour * 60 seconds per minute), and each fifteen minute interval includes 15*60 = 900 seconds. The program and listing in Figure 2 illustrate the principles in play. 1 NESUG 2011 PROC FORMAT options nocenter; data work.temp; length I 8 TimeC $8; do I = 0 to 96; TimeC = put(I*900,tod8.0); output; end; run; proc print data = work.temp (where = (0 <= I <= 5 or 91 <= I <= 96)); run; Obs 1 2 3 4 5 6 92 93 94 95 96 97 I 0 1 2 3 4 5 91 92 93 94 95 96 TimeC 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 22:45:00 23:00:00 23:15:00 23:30:00 23:45:00 00:00:00 Figure 2 In this example, I’ve defined TimeC as a character variable; the todw.d format translates a number of seconds into a time of day; and the PUT function writes the formatted time of day to the character variable TimeC. I’m using the todw.d format rather than timew.d so that I can get the leading zero written for the times before noon. Creating the Fifteen-Minute Interval Format With this in mind, my aim is to assign a label to times within each fifteen-minute interval. For example, the first fifteen minutes of the day would be labeled “00:00:00-00:15:00” and the time between 3:30PM and 3:45PM would be labeled “15:30:00-15:45:00”. Rather than enter each interval label and starting and ending values by hand, I define the format ranges and labels in a data step and then use an input control (CNTLIN) data set to generate the format programatically. A CNTLIN data set is essentially a collection of the components needed to create a format: the format name and type, the starting and ending values of the formatted ranges, and the label to be associated with each range of values. To these minimum requirements I also added an option to avoid overlapping ranges. The program is reproduced in Figure 3, and these comments refer to the marked lines in the program listing: 1. Strictly speaking, I don’t have to RETAIN the variable label, but this is a convenient way to specify the variable length and appearance. 2. I do have to RETAIN the variables fmtname, type, start, end, and eexcl. a. fmtname is the format name, and it has to be in every row of the CNTLIN data set b. type indicates the format type. In this case the format is being applied to a numeric variable, so the type is “N”. Here are acceptable values for the type variable (from the “Output Control Data Set” in the FORMAT Procedure section of the Base SAS® documentation: i. C specifies a character format ii. I specifies a numeric informat iii. J specifies a character informat 2 NESUG 2011 PROC FORMAT iv. N specifies a numeric format v. P specifies a picture format options nocenter; *******************************************************; * Define fifteen minute intervals in a CNTLIN data set *; *******************************************************; data work.int15m; retain label "00:00:00-00:00:00" fmtname "int15m" type "N" start 0 end 0 eexcl "Y"; do start = "00:00:00"t to "23:45:00"t by 900; end = start + 900; label = put(start,tod8.0)||"-"||put(end,tod8.0); output; end; hlo = "O"; label = "***OTHER***"; output; format start end tod8.0; run; ********************************************************; * Use the CNTLIN data set to create the interval format *; ********************************************************; proc format cntlin = work.int15m; run; ***************************; * View format description *; ***************************; proc format library = work fmtlib; select int15m; run; 1 2 3 4 5 6 7 8 9 Figure 3 3. The start and end variables hold the starting and ending points for each interval. 4. The variable eexcl indicates whether the ending point of the interval is included. Y indicates that the ending point is excluded (the interval does not contain the ending point). N would indicate that the ending point would not be excluded (the interval does include the ending point). Without this variable, the defined intervals might overlap; the ending point for one would overlap with the starting point of the next. Alternatively, the sexcl variable has the same function with respect to the interval starting points. 5. As in the earlier program, the DO loop runs in fifteen minute intervals from midnight to midnight. This is where I assign values to the format range start and end variables. 6. The format label is in the form “09:00:00-09:15:00”. With eexcl = “Y”, this interval includes the time 09:00:00, the starting value, and excludes “09:15:00” then ending value, so that successive format ranges do not overlap. It’s important the labels sort properly from earliest to latest, so I use the tod8.0 format to include leading zeros where necessary to assure proper sorting.. 3 NESUG 2011 PROC FORMAT 7. The hlo variable, in this case, indicates what’s to be done if it encounters a time not covered in any of the defined ranges (OTHER). Again, look at SAS® documentation for more information. 8. Use the CNTLIN = option in PROC FORMAT to generate the formats defined in the input control data set. In this case I’m defining only a single format, but multiple formats could be defined in the input control data set. 9. Finally, the fmtlib option in PROC FORMAT lets you look at the format details (Figure 4). This temporary format is stored in the WORK library and I’ve asked for a description of the int15m format alone. „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ FORMAT NAME: INT15M LENGTH: 17 NUMBER OF VALUES: 97 ‚ ‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 17 FUZZ: STD ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‣ ‚START ‚END ‚LABEL (VER. V7|V8 10JUN2011:08:50:29)‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ•ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ•ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‣ ‚ 0‚ 900‚00:00:00-00:15:00 ‚ ‚ 900< 1800‚00:15:00-00:30:00 ‚ ‚ 1800< 2700‚00:30:00-00:45:00 ‚ ---------------------------------------------------------------------------‚ 84600< 85500‚23:30:00-23:45:00 ‚ ‚ 85500< 86400‚23:45:00-00:00:00 ‚ ‚**OTHER** ‚**OTHER** ‚***OTHER*** ‚ ․ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‥ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‥ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ… Figure 4 I can’t recall a time when I’ve had to use it, but I’ll mention in passing that the CNTLOUT = option of PROC FORMAT will write information about format contents to a data set. TAQ Data from NYSE Defining the formats lays the groundwork for generating the fifteen-minute interval trading statistics. Before I get into the details of that process, I’d like to look a little more closely at the trading themselves. I access intraday stock price data from the Trades and Quotes (TAQ) database supplied by the New York Stock Exchange (NYSE), which include all trades and quotes for all stocks traded on the NYSE. I generally access it through Wharton Research Data Services (WRDS) at the Wharton School of the University of Pennsylvania. WRDS converts the raw TAQ data files into SAS® data sets, each of which contains information on a single day’s trades or quotes for all covered stocks. Figure 5 displays a brief description of a TAQ trades data set from PROC CONTENTS. I’ll take advantage of the fact that the data set is already sorted by SYMBOL and TIME. Each TAQ dataset at WRDS holds observations from a single day, so DATE doesn’t enter into the process. 4 NESUG 2011 PROC FORMAT Variables in Creation Order # Variable Type Len Format 1 2 3 4 5 6 7 8 9 SYMBOL DATE TIME PRICE SIZE G127 CORR COND EX Char Num Num Num Num Num Num Char Char Label 10 Stock Symbol 4 YYMMDDN8. Transaction Date 4 TIME. Trade Time 8 Actual Trade Price per Share 5 Number of Shares Traded 3 Comb G Rule 127 and Stop Stock indicator 3 Correction Indicator 2 Sale Condition 1 Exchange on which the Trade occurred Alphabetic List of Indexes and Attributes # of Unique # Index Values 1 SYMBOL 7830 Sort Information Sortedby SYMBOL TIME Validated YES Character Set ASCII Figure 5 Summarizing Trades by Fifteen-Minute Intervals My approach to accumulating the summary measures is: 1. Use my int15m format to add an interval variable INT15M to the daily TAQ data set. 2. Reopen the dataset BY SYMBOL INT15M so that I can keep track of open, high, low, close, and volume for each fifteen minute interval for each ticker symbol. As the CONTENTS indicate, the TAQ data set is already sorted by SYMBOL and TIME, and I’ve taken care to define the interval labels so that they also sort properly by time of day. Because the data set is already sorted by SYMBOL and TIME, it is implicitly sorted properly by SYMBOL and INT15M. 3. Retain the summary measures OPEN, HIGH, LOW, CLOSE, and VOLUME, setting their initial values to missing. 4. Initialize the values of the summary measures the first time I encounter a ticker symbol. 5. Assign values to the summary measures as each trade is processed. 6. Output an observation at the end of each time interval for each ticker symbol. 5 NESUG 2011 PROC FORMAT In the first part of the resulting program (Figure 6), I add interval labels to the TAQ data: data work.TAQ15_20100506; set taq.CT_20100506 (keep = Symbol Date Time Price Size G127 CORR COND EX); by Symbol Time; Int15M = put(time,int15m.); CT_RowNum + 1; run; Figure 6 In the next data step (Figure 7) I generate the summary measures for each symbol for each time interval. data xx.TAQ15_20100506 (drop = Time Price Size G127 CORR COND EX); set work.TAQ15_20100506 ; by symbol Int15M; retain Open High Low Close Volume Trades .; if first.date or first.symbol or first.Int15M then do ; Open = .; High = .; Low = .; Close = .; Volume = .; Trades = .; end; if Open = . then Open = Price; High = max(Price ,High); Low = min(Price ,Low ); Volume = sum(Volume,Size); Close = Price; Trades + 1; if last.Int15M then output; run; Figure 7 Using a View You might have noticed that I’ve had to read through the data set twice, once to add the interval label variable, int15m, and a second time to do the interval-level processing. The problem is that int15m has to be in the data set I’m reading in the second data step, because I’m using it as a BY variable. But this, it’s occurred to me, is a good time to use a data set VIEW. I may save some time if I create a VIEW rather than an actual data set in the first of the data steps, as illustrated in Figure 8. 6 NESUG 2011 PROC FORMAT ************************************************; * Add 15-minute interval labels to TAQ dataset *; ************************************************; data work.TAQ15_20100506/ view = work.TAQ15_20100506; length Time 8 Date 8; set taq.CT_20100506 (keep = Symbol Date Time Price Size G127 CORR COND EX); by Symbol Date Time; Int15M = put(time,int15m.); CT_RowNum + 1; run; Figure 8 Now I can refer to this data set exactly as I did before, without reading through the entire data set to add the interval labels. The absolute amount of time saved by using the view rather than an actual data set is small, but the relative savings in CPU time are respectable (Figures 9 and 10). Data set OTE: The SAS System used: real time 50.07 seconds user cpu time 30.20 seconds system cpu time 11.22 seconds Memory 10305k OS Memory 11340k Timestamp 6/10/2011 10:30:08 AM Page Faults 0 Page Reclaims 7644 Page Swaps 0 Voluntary Context Switches 2091 Involuntary Context Switches 1988 Figure 9 Data set view NOTE: The SAS System used: real time 1:12.78 user cpu time 30.33 seconds system cpu time 1.57 seconds Memory 10648k OS Memory 11596k Timestamp 6/10/2011 10:16:00 AM Page Faults 0 Page Reclaims 7771 Page Swaps 0 Voluntary Context Switches 2411 Involuntary Context Switches 93 Figure 10 7 NESUG 2011 PROC FORMAT The Finished Product As the candlestick plot of the fifteen-minute intervals illustrates (Figure 11), the period between 2:30PM and 3:00PM on May 6, 2010 – the “flash-crash” - was an exciting day on the New York Stock Exchange. Figure 11 I created the first plot in this presentation (Figure 1) in Excel. This later plot (Figure 11) has been generated by SAS®, based on program code from Sample 24914 in the Samples & SAS Notes section of the SAS Knowledge Base (http://support.SAS® com/kb/24/914.html). It’s not a pretty process. CONCLUSION My first inclination when attempting a task such as this would be to create the time-interval indicator with a tedious series of IF-THEN statements. However, PROC FORMAT and the PUT function provide another way to accomplish the task. When used with the PUT function, formats are not just for output anymore. So my second inclination would be to create interval formats by typing in ranges and labels manually and to use the PUT function to assign labels to time variables in the dataset. But the CNTLIN option of PROC FORMAT allows me to define the ranges and labels programmatically, arriving at a quicker, easier, more flexible, and more reliable solution. REFERENCES BASE SAS DOCUMENTION: The Format Procedure http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000063536.htm SAS KNOWLEDGE BASE/SAMPLES AND SAS NOTES: Sample 24914: Create candlestick plots (http://support.sas.com/kb/24/914.html) 8 NESUG 2011 PROC FORMAT The text of a report on the events of May 6, 2010 issued by U.S. Securities and Exchange Commission (SEC) and the Commodities Futures Trading Commission (CFTC)can be found here: http://sec.gov/news/studies/2010/marketevents-report.pdf ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. TAQ (Trades and Quotes) data are available through the New York Stock Exchange. More information on the database may be found at the NYSE website (http://www.nyxdata.com/Data-Products/Daily-TAQ). Access to TAQ data is available through Wharton Research Data Services (WRDS) at the Wharton School of the University of Pennsylvania (http://wrds-web.wharton.upenn.edu/wrds/). CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: James Zeitler Baker Research Services Harvard Business School Soldiers Field Boston, MA 02163 Work Phone: (617) 495-6837 Email: jzeitler@hbs.edu 9