My Fifteen Minutes of Format Fame

advertisement
NESUG 2011
PROC FORMAT
Using PROC FORMAT to Create Time-of-Day Intervals
James Zeitler, Harvard Business School, Boston, MA
Abstract
It’s not too hard to find end-of-day stock prices and summary measures such as open, high, low, close,
and volume, but historical intraday prices are harder to come by, and similar summary measures for time
periods within the trading day often require some custom programming. In this presentation I describe
the how I used PROC FORMAT to generate summary measures of trading activity for fifteen-minute
subintervals throughout a trading day.
Introduction
To set the scene, let’s look at this chart (Figure 1) of IBM’s daily stock price for the month of May, 2011,
which I drew (using Excel) on the basis of data I found at the Investor Relations page of the company’s
web site. This is an example of
a candlestick chart that tracks
the stock’s open, high, low, and
close price each day. The
endpoints of the narrow vertical
lines indicate the day’s low and
high prices. The endpoints of
the wider bars indicate the open
(actually, the previous day’s
closing price) and close prices,
and the color indicates whether
the price has risen (green) or
fallen (red) relative to the
previous day’s close.
This chart is based on readilyavailable end-of-day
observations. However, I was
asked recently to come up with
similar data for all stocks traded
on the New York Stock
Figure 1
Exchange on the basis of fifteen
minute intervals within the trading day rather than an entire day. In this paper I’d like to describe how I
used PROC FORMAT to help me find summary measures for fifteen-minute intraday trading intervals.
How SAS® Handles Time of Day
I’ve worked enough with dates in SAS® to know that SAS® stores dates as integer values counting the
number of days before or after January 1, 1960 – day zero in the SAS® calendar. But I needed to refer
back to the documentation to verify my recollection that SAS® stores time-of-day as the number of
seconds past midnight – time zero in the SAS® clock. So a SAS® time variable can assume values from
0 (midnight) to 86,400 seconds (24 hours * 60 minutes per hour * 60 seconds per minute), and each
fifteen minute interval includes 15*60 = 900 seconds. The program and listing in Figure 2 illustrate the
principles in play.
1
NESUG 2011
PROC FORMAT
options nocenter;
data work.temp;
length I 8
TimeC
$8;
do I = 0 to 96;
TimeC = put(I*900,tod8.0);
output;
end;
run;
proc print data = work.temp (where = (0 <= I <= 5
or 91 <= I <= 96));
run;
Obs
1
2
3
4
5
6
92
93
94
95
96
97
I
0
1
2
3
4
5
91
92
93
94
95
96
TimeC
00:00:00
00:15:00
00:30:00
00:45:00
01:00:00
01:15:00
22:45:00
23:00:00
23:15:00
23:30:00
23:45:00
00:00:00
Figure 2
In this example, I’ve defined TimeC as a character variable; the todw.d format translates a number of
seconds into a time of day; and the PUT function writes the formatted time of day to the character variable
TimeC. I’m using the todw.d format rather than timew.d so that I can get the leading zero written for
the times before noon.
Creating the Fifteen-Minute Interval Format
With this in mind, my aim is to assign a label to times within each fifteen-minute interval. For example,
the first fifteen minutes of the day would be labeled “00:00:00-00:15:00” and the time between 3:30PM
and 3:45PM would be labeled “15:30:00-15:45:00”. Rather than enter each interval label and starting and
ending values by hand, I define the format ranges and labels in a data step and then use an input control
(CNTLIN) data set to generate the format programatically. A CNTLIN data set is essentially a collection
of the components needed to create a format: the format name and type, the starting and ending values
of the formatted ranges, and the label to be associated with each range of values. To these minimum
requirements I also added an option to avoid overlapping ranges. The program is reproduced in Figure 3,
and these comments refer to the marked lines in the program listing:
1. Strictly speaking, I don’t have to RETAIN the variable label, but this is a convenient way to specify
the variable length and appearance.
2. I do have to RETAIN the variables fmtname, type, start, end, and eexcl.
a. fmtname is the format name, and it has to be in every row of the CNTLIN data set
b. type indicates the format type. In this case the format is being applied to a numeric
variable, so the type is “N”. Here are acceptable values for the type variable (from the
“Output Control Data Set” in the FORMAT Procedure section of the Base SAS®
documentation:
i. C specifies a character format
ii. I specifies a numeric informat
iii. J specifies a character informat
2
NESUG 2011
PROC FORMAT
iv. N specifies a numeric format
v. P specifies a picture format
options nocenter;
*******************************************************;
* Define fifteen minute intervals in a CNTLIN data set *;
*******************************************************;
data work.int15m;
retain label
"00:00:00-00:00:00"
fmtname "int15m"
type
"N"
start
0
end
0
eexcl
"Y";
do start = "00:00:00"t to "23:45:00"t by 900;
end
= start + 900;
label = put(start,tod8.0)||"-"||put(end,tod8.0);
output;
end;
hlo = "O";
label = "***OTHER***";
output;
format start end tod8.0;
run;
********************************************************;
* Use the CNTLIN data set to create the interval format *;
********************************************************;
proc format cntlin = work.int15m;
run;
***************************;
* View format description *;
***************************;
proc format library = work fmtlib;
select int15m;
run;
1
2
3
4
5
6
7
8
9
Figure 3
3. The start and end variables hold the starting and ending points for each interval.
4. The variable eexcl indicates whether the ending point of the interval is included. Y indicates that
the ending point is excluded (the interval does not contain the ending point). N would indicate
that the ending point would not be excluded (the interval does include the ending point). Without
this variable, the defined intervals might overlap; the ending point for one would overlap with the
starting point of the next. Alternatively, the sexcl variable has the same function with respect to
the interval starting points.
5. As in the earlier program, the DO loop runs in fifteen minute intervals from midnight to midnight.
This is where I assign values to the format range start and end variables.
6. The format label is in the form “09:00:00-09:15:00”. With eexcl = “Y”, this interval includes
the time 09:00:00, the starting value, and excludes “09:15:00” then ending value, so that
successive format ranges do not overlap. It’s important the labels sort properly from earliest to
latest, so I use the tod8.0 format to include leading zeros where necessary to assure proper
sorting..
3
NESUG 2011
PROC FORMAT
7. The hlo variable, in this case, indicates what’s to be done if it encounters a time not covered in
any of the defined ranges (OTHER). Again, look at SAS® documentation for more information.
8. Use the CNTLIN = option in PROC FORMAT to generate the formats defined in the input control
data set. In this case I’m defining only a single format, but multiple formats could be defined in
the input control data set.
9. Finally, the fmtlib option in PROC FORMAT lets you look at the format details (Figure 4). This
temporary format is stored in the WORK library and I’ve asked for a description of the int15m
format alone.
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚
FORMAT NAME: INT15M
LENGTH:
17
NUMBER OF VALUES:
97
‚
‚
MIN LENGTH:
1 MAX LENGTH: 40 DEFAULT LENGTH 17 FUZZ: STD
‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‣
‚START
‚END
‚LABEL (VER. V7|V8
10JUN2011:08:50:29)‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ•ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ•ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‣
‚
0‚
900‚00:00:00-00:15:00
‚
‚
900<
1800‚00:15:00-00:30:00
‚
‚
1800<
2700‚00:30:00-00:45:00
‚
---------------------------------------------------------------------------‚
84600<
85500‚23:30:00-23:45:00
‚
‚
85500<
86400‚23:45:00-00:00:00
‚
‚**OTHER**
‚**OTHER**
‚***OTHER***
‚
․ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‥ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‥ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…
Figure 4
I can’t recall a time when I’ve had to use it, but I’ll mention in passing that the CNTLOUT = option of
PROC FORMAT will write information about format contents to a data set.
TAQ Data from NYSE
Defining the formats lays the groundwork for generating the fifteen-minute interval trading statistics.
Before I get into the details of that process, I’d like to look a little more closely at the trading themselves. I
access intraday stock price data from the Trades and Quotes (TAQ) database supplied by the New York
Stock Exchange (NYSE), which include all trades and quotes for all stocks traded on the NYSE. I
generally access it through Wharton Research Data Services (WRDS) at the Wharton School of the
University of Pennsylvania. WRDS converts the raw TAQ data files into SAS® data sets, each of which
contains information on a single day’s trades or quotes for all covered stocks. Figure 5 displays a brief
description of a TAQ trades data set from PROC CONTENTS.
I’ll take advantage of the fact that the data set is already sorted by SYMBOL and TIME. Each TAQ
dataset at WRDS holds observations from a single day, so DATE doesn’t enter into the process.
4
NESUG 2011
PROC FORMAT
Variables in Creation Order
# Variable Type Len Format
1
2
3
4
5
6
7
8
9
SYMBOL
DATE
TIME
PRICE
SIZE
G127
CORR
COND
EX
Char
Num
Num
Num
Num
Num
Num
Char
Char
Label
10
Stock Symbol
4 YYMMDDN8. Transaction Date
4 TIME.
Trade Time
8
Actual Trade Price per Share
5
Number of Shares Traded
3
Comb G Rule 127 and Stop Stock indicator
3
Correction Indicator
2
Sale Condition
1
Exchange on which the Trade occurred
Alphabetic List of Indexes and Attributes
# of
Unique
#
Index
Values
1
SYMBOL
7830
Sort Information
Sortedby
SYMBOL TIME
Validated
YES
Character Set ASCII
Figure 5
Summarizing Trades by Fifteen-Minute Intervals
My approach to accumulating the summary measures is:
1. Use my int15m format to add an interval variable INT15M to the daily TAQ data set.
2. Reopen the dataset BY SYMBOL INT15M so that I can keep track of open, high, low, close, and
volume for each fifteen minute interval for each ticker symbol. As the CONTENTS indicate, the
TAQ data set is already sorted by SYMBOL and TIME, and I’ve taken care to define the interval
labels so that they also sort properly by time of day. Because the data set is already sorted by
SYMBOL and TIME, it is implicitly sorted properly by SYMBOL and INT15M.
3. Retain the summary measures OPEN, HIGH, LOW, CLOSE, and VOLUME, setting their initial
values to missing.
4. Initialize the values of the summary measures the first time I encounter a ticker symbol.
5. Assign values to the summary measures as each trade is processed.
6. Output an observation at the end of each time interval for each ticker symbol.
5
NESUG 2011
PROC FORMAT
In the first part of the resulting program (Figure 6), I add interval labels to the TAQ data:
data work.TAQ15_20100506;
set taq.CT_20100506 (keep = Symbol Date Time Price Size G127 CORR COND EX);
by Symbol
Time;
Int15M = put(time,int15m.);
CT_RowNum + 1;
run;
Figure 6
In the next data step (Figure 7) I generate the summary measures for each symbol for each time interval.
data xx.TAQ15_20100506 (drop = Time Price Size G127 CORR COND EX);
set work.TAQ15_20100506 ;
by symbol Int15M;
retain
Open
High
Low
Close
Volume
Trades .;
if first.date or first.symbol or first.Int15M then do ;
Open = .;
High = .;
Low = .;
Close = .;
Volume = .;
Trades = .;
end;
if Open = . then Open = Price;
High = max(Price ,High);
Low = min(Price ,Low );
Volume = sum(Volume,Size);
Close = Price;
Trades + 1;
if last.Int15M then output;
run;
Figure 7
Using a View
You might have noticed that I’ve had to read through the data set twice, once to add the interval label
variable, int15m, and a second time to do the interval-level processing. The problem is that int15m has
to be in the data set I’m reading in the second data step, because I’m using it as a BY variable. But this,
it’s occurred to me, is a good time to use a data set VIEW. I may save some time if I create a VIEW
rather than an actual data set in the first of the data steps, as illustrated in Figure 8.
6
NESUG 2011
PROC FORMAT
************************************************;
* Add 15-minute interval labels to TAQ dataset *;
************************************************;
data work.TAQ15_20100506/ view = work.TAQ15_20100506;
length Time 8
Date 8;
set taq.CT_20100506 (keep = Symbol Date Time Price Size G127 CORR COND EX);
by Symbol
Date
Time;
Int15M = put(time,int15m.);
CT_RowNum + 1;
run;
Figure 8
Now I can refer to this data set exactly as I did before, without reading through the entire data set to add
the interval labels. The absolute amount of time saved by using the view rather than an actual data set is
small, but the relative savings in CPU time are respectable (Figures 9 and 10).
Data set
OTE: The SAS System used:
real time
50.07 seconds
user cpu time
30.20 seconds
system cpu time
11.22 seconds
Memory
10305k
OS Memory
11340k
Timestamp
6/10/2011 10:30:08 AM
Page Faults
0
Page Reclaims
7644
Page Swaps
0
Voluntary Context Switches
2091
Involuntary Context Switches
1988
Figure 9
Data set view
NOTE: The SAS System used:
real time
1:12.78
user cpu time
30.33 seconds
system cpu time
1.57 seconds
Memory
10648k
OS Memory
11596k
Timestamp
6/10/2011 10:16:00 AM
Page Faults
0
Page Reclaims
7771
Page Swaps
0
Voluntary Context Switches
2411
Involuntary Context Switches
93
Figure 10
7
NESUG 2011
PROC FORMAT
The Finished Product
As the candlestick plot of the fifteen-minute intervals illustrates (Figure 11), the period between 2:30PM
and 3:00PM on May 6, 2010 – the “flash-crash” - was an exciting day on the New York Stock Exchange.
Figure 11
I created the first plot in this presentation (Figure 1) in Excel. This later plot (Figure 11) has been
generated by SAS®, based on program code from Sample 24914 in the Samples & SAS Notes section of
the SAS Knowledge Base (http://support.SAS® com/kb/24/914.html). It’s not a pretty process.
CONCLUSION
My first inclination when attempting a task such as this would be to create the time-interval indicator with
a tedious series of IF-THEN statements. However, PROC FORMAT and the PUT function provide
another way to accomplish the task. When used with the PUT function, formats are not just for output
anymore. So my second inclination would be to create interval formats by typing in ranges and labels
manually and to use the PUT function to assign labels to time variables in the dataset. But the CNTLIN
option of PROC FORMAT allows me to define the ranges and labels programmatically, arriving at a
quicker, easier, more flexible, and more reliable solution.
REFERENCES
BASE SAS DOCUMENTION: The Format Procedure
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000063536.htm
SAS KNOWLEDGE BASE/SAMPLES AND SAS NOTES: Sample 24914: Create candlestick plots
(http://support.sas.com/kb/24/914.html)
8
NESUG 2011
PROC FORMAT
The text of a report on the events of May 6, 2010 issued by U.S. Securities and Exchange Commission
(SEC) and the Commodities Futures Trading Commission (CFTC)can be found here:
http://sec.gov/news/studies/2010/marketevents-report.pdf
ACKNOWLEDGMENTS
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
TAQ (Trades and Quotes) data are available through the New York Stock Exchange. More information
on the database may be found at the NYSE website (http://www.nyxdata.com/Data-Products/Daily-TAQ).
Access to TAQ data is available through Wharton Research Data Services (WRDS) at the Wharton
School of the University of Pennsylvania (http://wrds-web.wharton.upenn.edu/wrds/).
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
James Zeitler
Baker Research Services
Harvard Business School
Soldiers Field
Boston, MA 02163
Work Phone: (617) 495-6837
Email:
jzeitler@hbs.edu
9
Download