Software, Hardware, and Database Structure Options for

advertisement
Software, Hardware,
and Database Structure Options for
Research in Financial Economics
SAS and Computing Speed
Michael Boldin, WRDS, University of Pennsylvania
boldinm@wharton.upenn.edu
Main Questions
1. How can researchers take advantage
of modern computing technology ?
2. Which econometric software
packages would you recommend to
students ?
3. How do SAS features and computing
speed stack up?
Q1. How can researchers take advantage of
modern computing technology ?
Observations:
• Today’s PCs are better than yesterday’s
‘supercomputer’ (for single users).
• The system–hardware, software, and network
connections–needs to work as a whole.
• Database management (DBMS) matters.
Q2. Which econometric software packages
would you recommend to students ?
Observations:
• Undergrad and Grad advice differs.
• Power, flexibility and user-friendly elements
are not mutually opposing.
• Almost too many choices (and change is
hard).
• Few students care about good programming
practice and they keep bad habits.
Q3. How do SAS features and computing
speed stack up?
• Is SAS fast enough in raw computing speed?
• Does the SAS Data step framework create
performance handicaps?
• How does SAS/IML stack up to MATLAB and
GAUSS in functionality?
Other issues:
• Does SAS need a better interface to C/C++
and FORTRAN modules ?
• What does SAS offer as an RDBM compliment to MATLAB ?
• Is greater compatibility with open-source software such as
MySQL and PHP possible ?
Statistical Software Evaluations
Reviewing the Reviews
Noteworthy:
Jeffrey MacKie-Mason (1992) ‘Econometric Software: A User’s View’
•
•
•
Could not select an unqualified winner among:
Gauss, Limdep, RATS, SAS, SST, Stata, and TSP.
Preferred TSP. Saw advantages to SAS, but found problems
in PC SAS (of 1991).
Correctly predicted movement toward matrix algebra oriented
software such as GAUSS.
John Rust (1993), ‘GAUSS and MATLAB: A Comparison’
•
•
•
Highlighted the advantages of matrix oriented programming for econometrics.
Correctly predicted that users would soon be moving away from DOS.
Incorrectly predicted that the move would be toward UNIX workstations.
Problems:
•
•
•
•
Most other reviews just count features.
Or worse, stress speed overall all other issues.
Within 2 years, the review is largely obsolete.
After 5 years likely to be completely misleading if not irrelevant.
Speed Comparisons
(by Stefan Steinhaus)
GAUSS
Mathematica
Matlab
Ox
O-Matrix
S-Plus
Speed Score
1997
1999
49.94
47.96
7.67
31.95
39.98
34.64
66.21
68.12
70.80
67.29
37.18
30.51
2002
47.90
31.32
65.89
62.22
69.80
38.56
|
|
|
|
|
|
Overall Score
1997
1999
64.38
63.64
48.76
54.93
60.03
55.85
47.30
49.22
48.72
43.68
54.28
44.90
2002
64.80
57.34
69.74
58.45
45.83
48.61
Source: http://www.scientificweb.com/ncrunch/index.html
Higher scores are better. 100 is the highest possible score in each year’s evaluation.
Speed scores are not comparable across years. Overall score includes breadth of functionality
and other usability considerations, using these weights: Mathematical functions 38%, Graphical
functions 10%, Programming environment 9%, Data import/export 5%, Available operating
systems 2%, Speed comparison 36
In pure speed comparisons (made comparable across years)
-- faster PC and new software vintage makes a poor performer
the top performer relative to the ‘best’ old technology pair.
And how about SAS ?
A Helicopter View of PC Technology
1981: IBM PC = $5,500 in today’s prices
64K memory, no hard disk, monochrome
monitor, no networking capabilities
Today: Dell Pentium IV < $1000
1G memory, 3000x faster, 80 Gig hard drive,
DVD/CD burner, flat screen color monitor, and
built-in networking.
The Speed Issue
Moore’s Law in Action
Pentium
Clockspeed
I 120 Mhz
II 266 Mhz
III 550 Mhz
IV 1.8 Ghz
IV 3.6 Ghz
IV 3.8 Ghz
Year
1995
1997
1999
2001
2003
2004
MWIPS
79
218
448
638
1342
3899
Time index
100.0 4500
4000
36.2
3500
3000
17.6
2500
12.4
2000
1500
5.9
1000
500
2.0
120
MWIPS
100
Time Index
80
60
40
20
0
0
1995
1997
1999
2001
2003
2004
MWIPS = Mean Whetstone Instructions per Second.
A higher MWIPS score is better (i.e. faster chip), and a twice as high MWIPS translates
to roughly 50% less time to make an average numerical calculation.
Source: http://homepage.virgin.net/roy.longbottom/whetstone.htm
Evaluation of Statistical Software
Three categories
1. Traditional programming languages:
FORTRAN, C/C++, and Basic.
Relatively new: Perl, Python, and Java.
2. Statistical packages:
EVIEWS, SAS, STATA, and TSP.
3. Matrix algebra oriented computing software:
GAUSS, Mathematica, MATLAB, R and Splus.
Speed & User Friendliness
Computation Speed:
Fortran > C > C++ > Matlab > SAS > Perl
User Friendliness:
SAS > Matlab > Perl > C++ > C > Fortran
Rankings of other languages /packages ??
Java VBasic Stata SPSS SPlus/R
Are the speed differences significant ?
Are ‘user’ elements only a matter of taste ?
How can user friendliness and computation speed be
combined in an evaluation.
Computing Speed
Only One Part of the Equation
Total Research Project Time
1. Planning
2. Data Management
3. Programming
4. Computation
5. Analysis of Results
6. Re-Evaluation
(revisit & repeat prior steps)
Simple Model of Cost/Benefit (Time) Tradeoffs
Programming = (b0 + b1*x + b2*x2) / (ease-factor)
Computation = (a0j + a1*x + a2* x2) / (speed)
Both
programming
and computing
time depend
on the
complexity of
the task, and
the computing
speed
advantage of
Package 2
may
overwhelm the
ease of use
issue for
modestly
complex tasks.
User Programming (Time and Effort) Element
30
Package 1 (slow and easy)
Package 2 (fast and hard)
20
10
0
0
1
2
3
4
5
6
7
8
9
10
8
9
10
8
9
10
Computation Time Element
10
5
0
0
1
2
3
4
5
6
7
Differences in Costs
20
Package 2 preferred for
complexity level above 6
Total
Programming
10
Computation
0
-10
0
1
2
3
4
5
Complexity
6
7
Simple Model of Cost/Benefit (Time) Tradeoffs
Programming = (b0 + b1*x + b2*x2) / (ease-factor*2)
Computation = (a0j + a1*x + a2* x2) / (speed*10)
User Programming (Time and Effort) Element
30
Package 1 (slow and easy)
Package 2 (fast and hard)
20
10
0
0
1
2
3
4
5
6
7
8
9
10
8
9
10
Computation Time Element
10
Increase in
computing
speed (relative
to ease-factor)
makes
Package 1 a
better choice
for a larger
range of tasks.
5
0
0
1
2
3
4
5
6
7
Differences in Costs
5
Threshold for preferring
Package 2 rises
Total
Programming
Computation
0
-5
0
1
2
3
4
5
Complexity
6
7
8
9
10
Black-Scholes Calculation Speeds
*SAS code -- Black Scholes Option Value calculation;
* S= Spot price, X = Excise price, sigma= Stock return volatility
* r= Risk free bond rate, q= Dividend rate, tau= Time till maturity;
d1= ( log(S/X) + ( r – q + 0.5*sigma*sigma ) * tau ) / ( sigma*sqrt(tau) );
d2= d1 - sigma * sqrt(tau);
*Normal curve cumulative density function values;
N1= cdf('normal',d1); N2= cdf('normal',d2);
Vc = ( S * exp(-q*tau) * N1 ) - ( X * exp(-r*tau)* N2 );
1 million cases
System A
Sun V440
System B
Pentium 4 PC
C Program
3.0 seconds
1.5 seconds
Fortran
4.1
--
Matlab
2.4
1.4
SAS
4.6
6.7
R
--
1.9
EXCEL VBA
--
560
Perl
39.6
--
SAS vs. MATLAB
Computation Speed Comparison
Basic Statistics Example
Simulated Data: 1million observation, 10 variables, in 10 groups
Data creation
Mean & std
Frequency
REG module
Sort by group
REG by group
sum
SAS
3.6
1.6
0.3
0.8
8.4
1.1
15.8 seconds
MATLAB
1.4
1.4
0.3
2.2
2.4
1.4
9.1 seconds
Bottom line:
• MATLAB is almost twice as fast in relative difference (42% faster in
this example), but only 6.7 seconds faster in absolute difference.
• For most applications there are less than 1 million observations
and the absolute difference is even smaller.
SAS vs. MATLAB
Computation Speed Comparison
Is MATLAB’s speed advantage due to its matrix based programming ?
No. SAS also has a Interactive Matrix Language module (IML).
Using SAS IML shows how alternative programming methods can
matter (within the same package).
OLS Regression Example: 1million observation, 10 variables
B= inv(X’X)*(X’y)
REG module
SAS IML
2.6
0.8
MATLAB
0.4
2.2
Programming the OLS matrix algebra equation in MATLAB beats
MATLAB’s regress(.) function in terms of speed,
while the opposite is true for SAS.
Finance Research Example
CAPM (Beta) Test:
Ri,t = αi + βi Rmt
500 Beta
Calculations
System A
Sun V440
(multi-user UNIX)
System B
Pentium 4
Windows PC
SAS
1.3 / 2.5 seconds
1.2 seconds
MATLAB
‘loop’ version
1.0
17.3
Multi-user UNIX system run time varies depending on load.
MATLAB run time varies depending on program design– optimal vectorized
code versus an inefficient loop.
A true CAPM test would estimate multi-factor betas (βi) for 5,000 to 25,000
stocks over different sample periods. Summarizations require sorting into
portfolios and applying 2 stage estimation and testing techniques.
Example: SAS run = 40 minutes // MATLAB = 35 minutes
Conclusions:
Changes in technology change the equation for determining
the best system—personal preferences are important.
Absolute speed (not relative speed) may matter
but programming time is overwhelmingly the larger
component (in > 90% of the cases) anyway.
Software is not an either/or situation.
Advice: Learn and use two or more software packages as
compliments.
Database management and connectivity is the key to the
greatest possible flexibility.
Almost Counterintuitive General Conclusion:
Technological progress makes human
factors and personal preferences
most important.
Download