Software, Hardware, and Database Structure Options for Research in Financial Economics SAS and Computing Speed Michael Boldin, WRDS, University of Pennsylvania boldinm@wharton.upenn.edu Main Questions 1. How can researchers take advantage of modern computing technology ? 2. Which econometric software packages would you recommend to students ? 3. How do SAS features and computing speed stack up? Q1. How can researchers take advantage of modern computing technology ? Observations: • Today’s PCs are better than yesterday’s ‘supercomputer’ (for single users). • The system–hardware, software, and network connections–needs to work as a whole. • Database management (DBMS) matters. Q2. Which econometric software packages would you recommend to students ? Observations: • Undergrad and Grad advice differs. • Power, flexibility and user-friendly elements are not mutually opposing. • Almost too many choices (and change is hard). • Few students care about good programming practice and they keep bad habits. Q3. How do SAS features and computing speed stack up? • Is SAS fast enough in raw computing speed? • Does the SAS Data step framework create performance handicaps? • How does SAS/IML stack up to MATLAB and GAUSS in functionality? Other issues: • Does SAS need a better interface to C/C++ and FORTRAN modules ? • What does SAS offer as an RDBM compliment to MATLAB ? • Is greater compatibility with open-source software such as MySQL and PHP possible ? Statistical Software Evaluations Reviewing the Reviews Noteworthy: Jeffrey MacKie-Mason (1992) ‘Econometric Software: A User’s View’ • • • Could not select an unqualified winner among: Gauss, Limdep, RATS, SAS, SST, Stata, and TSP. Preferred TSP. Saw advantages to SAS, but found problems in PC SAS (of 1991). Correctly predicted movement toward matrix algebra oriented software such as GAUSS. John Rust (1993), ‘GAUSS and MATLAB: A Comparison’ • • • Highlighted the advantages of matrix oriented programming for econometrics. Correctly predicted that users would soon be moving away from DOS. Incorrectly predicted that the move would be toward UNIX workstations. Problems: • • • • Most other reviews just count features. Or worse, stress speed overall all other issues. Within 2 years, the review is largely obsolete. After 5 years likely to be completely misleading if not irrelevant. Speed Comparisons (by Stefan Steinhaus) GAUSS Mathematica Matlab Ox O-Matrix S-Plus Speed Score 1997 1999 49.94 47.96 7.67 31.95 39.98 34.64 66.21 68.12 70.80 67.29 37.18 30.51 2002 47.90 31.32 65.89 62.22 69.80 38.56 | | | | | | Overall Score 1997 1999 64.38 63.64 48.76 54.93 60.03 55.85 47.30 49.22 48.72 43.68 54.28 44.90 2002 64.80 57.34 69.74 58.45 45.83 48.61 Source: http://www.scientificweb.com/ncrunch/index.html Higher scores are better. 100 is the highest possible score in each year’s evaluation. Speed scores are not comparable across years. Overall score includes breadth of functionality and other usability considerations, using these weights: Mathematical functions 38%, Graphical functions 10%, Programming environment 9%, Data import/export 5%, Available operating systems 2%, Speed comparison 36 In pure speed comparisons (made comparable across years) -- faster PC and new software vintage makes a poor performer the top performer relative to the ‘best’ old technology pair. And how about SAS ? A Helicopter View of PC Technology 1981: IBM PC = $5,500 in today’s prices 64K memory, no hard disk, monochrome monitor, no networking capabilities Today: Dell Pentium IV < $1000 1G memory, 3000x faster, 80 Gig hard drive, DVD/CD burner, flat screen color monitor, and built-in networking. The Speed Issue Moore’s Law in Action Pentium Clockspeed I 120 Mhz II 266 Mhz III 550 Mhz IV 1.8 Ghz IV 3.6 Ghz IV 3.8 Ghz Year 1995 1997 1999 2001 2003 2004 MWIPS 79 218 448 638 1342 3899 Time index 100.0 4500 4000 36.2 3500 3000 17.6 2500 12.4 2000 1500 5.9 1000 500 2.0 120 MWIPS 100 Time Index 80 60 40 20 0 0 1995 1997 1999 2001 2003 2004 MWIPS = Mean Whetstone Instructions per Second. A higher MWIPS score is better (i.e. faster chip), and a twice as high MWIPS translates to roughly 50% less time to make an average numerical calculation. Source: http://homepage.virgin.net/roy.longbottom/whetstone.htm Evaluation of Statistical Software Three categories 1. Traditional programming languages: FORTRAN, C/C++, and Basic. Relatively new: Perl, Python, and Java. 2. Statistical packages: EVIEWS, SAS, STATA, and TSP. 3. Matrix algebra oriented computing software: GAUSS, Mathematica, MATLAB, R and Splus. Speed & User Friendliness Computation Speed: Fortran > C > C++ > Matlab > SAS > Perl User Friendliness: SAS > Matlab > Perl > C++ > C > Fortran Rankings of other languages /packages ?? Java VBasic Stata SPSS SPlus/R Are the speed differences significant ? Are ‘user’ elements only a matter of taste ? How can user friendliness and computation speed be combined in an evaluation. Computing Speed Only One Part of the Equation Total Research Project Time 1. Planning 2. Data Management 3. Programming 4. Computation 5. Analysis of Results 6. Re-Evaluation (revisit & repeat prior steps) Simple Model of Cost/Benefit (Time) Tradeoffs Programming = (b0 + b1*x + b2*x2) / (ease-factor) Computation = (a0j + a1*x + a2* x2) / (speed) Both programming and computing time depend on the complexity of the task, and the computing speed advantage of Package 2 may overwhelm the ease of use issue for modestly complex tasks. User Programming (Time and Effort) Element 30 Package 1 (slow and easy) Package 2 (fast and hard) 20 10 0 0 1 2 3 4 5 6 7 8 9 10 8 9 10 8 9 10 Computation Time Element 10 5 0 0 1 2 3 4 5 6 7 Differences in Costs 20 Package 2 preferred for complexity level above 6 Total Programming 10 Computation 0 -10 0 1 2 3 4 5 Complexity 6 7 Simple Model of Cost/Benefit (Time) Tradeoffs Programming = (b0 + b1*x + b2*x2) / (ease-factor*2) Computation = (a0j + a1*x + a2* x2) / (speed*10) User Programming (Time and Effort) Element 30 Package 1 (slow and easy) Package 2 (fast and hard) 20 10 0 0 1 2 3 4 5 6 7 8 9 10 8 9 10 Computation Time Element 10 Increase in computing speed (relative to ease-factor) makes Package 1 a better choice for a larger range of tasks. 5 0 0 1 2 3 4 5 6 7 Differences in Costs 5 Threshold for preferring Package 2 rises Total Programming Computation 0 -5 0 1 2 3 4 5 Complexity 6 7 8 9 10 Black-Scholes Calculation Speeds *SAS code -- Black Scholes Option Value calculation; * S= Spot price, X = Excise price, sigma= Stock return volatility * r= Risk free bond rate, q= Dividend rate, tau= Time till maturity; d1= ( log(S/X) + ( r – q + 0.5*sigma*sigma ) * tau ) / ( sigma*sqrt(tau) ); d2= d1 - sigma * sqrt(tau); *Normal curve cumulative density function values; N1= cdf('normal',d1); N2= cdf('normal',d2); Vc = ( S * exp(-q*tau) * N1 ) - ( X * exp(-r*tau)* N2 ); 1 million cases System A Sun V440 System B Pentium 4 PC C Program 3.0 seconds 1.5 seconds Fortran 4.1 -- Matlab 2.4 1.4 SAS 4.6 6.7 R -- 1.9 EXCEL VBA -- 560 Perl 39.6 -- SAS vs. MATLAB Computation Speed Comparison Basic Statistics Example Simulated Data: 1million observation, 10 variables, in 10 groups Data creation Mean & std Frequency REG module Sort by group REG by group sum SAS 3.6 1.6 0.3 0.8 8.4 1.1 15.8 seconds MATLAB 1.4 1.4 0.3 2.2 2.4 1.4 9.1 seconds Bottom line: • MATLAB is almost twice as fast in relative difference (42% faster in this example), but only 6.7 seconds faster in absolute difference. • For most applications there are less than 1 million observations and the absolute difference is even smaller. SAS vs. MATLAB Computation Speed Comparison Is MATLAB’s speed advantage due to its matrix based programming ? No. SAS also has a Interactive Matrix Language module (IML). Using SAS IML shows how alternative programming methods can matter (within the same package). OLS Regression Example: 1million observation, 10 variables B= inv(X’X)*(X’y) REG module SAS IML 2.6 0.8 MATLAB 0.4 2.2 Programming the OLS matrix algebra equation in MATLAB beats MATLAB’s regress(.) function in terms of speed, while the opposite is true for SAS. Finance Research Example CAPM (Beta) Test: Ri,t = αi + βi Rmt 500 Beta Calculations System A Sun V440 (multi-user UNIX) System B Pentium 4 Windows PC SAS 1.3 / 2.5 seconds 1.2 seconds MATLAB ‘loop’ version 1.0 17.3 Multi-user UNIX system run time varies depending on load. MATLAB run time varies depending on program design– optimal vectorized code versus an inefficient loop. A true CAPM test would estimate multi-factor betas (βi) for 5,000 to 25,000 stocks over different sample periods. Summarizations require sorting into portfolios and applying 2 stage estimation and testing techniques. Example: SAS run = 40 minutes // MATLAB = 35 minutes Conclusions: Changes in technology change the equation for determining the best system—personal preferences are important. Absolute speed (not relative speed) may matter but programming time is overwhelmingly the larger component (in > 90% of the cases) anyway. Software is not an either/or situation. Advice: Learn and use two or more software packages as compliments. Database management and connectivity is the key to the greatest possible flexibility. Almost Counterintuitive General Conclusion: Technological progress makes human factors and personal preferences most important.