Software Testing Doesn’t Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server Overview The Problem: S/W size & complexity inevitable Short cycles reduce S/W reliability S/W testing is the real issue Testing doesn’t scale trading complexity for quality Cluster-based solution The Inktomi lesson Shared-nothing cluster architecture Redundant data & metadata Fault isolation domains 2 S/W Size & Complexity Inevitable Successful S/W products grow large # features used by a given user small Reality of commodity, high volume S/W Large feature sets Same trend as consumer electronics Example mid-tier & server-side S/W stack: But union of per-user features sets is huge SAP: ~47 mloc DB: ~2 mloc NT: ~50 mloc Testing all feature interactions impossible 3 Short Cycles Reduce S/W Reliability Reliable TP systems typically evolve slowly & conservatively Modern ERP systems can go through 6+ minor revisions/year Many e-commerce sites change even faster Current testing and release methodology: Fast revisions a competitive advantage As much testing as dev time Significant additional beta-cycle time Unacceptable choice: reliable but slow evolving or fast changing yet unstable and brittle 4 Testing the Real Issue 15 yrs ago test teams tiny fraction of dev group Current test methodology improving incrementally: Random grammar driven test case generation Fault injection Code path coverage tools Testing remains effective at feature testing Ineffective at finding inter-feature interactions Now tests teams of similar size as dev & growing rapidly Only a tiny fraction of Heisenbugs found in testing (www.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Aviali ability_talk.ppt) Beta testing because test known to be inadequate Test team growth scales exponentially with system complexity Test and beta cycles already intolerably long 5 The Inktomi Lesson Inktomi web search engine (SIGMOD’98) Quickly evolving software: System availability of paramount importance Individual node availability unimportant Shared nothing cluster Exploit ability to fail individual nodes: Memory leaks, race conditions, etc. considered normal Don’t attempt to test & beta until quality high Automatic reboots avoid memory leaks Automatic restart of failed nodes Fail fast: fail & restart when redundant checks fail Replace failed hardware weekly (mostly disks) Dark machine room No panic midnight calls to admins Mask failures rather than futile attempt to avoid 6 Apply to High Value TP Data? Inktomi model: Exploits ability to lose individual node without impacting system availability Ability to temporarily lose some data W/O significantly impacting query quality Can’t loose data availability in most TP systems Scales to 100’s of nodes S/W evolves quickly Low testing costs and no-beta requirement Redundant data allows node loss w/o data availability lost Inktomi model with redundant data & metadata a solution to exploding test problem 7 Connection Model/Architecture Client All data & metadata multiply redundant Shared nothing Single system image Symmetric server nodes Server Node Server Cloud Any client connects to any server All nodes SAN-connected 8 Compilation & Execution Model Client Query execution on many subthreads synchronized by root thread Server Thread Lex analyze Parse Normalize Optimize Code generate Server Cloud Query execute 9 Node Loss/Rejoin Client Execution in progress Lose node Recompile Re-execute Rejoin. Server Cloud Node local recovery Rejoin cluster Recover global data at rejoining node Rejoin cluster 10 Redundant Data Update Model Client Updates are standard parallel plans Optimizer knows all redundant data paths Generated plan updates all Server Cloud No significant new technology Like materialized view & index updates today 11 Fault Isolation Domains Trade single-node perf for redundant data checks: Fail fast rather than attempting to repair: Fairly common…but complex error recovery is even more likely to be wrong than original forward processing code Many of the best redundant checks are compiled out of “retail versions” when shipped (when needed most) Bring down node for mem-based data structure faults Never patch inconsistent data…other copies keep system available If anything goes wrong “fire” the node and continue: Attempt node restart Auto-reinstall O/S, DB and recreate DB partition Mark node “dead” for later replacement 12 Summary 100 MLOC of server-side code and growing: Can’t afford 2 to 3 year dev cycle 60’s large system mentality still prevails: Optimizing precious machine resources is false economy Continuing focus on single-system perf dead wrong: Can’t fight it & can’t test it … quality will continue to decline if we don’t do something different Scalability & system perf rather than individual node performance Why are we still incrementally attacking an exponential problem? Any reasonable alternatives to clusters? 13 Software Testing Doesn’t Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server