Software Testing Doesn’t Scale James Hamilton

advertisement
Software Testing
Doesn’t Scale
James Hamilton
JamesRH@microsoft.com
Microsoft SQL Server
Overview

The Problem:





S/W size & complexity inevitable
Short cycles reduce S/W reliability
S/W testing is the real issue
Testing doesn’t scale
 trading complexity for quality
Cluster-based solution




The Inktomi lesson
Shared-nothing cluster architecture
Redundant data & metadata
Fault isolation domains
2
S/W Size & Complexity Inevitable


Successful S/W products grow large
# features used by a given user small


Reality of commodity, high volume S/W



Large feature sets
Same trend as consumer electronics
Example mid-tier & server-side S/W stack:




But union of per-user features sets is huge
SAP: ~47 mloc
DB: ~2 mloc
NT: ~50 mloc
Testing all feature interactions impossible
3
Short Cycles Reduce S/W Reliability



Reliable TP systems typically evolve slowly
& conservatively
Modern ERP systems can go through 6+
minor revisions/year
Many e-commerce sites change even faster


Current testing and release methodology:



Fast revisions a competitive advantage
As much testing as dev time
Significant additional beta-cycle time
Unacceptable choice:

reliable but slow evolving or fast changing yet
unstable and brittle
4
Testing the Real Issue

15 yrs ago test teams tiny fraction of dev group


Current test methodology improving incrementally:







Random grammar driven test case generation
Fault injection
Code path coverage tools
Testing remains effective at feature testing
Ineffective at finding inter-feature interactions


Now tests teams of similar size as dev & growing rapidly
Only a tiny fraction of Heisenbugs found in testing
(www.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Aviali
ability_talk.ppt)
Beta testing because test known to be inadequate
Test team growth scales exponentially with system
complexity
Test and beta cycles already intolerably long
5
The Inktomi Lesson


Inktomi web search engine (SIGMOD’98)
Quickly evolving software:



System availability of paramount importance



Individual node availability unimportant
Shared nothing cluster
Exploit ability to fail individual nodes:





Memory leaks, race conditions, etc. considered normal
Don’t attempt to test & beta until quality high
Automatic reboots avoid memory leaks
Automatic restart of failed nodes
Fail fast: fail & restart when redundant checks fail
Replace failed hardware weekly (mostly disks)

Dark machine room

No panic midnight calls to admins
Mask failures rather than futile attempt to avoid
6
Apply to High Value TP Data?

Inktomi model:






Exploits ability to lose individual node without
impacting system availability
Ability to temporarily lose some data W/O
significantly impacting query quality
Can’t loose data availability in most TP systems


Scales to 100’s of nodes
S/W evolves quickly
Low testing costs and no-beta requirement
Redundant data allows node loss w/o data availability lost
Inktomi model with redundant data & metadata a
solution to exploding test problem
7
Connection Model/Architecture
Client




All data & metadata multiply
redundant
Shared nothing
Single system image
Symmetric server nodes


Server
Node
Server Cloud
Any client connects to any server
All nodes SAN-connected
8
Compilation & Execution Model
Client

Query execution on many
subthreads synchronized
by root thread
Server Thread
Lex analyze
Parse
Normalize
Optimize
Code generate
Server Cloud
Query execute
9
Node Loss/Rejoin
Client

Execution in progress
Lose node
Recompile
Re-execute

Rejoin.







Server Cloud
Node local recovery
Rejoin cluster
Recover global data at rejoining node
Rejoin cluster
10
Redundant Data Update Model
Client



Updates are standard parallel
plans
Optimizer knows all
redundant data paths
Generated plan updates all


Server Cloud
No significant new technology
Like materialized view & index
updates today
11
Fault Isolation Domains

Trade single-node perf for redundant data checks:



Fail fast rather than attempting to repair:



Fairly common…but complex error recovery is even more
likely to be wrong than original forward processing code
Many of the best redundant checks are compiled out of
“retail versions” when shipped (when needed most)
Bring down node for mem-based data structure faults
Never patch inconsistent data…other copies keep
system available
If anything goes wrong “fire” the node and
continue:



Attempt node restart
Auto-reinstall O/S, DB and recreate DB partition
Mark node “dead” for later replacement
12
Summary

100 MLOC of server-side code and growing:




Can’t afford 2 to 3 year dev cycle
60’s large system mentality still prevails:



Optimizing precious machine resources is false economy
Continuing focus on single-system perf dead
wrong:


Can’t fight it & can’t test it …
quality will continue to decline if we don’t do something
different
Scalability & system perf rather than individual node
performance
Why are we still incrementally attacking an
exponential problem?
Any reasonable alternatives to clusters?
13
Software Testing
Doesn’t Scale
James Hamilton
JamesRH@microsoft.com
Microsoft SQL Server
Download