Automatic Misconfiguration Disagnosis with PeerPressure Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang Microsoft Research OSDI 2004, San Francisco, CA 1 Misconfiguration Diagnosis • Technical support contributes 17% of TCO [Tolly2000] • Much of application malfunctioning comes from misconfigurations • Why? – Shared configuration data (e.g., Registry) and uncoordinated access and update from different applications • How about maintaining the golden config state? – Very hard [Larsson2001] • Complex software components and compositions • Third party applications • … 2 Outline Motivation • Goals • Design • Prototype • Evaluation results • Future work • Concluding remarks 3 Goals • Effectiveness – Small set of sick configuration candidates that contain the root-cause entries • Automation – No second party involvement – No need to remember or identify what is healthy 4 Intuition behind PeerPressure • Assumption – Applications function correctly on most machines -- malfunctioning is anomaly • Succumb to the peer pressure 5 An Example Suspects Mine P1’s P2’s P3’s P4’s e1 e2 e3 0 on 57 1 on 4 1 on 0 1 on 100 1 off 34 • Is R1 sick? Most likely • Is R2 sick? Probably not • Is R3 sick? Maybe not – R3 looks like an operational state • We use Bayesian statistics to estimate the sick probability of a suspect -- our ranking metric 6 System Overview Registry Entry Suspects App Tracer Entry Data HKLM\Software\Msft\... On HKLM\System\Setup\... 0 HKCU\%\Software\... null Run the faulty app Canonicalizer Troubleshooting Result Entry Prob. HKLM\Software\Msft\... 0.6 HKLM\System\Setup\... 0.2 HKCU\%\Software\... 0.003 Peer-to-Peer Troubleshooting Community Search & Fetch Database Statistical Analyzer PeerPressure 7 The Sick Probability • P(Sick) = (N + c) / (N + ct + cm (t-1) ) – – – – N: # of the samples C: cardinality t: the number of suspects m: the number of entries that match the suspect entry value • Properties: – As m increases, P decreases – As c increases, P decreases; when m = 0, smaller c implies smaller p 8 The PeerPressure Prototype • Database of 87 live Windows XP registry snapshots as our sample pool – hierarchical persistent storage for named, typed entries • PeerPressure troubleshooter implemented in C# • Needed to “sanitize” the entry values – 1, “1”, “#1” – Heuristics: unifying values of entries with different types 9 Outline Motivation Goals Design Prototype • Evaluation results • Future work • Concluding remarks 10 Windows Registry Characteristics • • • • • • Max size: 333,193 Min size: 77,517 Average size: 198,376 Median size: 198,608 Cardinality: 87% 1, 94% <=2 Distinct canonicalized entries in GeneBank 1,476,665 • Common canonicalized entries 43,913 • Distinct entries data-sanitized 1,820,706 11 Evaluation Data Set • 87 live Windows XP registry snapshots (in the database) – Half of these snapshots are from three diverse organizations within Microsoft: Operations and Technology Group (OTG) Helpdesk in Colorado, MSR-Asia, and MSR-Redmond. – The other half are from machines across Microsoft that were reported to have potential Registry problems • 20 real-world troubleshooting cases with known root-causes 12 Response Time 250.00 Seconds 200.00 150.00 100.00 50.00 5483 3983 3590 3209 1779 1777 1350 1230 1171 853 482 354 293 237 182 135 105 64 37 8 0.00 # of Suspects • # of suspects: 8 to 26,308 with a median: 1171 • 45 seconds in average for SQL server hosted on a 2.4GHz CPU workstation with 1 GB RAM • Sequential database queries dominate 13 Troubleshooting Effectiveness • Metric: root cause ranking • Results: – Rank = 1 for 12 cases – Rank = 2 for 3 cases – Rank = 3, 9, 12, 16 for 4 cases, respectively – cannot solve one case 14 Source of False Positives • Nature of the root-cause entry – Root cause entry has a large cardinality • How unique other suspects – A highly customized machine likely produces more noise • The database is not pristine 15 Impact of the Sample Set Size • Larger sample set doesn’t necessarily indicate better accuracy – Strong conformity doesn’t depend on the number of samples – Operational state doesn’t depend on the number of samples – Only helps with non-pristine sample set • 10 samples are large enough for most cases 16 Related Work • Blackbox-based techniques – Strider: need to identify the healthy [Wang ‘03] – Hardware, software component dependencies [Brown ‘01] • Much prior on leveraging statistics to pinpoint anomaly – Bug as deviant behavior [Engler et al SOSP ‘01] – Host-based intrusion detection based on system calls [Forrest ’96] and based on registry behavior [Apap et al, ‘99] 17 Future Work • • • • • Only scratch the surface! Multiple root cause entries Cross-application troubleshooting Database maintenenance Privacy – Friends Troubleshooting Network 18 Concluding Remarks • Automatic misconfiguration diagnosis is possible – Use statistics from the mass to automate manual identification of the healthy – Initial results promising 19