Automatic Misconfiguration Troubleshooting with PeerPressure

Automatic Misconfiguration Disagnosis with PeerPressure Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang Microsoft Research OSDI 2004, San Francisco, CA 1 Misconfiguration Diagnosis • Technical support contributes 17% of TCO [Tolly2000] • Much of application malfunctioning comes from misconfigurations • Why? – Shared configuration data (e.g., Registry) and uncoordinated access and update from different applications • How about maintaining the golden config state? – Very hard [Larsson2001] • Complex software components and compositions • Third party applications • … 2 Outline Motivation • Goals • Design • Prototype • Evaluation results • Future work • Concluding remarks 3 Goals • Effectiveness – Small set of sick configuration candidates that contain the root-cause entries • Automation – No second party involvement – No need to remember or identify what is healthy 4 Intuition behind PeerPressure • Assumption – Applications function correctly on most machines -- malfunctioning is anomaly • Succumb to the peer pressure 5 An Example Suspects Mine P1’s P2’s P3’s P4’s e1 e2 e3 0 on 57 1 on 4 1 on 0 1 on 100 1 off 34 • Is R1 sick? Most likely • Is R2 sick? Probably not • Is R3 sick? Maybe not – R3 looks like an operational state • We use Bayesian statistics to estimate the sick probability of a suspect -- our ranking metric 6 System Overview Registry Entry Suspects App Tracer Entry Data HKLM\Software\Msft\... On HKLM\System\Setup\... 0 HKCU\%\Software\... null Run the faulty app Canonicalizer Troubleshooting Result Entry Prob. HKLM\Software\Msft\... 0.6 HKLM\System\Setup\... 0.2 HKCU\%\Software\... 0.003 Peer-to-Peer Troubleshooting Community Search & Fetch Database Statistical Analyzer PeerPressure 7 The Sick Probability • P(Sick) = (N + c) / (N + ct + cm (t-1) ) – – – – N: # of the samples C: cardinality t: the number of suspects m: the number of entries that match the suspect entry value • Properties: – As m increases, P decreases – As c increases, P decreases; when m = 0, smaller c implies smaller p 8 The PeerPressure Prototype • Database of 87 live Windows XP registry snapshots as our sample pool – hierarchical persistent storage for named, typed entries • PeerPressure troubleshooter implemented in C# • Needed to “sanitize” the entry values – 1, “1”, “#1” – Heuristics: unifying values of entries with different types 9 Outline Motivation Goals Design Prototype • Evaluation results • Future work • Concluding remarks 10 Windows Registry Characteristics • • • • • • Max size: 333,193 Min size: 77,517 Average size: 198,376 Median size: 198,608 Cardinality: 87% 1, 94% <=2 Distinct canonicalized entries in GeneBank 1,476,665 • Common canonicalized entries 43,913 • Distinct entries data-sanitized 1,820,706 11 Evaluation Data Set • 87 live Windows XP registry snapshots (in the database) – Half of these snapshots are from three diverse organizations within Microsoft: Operations and Technology Group (OTG) Helpdesk in Colorado, MSR-Asia, and MSR-Redmond. – The other half are from machines across Microsoft that were reported to have potential Registry problems • 20 real-world troubleshooting cases with known root-causes 12 Response Time 250.00 Seconds 200.00 150.00 100.00 50.00 5483 3983 3590 3209 1779 1777 1350 1230 1171 853 482 354 293 237 182 135 105 64 37 8 0.00 # of Suspects • # of suspects: 8 to 26,308 with a median: 1171 • 45 seconds in average for SQL server hosted on a 2.4GHz CPU workstation with 1 GB RAM • Sequential database queries dominate 13 Troubleshooting Effectiveness • Metric: root cause ranking • Results: – Rank = 1 for 12 cases – Rank = 2 for 3 cases – Rank = 3, 9, 12, 16 for 4 cases, respectively – cannot solve one case 14 Source of False Positives • Nature of the root-cause entry – Root cause entry has a large cardinality • How unique other suspects – A highly customized machine likely produces more noise • The database is not pristine 15 Impact of the Sample Set Size • Larger sample set doesn’t necessarily indicate better accuracy – Strong conformity doesn’t depend on the number of samples – Operational state doesn’t depend on the number of samples – Only helps with non-pristine sample set • 10 samples are large enough for most cases 16 Related Work • Blackbox-based techniques – Strider: need to identify the healthy [Wang ‘03] – Hardware, software component dependencies [Brown ‘01] • Much prior on leveraging statistics to pinpoint anomaly – Bug as deviant behavior [Engler et al SOSP ‘01] – Host-based intrusion detection based on system calls [Forrest ’96] and based on registry behavior [Apap et al, ‘99] 17 Future Work • • • • • Only scratch the surface! Multiple root cause entries Cross-application troubleshooting Database maintenenance Privacy – Friends Troubleshooting Network 18 Concluding Remarks • Automatic misconfiguration diagnosis is possible – Use statistics from the mass to automate manual identification of the healthy – Initial results promising 19

Automatic Misconfiguration Troubleshooting with PeerPressure

Related documents

Products

Support

Automatic Misconfiguration Troubleshooting with PeerPressure

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib