Automating Network Diagnostics to Help End-Users Dave Thaler dthaler@microsoft.com 1 Motivation • User needs are becoming more dependent on the Internet • Problems can occur anywhere in the Internet • Users have no control over these problems, and often neither do their direct providers! • Increased support calls are NOT the answer 2 Poor error messages today • User doesn’t understand the problem, only the symptom • Error message pop-ups often aren’t helpful to users • Event logs aren’t really any better • Technical information is only useful to technical experts who don’t have access to end-users’ machines 3 Goals • Reduce number of support calls – Help the user/app help itself where possible – Locate the correct party to contact if not • Reduce the time spent on support calls that do occur 4 Focus on what the user wants! • User doesn’t want to have to call anyone • User doesn’t want to get email about an outage • User doesn’t want to go to some web site to find out • User just wants the application to work • If it doesn’t work, user often wants to know when it will work • If application is non-interactive, application may want to retry as soon as it is fixed (e.g. search engine) • User also wants to know that someone is working on fixing the problem • If user policy decision needed, user wants choices 5 Multiple Adminstrative Entities • Policy Principles – Freedom of information: Outage info should be available to all those affected by it. – Privacy: Outage info should be available only to those affected by it. – Freedom of speech: Any entity should be able to report a problem, whether or not it is trusted. – Conservation of effort: Perform the minimum work needed to troubleshoot the problem. 6 Solution Framework • Self-diagnosing, self-healing: – Naming: identification of a problem instance – Message routing: getting problem instance to capable agent – Methodology: structured process for confirming, diagnosing, repairing, etc. – Domain-specific classes/agents • Self-improving: – Learning what possible causes are the most likely – Learning what is normal/abnormal within a class – Reporting on agent behavior for improvements 7 Architecture Application or Monitoring Tool Client API Engine Helper Class Component Protocol Helper Class Engine Helper Class Helper Class Component Component Component 8 Separation of Roles • Engine – Maintains cause-effect tree – Handles message routing – Implements core methodology • Helper Classes – Implement simple API – Embed knowledge about one component type – Generate causal hypotheses that are treated just like client reports 9 Example TCPIP-related Helper Classes Socket UDP Session TCP Listener TCP Module UDP Endpoint TCP Connection IP Path UDP Module ICF Port IP Address DHCP Interface IP Module IP Route IP Neighbor Link Router IP Interface Interface 10 Summary • It’s not just about one network being selfmanaging – We need to improve the end-user experience – Handling multiple administrative domains is a core issue • Structured methodology aids in problem analysis – Today too many things are ad hoc • High-level methodology should be independent of component-specific knowledge – Provides extensibility – Facilitates appropriate learning at both levels 11