Automating Network Diagnostics to Help End-Users Dave Thaler

advertisement
Automating Network Diagnostics
to Help End-Users
Dave Thaler
dthaler@microsoft.com
1
Motivation
• User needs are becoming more
dependent on the Internet
• Problems can occur anywhere in the
Internet
• Users have no control over these
problems, and often neither do their direct
providers!
• Increased support calls are NOT the
answer
2
Poor error messages today
• User doesn’t understand the problem, only
the symptom
• Error message pop-ups often aren’t helpful
to users
• Event logs aren’t really any better
• Technical information is only useful to
technical experts who don’t have access
to end-users’ machines
3
Goals
• Reduce number of support calls
– Help the user/app help itself where possible
– Locate the correct party to contact if not
• Reduce the time spent on support calls
that do occur
4
Focus on what the user wants!
• User doesn’t want to have to call anyone
• User doesn’t want to get email about an outage
• User doesn’t want to go to some web site to find out
• User just wants the application to work
• If it doesn’t work, user often wants to know when it will
work
• If application is non-interactive, application may want to
retry as soon as it is fixed (e.g. search engine)
• User also wants to know that someone is working on
fixing the problem
• If user policy decision needed, user wants choices
5
Multiple Adminstrative Entities
• Policy Principles
– Freedom of information: Outage info should be
available to all those affected by it.
– Privacy: Outage info should be available only to
those affected by it.
– Freedom of speech: Any entity should be able to
report a problem, whether or not it is trusted.
– Conservation of effort: Perform the minimum work
needed to troubleshoot the problem.
6
Solution Framework
• Self-diagnosing, self-healing:
– Naming: identification of a problem instance
– Message routing: getting problem instance to capable
agent
– Methodology: structured process for confirming,
diagnosing, repairing, etc.
– Domain-specific classes/agents
• Self-improving:
– Learning what possible causes are the most likely
– Learning what is normal/abnormal within a class
– Reporting on agent behavior for improvements
7
Architecture
Application or
Monitoring Tool
Client API
Engine
Helper
Class
Component
Protocol
Helper
Class
Engine
Helper
Class
Helper
Class
Component
Component
Component
8
Separation of Roles
• Engine
– Maintains cause-effect tree
– Handles message routing
– Implements core methodology
• Helper Classes
– Implement simple API
– Embed knowledge about one component type
– Generate causal hypotheses that are treated just like
client reports
9
Example TCPIP-related Helper
Classes
Socket
UDP Session
TCP Listener
TCP Module
UDP Endpoint
TCP
Connection
IP Path
UDP Module
ICF Port
IP Address
DHCP
Interface
IP Module
IP Route
IP Neighbor
Link
Router
IP Interface
Interface
10
Summary
• It’s not just about one network being selfmanaging
– We need to improve the end-user experience
– Handling multiple administrative domains is a core
issue
• Structured methodology aids in problem analysis
– Today too many things are ad hoc
• High-level methodology should be independent
of component-specific knowledge
– Provides extensibility
– Facilitates appropriate learning at both levels
11
Download