SYSTEM ADMINISTRATION Chapter 17 Using a Structured Troubleshooting Strategy Understanding Troubleshooting • Troubleshooting is the process of taking a large, complex problem, and, through the use of various techniques, excluding all potential elements until only the actual cause remains. • Once the actual cause is identified, it can then be fixed. • Good troubleshooting skills require a commonsense, structured approach. Establish the Symptoms • At this stage of the process you are primarily working with symptoms of the actual problem. • Most symptoms manifest themselves as a failure of some type. • Failures may or may not be accompanied by one or more error messages. Open-Ended Questions • Open-ended questions are designed to elicit additional information. The plan is to engage in a dialog with the user. • Examples of open-ended questions are: – What did you observe when the problem happened? – What were you doing when the error occurred? Closed-Ended Questions • With closed-ended questions, you are looking for a specific answer. Many times the answer will be yes or no, or a selection or one or more options. • Examples of closed-ended questions are: – Is there an error message? – What does the error message say? – Is your computer plugged into the wall? – Is anyone else having this problem? – Has any recent changes been made to the computer? The Basic Questions • Always remember to answer the essential questions: Who, what, when, where, why, and how. Here are a few examples: • Who – Who reported the problem? – Who found the actual problem? – Whom does the problem affect? • What – What are the symptoms of this particular problem? – What type of device is having the problem? – What was the user doing when he or she first noticed the problem? (continued) The Basic Questions (continued) • When – When was the problem first noticed? – When did the problem actually start? • Where – Where is the affected device? – Where was the device when the problem occurred? (Used primarily with laptop computers.) • Why – Why was this problem noticed? – Why is this problem occurring? • How – How did this problem occur? – Can you make this problem reoccur? If so, how? Is There a Problem? One of the questions you must always answer is, “Does the problem really exist or is the system simply operating the way it is intended?” Identify the Affected Area • This section is concerned with the scope of the problem. • A common question to ask at this point is, “How widespread is the problem?” That is to say, is the problem limited to a single user, group, computer, server, subnet, or an entire network? • If one user calls the help desk to report a problem, you can usually deduce that it affects only one user or computer. (continued) Identify the Affected Area (continued) • On the other hand, if your phones start ringing off the hook with multiple users reporting the same problem, you can be fairly certain that the problem is more widespread. • The importance here is how you will focus your troubleshooting efforts. Establish What Has Changed • A good question to ask is if anything has changed recently. For example, has any new software been installed on the computer? • A dialog with other departments within the IT department will help you determine what software upgrades or repairs have been made to user workstations. • Always keep in mind that the user may have made changes on her or his own, but has decided not to tell you about them. Could You Do This Before? • It is important to determine if an employee has been able to perform a task in the past. • Many times, a user will have certain expectations of a computer system based on home use or past job experiences. Recreate the Problem It is always a good idea to see if you can recreate the problem. Select the Most Probable Cause • Once you have asked a sufficient number of questions, you should have a fairly good idea where the problem is, although you may not be able to describe what the problem is. • The problem here is that you rarely have only one possible answer to the problem. • At this point, you will start using some more techniques. For example, you may need to make a trip to the users’ computers, if possible, and run some diagnostics or utilities to eliminate some of the possibilities. • Always keep an open mind and avoid jumping to conclusions when it comes to troubleshooting. What Can You Eliminate? • By testing one thing at a time, TCP/IP, DHCP, router, and DNS, you have systematically eliminated one problem after another, until only the actual problem remains. • Through the use of the process of elimination and basic troubleshooting utilities, you have been able to quickly identify where the problem exists. • Now you must determine what the problem is. That is to say, you know the problem rests with DNS, you just don’t know what the problem is with DNS. Troubleshooting Tools • Always use all of the troubleshooting tools that are available to you. • Most computers provide a either a utility or log files with which to view system problems. These files log or document everything that happens at the server. • Looking through these log files might direct you to the problem, or at least identify for you the area where the problem is occurring. Event Viewer • The Windows Event Viewer contains three logs: Application, Security, and System. – Application • The application log contains events that are caused by applications or programs. – Security • The security log records events that relate directly to system security, such as valid and invalid logon attempts. – System • The system log contains events that are reported by system components. (continued) Event Viewer (continued) • The log indicates three different types of entries: Information, Warning, and Error. – Information • Describes a successful operation. For example, when an application, driver, or service loads successfully, an Information event will be logged. – Warning • An event that may be an indicator of a possible future problem. For example, if a system begins to get low on disk space, it will issue a Warning. (continued) Event Viewer (continued) – Error • Indicates a significant problem. This problem may result in a loss of data or loss of a function. For example, if a device driver fails to load during system startup, an Error will be logged. • In order to view additional information for a specific event, double-click on the entry in the right pane. Log Files • Most log files are simple text files that can be viewed through a text editor such as Notepad. • These log files normally monitor one particular item or process and report on the successes and failures of that item or process. • Each operating system contains a number of log files. The easiest way to locate them is to use the search utility for your particular computer. Formulate a Solution • Once you have a good idea where the problem is, you can begin formulating a plan of attack to fix it. • This plan should be based on your knowledge of the way the system is supposed to operate, and any ancillary factors that affect the operation of that particular object. • Always consider consulting a technical reference when troubleshooting. • Most major manufacturers of computer hardware and software have a Web site loaded with support information for their products. • Many times, the problem has already been identified by the manufacturer, who can provide the information or software necessary to fix the problem. • Based on your knowledge of the systems and your understanding of the problem, you will start to develop one or more solutions to fix the problem. Implement a Solution Always try not to make the problem worse. One Thing at a Time • Implement only one solution at a time. There are two reasons for this: – First, if you do three things and the problem is corrected, you really don’t know which of these actions fixed the problem. – Conversely, if your actions made the problem worse, you really don’t know which one of the actions made the problem worse; therefore, you have to undo all three actions. • Each step should build on the previous step. For example, if restarting the server does not correct the problem, what is the next most logical step to try that is not destructive? Continue working through these steps until you have resolved the problem. Consider the User • While you are working to resolve the problem, it is easy to forget about the person who is actually experiencing the problem, the user. • When possible, always try to accommodate the user to the extent possible. • In all cases, keep the user informed on your progress. If you believe that it will take two hours to fix the problem, be sure you communicate that information to the user. • Be realistic with your time estimates. If you try to make up the time required, it will reflect badly on you as a technician and the IT department as a whole. Test the Result • Testing the result means simply, did it work? • Keep in mind that you should only implement one potential solution at a time. • After you implement that one solution, try it to see if it works. If not, try another possible solution and then test it. Recognize the Potential Effects of the Solution • Will the “fix” that you developed “unfix” something else? • Occasionally, you will repair one problem only to cause another. • This is especially prevalent when working with several companies that make competing products. • Some “side-effect” problems are not obvious. • The best way to head off these problems is to consider all of the factors involved and to conduct extensive research prior to implementing a new software solution. (continued) Recognize the Potential Effects of the Solution (continued) • Independent discussion groups and good technical references are a great place to start. • In most cases, real users or administrators of a product moderate these groups. Since these folks have no vested interest in the product, they tend to be more honest about the problems that they have experienced with different manufacturers’ software. • A good technical reference will inform you of how all network components work together. Document the Solution • Once the problem is fixed, write down what you did. Most problems reoccur at some time. • Once a problem has been successfully resolved, document the problem and the solution in a format that is available to others. • The documentation can be stored in some type of formal, structured facility or as simply as a notebook of loose-leaf papers. • Many companies use a knowledge base to document their troubleshooting efforts. Knowledge Base • A knowledge base is generally a computerized system that allows you to log reported problems along with the steps a technician took to repair them. • A knowledge base can range from the very simple to the elaborate. • The simplest type of knowledge base might be a series of folders or binders containing technical notes that are compiled and maintained by technicians. • A knowledge base can also be a very sophisticated database-based system that can be queried by keyword or simply asked a question in the form of a text string. Frequently Asked Questions (FAQs) • Another simple knowledge base might consist of one or more Web pages made up largely of text that users or technicians can browse for frequently asked questions (FAQs).