Module 13 Troubleshooting the Operating System

advertisement
Module 13
Troubleshooting the Operating System
13.1 - Identifying and Locating Symptoms
and Problems
13.2 - LILO Boot Errors
13.3 - Various Reasons for Package
Dependency Problems
13.4 - Troubleshooting Network Problems
13.5 - Disaster Recovery
Identifying and Locating
Symptoms and Problems
Hardware Problems
• Although a few problems are
due to a combination of
factors, most can be isolated in
origin to one of these:
– Hardware, Kernel, Application
Software, Configuration, and
User Error,
• Other hardware leaves traces that
the kernel detects and records.
• Assuming an error is such that it
does not crash the system,
evidence might be left in the log
file /var/log/messages, with the
message prefixed by the word
oops.
Kernel Problems
• Released Linux kernels are remarkably stable,
unless experimental versions are used or
individual modifications are made.
• Loadable kernel modules are considered part
of the kernel as well, at least for the time
period they are loaded.
• Sometimes these can cause difficulties, too.
• The good news with modules is that they can
be uninstalled and replaced with fixed versions
while the system is still running.
Application Software
• Errors in application packages are most identifiable in
that they occur only when running the application.
• This is in contrast to hardware and kernel conditions
that affect an entire system.
• Some common signs of application bugs are failure to
execute and program crash.
• An application may consume too much system
memory and ultimately begin to swap so badly that
the whole system is affected.
• Some errors are caused by things that have to do with
the running program itself.
Configuration
• Configuration problems tend to affect whole subsystems,
such as the graphics, printing, or networking subsystems.
• If the system is rebooted and a remote file system that
was once present is not, the first place to look is in the
configuration file /etc/fstab to see if the file system is
supposed to be mounted at boot time.
User Error
• It is not unforgivable to make a mistake in using a
computer program, nor is it to be ignorant of the right
way to do something. It is only unforgivable to insist
on remaining stubbornly so.
• There is more to know about the ins and outs of
operating almost any software package than
everyday users will ever care or attempt to learn.
Using System Utilities and System Status Tools
• Linux operating systems
provide various system
utilities and system status
tools.
• The setserial utility
provides information and
set options for the serial
ports on the system.
• The lpq command helps
resolve printing problems.
• The command will display
all the jobs that are
waiting to be printed.
Using System Utilities and System Status Tools
• The ipconfig command can
be entered at the shell to
return the current network
interface configuration of
the system.
• The route command
displays or sets the
information on the system’s
routing, which it uses to
send information to
particular IP addresses.
Unresponsive Programs and Processes
• Sometimes there are programs and
processes that for various reasons can
become unresponsive or “lock up”.
• Sometimes just the program or process itself
will lock up and other times can cause the
entire system to become unresponsive.
• One method of identifying and locating the
unresponsive program and effectively
troubleshooting the problem is to kill or restart
the process or program.
When to Start, Stop,
or Restart a Process
• It is easiest to terminate a program by using the kill
command.
• Other processes need to be terminated by editing the
Sys V startup script.
• When restarting a program, service, or daemon it is
best to first consult the documentation because
different programs have to be restarted in different
ways.
• Some support using the restart command, some
need to be stopped completely and then started
again, and others can simply reread their
configuration files without needing to be either
stopped and started again, or restarted.
Troubleshooting Persistent Problems
• The best way to fix programs that crash repeatedly is
to replace them with new software or with a different
kind of software that performs the same task.
• If it is possible, try using the software in a different
way or if there is a particular keystroke or command
that causes the program to fail, stop using it.
• Most times there will be replacement software
available.
• If it is a daemon that is crashing regularly try using
other methods of starting it and running it.
Examining Log Files
• Some of the more important log
files on a Linux system are the
/var/log/messages,
/var/log/secure, and the
/var/log/syslog log files.
• The system’s log files can be
used to monitor system loads
such as how many pages a
web server has served.
• They can also check for
security breaches such as
intrusion attempts, verify that
the system is functioning
properly, and note any errors
that might be generated by
software or programs.
Examining Log Files
• There are several different types of
information that are good to know, which will
make identifying problems using the log files
a little easier.
• Some of these are listed below:
–
–
–
–
–
Monitoring System Loads
Intrusion Attempts and Detection
Normal System Functioning
Missing Entries
Error Messages
The dmesg Command
• The dmesg command can
be used to display the
recent kernel messages,
also known as the kernel
ring buffer.
• These messages contain
information about the
hardware installed in the
system and the drivers.
• The information in these
messages relates to
whether the drivers are
being loaded successfully
and what devices the
drivers are controlling.
Troubleshooting Problems
Based on User Feedback
• There are several
different types of
problems that users
report.
• Some of the most
common ones are:
– Login Problems
– File Permission
Problems
– Removable Media
Problems
– E-mail Problems
– Program Errors
– Shutdown Problems
LILO Boot Errors
Error Codes
• The LILO boot loader is the
first piece of code that takes
control of the boot process
form the BIOS. It loads the
Linux kernel, and then
passes control entirely to the
Linux kernel.
• When there is a problem
with LILO an error code will
be displayed:
– None, L error-code, LI,
LI101010… LIL , LIL?, LIL-,
LILO
Booting a Linux System
without LILO
• Using the LILO on a
Floppy method is the
least useful but it can help
in some instances.
• From this screen a LILO
boot floppy disk can be
created which can be used
to boot Linux from LILO
using the floppy disk.
Emergency Boot System
• Linux provides an
emergency system’s copy
of LILO, which can be
used to boot Linux in the
event that the original LILO
boot loader has errors or is
not working.
• This is known as the
Emergency Boot System.
• To use this copy of LILO
configuration changes
must be made in lilo.conf.
Using an Emergency
Boot Disk in Linux
• There are several
reasons and errors that
can cause a Linux
system not to boot,
besides LILO problems.
• The emergency boot
disk should have the
necessary disk utilities
such as fdisk, mkfs,
and fsck, which can be
used to format a hard
drive so that Linux can
be installed on it.
Using an Emergency
Boot Disk in Linux
• It is always important to
include some sort of
backup software utility.
• If a change or repair to
some configuration files
needs to be made, first
back them up.
• Most distributions come
with some sort of backup
utility like tar, restore,
cpio, and possibly others.
Recognizing Common Errors
Various Reasons for
Package Dependency Problems
• When a package is installed in a Linux system there
might be other packages that need to be installed for
that particular package to work properly.
• The dependency package may have certain files
which need to be in place or it may run certain
services which need to be started before the package
that is to be installed can work.
• Linux will often notify the user if they are installing a
package that has dependencies so that they can be
installed as well.
Solutions to Package
Dependency Problems
• One solution to solving package dependency problems is to
simply ignore the error message and forcibly install the
package anyway.
• The correct and recommended method for providing solutions
is to modify the system so that it has the necessary
dependencies that are needed to run properly.
• It may be necessary to rebuild the package from source code
if there are dependency error messages showing up.
• The easiest way is to locate a different version of the package
that is causing the problems.
• Another option is to look for a newer version of the package.
Backup and Restore Errors
• Backup and Restore errors can occur at different points.
• Some errors will occur when the system is actually
performing the backup.
• Other errors will occur during the restore process when
the system is attempting to recover data.
• Some of the most common types of problems:
–
–
–
–
Driver problems
Tape drive access errors
File access errors
Media errors
– Files not found errors
Application Failure
on Linux Servers
• There are several things that can provide some
indication of an application failure or software problem
on a Linux server:
–
–
–
–
–
Failure to Start
Failure to Respond
Slow Responses
Unexpected Reponses
Crashing Application or Server
• A good general rule is to check the system’s logs.
• The system’s log files are usually the place to find most
error messages that are generated because they are not
always displayed on the screen.
Troubleshooting Network Problems
Loss of Connectivity
• Loss of connectivity can be hardware and/or software
related. The first rule of troubleshooting is to check
for physical connectivity.
• Ensure that the cables are properly plugged in at
both ends, that the network adapter is functioning by
checking the link light on the NIC, that the hub's
status lights are on, and that the communication
problem is not a simple hardware malfunction.
Operator Error
• Be sure that users are using the correct username and
password and that their accounts are not restricted in a
way that prevents them from being able to connect to the
network.
•
Software settings might have been changed by the
installation routine of a recently installed program, or the
user might have been experimenting with settings.
• Users accidentally, or purposely, delete files, and power
surges or shutting down the computer abruptly can
damage file data.
• Viruses can also damage system files or user data.
Using TCP/IP Utilities
• The first step in checking for a
suspected connectivity
problem is to ping the host.
• If a reply is received, the
physical connection between
the two computers is intact and
working.
• The successful reply also
signifies that the calling system
can reach the Internet.
• The term ping time refers to
the amount of time that
elapses between the sending
of the Echo Request and
receipt of the Echo Reply.
• A low ping time indicates a fast
connection.
Using TCP/IP Utilities
• Tracing utilities are used to
discover the route taken by a
packet to reach its destination.
• The way to determine packet
routing in UNIX systems is the
traceroute command.
• Traceroute shows all the
routers through which the
packet passes as it travels
through the network from
sending computer to destination
computer.
• This is useful for determining at
what point connectivity is lost or
slowed.
Using TCP/IP Utilities
• The ipconfig
command is used in
Windows NT and
Windows 2000 to
display the IP address,
subnet mask, and
default gateway for
which a network
adapter is configured.
• For more detailed
information, the /all
switch is used.
Problem-Solving Guidelines
• Troubleshooting a network requires problem-solving
skills.
• The use of a structured method to detect, analyze,
and address each problem as it is encountered
increases the likelihood of successful
troubleshooting.
• These steps should be followed:
–
–
–
–
–
Gather information
Analyze the information
Formulate and implement a "treatment" plan
Test to verify the results of the treatment
Document everything
Windows 2000 Diagnostic Tools
• The network diagnostic
tools for Microsoft
Windows 2000 Server
include Ipconfig,
Nbtstat, Netstat,
Nslookup, Ping, and
Tracert.
• Windows 2000 Server
also includes the
Netdiag and Pathping
commands.
Wake-on-LAN
•
•
•
•
•
Some network interface cards support a technology
known as Wake-On-LAN (WOL).
The purpose of WOL technology is to enable a
network administrator to power up a computer by
sending a signal to the NIC with WOL technology.
The signal is called a magic packet.
When the magic packet is received by the NIC, it
will power up the computer in which it is installed.
When fully powered up, the remote computer can
be accessed through normal remote diagnostic
software.
Disaster Recovery
Risk Analysis
A good risk analysis can be broken into the following
four parts:
1. Identify business processes and their associated
infrastructure.
2. Identify the threats associated with each of the
business processes and associated infrastructure.
3. Define the level of risk associated with each threat.
4. Rank the risks based on severity and likelihood.
Understanding Redundancy
• Redundancy is the ability to
continue providing service
when something fails.
• RAID 0 - also known as disk
striping, it writes data
across two or more physical
drives and has no
redundancy.
Understanding Redundancy
• RAID 1 - also known as disk
mirroring, requires the use of two
disk drives and one disk controller
to provide redundancy.
• To increase performance add a
second controller, one for each disk
drive.
• RAID 5 - also know as disk striping
with parity. Parity is an encoding
scheme that represents where the
information is stored on each drive.
• RAID 5 is similar to RAID 0 in that it
writes data across disks but it adds
a parity bit for redundancy.
• Three drives are required to
implement this type of RAID.
Understanding Redundancy
• RAID 0+1 offers the best of
both worlds, the performance
of RAID 0 and the
redundancy of RAID 1.
• This is an expensive solution
because of the number of
drives it requires.
• A number of other
components in the server
can be configured in a
redundant manner:
– Power supplies, Cooling
fans, Network interface
adapters, Processors,
Uninterruptible power supply
(UPS)
Clustering
• A cluster is a group of
independent computers
working together as a
single system.
• This system is used to
ensure that mission-critical
applications and resources
are as highly available as
possible.
• The advantages to running
a clustered configurations:
– Fault tolerance, High
availability, Scalability, Easier
manageability
Scalability
• Scalability refers to how well
a hardware or software
system can adapt to
increased demands.
• The question is how much
extra capacity should be built
in, and how much additional
capacity can be added once
the server is installed?
• It is a good idea to add an
additional 25% to any new
server configurations to
ensure scalability.
High Availability
• High availability is the designing and configuring of a
server to provide continuous use of critical data and
applications.
• Highly available systems are required to provide
access to the enterprise applications that keep
businesses up and running, regardless of planned or
unplanned interruption.
• It is not uncommon for mission critical applications to
have an availability requirement of 99.999%.
Hot Swapping,
Warm Swapping, and Hot Spares
The types of components that might be kept on hand in case of
a problem are broken into these three basic categories:
1.
2.
3.
A hot-swap component has the capability to be added
and removed from a computer while the computer is
running and have the operating system automatically
recognize the change.
Warm swaps are generally done in conjunction with the
failure of a hard drive. In this case, it is necessary to shut
down the disk array before the drive can be replaced.
A hot-spare component is a component that can be kept
on hand in case of an equipment failure.
Creating a Disaster Recovery Plan
Based on Fault Tolerance/Recovery
The first piece of the plan is to create the fault-tolerance
portion of the disaster-recovery plan, follow these steps:
1.
2.
3.
From the risk analysis, identify the hardware failurerelated threats
From the list of components, identify the components
that place the data at the most risk if they were to fail
Take each component and make a list of the methods
that could be used to implement it in a fault-tolerant
configuration. List approximate costs for each solution
and the estimated outage time in the event of a failure
for each component.
Creating a Disaster Recovery Plan Based on
Fault Tolerance/Recovery
4. Take any components that can be implemented in a cost
effective manner and start documenting the
configuration.
5. Take any components that either cannot be
implemented in a fault-tolerant configuration or that for
which a fault-tolerant configuration would be costprohibitive, and determine whether a spare part should
be kept on hand in the event of an outage.
6. The disaster-recovery plan should include documented
contingencies for any of the threats identified as part of
the risk analysis.
7. After all this information has been documented, place
the orders and get ready to start configuring the server.
Testing the Plan
Some of the things that should be tested for include the following:
• Check the documentation to ensure that it is understandable
and complete.
• Do a “dry run” of each of the components of the plan. Make
sure spare drives can be located, if applicable, and that
replacement parts can be ordered from the vendor.
• Test the notification processes. It should be documented
who is to be notified in case of an outage.
• Check the locations of any hot spare equipment or servers.
• Verify that any support contracts that are on equipment are
still in effect, and that all the contact numbers are available.
• Test the tape backups at least once a week.
• Test the RAID configuration at least twice a year.
Hot and Cold Sites
Two types of disaster-recovery sites are commonly used:
1. A hot site is a commercial facility available for
systems backup.
•
For a fee, these vendors provide a facility with server
hardware, telecommunications hardware, and
personnel to assist a company with restoring critical
business applications and services in the event of an
interruption in normal operations.
2. A cold site, also known as a shell site, is a facility
that has been prepared to receive computer
equipment in the event that the primary facility is
destroyed or becomes unusable.
Download