SYSTEM ADMINISTRATION Chapter 15 Network Integrity Network Integrity • The definition of network integrity is maintaining the state of the network such that all parts function as a whole in a sound and unimpaired state. • The areas that must be included in a plan to maintain network integrity are: – Documentation – Disaster planning/recovery – Fault tolerance – System backup Documentation • Documentation for the network includes information on the following: – – – – – – – – – – – LAN/WAN topology Hardware inventory Software inventory Change logs Server information Router and switch configurations User policies and profiles Baseline documents Mission-critical applications and hardware Network service configuration Procedures (continued) Documentation (continued) • Good network documentation aids in troubleshooting problems that occur within the network, such as failed connections, failed servers, hung applications, WAN connection failure, and user resource access failure. • Documentation can be formalized using custom forms or informally kept using inexpensive notebooks to record change and repair events. Disaster Planning • A disaster is an event that causes widespread destruction or distress, or total failure. • Planning for the worst-case scenario allows the network administrator, planning team, and technicians to anticipate the consequences of both natural and man-made disasters. • Disaster planning may be as simple as writing a procedure to back up all data, or as complex as contracting for remote hot-sites with 100% uptime so that a network could sustain a disaster without loss of service. Disaster Recovery Plan • A disaster recovery plan follows a set of well-defined steps: – Creation of the disaster recovery (DR) team – Identifying the risks and vulnerabilities that threaten the network. – Business impact assessment – Definition of needs – Detailed plan development – Testing – Maintenance of the plan • A disaster recovery plan is a living document that may take many weeks or months to develop and implement, and this plan must be consistently updated as changes are made to the network. Mirrored Servers (Failover Clustering) • Mirrored servers provide 99.9% uptime for missioncritical applications and data. • To build mirrored servers, both servers must be configured with identical equipment and software, and both must be attached to the network. • The “primary” mirrored server answers requests from the network, and issues a “heartbeat” to its twin to let the secondary mirrored server know that the primary is servicing the network. • When the secondary server does not hear a heartbeat in a predetermined time frame, it will begin answering requests from the network. The window between failure of the primary server and “cutover” to the secondary is usually 30 to 45 seconds. Clustered Servers • Clustered servers represent 2 or more servers that are configured with identical applications and file structures, all attached to the network, and all answering requests from the network. All servers are acting as one very large server. • Clustered servers can make use of replication services. Several servers may be located off-site and participate in replication to assure that data is identical on all servers in the event of failure within the network or disaster. • Clustering is very expensive to implement and is a complex implementation. For this reason, small- to medium-sized businesses usually do not choose this option for disaster planning and recovery. Power Protection • Power loss is one of the small disasters that an administrator can mitigate without undue expense or complex configurations. • Several types of power protection can be used in the network. The choice will depend on the nature of the operations and the stability of the geographical location of the business. Surge Protectors • Surge protectors are designed to minimize the effects of power spikes, surges, and brownouts. • Surge protectors do not protect equipment from “dirty power,” noise on the line, or power failure. • Over time, the circuit breaker in a surge protector loses its sensitivity to power fluctuations and can allow great variation in power to pass through to components, weakening components. • Surge protectors should be replaced at least yearly on equipment to reduce the likelihood of component damage. Online Uninterruptible Power Supply (UPS) • The purpose of a UPS is to provide enough power for enough time to allow a server or other critical machine the ability to shut down gracefully. • An online UPS provides protection for equipment by conditioning the power before it reaches the equipment. • Inside the UPS is a battery that stores power coming from a wall outlet. That power is then sent to the equipment. All noise and fluctuation is minimized, thus making the power used by the server “clean” again, and providing a power source should there be a loss of power. (continued) Online UPS (continued) • The size of the UPS depends on the wattage of the attached equipment. 1 watt = 1.4 VA. Calculate the wattage of the equipment, multiply it by 1.4, and determine the length of time necessary to complete the shutdown process and any other routines that must be done while the machine is still running. • Most UPSs will provide 15-20 minutes worth of power by default, but if longer times are needed, then the total wattage must be multiplied by the amount of time (above 15-20 minutes) to determine the size of the UPS. Standby UPS • A standby UPS allows power to go directly to the equipment while charging a battery in the UPS. When a power failure occurs, the UPS detects a reduction in power and cuts over to battery power. • Some devices, such as servers, may reboot or shut down during a short gap between loss of power and cutover to battery backup. Fault Tolerance • Fault tolerance is the system’s capacity to continue functioning given a “fault” or malfunction of one or more components. Disk Fault Tolerance • Disk fault tolerance provides the network with the ability to recover from loss of function of a hard disk storage device, and to prevent loss of data stored on that device. • One of several disk fault tolerance strategies can be implemented in the servers to protect the data. The most common is some form of redundant array of inexpensive disks (RAID). RAID Level 0 • RAID level 0 is commonly called disk striping without parity. • This form of RAID allows data to be written across multiple disks, but does not provide any fault tolerance. • RAID 0 requires at least two hard disks to implement. • With RAID 0, both read and write performance will improve over single disk usage. • RAID 0 uses all available disk space for storage. • This form of RAID is useful for noncritical data that is routinely backed up. RAID Level 1 • RAID level 1 is commonly referred to as disk mirroring (or disk duplexing when two controllers are used). • With RAID 1, data is written to both disks at the same time. Should one disk fail, the other disk takes over servicing requests from the network. • RAID level 1 requires two disks to implement. • Mirroring/duplexing will provide good read and write access to data on the disk. • Only 50% of the total disk space can be used for storage. • This form of RAID is used where fault tolerance is needed, but cost is of a concern. RAID Level 2 • RAID level 2 is known as bit-level striping with Hamming code ECC. • This level of RAID is not used in modern systems. RAID Level 3 • RAID level 3 uses byte-level striping with dedicated parity. • Data is striped across multiple drives and a parity bit is written to a dedicated hard disk for recovery of lost data. • Read performance with RAID 3 is good, but write performance is only poor to fair. • This type of RAID is costly to implement and is not as efficient as other implementations. RAID Level 4 • RAID level 4 uses a method called block-level striping with dedicated parity. • The difference between RAID 3 and RAID 4 is simply that 4 uses blocks of a size determined by the administrator and 3 uses a stripe at the bit level. • Read performance is good and write performance is fair. • This type of RAID is a midline between 3 and 5, but is not frequently implemented. RAID Level 5 • RAID level 5 is commonly known as striping with parity. • This form of RAID requires at least 3 disks. Data is striped across the disks, and a parity bit is written to the disk as well. This is not a dedicated parity disks system. • Read performance is very good, while write performance is fair. • When figuring available storage space, add the amount of disk space on all drives and subtract the amount of space on one drive. • RAID 5 is considered to be the best choice for fault tolerance and performance. RAID Level 6 • RAID level 6 uses block-level striping with dual distributed parity. • This form of RAID requires a minimum of 4 disks to implement. The equivalent of two disks are lost to parity. • The read performance is good and the write performance is poor to fair due to the parity bits written to the drives. RAID Level 7 • RAID Level 7 is a proprietary form of RAID that uses an asynchronous cached striping mechanism with dedicated parity storage. • Although a defined RAID level, consult the vendor for more information. Backups • When determining a backup strategy, the first two considerations are how you want to accomplish the backup (the hardware) and what software you will use to complete this task. • Some of the options for backup include: • Small- and large-capacity removable disks • Optical discs • Magnetic tape (the most commonly used) • Once the medium is identified, the administrator will determine a schedule of backups using one or more of the following methods: • Full backups • Incremental backups • Differential backups Full Backups • A full backup takes all data and commits it to tape. • During a full backup, the archive bit (attribute) is reset to “off” to notify the backup software that the file has been saved to tape. • Full backups done on a daily basis allow quick restore because only one tape will be used to complete the restore. Incremental Backups • Only files that have changed since the last backup are committed to tape. The last backup may have been a full, incremental, or differential backup. • This method of backup is used in conjunction with weekly full backups. • Incremental backups reduce the amount of time it takes to complete the backup process because of the limited selection of files that are backed up. • When restoring, use the last full backup and all subsequent incremental backups. • Incremental backups reset the archive bit to off. Differential Backups • A differential backup saves all files that have changed since the last full backup. • To restore, only the tapes from the last full backup and the last (most recent) differential backup will be used. • This method of backing up data is used in conjunction with weekly full backups. • A weekly full backup and daily differential backups are considered the most efficient and safest strategy for maintaining data integrity. Other Considerations for Network Integrity • Tape rotation patterns are determined when the backup strategy is designed. – The choices are: • • • • Daily rotation Weekly rotation Monthly rotation Yearly rotation • With each option, the administrator must consider what archive of past data must be maintained for the business, and whether the cost of maintaining a large archive of tapes outweighs the protection of the data. (continued) Other Considerations (continued) • Most businesses use either a weekly rotation or a monthly rotation to manage archived data. • Tape storage is important to consider as well. • Magnetic tape is susceptible to damage from natural elements including heat, sun, water, and humidity. • Proper storage for disaster recovery is necessary. • Tapes should be stored in climate-controlled rooms that are physically protected or at an off-site storage facility. (continued) Other Considerations (continued) • The best option for disaster recovery is to contract with a third party to maintain the tape archive at a remote location. – This method allows the tapes to remain safe should there be a disaster at the location of the business. – Restoration of the data can then take place at the new location should the old one be rendered unusable. Network Attached Storage (NAS) • NAS attaches large data storage to the network, but does not require a server to manage. • Access to NAS is controlled through file system permissions. • NAS can use multiple file formats such as CIFS and NFS. • The NAS device is a storage facility, and does not expend any resources providing any other services to the network. • NAS devices can be brought down for maintenance without causing outages on the network.