IT2204: Systems Administration I 7. Device Management Managing desktops Managing many desktops (workstations) Best done by automation for supported platforms, thus automating the process of managing many PCs. − Main duties for workstations □ □ □ − Loading of system and software Updating system software and applications Configuring network parameters All three must be right □ □ □ Initial load must be consistent across all machines Quick updates 2 Network configuration to be managed centrally Managing many workstations − Best done by automation for supported platforms 3 Machine life cycle Five states and several transitions exist. There is need to plan for the different states and transitions 4 Machine life cycle • Machine states New − A new machine Clean − OS installed, but not yet configured for environment. Configured − Set up (configured) correctly for the operating environment. Unknown − Misconfigured, broken, newly discovered, outdated configuration, etc. Off − Retired computer 5 Machine life cycle • State transitions Build − Transition from new to clean states. − Set up hardware and install OS. Initialize − Configure for environment; often part of build. Update − Install new software. − Patch old software. − Change configurations. 6 Machine life cycle • State transitions Entropy − Undisciplined/ unmanageable changes to configurations Major environment changes Unexplained problems − − Debug − Going back to correct configured state Rebuild − Machine rebuilding □ Possibly due to a major OS revision, □ Drastic changes to be made such that simple updates make no sense 7 Automated Installations • Advantages Saves time/money − Boot the computer, then go do something else. Ensures consistency − No chance of entering wrong input during install. − Avoids user requests due to mistakes in configuration − What works on one desktop, works on all. Allows for fast system recovery − Rebuild system with auto-install vs. slow tapes.8 Automated Installations Full automation better than partial one − − Eliminates prompts in installation scripts Can include complete notification when complete. Partial automation (better than no none) − Needs proper documentation for consistency 9 Vendor Installations • Weaknesses Need to always reload the OS on new machines (you may have your OS preference) − − − You need to configure the host for your environment Eventually you’ll reload the OS on a desktop, leaving you with two platforms to support: the vendor OS install and your OS install. Vendors change their OS images from time to time, so systems bought today have a different OS from systems bought a few months ago. 10 Your own Installations Little trust to vendor' s pre-installed OS − − − Makes adding new apps to your clean install easier as you are installing for your own environment. No need to contact the vendor for installation help. There may be need for special apps or add-ons You may eventually need a re-install □ This may be different from that of the vendor □ Need to make sure required drivers, 11 software are available System and Application Updates At times the software we install may get bugs and • security loop holes. New applications are also launched now and again. We may need to update or upgrade our software. Updates should be updated too Automation systems include: − Solaris autopatch − Windows − Linux package updaters eg yum, apt Updates can be retrieved from the online software 12 update centers for different OS vendors. System and Application Updates • A software update provides bug fixes for features that aren't working right, minor software enhancements and sometimes include new drivers. • Some software updates are free, and are sometimes called a patch because the update is installed over software you're already using and it isn't a full software installation. 13 System and Application Updates • A software upgrade requires a purchase of a new version of your software, usually at a lower price than you would pay if you bought the software for the first time. • Some companies will offer a free update to the latest version if you just recently bought your software, so be sure to register the software when you install it so you know if 14 you qualify for free upgrades. System and Application Updates • A patch is a software update comprised of code inserted (or patched) into the code of an executable program. Typically, a patch is installed into an existing software program. Patches are often temporary fixes between full releases of a software package. • Patches may do any of the following: 15 – – – – – Fix a software bug Install new drivers Address new security vulnerabilities Address software stability issues Upgrade the software • A patch is designed to update a computer program or its supporting data, to fix or improve it. This includes fixing security vulnerabilities and other bugs, and improving the usability or performance. • Though meant to fix problems, poorly designed patches can sometimes introduce new problems. 16 Network configuration Done over the network, typically using DHCP − − − Eliminates time wastage and manual error More secure – only authorized systems have access Centralized control makes updates and changes a lot easier (e.g. new DNS server) 17 Managing Servers Different from workstation − Serves many users − Requires reliability and high uptime (a server should always be on) − Requires tight security − Often expected to last longer − More expensive − Typically has a different OS configuration from desktops − Deployed within a data center − Often has maintenance contracts − Has backup systems 18 − Has better remote access Server HW Buy server HW for servers − − − − specialized hardware offers more internal space offers more CPU performance Offer high performance I/O (both disk and network) − Has more upgrade options − Can be rack mountable For reliability, use known vendors 19 Vendor server product lines • The typical vendor has three product lines: 1. Home 2. Require more internal space 3. (Business) Offer high performance I/O (Server) 20 Vendor server product lines Home − − − Absolute cheapest purchase price Original equipment manufacturer (OEM) components change often focuses on being the lowest cost at the outset. consumers make purchasing decisions based on price. (Any add-on features can be considered later and purchased for a premium cost.) 21 Vendor server product lines Require more internal space (Business) − − − − Longer life, reduced total cost of ownership Fewer component changes concentrate on the total cost of ownership. Businesses tend to keep their computers for a longer time than the average home consumer. Therefore, the manufacturers have to keep a large pool of spare parts for maintenance of these computers. Usually higher quality components than that of the home line computers. 22 Vendor server product lines Offer high performance I/O (Server) − − − Lowest cost performance metric Easier to service components and design The server line tends to focus on the lowest total cost of ownership. For example, a server is designed with higher quality components that will last a great deal of time longer than the home or business line of computers. A server is also designed to be able to process a higher workload than a home line of computers. 23 Maintenance contracts Vendors have variety of service contracts − Customer-purchased spare parts get replaced when they get used up How to select a maintenance contract? − Think of needs □ □ □ Non-critical hosts: next-day or two-day response time is likely reasonable, or perhaps no contract Large groups of similar hosts: use spares approach Controlled model: only use a small set of distinct technologies so that few spare part kits 24 needed. Maintenance contracts How to select a maintenance contract? − Think of needs □ Critical host: stock failure-prone and interchangeable parts (power supplies, hard drives); get same-day contract □ Large variety of models from same vendor: sufficiently large sites may opt for a contract with an on-site technician 25 Data backups Servers often have critical data that must be backed up − − Client data often backed up on server Think of separate administrative network □ Keep bandwidth-hungry backup jobs off of the production network □ Provide alternate access during network problems □ May require additional NICs, cabling, switches etc 26 Servers in the data center Servers should be located in data centers − Data centers provide □ Proper power (enough power, conditioned, UPS, maybe generator) □ Fire protection/suppression □ Networking □ Sufficient air conditioning (climate controlled) □ Physical security 27 Remote administration Data centers are expensive and may be distant from admin office Servers should not require physical presence at a console − − − Typical solution is a console server □ Eliminate need for keyboard and screen □ Can see booting, can send special keystrokes □ Access to console server can be remote (e.g., ssh, rdesktop) Power cycling provided by remote-access power-strips 28 Media insertion & hardware servicing are still problems RAID Redundant Array of Independent Disks RAID Redundant array of independent disks RAID − A system whereby two or more disks are physically linked together to form a single logical, large capacity storage device that offers a number of advantages over conventional hard disk storage devices − Makes many smaller disks appear as one 30 large disk to a server Why RAID? Superior performance and system reliability Improved resilience through increased redundancy Lower costs 31 Why RAID? Performance The parallelism or ability to access multiple disks in the same time allows for the data to be written or read from an array in a faster way than what would be possible in a simple single drive. − Typically used in large file servers, transaction or application servers, where data accessibility is critical, and fault tolerance is required. − Today, RAID is also being used in desktop environments for CAD, multimedia editing and playback where higher transfer rates are needed. Performance is increased because the server has more "spindles" to read from or write to when data is 32 accessed from a drive Why RAID? More resilience This provided allowance for a backup of data in the storage array during failure. The failure of any array can be prevented by swapping out a new drive without turning the system off. The RAID performance depends on the number of drives used in the array. 33 Why RAID? More resilience Redundancy is achieved by either writing the same data to multiple drives (mirroring), or collecting data (parity data) across the array, such that the failure of one or more disks in the array does not result in data loss. A failed disk may be replaced by a new one, and the lost data reconstructed from the remaining data and the parity data. 34 Why RAID? Cheaper Since the main principle involved in the RAID is to provide greater or the same storage capacity to a system in comparison with that for a single drive, there is a high price difference. When self repairing configurations are used that do not need humans to replace drives, storage could become cheaper and more liable. 35 RAID Techniques • Two principal techniques employed: 1. Mirroring − The first implementation of RAID, typically requiring two individual drives of similar capacity. One drive is the active drive and the secondary drive is the mirror − The technique provides a simple form of redundancy for data by automatically writing data to the mirror drive when it is written 36 to the active drive. RAID Techniques • Two principal techniques employed: 2. Striping − This technique provides increased performance. It is a method of mapping data across the physical drives in an array to create a large virtual drive. − The data is subdivided into consecutive segments or stripes that are written sequentially across the drives in the array. 37 RAID levels Several levels of RAID are used RAID 0 – “Disk Striping” − It is technically not a RAID level since it provides no fault tolerance. Data is written in blocks across multiple drives, so one drive can be writing or reading a block while the next is seeking the next block. 38 RAID 0: Striping 39 RAID levels − − − Advantages: higher access rate, and full utilization of the array capacity, easy to implement. Disadvantage: there is no fault tolerance - if one drive fails, the entire contents of the array become inaccessible. Ideal for non-critical systems. 40 RAID levels: RAID 1 RAID 1 – “Disk Mirroring” − − Provides redundancy by storing data twice. Data is written to both the data disk (active) and a mirror disk(s). If one fails, the controller uses either the data drive or the mirror drive for data recovery and continues operation. 41 RAID 1: Mirroring 42 RAID levels: RAID 1 − − − Advantage: it provides the best protection of data since the array management software will simply direct all application requests to surviving disk when one disk fails. Disadvantages: no improvement in data access speed, and higher cost, since twice the number of drives is required. Ideal for critical mission systems 43 RAID levels: RAID 3 RAID 3 − Data blocks are subdivided (striped) and written in parallel on two or more drives. An additional drive stores parity information for error correction/recovery . A minimum of 3 disks is needed for a RAID 3 array. − Since parity is used, a RAID 3 stripe set can withstand a single disk failure without losing data or access to data. 44 RAID 3 45 RAID levels: RAID 3 − Advantages: it provides high throughput (both read and write) for large data transfers; and disk failures do not significantly slow down throughput. − Disadvantages: this technology is fairly complex and too resource intensive to be done in software; performance is slower for random, small I/O operations. 46 RAID levels: RAID 5 RAID 5 – “Data striping with parity” − The most common secure RAID level. − Similar to RAID-3 except that data are transferred to disks by independent read and write operations (not in parallel). − The written data chunks are also larger. − Instead of a dedicated parity disk, parity information is spread across all the drives. A minimum of 3 disks is needed for a RAID 5 array. 47 RAID 5 – “Data striping with parity” 48 RAID levels: RAID 5 RAID 5 – “Data striping with parity” − A RAID 5 array can withstand a single disk failure without losing data or access to data. Although RAID 5 can be achieved in software, a hardware controller is recommended. Often extra cache memory is used on these controllers to improve the write performance. 49 RAID levels: RAID 5 − − − − Advantages: read data transactions are very fast while write data transaction are somewhat slower (due to the parity that has to be calculated). Disadvantages: Disk failures have an effect on throughput, although this is still acceptable; like RAID 3, this is complex technology. RAID 5 is a good all-round system that combines efficient storage with excellent security and decent performance. It is ideal for file and application servers. 50 RAID levels • RAID 0 and RAID 1 combinations Combines the advantages (and disadvantages) of RAID 0 and RAID 1 in one single system. − It provides security by mirroring all data on a secondary set of disks (disk 3 and 4 in the drawing) while using striping across each set of disks to speed up data transfers. 51 52 RAID levels • RAID 0 and RAID 1 combinations RAID 1+0 (or 10) is a mirrored data set (RAID 1) which is then striped (RAID 0), hence the "1+0" name. A RAID 10 array requires a minimum of two drives, but is more commonly implemented with 4 drives to take advantage of speed benefits. RAID 0+1 (or 01) is a striped data set (RAID 0) which is then mirrored (RAID 1). A RAID 0+1 array requires a minimum of four drives: two to hold the striped data, plus another two to mirror the first pair. 53 RAID Implementations Data distribution across multiple drives can be managed either by dedicated hardware or by software. When done in software the software may be part of the OS or it may be part of the firmware and drivers supplied with the card. 55 RAID Implementations • Operating system based (Software RAID) • Software implementations are now provided by many OSs. • A software layer sits above the disk device drivers and provides an abstraction layer between the logical drives (RAIDs) and physical drives. • Most common levels are RAID 0 and RAID 1, followed by RAID 1+0, RAID 0+1, and RAID 5 are supported. 56 RAID Implementations • Software RAID Apple's Mac OS X Server supports RAID 0, RAID 1, RAID 5 and RAID 1+0. FreeBSD supports RAID 0, RAID 1, RAID 3, and RAID 5 and all layerings of the above via GEOM (main storage framework for FreeBSD OS) modules, as well as supporting RAID 0, RAID 1, RAID-Z, and RAID-Z2 (similar to RAID 5 and RAID 6 respectively), plus nested combinations of those via ZFS a Suns file system and logical volume manager . Linux supports RAID 0, RAID 1, RAID 4, RAID 5, RAID 6 and all layerings of the above. 57 RAID Implementations • Software RAID Microsoft's server OSs support 3 RAID levels; RAID 0, RAID 1, and RAID 5. Some of the Microsoft desktop OSs support RAID such as Windows XP Professional which supports RAID level 0 in addition to spanning multiple disks but only if using dynamic disks and volumes. Windows XP supports RAID 0, 1, and 5 with a simple file patch. RAID functionality in Windows is slower than hardware RAID, but allows a RAID array to be moved to another machine with no compatibility issues. NetBSD supports RAID 0, RAID 1, RAID 4 and RAID 5 (and any nested combination of those like 1+0) via its software implementation, named RAIDframe. 58 RAID Implementations • Software RAID OpenBSD aims to support RAID 0, RAID 1, RAID 4 and RAID 5 via its software implementation softraid. OpenSolaris and Solaris 10 supports RAID 0, RAID 1, RAID 5 (or the similar “RAID Z” found only on ZFS), and RAID 6 (and any nested combination of those like 1+0) via ZFS and now has the ability to boot from a ZFS volume on both x86 and UltraSPARC a Suns microprocessor. − Through SVM, Solaris 10 and earlier versions support RAID 1 for the boot filesystem, and adds RAID 0 and RAID 5 support (and various 59 nested combinations) for data drives. RAID Implementations • Hardware RAID Hardware RAID controllers use different, proprietary disk layouts, so it is not usually possible to span controllers from different manufacturers. They do not require processor resources, the BIOS can boot from them, and tighter integration with the device driver may offer better error handling. 60 RAID Implementations • Hardware RAID A hardware implementation of RAID requires at least a special-purpose RAID controller. On a desktop system this may be a PCI expansion card, PCI-e expansion card or built into the motherboard. Controllers supporting most types of drive may be used – IDE/ATA, SATA, SCSI, SSA, Fibre Channel, sometimes even a combination. The controller and disks may be in a stand-alone disk enclosure, rather than inside a computer. The enclosure may be directly attached to a computer, or connected via SAN. 61 RAID Implementations • Hardware RAID Most hardware implementations provide a read/write cache, which, depending on the I/O workload, will improve performance. In most systems the write cache is non-volatile (battery-protected), so pending writes are not lost on a power failure. Hardware implementations provide guaranteed performance, add no overhead to the local CPU complex and can support many operating systems, as the controller simply presents a logical disk to the operating system. 62 RAID Implementations • Reading Assignment What are the advantages and disadvantages of SW RAID? What are the advantages and disadvantages of HW RAID? 63 RAID and Backup! • RAID is no substitute for back-up! All RAID levels except RAID 0 offer protection from a single drive failure. A RAID 6 system even survives 2 disks dying simultaneously. For complete security there is need to back-up the data from a RAID system. − A back-up comes in handy if all drives fail simultaneously because of a power spike. − It is a safeguard if the storage system gets stolen. − Back-ups can be kept off-site at a different location. This can come in handy if a natural disaster or fire 64 destroys your workplace. RAID and Backup! • RAID is no substitute for back-up! − The most important reason to back-up multiple generations of data is user error. If someone accidentally deletes some important data and this goes unnoticed for several hours, days or weeks, a good set of back-ups ensure you can still retrieve those files. − **Read about existing backup strategies 65 Redundant Power supplies • Power supplies are the 2nd most failureprone part • Ideally, servers should have RPSs – – – The server will still operate if one power supply fails Should have separate power cords Should draw power from different sources (e.g., separate UPS) 66 Hot-swap components • Redundant components should be hotswappable New components can be added without downtime – Failed components can be replaced without outage • Hot-swap components increases cost – – But consider cost of downtime • Always check – – – Does OS fully support hot-swapping components? What parts are not hot-swappable? 67 How long/severe is the service interruption? But Servers are expensive ! • Is there an alternative? • Server appliances – – Dedicated-purpose, already optimized Examples: file servers, web servers, email, DNS, routers, etc. • Many inexpensive workstations – – – Common approach for web services □ Google, Hotmail, Yahoo, etc. Use full redundancy to counter unreliability Can be useful (but need to consider total costs, e.g., 68 support and maintenance, not just purchase price) Managing Services • Services distinguish a structured computing environment from a bunch of standalone computers • Larger groups are typically linked by shared services that ease communication and optimize resources • Typical environments have many services – – DNS, email, authentication, networking, printing Remote access, license servers, DHCP, software repositories, backup services, Internet access, file service 69 Managing Services • Providing a service means – – – – Not just putting together hardware and software Making service reliable Scaling the service Monitoring, maintaining, and supporting the service 70 Provide Good Solid Services • Get customer requirements – Reason for service □ □ □ □ – Define a service level agreement (SLA) □ □ □ – How service will be used Features needed vs. desired Level of reliability required Justifies budget level Enumerates services Defines level of support provided Response time commitments for various kinds of problems Estimate satisfaction from demos or small usability trials 71 Provide Good Solid Services • Get operational requirements – What other services does it depend on? □ Only services/systems built to same standards or higher □ Integration with existing authentication or directory services? – How will the service be administered? – Will the service scale for growth in usage or data? – How is it upgraded? Will it require touching each desktop? – Consider high-availability or redundant hardware – Consider network impact and performance for remote users • Revisit budget after considering operational concerns72 Provide Good Solid Services • Consider an open architecture – e.g. open protocols and open file formats – Proprietary protocols and formats can be changed, may cause dependent systems/vendors to become incompatible – Beware of vendors who “embrace and extend” so that claims can be made for standards support, while not providing customer interoperability – Open protocols allow different parties to select client vs. server portions separately – Open protocols change slowly, typically in upward compatible ways, giving maximum product choice – No need for protocol gateways (another 73 system/service) Provide Good Solid Services • Favor simplicity over complexity KISS Keep it Simple Stupid – Simple systems are more reliable, easier to maintain, and less expensive – Typically a features vs. reliability trade-off Take advantage of vendor relationships – – – – Provide recommendations for standard services Let multiple vendors compete for your business Understand where the product is going Attempt to favor vendors who develop natively on your platform (not port to it) 74 Provide Good Solid Services • Machine independence – – – • Clients should access services using generic names □ e.g., www, calendar, pop, imap, etc. Moving services to different machines becomes invisible to users Consider (at the start) what it will take to move the service to a new machine Supportive environment – – Data center provides power, AC, security, networking Only rely on systems/services also found in data center (within protected environment) 75 Provide Good Solid Services • Reliability – – Build on reliable hardware Exploit redundancy when available □ – Plug redundant power supply into different UPS on different circuit Components of service should be tightly coupled □ Reduce single points of failure – e.g., all on same power circuit, network switch, etc. □ Includes dependent services – e.g., authentication, authorization, DNS, etc. 76 Provide Good Solid Services • Reliability – – Make service as simple as possible Independent services on separate machines, when possible like having a DNS service independent of a Mail service. □ But put multiple parts of single service together 77 Provide Good Solid Services • Restrict access – Customers should not need physical access to servers □ □ • Fewer people Eliminate any unnecessary services on server (security) Centralization and standards – Building a service = centralizing management of service – May be desirable to standardize the service and centralize within the organization as well □ □ Makes support easier, reducing training costs Eliminates redundant resources 78 Provide Good Solid Services • Performance – – If a complicated service is deployed, but slow, it is unsuccessful Need to build in ability to scale □ □ – – – Can't afford to build servers for service every year Need to understand how the service can be split across multiple machines if needed Estimate capacity required for production (and get room for growth) First impression of user base is very difficult to correct When choosing hardware, consider whether 79 service is Disk I/O, memory, or network bound Provide Good Solid Services • Monitoring – – – • Helpdesk or front-line support must be automatically alerted to problems Customers that notice major problems before sysadmins are getting poor service Need to monitor for capacity planning as well Service roll-out – First impressions □ □ □ Have all documentation available Helpdesk fully trained 80 Use slow roll-out (helps clients adjust to service) Q&A