Chapter 6: Assuring Reliable and Secure IT Services The inherent reliability of internetworks: The inherent reliability of internetworks is due to US Department of Defense research that led to technologies robust enough to withstand a military attack. The key to this inherent liability is redundancy—the exceptionally large number of potential paths a message can take between any two points in a network. Messages are routed around network problems. However, some components of a firm’s infrastructure are not inherently reliable. For example, the reliability of processing systems depends on how they are designed and managed. Reliability through redundancy comes at a price—extra equipment to guard against failures. How much reliability to build in is a management decision contingent on numerous, mainly business, factors: Tangible factors: direct revenue losses --such as how costly is a 15-minute failure of the order management system? Less tangible factors: how many customers frustrated by the outage will never return? How likely is such an event to happen? Redundant systems are more complex and difficult to manage than nonredundant systems: Charles Perrow: failures are inevitable in tightly coupled complex systems. Precautions such as adding redundancy create new categories of accidents by adding complexity. Malicious threats to computing infrastructure: Hackers-from pranksters to organized criminals to international terrorists It’s an arms race requiring constantly improving defenses against increasingly sophisticated weaponry. Attacks are often automated and systematic, carried out by wrecking routines that probe for vulnerabilities and inflict damage randomly. A. Availability Math Reliability of computing infrastructure = availability of a specific information technology service, expressed as a percentage. Ex: 98% availability, which is equivalent to half-hour per day of downtime. A business’ tolerance for outages varies by system and situation. Downtime in large chunks of time is more serious than short-term outages. Predictable downtown is more manageable than random downtime. Availability for real-time infrastructure is usually expressed in terms of a “number of nines”— 5 nines means 99.999% availability—less than a second of downtime in a 24-hour day, or no more than a minute in 3 months. Service availability is generally lower than the availability of individual components, and it decreases as components are added in a series. 1. The Availability of Components in Series IT service availability degrades severely as components are added in a chain. By the time 15 devices are added in a series, downtime exceeds 25%. If only one component in the chain fails, 99.999 availability cannot be maintained. 1 2. The Effect of Redundancy on Availability Solution: Components connected in parallel in the provision of an IT service. Because any of the individual components can support the service, all five must fail at the same time to render this combination of components a failure. Availability increases when components that are 98% available are combined in parallel. B. High-Availability Facilities 1. Uninterruptible Electric Power Delivery Redundant power is provided to each piece of computing equipment housed in them—e power cables for each computer. Power distribution inside the facility is fully redundant and includes uninterruptible power supplies (UPSs) to maintain power even if power delivery to the facility is interrupted. UPSs can employ batteryless, flywheel-based technologies. Connections to outside sources of power are redundant—facilities access 2 utility power grids. Diesel generators stand by for backup power generation, and on-site fuel tanks contain fuel for a day or more of operation. Plans are in place for high-priority access to additional fuel in case of a lengthy primary power outage (e.g., delivery helicopter) High-end data centers may obtain primary power from on-site power plants, with first-level backup from local utility power grids and second-level backup from diesel generators. 2. Physical Security Security guards in bulletproof enclaves protect points of entry and patrol facilities regularly.. Closed-circuit tv monitors critical infrastructure and provides immediate visibility into any area of the facility from a security desk. Access to internal area requires photo ID and presence on a prearranged list. Entry is through single-person buffer zone with integrated metal and explosive detection that can be locked down. Motion sensors and biometric scanners (retinal, palm, voice recognition) are in place. The building that houses the data center is not shared with other businesses. Building is “hardened” against external explosions, earthquakes, and other disasters. 3. Climate Control and Fire Suppression Redundant heating, ventilating, and air-conditioning equipment that monitors and maintains suitable temperature conditions. Mobile cooling units. Integrated fire suppression systems. 4. Network Connectivity External connections to Internet backbone providers are redundant., involving at least 2 backbone providers, and enter the building through separate points. Agreements are made with backbone providers that permit significant percentages (i.e., 5090%) of network traffic to travel from origin to destination across the backbone company’s private network avoiding often-congested public Internet junctions. A 24X7 network operations center (NOC) is staffed with network engineers who monitor the connectivity infrastructure of the facility. 2 A redundant NOC on another site is capable of delivering services of equal quality as those provided by the primary NOC. 24X7 assistance to customers. Automated problem-tracking systems are integrated with similar systems at service delivery partner sites. 5. N + 1 and N + N Redundancy N +1 level of redundancy (99.9%) is the least level that must be maintained for mission-critical components (for each type of critical component at least one unit is standing by) N + N redundancy (99.999)—twice as many mission-critical components as are necessary to run a facility. Facilities are categorized according to the level of uptime they support: o Level 1 data centers—N + 1 redundancy, are available 99-99.9% of the time. o Level 2 and 3 data centers—more. o Level 4 data centers—N + N or better, achieving uptimes of 99.999-99.9999. Downtime is seconds per year, unnoticeable to most users. High levels of availability are costly. o Increasing the availability of a single web site from 99% to 99.999% costs millions of dollars. o A 99.999% availability data center costs 3 or 4 times more than one capable of 99 to 99.9% availability. o Management decisions about the design of IT infrastructures involve tradeoffs between availability and the expense of additional components. C. Securing Infrastructure against Malicious Threats 99% of companies/gov’t agencies in a 2003 survey had detected security breaches in the last 12 months. Who are the hackers? –thrill seekers, those who have a specific grudge with a company, those seeking a company’s proprietary data, terrorists. 1. Classification of Threats External Attacks o Attacks against computing infrastructure that harm it or degrade its services without actually gaining access to it. o DoS attacks disable infrastructure devices by flooding them with an overwhelming number of messages that the computer cannot handle. o Hackers send packets that originate from multiple locations on the Internet or that appear to originate from multiple locations. o Distributed denial of service (DDoS) attacks are carried out by automated routines secretly deposited on Internet-connected computers whose owners have not secured them against intrusion (this includes a large % of DSL and cable modem-connected PCs). Spoofing occurs when hackers provide packets with false origin addresses that mislead filtering software at a target site. o Anyone can hack: DoS attack routines can be downloaded from the Internet and are as easy to use as email. DDoS and spoofing attacks are a little more complicated, but require no technical expertise. o DoS attacks are very difficult to defend against. It is relatively simple for attackers to cary their patterns of attack and make them like legitimate ecommerce traffic. 3 o DoS attacks do not cause outages, but they affect infrastructure performance, waste company resources, and reduce customer satisfaction. Intrusion o Intrusion attacks gain access to a company’s internal IT infrastructure: By obtaining user names and passwords. People tend to use the same user name/password in multiple systems. Social engineering is used to get people to divulge privilege info, such as over the phone. By acquiring passwords by eavesdropping on network conversations with “sniffer” software. By exploiting vulnerabilities left in software when it was developed to gain access to systems without first obtaining passwords. Computers are “port scanned” within a few minutes of connecting to the Internet. Hackers can also use automated routines that systematically scan IP addresses and report back to their masters which addresses contain exploitable vulnerabilities. o Once inside, intruders have the same rights of access and control over systems and resources as legitimate users. They can steal info, erase/alter data, deface web sites, pose as company rep, deposit time bombs. It’s very difficult to figure out what an intruder may have done. Viruses and Worms o Malicious software programs that replicate themselves to other computers. o Distinguished by their degree of automation and ability to replicate across networks. o Viruses require assistance (often inadvertent) from users to replicate and propagate (e.g., opening a file attached to an email message or even opening a web page) whereas worms replicate and move across networks automatically. o Danger: they can incorporate and automate other types of attacks, like DoSs. 2. Defensive Measures Security Policies—company policy should specify what is appropriate and inappropriate: o What kinds of passwords are to be used, and how often should they be changed? o Who is allowed to have accounts on company systems? o What security features must be activated before a company can connect to a network? o What services are allowed to operate inside a company’s network? o What are users allowed to download? o How is the security policy enforced? Firewalls o A collection of hardware/software designed to prevent unauthorized access to a company’s internal computer resources. o Located at points of maximum leverage within a network, typically at the point of connection between a company’s internal network and the external public network. o Some work by filtering packets coming from outside the company before passing them along to computers inside the company’s facilities. o Others use a sentry computer that relays info between internal and external computers without allowing external packets direct entry. o They are excellent points at which to collect data about traffic moving inside and outside networks. o They can be used to divide an internal network into segments, so that an intruder that penetrates one part cannot access the rest. o They conceal internal network configurations from external prying. 4 o Limitations of firewalls: provide no defense against malicious insiders or against activity that does not traverse the firewall (ex: traffic that enters a network via an unauthorized dial-up modem behind the firewall). o Authentication o the variety of techniques and software used to control who accesses elements of computing infrastructures. o Host authentication controls access to specific computers (hosts) o Network authentication controls access to regions of a network. o Both types are used together. o Strong authentication—passwords expire regularly and forms of passwords are restricted to make them harder to guess. User name/password plus one other factor, such as certificate authentication, or biometric verification of identity. Encryption o Render the contents of electronic transmissions unreadable by anyone who might intercept them. o Legitimate recipients can decrypt transmission contents by using a piece of data called a key. o Key must be kept secret and protected from social engineering, physical theft, insecure transmission. o By setting up encryption at both ends of a connection across public networks, a company can extend its secure private network, creating a virtual private network. Ex: Publicprivate key encryption—one unique key (public key) is used to transform a plan text message into encrypted form, and a different one (a private key) is used to decrypt the message back into plain text at its destination. o Limitations of encryption: hackers can still gain useful info from the pattern of transmission, message lengths, origin, or destination address. Hackers can still intercept and change data in a transmission. Patching and Change Management o Keeping track of the variety of systems in a company’s infrastructure their security weaknesses, the available patches, and whether they have been applied is very important. o Best practice calls for keeping detailed records of all files that are supposed to be on production computers. o However, in many cases shortcuts are done on formal change management procedures, resulting in a gap in formal knowledge about what files and programs should be present on company systems. Intrusion Detection and Network Monitoring o Help network administrators recognize when infrastructure is or has been under attack. o Network monitoring automatically filters out external attack traffic at company network boundaries. o Intrusion detection systems include combinations of hardware probes and software diagnostic systems that log activity throughout company networks and high patterns of suspicious activity. o They provide information that can help to reconstruct exactly what an intruder did. 3. A Security Management Framework—principles of security management Make deliberate security decisions Consider security a moving target Practice disciplined change management Educate users 5 Deploy multilevel technical measures, as many as are affordable 4. Risk Management of Availability and Security Risks must be characterized and addressed in proportion to their likelihood and potential consequences. Management actions to mitigate risks must be prioritized according to costs and potential benefits. One method of prioritizing involves computing the expected loss associate with incidents by multiplying the probability of an incident and its cost if it occurs. The logic of risk management can be very complex. Sometimes, managers address high-cost risks first, even though their likelihood of occurrence is very low. Intangible aspects of risk are also hard to define. Risk is often also a factor in acquiring new technology, since it can affect security and availability. 5. Incident Management and Disaster Recovery—steps (described in 6., 7., and 8.) that need to be taken before, during, and after an incident. 6. Managing Incidents before They Occur Sound infrastructure design Disciplined execution of operating procedures Careful documentation Established crisis management procedures Rehearsing incident response 7. Managing during an Incident Psychological obstacles humans have to deal with in crises: Emotional responses, including confusion, denial, fear, and panic. Wishful thinking and groupthink Political maneuvering, diving for cover, and ducking responsibility Leaping to conclusions and blindness to evidence that contradicts current beliefs Public relations inhibition—managers don’t want to admit the seriousness of a problem 8. Managing after an Incident After an incident, infrastructure managers may need to rebuild parts of the infrastructure. Well-documented configurations and procedures are necessary for recovery. Questions for Discussion: 1. Why are internetworks inherently reliable? Why are some components of a firm’s infrastructure not necessarily reliable? 2. Discuss the availability of components in a series and the effect of redundancy on availability. 3. Discuss the steps companies must take to ensure that their facilities are high-availability. 4. Discuss N + 1 and N + N redundancy and the way facilities are categorized according to the level of uptime they support. Include the costliness of high levels of availability. 5. Discuss the three major categories of threats to computer infrastructure, the dangers of each, and why they are hard to prevent. 6. Describe how security policies, firewalls, authentication, and encryption are used as security measures. 6 7. Describe security measures that should be taken before, during, and after a security breach incident. 7