Server Infrastructure Failure Guide

It is essential to consider that most server infrastructure is virtualized. For one virtual host, there can be multiple virtual servers operating inside of it. This means that if your virtual host goes offline, then you will also potentially have multiple servers going offline if you do not have appropriate redundancy in place. To prevent these kinds of failures from causing prolonged and unnecessary downtime, you need to make sure that a few things are put in place. First, you need to make sure that all of your production hardware is under warranty. Or that you have some kind of service provider that can supply the parts for your hardware at a moment's notice. You can keep a selection of hot spares such as power supplies and hard drives on-site to help clear errors and ensure that small failures don't become big problems further down the line. Most servers have multiple hard drives and power supplies that can operate if one of them fails. These parts can also usually be replaced without the server needing to be powered down. This is where the term "hot-swappable" comes from. Type 3: Hard Drive Failures Although we mentioned hard drives earlier, it is essential to understand a few critical things about how your company's servers and their hard drives work together. There are usually multiple hard drives on a traditional server that connect directly to a server's motherboard. To help with performance and redundancy, they are typically configured in a RAID array. This means that if one drive fails, then the system can continue to run without data loss. When a new hard drive is installed, then data is rebuilt and loaded onto that drive. With virtual machines, there is usually a unit that stores all of the virtual host's hard drives for it, generally on a fiber network. This is known as a Storage Area Network (SAN), and it also uses RAID to help with performance and redundancy. Hard drive failures on a SAN can be disastrous if data cannot be recreated on replacement drives because the impact can potentially affect many different servers, services, and applications. Example: Outdated Equipment A balance between cost-effective computer infrastructure and technology improvements can be tricky to navigate. Because organization-wide upgrades are expensive, it is not uncommon to see some hardware staying in service beyond its warranty period. This might happen with legacy systems that are no longer supported, and building a replacement system will take a lot of resources to accomplish. When outdated equipment fails, it has two effects. The first is a scarcity of skill and supply. If your hardware has not been manufactured for some time, then the odds of quickly finding replacements are not great. This also applies to the skills and expertise required to install replacement hardware on failed legacy systems. The scarcity impacts the cost of these repairs and makes it a very costly exercise to maintain such equipment. Whenever systems start to reach the end of life, you need to start making suitable replacements before running into any more significant problems further down the line. Example: Failed/Incompatible Firmware Upgrades or Patches To keep your hardware running effectively, manufacturers will release firmware upgrades and patches from time to time. Firmware is the low-level code that tells your hardware how its components interact and what information is available to the operating system. If firmware becomes corrupt on a device, then the result is usually that the device becomes "bricked." This means that it won't even be able to power on correctly, or, if it can power on, then it can't do much else. Failed firmware patching can occur when a device loses power mid-way through the process or if a communications cable is accidentally removed mid-process. Sometimes a firmware file is corrupt and can be flashed to a device, or an incompatible device accepts a flash, making it inoperable. There are usually extensive checks done by the flashing software before starting, but errors can sometimes occur. Software patches to an operating system also have the potential to break certain services on a server. Microsoft has had many examples of hotfixes needing to be released to fix bugs introduced from Windows Updates, although this has gotten considerably better over the years. The best way to prevent bad software patches from affecting your systems is to test them in isolation on a test network before deploying them to a live system. Any bus or performance issues should be noted before you begin rolling the patches out to the rest of your network. Example: Overheating Overheating can cause severe damage to your server infrastructure, and it can also introduce some strange errors too. Your data center needs to have proper cooling, especially if you have many populated server racks generating heat. Most server rooms will have a dedicated hot and cold aisle system that directs cool air into the servers and exhausts hot air out into a channel that drains it out of the room. If these systems are not running correctly, then the hardware in that room will be affected by heat, which has short- and long-term effects on both the systems' operation and lifespan. The key takeaway from this is that your cooling systems always need to be running efficiently. To accomplish this, you must have set service intervals for your server room's cooling components. Maintenance of fans, ducts, and filters needs to be carried out on all of your servers and rack-mounted equipment on a maintenance schedule as well. Over time, dirt will block the airflow for any device with a fan (or multiple fans) installed. This airflow needs to be kept clean for the best cooling performance. Type 4: Software Failures Software failures can occur for a whole multitude of different reasons. A license can expire, a configuration file can go missing or become corrupted, a bad software update can cause issues, a software bug can introduce issues — the list is almost endless. Another factor to consider is whether the failed software is an off-the-shelf solution purchased from a vendor or if the application has been developed in house. The time it takes to get your software up and running again depends on who designed the nonfunctioning software and if they are available to provide support. The result is usually the same: the software is not working — and many business functions are impacted. Your application might be performing some essential network functions that might now be impacting everything else, which will cause more issues until it is fixed. If you are still manually rolling out your updates, then you will find that a manual fix will need to be implemented and rolled out across the business. This is avoidable with automation tools. Therefore, it is vital for all companies that rely heavily on software solutions to think about how they can better leverage their resources by automating as many processes as possible. Perhaps a software update has been introduced to your environment but not validated, causing bugs and errors. You will need to visit each workstation or node where the update has been applied and either 1.) apply a patch or fix, or in some cases, 2.) roll back to the known good version of the software. It is always an excellent idea to have some kind of rollback plan when implementing software updates to your systems. To prevent these kinds of failures, you need to ensure that you follow a proper testing and validation plan for all of your software updates. Most companies have a testing environment that mimics most of the mission-critical systems. Any updates that are applied can be closely watched and documented. When the rollout happens in your live environment, you won't be met with any surprises. Type 5: Human Failures Unfortunately, it is still possible for human errors to cause issues on your network. Anything from accidental hardware damage, or cable damage, or even poorly configured network devices can cause downtime on your network. If your network is not maintained regularly, then you can also expect network failures to occur. Whenever a device is removed or added to your network cabinet, cable management must be adhered to, and documentation must be updated. Sometimes a network failure can occur because cables are not correctly labeled, creating problems when they are accidentally removed. Even worse, if an unlabeled cable is removed, it can take that much longer for the fault to be found and then successfully troubleshot. It is for this reason that network maintenance must be carried out at regular intervals. Another major cause has to do with bad changes being made to the network environment. Changes such as VLAN configurations, routing, and IP Address configurations that are not tested before being deployed have the potential for unexpected results. Network changes and network maintenance must not be treated as an afterthought and must instead be scheduled and carried out regularly. Type 6: Security Failures Security could be thought of as an extension of human errors, but many other variables also play a role. Security should be considered an active measure that must be implemented from the very start of a project and maintained throughout its life cycle. If you don't actively work to protect your environment, you can open yourself to unnecessary risks. DDoS (Distributed Denial of Service) attacks have become a standard method of attack used by cybercriminals. The method involves using thousands of hosts that send requests to a website or server. The unexpected load can sometimes take such a service offline, meaning that the business cannot continue to operate until it is brought back up. There are ways to mitigate this. Modern solutions can detect DDoS attacks and then reroute that traffic to another data center or network appliance where the data packets cannot reach the intended target. Some internet service providers and internet hosts offer this kind of protection, so it is a good idea to find out if you can integrate such a solution into your online services. Another point of entry is through malware and viruses. Antivirus protection has become commonplace in many organizations, but you need to have an acceptable IT policy to use this technology effectively. You can have the best security software packages in the world, but if your users are not following the rules, they will not be effective at all. A lack of user awareness and training around social-engineering attacks and phishing scams can also lead to enormous security risks within an organization. Teaching your employees how to identify and avoid such scams and attacks will protect your company from losing valuable data. Data Loss Prevention is an area that most companies are starting to employ to retain intellectual property and sensitive data. DLP solutions can scan all outgoing data such as emails and attachments and find specific keywords and phrases that relate to the protected data that you are trying to prevent from being exfiltrated out of your organization. More advanced solutions can search within the metadata of files that are injected with proprietary data, making it easier to identify those responsible for trying to send out your data. Proper security needs to be implemented at all of your sites to ensure that hardware and any other IT asset is not removed without being authorized. Bảo mật có thể được coi là một phần mở rộng của các lỗi do con người gây ra, nhưng nhiều biến số khác cũng đóng một vai trò nào đó. Bảo mật nên được coi là một biện pháp tích cực phải được thực hiện ngay từ khi bắt đầu dự án và được duy trì trong suốt vòng đời của dự án. Nếu bạn không tích cực làm việc để bảo vệ môi trường của bạn, bạn có thể tự mở mình trước những rủi ro không đáng có. Các cuộc tấn công DDoS (Từ chối Dịch vụ Phân tán) đã trở thành một phương pháp tấn công tiêu chuẩn được sử dụng bởi tội phạm mạng. Phương pháp này liên quan đến việc sử dụng hàng nghìn máy chủ gửi yêu cầu đến một trang web hoặc máy chủ. Việc tải không mong muốn đôi khi có thể khiến một dịch vụ như vậy ngoại tuyến, có nghĩa là doanh nghiệp không thể tiếp tục hoạt động cho đến khi hoạt động trở lại. Có nhiều cách để giảm thiểu điều này. Các giải pháp hiện đại có thể phát hiện các cuộc tấn công DDoS và sau đó định tuyến lại lưu lượng truy cập đó đến một trung tâm dữ liệu hoặc thiết bị mạng khác, nơi các gói dữ liệu không thể đến được mục tiêu đã định. Một số nhà cung cấp dịch vụ internet và máy chủ internet cung cấp loại bảo vệ này, vì vậy bạn nên tìm hiểu xem liệu bạn có thể tích hợp giải pháp như vậy vào các dịch vụ trực tuyến của mình hay không. Một điểm xâm nhập khác là thông qua phần mềm độc hại và vi rút. Bảo vệ chống vi-rút đã trở nên phổ biến trong nhiều tổ chức, nhưng bạn cần phải có một chính sách CNTT được chấp nhận để sử dụng công nghệ này một cách hiệu quả. Bạn có thể có các gói phần mềm bảo mật tốt nhất trên thế giới, nhưng nếu người dùng của bạn không tuân theo các quy tắc, chúng sẽ không hiệu quả chút nào. Việc người dùng thiếu nhận thức và đào tạo về các cuộc tấn công kỹ thuật xã hội và các trò gian lận lừa đảo cũng có thể dẫn đến những rủi ro bảo mật to lớn trong một tổ chức. Dạy cho nhân viên của bạn cách xác định và tránh những trò gian lận và tấn công như vậy sẽ bảo vệ công ty của bạn khỏi bị mất dữ liệu có giá trị. Ngăn ngừa mất dữ liệu là một lĩnh vực mà hầu hết các công ty đang bắt đầu sử dụng để giữ lại tài sản trí tuệ và dữ liệu nhạy cảm. Các giải pháp DLP có thể quét tất cả dữ liệu gửi đi, chẳng hạn như email và tệp đính kèm, đồng thời tìm các từ khóa và cụm từ cụ thể liên quan đến dữ liệu được bảo vệ mà bạn đang cố gắng ngăn chặn việc bị đưa ra khỏi tổ chức của mình. Các giải pháp nâng cao hơn có thể tìm kiếm trong siêu dữ liệu của các tệp được đưa vào dữ liệu độc quyền, giúp dễ dàng xác định những người chịu trách nhiệm cố gắng gửi dữ liệu của bạn. Bảo mật thích hợp cần được thực hiện tại tất cả các trang web của bạn để đảm bảo rằng phần cứng và bất kỳ nội dung CNTT nào khác không bị xóa khi chưa được cấp phép. The Impact of Network Failures Network failures mean possible network downtime. Depending on the company, this could equal the loss of thousands or tens of thousands of dollars per hour. In highly competitive markets with external-facing internet services, this can also potentially lose customers who will use a competitor while your company is offline. The Business Costs of Network Failures Many different factors can cause a network failure, and each has its only potential cost. Some of these costs are financial, but there are other things to consider, such as reputational damage or the tarnishing of your company's brand. If your company experiences a catastrophe that destroys a data center, then you are probably looking at millions of dollars' worth of damage. Suppose you have off-site backups for all of your data, configuration files for all of your devices, and a proper disaster recovery plan in place. You are not out of the race just yet. You will have to rely on your teams to execute the Disaster Recovery Plan to get your services back up and running, even if it is in a minimized form. If you don't have a plan to recover from such an unthinkable failure, then you might not be able to recover from that kind of worst-case scenario disaster. Hardware costs are the most obvious concern for many people, and it is understandable. The hardware costs of servers, switches, routers, and universal power supplies – anything required to run your IT operations- are very costly. You wouldn't think of software as carrying a cost when your network fails, but there are a few things to consider. If your company has its proprietary software and systems, then recovering from a complete failure will have its own unique set of challenges. You will need to protect all of your development resources like source code and repositories. If you are in a situation where your production environment is compromised, you can spin up another instance and restore data to it. Your customers are vital to your operations, and so is their data. Losing critical data either through software or hardware failures can be challenging to recover from. Backups can help you to make sure that your customers experience as little frustration as possible. If your customers are unable to use your services, or if you are not able to help them to recover their data, then you risk damaging your company's reputation. Intellectual property loss can be a considerable cost factor, especially if that data is exposed online through a cyber attack or exfiltration. You need to make sure that sensitive intellectual property is stored in a safe offsite location that you can access in a network failure event. Suppose your company operates in a space where compliance and reporting penalties could be a factor. You will need to have safeguards to minimize the impact that a network failure could impose on your organization. Brand damage can occur if your customers are negatively affected by a network or system failure. Your competitors will take advantage of your downtime, making it very difficult to win back your customers. Once a customer decides to jump ship, you will almost always struggle to win them back once your systems have recovered. The Human Costs of Network Failures While your staff is battling to get your network back up and running, there are naturally going to be other areas that suffer. These interruptions take your team away from productive work and create backlogs in other business areas. It also has the unintended consequence that your support staff suddenly have to respond to customer complaints. This is difficult because your technical staff doesn't necessarily have the same customer-focused skills as customer-facing staff and can lead to miscommunication. In most cases, your marketing and sales staff will need to reach out and let your customers know about the interruptions to understand what is happening and its progress. Again, this takes your staff away from their core business roles and creates more backlogs throughout the business. If overtime is needed to get your systems back up and running, you will incur additional labor costs during these types of failures. If you don't have the necessary in-house skills, then getting a service provider or subcontractor to assist will also incur additional costs. Your teams will also experience increased stress levels while the problems are ongoing. Remember to give them some time off after fixing all of the issues as fatigue and stress can be real productivity killers afterward. How to Prevent Network Failures Now that we know what causes network and system failures, we can start to prepare for them. Nobody likes to fixate on only the negatives, but you have to plan for the worst when it comes to network failure prevention. Your first port of call should be preparing a disaster recovery plan. We could write an entire article on what you need to consider for a disaster recovery plan, but we will touch on a few basics that could be included in your disaster recovery plan. Your staff needs to be trained with a series of test drills to make sure that everybody knows what they need to do in the event of an emergency. These test runs need to be carried out often so that your teams are ready to spring into action. Part of this preparation means that you need to document your disaster recovery. These can be in the form of playbooks, step by step guides, and any other resources that will help your teams get the job done. To detect issues before they turn into a massive network outage, you need to be monitoring your systems. Continuously. This can be as elaborate as a fully integrated network operations center or a single workstation with monitoring software if your support staff can see an issue before it becomes an issue. We've gone over a few different solutions that can help you to prevent these outages, like Uninterruptible Power Supplies and inverters, backup solutions, and fire or flooding protection. Your critical servers must be a part of high availability groups for redundancy. Combined with virtualization, you can minimize your downtime with little to no interruptions to service. Other things to look at include a comprehensive security training program and defensive security systems. Final Thoughts We've covered many different areas in this article, but we've looked at many essential features that should be a part of your day to day operations. The main takeaway is this: planning is only half of the battle. Implementing your plan is just as tricky. To accomplish this, you need to create and document your plan and make it accessible to everyone that needs to know what is in it. Once you have all of the details figured out, you need to practice it at set intervals to make sure that when the time comes, your team is ready to act at a moment's notice. Cybersecurity training and workforce security awareness are also paramount.

Server Infrastructure Failure Guide

Related documents

Products

Support

Server Infrastructure Failure Guide

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib