Uploaded by Duc Trung Dang

quản lý lỗi

advertisement
It is essential to consider that most server infrastructure is virtualized. For one virtual
host, there can be multiple virtual servers operating inside of it. This means that if your
virtual host goes offline, then you will also potentially have multiple servers going offline
if you do not have appropriate redundancy in place.
To prevent these kinds of failures from causing prolonged and unnecessary downtime,
you need to make sure that a few things are put in place. First, you need to make sure
that all of your production hardware is under warranty. Or that you have some kind of
service provider that can supply the parts for your hardware at a moment's notice. You
can keep a selection of hot spares such as power supplies and hard drives on-site to
help clear errors and ensure that small failures don't become big problems further down
the line.
Most servers have multiple hard drives and power supplies that can operate if one of
them fails. These parts can also usually be replaced without the server needing to be
powered down. This is where the term "hot-swappable" comes from.
Type 3: Hard Drive Failures
Although we mentioned hard drives earlier, it is essential to understand a few critical
things about how your company's servers and their hard drives work together. There are
usually multiple hard drives on a traditional server that connect directly to a server's
motherboard. To help with performance and redundancy, they are typically configured in
a RAID array. This means that if one drive fails, then the system can continue to run
without data loss. When a new hard drive is installed, then data is rebuilt and loaded
onto that drive.
With virtual machines, there is usually a unit that stores all of the virtual host's hard
drives for it, generally on a fiber network. This is known as a Storage Area Network
(SAN), and it also uses RAID to help with performance and redundancy. Hard drive
failures on a SAN can be disastrous if data cannot be recreated on replacement drives
because the impact can potentially affect many different servers, services, and
applications.
Example: Outdated Equipment
A balance between cost-effective computer infrastructure and technology improvements
can be tricky to navigate. Because organization-wide upgrades are expensive, it is not
uncommon to see some hardware staying in service beyond its warranty period. This
might happen with legacy systems that are no longer supported, and building a
replacement system will take a lot of resources to accomplish.
When outdated equipment fails, it has two effects. The first is a scarcity of skill and
supply. If your hardware has not been manufactured for some time, then the odds of
quickly finding replacements are not great. This also applies to the skills and expertise
required to install replacement hardware on failed legacy systems. The scarcity impacts
the cost of these repairs and makes it a very costly exercise to maintain such
equipment.
Whenever systems start to reach the end of life, you need to start making suitable
replacements before running into any more significant problems further down the line.
Example: Failed/Incompatible Firmware Upgrades or Patches
To keep your hardware running effectively, manufacturers will release firmware
upgrades and patches from time to time. Firmware is the low-level code that tells your
hardware how its components interact and what information is available to the operating
system.
If firmware becomes corrupt on a device, then the result is usually that the device
becomes "bricked." This means that it won't even be able to power on correctly, or, if it
can power on, then it can't do much else. Failed firmware patching can occur when a
device loses power mid-way through the process or if a communications cable is
accidentally removed mid-process. Sometimes a firmware file is corrupt and can be
flashed to a device, or an incompatible device accepts a flash, making it inoperable.
There are usually extensive checks done by the flashing software before starting, but
errors can sometimes occur.
Software patches to an operating system also have the potential to break certain
services on a server. Microsoft has had many examples of hotfixes needing to be
released to fix bugs introduced from Windows Updates, although this has gotten
considerably better over the years.
The best way to prevent bad software patches from affecting your systems is to test
them in isolation on a test network before deploying them to a live system. Any bus or
performance issues should be noted before you begin rolling the patches out to the rest
of your network.
Example: Overheating
Overheating can cause severe damage to your server infrastructure, and it can also
introduce some strange errors too. Your data center needs to have proper cooling,
especially if you have many populated server racks generating heat. Most server rooms
will have a dedicated hot and cold aisle system that directs cool air into the servers and
exhausts hot air out into a channel that drains it out of the room.
If these systems are not running correctly, then the hardware in that room will be
affected by heat, which has short- and long-term effects on both the systems' operation
and lifespan. The key takeaway from this is that your cooling systems always need to
be running efficiently. To accomplish this, you must have set service intervals for your
server room's cooling components.
Maintenance of fans, ducts, and filters needs to be carried out on all of your servers and
rack-mounted equipment on a maintenance schedule as well. Over time, dirt will block
the airflow for any device with a fan (or multiple fans) installed. This airflow needs to be
kept clean for the best cooling performance.
Type 4: Software Failures
Software failures can occur for a whole multitude of different reasons. A license can
expire, a configuration file can go missing or become corrupted, a bad software update
can cause issues, a software bug can introduce issues — the list is almost endless.
Another factor to consider is whether the failed software is an off-the-shelf solution
purchased from a vendor or if the application has been developed in house. The time it
takes to get your software up and running again depends on who designed the nonfunctioning software and if they are available to provide support.
The result is usually the same: the software is not working — and many business
functions are impacted. Your application might be performing some essential network
functions that might now be impacting everything else, which will cause more issues
until it is fixed.
If you are still manually rolling out your updates, then you will find that a manual fix will
need to be implemented and rolled out across the business. This is avoidable with
automation tools. Therefore, it is vital for all companies that rely heavily on software
solutions to think about how they can better leverage their resources by automating as
many processes as possible.
Perhaps a software update has been introduced to your environment but not validated,
causing bugs and errors. You will need to visit each workstation or node where the
update has been applied and either 1.) apply a patch or fix, or in some cases, 2.) roll
back to the known good version of the software. It is always an excellent idea to have
some kind of rollback plan when implementing software updates to your systems.
To prevent these kinds of failures, you need to ensure that you follow a proper testing
and validation plan for all of your software updates. Most companies have a testing
environment that mimics most of the mission-critical systems. Any updates that are
applied can be closely watched and documented. When the rollout happens in your live
environment, you won't be met with any surprises.
Type 5: Human Failures
Unfortunately, it is still possible for human errors to cause issues on your network.
Anything from accidental hardware damage, or cable damage, or even poorly
configured network devices can cause downtime on your network.
If your network is not maintained regularly, then you can also expect network failures to
occur. Whenever a device is removed or added to your network cabinet, cable
management must be adhered to, and documentation must be updated. Sometimes a
network failure can occur because cables are not correctly labeled, creating problems
when they are accidentally removed.
Even worse, if an unlabeled cable is removed, it can take that much longer for the fault
to be found and then successfully troubleshot. It is for this reason that network
maintenance must be carried out at regular intervals.
Another major cause has to do with bad changes being made to the network
environment. Changes such as VLAN configurations, routing, and IP Address
configurations that are not tested before being deployed have the potential for
unexpected results. Network changes and network maintenance must not be treated as
an afterthought and must instead be scheduled and carried out regularly.
Type 6: Security Failures
Security could be thought of as an extension of human errors, but many other variables
also play a role. Security should be considered an active measure that must be
implemented from the very start of a project and maintained throughout its life cycle. If
you don't actively work to protect your environment, you can open yourself to
unnecessary risks.
DDoS (Distributed Denial of Service) attacks have become a standard method of attack
used by cybercriminals. The method involves using thousands of hosts that send
requests to a website or server. The unexpected load can sometimes take such a
service offline, meaning that the business cannot continue to operate until it is brought
back up. There are ways to mitigate this. Modern solutions can detect DDoS attacks
and then reroute that traffic to another data center or network appliance where the data
packets cannot reach the intended target. Some internet service providers and internet
hosts offer this kind of protection, so it is a good idea to find out if you can integrate
such a solution into your online services.
Another point of entry is through malware and viruses. Antivirus protection has become
commonplace in many organizations, but you need to have an acceptable IT policy to
use this technology effectively. You can have the best security software packages in the
world, but if your users are not following the rules, they will not be effective at all.
A lack of user awareness and training around social-engineering attacks and phishing
scams can also lead to enormous security risks within an organization. Teaching your
employees how to identify and avoid such scams and attacks will protect your company
from losing valuable data.
Data Loss Prevention is an area that most companies are starting to employ to retain
intellectual property and sensitive data. DLP solutions can scan all outgoing data such
as emails and attachments and find specific keywords and phrases that relate to the
protected data that you are trying to prevent from being exfiltrated out of your
organization.
More advanced solutions can search within the metadata of files that are injected with
proprietary data, making it easier to identify those responsible for trying to send out your
data. Proper security needs to be implemented at all of your sites to ensure that
hardware and any other IT asset is not removed without being authorized.
Bảo mật có thể được coi là một phần mở rộng của các lỗi do con người gây ra, nhưng
nhiều biến số khác cũng đóng một vai trò nào đó. Bảo mật nên được coi là một biện
pháp tích cực phải được thực hiện ngay từ khi bắt đầu dự án và được duy trì trong suốt
vòng đời của dự án. Nếu bạn không tích cực làm việc để bảo vệ môi trường của bạn,
bạn có thể tự mở mình trước những rủi ro không đáng có.
Các cuộc tấn công DDoS (Từ chối Dịch vụ Phân tán) đã trở thành một phương pháp
tấn công tiêu chuẩn được sử dụng bởi tội phạm mạng. Phương pháp này liên quan đến
việc sử dụng hàng nghìn máy chủ gửi yêu cầu đến một trang web hoặc máy chủ. Việc
tải không mong muốn đôi khi có thể khiến một dịch vụ như vậy ngoại tuyến, có nghĩa là
doanh nghiệp không thể tiếp tục hoạt động cho đến khi hoạt động trở lại. Có nhiều cách
để giảm thiểu điều này. Các giải pháp hiện đại có thể phát hiện các cuộc tấn công
DDoS và sau đó định tuyến lại lưu lượng truy cập đó đến một trung tâm dữ liệu hoặc
thiết bị mạng khác, nơi các gói dữ liệu không thể đến được mục tiêu đã định. Một số
nhà cung cấp dịch vụ internet và máy chủ internet cung cấp loại bảo vệ này, vì vậy bạn
nên tìm hiểu xem liệu bạn có thể tích hợp giải pháp như vậy vào các dịch vụ trực tuyến
của mình hay không.
Một điểm xâm nhập khác là thông qua phần mềm độc hại và vi rút. Bảo vệ chống vi-rút
đã trở nên phổ biến trong nhiều tổ chức, nhưng bạn cần phải có một chính sách CNTT
được chấp nhận để sử dụng công nghệ này một cách hiệu quả. Bạn có thể có các gói
phần mềm bảo mật tốt nhất trên thế giới, nhưng nếu người dùng của bạn không tuân
theo các quy tắc, chúng sẽ không hiệu quả chút nào.
Việc người dùng thiếu nhận thức và đào tạo về các cuộc tấn công kỹ thuật xã hội và
các trò gian lận lừa đảo cũng có thể dẫn đến những rủi ro bảo mật to lớn trong một tổ
chức. Dạy cho nhân viên của bạn cách xác định và tránh những trò gian lận và tấn
công như vậy sẽ bảo vệ công ty của bạn khỏi bị mất dữ liệu có giá trị.
Ngăn ngừa mất dữ liệu là một lĩnh vực mà hầu hết các công ty đang bắt đầu sử dụng
để giữ lại tài sản trí tuệ và dữ liệu nhạy cảm. Các giải pháp DLP có thể quét tất cả dữ
liệu gửi đi, chẳng hạn như email và tệp đính kèm, đồng thời tìm các từ khóa và cụm từ
cụ thể liên quan đến dữ liệu được bảo vệ mà bạn đang cố gắng ngăn chặn việc bị đưa
ra khỏi tổ chức của mình.
Các giải pháp nâng cao hơn có thể tìm kiếm trong siêu dữ liệu của các tệp được đưa
vào dữ liệu độc quyền, giúp dễ dàng xác định những người chịu trách nhiệm cố gắng
gửi dữ liệu của bạn. Bảo mật thích hợp cần được thực hiện tại tất cả các trang web của
bạn để đảm bảo rằng phần cứng và bất kỳ nội dung CNTT nào khác không bị xóa khi
chưa được cấp phép.
The Impact of Network Failures
Network failures mean possible network downtime. Depending on the company, this
could equal the loss of thousands or tens of thousands of dollars per hour. In highly
competitive markets with external-facing internet services, this can also potentially lose
customers who will use a competitor while your company is offline.
The Business Costs of Network Failures
Many different factors can cause a network failure, and each has its only potential cost.
Some of these costs are financial, but there are other things to consider, such as
reputational damage or the tarnishing of your company's brand.
If your company experiences a catastrophe that destroys a data center, then you are
probably looking at millions of dollars' worth of damage. Suppose you have off-site
backups for all of your data, configuration files for all of your devices, and a proper
disaster recovery plan in place. You are not out of the race just yet.
You will have to rely on your teams to execute the Disaster Recovery Plan to get your
services back up and running, even if it is in a minimized form. If you don't have a plan
to recover from such an unthinkable failure, then you might not be able to recover from
that kind of worst-case scenario disaster.
Hardware costs are the most obvious concern for many people, and it is
understandable. The hardware costs of servers, switches, routers, and universal power
supplies – anything required to run your IT operations- are very costly.
You wouldn't think of software as carrying a cost when your network fails, but there are
a few things to consider. If your company has its proprietary software and systems, then
recovering from a complete failure will have its own unique set of challenges. You will
need to protect all of your development resources like source code and repositories. If
you are in a situation where your production environment is compromised, you can spin
up another instance and restore data to it.
Your customers are vital to your operations, and so is their data. Losing critical data
either through software or hardware failures can be challenging to recover from.
Backups can help you to make sure that your customers experience as little frustration
as possible. If your customers are unable to use your services, or if you are not able to
help them to recover their data, then you risk damaging your company's reputation.
Intellectual property loss can be a considerable cost factor, especially if that data is
exposed online through a cyber attack or exfiltration. You need to make sure that
sensitive intellectual property is stored in a safe offsite location that you can access in a
network failure event.
Suppose your company operates in a space where compliance and reporting penalties
could be a factor. You will need to have safeguards to minimize the impact that a
network failure could impose on your organization.
Brand damage can occur if your customers are negatively affected by a network or
system failure. Your competitors will take advantage of your downtime, making it very
difficult to win back your customers. Once a customer decides to jump ship, you will
almost always struggle to win them back once your systems have recovered.
The Human Costs of Network Failures
While your staff is battling to get your network back up and running, there are naturally
going to be other areas that suffer. These interruptions take your team away from
productive work and create backlogs in other business areas. It also has the unintended
consequence that your support staff suddenly have to respond to customer complaints.
This is difficult because your technical staff doesn't necessarily have the same
customer-focused skills as customer-facing staff and can lead to miscommunication. In
most cases, your marketing and sales staff will need to reach out and let your
customers know about the interruptions to understand what is happening and its
progress. Again, this takes your staff away from their core business roles and creates
more backlogs throughout the business.
If overtime is needed to get your systems back up and running, you will incur additional
labor costs during these types of failures. If you don't have the necessary in-house
skills, then getting a service provider or subcontractor to assist will also incur additional
costs.
Your teams will also experience increased stress levels while the problems are ongoing.
Remember to give them some time off after fixing all of the issues as fatigue and stress
can be real productivity killers afterward.
How to Prevent Network Failures
Now that we know what causes network and system failures, we can start to prepare for
them. Nobody likes to fixate on only the negatives, but you have to plan for the worst
when it comes to network failure prevention. Your first port of call should be preparing a
disaster recovery plan. We could write an entire article on what you need to consider for
a disaster recovery plan, but we will touch on a few basics that could be included in your
disaster recovery plan.
Your staff needs to be trained with a series of test drills to make sure that everybody
knows what they need to do in the event of an emergency. These test runs need to be
carried out often so that your teams are ready to spring into action. Part of this
preparation means that you need to document your disaster recovery. These can be in
the form of playbooks, step by step guides, and any other resources that will help your
teams get the job done.
To detect issues before they turn into a massive network outage, you need to be
monitoring your systems. Continuously. This can be as elaborate as a fully integrated
network operations center or a single workstation with monitoring software if your
support staff can see an issue before it becomes an issue.
We've gone over a few different solutions that can help you to prevent these outages,
like Uninterruptible Power Supplies and inverters, backup solutions, and fire or flooding
protection. Your critical servers must be a part of high availability groups for
redundancy. Combined with virtualization, you can minimize your downtime with little to
no interruptions to service. Other things to look at include a comprehensive security
training program and defensive security systems.
Final Thoughts
We've covered many different areas in this article, but we've looked at many essential
features that should be a part of your day to day operations. The main takeaway is this:
planning is only half of the battle. Implementing your plan is just as tricky. To
accomplish this, you need to create and document your plan and make it accessible to
everyone that needs to know what is in it.
Once you have all of the details figured out, you need to practice it at set intervals to
make sure that when the time comes, your team is ready to act at a moment's
notice. Cybersecurity training and workforce security awareness are also paramount.
Download