Classic Mistakes - Software Engineering II

advertisement
University of Southern California
Center for Systems and Software Engineering
Recovering IT in a
Disaster &
Classic Mistakes
CS 577b Software Engineering II
Supannika Koolmanojwong
University of Southern California
Center for Systems and Software Engineering
http://en.wikipedia.org/wiki/Hurricane_Katrina
http://napoleonlive.info/see-the-evidence/never-forget-9-11-essay/
http://news.nationalgeographic.com
2
University of Southern California
Center for Systems and Software Engineering
Avian influenza
Cyber attack
http://www.itrportal.com/absolutenm/templates/article-channelnews.aspx?articleid=7115&zoneid=45
http://bepast.org/dataman.pl?c=lib&frame_nav=1&dir=docs/photos/avian%20flu/
3
University of Southern California
Center for Systems and Software Engineering
California Natural Disasters
http://www.americanforests.org/magazine/article/regrowing-a-forest/
http://www.exponent.com/earthquake_engineering/
4
University of Southern California
Center for Systems and Software Engineering
Recovering IT in a Disaster:
Lessons from Hurricane Katrina
Iris Junglas, Blake Ives, MIS Quarterly Executive Vol. 6 No. 1 / Mar 2007
• August 29, 2005 - Hurricane
Katrina destroyed a data center
and communications
infrastructure at the Pascagoula
and Gulfport, Mississippi,
operations of the Ship Systems
sector of Northrop Grumman
Corporation
• Also put a second data center
out of commission in a shipyard
near New Orleans
http://www.scholastic.com/browse/article.jsp?id=3754772
5
University of Southern California
Center for Systems and Software Engineering
NGC’s Shipyard
• 20,000 employees in Ship Construction
• Caused over US$1 billion in damage for the company
• Brought two of the nation’s largest shipyards to a
standstill
6
University of Southern California
Center for Systems and Software Engineering
Recovering IT in a Disaster
• How to adapt when the business continuity
plan; inadequate public infrastructure
• Reexamine our processes for preparing
disaster plans
• Processes for assessing preparedness and
response after a disaster or a near-disaster.
7
University of Southern California
Center for Systems and Software Engineering
Northrop Grumman Corporation
• Products : electronics, aerospace, and
shipbuilding
• Customers: government and commercial
customers worldwide
• Major business:
–
–
–
–
–
Ship construction - large military vessels
Revenue: US$5.7 billion in 2005
Customers: DoD and Navy
12,900 employees at Mississippi;
7,100 employees at the New Orleans
8
University of Southern California
Center for Systems and Software Engineering
Preparation for Hurricane
• Hurricane is nothing new to
ship industry
– September 04 – Hurricane Ivan
– July 05 - Hurricane Dennis
• A bigger one is heading in
– August 05
• 11 people dead, over
US$1billion in damage in
Florida
http://www.fema.gov/hazard/flood/recoverydata/katrina/katrina_about.shtm
9
University of Southern California
Center for Systems and Software Engineering
Preparation for Hurricane
• Data
– Data backups were sent to Iron Mountain (information
management services)
– Double back up in Dallas
• Servers
– power off
– wrapped in plastic
• New backup generator – in secure location
• Only one extranet alive (crucial the Navy and DoD)
• Human
– Left the area
10
University of Southern California
Center for Systems and Software Engineering
The storm smashed
• NGC facilities were on the storm’s path
• Communication failed
• Extensive damage to shipyard and nearby
communities
• Emergency command center – at Dallas
office – newly assembled emergency team
is formed
– Began to pull together the first stages of NGC
disaster recovery response
11
University of Southern California
Center for Systems and Software Engineering
Damages
• Collect digital images of damages
• At Mississippi, lost
– 1,500 PC, 200 servers, 300 printers, 600 data input
devices, and hundreds of two-way radios.
– communications closets, routers, switches, fiber and
copper cables and wires.
– LAN / WAN / MAN – no longer worked
• At New Orleans
– Infrastructures are there
– AC systems are not working, hence servers are automatic
shutdown
• A week after the storm, communication lines are
down again due to cars are driving over them
12
University of Southern California
Center for Systems and Software Engineering
First thing first
• Not about restoring computer systems, but
restoring human resources
• But most of the 20,000 employees were out
of contact
• Tools
– Press releases
– Corporate web site (67,000 hits in the weeks
after the storm )
– Toll-free call in number
• Payroll through Wal-Mart and Western
Union
13
University of Southern California
Center for Systems and Software Engineering
Restoring IT infrastructure
• Electronic communication – nonexistent
due to public communication infrastructure
• Communication through Black Berry can
be used intermittently
• Two-way radios, walkie-talkies
• Key members using satellite phones
– Required line-of-sight access to satellites
• Later on, use wireless communication
14
University of Southern California
Center for Systems and Software Engineering
Building new data center
• Hardware acquisition
– 1500 desktop, 200 servers, etc
– Contact supplier, reorder the latest orders.
• Incompatibilities between software and new
hardware environment
• Inaccessible or difficult to find system
documentation, e.g. license keys, server
names, addressing schemes, login IDs
15
University of Southern California
Center for Systems and Software Engineering
Restoring data and applications
• Some firms found that their back up data is
partially unreadable
• For NGC, 2 backups : iron mountain and
Dallas
• Lost some data on desktops or local
machines
• Two weeks after Katrina – had a new data
center; essential systems are up and
running
16
University of Southern California
Center for Systems and Software Engineering
Disaster preparedness
• Common mistake : prepare for disasters specific
to their domain
– financial institutions prepare for IT failures,
– hospitals for pandemics
– airliners for technical failures and sabotages.
• An alternative approach : consider a broader
spectrum of disaster types, such as the generic
disaster
– economic, information, physical, human resource,
reputation, psychopathic, and natural disasters
• Identify common characteristics of each disaster
categories, then construct the plan
17
University of Southern California
Center for Systems and Software Engineering
IT disaster preparedness framework
• provide generic objectives and measurements, guidelines for
establishing IT disaster preparedness,
• emphasize developing an IT continuity plan, identifying and
allocating critical resources, executing a business impact
analysis, and maintaining, testing and training of the plan
• COBIT (Control Objectives for Information and Related
Technology)
– For operational IT and business managers
– Focus on three core elements of IT governance: IT as an asset, ITrelated risks, and IT control structures.
• ITIL (IT Infrastructure Library)
– focus is to improve the efficiency and effectiveness of IT services
delivered to customers within the enterprise
– de facto standard for IT service management.
18
University of Southern California
Center for Systems and Software Engineering
Lesson Learned
1. Keep Data and Data Centers Out of Harm’s
Way
2. Don’t Assume the Public Infrastructure
Will Be Available
3. Plan for Civil Unrest
4. Assume Some People Will Not Be
Available
5. Leverage Your Suppliers as Critical Team
Members
19
University of Southern California
Center for Systems and Software Engineering
Lesson Learned
6. Expect the Unexpected
7. Get Prepared – Crisis portfolio
8. Establish a Strong Leadership Position
9. Empower Decision Makers on the Team
10.Exploit Fresh-Start Opportunities
20
University of Southern California
Center for Systems and Software Engineering
IT disaster recovery plan
21
University of Southern California
Center for Systems and Software Engineering
IT disaster recovery (DR) plan
National Institute for Standards and Technology (NIST)
• Goal
– minimize any negative impacts to company operations
• By
– identify critical IT systems and networks;
– prioritize their recovery time objective;
– delineates the steps needed to restart, reconfigure,
and recover them.
22
http://searchdisasterrecovery.techtarget.com/feature/IT-disaster-recovery-DR-plan-template-A-free-download-and-guide
University of Southern California
Center for Systems and Software Engineering
IT Disaster Recovery Process
Perform Risk
Assessment
Identify
potential
threats
Determine
important
infrastructure
elements
23
University of Southern California
Center for Systems and Software Engineering
Structure for an IT disaster
recovery plan (1)
1.
2.
3.
4.
Develop the contingency planning policy statement. A
formal policy provides the authority and guidance necessary to
develop an effective contingency plan.
Conduct the business impact analysis (BIA). The business
impact analysis helps to identify and prioritize critical IT systems
and components.
Identify preventive controls. These are measures that reduce
the effects of system disruptions and can increase system
availability and reduce contingency life cycle costs.
Develop recovery strategies. Thorough recovery strategies
ensure that the system can be recovered quickly and effectively
following a disruption.
National Institute for Standards and Technology (NIST)
24
University of Southern California
Center for Systems and Software Engineering
Structure for an IT disaster
recovery plan (2)
5.
6.
7.
Develop an IT contingency plan. The contingency plan should
contain detailed guidance and procedures for restoring a
damaged system.
Plan testing, training and exercising. Testing the plan
identifies planning gaps, whereas training prepares recovery
personnel for plan activation; both activities improve plan
effectiveness and overall agency preparedness.
Plan maintenance. The plan should be a living document that is
updated regularly to remain current with system enhancements.
National Institute for Standards and Technology (NIST)
25
University of Southern California
Center for Systems and Software Engineering
Important IT disaster recovery
planning considerations
• Senior management support.
• Take the IT DR planning process
seriously. need the right information, and that
information should be current and accurate
• Availability of standards. IT DR plans are
NIST SP 800-34, ISO/IEC 24762, and BS
25777.
• Keep it simple
• Review results with business units.
• Be flexible
26
University of Southern California
Center for Systems and Software Engineering
Reviewing the IT disaster
recovery plan template (1)
• Information Technology Statement of Intent -- This
sets the stage and direction for the plan.
• Policy Statement -- Very important to include an
approved statement of policy regarding the provision of
disaster recovery services.
• Objectives -- Main goals of the plan.
• Key Personnel Contact Information -- Very
important to have key contact data near the front of the
plan. It's the information most likely to be used right
away, and should be easy to locate.
27
University of Southern California
Center for Systems and Software Engineering
Reviewing the IT disaster
recovery plan template (2)
• Plan Overview -- such as updating.
• Emergency Response -- Describes what needs to be
done immediately following the onset of an incident.
• Disaster Recover Team-- Members and contact
information of the DR team.
• Emergency Alert, Escalation and DRP Activation -Steps to take through the early phase of the incident,
leading to activation of the DR plan.
• Media, Insurance, Financial and Legal Issues
28
University of Southern California
Center for Systems and Software Engineering
Description
Likelihood
and Impact
Single Disk
Failure
Medium
Multiple Disk
Failure
Low
Unauthorised Low
modification
of content
Detection,
how will
we know it
has
happened
Nagios
Warning
Immediate
Action
Later
Action
Effect on
Users
Mitigation and
Contingency
(currently in place)
Replace
failed disk
in RAID
volume.
No effect
Nagios monitoring of
RAID volumes. Keep
replacements drives
available.
Nagios
Warning
Replace
failed disks
in RAID
volume.
Restore
from hot
backup.
Restore
modified
content.
Order new
disks. Have
existing
disks
destroyed.
Order new
disks. Have
existing
disks
destroyed.
No effect
(failover)
Nagios monitoring of
RAID volumes. Keep
replacements drives
available.
Periodic
Auditing of
logs.
Monitoring
of
application
www.questionpro.com/.../SA-Disaster-Recovery-Plan-120D.doc
Repair
Low
security
effect on
breach.
users.
Determine
root
vulnerability
.
Determine root
vulnerability. Repair
vulnerability.
29
University of Southern California
Center for Systems and Software Engineering
Description Likelihood Detection,
and Impact how will we
know it has
happened
Data loss
Low
Nagios
Warning
Immediate
Action
Later Action Effect on
Users
No later
action
necessary.
Multiple
machine
failure
Low
Nagios
Warning
Restore
data from
hot or
offsite
backup.
Repair
machine,
replace
machine
with hot
backup
machine.
Software
failure
Medium
Nagios
Warning
Repair
machine,
replace
machine with
hot backup
machine.
Order new
hot backup
machine.
Update/rep Update/repai
air software. r software.
www.questionpro.com/.../SA-Disaster-Recovery-Plan-120D.doc
Mitigation and
Contingency
(currently in
place)
Users will not Hot and offsite
have access backups in place.
to their data.
Low effect
Monitor machine
(failover).
health with Nagios.
Performance
will be
compromised
.
Low effect or
no access to
software.
Update software to
latest stable
version.
30
University of Southern California
Center for Systems and Software Engineering
Classic Mistakes
IT Project Management: Infamous
failures, Classic mistakes, and best practices
MIS Quarterly 2007, R. Ryan Nelson
31
University of Southern California
Center for Systems and Software Engineering
Classic Mistakes
•
•
•
•
People
Process
Product
Technology
32
University of Southern California
Center for Systems and Software Engineering
Classic Mistakes : People
• Undermined motivation
• Individual capabilities of the team members
• Failure to take action to deal with a problem
employee
• Adding people to a late project
33
University of Southern California
Center for Systems and Software Engineering
Classic Mistakes : Process
• Waste time on fuzzy front end, approval
and budgeting, aggressive schedule later
• human tendency to underestimate and
produce overly optimistic schedules
• Insufficient risk management
– lack of sponsorship, changes in stakeholder
commitment, scope creep, and contractor
failure.
• Risks from outsourcing and offshoring
– QA, interfaces, unstable requirements
34
University of Southern California
Center for Systems and Software Engineering
Classic Mistakes : Product
• Requirements gold-plating
– unnecessary product size and/or characteristics
• Developer gold-plating
– Developers try out new technology / features
• Feature creep
– +/- 25% change in requirements over lifetime
35
University of Southern California
Center for Systems and Software Engineering
Classic Mistakes : Technology
• Silver-bullet syndrome
– Expect new technology to solve all problems
• 4GL, CASE tools, OOD
• Overestimated savings from new tools or
methods
– Did not account for learning curve and unknown
unknowns
• Switching tools in the middle of a project
– Version upgrade
36
University of Southern California
Center for Systems and Software Engineering
Findings from empirical Study – 99 projects • Finding 1
– People (43%), Process (45%), Product (8%),
Technology (4%)
• Scope creep
– Not a top 10, although ¼ of the projects faced
scope creep and manager should watch out for
it.
• Top 3 mistakes found in ½ of the projects
– Should have focused more on estimation,
scheduling, stakeholders management, risk
management
37
University of Southern California
Center for Systems and Software Engineering
38
University of Southern California
Center for Systems and Software Engineering
Classic Mistakes vs Best Practices
39
University of Southern California
Center for Systems and Software Engineering
References
• IT Project Management: Infamous Failures,
Classic Mistakes, and Best Practices
• Recovering IT in a Disaster: Lessons from
Hurricane Katrina
40
Download