PowerPoint version

advertisement
Business Continuity
For busy IT people
GOETEC seminar 16th February 2012
A bit about me
David Hayling
• Kent MAN operations manager for 10 years
• Kent MAN operations manager for 10 years
– Microwave radio links
– ATM
– BT circuits, first wavestream
• The BT ‘excuse book’
(rain, trees)
(LANE, clock)
(spares)
(back breaking)
• Christ Church Infrastructure Manager
– One or two interesting experiences
• Christ Church Infrastructure Manager
– One or two interesting experiences
flood, fire, pestilence … …
electricity
electricity
Business Continuity
Why
Things go wrong
“In theory, theory and practice are
the same. In practice, they are not.”
Albert Einstein
Things go wrong
City University fire 2001
“Around 300 people had to be
evacuated from City University's
college building in central London
last night, after a fire gutted the roof
and fourth floor offices.”
[guardian, Tuesday 22 May 2001]
City University fire 2001
“Around 300 people had to be
evacuated from City University's
college building in central London
last night, after a fire gutted the roof
and fourth floor offices. Students
continued to sit their
examinations today.”
[guardian, Tuesday 22 May 2001]
Causes of outage
7%
3%
32%
human error
software malfunction
hardware fault
computer virus
site disaster
44%
14%
BCS – BC in practice
Causes of outage
3%
7%
32%
human error
software malfunction
hardware fault
computer virus
site disaster
44%
14%
BCS – BC in practice
UCISA Top Concerns
UCISA Top Concerns
Networking Risks
Five golden rules of
business continuity
British Computer Society
• Understand the business requirements
• Understand the business requirements
– Institutional DR / BC plan
– Make friends with the auditor
– Insurance officer
– Check with fellow service providers
• Estates
– Senior managers
– Your manager
• Commit time and effort from across the
business
• Internal communications is critical
• Documentation should match the organisation
• Test the plan
Five golden rules
• Understand the business requirements
• Commit time and effort from across the
business
• Internal communications is critical
• Documentation should match the organisation
• Test the plan
Hardware fault
7%
3%
32%
human error
software malfunction
hardware fault
computer virus
site disaster
44%
14%
BCS – BC in practice
Hardware fault
7%
3%
32%
human error
software malfunction
hardware fault
computer virus
site disaster
44%
14%
BCS – BC in practice
Hardware fault
• Look at your key business systems
– Network
– AAA
– Key services – web, mail, teaching
Hardware fault
• Identify single points of failure
– Risk asses
– Mitigate / accept
– RAID 1,5, 10, …
– SAN
– Virtualisation
Hardware fault
www.brentozar.com
Hierarchy of Database Needs
Hardware fault
• Test your backups
– Can you recover the data
– How long does it take
• Maintenance contracts
– What do they cover
• Cold spares
– Check you can deploy
– just break/fix
– replacement
Human error
7%
3%
32%
human error
software malfunction
hardware fault
computer virus
site disaster
44%
14%
BCS – BC in practice
Human error
7%
3%
32%
human error
software malfunction
hardware fault
computer virus
site disaster
44%
14%
BCS – BC in practice
Human error
7%
3%
32%
human error
software malfunction
hardware fault
computer virus
site disaster
44%
14%
BCS – BC in practice
Human error
• Change control
– Don’t change unless you know (and have written
down); why, what, when, to what, who to tell,
what success looks like, backout plan, test plan
• Working mobile phones
– Normally used
Software malfunction
7%
3%
32%
human error
software malfunction
hardware fault
computer virus
site disaster
44%
14%
BCS – BC in practice
Software malfunction
7%
3%
32%
human error
software malfunction
hardware fault
computer virus
site disaster
44%
14%
BCS – BC in practice
Software malfunction
7%
3%
32%
human error
software malfunction
hardware fault
computer virus
site disaster
44%
14%
BCS – BC in practice
Software malfunction
• Follow supplier’s patching plan
– Do not compromise
• Automated tests
– Test the actual service
– (e.g. Nagios)
Software malfunction
• Anti-virus
– Keep up to date automatically
– Check all vectors
– Beware false positive
• User behaviour training
– Spear phishing
– Have a response
– Make sure CERT contacts are up to date
Site disaster
7%
3%
32%
human error
software malfunction
hardware fault
computer virus
site disaster
44%
14%
BCS – BC in practice
Site disaster
7%
3%
32%
human error
software malfunction
hardware fault
computer virus
site disaster
44%
14%
BCS – BC in practice
Site disaster
3%
7%
32%
human error
software malfunction
hardware fault
computer virus
site disaster
44%
14%
BCS – BC in practice
Site disaster
• Consult with estates
– What is their plan wrt site loss
• Telco circuit faults are rare
<5%
• SPF
• Acute & long time to recover – vs –
Acute quick to recover – vs - chronic
You’re already doing
Business Continuity
Just document, review, improve
“In theory, theory and practice are
the same. In practice, they are not.”
Albert Einstein
• The Practice of System and Network
Administration
– Thomas A. Limoncelli, et al
Links & credits
• UCISA, BCS, JANET(UK), Gartner
• Harvey Rutt & Adrian Pickering (ECS,
Southampton University)
• Brent Ozar (www.brentozar.com)
• david.hayling@canterbury.ac.uk
Download