Business Continuity For busy IT people GOETEC seminar 16th February 2012 A bit about me David Hayling • Kent MAN operations manager for 10 years • Kent MAN operations manager for 10 years – Microwave radio links – ATM – BT circuits, first wavestream • The BT ‘excuse book’ (rain, trees) (LANE, clock) (spares) (back breaking) • Christ Church Infrastructure Manager – One or two interesting experiences • Christ Church Infrastructure Manager – One or two interesting experiences flood, fire, pestilence … … electricity electricity Business Continuity Why Things go wrong “In theory, theory and practice are the same. In practice, they are not.” Albert Einstein Things go wrong City University fire 2001 “Around 300 people had to be evacuated from City University's college building in central London last night, after a fire gutted the roof and fourth floor offices.” [guardian, Tuesday 22 May 2001] City University fire 2001 “Around 300 people had to be evacuated from City University's college building in central London last night, after a fire gutted the roof and fourth floor offices. Students continued to sit their examinations today.” [guardian, Tuesday 22 May 2001] Causes of outage 7% 3% 32% human error software malfunction hardware fault computer virus site disaster 44% 14% BCS – BC in practice Causes of outage 3% 7% 32% human error software malfunction hardware fault computer virus site disaster 44% 14% BCS – BC in practice UCISA Top Concerns UCISA Top Concerns Networking Risks Five golden rules of business continuity British Computer Society • Understand the business requirements • Understand the business requirements – Institutional DR / BC plan – Make friends with the auditor – Insurance officer – Check with fellow service providers • Estates – Senior managers – Your manager • Commit time and effort from across the business • Internal communications is critical • Documentation should match the organisation • Test the plan Five golden rules • Understand the business requirements • Commit time and effort from across the business • Internal communications is critical • Documentation should match the organisation • Test the plan Hardware fault 7% 3% 32% human error software malfunction hardware fault computer virus site disaster 44% 14% BCS – BC in practice Hardware fault 7% 3% 32% human error software malfunction hardware fault computer virus site disaster 44% 14% BCS – BC in practice Hardware fault • Look at your key business systems – Network – AAA – Key services – web, mail, teaching Hardware fault • Identify single points of failure – Risk asses – Mitigate / accept – RAID 1,5, 10, … – SAN – Virtualisation Hardware fault www.brentozar.com Hierarchy of Database Needs Hardware fault • Test your backups – Can you recover the data – How long does it take • Maintenance contracts – What do they cover • Cold spares – Check you can deploy – just break/fix – replacement Human error 7% 3% 32% human error software malfunction hardware fault computer virus site disaster 44% 14% BCS – BC in practice Human error 7% 3% 32% human error software malfunction hardware fault computer virus site disaster 44% 14% BCS – BC in practice Human error 7% 3% 32% human error software malfunction hardware fault computer virus site disaster 44% 14% BCS – BC in practice Human error • Change control – Don’t change unless you know (and have written down); why, what, when, to what, who to tell, what success looks like, backout plan, test plan • Working mobile phones – Normally used Software malfunction 7% 3% 32% human error software malfunction hardware fault computer virus site disaster 44% 14% BCS – BC in practice Software malfunction 7% 3% 32% human error software malfunction hardware fault computer virus site disaster 44% 14% BCS – BC in practice Software malfunction 7% 3% 32% human error software malfunction hardware fault computer virus site disaster 44% 14% BCS – BC in practice Software malfunction • Follow supplier’s patching plan – Do not compromise • Automated tests – Test the actual service – (e.g. Nagios) Software malfunction • Anti-virus – Keep up to date automatically – Check all vectors – Beware false positive • User behaviour training – Spear phishing – Have a response – Make sure CERT contacts are up to date Site disaster 7% 3% 32% human error software malfunction hardware fault computer virus site disaster 44% 14% BCS – BC in practice Site disaster 7% 3% 32% human error software malfunction hardware fault computer virus site disaster 44% 14% BCS – BC in practice Site disaster 3% 7% 32% human error software malfunction hardware fault computer virus site disaster 44% 14% BCS – BC in practice Site disaster • Consult with estates – What is their plan wrt site loss • Telco circuit faults are rare <5% • SPF • Acute & long time to recover – vs – Acute quick to recover – vs - chronic You’re already doing Business Continuity Just document, review, improve “In theory, theory and practice are the same. In practice, they are not.” Albert Einstein • The Practice of System and Network Administration – Thomas A. Limoncelli, et al Links & credits • UCISA, BCS, JANET(UK), Gartner • Harvey Rutt & Adrian Pickering (ECS, Southampton University) • Brent Ozar (www.brentozar.com) • david.hayling@canterbury.ac.uk