(Note: All times noted in this document are Eastern Daylight Time (EDT).)
At approximately 6:00 P.M., on April 11 th , 2014, the NCREN constituents listed below experienced an interruption of network service when a total and catastrophic failure of the MCNC NCREN Research
Triangle Park (RTP) optical node occurred. This event caused an interruption of network connectivity service for all NCREN constituents that receive IP and/or transport services through the
RTP Regional Point-‐of-‐Presence (RPoP). In addition, MCNC Data Center constituents experienced an interruption of service during this event. The RTP optical node was restored and returned to service by approximately 12:00 A.M., April 12 th , 2014. The majority of the disrupted constituent connectivity services were restored at this time, but several constituents’ services remained unavailable until restored at approximately 1:30 A.M., April 12 th , 2014.
The outage was caused by a catastrophic failure of the MCNC NCREN Research Triangle Park (RTP) optical transport node. Both the primary and redundant line cards that are responsible for the timing and control functions of the optical node spontaneously and simultaneously failed leaving the entire node in a state such that no network traffic could be passed. The equipment vendor performed exhaustive diagnostic tests post restoration, but ultimately, to date has not been able to determine the root cause of the initial failure. Restoration of service was delayed due to two factors: 1) a software bug which resulted in continuous controller card switchover and resets and 2) corruption of the optical node’s backup configuration database files that are routinely created by the vendor’s management platform.
At approximately 5:00 P.M., April 11 th , an NCREN network engineer noticed in the vendor supplied optical monitoring system that the RTP Optical Node’s primary timing and control card appeared to have lost its connection to the optical monitoring system. The engineer attempted to reach the network management interface on the card via several NOC computer workstations, but to no avail.
At this time, there was no loss of service.
At approximately 5:15 P.M., April 11 th , the decision was made to manually failover the primary timing and control card to the redundant card by removing the primary card from its slot based on processes and procedures authorized by the equipment vendor in similar situations in the past.
Though the card’s status LEDs indicated that the redundant card was now active, NCREN network engineers were still not able to reach the network management interface on the card. There was still
no loss of service at this time.
2014-‐04-‐11 NCREN Optical Node Outage -‐ RTP Postmortem FINAL.docx
1
At approximately 5:20 P.M., April 11 th , the decision was made to fail back over to the primary timing and control card by reinserting it into its slot. At this point, both the primary and redundant timing and control cards began a 30-‐minute rebooting and syncing phase. Once complete, it appeared that the primary card’s status LEDs indicated that it had become the active timing and control card; however, the network management interface on the card could still not be reached. There was still no
loss of service at this time.
At approximately 6:00 P.M., April 11 th , the decision was made to make another attempt at regaining connectivity to the RTP Optical Node’s network management interface by removing and reinserting the primary timing and control card from the node. Shortly after the reinsertion, the RTP Optical
Node suffered a complete failure that resulted in the loss of all services provisioned on or through the node.
The NCREN Network Operations Center (NOC) began receiving network management system’s alarms and calls from multiple NCREN constituents starting at approximately 6:00 P.M. The issue was immediately isolated to the Research Triangle Park RPoP optical node, where Senior Network
Engineers were immediately engaged in troubleshooting the issue. The incident was escalated to
MCNC executive management within 20 minutes.
By 6:30 P.M., MCNC senior network engineers felt that based on the behavior of the failed optical node, that the appropriate measure was to restore the node’s configuration from backup. After several failed attempts at approximately 7:19 P.M., a Technical Assistance Case (TAC) was opened with the equipment vendor. Senior engineers then worked with TAC to identify a working backup file and restore the failed optical node.
At approximately 7:30 P.M., NCREN Executive Management escalated the issue with the equipment vendor account team.. MCNC NCREN senior network engineers continued troubleshooting the issue
with the equipment vendor to identify the source of the problem. There was no ETR at that time.
After working with the equipment vendor’s TAC for several hours, a non-‐corrupt backup configuration database file was identified on the NCREN NOC’s optical network monitoring system and the RTP optical node was restored and back in service by approximately 12:00 A.M., April 12 th .
The majority of the disrupted constituent connectivity services were restored at this time, but several constituents’ services remained unavailable until restored at approximately 1:30 A.M., April
12 th . The delay in restoration of the final circuits was due to additional required fine-‐tuning of the optical system.
The NCREN NOC’s initial attempt at sending notification via email to NCREN constituents failed due to the connectivity issues created by the loss of the optical node at MCNC’s RTP facility. Available off-‐ site NOC staff members were notified about the communication issue at 6:33 P.M. An initial notification was posted via Twitter at 6:54 P.M. Additional attempts to send notification via off-‐site e-‐ mail were made at approximately 6:55 P.M, 7:12 P.M., and 7:22 P.M. respectively. These attempts failed due to the primary NCREN constituent mailing list server being unavailable due to the outage.
At 7:22 P.M., the initial notification was posted via the MCNC website and Facebook. At 7:31 P.M., the initial notification was sent via email and auto-‐dial via MCNC’s backup notification tool, targeting constituents that received IP services out of RTP. Another notification was sent at approximately
8:07 P.M., via email and auto-‐dial specifically to MCNC Data Center constituents.
2014-‐04-‐11 NCREN Optical Node Outage -‐ RTP Postmortem FINAL.docx
2
Another status message was posted via Twitter at 10:26 P.M. as well as e-‐mailed at 10:36 P.M., to
MCNC NCREN constituents (RTP and MCNC Data Center constituents). At 12:31 A.M., a final notification was emailed to all NCREN constituents stating that the failed optical node had been restored. At that point, the NCREN backbone was no longer considered impaired and was functioning in a normal state.
NOC staff began calling NCREN constituents that requested a call back. An update was made to
Facebook and Twitter at 12:44 A.M. and 1:07 A.M., respectively. Remaining impacted constituents’ connectivity services were restored at approximately 1:30 A.M., April 12 th .
The following items are noted as mitigating factors i.e., they are things that may have caused
increased impact or some delay in problem resolution:
1.
Several initial attempts to restore the optical node failed due to the NCREN network engineering team using corrupted backup database configuration files. Backup configuration files are automatically retrieved at scheduled intervals from each
NCREN optical node and stored on the vendor supplied optical network monitoring system. This application does not possess the intelligence to make a determination that any backup file or files that reside on it may be corrupt.
2.
The MCNC Net-‐Info list, which is the primary application used to communicate the status of network events to NCREN constituents, such as the interruption of service issue experienced on April 11th, was not functioning correctly.
3.
The NCREN Network engineering team delayed opening a TAC case with the vendor as they felt that the outage would be mitigated by restoring the optical node’s configuration from a backup configuration file located on the vendor supplied optical network monitoring system based on this type of guidance in past similar
situations.
4.
Redundant pre-‐provisioned virtual circuits to redundant NCREN aggregation routers would have allowed for quicker restoration of and/or prevented loss of service to affected constituents on the IP network.
5.
NCREN network engineers were required to relocate from the NOC to a different building on the MCNC campus to leverage Out-‐of-‐Band Direct Internet Access.
These mitigating factors will all be examined in a subsequent section titled “Action Items.”
2014-‐04-‐11 NCREN Optical Node Outage -‐ RTP Postmortem FINAL.docx
3
MCNC Tommy Jacobson Chief Operating Officer and VP Network
Ray Suitte
Matt Valenzisi
Infrastructure Initiatives
Sr. Technical Manager/
Core Engineering & Network Deployment
Chief Network Architect/
Sr. Technical Manager Network Management
Jeremy Buenviaje Sr. Manager, Network Operations & Customer
Fulfillment
919-‐248-‐1178
tjacobson@mcnc.org
919-‐248-‐1454
rsuitte@mcnc.org
919-‐248-‐8429 mvalenzisi@mcnc.org
919-‐248-‐8429 jbuenviaje@mcnc.org
1) Upgrade optical nodes to address equipment vendor software issue beginning with most critical nodes first. a.
Owner: Ray Suitte b.
Due Date: April 21, 2014
PLEASE NOTE: As of this distribution of the RFO, on April 17 th , 2014, the most critical operational optical nodes on NCREN that had contained the previously noted software bug and corrupted database files have been successfully upgraded.
2) Plan for and build backup virtual circuits for constituents receiving MCNC NCREN IP
Connectivity Service via MCNC’s Owned and Operation MPLS-‐based transport network. a.
Owner: Matt Valenzisi & Ray Suitte b.
Due Date: June 11, 2014
3) Relocate the secondary MCNC Data Center lambda as well as any secondary constituent lambdas to a diverse optical node in RTP as soon as possible. a.
Owner: Matt Valenzisi & Ray Suitte b.
Due Date: Pending completion of new fiber infrastructure currently being
constructed in Research Triangle Park to an alternate location (most likely by June 1,
2014).
4) Investigate potential deployment and value add of additional independent optical infrastructure at several key NCREN RPoPs to increase network availability and reduce disruptions during any network degradation or outage. a.
Owner: Matt Valenzisi & Ray Suitte b.
Due Date: May 30, 2014
2014-‐04-‐11 NCREN Optical Node Outage -‐ RTP Postmortem FINAL.docx
4
5) Modify NOC policy for contacting vendor’s TAC when dealing with unreachable network management interfaces on optical nodes and critical routing and switching infrastructure.
Going forward, when loss of connectivity to a network management interface is detected, a
TAC case will be opened immediately and as long as there is no service disruption, no action will be taken until directed to do so by the vendor’s TAC. In addition, as long as there is no service disruption, any recommended actions will be taken during the first available emergency maintenance window. a.
Owner: Jeremy Buenviaje & Ray Suitte b.
Due Date: COMPLETED
6) Review High Availability (HA) capabilities of all Network Management tools to ensure maximum availability during any network degradation or outage. a.
Owner: Matt Valenzisi & Jeremy Buenviaje b.
Due Date: May 16, 2014
7) Review the posture of NCREN network status and NCREN Constituent communication and notification tools to ensure that they will function as expected during any type of network degradation or outage. In addition, verify current constituent contacts for each service provided by NCREN. Ensure that all constituents are aware of all existing channels of communication available to them for receiving information during events of this nature (e.g., twitter, facebook, e-‐mail, etc.). a.
Owner: Jeremy Buenviaje b.
Due Date: May 16, 2014
8) Extend Out-‐of-‐Band (OOB) 3 rd Party Direct Internet Access circuit (Building 3) into the
MCNC Network Operations Center (Building 2). a.
Owner: Matt Valenzisi b.
Due Date: May 16, 2014
Administrative Office of the Courts – Raleigh
Akamai Networks -‐ RTP NC
Albemarle Regional Library
Asheboro City Schools
Barton College
Beaufort Community College
Beaufort County Schools
Bertie County Schools
Brunswick Community College
Brunswick County Schools
Burroughs/Welcome Fund
Camden County Schools
Campbell University – Law School
Campbell University – RTP Center
Campbell University – to RTP
1-‐99.147
1-‐99.131
1-‐99.236
1-‐99.102
1-‐99.160
1-‐99.213
1-‐99.225
1-‐99.189
1-‐99.178
1-‐99.200
1-‐99.114
1-‐99.188
1-‐99.18
1-‐99.38
1-‐99.43
2014-‐04-‐11 NCREN Optical Node Outage -‐ RTP Postmortem FINAL.docx
5
Cape Fear Community College
Cardinal Gibbons High School
CarolinaEast – Heart Center Newbern
Carteret Community College
Carteret County General Hospital
Carteret County Health Department
Carteret County Schools
Casa Esperanza Montessori Charter School
Central Carolina Community College
Central Park School for Children
Central Piedmont Community College – MCNCDCS-‐ChltMyers
Chapell Hill Carbarro City Schools
Charlotte Mecklenburg County Government
Charlotte-‐Mecklenburg County Government – RTP-‐ChltMyers
City of New Bern
Coastal Carolina Community College
College of the Albermarle
Craven Community College
Craven County Government
Craven County Schools
Cumberland County Schools
Daymark Recovery Services – Archdale Center
Daymark Recovery Services – Hoke Center
Daymark Recovery Services – Lee Center
Daymark Recovery Services – Rockingham Center
Daymark Recovery Services – Vance Center
Daymark Recovery Services – Wake Center
Daymark Recovery Services – Harnett Center
Department of Justice – ITS – MCNCDCS-‐RaleighDep
Department of Justice – ITS – MCNCDCS-‐RaleighWest
Duke University – DukeAlex-‐DukeNorth
Duke University – RTP-‐Duke-‐Naan
Durham Public Schools
Durham Technical Community College – Hillsborough
Durham Technical Community College – Lawson St Main
Campus
1-‐99.239
1-‐99.46
1-‐99.123
Durham Technical Community College – North Satellite Campus 1-‐99.48
Durham Technical Community College – Northgate Mall 1-‐99.89
Durham Technical Community College – South Bank Building
Eastern NC School for the Deaf
1-‐99.41
1-‐99.164
1-‐99.179
1-‐99.218
1-‐99.229
1-‐99.177
1-‐99.165
1-‐99.245
1-‐99.32
1-‐99.124
1-‐99.159
1-‐99.104
1-‐28.911
1-‐99.256
1-‐99.64
1-‐28.910
1-‐99.149
1-‐99.136
1-‐99.214
1-‐99.133
1-‐99.168
1-‐99.151
1-‐99.227
1-‐99.249
1-‐99.251
1-‐99.259
1-‐99.253
1-‐99.254
1-‐99.228
1-‐99.250
1-‐19.910
1-‐20.911
26-‐40.910
1-‐99.122
2014-‐04-‐11 NCREN Optical Node Outage -‐ RTP Postmortem FINAL.docx
6
Edenton-‐Chowan County Schools
Elizabeth City State University
Elizabeth City-‐Pasquotank County Schools
Elon University
Elon University – Data Center #1
Elon University – Data Center #2
Englehard Medical Center
Exploris Middle Charter
Fayetteville State University – Seymour Johnson AFB
First Flight Venture Center
Franklin County Schools
Gates County Schools
Global Scholars Academy Charter School
Governor Morehead School for the Blind
Granville County Schools
Hamner Institutes for Health Sciences
Hertford County Schools
Hyde County Schools
Information Technology Services – NCTN Trunk to RTP RPoP
Johnston Community College
Juniper
Kestrel Heights Charter
Louisburg College
Magellan Charter School
Martin Community College
Martin County Schools
Mayland Community College
MCNC Data Center – RaleighWest
MCNC Data Center – WSA1A Primary
MCNC Data Center – WSA1A Secondary
MCNC Data Center Services
Meredith College
Mooresville Graded School District
Mount Olive Medicine Center
Nash Community College
Nash Health Care Systems
National Humanities Center
National Institute of Statistical Sciences #1
National Institute of Statistical Sciences #2
NC Biotechnology Center #1
1-‐99.215
1-‐99.207
1-‐99.88
1-‐99.107
1-‐99.163
1-‐99.209
1-‐99.45
1-‐99.85
1-‐99.195
1-‐99.204
1-‐99.210
1-‐20.910
1-‐3.911
1-‐3.912
1-‐1.9
1-‐99.14
1-‐99.191
1-‐99.196
1-‐99.193
1-‐99.255
1-‐78.910
1-‐78.911
1-‐99.226
1-‐99.148
1-‐99.12
1-‐99.134
1-‐99.212
1-‐99.222
1-‐99.132
1-‐99.109
1-‐99.198
1-‐99.128
1-‐99.217
1-‐99.182
1-‐99.153
1-‐99.166
1-‐99.53
1-‐99.52
1-‐99.5
1-‐99.56
2014-‐04-‐11 NCREN Optical Node Outage -‐ RTP Postmortem FINAL.docx
7
NC Biotechnology Center #2
NC Central University – to RTP RPoP
NC Independent Colleges and Universities
NC Rural Economic Development Center
NC State University – RTP-‐NCSUHills
NCSU -‐ Center for Marine Sciences and Technology (CMAST)
NCSU Tape Library – MCNCDCS-‐NCSUCMDF
NCSU Tape Library – MCNCDCS-‐NCSUHills
New Hanover County IT Library
New Hanover Regional Medical Center
NOAA NOS
North Carolina School of Science and Mathematics
North Raleigh Christian Academy
Onslow County Schools
Orange County Schools
Partnership for Defense Innovation
Perquimans County Schools
Person County Public Library
Piedmont Community college
Piedmont Health Services – Carrboro
Piedmont Health Services – Charles Drew
Piedmont Health Services – Pittsboro
Piedmont Health Services – Senior Care
Raleigh Charter High School
Ravenscroft School
Red Hat
RENCI – RalDep to NLR
RENCI – to RTP
Research Triangle High School
River Mill Academy
Roanoke Rapids City Schools
Roanoke-‐Chowan Community College
Robeson Community College – Pembroke Campus
RST Fiber Optic Network – Lincolnton to Hyde L2 P2P
RST Fiber Optic Network – Lincolnton to Williamston L2 P2P
RTI
Saint Augustine's University
SAS Institute
Shaw University
Shodor Education Foundation
1-‐99.192
1-‐99.183
1-‐99.201
1-‐99.172
1-‐99.176
1-‐99.231
1-‐99.174
1-‐99.101
1-‐99.19
1-‐99.144
5-‐23.911
1-‐99.71
1-‐99.135
1-‐99.93
1-‐99.194
1-‐99.220
1-‐99.4
1-‐99.37
1-‐99.17
1-‐99.11
1-‐99.72
1-‐99.143
1-‐14.910
1-‐21.910
1-‐99.185
7-‐99.18
1-‐99.203
1-‐99.70
1-‐99.15
1-‐99.155
1-‐99.152
1-‐99.82
1-‐99.171
46-‐72.01
46-‐74.01
1-‐99.142
1-‐99.75
1-‐99.181
1-‐99.40
1-‐99.106
2014-‐04-‐11 NCREN Optical Node Outage -‐ RTP Postmortem FINAL.docx
8
South Piedmont Community College – Polkton Campus
Southeastern Baptist Theological Seminary
Southeastern Regional Medical Center
St. David's School
Sterling Montessori Charter School
The Institute for the Development of Young Leaders
Town of Chapel Hill – Martin Luther King Jr Blvd
Triangle Math and Science Academy
Tyrrell County Public Schools
UNC Chapel Hill – Chlt-‐UNCManning
UNC Chapel Hill – UNCPhillips-‐UNC54
UNC Charlotte – UNCAtkins to MCNC Data Center
UNC Charlotte – UNCRUP to MCNC Data Center
UNC Coastal Studies Institute
UNC Healthcare – Meadowmont to RTP via UNC Manning
UNC Pembroke – to RTP RPoP
UNC-‐Chapel Hill Institute of Marine Sciences to UNC Manning
P2P
UNC-‐General Administration – (main floor 215) (Backup
Connection)
UNC-‐General Administration – MCNC Data Center-‐UNCManning
L2
UNC-‐General Administration – MCNC Data Center-‐UNCPhillips
L2
UNC-‐General Administration – RTP-‐UNCManning L3
UNC-‐TV via Data Center
Urban Ministries of Wake County
Vance County Public Schools
Vance-‐Granville Community College
Vidant Health Beaufort (Washington) L2
Vidant Health Bertie (Windsor) L2
Vidant Health Chowan (Edenton) L2
Vidant Health Greenville L2
Vidant Health Outer Banks (Naghead) L2
Vidant Health Roanoke-‐Chowan (Ahoskie) L2
Voyager Academy
Wake County Government – to RTP RPoP
Wake County Schools
Wake Technical Community College
Warren County Schools
Washington County Schools
Wayne Community College
1-‐99.167
1-‐99.126
1-‐99.125
1-‐99.100
1-‐99.162
1-‐99.216
1-‐99.232
1-‐99.205
1-‐99.206
2-‐99.79
24-‐36.910
1-‐29.910
1-‐30.910
1-‐99.186
1-‐99.230
1-‐99.161
99-‐99.43
1-‐99.94
1-‐25.912
1-‐24.910
1-‐99.175
1-‐99.246
1-‐99.97
1-‐99.199
1-‐99.184
74-‐99.02
74-‐99.01
70-‐99.01
8-‐99.29
71-‐99.01
69-‐99.01
1-‐99.47
1-‐99.10
1-‐99.224
1-‐99.158
1-‐99.190
1-‐99.202
1-‐99.197
2014-‐04-‐11 NCREN Optical Node Outage -‐ RTP Postmortem FINAL.docx
9
Wayne County Schools
WebAssign – LAN
WebAssign – VoIP
William Peace University – to RTP RPoP
Willow Oak Montessori
Wilson Community College
Wilson County School District
Woods Charter School
1-‐99.221
1-‐99.233
1-‐99.234
1-‐99.98
1-‐99.208
1-‐99.150
1-‐99.154
1-‐99.119
Bams Hosting
CASC
CFCC
Campbell
Catawba Hosting
City of Ral Fixd
DOJ
Duke Psychiatry
Durham Public Schools
Greensboro College
ICAN Fixed
ICAN Fixed, NewTeacherCtr
ITS
Learn NC
Mecklenburg
NC Live
NCCU
NCCU DR
NCHA
NCICU
NCSSM Hosting
NCSU DELTA WebCT Vista Hosting
NCSU Off of Ext Tech Svcs
NCSU VCL
NCVPS
NDG
NewTeacherCtr
ONET Hosting
Priveon
RTI
Remote Learner
Sandhills CC
Shaw Univ
St Aug
Surya Technologies
2014-‐04-‐11 NCREN Optical Node Outage -‐ RTP Postmortem FINAL.docx
10
TRLN
Tape Backup Ntl Humanitie
TapeBack-‐upNC RuralEconCn
TrustedMetricsLLC
UNC-‐C
UNC-‐C VirtualHstg
UNC-‐CH LEARN NC
UNC-‐Char Mgd Hstg
UNC-‐GA Dell Banner, Puredisk UNC-‐GA
UNC-‐GA Dell Banner, UNC-‐GA SUN Banner
UNC-‐GA SUN Banner
UNC-‐GA, UNCGA-‐LEARN NC Hosting
UNC-‐W Hosting
UNCFSU
UNCGA
UNCW Hosting
WebAssign Hosting
e-‐NC eLearning Commission
Prepared by Jeremy Buenviaje, Matt Valenzisi
Reviewed by Joe Freddoso, Tommy Jacobson, Ray Suitte
2014-‐04-‐11 NCREN Optical Node Outage -‐ RTP Postmortem FINAL.docx
11