NOC Report to CIC Regarding Operational Considerations of a Distributed Network Support Model INTRODUCTION ACNS has seen a continued increase in the number of network applications that require configuration from the core to the edge. The four basic areas of network management, design, initial/ongoing switch configuration, monitoring and troubleshooting are challenged by the current infrastructure. Inconsistencies and challenges in these four areas exist in the currently distributed switch operational model. Operating the switches from a centralized point and organizational unit is proposed as a means of resolving these challenges. This document is broken into two sections: An analysis of network operational areas as they relate to CSU and a discussion of concerns expressed by the ACNS NOC and network managers. BACKGROUND Some background is useful in understanding the nature of the CSU network’s scale, complexity and foundation. Networking is typically broken down into “layers” (physical through application). ACNS’ NOC, currently composed of seven individuals, has extensive campus experience across all layers. The campus network consists of over 900 switches spread across three campuses, 100+ MDFs, 280+ IDFs carrying the traffic of more than 40,000 pieces of equipment. In total, this comprises 213 subnets coordinated with 60 subnet managers. This background provides the foundation for understanding the proposal for consistency and operations of the network from a single, cohesive entity. Additionally, as part of the background, statistics relating to centralized network operations are provided: WestNet Survey In June, 2011, a survey of 14 institutions of Higher Education attending the Westnet Conference at the University of Utah, only one of the 14 schools stated that they have a distributed support model for networking. That school (the University of Utah), further indicated they were making steady progress toward centralizing support of networking infrastructure. Current Estimate of Central vs. Distributed Switch Management 1 Table 1, below, illustrates the proportion of switches for which the NOC or college/department is responsible for the daily switch management. It is important to note that this does not take into consideration the design, monitoring or troubleshooting features of network management, but rather the ongoing virtual changes for ports/VLANs or physical cabling/patching/maintenance. Currently, the NOC is responsible for over ¾ of all virtual management across CSU’s switches and either shares in or is entirely responsible for ¾ of the physical management. Virtual% 77 20 1 2 NOC 50/50 10/90 90/10 Physical% 39 40 19 2 Table 1. Illustration of the proportion of central vs. distributed ongoing operational support. Notes regarding Table 1: “NOC”: NOC does all work (virtual or physical) on the switch “50/50”: VLANs and Physical changes shared between department/college and NOC “90/10”: NOC responsible for 90% of the changes (virtual or physical) on the switch, department responsible for 10% of changes (virtual or physical) “10/90”: NOC responsible for 10% of the changes (virtual or physical) on the switch, department responsible for 90% of the changes (virtual or physical). Current Estimate of the Scope of VLAN Change Requests This analysis was done on all switches backed-up by the NOC. The NOC monitors all changes on many of the switches across campus departments and colleges. This is primarily done to have backups of switch configurations. Additionally, it offers assistance in troubleshooting as the process also reflects changes in such configurations. Table 2, below, illustrates that of the changes made over the six week period* 69% were made by the NOC. COB Engr CVMBS AHS CNR LibArts CNS AgSci Other NOC 53* 7 2 0 1 0 0 0 1 141 Table 2. Illustration of VLAN change requests handled either by the colleges or the NOC. Notes regarding Table 2: *Summary of network changes between April 11 to May 25, 2011 **COB configured a VLAN across 30+ switches in one day. Finally, in addition to operating the four main focus areas mentioned above, security and documentation must be done consistently, especially as initiatives such as Voice over IP (VoIP) are 2 deployed across campus. These combined activities form the basis for justification of the centralized switch operational proposal. Design: In most cases, the design phase of networks is already handled by the NOC. The NOC works closely with architects, Facilities, Colleges/Departments, and Telecommunications on new building designs and remodels. All aspects of the network design are dealt with by the NOC: power/UPS needs, cooling requirements, wireless, VOIP, security, environmental controls, budgets, etc. Building designs and remodels are a complex process requiring experience along with often daily communication between NOC members and others on the design team. There is an established workflow and communication structure to make sure the proper equipment is ordered and installed. This design process is currently being handled by the NOC as a coordinated service to the campus. Configuration: The second part of network operations is configuration. The first element of this consists of physical installation of and preparing the cabling in anticipation of the new switch gear. For most areas on campus, the NOC oversees the physical installation. This is because the physical infrastructure, both fiber and copper, is a necessary part of installations. While the copper installation can be done by non ACNS staff this is not the recommended model for several reasons: 1. The physical networking infrastructure (fiber, cabling, patch panels, jacks, etc.) is under a 30 year warranty of service by Commscope. This warranty becomes void if very prescriptive cabling procedures are not followed. Audits of closets have found cabling not matching specifications resulting in voided warranties and putting the infrastructure back into warranty can require extraordinary resources. 2. The fiber infrastructure is complex, unfamiliar to most, and even dangerous as it’s composed of invisible lasers that can burn retinas if not handled properly. Thus, for safety and financial reasons, it is recommended that copper and fiber infrastructure only be managed by ACNS telecommunications. Overall, new installs and remodels are handled by the NOC in close work with telecommunications. The second part of installation that complements the physical installation is the configuration of the Ethernet switches. Again, except for a few colleges, this is already done primarily by the NOC. VLAN, multicast, spanning-tree, QoS, and SNMP are just some of the necessary configuration items that are done on switches. The NOC has developed a base configuration which can be easily modified to accommodate the following: Facilities needs on the network – requiring coordination with the various entities needing such services: environmental controls, CSUPD, Alarm Shop, CardKey access, metering, etc. Central VLANs trunked across campus for such services – NOC keeps track of all such VLANs and assists on tracking which switch port and jack is assigned to these along with IP address ranges used to map to those VLANs and for routing purposes. PoE needs – NOC manages power consumption and temperature issues related to closets and PoE. This includes UPS management, temperature monitoring in closets and power/cooling 3 needs. This is then coordinated with telecommunications and Facilities for rectification or modification of existing infrastructure. Wireless configurations – campus has three wireless vendors across campus. Careful coordination of which departments and areas need which SSIDs, jack numbers, and switch port numbers of all wireless APs are tracked as well. Uplink and downlink physical considerations - fiber and copper infrastructure play a large part in how MDFs and IDFs are connected. In addition extensive knowledge of switch models, parts and interaction of various connectors is necessary to ensure compatibility between and among devices. Spanning tree configuration, loop detection, bpdu guard and dhcp-protect - Improper configuration of spanning tree can lead to widespread network outages easily affecting multiple departments and buildings. Additionally, various centrally deployed tools are used to ensure consistent spanning tree configuration, loop detection, bpdu guard and dhcp protection. While the above is not intended to be a comprehensive list of consideration needs for proper switch configuration, it is provided to illustrate the amount of coordination needed to be overseen by one entity for complete and consistent deployment of network gear. Monitoring: The third part of network operations relates to monitoring. This is comprised of a number of factors, including uptime, traffic loads, errors, and throughput. The NOC maintains central servers to monitor all of those areas where it has a consistent picture of the network. Proper monitoring includes configuration settings that report back to a central server for data collection, traffic monitoring, error reporting, troubleshooting, and/or security event handling/detection. Currently, this picture of campus is somewhat limited due to the distributed nature of the network. A consistent configuration across campus is needed to establish the foundation necessary for additional and more complicated deployments of services requiring QoS, service level agreements, specific uptime or security needs. Troubleshooting: The fourth area of network operations is troubleshooting. The previous three items must be in place before effective troubleshooting can be accomplished. Improper configuration and installation can lead to network problems. An inability to monitor the entire network to the edge switch in a consistent and complete fashion hampers troubleshooting. One charge of the CIC in evaluating network management policies is to consider operational efficiency. A consistent design, configuration and monitoring process across campus would greatly speed up the NOC’s ability to troubleshoot problems. Currently, much too much time is spent by the NOC looking for network devices that aren’t known or dealing with configurations that are not consistent. A major component of troubleshooting is the notification and response structure. While also reliant upon the monitoring piece of network operations, notification and response relates to which entities are 4 notified in which order, how the notifications are escalated, and who actually responds during which times. Currently, many devices are monitored for Facilities and CSUPD along with basic network services such as network service to an entire building. The NOC maintains central monitoring such that devices are monitored during specified hours, and when they fail, NOC staff is contacted electronically in a specified order. Many of these responses are 24x7 to address life and safety concerns. Coordinating all of those alerts, devices, individuals, and entities requires a central base of knowledge and is a natural extension of the NOC. Besides the notification mechanism, often complete problem resolution requires response by both Facilities or CSUPD and the NOC. Security and Documentation: Two additional areas were noted that encompass network operations: security and documentation. There has been an increase in the security and integrity requirements of networks. Some of these cases deal with ensuring that traffic cannot be analyzed or monitored except by those with authorization to do so. Right now, anyone with read/write access to the switches can monitor/sniff ports. Other cases deal with the physical integrity of the switch infrastructure as relates to its ability to be physically compromised. One project in particular deals with radioactive material. For this project, physical access to the switches is restricted solely to those with a demonstrated and approved need for access. Additional access restrictions and identification processes are put in place and audits are composed to ensure compliance. These sorts of security requirements and issues are increasing across campus. A distributed operational model for the switch infrastructure is in conflict with those requirements. While the NOC has been able to work through the existing cases, the concern for the NOC is that the current model of network management does not scale and does not pass such audit requirements. The second encompassing area of network operations is the documentation. Accurate and complete documentation is a necessary component of any successful network operations policy. As the campus moves into the next generation of networking and network applications, documentation of the four key points of network operations needs to be addressed. This is something that requires a level of consistency best done by an individual or small team of individuals, and doesn’t lend itself to a distributed model. In fact, a consistent approach to the design, configuration, monitoring and troubleshooting facilitates accurate documentation. Naturally, the more disparate or distributed entities involved in that process, the more likely inaccuracies will be introduced, and the more difficult it will be to operate. Improper documentation causes delays, inefficiencies, and mistakes in all levels of the network operations process. SUMMARY Four areas of network operations were discussed: design, configuration, monitoring and troubleshooting. It was noted that the NOC is already involved in many of the design stages of the campus network. Currently, many campus entities ship their network gear to the NOC for consistent configurations. The NOC noted that a more centralized approach to network management will yield switch configurations that can prevent some common network outages. The monitoring and 5 troubleshooting of the network involves having a consistent and complete picture of the campus network from core to edge in order to note problems and correctly respond to or troubleshoot them. Finally, two encompassing concerns were presented: security and documentation. The security component is growing in focus across campus. Meanwhile, the documentation piece is fundamental to a successful deployment of any network operations policy. 6 NOC RESPONSES TO SPECIFIC CONCERNS EXPRESSED BY THE UNITS Discussion on various areas of concern: 1. Patching of cables is one area that has been done by network managers. The NOC does not have the resources to patch cables with the responsiveness needed by the entire campus. The committee must consider this issue as it moves forward. Issues surrounding this are those already mentioned such as safety and warranty concerns. Meanwhile, as more devices are added to the network, telecommunications works closely with the NOC to map network switch ports to patch panel locations. Switch ports are no longer generic, but rather switch ports are dedicated to specific services. Knowing which switch ports go to which environmental control, which camera or which VOIP phone, is critical for all phases of network operations from design to troubleshooting. The NOC works closely with telecommunication during installs and changes to document these items. Color coding and documentation processes are in place for tracking these services. Unauthorized or un-communicated moves of copper cabling can result in outages that must be tracked down and fixed, sometimes on a 24x7 basis. To that end, the NOC proposes a simple system of color coding, education, and communication with network managers as to which cables can be department patched and which carry central infrastructure. To maintain the status quo on this particular area, the NOC recommends working to pre-configure switch ports to meet most of the network managers needs so that patching is all that is necessary with an emphasis on limiting the amount of actual switch configuration that would need to be done by the NOC to meet these needs. 2. The distributed units may require too many VLAN changes on edge switch ports for the NOC to accommodate within the units’ expected timeframe. The NOC has looked at statistics on switch configurations (see example at start of document). Over a six week period, the NOC was responsible for the majority of all network changes with only a subset of changes being done by the departments or colleges. However, it is possible that entities might need a number of changes made in a hurry for which the NOC, at current staffing levels, could not meet the demand. Two solutions to this are presented herein: increased staffing of the NOC to meet these needs (in process), and a process for allowing network managers to make edge switch port changes through an automated process that validates the request and permits and documents the change without human intervention. This latter method helps to keep an important part of the network support in place – the network manager who is in the field. Finally, in critical situations, a phone call can be placed to the NOC for immediate response. 3. The network managers have always had a good working relationship with the NOC. Why is that not working all of a sudden? 7 It’s critical to note that the NOC is not singling out any entity or person and truly believes in the good job that is being done by network managers and the critical component they are in the system. The concern for the NOC is solely that the current system does not scale to the demands being made upon the NOC. Thus, the network managers and the NOC are charged with working together to find a way to proceed in an efficient manner that provides a more hardened and consistent network for the future. In the end, the current good working relationship must continue for us all at CSU to be successful. 4. Why must I provide read/write access to the NOC, but not retain it myself since my department/college bought the switch? That’s a fair question, and hopefully largely addressed above. Models for centrally funding switch replacement are being considered for subsequent proposal to the campus. 5. We have entities for which we must provide a certain level of service. How can we do that if central services are to be provided through our network, especially if those services require a level of QoS with a higher priority than my data does? The committee does need to address this issue. Do life and safety packets have priority over researcher packets and if so, who pays for those life and safety ports? Do electronic meters prioritize over data packets? What about VOIP? These types of questions must be answered. An operational model will indeed facilitate the configuration of the network so that these policies can be implemented coherently and consistently for the entire network. 6. Why can’t we continue to manage the department/college switches like we did, just with closer integration with the NOC? The simple answer is that it is becoming too hard to operate under the present model, and the additional communication and collaboration necessary to persist under the current model is unsustainable as it presents challenges at all levels of network operations. In the design phase, it was noted that the NOC works with numerous entities to design the network properly. This process establishes a common knowledge base and documentation process that is shared amongst the NOC to facilitate the next steps of proper and efficient network operations. Configuration of the switches is something that is discussed by the NOC on a regular basis. Base configurations, naming conventions, product lines and parts are elements that are constantly reevaluated, analyzed, discussed with vendors and tech support. The configurations are discussed with a view to make all services work across campus. It’s not simply putting a configuration on a switch, but an integral part following the design phase and building common knowledge of the NOC. Monitoring is something that is done across campus devices as many services are spread across all networks. Additionally, switches are being configured to report back to a central monitoring server for error collection and data collection. It’s difficult for the NOC to envision the monitoring done in a distributed manner. Finally, troubleshooting is a critical component based on the monitoring. The entity with the design knowledge, configuration understanding 8 and monitoring tools is the one alerted to the problem and the one armed with the tools to fix the problem. Finally, the NOC is charged with 24x7 responses for those critical items. Given that, the primary concern for the NOC is that undocumented changes to the switches along with inconsistent configurations across campus force the NOC to respond to outages they aren’t responsible for nor have knowledge of. Those that are charged with the responsibility to respond feel it is only fair that they are the ones to oversee the changes that affect that responsibility or those that have that ability also share in the response effort 24x7. While the latter sounds plausible, it doesn’t work practically. A network manager simply isn’t going to have the common knowledge of the network the NOC has in order to help the CSUPD fix a camera that’s out on snowy Saturday night. It’s going to take the NOC to do it. (Thus, the compromise to provide an automated change request process was proposed, above.) 7. Why the concern over security all of a sudden – don’t secure protocols eliminate much of those concerns? Two primary concerns are in the forefront: First, some regulatory agencies require the network to be hardened and accountable in a manner that simply isn’t workable in a distributed environment. Access, design, and even knowledge of some of these networks are restricted! Second, while secure protocols may make sniffing VOIP, and similar traffic, difficult, denial of service (DoS), arp spoofing, and sniffing of ports lead to integrity of the network issues and raises liability concerns for individuals and CSU. 8. What are some protocol/technical challenges the NOC has before it that it says it’s hard to meet campus wide with the current architecture? 802.1x, IP multicast, QoS, VOIP, Facilities networks for environmental controls, smart meters, card access devices, power/UPS management, fire/police alarms along with video surveillance services/cameras, and distance learning in classrooms are the primary ones. 9. What does the NOC want exactly? Sole read/write access to the switches, while still providing read access to the units where desired. 10. Will that fix everything? No. We’ll always have network issues, computers will always break and there will always be problems. The NOC’s primary objective is to get the network hardened and managed efficiently and consistently so that such problems are minimized and response to them is quicker and more effective. 9 10 11