Campus Infrastructure Committee Report to ITEC Advisory Committee (IAC) on a CSU Network Operations Policy July, 2011 Committee Members Scott Baily, chair Academic Computing and Network Services (ACNS) Ty Boyack Natural Resources Ecology Laboratory Jim Cox College of Natural Sciences Michael Dennis Business and Financial Services Robin McGee Warner College of Natural Resources Jon Peterson Research Services Mark Ritschard College of Engineering Jon Schroth College of Business Introduction ...........................................................................................................................................................1 Input from Campus .............................................................................................................................................3 Scope and Draft Policy .......................................................................................................................................4 Appendix A: Background and Concerns from ACNS NOC ...................................................................6 Appendix B: NOC Responses to Specific Concerns Expressed by the Units ................................12 Appendix C: Campus Concerns Regarding Central Management of Network Switches.........16 Appendix D: College need for Write Access to Network Switches .....................................19 Appendix E: CIC Report on Campus Network Infrastructure ............................................24 1 Network Operations Policy DRAFT – July 20, 2011 Introduction The campus network is essential to the conduct of CSU’s education, research and outreach tripartite mission. The data network must be "hardened" in order to ensure the capacity, availability, and throughput required to accommodate user needs, in particular, to prepare for the upcoming transition from a legacy telephone system to a unified communications system as an application relying upon the data network. Furthermore, the ACNS Network Operations Center (NOC) reports that they are unable to meet these goals in the present operational environment, especially given their university role in sustaining an environment capable of supporting Voice over IP (VoIP) and an ever-growing list of environmental and access control systems, and life and safety devices. Finally, this activity is consistent with a directive from the Provost to become more efficient, and is an integral part to implementing CSU’s IT Consolidation Activity #7. The Vice President for Information Technology, as chair of the ITEC Advisory Council (IAC), charged the Campus Infrastructure Committee (CIC), an ad-hoc subcommittee of the IAC, to develop a policy with the following characteristics: 1. Effective, allowing the network to be hardened, as described above, resulting in increased availability and higher capacity as it is upgraded in speed, 2. Efficient, supporting a single, non-duplicative infrastructure, 3. Flexible to the extent practicable, establishing usage parameters for the network that recognizes special circumstances, such as use of the network for research, yet preserves the quality of the network for other users, 4. Clear in terms of responsibilities for all aspects of operations, and 5. Sustainable, enabling the ACNS NOC to maximize administrative and operational efficiency, availability, and overall reliability of the campus network. Input from Campus As part of its consideration of the current network operations model, the CIC has discussed issues that have rendered operations of the network problematic, as well as discussion regarding the merits of continuing to allow non-ACNS personnel the ability to configure and operate network switches. These discussions are included as Appendix A-C. During the Spring of 2011, discussions regarding a new network operations policy were conducted with the Provost, the IAC, the Council of Deans, ISTeC’s Education and Research Advisory Councils, and campus Subnet Managers. 2 Approximately 77% of the network switches on the campus today are operated by the ACNS Network Operations Center (NOC). In those areas, local IT support personnel are satisfied with the support received from the NOC. In many cases, concerns about the ability for non-ACNS IT staff to adequately manage their local area networks can be addressed by defining procedures for things such as Virtual LAN (VLAN) changes, the ability to monitor network traffic for problem diagnostic purposes, and the ability to look at traffic levels, port errors, etc. on network switches in their area. For this policy to be effective, the CIC believes an exemption request and review process will be required. Exemptions to the policy are addressed below. Scope In the context of this policy, Covered Devices shall include: 1. Building distribution switches that have been migrated to dual gigabit Ethernet connections (via the IAC CIC model) 2. Edge switches in buildings where VoIP, environmental or access control systems, or life and safety devices have been deployed 3. Exceptions to this policy include network switches used to implement independent and specific unit requirements, e.g. HPC computing clusters, SANs implementations, or for purposes other than supporting “edge” devices such as end-user workstations, laptops, printers, etc. Units may request exemptions for their new and existing data switches, but in doing so must agree to: i. Configure units’ data switches according to NOC-approved standards for settings like spanning tree, multicast, etc. ii. Fund the initial cost of implementing a dual infrastructure (one for units’ data switches and the other for VoIP, energy management, life and safety, etc. devices), as illustrated in Figure 3 of Appendix E, the CIC infrastructure report of January, 2010. iii. Independently fund the maintenance and life cycle replacement of such exempted switches iv. The NOC will disable connectivity to exempted switches if they are determined to be disrupting other critical services Policy Covered Devices, as described above, will fall under the administrative control of the ACNS NOC. A web form will be developed through which exceptions to this policy may be requested. Such requests will be reviewed by the CIC, which shall render a decision within 10 business days from the date the exemption request was submitted. Authorized IT personnel, upon request, will be granted “read” access to Covered Devices, and as available, assistance with NOC-developed or NOC-supported tools 3 and utilities useful in monitoring network utilization, port errors, etc. to better enable end user and application support. Procedures defined by, and periodically reviewed and modified, per IAC approval, will be adhered to for: Port activations and related documentation VLAN changes for ports, whether they are currently active or pre-configured with an alternate VLAN Requests to “mirror” edge device traffic to a specific monitoring port for diagnostic purposes Such procedures will include reasonable expectations for turnaround time for requests of various magnitudes (small, incidental requests; medium and large requests). Procedures will also include an escalation process to ensure timely response by the ACNS NOC. Procedures shall also define responsibilities shared between the NOC and other IT personnel. For example, Authorized IT staff may physically connect (“patch”) port activations in response to requests by the units Any traffic requested for monitoring purposes shall adhere to the University’s Acceptable Use Policy 4 Appendix A – Background and Concerns from ACNS NOC INTRODUCTION ACNS has seen a continued increase in the number of network applications that require configuration from the core to the edge. The four basic areas of network management, design, initial/ongoing switch configuration, monitoring and troubleshooting are challenged by the current infrastructure. Inconsistencies and challenges in these four areas exist in the currently distributed switch operational model. Operating the switches from a centralized point and organizational unit is proposed as a means of resolving these challenges. This document is broken into two sections: An analysis of network operational areas as they relate to CSU and a discussion of concerns expressed by the ACNS NOC and network managers. BACKGROUND Some background is useful in understanding the nature of the CSU network’s scale, complexity and foundation. Networking is typically broken down into “layers” (physical through application). ACNS’ NOC, currently composed of seven individuals, has extensive campus experience across all layers. The campus network consists of over 900 switches spread across three campuses, 100+ MDFs, 280+ IDFs carrying the traffic of more than 40,000 pieces of equipment. In total, this comprises 213 subnets coordinated with 60 subnet managers. This background provides the foundation for understanding the proposal for consistency and operations of the network from a single, cohesive entity. Additionally, as part of the background, statistics relating to centralized network operations are provided: WestNet Survey In June, 2011, a survey of 14 institutions of Higher Education attending the Westnet Conference at the University of Utah, only one of the 14 schools stated that they have a distributed support model for networking. That school (the University of Utah), further indicated they were making steady progress toward centralizing support of networking infrastructure. Current Estimate of Central vs. Distributed Switch Management Table 1, below, illustrates the proportion of switches for which the NOC or college/department is responsible for the daily switch management. It is important to note that this does not take into consideration the design, monitoring or troubleshooting features of network management, but rather 5 the ongoing virtual changes for ports/VLANs or physical cabling/patching/maintenance. Currently, the NOC is responsible for over ¾ of all virtual management across CSU’s switches and either shares in or is entirely responsible for ¾ of the physical management. Virtual% 77 20 1 2 NOC 50/50 10/90 90/10 Physical% 39 40 19 2 Table 1. Illustration of the proportion of central vs. distributed ongoing operational support. Notes regarding Table 1: “NOC”: NOC does all work (virtual or physical) on the switch “50/50”: VLANs and Physical changes shared between department/college and NOC “90/10”: NOC responsible for 90% of the changes (virtual or physical) on the switch, department responsible for 10% of changes (virtual or physical) “10/90”: NOC responsible for 10% of the changes (virtual or physical) on the switch, department responsible for 90% of the changes (virtual or physical). Current Estimate of the Scope of VLAN Change Requests This analysis was done on all switches backed-up by the NOC. The NOC monitors all changes on many of the switches across campus departments and colleges. This is primarily done to have backups of switch configurations. Additionally, it offers assistance in troubleshooting as the process also reflects changes in such configurations. Table 2, below, illustrates that of the changes made over the six week period* 69% were made by the NOC. COB Engr CVMBS AHS CNR LibArts CNS AgSci Other NOC 53* 7 2 0 1 0 0 0 1 141 Table 2. Illustration of VLAN change requests handled either by the colleges or the NOC. Notes regarding Table 2: Summary of network changes between April 11 to May 25, 2011 6 *COB configured a VLAN across 30+ switches in one day. Finally, in addition to operating the four main focus areas mentioned above, security and documentation must be done consistently, especially as initiatives such as Voice over IP (VoIP) are deployed across campus. These combined activities form the basis for justification of the centralized switch operational proposal. Design: In most cases, the design phase of networks is already handled by the NOC. The NOC works closely with architects, Facilities, Colleges/Departments, and Telecommunications on new building designs and remodels. All aspects of the network design are dealt with by the NOC: power/UPS needs, cooling requirements, wireless, VOIP, security, environmental controls, budgets, etc. Building designs and remodels are a complex process requiring experience along with often daily communication between NOC members and others on the design team. There is an established workflow and communication structure to make sure the proper equipment is ordered and installed. This design process is currently being handled by the NOC as a coordinated service to the campus. Configuration: The second part of network operations is configuration. The first element of this consists of physical installation of and preparing the cabling in anticipation of the new switch gear. For most areas on campus, the NOC oversees the physical installation. This is because the physical infrastructure, both fiber and copper, is a necessary part of installations. While the copper installation can be done by non ACNS staff this is not the recommended model for several reasons: 1. The physical networking infrastructure (fiber, cabling, patch panels, jacks, etc.) is under a 30 year warranty of service by Commscope. This warranty becomes void if very prescriptive cabling procedures are not followed. Audits of closets have found cabling not matching specifications resulting in voided warranties and putting the infrastructure back into warranty can require extraordinary resources. 2. The fiber infrastructure is complex, unfamiliar to most, and even dangerous as it’s composed of invisible lasers that can burn retinas if not handled properly. Thus, for safety and financial reasons, it is recommended that copper and fiber infrastructure only be managed by ACNS telecommunications. Overall, new installs and remodels are handled by the NOC in close work with telecommunications. The second part of installation that complements the physical installation is the configuration of the Ethernet switches. Again, except for a few colleges, this is already done primarily by the NOC. VLAN, multicast, spanning-tree, QoS, and SNMP are just some of the necessary configuration items that are done on switches. The NOC has developed a base configuration which can be easily modified to accommodate the following: 7 Facilities needs on the network – requiring coordination with the various entities needing such services: environmental controls, CSUPD, Alarm Shop, CardKey access, metering, etc. Central VLANs trunked across campus for such services – NOC keeps track of all such VLANs and assists on tracking which switch port and jack is assigned to these along with IP address ranges used to map to those VLANs and for routing purposes. PoE needs – NOC manages power consumption and temperature issues related to closets and PoE. This includes UPS management, temperature monitoring in closets and power/cooling needs. This is then coordinated with telecommunications and Facilities for rectification or modification of existing infrastructure. Wireless configurations – campus has three wireless vendors across campus. Careful coordination of which departments and areas need which SSIDs, jack numbers, and switch port numbers of all wireless APs are tracked as well. Uplink and downlink physical considerations - fiber and copper infrastructure play a large part in how MDFs and IDFs are connected. In addition extensive knowledge of switch models, parts and interaction of various connectors is necessary to ensure compatibility between and among devices. Spanning tree configuration, loop detection, bpdu guard and dhcp-protect Improper configuration of spanning tree can lead to widespread network outages easily affecting multiple departments and buildings. Additionally, various centrally deployed tools are used to ensure consistent spanning tree configuration, loop detection, bpdu guard and dhcp protection. While the above is not intended to be a comprehensive list of consideration needs for proper switch configuration, it is provided to illustrate the amount of coordination needed to be overseen by one entity for complete and consistent deployment of network gear. Monitoring: The third part of network operations relates to monitoring. This is comprised of a number of factors, including uptime, traffic loads, errors, and throughput. The NOC maintains central servers to monitor all of those areas where it has a consistent picture of the network. Proper monitoring includes configuration settings that report back to a central server for data collection, traffic monitoring, error reporting, troubleshooting, and/or security event handling/detection. Currently, this picture of campus is somewhat limited due to the distributed nature of the 8 network. A consistent configuration across campus is needed to establish the foundation necessary for additional and more complicated deployments of services requiring QoS, service level agreements, specific uptime or security needs. Troubleshooting: The fourth area of network operations is troubleshooting. The previous three items must be in place before effective troubleshooting can be accomplished. Improper configuration and installation can lead to network problems. An inability to monitor the entire network to the edge switch in a consistent and complete fashion hampers troubleshooting. One charge of the CIC in evaluating network management policies is to consider operational efficiency. A consistent design, configuration and monitoring process across campus would greatly speed up the NOC’s ability to troubleshoot problems. Currently, much too much time is spent by the NOC looking for network devices that aren’t known or dealing with configurations that are not consistent. A major component of troubleshooting is the notification and response structure. While also reliant upon the monitoring piece of network operations, notification and response relates to which entities are notified in which order, how the notifications are escalated, and who actually responds during which times. Currently, many devices are monitored for Facilities and CSUPD along with basic network services such as network service to an entire building. The NOC maintains central monitoring such that devices are monitored during specified hours, and when they fail, NOC staff is contacted electronically in a specified order. Many of these responses are 24x7 to address life and safety concerns. Coordinating all of those alerts, devices, individuals, and entities requires a central base of knowledge and is a natural extension of the NOC. Besides the notification mechanism, often complete problem resolution requires response by both Facilities or CSUPD and the NOC. Security and Documentation: Two additional areas were noted that encompass network operations: security and documentation. There has been an increase in the security and integrity requirements of networks. Some of these cases deal with ensuring that traffic cannot be analyzed or monitored except by those with authorization to do so. Right now, anyone with read/write access to the switches can monitor/sniff ports. Other cases deal with the physical integrity of the switch infrastructure as relates to its ability to be physically compromised. One project in particular deals with radioactive material. For this project, physical access to the switches is restricted solely to those with a demonstrated and approved need for access. Additional access restrictions and identification processes are put in place and audits are composed to ensure compliance. These sorts of security requirements and issues are increasing across campus. A distributed operational model for the switch infrastructure is in conflict with those requirements. While the NOC has been able to work through the existing cases, the concern for the NOC is that the current 9 model of network management does not scale and does not pass such audit requirements. The second encompassing area of network operations is the documentation. Accurate and complete documentation is a necessary component of any successful network operations policy. As the campus moves into the next generation of networking and network applications, documentation of the four key points of network operations needs to be addressed. This is something that requires a level of consistency best done by an individual or small team of individuals, and doesn’t lend itself to a distributed model. In fact, a consistent approach to the design, configuration, monitoring and troubleshooting facilitates accurate documentation. Naturally, the more disparate or distributed entities involved in that process, the more likely inaccuracies will be introduced, and the more difficult it will be to operate. Improper documentation causes delays, inefficiencies, and mistakes in all levels of the network operations process. SUMMARY Four areas of network operations were discussed: design, configuration, monitoring and troubleshooting. It was noted that the NOC is already involved in many of the design stages of the campus network. Currently, many campus entities ship their network gear to the NOC for consistent configurations. The NOC noted that a more centralized approach to network management will yield switch configurations that can prevent some common network outages. The monitoring and troubleshooting of the network involves having a consistent and complete picture of the campus network from core to edge in order to note problems and correctly respond to or troubleshoot them. Finally, two encompassing concerns were presented: security and documentation. The security component is growing in focus across campus. Meanwhile, the documentation piece is fundamental to a successful deployment of any network operations policy. 10 Appendix B: NOC RESPONSES TO SPECIFIC CONCERNS EXPRESSED BY THE UNITS Discussion on various areas of concern: 1. Patching of cables is one area that has been done by network managers. The NOC does not have the resources to patch cables with the responsiveness needed by the entire campus. The committee must consider this issue as it moves forward. Issues surrounding this are those already mentioned such as safety and warranty concerns. Meanwhile, as more devices are added to the network, telecommunications works closely with the NOC to map network switch ports to patch panel locations. Switch ports are no longer generic, but rather switch ports are dedicated to specific services. Knowing which switch ports go to which environmental control, which camera or which VOIP phone, is critical for all phases of network operations from design to troubleshooting. The NOC works closely with telecommunication during installs and changes to document these items. Color coding and documentation processes are in place for tracking these services. Unauthorized or un-communicated moves of copper cabling can result in outages that must be tracked down and fixed, sometimes on a 24x7 basis. To that end, the NOC proposes a simple system of color coding, education, and communication with network managers as to which cables can be department patched and which carry central infrastructure. To maintain the status quo on this particular area, the NOC recommends working to preconfigure switch ports to meet most of the network managers needs so that patching is all that is necessary with an emphasis on limiting the amount of actual switch configuration that would need to be done by the NOC to meet these needs. 2. The distributed units may require too many VLAN changes on edge switch ports for the NOC to accommodate within the units’ expected timeframe. The NOC has looked at statistics on switch configurations (see example at start of document). Over a six week period, the NOC was responsible for the majority of all network changes with only a subset of changes being done by the departments or colleges. However, it is possible that entities might need a number of changes made in a hurry for which the NOC, at current staffing levels, could not meet the demand. Two solutions to this are presented herein: increased staffing of the NOC to meet these needs (in process), and a process for allowing network managers to make edge switch port changes through an automated process that validates the request and permits and documents the change without human intervention. This latter method helps to keep an important part of the network support in place – the network manager who is in the field. Finally, in critical situations, a phone call can be placed to the NOC for immediate response. 11 3. The network managers have always had a good working relationship with the NOC. Why is that not working all of a sudden? It’s critical to note that the NOC is not singling out any entity or person and truly believes in the good job that is being done by network managers and the critical component they are in the system. The concern for the NOC is solely that the current system does not scale to the demands being made upon the NOC. Thus, the network managers and the NOC are charged with working together to find a way to proceed in an efficient manner that provides a more hardened and consistent network for the future. In the end, the current good working relationship must continue for us all at CSU to be successful. 4. Why must I provide read/write access to the NOC, but not retain it myself since my department/college bought the switch? That’s a fair question, and hopefully largely addressed above. Models for centrally funding switch replacement are being considered for subsequent proposal to the campus. 5. We have entities for which we must provide a certain level of service. How can we do that if central services are to be provided through our network, especially if those services require a level of QoS with a higher priority than my data does? The committee does need to address this issue. Do life and safety packets have priority over researcher packets and if so, who pays for those life and safety ports? Do electronic meters prioritize over data packets? What about VOIP? These types of questions must be answered. An operational model will indeed facilitate the configuration of the network so that these policies can be implemented coherently and consistently for the entire network. 6. Why can’t we continue to manage the department/college switches like we did, just with closer integration with the NOC? The simple answer is that it is becoming too hard to operate under the present model, and the additional communication and collaboration necessary to persist under the current model is unsustainable as it presents challenges at all levels of network operations. In the design phase, it was noted that the NOC works with numerous entities to design the network properly. This process establishes a common knowledge base and documentation process that is shared amongst the NOC to facilitate the next steps of proper and efficient network operations. Configuration of the switches is something that is discussed by the NOC on a regular basis. Base configurations, naming conventions, product lines and parts are elements that are constantly re-evaluated, analyzed, discussed with vendors and tech 12 support. The configurations are discussed with a view to make all services work across campus. It’s not simply putting a configuration on a switch, but an integral part following the design phase and building common knowledge of the NOC. Monitoring is something that is done across campus devices as many services are spread across all networks. Additionally, switches are being configured to report back to a central monitoring server for error collection and data collection. It’s difficult for the NOC to envision the monitoring done in a distributed manner. Finally, troubleshooting is a critical component based on the monitoring. The entity with the design knowledge, configuration understanding and monitoring tools is the one alerted to the problem and the one armed with the tools to fix the problem. Finally, the NOC is charged with 24x7 responses for those critical items. Given that, the primary concern for the NOC is that undocumented changes to the switches along with inconsistent configurations across campus force the NOC to respond to outages they aren’t responsible for nor have knowledge of. Those that are charged with the responsibility to respond feel it is only fair that they are the ones to oversee the changes that affect that responsibility or those that have that ability also share in the response effort 24x7. While the latter sounds plausible, it doesn’t work practically. A network manager simply isn’t going to have the common knowledge of the network the NOC has in order to help the CSUPD fix a camera that’s out on snowy Saturday night. It’s going to take the NOC to do it. (Thus, the compromise to provide an automated change request process was proposed, above.) 7. Why the concern over security all of a sudden – don’t secure protocols eliminate much of those concerns? Two primary concerns are in the forefront: First, some regulatory agencies require the network to be hardened and accountable in a manner that simply isn’t workable in a distributed environment. Access, design, and even knowledge of some of these networks are restricted! Second, while secure protocols may make sniffing VOIP, and similar traffic, difficult, denial of service (DoS), arp spoofing, and sniffing of ports lead to integrity of the network issues and raises liability concerns for individuals and CSU. 8. What are some protocol/technical challenges the NOC has before it that it says it’s hard to meet campus wide with the current architecture? 802.1x, IP multicast, QoS, VOIP, Facilities networks for environmental controls, smart meters, card access devices, power/UPS management, fire/police alarms along with video surveillance services/cameras, and distance learning in classrooms are the primary ones. 9. What does the NOC want exactly? 13 Sole read/write access to the switches, while still providing read access to the units where desired. 10. Will that fix everything? No. We’ll always have network issues, computers will always break and there will always be problems. The NOC’s primary objective is to get the network hardened and managed efficiently and consistently so that such problems are minimized and response to them is quicker and more effective. 14 Appendix C – Campus Concerns Regarding Central Management of Network Switches (Submitted by Ty Boyack, NREL and CIC Member) Points that colleges/departments/units have brought up as possible barriers to a centralized network policy: Overall, I see that subnet managers are willing to find a solution to solve issues that are hindering NOC operations, but are ready to push back if a sweeping decision is pushed on them. There are a number of issues, either real or perceived, that will need to be either addressed or dismissed. This is not an attempt to be a canonical list of issues or even a document of advocacy, but just things that have been brought up in recent discussions. Security: NOC is rightly concerned with inappropriate people being able to sniff traffic, but the reverse holds true as well. There are projects which know and trust the local system and network admins, but are unaware of the ability of NOC to sniff traffic. Many issues and threats have been examined by subnet-level employes using tools like port sniffing. If the hurdle to get access to these tools is too high, these problems may go unsolved for a longer period of time than necessary. In some cases, “iceberg-threats” with a low visible profile may go undiagnosed altogether, resulting in a decrease in security. Design: Networking is a large field, and not necessarily a one-size-fits-all system (e.g., The number of VLANs in each unit varies significantly across campus). If NOC delivers a world-class network to all of campus, but that design does not allow for units to deploy specific technologies that meet their customer base, then the local customers end up suffering. (e.g. port based IDS/IPS/NAC) The idea of centralized switch management has a vaguely defined “edge” boundary. I think most people assume it will cover the MDF and IDF main switches, but there are many other questions: • Could a unit put a switch in an IDF AFTER the “edge” that they maintain management of? • Would this extend to Blade server embedded switches? • Would in-room switches be included (server rooms, labs, top-of-rack, etc.)? • Is a wireless AP the same as an edge switch? • Could parallel networking be installed to avoid these restrictions? 15 Wired vs. wireless should be examined in the policy, and either be consistent or have clearly defined differences, especially with the emergence of higher-speed wireless that could in some cases supplant traditional wired network infrastructures. Comparison with other universities may not be representative without knowing what their server and support model is like. We have a distributed server and support model, and to centralize the network without centralizing the other portions may hinder rather than help. Job Security: There is a threat in removing daily network operational duties from someone could result in loss of professional identity, clout, power, pay, or job. This is partly a twolevel perception issue. If a subnet manager spends only 5% of his/her time on daily network issues, but that person's manager/supervisor perceives that “network management” is 80% of the job, then that person will feel especially threatened. Efficiency: A large concern is the response time that NOC will be able to provide for changes to the network. There are several aspects here: • Prior planning on the part of the subnet managers can mitigate some of this. • Prior planning on the part of the end users can mitigate some of this, but they are removed from this process, and willing to yell loudly if the local subnet manager does not respond with a timely fix, regardless of the NOC schedule. • Emergencies on the data network are often (usually?) not the result of a network outage. As such, the first line response will still be with the subnet managers, who should be fully equiped with a toolset to solve the problem as quickly as possible. • NOC is adding a person, going from 7 to 8 people. That's a 14% increase, but there are fears that NOC is taking on more than a 14% increase in time-sensitive duties, so actual response time may decrease. Communication: NOC has rightly addressed the problem of the departments not communicating changes/work/outages upstream to the NOC, but departments are often unaware of changes that the NOC/facilities/etc. make on the network as well. This puts the subnet managers in the same position as NOC, yet the subnet manager is expected to be able to answer any questions to the deans/department heads/local academic leadership. There is a fear that if all network management goes central that the communication gap would increase, resulting in even less knowledge of the system at the unit level. 16 Financial Perceptions: Some units have built their “own” networks, and feel like it's being taken away. Some feel that they paid for the switches for data, and “allowed” NOC to use them (in a synergistic approach) to carry data for EHS, alarms, etc. Now, because of those services, they are being asked to give up the access to hardware. If this had been the stated end-game that use may never have been allowed. If edge switches are being used by telecom (VoIP) and facilities (EHS, alarms, controls, etc.), then there should be a charge to them, rather than funding the hardware by data chargeback or voice charges only. 17 Appendix D – College need for Write Access to Network Switches (Submitted by Jon Schroth, College of Business and CIC Member) Introduction The local colleges within Colorado State University need write access to their switches in order to speed up day-to-day changes, development of solutions at the network level, and incident response. This does not imply that the importance and need for consistency across an environment is not understood; but there is some room for compromise. Having a subnet manager actively manage their environment in compliance with campus standards and communication techniques can offset a lot of the day-to-day burden of minor changes, ease the implementation of LAN scale network devices, and speed troubleshooting for issues inside the LAN and across the LAN/WAN link. Day-to-Day Changes The one of the most common tasks for a subnet manager is to make ports active for new or moved devices. For edge ports that only have one subnet assigned to them the ports can be pre-configured with the needed VLAN. However, in buildings with multiple subnets and devices that roam with people (such as a printers) if a device with a static IP on subnet A plugs into a port on subnet B then communication will break. It is far more efficient to simply change the VLAN of the port going to the device so it is on the correct subnet, rather than re-IPing the device and going around re-mapping the device for everyone who might use the device. Multiple VLANS can be very helpful for deploying private networks used for isolating communication within a dedicated management cluster for Hyper-V, Vmware, Sharepoint, implementing NAT as assignable public IP address space runs out, shunting traffic through systems by having it act as a layer 2 gateway. All of these can be done by a trained subnet manager working closely with the service owners to detect, diagnose, and resolve issues without generating work for NOC. Development of Solutions at the Network Level When it comes to LAN management, an attentive college should be working with monitoring tools like to solve questions like “Are their devices are online? When was my last configuration backup? Can it be automated? What do my normal network traffic trends looks like? What traffic causes large spikes? Who is doing the most talking? Are they talking as expected?” Colleges should try to solve security questions like “Can I associate traffic to a user, not just to an IP address? How can I protect my servers from malicious clients on the same subnet? How can I throttle 18 and block traffic from users dynamically without incurring administrative overhead? How do I slow or stop a virus outbreak?” NOC already has several of monitoring tools built (for example: Nagios, RANCID, Cacti) to answer several of these questions. However not all of them are accessible to the different colleges today. Where possible, existing services built by NOC can be leveraged. However if a college wants to re-deploy an existing service deploy or a new service that is not available through NOC, then access to the switches is needed. Incident Response Having a clean, systematic, documented environment helps determine what is versus what should be, a critical step in troubleshooting errors. If a college is going to ask to retain write access to the switches they have the responsibility of building, maintaining, and communicating accurate documentation. In the event of the issue that spills out into the rest of campus, it is also their responsibility to work with NOC to solve the issue. Speed of resolution is important in preventing people from going home due to an outage. To this end, having close communication, clean and correct documentation configurations and topologies, and having write access to the switches helps at a college level. Write access to the switch helps by giving the ‘show running-configuration’ command because it cleanly lists how the switch is configured. Write access also allows the use of monitor ports to sniff traffic, a useful troubleshooting technique. If a simple error is found by the college’s subnet manager in their environment they could also correct it directly. Calling or emailing NOC to correct it would slow down resolution. NOC’s Concerns NOC’s concerns include configuration change management, monitoring, network security, and documentation. With some changes having campus-wide and lifesaving implications, these concerns should not be casually ignored. If the colleges are not allowed write access, a compromise to allow the colleges to be as flexible is needed. Documentation Each college should share with NOC documentation of their infrastructure, each one consistently documented and uniform with the rest of campus. Monitoring key links to gather information then becomes much simpler, and it is easier to tell what is really going on. A college’s subnet manager and NOC should work together to 19 assemble that information. What is important to a subnet manager may not be immediately important to NOC. Escalation Process Though it may seem obvious, below is a generic escalation process that may be used for contacting NOC about issues. During incidents the subnet manager and NOC should have an open line of communication about what they each see, what implications that may have, what each party is doing, and what needs to happen. Once the issue is solved a report about what the problem was, the resolution, and steps for possible prevention should be issued to the relevant parties. As seen, the subnet-manager should be able to take some actions to help diagnose and solve the issue, such as disabling ports or reloading a switch. Having write access assists in the subnet managers ability to take care of issues inside the college. Troubleshoot & try to fix the issue Issue Reported Is the issue solved? Yes Yes No Send out aftermath report about issue. Email noc@colostate.edu No Is the issue LAN only? No Is the issue an active outage? Yes 20 Call someone at NOC, explain issue, and ask for help. Day-to-Day Changes Day-to-day changes should be small changes like VLAN configuration and naming ports. If the subnet manager does not have write access then these tasks need to be done by NOC. If an indirect access can be given through a tool like ProCurve Manager Plus or custom tool restricting modifications to a set of pre-approved actions, then allow colleges can be as agile as required. Development of solutions As colleges grow they will need to deploy solutions at the network level like network firewalls, Intrusion Detection and Prevention (IDS/IPS) systems, Hardware Load Balancers (HLBs). The ability to design, test and implement these types of solutions requires non-trivial amounts of time and effort. Involving NOC in all of these projects, while informative from a campus perspective, may mean timesharing NOC resources and slowing down deployment of college and non-college projects. Having write access to switches will remove that deployment limitation. Training the Subnet Managers To mitigate inaccuracies in change management it may be beneficial to train the subnet managers to meet NOC’s standards in communication, monitoring, and troubleshooting. While this improves the quality of change management and communication process it does not eliminate increased volume of people making changes and potential for inaccuracies; it just helps limit the scope of inaccuracies. Below is a generic change management process that trained subnet managers can use to for day-to-day tasks as well development of solutions. The most complex step is grasping the implications and scope of a proposed change; it is also the step most likely to cause the most errors spanning campus. 21 Undo the change No No Change required Does the change have implications for campus or NOC? Does the change have potential for downtime? Yes Yes Contact NOC Schedule an Outage & notify service owners and NOC at least 24 hours prior to the outage Yes No Perform Change Did the change make the documentation out-of-date? Was the change successful? Yes Yes No After arriving at agreement with NOC, does the change still need to occur? No End Update Documentation & send a copy to NOC Conclusion The issues NOC is facing include scale, consistency, communication, documentation, and security. While having sole write access to the switch environments is one way to solve those issues, it incurs the burden of the daily tasks and projects that subnet managers were already doing. To avoid placing the burden of work the subnet managers do on NOC and slowing down campus projects, allow the subnet managers to continue to do the work in conjunction with NOC’s standards. 22 Appendix E Campus Infrastructure Committee Report to ITEC Advisory Committee (IAC) January, 2010 Committee Members Scott Baily, chair (ACNS) Academic Computing and Network Services Jim Cox College of Natural Sciences Michael Dennis Business and Financial Services Neal Lujan Division of Student Affairs Mike Maxwell Sciences College of Veterinary Medicine and Biological Robin McGee Warner College of Natural Resources Mark Ritschard College of Engineering Jon Schroth College of Business 23 Adam Warren Public Affairs Charge to Committee................................................................... 2 Introduction.................................................................................... 3 Proposed Campus Infrastructure ........................................... 4 Comparison of Proposed Topologies ....................... 8 Current Campus Network Management .............................. 10 Final Analysis and Committee Recommendation ............ 11 24 Charge to the Communications Infrastructure Committee (CIC) June, 2009 The committee is constituted to formulate and analyze models for providing future communications infrastructure and services at CSU. The committee shall be chaired by Scott Baily, interim director of ACNS, and shall also include four members of the College IT Administrators Council (CITAC) appointed by the elected chair of the CITAC and four members of the IAC who are not members of CITAC appointed by the Vice President for IT. The committee shall formulate and analyze models for providing future communications infrastructure and services at CSU, including technologies, equipment, architectures, service levels, user support levels, staffing levels, total costs including centralized and decentralized staffing costs, security aspects, E911 and other emergency communications requirements and services, levels of risk, and legal requirements. Specifically, the committee shall first determine requirements for physical communications infrastructure (fiber and copper cabling), networking technologies, and telephone technologies. Then, the committee shall explore and analyze at least three operational models: centralized operations, hybrid centralized/decentralized operations, and decentralized operations for each technology, networking and telephony. For each model, the committee shall evaluate the total costs and benefits of each approach. Note that the committee is charged with a comprehensive, balanced analysis, and should not strive for any advocacy position until a comprehensive analysis is completed. The committee should address strengths and weaknesses of each model. Prior to its reporting, the committee shall have its analyses assessed for risk by Internal Auditing, the Facilities Alarm Shop, CSUPD, and CSU Risk Management. The committee should endeavor to collect feedback from the campus, including surveys, open fora, etc. as it deems appropriate. 25 In this initial approach, the committee shall not address chargeback models, as that will be determined as an outcome after the committee completes its tasks. As its first task, the committee may review and adjust this charge as it deems appropriate, in consultation with the VP for IT. The committee shall endeavor to complete its work by December 15, 2009, and report back to the IAC. 26 Introduction Since the first campus networks were deployed in the middle to late 1980’s, an extensive fiber optic cable plant has been developed to support the main, south, and foothills campuses. The campus backbone network is currently at 1 Gigabit per second (Gbps), or 100 times the bandwidth of the original campus local area network (LAN) connections. CSU’s external connectivity is also currently at 1 Gbps, though there are plans to upgrade both the backbone network and the wide area connection to 10 Gbps in FY10. In FY02, a charge back model was implemented to fund the rising costs of networking on the campus (see http://www.acns.colostate.edu/?page=network_charge_back for information regarding this activity). The charge back algorithm is based on the speed of the connection to the campus backbone network; for each 10x increase in network capacity there is a corresponding 2.8x increase in the annual charge. This chargeback model has had the unintended consequence of sub-optimal network connectivity to many campus buildings in order to minimize the annual recurring cost of network access. This document considers alternative topologies and support models that would enhance performance and availability afforded by a new chargeback algorithm (e.g. a per-FTE model). 27 Figure 1. Illustration of current network topologies on the CSU campus, with multiple buildings often behind a single connection to the campus backbone network. Current Topology As a result of the chargeback system, colleges and departments have “value engineered” their network connectivity in order to reduce costs. Figure 1 shows various configurations that have been adopted. While some buildings are attached via a single backbone connection, it is far more common to have several buildings attached to the core campus network via a single connection (typically at 100 Mbps). Indeed, the number of connections to the backbone network has been reduced by 33% since the charge back mechanism was proposed in 2000. This not only creates a single point of failure for all buildings sharing that connection, it also presents a bottleneck for all users sharing a common link to the campus and the Internet. Additionally, when a core campus router fails or is taken down for maintenance, then all buildings serviced by that router are without connectivity to the rest of campus or the Internet until service has been restored. 28 Proposed Campus Physical Network Infrastructure The CIC proposes two topologies for consideration, both of which include redundant core router connections to buildings with a significant number of users (or a small number of users with significant need for bandwidth). The connection to each core router of the campus network would be nominally 1 gigabit Ethernet, easily upgradable to 10 gigabit where needs dictate. In this design, buildings would have a primary connection to one core campus router and a backup connection to the other, adding resiliency in addition to capacity. Figure 2. Campus Design 1: Buildings are connected via primary and backup gigabit Ethernet connections. Proposed Campus Physical Network Infrastructure: Design 1 Figure 2 illustrates the first proposed topology: each major building has two redundant network connections, one to each core router. Building connections will typically be implemented at 1 Gbps, though upgrading to 10 Gbps is certainly possible. The connectivity between core routers and between core and border routers will be 10 Gbps, upgradable to 40 Gbps. 29 Proposed Campus Physical Network Infrastructure: Design 2 The second proposed topology for the campus network infrastructure arose from conversations with ACNS networking staff and security and policy officers at other institutions. ACNS staff indicated that from a purely technical perspective, the best way to guarantee high-quality IP telephony support is to implement a separate, homogeneous network that is centrally managed. Special-purpose applications, such as video surveillance, card key access and environmental systems controls could also be provisioned over such a network. In addition, security and policy officers at some universities indicated that voice, life, and safety network applications should be on a physically separate network. Given that the current voice, life, and safety devices are approximately 17% of the number of active campus network jacks, a physically separate network would make homogeneity possible. Figure 3 illustrates the second proposed topology: two physically separate and redundant networks; one for data and a second for voice, life, and safety. Figure 3. 30 Campus Design 2: Buildings are doubly connected via primary and backup gigabit Ethernet connections, one redundant network for data a second redundant network for voice, life, and safety Proposed Intra-Building Network Topologies Once a building is connected to the campus network via the redundant connections proposed above, the question remains as to how to handle network traffic within the building. Similar to the proposed campus topologies, the CIC proposes two intra-building topologies, one single physical network for all applications and a separate physical network for voice, life, and safety applications. Proposed Intra-Building Physical Network Infrastructure: Design 1 Figure 4 illustrates the first proposed design, that of a single physical network infrastructure, with network traffic segregated as desired with virtual local area networks (VLANs). In particular, a VLAN would likely be created for IP based telephones (VoIP), one for life and safety, and one for the campus wireless network. The redundant connections from the campus backbone would connect to each building network’s main distribution frame (MDF). From the MDF, single connections would be routed to each of the network’s intermediate distribution frames (IDFs); i.e., the network closets. VLANs would then be configured by network personnel on each switch in each IDF. In this design, every switch must be supported by a UPS appropriately sized to accommodate the desired run-time of voice, life, and safety applications during a power outage. 31 Figure 4. Building Design 1: One physical network with all traffic segregated as needed by VLANs Proposed Intra-Building Physical Network Infrastructure: Design 2 Design 2 for the intra-building network follows from the same issues that gave rise to design 2 of the campus network. Concerns for a high-quality and highly available network for voice, life, and safety applications could be satisfied by establishing physically separate networks within buildings, regardless of the design of the campus backbone network. Such a network would have dedicated edge switches where needed for voice, life, and safety, and dedicated fiber connections back to the MDF. Should a homogeneous network for voice, life, and safety be desirable, design 2 reduces the size of the homogeneous network. Run-time considerations during a power outage would then be separated based on the individual needs of the data network vs. the voice, life, and safety network. Figure 5 illustrates the physically separate networks. 32 Figure 5. Building Design 2: Dual physical networks within buildings to separate data traffic from voice, life, and safety applications. 33 Comparison of Proposed Topologies Cost Comparison Equipment costs shown in Table 1 are for the entire project, and are anticipated to be amortized over 5-6 years. These costs include maintenance and accommodate a replacement cycle of 5 years for core routers and 6 years for switches inside buildings. “FTE Yr.” is a measure of the effort (in FTEs) that would be required to accomplish a particular task in one year’s time, assuming they did nothing else. Proposed Topology Approximate Equipment Costs Campus Design 1 - Approximately $1.8M (single redundant network) Campus Design 2 (dual redundant networks) Building Design 1 Approximately $2.8M Requires additional $250K for electronics and $750K for fiber builds $1.8M (single network with VLANs) Building Design 2 $2.4M (dual networks) Approximate Personnel Requirements 1 FTE Yr. installation effort, .25 FTE recurring 1.25 FTE Yr. installation effort, .25 FTE recurring 1 FTE Yr. installation effort, .25 FTE recurring 1.5 FTE Yr. installation effort, .25 FTE recurring Table 1. Cost comparisons of the proposed options. Strengths and Weaknesses Table 2 illustrates strengths and weaknesses of the proposed topologies, both the campus designs and the building designs. Proposed Topology Campus Design 1 (single redundant network) Strengths Weaknesses Takes advantage of existing Does not totally isolate fiber infrastructure; very little voice, life, and safety additional wiring needed. applications. Provides a robust, high Cost of electronics and 34 availability network. Provides greater ability to manage network connection speeds to end points. Eliminates decisions for building connectivity based on cost. Campus Design 2 (dual redundant networks) Building Design 1 (single network with VLANs) Building Design 2 (dual networks) fiber optics associated with establishing redundant connections to a building. Totally isolates voice, life, and Very expensive safety, permitting greater Longer-term capital cost control and optimal leading to higher management of voice, life, and replacement costs. safety applications. Provides a robust, highavailability network. Provides greater ability to manage network connection speeds to end points. Takes advantage of existing All services subject to fiber infrastructure; very little same down-time of single additional wiring needed. switch for all services. Less expensive; all network Homogeneity of VoIP ports aggregated into larger network very difficult. switches Allows use of one network jack in offices to support both voice and data. Highest quality of service Highest level of confidence in E911 accuracy Hardware homogeneity for voice, life, and safety applications easier to attain. Greater ability to fund and maintain a regular replacement cycle for voice, life, and safety due to smaller network footprint for these services. Cost of voice network is clearly identifiable and, therefore, more easily charged by end-point device. 35 Somewhat longer-term capital cost leading to slightly higher replacement costs. Implies additional, dedicated building wiring to support IP telephones Ability to separately provide UPS service to voice, life, and safety. 36 Current Network Management Model A hybrid network management model is in place today for academic and administrative units, utilizing IT staff from both central (ACNS) and local (College/Departmental) IT offices, as described below. The campus backbone network is comprised of the “core” and “border” routers shown in figures 1-5. These devices and the circuits interconnecting them are managed by ACNS. Management by ACNS ACNS staff installs network switches for new construction and remodel projects. ACNS gives the building occupants the choice of managing building switches themselves, or to have ACNS do so. Approximately half of the building switches are managed by local IT offices. In the case of special-purpose virtual local area networks (VLANs), as required to support functions such as video surveillance cameras, card key door access control, and VoIP phones, ACNS manages those configurations since the virtual path extends well beyond the building perimeter. The campus’ connectivity to the Internet, Internet2 and National Lambda Rail are all managed by ACNS. Other centrally managed services include the campus wireless network, virtual private network (VPN) access, and firewall functionality at the campus border. Central monitoring of campus network includes devices on the campus backbone network, building distribution switches, and most edge switches in buildings. In cases where departmental firewalls exist, device monitoring inside buildings typically stops at the building distribution switch. ACNS networking staff is on call off-hours to respond to critical outages affecting the campus backbone network or to assist with large departmental outages. Monitoring of building network devices by local IT offices varies by department and College. Management by College and Department IT staff The degree of management by local IT offices varies by college or department. Some IT offices manage the physical network among many buildings, while smaller IT offices manage the network within a single department. Local IT offices partner with ACNS and Telecommunications at various levels depending on the expertise and capabilities of the local IT staff. The IT office in most of the colleges install, configure, and manage network switches for the buildings in their college and often across several locations scattered among the campuses in Fort Collins. The colleges use a variety of network management tools and monitor their networks using automated tools. All network switches are configured following campus guidelines and often include college-specific VLANs. 37 Local IT offices also maintain network maps for buildings and offices within their areas of responsibility. Local IT staff are responsible for all network jack activations for their buildings and trouble shoot network jack problems. The installation of new data jacks and physical repairs for the wall jack or wiring are coordinated with Telecommunications. Colleges also rely on Telecommunications for fiber connectivity within and among buildings. 38 Final Analysis and Committee Recommendation Physical Network Recommendation For the physical network infrastructure, the committee recommends adoption of single redundant network backbone for both data and voice, life, and safety (Campus Design 1) and a dual network for intra-building networks (Building Design 2). While the ideal network for voice, life, and safety applications would be a homogeneous separate network throughout the entire campus, a dual redundant backbone infrastructure (Campus Design 2) is acknowledged to be prohibitively expensive. The committee agrees that the ideal network for voice, life, and safety, is a homogeneous network across campus. However, given the decentralized funding for network switches at CSU, having a campus-wide homogeneous network for data, voice, life, and safety would be very difficult. Even with infrastructure funding provided centrally, the sheer magnitude of switches on campus would require an annual replacement plan over a period of years. For example, a five-year replacement cycle of all campus network switches would create five different models of switches on campus, preventing the possibility of a homogeneous network for voice, life, and safety. As a result, network service for voice, life, and safety, would compete with data and likely make quality of service difficult. If all telephones on campus were voice over IP sets, approximately 17% of all networked devices would be telephones, with some very small percentage being life and safety devices and/or connections. With about 17% of the network being needed for voice, life, and safety, the possibility of having a homogeneous network is within reach. The separate physical network would be small enough to enable all equipment to be replaced en masse, providing for long-term homogeneity of the voice, life, and safety network. Regarding networked voice, life, and safety applications, there are two opposing camps in higher education. The first camp is networking professionals, including chief technology officers with primarily technical backgrounds, who prefer a single network with voice, life, and safety on separate virtual networks. In the second camp, security and policy experts – including chief technology officers with backgrounds in security – prefer a physically separate network for voice, life, and safety. Included in the first camp are Notre Dame, University of North Carolina at Pembroke, Columbia, California State, and Bryn Mawr, among others. In the words of the chief technologist at UNC Pembroke, "It's the whole value of convergence." 39 The second camp was probably best represented by the CIO of the Community College system of New Hampshire: “I agree…that fire alarms,…phones and other life safety devices should be on their own dedicated circuits…most data folks that I have worked with over the years (and I have worked with some great people) do not have a clue as to the responsibility they are taking on by trying to converge those systems onto their data networks… Yes, there could be some savings by converging voice and data but are those savings worth a life?” There is always a space between the two camps, and both Idaho State and Louisiana State are in that space. The chief information officer at Idaho State said, “Certainly a separate physical and logical network is by far the best from a management and security standpoint”, but they use a single network for cost reasons. The chief IT security and policy officer at Louisiana State said two physical networks are desired on campus, but the networking group proceeded with a single network before the security and policy issues could be vetted. Even Bryn Mawr, in camp one, stated “If funding and staffing permitted we would choose option 1 [two physical networks] but as with most things in life compromise is necessary; we are using option 2 [single physical network].” As it turns out, though, Bryn Mawr has left critical phone systems on their PBX switch. Two physically separate networks would also reduce the cost for providing up-time to voice, life, and safety applications during power outages. Whether the voice connections are power-over-Ethernet or “injected power”, a UPS would be needed to keep the systems up. Smaller switches would require smaller UPS units or keep switches up longer. If cooling is needed for these separate switches, the cooling needs are also less and more easily handled in the separate network scenario. The downside to two physically separate networks is the expense, including the cost to create the separate network, the initial staffing to implement it, and the consumption of data drops in offices (many office areas were wired with one data jack on a wall, typically already consumed by computer). However, the cost of equipment is not that much greater than a single network, as network ports have to be provided for all the new IP phone sets, regardless of the resulting topology. Once implemented, the additional management cost of the separate network would be very low. As the gentleman from Idaho State said, “I think one of the weak points to VoIP is the fact that it is just another data application on your network. It is subject to all the issues and concerns of a network. You can control the environment when 40 it is separate, but you lose the cost savings of convergence”. All agreed that the longterm cost differential is minimal, but not many schools are willing to make the initial investment. In summary, a separate physical network for voice, life, and safety applications provides the homogeneity needed for excellent and consistent network service. Network Management Model Recommendation In general, the recommendation for managing network devices will be largely the same as described above under Current Network Management Model. For example, ACNS network staff will continue to operate the core and border routers and other elements currently managed centrally. In the proposed building design 1, which combines voice and data over a common infrastructure, ACNS and the distributed IT staff will need to develop policies and procedures for coordinating changes. A balance must be struck to allow the efficiencies of utilizing existing IT staff distributed around the campus and ensuring the integrity of both departmental data and special purpose (i.e. VoIP, video surveillance, environmental controls, etc.) VLANs. In the proposed building design 2, switches specific to the special purpose VLANs would be managed by ACNS, while the departmental data switches would be managed by either ACNS or distributed IT staff, as preferred by the building occupants. 41