NOC report to CIC - Final - Academic Computing and Networking

advertisement
NOC Report to CIC Regarding Operational Considerations of a Distributed Network Support Model
INTRODUCTION
ACNS has seen a continued increase in the number of network applications that require configuration
from the core to the edge. The four basic areas of network management, design, initial/ongoing switch
configuration, monitoring and troubleshooting are challenged by the current infrastructure.
Inconsistencies and challenges in these four areas exist in the currently distributed switch operational
model. Operating the switches from a centralized point and organizational unit is proposed as a means
of resolving these challenges. This document is broken into two sections: An analysis of network
operational areas as they relate to CSU and a discussion of concerns expressed by the ACNS NOC and
network managers.
BACKGROUND
Some background is useful in understanding the nature of the CSU network’s scale, complexity and
foundation. Networking is typically broken down into “layers” (physical through application). ACNS’
NOC, currently composed of seven individuals, has extensive campus experience across all layers. The
campus network consists of over 900 switches spread across three campuses, 100+ MDFs, 280+ IDFs
carrying the traffic of more than 40,000 pieces of equipment. In total, this comprises 213 subnets
coordinated with 60 subnet managers. This background provides the foundation for understanding the
proposal for consistency and operations of the network from a single, cohesive entity.
Additionally, as part of the background, statistics relating to centralized network operations are
provided:
WestNet Survey
In June, 2011, a survey of 14 institutions of Higher Education attending the Westnet Conference
at the University of Utah, only one of the 14 schools stated that they have a distributed support
model for networking. That school (the University of Utah), further indicated they were making
steady progress toward centralizing support of networking infrastructure.
Current Estimate of Central vs. Distributed Switch Management
1
Table 1, below, illustrates the proportion of switches for which the NOC or college/department
is responsible for the daily switch management. It is important to note that this does not take
into consideration the design, monitoring or troubleshooting features of network management,
but rather the ongoing virtual changes for ports/VLANs or physical
cabling/patching/maintenance. Currently, the NOC is responsible for over ¾ of all virtual
management across CSU’s switches and either shares in or is entirely responsible for ¾ of the
physical management.
Virtual%
77
20
1
2
NOC
50/50
10/90
90/10
Physical%
39
40
19
2
Table 1. Illustration of the proportion of central vs. distributed ongoing operational support.
Notes regarding Table 1:

“NOC”: NOC does all work (virtual or physical) on the switch

“50/50”: VLANs and Physical changes shared between department/college and NOC

“90/10”: NOC responsible for 90% of the changes (virtual or physical) on the switch,
department responsible for 10% of changes (virtual or physical)

“10/90”: NOC responsible for 10% of the changes (virtual or physical) on the switch,
department responsible for 90% of the changes (virtual or physical).
Current Estimate of the Scope of VLAN Change Requests
This analysis was done on all switches backed-up by the NOC. The NOC monitors all changes on
many of the switches across campus departments and colleges. This is primarily done to have
backups of switch configurations. Additionally, it offers assistance in troubleshooting as the
process also reflects changes in such configurations. Table 2, below, illustrates that of the
changes made over the six week period* 69% were made by the NOC.
COB
Engr
CVMBS AHS
CNR
LibArts CNS
AgSci
Other
NOC
53*
7
2
0
1
0
0
0
1
141
Table 2. Illustration of VLAN change requests handled either by the colleges or the NOC.
Notes regarding Table 2:

*Summary of network changes between April 11 to May 25, 2011

**COB configured a VLAN across 30+ switches in one day.
Finally, in addition to operating the four main focus areas mentioned above, security and
documentation must be done consistently, especially as initiatives such as Voice over IP (VoIP) are
2
deployed across campus. These combined activities form the basis for justification of the centralized
switch operational proposal.
Design:
In most cases, the design phase of networks is already handled by the NOC. The NOC works closely with
architects, Facilities, Colleges/Departments, and Telecommunications on new building designs and
remodels. All aspects of the network design are dealt with by the NOC: power/UPS needs, cooling
requirements, wireless, VOIP, security, environmental controls, budgets, etc. Building designs and
remodels are a complex process requiring experience along with often daily communication between
NOC members and others on the design team. There is an established workflow and communication
structure to make sure the proper equipment is ordered and installed. This design process is currently
being handled by the NOC as a coordinated service to the campus.
Configuration:
The second part of network operations is configuration. The first element of this consists of physical
installation of and preparing the cabling in anticipation of the new switch gear. For most areas on
campus, the NOC oversees the physical installation. This is because the physical infrastructure, both
fiber and copper, is a necessary part of installations. While the copper installation can be done by non
ACNS staff this is not the recommended model for several reasons: 1. The physical networking
infrastructure (fiber, cabling, patch panels, jacks, etc.) is under a 30 year warranty of service by
Commscope. This warranty becomes void if very prescriptive cabling procedures are not followed.
Audits of closets have found cabling not matching specifications resulting in voided warranties and
putting the infrastructure back into warranty can require extraordinary resources. 2. The fiber
infrastructure is complex, unfamiliar to most, and even dangerous as it’s composed of invisible lasers
that can burn retinas if not handled properly. Thus, for safety and financial reasons, it is recommended
that copper and fiber infrastructure only be managed by ACNS telecommunications. Overall, new
installs and remodels are handled by the NOC in close work with telecommunications.
The second part of installation that complements the physical installation is the configuration of the
Ethernet switches. Again, except for a few colleges, this is already done primarily by the NOC. VLAN,
multicast, spanning-tree, QoS, and SNMP are just some of the necessary configuration items that are
done on switches. The NOC has developed a base configuration which can be easily modified to
accommodate the following:



Facilities needs on the network – requiring coordination with the various entities needing such
services: environmental controls, CSUPD, Alarm Shop, CardKey access, metering, etc.
Central VLANs trunked across campus for such services – NOC keeps track of all such VLANs and
assists on tracking which switch port and jack is assigned to these along with IP address ranges
used to map to those VLANs and for routing purposes.
PoE needs – NOC manages power consumption and temperature issues related to closets and
PoE. This includes UPS management, temperature monitoring in closets and power/cooling
3



needs. This is then coordinated with telecommunications and Facilities for rectification or
modification of existing infrastructure.
Wireless configurations – campus has three wireless vendors across campus. Careful
coordination of which departments and areas need which SSIDs, jack numbers, and switch port
numbers of all wireless APs are tracked as well.
Uplink and downlink physical considerations - fiber and copper infrastructure play a large part in
how MDFs and IDFs are connected. In addition extensive knowledge of switch models, parts
and interaction of various connectors is necessary to ensure compatibility between and among
devices.
Spanning tree configuration, loop detection, bpdu guard and dhcp-protect - Improper
configuration of spanning tree can lead to widespread network outages easily affecting multiple
departments and buildings. Additionally, various centrally deployed tools are used to ensure
consistent spanning tree configuration, loop detection, bpdu guard and dhcp protection.
While the above is not intended to be a comprehensive list of consideration needs for proper switch
configuration, it is provided to illustrate the amount of coordination needed to be overseen by one
entity for complete and consistent deployment of network gear.
Monitoring:
The third part of network operations relates to monitoring. This is comprised of a number of factors,
including uptime, traffic loads, errors, and throughput. The NOC maintains central servers to monitor all
of those areas where it has a consistent picture of the network. Proper monitoring includes
configuration settings that report back to a central server for data collection, traffic monitoring, error
reporting, troubleshooting, and/or security event handling/detection. Currently, this picture of campus
is somewhat limited due to the distributed nature of the network. A consistent configuration across
campus is needed to establish the foundation necessary for additional and more complicated
deployments of services requiring QoS, service level agreements, specific uptime or security needs.
Troubleshooting:
The fourth area of network operations is troubleshooting. The previous three items must be in place
before effective troubleshooting can be accomplished. Improper configuration and installation can lead
to network problems. An inability to monitor the entire network to the edge switch in a consistent and
complete fashion hampers troubleshooting. One charge of the CIC in evaluating network management
policies is to consider operational efficiency. A consistent design, configuration and monitoring process
across campus would greatly speed up the NOC’s ability to troubleshoot problems. Currently, much too
much time is spent by the NOC looking for network devices that aren’t known or dealing with
configurations that are not consistent.
A major component of troubleshooting is the notification and response structure. While also reliant
upon the monitoring piece of network operations, notification and response relates to which entities are
4
notified in which order, how the notifications are escalated, and who actually responds during which
times. Currently, many devices are monitored for Facilities and CSUPD along with basic network
services such as network service to an entire building. The NOC maintains central monitoring such that
devices are monitored during specified hours, and when they fail, NOC staff is contacted electronically in
a specified order. Many of these responses are 24x7 to address life and safety concerns. Coordinating
all of those alerts, devices, individuals, and entities requires a central base of knowledge and is a natural
extension of the NOC. Besides the notification mechanism, often complete problem resolution requires
response by both Facilities or CSUPD and the NOC.
Security and Documentation:
Two additional areas were noted that encompass network operations: security and documentation.
There has been an increase in the security and integrity requirements of networks. Some of these cases
deal with ensuring that traffic cannot be analyzed or monitored except by those with authorization to do
so. Right now, anyone with read/write access to the switches can monitor/sniff ports. Other cases deal
with the physical integrity of the switch infrastructure as relates to its ability to be physically
compromised. One project in particular deals with radioactive material. For this project, physical access
to the switches is restricted solely to those with a demonstrated and approved need for access.
Additional access restrictions and identification processes are put in place and audits are composed to
ensure compliance. These sorts of security requirements and issues are increasing across campus. A
distributed operational model for the switch infrastructure is in conflict with those requirements. While
the NOC has been able to work through the existing cases, the concern for the NOC is that the current
model of network management does not scale and does not pass such audit requirements.
The second encompassing area of network operations is the documentation. Accurate and complete
documentation is a necessary component of any successful network operations policy. As the campus
moves into the next generation of networking and network applications, documentation of the four key
points of network operations needs to be addressed. This is something that requires a level of
consistency best done by an individual or small team of individuals, and doesn’t lend itself to a
distributed model. In fact, a consistent approach to the design, configuration, monitoring and
troubleshooting facilitates accurate documentation. Naturally, the more disparate or distributed
entities involved in that process, the more likely inaccuracies will be introduced, and the more difficult it
will be to operate. Improper documentation causes delays, inefficiencies, and mistakes in all levels of
the network operations process.
SUMMARY
Four areas of network operations were discussed: design, configuration, monitoring and
troubleshooting. It was noted that the NOC is already involved in many of the design stages of the
campus network. Currently, many campus entities ship their network gear to the NOC for consistent
configurations. The NOC noted that a more centralized approach to network management will yield
switch configurations that can prevent some common network outages. The monitoring and
5
troubleshooting of the network involves having a consistent and complete picture of the campus
network from core to edge in order to note problems and correctly respond to or troubleshoot them.
Finally, two encompassing concerns were presented: security and documentation. The security
component is growing in focus across campus. Meanwhile, the documentation piece is fundamental to
a successful deployment of any network operations policy.
6
NOC RESPONSES TO SPECIFIC CONCERNS EXPRESSED BY THE UNITS
Discussion on various areas of concern:
1. Patching of cables is one area that has been done by network managers. The NOC does not
have the resources to patch cables with the responsiveness needed by the entire campus.
The committee must consider this issue as it moves forward. Issues surrounding this are those
already mentioned such as safety and warranty concerns. Meanwhile, as more devices are
added to the network, telecommunications works closely with the NOC to map network switch
ports to patch panel locations. Switch ports are no longer generic, but rather switch ports are
dedicated to specific services. Knowing which switch ports go to which environmental control,
which camera or which VOIP phone, is critical for all phases of network operations from design
to troubleshooting. The NOC works closely with telecommunication during installs and changes
to document these items. Color coding and documentation processes are in place for tracking
these services. Unauthorized or un-communicated moves of copper cabling can result in
outages that must be tracked down and fixed, sometimes on a 24x7 basis.
To that end, the NOC proposes a simple system of color coding, education, and communication
with network managers as to which cables can be department patched and which carry central
infrastructure. To maintain the status quo on this particular area, the NOC recommends
working to pre-configure switch ports to meet most of the network managers needs so that
patching is all that is necessary with an emphasis on limiting the amount of actual switch
configuration that would need to be done by the NOC to meet these needs.
2. The distributed units may require too many VLAN changes on edge switch ports for the NOC to
accommodate within the units’ expected timeframe.
The NOC has looked at statistics on switch configurations (see example at start of document).
Over a six week period, the NOC was responsible for the majority of all network changes with
only a subset of changes being done by the departments or colleges. However, it is possible that
entities might need a number of changes made in a hurry for which the NOC, at current staffing
levels, could not meet the demand. Two solutions to this are presented herein: increased
staffing of the NOC to meet these needs (in process), and a process for allowing network
managers to make edge switch port changes through an automated process that validates the
request and permits and documents the change without human intervention. This latter
method helps to keep an important part of the network support in place – the network manager
who is in the field. Finally, in critical situations, a phone call can be placed to the NOC for
immediate response.
3. The network managers have always had a good working relationship with the NOC. Why is that
not working all of a sudden?
7
It’s critical to note that the NOC is not singling out any entity or person and truly believes in the
good job that is being done by network managers and the critical component they are in the
system. The concern for the NOC is solely that the current system does not scale to the
demands being made upon the NOC. Thus, the network managers and the NOC are charged
with working together to find a way to proceed in an efficient manner that provides a more
hardened and consistent network for the future. In the end, the current good working
relationship must continue for us all at CSU to be successful.
4. Why must I provide read/write access to the NOC, but not retain it myself since my
department/college bought the switch?
That’s a fair question, and hopefully largely addressed above. Models for centrally funding
switch replacement are being considered for subsequent proposal to the campus.
5. We have entities for which we must provide a certain level of service. How can we do that if
central services are to be provided through our network, especially if those services require a
level of QoS with a higher priority than my data does?
The committee does need to address this issue. Do life and safety packets have priority over
researcher packets and if so, who pays for those life and safety ports? Do electronic meters
prioritize over data packets? What about VOIP? These types of questions must be answered. An
operational model will indeed facilitate the configuration of the network so that these policies
can be implemented coherently and consistently for the entire network.
6. Why can’t we continue to manage the department/college switches like we did, just with closer
integration with the NOC?
The simple answer is that it is becoming too hard to operate under the present model, and the
additional communication and collaboration necessary to persist under the current model is
unsustainable as it presents challenges at all levels of network operations. In the design phase,
it was noted that the NOC works with numerous entities to design the network properly. This
process establishes a common knowledge base and documentation process that is shared
amongst the NOC to facilitate the next steps of proper and efficient network operations.
Configuration of the switches is something that is discussed by the NOC on a regular basis. Base
configurations, naming conventions, product lines and parts are elements that are constantly reevaluated, analyzed, discussed with vendors and tech support. The configurations are discussed
with a view to make all services work across campus. It’s not simply putting a configuration on a
switch, but an integral part following the design phase and building common knowledge of the
NOC. Monitoring is something that is done across campus devices as many services are spread
across all networks. Additionally, switches are being configured to report back to a central
monitoring server for error collection and data collection. It’s difficult for the NOC to envision
the monitoring done in a distributed manner. Finally, troubleshooting is a critical component
based on the monitoring. The entity with the design knowledge, configuration understanding
8
and monitoring tools is the one alerted to the problem and the one armed with the tools to fix
the problem. Finally, the NOC is charged with 24x7 responses for those critical items.
Given that, the primary concern for the NOC is that undocumented changes to the switches
along with inconsistent configurations across campus force the NOC to respond to outages they
aren’t responsible for nor have knowledge of. Those that are charged with the responsibility to
respond feel it is only fair that they are the ones to oversee the changes that affect that
responsibility or those that have that ability also share in the response effort 24x7. While the
latter sounds plausible, it doesn’t work practically. A network manager simply isn’t going to
have the common knowledge of the network the NOC has in order to help the CSUPD fix a
camera that’s out on snowy Saturday night. It’s going to take the NOC to do it. (Thus, the
compromise to provide an automated change request process was proposed, above.)
7. Why the concern over security all of a sudden – don’t secure protocols eliminate much of those
concerns?
Two primary concerns are in the forefront: First, some regulatory agencies require the network
to be hardened and accountable in a manner that simply isn’t workable in a distributed
environment. Access, design, and even knowledge of some of these networks are restricted!
Second, while secure protocols may make sniffing VOIP, and similar traffic, difficult, denial of
service (DoS), arp spoofing, and sniffing of ports lead to integrity of the network issues and
raises liability concerns for individuals and CSU.
8. What are some protocol/technical challenges the NOC has before it that it says it’s hard to meet
campus wide with the current architecture?
802.1x, IP multicast, QoS, VOIP, Facilities networks for environmental controls, smart meters,
card access devices, power/UPS management, fire/police alarms along with video surveillance
services/cameras, and distance learning in classrooms are the primary ones.
9. What does the NOC want exactly?
Sole read/write access to the switches, while still providing read access to the units where
desired.
10. Will that fix everything?
No. We’ll always have network issues, computers will always break and there will always be
problems. The NOC’s primary objective is to get the network hardened and managed efficiently
and consistently so that such problems are minimized and response to them is quicker and more
effective.
9
10
11
Download