Ethernet OAM Victor Olifer (JANET/GEANT JRA1 Task 1) JRA1/TERENA workshop, Copenhagen, 20 November 2012 connect • communicate • collaborate 1 Agenda Ethernet Service Assurance & Monitoring overview Monitoring standards Service assurance standards Service assurance lab trials CFM/Y.1731 trial Multi-domain testbed OAM agent boxes CyPortal JRA1 & JRA2 trial (Year 4 extension) Multi-segment connections Diverse equipment perfSONAR extensions connect • communicate • collaborate 2 Wide-area point-to-point Ethernet connections Ethernet Ethernet over MPLS Ethernet over Transport Multi-segment multi-domain connection with: - Ethernet UNI (a must); segments of pure Ethernet (optional); segments where Ethernet is tunneled over some other technology, e.g TDM (SDH, OTN) or MPLS (optional) Where we can find such connections? - GEANT Plus, JANET Lightpath: demand is from big projects, large scientific centres - Inter-router connections - An offer from commercial providers: they had 20% revenue growth in 2010 over 2009. Mobile backhaul and multi-site corporates are major users; the reasons – price and flexibility - New demand for academic providers might arise from such areas as cloud services, data centres, HD videoconferences, multi-site university connections connect • communicate • collaborate 3 Problems with managing Ethernet connections Until recently Ethernet had no OAM tools (hence cheapest equipment) -> no way to check, monitor and troubleshoot connectivity and performance end-to-end ( a customer view) or within a domain (a provider view). E.g. comparing to IP experience: No ping, traceroute and ICMP diagnostic messages available. Partial solution: we can use MPLS or SDH/OTN OAM to manage tunnels Good news: Ethernet OAM functions started being developed and implemented in equipment since 2007-8 Bad news: We (JANET) don’t have much experience in Ethernet OAM use. The same situation in other NRENs connect • communicate • collaborate (as far as I know from GEANT3 participants). 4 Three areas of emerging Ethernet OAM standards Service assurance Service monitoring Service trouble shooting • Checks whether a connection performs to its specs, e.g. up to CIR and EIR, after service configuration and activation. • Periodic checks of connection connectivity (continuity) and performance (delay, loss, throughput, availability) • When monitoring shows a fault one needs to locate a faulty point along a path and possible reason(s) of a failure connect • communicate • collaborate 5 Service Assurance (1) 1. Service definitions (topology: e.g. point-to-point, bandwidth profile: CIR, EIR for several CoS): • MEF 10.2 • ITU-T G.8011 Very important as it is often a cause of confusions: e.g. CIR might be measured for UDP payload or Ethernet frames – very different figures for the same data flow 2. Service performance parameters (delay, loss, throughput, availability): • MEF 10.2.1 • Y.1563 connect • communicate • collaborate 6 Service Assurance (2) 3.Service Verification Relatively new (Summer 2011) ITU-T spec Y.1564 “Ethernet service activation test methodology” • Defines a simple disruptive on-demand procedure that tests connectivity and throughput up to CIR & EIR & policing limit by injecting traffic into a connection • More suitable for Ethernet than complex and IP-centric RFC2544; implemented in many traffic generators connect • communicate • collaborate and boxes 7 Service Assurance trials JANET lab trial of SunRise RxT tester Positive impression, works according the standard, looks worth to try in wide-area tests Tester PIR Box PIR=CIR+EIR CIR Just one problem: Y.1564 doesn't’t give an opportunity to detect the situation when real PIR value set up lower than expected (not box bug, just the standard intention) connect • communicate • collaborate 8 Service Monitoring IEEE 802.1ag Connectivity Fault Management (CFM) (ratified in 2007): - Hierarchical sessions of heartbeat messages (Continuity Check Messages, CCM) -> up/down status check - VLAN-aware - MEP (End) and MIP (Intermediate) maintenance points ITU-T Y.1731 (ratified in 2008): Same as CFM + Performance monitoring (delay, loss, throughput) Customer maintenance session level 7 Service provider maintenance session level 5 Operator maintenance sessions level 3 connect • communicate • collaborate 9 Service Troubleshooting CFM: - Linktrace (analogy of IP traceroute) - Loopback (analogy of IP ping) - RDI (Remote Defect Indication) Y.1731: - same as CFM + a richer set of diagnostic messages + performance monitoring (loss, delay, throughput): - Alarm Indication Signal (AIS) - Lock Signal - … connect • communicate • collaborate 10 Service monitoring trials JRA1 Task 1 Ethernet OAM trial (2011): - 5 NRENs, 5 connections under 6 months monitoring - Small Y.1731 agent boxes from Overture - CyPortal from Cyan Optics for storing and visualising of monitoring data Positive results but only for single-segment connections Combined JRA1 Task 1& JRA2 Task 3 Service Assurance & Monitoring trial GN3 Year 4 (2012-2013) - ongoing connect • communicate • collaborate 11 JRA 1 Ethernet OAM trial (2011) objectives Test CFM/Y.1731 functions in multi-domain and multi-vendor environment (5 connections) Evaluate Y.1731 agent boxes Evaluate OAM data visualisation system (CyPortal) Essex Uni JANET LH Cyan OAM portal Collector NORDUnet OAM Data from Collector Cloud service Equipment under test OAM agent (Overture ISG24) Monitored VLAN connections SURFnet CESNET PIONIER (PSNC) connect • communicate • collaborate 12 OAM agent options Dedicated extra network switch with advanced OAM capabilities Pros: uniform, rich OAM functionality, and consistent source of monitoring data Cons: extra boxes overheads (adds complexity, cost – especially for high speed links, maintenance etc) OAM capabilities of existing network boxes: routers, switches, muxes Pros: no extra equipment, ability to test internal segments Cons: some vendor-specific features, e.g. in CFM MIBs – diverse environment with possible incompatibilities Software OAM agent on a dedicated server (e.g. ‘dot1ag-utils’ developed by SARA and presented by Ronald van der Pol at NORDUnet 2011) Pros: end users can ping and trace network elements; no switches needed Cons: currently limited to MEP down functionality, performance depends on a server performance, time precision might be an issue connect • communicate • collaborate 13 ISG24 OAM agent box trial Compact 4 port GE demarcation box, low cost (~ $1000) 2 copper GE and 2 SFP ports (there is 10GE version) Web GUI OAM functions: CFM Y.1731 D(elay)MM and L(oss)M RFC 2544 PAA – proprietary analogy of Y.1731 Ethernet First Mile 802.2ag connect • communicate • collaborate 14 ISG24 CCM (continuity) tests Positive results – properly detected the Up/Down state of all 5 connections by permanent monitoring over 6 months Compact form web Detailed web form connect • communicate • collaborate 15 ISG 24 DMM (performance) tests Mostly positive results – CFM and PAA Delay Measurement sessions showed stable and close to expected (from other sources) One Way and Two Ways delays and jitter results Janet – NORDUnet PAA results: PSNC– CESNET CFM DMM results: We experienced some problems with CFM One Way delay measurements on two connections – will talk later after CyPortal slides connect • communicate • collaborate 16 CyPortal: monitoring data storage and visualisation Detailed monitoring data are collected from ISG24 agent boxes and stored in a cloud-based database Web GUI provides a map of all services; parameters violate SLD in red those which current connect • communicate • collaborate 17 CyPortal: Per- service data Historical graphical presentation of all parameters under monitoring Zooming of a selected time period Setting of SLA limits Flexible reports connect • communicate • collaborate 18 Problems encountered 1. Saw-tooth shape of delay between JANET LH and Essex Uni Level 5 DDM session There was no reason for saw-tooth shape of Two Way Delay with peaks of about 1 sec showed by MEP Level 5 (ISG24 box) Level 3 DDM session Capturing and analyzing traffic before and after MEP Level 3 (Ciena 311v box) showed the ‘guilty’ box: MEP Level 3 time-stamped packets of MEP Level 5 instead of their transparent forwarding – definitely a bug in a box software connect • communicate • collaborate 19 Problems encountered (cont.) 2. Inability of ISG boxes to measure CFM One Way Delay on some connections (LH-Copenhagen, LH-Essex) PAA: OAD = 10. 903 TWD = 23,004 CFM DMM: OAD = ---- TWD = 23,004 ISG vendor version: too poor synchronization to calculate CFM OWD Seems not to be true: why it is enough for proprietary PAA Needs further investigation ! connect • communicate • collaborate 20 JRA1 Ethernet OAM trial conclusions Ethernet OAM functions embedded in the carrier grade Ethernet equipment are mature enough to be used for effective monitoring of health and performance of wide-area Ethernet services from a customer and provider perspectives The use of dedicated Ethernet demarcation boxes with a rich set of OAM functions (Overture ISG and Accedian MetroNID) proved to be an effective way for monitoring Ethernet services on the end-to-end basis Visualization and data store software like CyPortal is a very useful element for providing managed Ethernet services We managed to monitor only single-segment connections on the endto-end basis – still more to try connect • communicate • collaborate 21 Year 4 JRA1/JRA2 Service Assurance & Monitoring trial Trial objectives: To carry on the previous trial with extending of an investigation for: multi-segment connections with hierarchical monitoring troubleshooting use of embedded CFM/Y.1731 function in carrier class equipment (such as Cisco, Juniper, Extreme, Brocade, Alcatel etc) To support new Ethernet OAM functionality in perfSONAR software: • perfSONAR protocol and topology extensions to support Eth OAM data (data storing, searching and fetching) • use of the existing GN3 perfSONAR implementation (perfSONAR MDM) with needed changes • standardization under the OGF NMC/NM umbrella Trail term – 1 year, the end in March 2013 connect • communicate • collaborate 22 GN3 Year 4 testbed Bristol Uni NORDUnet core JANET LH Collector GEYSERS NORDUnet testbed PSNC 1000 TNO (NL) SARA SURFnet CESNET - Y.1731 agent box (ISG24 from Overture ) - Y.1731 enabled equipment of the trial participants - non-Y.1731 enabled equipment of the trial participants connect • communicate • collaborate 23 Multi-segment Janet – NORDUnet service Janet ISG24 193.63.63.133 NORDUnet testbed ALU 1850 TSS Janet testbed Ciena 5305 134 NORDUnet ISG24 109.105.113.183 314 Customer, level 5, MA/MEG=“jan-nor-400” Provider, level 4,MA/MEG=“jan-nor-400-4” – doesn’t exist yet 144 1152 Operator, level 2, MA/MEG=“janet-400-2” 1153 Inter-node, level 0, MA/MEG=“isg-ciena-400-0” 1101 1102 344 Operator, level 2,MA/MEG=“nor-400-2” – doesn’t exist yet Inter-operator, level 0, MA/MEG=“jan-nor-400-0” doesn’t exist yet 1 3 Inter-node, level 0, MA/MEG=“tss-isg” 2 1 connect • communicate • collaborate Multi-segment tests • Evaluation of different hierarchical schemes: Shared levels (same VLAN ID for domains) Independent levels (C-VID, S-VID) • Testing different ways of visualizing of the hierarchical monitoring information for different types of users – NOC engineers, end users. • Location of a failure by: using a hierarchy of CCM sessions; using Linktrace protocol and MIPs Different types of faults should be emulated: • Link fauilre • Port failure • Route Loops • VLAN mismatch connect • communicate • collaborate 25 Year 4 trial team JRA1 Task 1: Alberto Colmenero, NORDUnet Victor Olifer, Janet Marcin Garstka, PSNC, Jan Radil, CESNET Michal Hazlinsky, CESNET Mayur Channegowda , Essex Uni JRA2 Task 3: • Roman Lapacz , PSNC • Jakub Gutkowski, PSNC • Freek Dijkstra, SARA • Ronald van der Pol, SARA • Richa Malhotra, SURFnet • Borgert van der Kluit, TNO • Rob Smets, TNO • Piotr Zuraniewski, TNO • Otto Baijer, TNO connect • communicate • collaborate 26 Questions? connect • communicate • collaborate 27 Year 3 Partner testbed example PSNC testbed connect • communicate • collaborate 28