OAM and QoS Presented by: Yaakov (J) Stein Chief Scientist Unique Access Solutions SERVICE GUARANTEES OAMQoS-YJS Slide 2 Why do we pay for services ? Generally good (and frequently much better than toll quality) voice service is available free of charge (Skype, Fring, Nimbuzz, …) So why does anyone pay for voice services ? Similarly, one can get free • (WiFi) Internet access • email boxes • file storage and sharing • web hosting • software services So why pay ? OAMQoS-YJS Slide 3 Paying for QoS The simple answer is that one doesn’t pay for the service one pays for Quality of Service guarantees In our voice model price toll quality with mobility BE QoS But what does QoS mean and why are we willing to pay for it ? To explain, we need to review some history OAMQoS-YJS Slide 4 Father of the telephone Everyone knows that the father of the telephone was Alexander Graham Bell (along with his assistant Mr. Watson) But Bell did not invent the telephone network Bell and Watson sold pairs of phones to customers The father of the telephone network was Theodore Vail OAMQoS-YJS Slide 5 Theodore Vail Theodore Who? Son of Alfred Vail (Morse’s coworker) Ex-General Superintendent of US Railway Mail Service First general manager of Bell Telephone Father of the PSTN Why is he so important? Organized PSTN Established principle of reinvestment in R&D Established Bell Telephones IPR division Executed merger with Western Union to form AT&T Solved the main technological problems • use of copper wire • use of twisted pairs Organized telephony as a service (like the postal service!) Vailism is the philosophy that public services should be run as closed centralized monopolies for the public good OAMQoS-YJS Slide 6 What’s the difference ? In the Bell-Watson model the customer pays once, but is responsible for • installation • wires • wiring • operations + • power • fault repair • performance (distortion and noise) • infrastructure maintenance while the Bell company is responsible only for providing functioning telephones In the Vail model the customer pays a monthly fee but the provider assumes responsibility for everything including fault repair and performance maintenance the telephone company owns the telephone sets and even the wires in the walls ! OAMQoS-YJS Slide 7 Service Level Agreements In order to justify recurring payments the provider agrees to a minimum level of service in an SLA SLAs should capture Quality of user Experience (QoE) but this is often hard to quantify So SLAs usually actually detail measurable network parameters that influence QoE, such as : • • • • • availability (e.g., the famous five nines) time to repair (e.g., the famous 50 ms) information rate (throughput) information latency (delay) allowable defect densities (noise/distortion) Availability (basic connectivity) always influences QoE It is hard to predict the effect of the other parameters on QoE even when there is only one application (e.g., voice) When multiple applications are in use - it may be impossible OAMQoS-YJS Slide 8 Some Applications System traffic routing protocols, DNS, DHCP, time delivery, system update, OAM, tunneling and VPN setup Business processes database access, backup and data-center, B2B, ERP Communications - interactive voice, video conferencing, telepresence, instant messaging, remote desktop, application sharing Communications – non-interactive email, broadcast programming, music video : progressive download, live streaming, interactive Information gathering http(s), Web 2.0, file transfer Recreational gaming, p2p file transfer Malicious DoS, malware injection, illicit information retrieval OAMQoS-YJS Slide 9 What do applications need ? Some applications only require availability Some also require minimum available throughput Some require delay less then some end-end (or RT) delay Some require packet loss ratio (PLR) less than some percentage and these parameters are not necessarily independent For example, TCP throughput drops with PLR 1000 B packets 50 ms RTT OAMQoS-YJS Slide 10 Some rules of thumb Mission Critical (and life critical) applications require • high availability If there are any MC applications then system traffic requires high availability too MC applications do not necessarily require strict throughput but always indirectly require • a certain minimal average throughput • bounded delay If the MC application uses TCP then it requires • low PLR Real-time applications require • sufficient throughput but not necessarily low PLR (audio and video codecs have PLC) Interactive applications require • low RT delay It may be more scalable for a SP to measure 1-way delays OAMQoS-YJS Slide 11 OAM OAMQoS-YJS Slide 12 Monitoring an SLA The Service Provider’s justification for payment is the maintenance of an SLA To ensure SLA compliance, the SP must : • monitor the SLA parameters • take action if parameter is dropping below compliance levels But how does the SP verify/ensure that the SLA is being met ? Monitoring is carried out using Operations, Administration, Maintenance (OAM) The customer too may use OAM to see that the SP is compliant ! Technical note: OAM is a user-plane function but may influence control and management plane operations for example • OAM may trigger protection switching, but doesn’t switch • OAM may detect provisioned links, but doesn’t provision them OAMQoS-YJS Slide 13 Operations, Administration, Maintenance Traditionally, one distinguishes between 2 OAM functionalities : 1. Fault Monitoring • OAM runs continuously/periodically at required rate • detection and reporting of anomalies, defects, and failures • used to trigger mechanisms in the • control plane (e.g. protection switching) and • management plane (alarms) • required for maintenance of basic connectivity (availability) 2. Performance Monitoring • OAM run : • before enabling a service • on-demand or • per schedule • measurement of performance criteria (delay, PDV, etc.) • required for maintenance of all other QoE attributes OAMQoS-YJS Slide 14 Early OAM Analog channels and 64 kbps digital channels did not have mechanisms to check signal validity and quality Thus • major faults could go undetected for long periods of time • hard to characterize and localize faults when reported • minor defects might be unnoticed indefinitely As PDH networks evolved, more and more OAM was added on : • monitoring for valid signal • loopbacks • defect reporting • alarm indication/inhibition The OAM overhead started to explode in size ! When SONET/SDH was designed bounded overhead was reserved for OAM functions OAMQoS-YJS Slide 15 OAM for Packet Switched Networks OAM is more complex for Packet Switched Networks in addition to the previous defects : • loss of signal • bit errors we have new defect types • packets may be lost • packets may be delayed • packets may delivered to the wrong destination The first PSN-like network to acquire OAM was ATM (I.610) Although technically ATM is cell-based, not packet-based OAMQoS-YJS Slide 16 Some FM OAM mechanisms (1) How do we perform Continuity Check ? • send OAM packets at a constant known rate • if CC packets are not received for >3 intervals then declare a fault see also LB / echo mode How do we perform Connectivity Verification ? • send OAM packets to a known destination • if CV packets are received somewhere else then declare a fault How do we indicate AIS (FDI) ? • when do not receive forward traffic send AIS OAM packets • if AIS packets received then declare a fault How do we indicate RDI (BDI) ? • when do not receive reverse traffic send RDI OAM packets • if RDI packets received then declare a fault Note: RDI is often a flag set on CC message OAMQoS-YJS Slide 17 Some FM OAM mechanisms (2) How do we use LoopBack ? • non-intrusive (in-service) (echo mode) • send LB request OAM packet to remote site • remote site replies with LB reply • if LB reply not received then declare a fault • intrusive (out-of-service) • put remote site into LB mode • remote sites reflects (and does not forward) all traffic (note that it must monitor OAM traffic) • if packets sent are not received then declare a fault note: need to inform next hops of LB by locking How do we use LinkTrace ? • send LB request OAM packet to next hop • send LB request to following hop • etc. OAMQoS-YJS Slide 18 Some PM OAM mechanisms (1) How do we measure Packet Loss Ratio ? • Traffic (counter) based maintain 2 counters: • • number of packets transmitted to peer Tx number of packets received from peer Rx • send Tx counter to peer at time 1 Tx(1) • peer notes its Rx counter at time of reception Rx(2) and its Tx counter at time of its reply Tx(3) • originator notes its Rx counter when reply is received Rx(4) calculate PLR in both directions • Synthetic : do not maintain counters – use OAM packets Note : synthetic loss is only a rough estimate How do we measure Throughput? • Primitive way (RFC 2544) • send packets at maximum rate and observe packet loss • reduce rate until no loss is observed Note : there are more sophisticated mechanisms ! OAMQoS-YJS Slide 19 Some PM OAM mechanisms (2) How do we measure 1-way Packet Delay (Latency) ? synchronize clocks at both OAM peers • send timestamp T1 to peer • peer timestamps receipt with T2 calculate time difference T2 – T1 How do we measure 2-way Packet Delay (Latency) ? send timestamp T1 to peer peer timestamps receipt with T2 peer replies at T3 originator timestamps receipt of reply at T4 calculate time difference (T4 – T1) – (T3 - T2) assuming symmetry, 1-way delay is half this amount Note : do not need to synchronize clocks • • • • How do we measure Packet Delay Variation ? • send timestamps at a constant rate • peer calculates timestamp differences and statistics thereof Note : do not need to synchronize clocks OAMQoS-YJS Slide 20 ETHERNET OAM OAMQoS-YJS Slide 21 What about Ethernet ? Carrier Ethernet has replaced ATM as the default layer-2 Ethernet is by far the most widespread network interface Ethernet has some advantages as compared to ATM • it has network-wide unique addresses • it has a source address in every packet but some aspects make Ethernet OAM more difficult • ConnectionLess (CL) • multipoint to multipoint • overlapping layering – need OAM for operator, SPs, customer • some specific problematic ETH behaviors (flooding, multicast, …) OAMQoS-YJS Slide 22 What’s the problem with CL ? OAM makes a lot of sense in Connection Oriented environments • connections last a relatively long amount of time • there is some SLA at the connection level For CL networks, the network path is neither known nor pinned So it doesn’t really make sense to talk about FM what does continuity mean if when a link goes down the network automatically reroutes around the failure ? The Ethernet CL problem is solved by overlaying CO functionality : • flows or • EVCs OAMQoS-YJS Slide 23 Ethernet OAM For many years there was no OAM for Ethernet (LANs don’t need OAM) now there are two incompatible ones! • Link layer OAM – 802.3 clause 57 (EFM OAM, 802.3ah) single link only slow protocol, limited functionality some management functions • Service OAM – Y.1731, 802.1ag (CFM) any network configuration multilevel OAM functionality In some cases one may need to run both while in others only service OAM makes sense Link layer OAM is only for a single link, which is necessarily CO Service OAM is most frequently used for infrastructure networks, which are also CO OAMQoS-YJS Slide 24 Layer 2 control protocols (L2CPs) Do not be confused - L2CPs are NOT OAM ! Here are a few well-known L2CPs : protocol DA reference 01-80-C2-00-00-00 802.2 LLC 01-80-C2-00-00-01 802.1D §8,9 802.1D§17 802.1Q §13 802.3 §31B 802.3x 802.3 §43 (ex 802.3ad) Port Authentication 01-80-C2-00-00-02 EtherType 88-09 Subtype 01 and 02 01-80-C2-00-00-02 EtherType 88-09 Subtype 03 01-80-C2-00-00-02 EtherType 88-09 Subtype 10 01-80-C2-00-00-03 E-LMI 01-80-C2-00-00-07 MEF-16 Provider MSTP 01-80-C2-00-00-08 802.1D § 802.1ad Provider MMRP 01-80-C2-00-00-0D 802.1ak STP/RSTP/MSTP PAUSE LACP/LAMP Link OAM ESMC LLDP GARP (GMRP, GVRP) 802.3 §57 (ex 802.3ah) G.8264 802.1X 01-80-C2-00-00-0E 802.1AB-2009 EtherType 88-CC Block 01-80-C2-00-00-20 802.1D §10, 11, 12 through 01-80-C2-00-00-2F Note: IEEE disallows forwarding of L2CPs, MEF allows it under certain circumstances OAMQoS-YJS Slide 25 Link Layer OAM (AKA EFM OAM) Ethernet in the First Mile (Last Mile ?) EFM networks are mostly p2p DSL links or p2mp PONs thus a link layer OAM is sufficient for EFM applications Since EFM link is between customer and Service Provider EFM OAM entities are either active (SP) or passive (customer) active entity can place passive one into LB mode but not the reverse EFM OAMPDUs are a slow protocol frames – never forwarded Ethertype = 88-09 and subtype 03 messages multicast to slow protocol specific group address OAMPDUs must be sent once per second (heartbeat) messages are TLV-based DA 01-80-C200-00-02 SA TYPE 8809 SUB TYPE FLAGS CODE (2B) (1B) DATA CRC 03 OAMQoS-YJS Slide 26 EFM OAM capabilities 6 • • • • • • codes are defined Information (autodiscovery, heartbeat, fault notification) Event notification (statistics reporting) Variable request (active entity query passive’s configuration) (mngt) Variable response (passive entity responds to query) (mngt) Loopback control (active entity enable/disable of intrusive LB mode) Organization specific (proprietary extensions) and there are flags in every OAMPDU to expedite notification of critical events • link fault (RDI) • dying gasp • unspecified monitor slow degradations in performance OAMQoS-YJS Slide 27 Service OAM (AKA CFM, Y.1731) Many SPs need to monitor full networks not just single links Service layer OAM provides end-to-end integrity of the Ethernet service over arbitrary server layers Because Ethernet is flat not true client-server layering (except MAC-in-MAC) service layer OAM is multilevel Because SPs want to replace transport networks with Ethernet service OAM must support all OAM features and must enable advanced transport capabilities (such as linear/ring protection switching) a transport network is a network with : 1. High availability (Fault Management OAM and Automatic Protection Switching) 2. SLA support (Performance Management OAM and QoS mechanisms) 3. a Management plane (optionally a control plane) for configuration and provisioning 4. Efficiency and Scalability OAMQoS-YJS Slide 28 Y.1731 messages Y.1731 supports many OAM message types: • • • • • • • • • • • • • Continuity Check proactive heartbeat with 7 possible rates Synthetic Loss Measurement on demand loss rate estimation LoopBack unicast/multicast pings with optional patterns Link Trace identify path taken to detect failures and loops AIS periodically sent when CC fails RDI flag set to indicate reverse defect Client Signal Fail sent by MEP when client doesn’t support AIS LoCK signal inform peer entity about diagnostic actions TeST signal in-service/out-of-service tests for loss rate, etc. Automatic Protection Switching Maintenance Communications Channel remote maintenance EXPerimental Vendor SPecific OAMQoS-YJS Slide 29 Y.1731 frame format after DA, SA and Ethertype (8902) Y.1731/802.1ag PDUs have the following header (may be VLAN tagged) LEVEL VER OPCODE FLAGS TLV-OFF (3b) (5b) (1B) (1B) (1B) if there are sequence numbers/timestamp(s) they immediately follow then come TLVs, the “end TLV”, followed by the CRC TLVs have 1B type and 2B length fields there may or not be a value field the “end-TLV” has type = zero and no length or value fields OAMQoS-YJS Slide 30 Y.1731 PDU types opcode OAM Type DA 1 CCM M1 or U 3 LBM M1 or U 2 LBR U 5 LTM M2 LTR U 4 6-31 RES IEEE 32-63 unused RES ITU-T 33 AIS M1 or U 35 LCK M1or U 37 TST M1 or U 39 Linear APS M1or U 40 Ring APS M1or U 41 MCC M1 or U 43 LMM M1 or U 42 LMR U DA 45 1DM M1 or U 47 DMM M1 or U 46 DMR UA 49 EXM 48 EXR 51 VSM 50 VSR 52 CSF M1 or U 55 SLM U 54 SLR U 64-255 RES IEEE OAMQoS-YJS Slide 31 MEPs and MIPs Maintenance Entity (ME) – entity that requires maintenance ME is a relationship between ME end points because Ethernet is MP2MP, we need to define a ME Group MEGs can be nested, but not overlapped MEG LEVEL takes a value 0 … 7 by default - 0,1,2 operator, 3,4 SP, 5,6,7 customer MEP = MEG end point (MEG = ME group, ME = Maintenance Entity) (in IEEE MEG is called MA = Maintenance Association) unique MEG IDs specify to which MEG we send the OAM message MEPs responsible for OAM messages not leaking out but transparently transfer OAM messages of higher level MIPs = MEG Intermediate Points • never originate OAM messages, • process some OAM messages • transparently transfer others OAMQoS-YJS Slide 32 MEPs and MIPs (cont.) OAMQoS-YJS Slide 33 How is OAM used ? MEF-30 Service OAM FM • • • • and MEF-xx Service OAM PM describe the use of OAM for Carrier Ethernet networks, such as which Y.1731/802.1 features/messages should be used where to put MEPs, what MA and MEG levels names should be used minimum number of EVCs that must be supported what should be reported and how Y.1564 (ex Y.156sam) Ethernet Service Activation Test Methodology describes commissioning procedures (replaces RFC2544-like benchmarking) Tests that desired performance level can be achieved, including • CIR, EIR (and optionally CBS and EBS for bursting) • traffic policing • rate, loss, delay, delay variation, availability (measured simultaneously) Testing in two steps : • Service Configuration Test – each service separately • Service Performance Test – all services together Performance testing may be for : • 15 minutes (new service on operational network) • 2 hours (single operator network) • 24 hours (multiple operator networks) OAMQoS-YJS Slide 34 QOS ENFORCEMENT OAMQoS-YJS Slide 35 QoS approaches There are two approaches to QoS handling IntServ (guaranteed QoS) • define traffic flows (CO approach) • guarantee QoS attributes for each flow • reserve resources at each router along the flow • signaling protocol (e.g., RSVP) needed DiffServ (statistical QoS) • retain CL paradigm • no guaranteed QoS attributes • mark packets (differentiated – e.g., gold, silver, bronze) • marking can be by VLAN, P-bits, IP-ToS/DSCP, or general “flow” • offer special treatment (priority) relative to other packets • no resource reservation For Ethernet and IP DiffServ is the preferred approach OAMQoS-YJS Slide 36 Some fields for marking Example: For an IPv4 packet inside Q-in-Q Ethernet we have various choices for marking priority DA (6B) SA(6B) ET=8100 (2B) ET=88A8 (2B) ET=0800 (2B) P(3b) CFI(1b) CVID(12b) P(3b) DEA(1b) SVID(12b) Ver(4b) IHL(4b) ToS(1B) Len(2B) ... Source IP Address (4B) 802.1p user priority field AKA P-bits 0 … 7 priority tagging (VLAN=0) if no VLAN P=0 means non-expedited traffic 802.1Q recommends mappings IP ToS RFC 2474 redefined ToS to contain • 6 bit DSCP (see also RFC 4594) • 2 bit ECN Destination IP Address (4B) OAMQoS-YJS Slide 37 Queuing output port output port input port input port take from nonempty queues according to configured “weight” input port • Weighted Fair Queuing output port always take from nonempty queue of highest priority switch fabric queue queue queue queue Many methods for emptying queues The most popular are : • Strict Priority output port Ethernet switches have queues FIFO buffers on each output port If there were only one queue then traffic handling would be FIF To enable DiffServ prioritization multiple queues are used Outgoing frames are inserted into queues according to priority marking OAMQoS-YJS Slide 38 Traffic shaping One of the most important parts of an SLA is the Committed Information Rate (bps) This is the datarate (bandwidth) SP guarantees will be forwarded There may also be an Extra Information Rate (bps) This is a datarate that the SP will forward if possible Packet traffic is often bursty A customer who did not send data for a while will expect to be able to send a higher rate afterwards This is accomplished via traffic shaping • time integration is accomplished by leaky/token buckets • the effect of shaping is marking drop eligibility (marking a packet on the line is only possible with S-tags!) There is often also traffic policing policing simply discards packets to police a maximum rate ! OAMQoS-YJS Slide 39 MEF token bucket algorithm Metro Ethernet Forum 10.x defines a bandwidth profile there are two byte buckets, C of size CBS and E of size EBS (in bytes) tokens are added to the buckets at rate CIR/8 and EIR/8 when bucket overflows tokens are lost (use it or lose it) if ingress frame length < number of tokens in C bucket frame is green and its length in tokens is debited from C bucket else if ingress frame length < number of tokens in E bucket frame is yellow and its length of tokens is debited from E bucket for simplicity we assume else frame is red • no coupling and • no sharing ! green frames are delivered CBS and service objectives apply EBS yellow frames are delivered but service objectives don’t apply red frames are discarded C E OAMQoS-YJS Slide 40