Describing performance levels

advertisement
Corso di Reti di Calcolatori II
Defining
Service Level Agreements
Giorgio Ventre
The COMICS Research Group
@
The University of Napoli Federico II,
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
1
Professional & Business Challenges
 Excessive alarms: A medium-sized operations center
receives 100,000 to 1,000,000 alarms per day.
 Constant changes: New or upgraded devices and new
services launch frequently.
 Complex services structure: Services are vital for business
and customer interaction, but they are not really managed.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
2
Professional & Business Challenges
 Customer interaction: Operators must handle customer
complaints, customer care, selling services and servicelevel agreements (SLAs).
 Cuts in operations costs: A small team must run a large,
multifaceted network.
 Difficult interface integration: Diverse equipment and
support systems make managing interface integration a
challenge.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
3
Professional & Business Challenges
 The bulk of network administrators’ daily work involves
alarms.
 A large number of alarms indicates we have many irrelevant
and noncorrelated signals
 It’s therefore hard to understand the true state of problems in
the network.
 Today’s alarms are more or less raw warnings from the
different equipment and vendor-specific management
systems.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
4
Professional & Business Challenges
 Operators establish an organization on multiple
levels that handle the alarms.
 The first line has three major tasks:
 check for alarms that indicate the same problem,
 group the alarms and attach them to a trouble ticket,
 distribute problem information to affected parties, such as
SLA customers and customer care.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
5
Professional & Business Challenges
 If it’s a simple problem, the first line resolves it and closes the
ticket.
 If it’s a complex problem, the first line dispatches it to the
second- and third-line organizations. This might involve:
 equipment vendors
 operator staff in the field who might perform onsite
management, card replacement, and so on
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
6
Professional & Business Challenges
 Automatic trouble ticketing manages the workflow from
problem identification to problem solution
 An alarm’s context determines if it affects services, customer
SLAs, equipment’s state
 We need Alarm-correlation but
 Alarm quality is insufficient
 We lack an overall network topology
 Correlation knowledge is spread across the organization and
over several domain experts.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
7
Professional & Business Challenges
 Operators are looking for service management solutions and
SLA management solutions
 We need to solve several underlying problems:
 Topology management: network topology, service
topology,and the mapping between these.
 Service management: formal but dynamic management of
services, and SLAs
 Service centric integration and modeling
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
8
Professional & Business Challenges
 A basic tool is the
Service Tree
 But it is also critical
to be able to map
the important
dependencies:
 We need correlation
software!
 We need formalisms!
 We need Knowledge
Management!
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
9
The concept of Service
Supplier
Customer
Value provided
(goods, services, …)
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
10
New business model
 In traditional telephony, a customer was just a number
(SAP) and a well defined service
 Quality used to be controlled statistically
 Faults were rare and mostly at exchanges
 …the good old days of monopolists
 Today we have multiple services with different possible
configurations
 We now need to link customers to services
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
11
Definitions
 Service = is the interface between supplier and customer:
 Service Access Point (SAP), it is where the service is
accessed (my home, your company’s sites…)
 Customer Service Management (CSM), it is how the
customer interacts with the supplier
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
12
Service life-cycle
 Design (definition, marketing)
 Negotiation (a business interaction)
 Provisioning (implementation & test)
 Usage
 Operation
 Change
 Deinstallation (end of supply)
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
13
TMN
Telecommunication
Management Network Model
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
14
TMN
Telecommunication
Management Network Model
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
15
TOM: Telecom Operations Map
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
16
eTOM
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
17
Service-related processes
 Service Creation
 Service Provisioning
 Service Activation
 Service Assurance
 Service Monitoring
 Service Accounting
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
18
Business scenario at-a-glance
Billing
€41.20
Customer Care
Service
Service Creation
Provisioning
Service
Service
Activation
Inventory
Monitoring
Network planning and provisioning
Service
Assurance
Service
Accounting
Network
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
19
Service Management integration
 Integration with all main business processes:
 Order Management
 Service Creation (Marketing)
 Network Configuration
 Network Management
 Service Monitoring (SLA)
 Billing
 Customer Care
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
20
A formal definition
• A Service Level Agreement is defined as a contract
between the service provider and the customer that
specifies the QoS level that can be expected for that
service.
•
It includes technical elements like the expected behavior
of the service, the parameters for QoS verification, the
devices involved…
• But it also includes legal components: costs, obligations,
compensations…
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
21
A formal definition
 The Internet Engineering Task Force’s differentiated services
(DiffServ) working group defines an SLA in networking parlance
as follows:
 A SLA is a service contract between a customer and a service
provider that specifies the forwarding service a customer should
receive (RFC 2475).
 The SLA contains both technical and nontechnical terms and
conditions. The technical specification of the transport service is
given in service level specifications (SLSs).
 An SLS is a set of parameters and their values which together
define the service offered to a traffic stream by a DiffServ
domain (RFC 3260)
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
22
Research Issues on SLA
• SLA parameter definitions
– This concerns the definition of service level parameters
such as availability, reliability, latency, and loss for SLA.
– Ongoing work on common standards
• SLA Negotiation
– Can we automatically stipulate & negotatiate SLAs?
• SLA measurement
– This issue deals with how to accurately measure the QoS
that service providers deliver to their customers.
– Quite a mature research area
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
23
Research Issues on SLA
• SLA compliance reporting
– This deals with mechanisms to satisfy increasingly sophisticated
customers who demand real-time reporting to confirm that they
are receiving the service levels they were promised.
– Many things already done
• QoS management
– This issue deals with how to manage and control the QoS
delivered to customers to ensure compliance with established
SLAs.
– Lot of work to be done
• Languages for SLA definition and validation
– Can we define automatically and correctly Service Level
Agreements?
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
24
SLA Parameter Definition
• The QoS team of the TeleManagement Forum has been
working on the automation of the interface between
service providers and customers for performance reporting
with the SLA concept.
• They have identified common terms and definitions, and
have created an industry-wide glossary for performance
measurement and reporting
• NMF, “Performance reporting definition document, NMF
701, June 1998
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
25
SLA Parameter Definition
• The IP Performance Working Group of the Internet
Engineering Task Force (IETF) has been working on the
identification of Internet service metrics:
– Framework for IP Performance Metrics (RFC 2330)
– IPPM Metrics for Measuring Connectivity (RFC 2678)
– A One-Way Delay Metric for IPPM (RFC 2679)
– A One-Way Packet Loss Metric for IPPM (RFC 2680)
– A Round-Trip Delay Metric for IPPM(RFC 2681)
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
26
Components of a SLA
 A descritpion of the nature of service to be provided
 The expected performance level of the service (i.e.
reliability and responsiveness)
 The procedure for reporting problems
 The time frame for response and problem
resolution
 The process for monitoring and reporting the
service
 Consequences for non meeting expectations
 Escape clauses and constraints
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
27
Describing a Service
• This part of a SLA includes the type of service to be
provided and any qualifications of the type of service to be
provided.
• In the context of IP network connectivity, the type of
service may specify the maintenance of network
connectivity,
• It may include additional functions such as operation and
maintenance of domain name servers, dynamic host
configuration protocol servers, etc.
• It includes also info related to the location(s) where the
service has to be provided
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
28
Describing a Service
 For Internet services, the definition can be quite complex
and can include several components:
• Application: security, access, configuration, upgrades,
resource utilisation, response time…
• System: all info related to system health
• Network: everything related to the data transport, such as
technologies, QoS, traffic conditioning and profiles, VPNs,
encryption…
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
29
Describing performance levels
 The Quality of Service (QoS) parameters for
the Communication Network specifies the
minimum requirements for:
•
•
•
•
Network accessibility
Network availability
Network performance (capacity, delay etc.)
Network operation and maintenance
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
30
Describing performance levels
 One simple thing: to agree on the parameters
to measure…
 Shall we go for ITU parameters or for IETF
stuff?
 ITU: world basically made of SONET and
optical lines, bits or cells
 IETF: world basically made of packets
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
31
Describing performance levels
•
To estimate and verify the quality of the various
components in the network a number of
measurement are specified in international agreed
standards.
•
The ITU Recommendations G.821 and G.826
specify a set of communication line parameters for
SDH networks, primarily based on Bit Error Rates
and derived numbers.
•
The values will be part of the SLA between the end
user and the network service provider.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
32
Describing performance levels
•
•
The recommendation G.821 has the following definitions:
•
Errored second (ES), a one-second time interval in which one or more bit
errors occurs.
•
Severely Errored second (SES), a one-second time interval in which the bit
error rate exceeds 10-3.
•
Unavailable second (US), a circuit is considered to be unavailable from the
first of at least 10 consecutive SES. The circuit is available from the first of at
least 10 consecutive seconds which are not SES.
•
Degraded minute (DM), a one-minute time interval in which the bit error rate
exceeds 10-6.
•
Error free seconds (EFS), a one-second time interval without any bit errors.
In recommendation G.821 similar definitions are specified based on the block level.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
33
Describing performance levels
•
Recommendation G.826 has the following definitions:
•
Errored second (ES), a one-second time interval containing one or more
errored blocks.
•
Errored block (EB), a block containing one or more errored bits
•
Severely Errored second (SES), a one-second time interval in which
more than 30% of the blocks are errored.
•
Unavailable second (US), as for G.821
•
Background block error (BBE), an error block that is not a SES
•
A measurement time interval has to be specified, and the derived ratios for
ES, SES and BER are the base for the QoS parameters.
•
The recommended measurement time for G.821 and G.826 is 30 days.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
34
Describing performance levels
• Recommendation I.356 has the following definitions:
• Cell Loss Ratio the number of cells lost divided by the
number of cells transmitted.
• Cell Error Ratio (CER), the number of errored cells
divided by the number of cells transmitted.
• Cell Misinsertion Rate (CMR) the number of wrongly
inserted cells in a specified time interval.
• Cell Transfer Delay (CTD) the time from a cell enters a
device under test to it leaves the device.
• Mean Cell Transfer Delay (CTD) is the arithmetical
mean of a number of CTD values in a specified period.
• Cell Delay Variation (CDV) is the degree of variation in
the cell transfer delay (CTD) of a virtual connection.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
35
Describing performance levels
 For IP based services, performance levels migh
be related to a packet-based, routed world:
 Average delay measured monthly across ISP network
between any two access routers should be less than
200ms
 The average delay across the ISP network on the
transcontinental link between the New York City
access router of the customer and the London, U.K.,
access router of the customer would be less than 250
ms.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
36
Describing performance levels
 For IP based services, it is sometime specified
also the method of measurement:
 The customer will not have unscheduled connectivity
disruption across the ISP network between any two
access routers exceeding 5 min. Connectivity
disruption would be defined as the loss of 100% of
packets as measured by pinging an access router
from a machine connected to another access router.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
37
Describing performance levels
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
38
Example: JANET, UK
Network availability for at least:
Availability of 99.6%
to more than 90% of clients
Availability of 99%
to more than 96.5%of clients
Availability of 97%
to more than 98.5% of clients
Availability of 93%
to more than 99.5% of clients
Mean time between failures of the service of at least:
1000 hours provided to 99% of clients
The target rate is less than 0.001 incidents per hour,
calculated each month by dividing the number of
failures in the best 99% access points by the number
of access points and the number of hours in the month.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
39
Example: JANET, UK
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
40
Example: JANET, UK
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
41
Example: JANET, UK
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
42
Example: JANET, UK
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
43
Example: JANET, UK
End-to-end latency between any pair of clients for
128 octet packets, measured as the time of entry on to the
first access line of the last bit of the packet to the time of
exit from the second access line of the first bit of the
packet, of less than a stated target time, which depends on
the transmission technology used for 95% of transmissions
over any thirty minute period.
Clients shall normally expect to be able to
transmit and receive traffic (from a number of sources)
which, over any thirty minute period, uses at least 40% of
the nominal capacity of their access line, once the
overheads of the data solely concerned with the
transmission technology in use have been discounted
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
44
Example: JANET, UK
Performance Indicators and Service Levels for Domain Name
Service :
Availability of the primary name server for the target
domain of 99.5%
Availability of service from an available officially
supported name server of 99.95%.
Performance Indicators and Service Levels for NTP Time
Service:
This service is intended for use by access points in constructing
their own distributed time services (RFC 1305).
Availability of each time reference of 98%,
MTBF of 800 hours.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
45
Describing performance levels
 For hosting services we can have several types
of contracted facilities:
 Boxes in specific locations (PoPs, IXPs): boxes
availability is a task of the customer (local access)
 Boxes + uptime time (i.e. maintenance): boxes are
maintained up. All the software inside is a customer
responsibility (remote access)
 Boxes + uptime + applications: the customer needs
just to access the service (remote usage)
 Examples are Web-hosting, CDN, Data Centres
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
46
Describing performance levels
 Typical performance and availability clauses:
 The hosted server will not be unavailable for a
contiguous period exceeding 5 min in any 24-h period.
Unavailability is defined as the ability to ping the
server from a machine with network connectivity to the
hosting provider’s access router.
 The hosted server will be able to handle inbound
traffic of 30 000 Web requests per day.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
47
Describing performance levels
 Typical performance and availability clauses:
 The hosted application will be provided access to the Internet at
a bandwidth of 45 Mb/s or more.
 The service provider will ensure that there are at least five
servers available and running the application at all times.
 If we host multiple customers at the same site we are
responsible for ensuring that the performance of one
customer’s server is not adversely affected by requests
directed to other customers.
 See AKAMAI, IBM “Business on Demand”, …
 Resource control? Virtualisation?
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
48
Describing performance levels
 Today trend is to outsource entire IT services or even
departments
 In this case we have an integrated services offer and the
performance clauses are at the overall system level:
 The time to perform an employee lookup on the corporate
directory would not exceed 500 ms.
 The average performance of a standard syntheticWebbased
transaction, as reported by probes located at selected locations,
will not exceed 100 ms.
 Unscheduled downtime of the mail server will not exceed a 30min period during the normal business day of 9 A.M. to 5 P.M.
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
49
Describing Customer Support
 This section includes the typical helpdesk problem of
reporting and problem resolution guarantees.
 Examples include a single point of contact assigned to the
customer and problem resolution within 48 hours of
reporting.
 Sometime it indicates also the SAP (e.g. a toll-free number
or a web service).
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
50
SLA Measurement
 SLA measurement is a quite active area of R&D
 We can have different approaches depending on the
component we need to monitor:
 Sampling: very effective and reliable for applications and
system level parameters
 Trap/alarms: logging all troubles and faults
 Big issues for network parameters: how, when & where
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
51
SLA Measurement
COMICS (COMputer for Interaction and CommunicationS) Research Group – DIS, University of Napoli Federico II
52
Download