Towards High-Availability for IP Telephony using Virtual Machines Vishal K Singh

advertisement
Towards High-Availability for IP
Telephony using Virtual Machines
Devdutt Patnaik, Ashish Bijlani and
Vishal K Singh
Outline
• Virtualization
• High Availability (HA) in Virtualized Platforms
– XEN and REMUS (HA solution for XEN)
• Remus applied to IP Telephony (IPT)
applications
– Scalability and Reliability of IPT applications using
Virtualization
• Experimental Results
• Conclusion
Virtualization and its Benefit
• Abstraction layer (Hypervisor) between the
physical hardware and the OS.
• Single physical machine can host multiple virtual
machines each running a different OS +
application stack
• VMMs
– Xen, VMWare, Microsoft HyperV
• Benefits
–
–
–
–
Server consolidation
Green computing
Cost savings – space and power
High Availability
Reliability solutions, ease of upgrades with near zero
down-times
Virtualized hosting for IP Telephony
• Virtualized hosting for IP Telephony already
available
– Avaya, Cisco, Asterix etc.
• IP Telephony in Cloud
– Scalability: ability to elastically add/remove additional
servers while supporting High-Availability for all
servers
– Reliability: protection against hardware and software
failures
• HA features in virtualization platforms
• Memory state check pointing
Virtualization and High Availability
• Seamless fail-over, Efficient and
transparent migration of VM to another
physical machine
– Live Migration with very small down-times
– Minimal or no impact to client nodes
• Asynchronous check-pointing
– Continuously syncs the state between the
primary and secondary host
• We use
– Remus: A High Availability Solution for XEN
Remus on XEN
•
•
•
•
Remus is a High Availability solution available on the Xen VMM
Remus uses continuous check-pointing and keeps a consistent client view of
network state
The secondary machine hosts a paused replica of the primary VM
Uses a heart-beat mechanism
– Failure to receive periodic heart-beat on secondary will un-pause the backup VM
– Heart beat time-out can be configured
Image: http://osnet.cs.nchu.edu.tw/powpoint/seminar/2008/Remus.pdf
Fig 1
6
Remus on XEN (contd.)
• Remus modes of
operation
– Net Mode – Highly
reliable
– No-Net Mode –
better performance
with negligible
packet loss in case of
failure
– Tunable for
Reliability vs.
Performance
Image: http://osnet.cs.nchu.edu.tw/powpoint/seminar/2008/Remus.pdf
Fig. 2 Disk writes and Network Writes
Net Mode: Buffers outgoing network
packets until execution state is synced
with the back up VM (on secondary
host).
•reliability at cost of performance
Remus applied to IP Telephony
- Scale with Reliability
• Our work using HA in XEN extends:
“architecture for fail-over and load sharing for
IP Telephony” proposed by Kundan Singh et. al.
• Challenges:
– Overheads of virtualization on IP Telephony
performance
– Co-Hosted/Co-located media server causes
interference because of heavy I/O workload
Reliability and Scalability using Virtual
Machines
• Scalability using
load balancer
(LB)
– LB can
elastically add
more VMs as
demand grows
• Reliability using
Remus in XEN
Stateless
Load balancer
Reliability Architecture using Virtual machines
• For every primary Virtual Machine there is a
back up VM in paused state.
• Since, backup VM is paused, it allows to place
other running VMs on the same physical
machine
• Provides N to M elastic/backup model (m back
up for n primary)
Reliability and Scalability using Virtual
Machines (contd.)
• Reliability
– Provided by Xen + Remus
– Failure of primary starts the execution of the
secondary with IP address takeover
– Clients continue to execute un-affected
• Signaling and Media Server:
– Co-located on same VM
– allows better utilization,
– no overhead of inter-vm communication
– Placed on different VM
– elastic scaling of media and signaling VM’s
Studying Performance Implications
• Experimental setup
– Primary /Backup Servers
– Intel Core 2 Quad Processors, 2.5 Ghz, 8 GB RAM, 4MB L2
Cache
– Hypervisor – Xen 3.2.1 + Remus
– Default Credit Scheduler configuration
– Guest OS : Para Virtualized Linux 2.6.18
• IP Telephony Workload
– Modeled our workload using SIPStone
• Measured % success of registrations during failover
• Used UDP and TCP as transport for registrations
– Used OpenSIPs as SIP server
– RTPProxy as Media Server
– SIPp for generating signaling and media traffic
Analysis and Results: Signaling
• Guest VM and Domain 0 both have high CPU utilization
with tcp_n (new tcp connection for each REGISTER)
• UDP and tcp_1 (1 tcp connection for all REGISTER)
have similar overhead.
CPU utilization (in guest VM, dom0)
Udp means with udp transport,
tcp_1 means same connection for all call,
tcp_n means new connection for each call
With Remus NET mode, Registration overhead.
Analysis and Results: Signaling
• CPU overhead increases with proportionately
with signaling loads
• Dom0 has significant overheads due to checkpointing overheads.
• Net Mode gives good results for Signaling
• With 1400 regs/sec failure was induced
– with 100% completion of all by failover to the
back up
Analysis and Results: Media
• Media loads with Net Mode gives poor results
• Media with No-Net gives good performance even with 400
streams with 2% losses
– This can be further reduced by tweaking scheduler parameters
• 100% fail-over of all calls in progress during media
experiments
Net Mode
100, 200, 400, 600 and 800 streams
No Net Mode
100, 200, 400, 600 and 800 streams
Conclusion
• Using No-Net mode for media streams gives us a balance
between performance(loss and delay) and
reliability(failover) while still being able to migrate 100% of
all calls in progress (using TCP) which is a significant result
• Net Mode for Signaling is a good configuration with 100%
registration completion with failover
• No-Net mode for the Media server deployment provides
significant improvement in performance: loss and delay
reduces significantly
– While the No-Net configuration performs better for media, it
may not provide call completion guarantees during the failover operation for signaling
• Migration of user registration and call setup operations was
100% successful
Contributions
• Extended load sharing and failover architecture
using Virtualization
• Proposed use of high availability feature in
virtualized platforms to achieve reliability in IP
Telephony
• Proposed placement scheme of signaling and
media applications for scale(elasticity) and
efficiency (utilization)
• Systematic evaluation of overheads involved in
use of virtualization for IP Telephony Applications
• Demonstrated that High Availability using Virtual
Machines can be deployed for medium scale IP
Telephony infrastructure
Future Work
• More detailed analysis of overheads
– Overhead because of check pointing in virtualization
platform
– Overhead because of I/O in Domain 0
• Propose solutions to improve performance
– Improve I/O handing in XEN VMM
• Propose better VM placement algorithm for IP
Telephony applications
– Utilizing fine grained overhead measurements for resource
allocation
– Considering I/O (media) vs. memory (signaling state
replication) optimizations
– Elasticity with co-location of media and signaling server on
same VM
Questions
• vs2140@columbia.edu
Download