Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh Outline • Virtualization • High Availability (HA) in Virtualized Platforms – XEN and REMUS (HA solution for XEN) • Remus applied to IP Telephony (IPT) applications – Scalability and Reliability of IPT applications using Virtualization • Experimental Results • Conclusion Virtualization and its Benefit • Abstraction layer (Hypervisor) between the physical hardware and the OS. • Single physical machine can host multiple virtual machines each running a different OS + application stack • VMMs – Xen, VMWare, Microsoft HyperV • Benefits – – – – Server consolidation Green computing Cost savings – space and power High Availability Reliability solutions, ease of upgrades with near zero down-times Virtualized hosting for IP Telephony • Virtualized hosting for IP Telephony already available – Avaya, Cisco, Asterix etc. • IP Telephony in Cloud – Scalability: ability to elastically add/remove additional servers while supporting High-Availability for all servers – Reliability: protection against hardware and software failures • HA features in virtualization platforms • Memory state check pointing Virtualization and High Availability • Seamless fail-over, Efficient and transparent migration of VM to another physical machine – Live Migration with very small down-times – Minimal or no impact to client nodes • Asynchronous check-pointing – Continuously syncs the state between the primary and secondary host • We use – Remus: A High Availability Solution for XEN Remus on XEN • • • • Remus is a High Availability solution available on the Xen VMM Remus uses continuous check-pointing and keeps a consistent client view of network state The secondary machine hosts a paused replica of the primary VM Uses a heart-beat mechanism – Failure to receive periodic heart-beat on secondary will un-pause the backup VM – Heart beat time-out can be configured Image: http://osnet.cs.nchu.edu.tw/powpoint/seminar/2008/Remus.pdf Fig 1 6 Remus on XEN (contd.) • Remus modes of operation – Net Mode – Highly reliable – No-Net Mode – better performance with negligible packet loss in case of failure – Tunable for Reliability vs. Performance Image: http://osnet.cs.nchu.edu.tw/powpoint/seminar/2008/Remus.pdf Fig. 2 Disk writes and Network Writes Net Mode: Buffers outgoing network packets until execution state is synced with the back up VM (on secondary host). •reliability at cost of performance Remus applied to IP Telephony - Scale with Reliability • Our work using HA in XEN extends: “architecture for fail-over and load sharing for IP Telephony” proposed by Kundan Singh et. al. • Challenges: – Overheads of virtualization on IP Telephony performance – Co-Hosted/Co-located media server causes interference because of heavy I/O workload Reliability and Scalability using Virtual Machines • Scalability using load balancer (LB) – LB can elastically add more VMs as demand grows • Reliability using Remus in XEN Stateless Load balancer Reliability Architecture using Virtual machines • For every primary Virtual Machine there is a back up VM in paused state. • Since, backup VM is paused, it allows to place other running VMs on the same physical machine • Provides N to M elastic/backup model (m back up for n primary) Reliability and Scalability using Virtual Machines (contd.) • Reliability – Provided by Xen + Remus – Failure of primary starts the execution of the secondary with IP address takeover – Clients continue to execute un-affected • Signaling and Media Server: – Co-located on same VM – allows better utilization, – no overhead of inter-vm communication – Placed on different VM – elastic scaling of media and signaling VM’s Studying Performance Implications • Experimental setup – Primary /Backup Servers – Intel Core 2 Quad Processors, 2.5 Ghz, 8 GB RAM, 4MB L2 Cache – Hypervisor – Xen 3.2.1 + Remus – Default Credit Scheduler configuration – Guest OS : Para Virtualized Linux 2.6.18 • IP Telephony Workload – Modeled our workload using SIPStone • Measured % success of registrations during failover • Used UDP and TCP as transport for registrations – Used OpenSIPs as SIP server – RTPProxy as Media Server – SIPp for generating signaling and media traffic Analysis and Results: Signaling • Guest VM and Domain 0 both have high CPU utilization with tcp_n (new tcp connection for each REGISTER) • UDP and tcp_1 (1 tcp connection for all REGISTER) have similar overhead. CPU utilization (in guest VM, dom0) Udp means with udp transport, tcp_1 means same connection for all call, tcp_n means new connection for each call With Remus NET mode, Registration overhead. Analysis and Results: Signaling • CPU overhead increases with proportionately with signaling loads • Dom0 has significant overheads due to checkpointing overheads. • Net Mode gives good results for Signaling • With 1400 regs/sec failure was induced – with 100% completion of all by failover to the back up Analysis and Results: Media • Media loads with Net Mode gives poor results • Media with No-Net gives good performance even with 400 streams with 2% losses – This can be further reduced by tweaking scheduler parameters • 100% fail-over of all calls in progress during media experiments Net Mode 100, 200, 400, 600 and 800 streams No Net Mode 100, 200, 400, 600 and 800 streams Conclusion • Using No-Net mode for media streams gives us a balance between performance(loss and delay) and reliability(failover) while still being able to migrate 100% of all calls in progress (using TCP) which is a significant result • Net Mode for Signaling is a good configuration with 100% registration completion with failover • No-Net mode for the Media server deployment provides significant improvement in performance: loss and delay reduces significantly – While the No-Net configuration performs better for media, it may not provide call completion guarantees during the failover operation for signaling • Migration of user registration and call setup operations was 100% successful Contributions • Extended load sharing and failover architecture using Virtualization • Proposed use of high availability feature in virtualized platforms to achieve reliability in IP Telephony • Proposed placement scheme of signaling and media applications for scale(elasticity) and efficiency (utilization) • Systematic evaluation of overheads involved in use of virtualization for IP Telephony Applications • Demonstrated that High Availability using Virtual Machines can be deployed for medium scale IP Telephony infrastructure Future Work • More detailed analysis of overheads – Overhead because of check pointing in virtualization platform – Overhead because of I/O in Domain 0 • Propose solutions to improve performance – Improve I/O handing in XEN VMM • Propose better VM placement algorithm for IP Telephony applications – Utilizing fine grained overhead measurements for resource allocation – Considering I/O (media) vs. memory (signaling state replication) optimizations – Elasticity with co-location of media and signaling server on same VM Questions • vs2140@columbia.edu