Modeling and Assessing Secure Voice over IP Performance by Cory L. Zue Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the Degrees of Bachelor of Science in Computer Science and Engineering and Masters of Engineering in Electrical Engineering and Computer Science at the Massacusetts Institute of Technology May 24, 2005 IDE -C* IF Copyright 2005 Cory L. Zue. All rights reserve MASSACHUSETTS INSTiTUTE TECHNOLOGY JUL 18 2005 LIBRARIES The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to grant others the right to do so. Author Department of Electrical Engineering and Computer Science May 24, 2005 Certified by CrfebRobert K. Cunningham Associate Leader - Information Systems Technology Group (Lincoln Lab) Thesis Supervisor Acceptedb A e bArthur C. Smith Chairman, Department Committee on Graduate Students BARKER THIS PAGE INTENTIONALLY LEFT BLANK 2 Modeling and Assessing Secure Voice over IP Performance by Cory L. Zue Submitted to the Department of Electrical Engineering and Computer Science May 25, 2005 In Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science Abstract Voice over Internet Protocol (VoIP) systems enable efficient communications over data networks, but security of VoIP and the impact of that security on communications quality has not been quantitatively modeled. A conversational model is adapted for VoIP and a computational model of communication quality - the Z-Model - is developed. VoIP conversations are simulated for networks with a range of performance characteristics including differing bandwidth, latency and bit error rates to evaluate the impact of security on communication quality. Results show that improving confidentiality via encryption of conversation data packets does not introduce significant delays, but does increase bandwidth. In certain restricted-bandwidth environments this results in dramatic reductions of perceived conversation quality. Thesis Supervisor: Robert K. Cunningham Title: Associate Leader, Information Systems Technology Group This work is sponsored by the United States Air Force under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. 3 THIS PAGE INTENTIONALLY LEFT BLANK 4 Acknowledgements There are several people whose work and input went into this thesis. First I'd like to thank my parents, for putting me through college and supporting me both financially and emotionally through some stressful times. From Lincoln Laboratory, I'd like to thank Lenny Veytser for the use of his tool for analyzing network traffic, and Mark Yeager for porting my VoIP traffic generating code into a scriptable format. I'd also like to thank Aaron Beveridge for putting up with my constant requests for help in various administrative tasks. Mostly, though, I'd like to thank my advisors: Rob Cunningham, my official thesis advisor, and Cindy McLain, who, although her name is not on the cover, put as much work into this thesis as any advisor could. Rob was terrific in keeping me on task and seeing the big picture, while Cindy analyzed my work with an incredible attention to detail, making sure facts were checked and grammar rules were not broken. I feel very fortunate to have had not one, but two excellent advisors willing to work extra hours and late nights to ensure that this thesis got finished. To both of you I owe an incredible debt of gratitude, and probably a few hours of sleep. 5 THIS PAGE INTENTIONALLY LEFT BLANK 6 Contents 1 Introduction 14 2 Background 16 2.1 Before VoIP: Telephones and the Public Switched Telephone Network 16 2.2 The Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Voice Over IP 19 3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 History and Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 M otivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 VoIP Deployment Challenges . . . . . . . . . . . . . . . . . . . . . . 22 3.4.1 Infrastructure Requirements . . . . . . . . . . . . . . . . . . . 22 3.4.2 Factors Affecting Conversation Quality . . . . . . . . . . . . . 23 3.4.3 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4 VoIP Implementation Details 4.1 4.2 5 25 Call Setup Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1.1 H .323 25 4.1.2 Session Initiation Protocol (SIP) . . . . . . . . . . . . . . . . 27 4.1.3 SIP versus H.323: A Comparison . . . . . . . . . . . . . . . . 28 4.1.4 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Transport Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.1 Internet Protocol (IP) . . . . . . . . . . . . . . . . . . . . . . 30 4.2.2 User Datagram Protocol (UDP) . . . . . . . . . . . . . . . . . 31 4.2.3 Transmission Control Protocol (TCP) . . . . . . . . . . . . . . 31 4.2.4 Real-Time Protocol (RTP) . . . . . . . . . . . . . . . . . . . . 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VoIP and Security 34 5.1 34 Internet Security Overview . . . . . . . . . . . . . . . . . . . . . . . . 7 5.2 6 5.1.1 Definition of Security . . . . . . . . . . . . . . . . . . . . . . . 34 5.1.2 Security Details . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.1.3 Security Implementation . . . . . . . . . . . . . . . . . . . . . 38 Security Applied to VoIP . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.1 Confidentiality 41 5.2.2 Integrity and Non-Repudiation . . . . . . . . . . . . . . . . . 42 5.2.3 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2.4 Security and Quality of Service 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Measuring Conversation Quality 44 6.1 Mean Opinion Score (MOS) . . . . . . . . . . . . . . . . . . . . . . . 44 6.2 Perceptual Evaluation of Speech Quality (PESQ) . . . . . . . . . . . 45 6.3 E-M odel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7 VoIP Conversation Modeling 48 7.1 M otivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.2 Brady's Model for Two-Way Speech . . . . . . . . . . . . . . . . . . . 49 7.2.1 One-Port versus Many-Port Models . . . . . . . . . . . . . . . 49 7.2.2 The One-Port, Two-State Model . . . . . . . . . . . . . . . . 50 7.2.3 The One-Port, Four-State Model . . . . . . . . . . . . . . . . 50 7.2.4 The One-Port, Six-State Model . . . . . . . . . . . . . . . . . 51 7.2.5 Model Parameters and Interpretation . . . . . . . . . . . . . . 51 7.2.6 Model Limitations 54 7.3 Applying Brady's Model to VoIP . . . . . . . . . . . . . . . . . . . . 54 Voice Activity Detection (VAD) . . . . . . . . . . . . . . . . . 55 Revising Brady's Parameters . . . . . . . . . . . . . . . . . . . . . . . 56 7.4.1 Correlating Brady's Data with the Switchboard Corpus . . . . 56 7.4.2 Comparing the Switchboard Corpus to VoIP . . . . . . . . . . 58 Developing User Models . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.3.1 7.4 7.5 . . . . . . . . . . . . . . . . . . . . . . . . 8 7.6 8 7.5.1 The Average Speaker Pair . . . . . . . . . 62 7.5.2 The Authority Relationship . . . . . . . . 62 7.5.3 An Alternating Protocol . . . . . . . . . . 65 Sum m ary . . . . . . . . . . . . . . . . . . . . . . 68 Adapting the E-Model to Secure VoIP Systems: The Z-Model 69 8.1 Using the E-Model for VoIP Communication . . . 69 8.1.1 Jitter . . . . . . . . . . . . . . . . . . . . . 69 8.1.2 Echo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 8.1.3 Error Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Going from E-Model to Z-Model . . . . . . . . . . . . . . . . . . . . . 72 8.2.1 72 8.2 8.3 Modeling the Effects of Security . . . . . . . . . . . . . . . . . Incorporating Conversational Improvements . . . . . . . . . . . 9 Resources 74 77 9.1 Traffic Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 9.2 Link Emulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 9.3 Encryption Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 9.4 Test Network 78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Methodology 80 10.1 Understanding the Performance of Secure VoIP . . . . . . . . . . . . 10.1.1 Experiment 1: Performance Under Optimum Conditions 80 . . . 80 10.1.2 Experiment 2: Performance of Openswan with Increased Traffic 82 10.1.3 Experiment 3: Performance over Low-Bandwidth Links . . . . 84 10.1.4 Experiment 4: Performance over Links with High Loss and Error R ates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 10.2 Adding Security to the Z-Model . . . . . . . . . . . . . . . . . . . . . 88 10.3 Evaluating the Performance of the Z-Model . . . . . . . . . . . . . . 91 10.3.1 Evaluating the Disagreement Factor as a Replacement for Absolute D elay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 9 10.3.2 The Loss Due to Errors . . . . . . . . . . . . . . . . . . . . . 93 10.3.3 Using the Z-Model to Estimate and Measure Overall Conversation Q uality . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 11 Future Work 11.1 Conversation Modeling 98 . . . . . . . . . . . . . . . . . . . . . . . . . 98 11.2 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 . . . . . . . . . . . . . . . . . . . . . . . . . 99 11.4 Voice Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 11.3 Network Characteristics 12 Conclusion 100 10 List of Figures 1 A Typical VoIP Setup . . . . . . . . . . . . . . . . . . . . . . 19 2 An H.323 Call Setup . . . . . . . . . . . . . . . . . . . . . . . 26 3 A SIP Call Setup and Takedown . . . . . . . . . . . . . . . . 29 4 Evaluating VoIP with the PESQ Model . . . . . . . . . . . . 45 5 R-Values and MOS for Varying Conversation Quality[58] . . . 47 6 A Two-State Conversation Model . . . . . . . . . . . . . . . . 50 7 A Four-State Conversation Model . . . . . . . . . . . . . . . . 51 8 A Six-State Conversation Model . . . . . . . . . . . . . . . . 52 9 Probability Distribution of Time Spent in a State . . . . . . . 53 10 Six-State Conversation Model with ur and 3 Parameters 53 11 Talking On/Off Patterns versus Time for a Single Speaker in a Conversatio n. 12 Time in Each State for Brady's Data[32] and the Switchboard Speech Corpus[34] 58 13 Time in Each State for 3 Calls from the Switchboard Corpus . . . . . . . . 61 14 Simulated Talking On/Off Patterns versus Time for a Single Conversation with a Pair of Average Speakers . . . . . . . . . . . . . . . . . . . . . . . . . 63 Real Talking On/Off Patterns versus Time for a Single Conversation with a Pair of Average Speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Simulated Talking On/Off Patterns versus Time for a Conversation with a Dominant Speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Simulated Talking On/Off Patterns versus Time for a Conversation with an Alternating Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 18 Agreement and Disagreement Time with ta Shorter than State Lengths 76 19 Agreement and Disagreement Time with ta Shorter than State Lengths 76 20 Test Network for Experimentation . . . . . . . . . . . . . . . . . . . . . . . 79 21 End-to-End Delays for 128-bit, Uncompressed AES and Clear Communication with Varying Levels of Traffic . . . . . . . . . . . . . . . . . . . . . . . 83 15 16 17 . . . 57 22 Loss Rates for 128-bit, Uncompressed AES and Clear Communication with 128 Kbps Bandwidth and Varying Background Traffic (End-to-end throughput) 84 23 Loss Rates for 128-bit, Uncompressed AES and Clear Communication with 128 Kbps Bandwidth and Varying Background Traffic (Actual Packet Sizes) 85 Introduced Loss Rates for Clear and Encrypted Communication . . . . . . . 86 24 11 25 Loss Rates versus Bit Error Rates for Clear and Encrypted Communication 87 26 Disagreement Factor for Varying Two-Way Latencies . . . . . . . . . . . . . 92 27 Expected and Observed Loss Rates versus Bit Error Rates for Clear and Encrypted Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 12 List of Tables 1 Sample and Bandwidth Information for Various Voice Codecs . . . . . . . . 20 2 The MOS Quality Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3 Brady's Parameters for the Six-State Model . . . . . . . . . . . . . . . . . . 54 4 Statistics of Brady's Parameters for the Switchboard Data . . . . . . . . . . 59 5 Statistics of Brady's Parameters for the Switchboard Data with Buffer . . . 59 6 Brady's Parameters for a Dominant and Passive Speaker . . . . . . . . . . . 65 7 Brady's Parameters for an Alternating Protocol . . . . . . . . . . . . . . . . 68 8 Characteristics used to Simulate Various Airborne Links . . . . . . . . . . . 78 9 Baseline Per-Packet Delays for Various Encryption Algorithms in Openswan 81 10 Factors Contributing to R score for Various Links . . . . . . . . . . . . . . . 97 13 1 Introduction The rapid growth and scope of the Internet has had a tremendous impact on the way people communicate. E-Mail and instant messaging have revolutionized the speed and cost of written correspondence, and allowed people separated by great geographical distances to write to each other much faster and cheaper than previously possible. Phone conversations, on the other hand, have not experienced a significant change with the advent of the Internet until only recently. In the past five years, though, the spread of technology that allows communicating over the Internet, called Voice Over IP, has demonstrated that this is rapidly changing. The slow adoption of Voice Over IP, or VoIP, has largely been a result of poor quality of service. The Internet is subject to delays and bandwidth limitations that, until recently, have made VoIP unattractive to typical users. In sending e-mail, or browsing websites, delays of a few seconds do not significantly hamper a user's experience, but in a verbal conversation it makes things quite difficult. A second challenge facing VoIP is security. The traditional phone system, the public switched telephone network or PSTN, consists of a series of dedicated circuits that are really owned by a few select bodies. This makes it difficult for others to "tap in" and listen to particular calls. The Internet, on the other hand, spans countless switches, lines, and routers, and in any given connection is very difficult to know exactly where data is being sent, or who can see it. For this reason, the privacy that people have come to expect from their communications is not guaranteed in VoIP systems. Providing privacy in an Internet setting generally relies on mathematical cryptographic algorithms that encode data before it is sent out in a way that only the intended recipient can decode. These algorithms, however, introduce an overhead in both time and bandwidth. If people are to have an equivalent level of privacy from VoIP systems as that provided by the PSTN, the overhead introduced by applying security features must not reduce the conversation quality below the point of usability. The main goal of this research is to develop a methodology for objectively estimating the quality of VoIP communication in a given network environment. Security should be incorporated into this model, as well as unique network characteristics that may arise in less than ideal situations. Some Internet communication runs over links with limited bandwidth or above average latency, and it is important to understand how VoIP performs in these environments. Of particular interest is the performance of VoIP over encrypted wireless airborne networks. To explore the feasibility of secure VoIP in various network environments, an understanding of behavior of VoIP systems is first necessary. The size and rate of packets sent, as well as the on/off spurts of VoIP speech are studied so that an accurate model of a VoIP user can be developed. There has been quite a lot of exploration of VoIP traffic generation in the commercial sector, with emphasis on testing the limits 14 of infrastructure capacity. These commercial tools, however, such as Hammer VoIP Test Solution [1] and IxVoice[2], do not use complex conversational models, but rather send pre-generated traffic streams, or use very simple exponential on/off models[37]. A second task of this thesis is to explore the various ways of providing security in an Internet setting. There are several ways to ensure security, including different encryption algorithms, and at which network layer they are applied. Finally, a metric for conversation quality that incorporates network characteristics and security implementation should be developed. Conversation quality is largely a subjective concept. Different people may have different standards of what a "good" quality conversation is. Being able to objectively estimate voice quality is essential in estimating the performance of a particular proposed VoIP setup. The focus of this thesis was on the evaluation of conversation quality, and the impact of security and unfavorable network links on call setup was not considered, although some information about the latter is included for completeness. Section 2 provides some background information about the public switched telephone network and the Internet. Sections 3 and 4 present VoIP in more detail, discussing it in a historical context, providing implementation details, and reviewing the protocols involved. Section 5 provides an overview of Internet security, with emphasis on security's role in the context of VoIP. It also explains why we chose IPSec as the security layer for our experiments. Section 6 gets into the various methods of evaluating speech quality, and justifies the choice of the International Communications Union's E-Model as an appropriate performance measurement tool. In Section 7 the techniques for conversation modeling are explored, as well as a methodology for adapting a PSTN conversation model to mimic VoIP behavior. The main focus of this thesis is in Sections 8 and 9. In Section 8 the Z-Model is introduced, a computational model, based on the E-Model, that incorporates security and fine-tunes the E-Model in the context of VoIP systems. Section 9 discusses the experiments performed in designing and evaluating the Z-Model as an appropriate tool for measuring VoIP quality. Finally, Section 11 discusses the limitations of this research and explores areas for further study. Section 12 offers some concluding remarks. 15 2 Background Before delving into the world of VoIP and security, it is appropriate to know a little bit about VoIP's predecessors. This section will discuss the telephone and the network built to support telephony: the public switched telephone network. It will also briefly discuss the advent of the Internet, the network VoIP was built to run over. 2.1 Before VoIP: Telephones and the Public Switched Telephone Network The Internet has connected people further and faster than ever before in history. But before the Internet, the telephone was the global method of communicating across large distances. The first voice transmission was sent by Alexander Graham Bell in 1876[3]. Bell did not have a phone number to dial or an e-mail address, he simply picked up the phone and the person on the other end could hear what he said. For a long time, this was the model used by the telephone: a user had to have a direct connection to whoever he wanted to call. As phones became more widespread, a new model was required in which each person had a connection to a central switch. To reach someone else, a caller would ask the operator of the switch to physically connect the appropriate lines, which would establish a circuit for the duration of the call[3]. Over time the shape and mechanisms of the phone network changed. Operators were replaced by a signaling code (i.e. your phone number) that allowed calls to be routed and connected automatically. As the network grew, it developed hierarchical layers of switches, each one supporting more traffic. The resulting network is known as the public switched telephone network (PSTN). The PSTN is excellent at providing high quality communications. Except in times of extreme usage, such as a natural disaster or holiday, when many people are trying to communicate at the same time, people can reach each other whenever they want. Typically the PSTN is up 99.999% of the time[4]. In addition, the typical sound quality of PSTN calls is very good. This is because the PSTN calls are circuitswitched. That means that a dedicated circuit is created during the setup phase of the call, and this circuit continues to serve the call in an exclusive manner until it completes. Each call is guaranteed a fixed amount of continuous bandwidth, no more and no less, for the duration of the call. Circuit-switching also has its disadvantages. First of all, it is bandwidth inefficient; the circuit is tied up by a conversation, regardless of whether or not either of the parties happens to be talking. Additionally, this fixed bandwidth makes it difficult to add new features. Typically, a single home has a 56-kbps phone line, which is simply 16 not enough bandwidth to support Internet, phone, and video at the same time[3]. The alternative to a circuit-switched network is a packet switched network, in which each chunk of data is individually routed. Packet switched networks, such as the Internet, make more efficient use of bandwidth. For this reason, the people who hope to converge the phone and data networks believe that voice should be migrated to packet networks and not the other way around. As it currently stands, many people still use the phone network and a modem for data transfers, such as e-mail and web browsing. As the demand for data and bandwidth continues to increase, the use of the PSTN for data becomes increasingly infeasible, and the need to migrate voice to the data network becomes apparent. 2.2 The Internet In this section we provide a brief history of the Internet. It is not meant to be an exhaustive study of the Internet details, but rather a context for comparing VoIP to PSTN calls. The origins of the Internet go back to a 1969 project of the U.S. Department of Defense. At that time, the PSTN was the only available nation-wide communications system. The government realized that because of its dependence on circuit switching, switching stations could be targeted during an attack and effectively take down communication channels of an entire region[5]. The government wanted to build a network that would dynamically route each piece of traffic (called packets) depending on the availability of links. The result of this project was the ARPANET, which would later develop into the Internet[6]. At the heart of the ARPANET was packet switching. This was a way that allowed data to travel from one point to another in the network without setting up a connection or establishing a fixed path. Each node on the network had its own unique address, known as an IP (for Internet Protocol) address. Then each packet could be labeled with a source and destination address, and the routers could look up where to send the packet to reach that address. One problem with this protocol is that it is unreliable; packets are not guaranteed to reach their destination. For this reason, other protocols run over the Internet, such as Transmission Control Protocol (TCP), and User Datagram Protocol (UDP), that can provide varying levels of connection control. These protocols are discussed in Section 4.2. There are several nice things about packet switching. One is that it allows bandwidth to be used only when it is needed. Low bandwidth applications, such as e-mail, can share links with voice or video at the same time as long as bandwidth supports it. Packet switching also allows data to be dynamically routed, so connections can be maintained even as intermediate links and hosts go up and down. Dynamic routing also allows bandwidth to be spread across different links in times of congestion. 17 There are some disadvantages to packet switching as well. In particular, it is difficult to guarantee a particular level of service to users; a task that is readily accomplished by circuit switched networks. Additionally, the Internet was designed such that the majority of network control is placed at the endpoints. This can cause network management to be a difficult task. Despite its shortcomings, the Internet has experienced incredible growth since the development of the ARPANET. As recently as 1983 there were fewer than 600 registered hosts on the ARPANET, most of whom were universities and military research sites[5]. Now the Internet has countless hosts and supports incredible volumes of traffic. Thus, there is plenty of room on the Internet for voice traffic. 18 Voice Over IP 3 Voice over IP (VoIP) is the transmission of voice data over packet-switched networks, such as the Internet. This section provides an overview of VoIP technology and discusses why VoIP is an important topic to study. Section 3.1 provides a quick overview of VoIP systems. Section 3.2 discusses the historical context of VoIP. Section 3.3 discusses the motivation behind implementing VoIP instead of traditional PSTN technology, and Section 3.4 discusses some of the challenges facing VoIP, several of which are addressed in the rest of this thesis. 3.1 System Overview A typical VoIP architecture is shown in Figure 1. Each user has a device, called a "Voice Terminal" that performs the operation of translating speech to data packets that can be sent over the Internet. Voice Terminals can be a stand-alone piece of hardware (known as an IP phone), software running on a PC, or a regular phone running through an adaptor and plugged into the Internet. Analog to converter Dt copesn Alice RTP Packet UOP Packet Internet [T Pce] Bob's Voice Terminal Alice's Voice Terminal (Software or Hardware Phone) Figure 1: A Typical VoIP Setup The first element of the Voice Terminal is an analog to digital (A/D) converter. This device takes in the analog audio signal, either from a phone handset or PC microphone, and converts it to digital data that can be processed by a computer chip. Once the speech has been digitized it is usually compressed by a device or program known as a codec. Compression allows the data to take up less memory, while sometimes resulting in loss of information and quality. There are several standard codecs for voice that offer varying levels of compression and quality, and an overview of the most popular codecs can be found in Table 1[16]. In addition to being compressed, 19 Codec and Bit Rate (Kbps) G.711 (64) G.729 (8) G.723.1 (6.3) G.723.1 (5.3) G.726 (32) G.726 (24) G.728 (16) Codec Sample Size (Bytes) 80 10 24 20 20 15 10 Codec Sample Interval (ms) 10 10 30 30 5 5 5 Voice Payload Size (Bytes) 60 20 24 20 80 60 60 Voice Payload Size (ms) 20 20 30 30 20 20 30 Packets Per Second (PPS) 50 50 34 34 50 50 34 Bandwidth MP or FRF.12 (Kbps) 82.8 26.8 18.9 17.9 50.8 42.8 28.5 Bandwidth Ethernet (Kbps) 87.2 31.2 21.9 20.8 55.2 47.2 31.5 Table 1: Sample and Bandwidth Information for Various Voice Codecs the codec also breaks the audio into data values, known as samples, taken at small, discrete timesteps apart from each other. The exact timestep and size of these samples depends on the codec used. The compressed audio data is then wrapped inside a Real-Time Protocol (RTP) packet[71]. Depending on the codec, one or more samples will be put inside a single RTP packet. The purpose of the RTP is to provide information on the codec used, as well as additional sequencing and timing information that allow the allow the stream of packets to be converted back to audio. Details of how this is accomplished are discussed in Section 4.2.4. The RTP packet is then put into a User Datagram Protocol (UDP) packet[69], to be sent over the Internet. UDP provides information about source and destination addressing so that routers in the Internet that handle the packet know where to send it. The choice of UDP over TCP is discussed in Section 4.2.2. The packet is then shipped over the Internet (or, in some cases a Local Area Network (LAN)) to the receiving user. This user has his own Voice Terminal that can perform the inverse operations on the packet. This involves first removing the UDP headers and parsing the RTP data to determine the codec and ordering of the packet. Then, the packet is decoded with the appropriate codec, and is passed through a digital to analog (D/A) converter, where the resulting analog signal can be played (either through a handset or a computer's speakers). Because both parties in a two-way conversation need to encode and decode the data, both functions are needed in each host's voice terminal. 3.2 History and Growth Although VoIP has recently emerged as one of the hottest topics in technology, it has actually been researched and developed for quite a long time. In 1978 the first packet-switched teleconference was held on the ARPANET, with members including 20 Cliff Weinstein of MIT's Lincoln Laboratory[8]. Weinstein et. al. later published a paper describing sending speech over packet networks in 1983[9]. Additionally, the first functional VoIP software was released by VocalTec Inc. in 1995[10]. This software was designed to run on a PC, but, unfortunately, insufficient processor speed, Internet bandwidth, and other factors prevented VoIP from being a viable option to PSTN telephony. Since then, however, advances in processor speeds and a drastic increase in Internet bandwidth and reliability have allowed VoIP to be a reasonable alternative to PSTN. As a result, VoIP has experienced tremendous growth. This can be seen in the rate at which Cisco, the leading manufacturer of IP phones, has sold its products. Cisco shipped its 1 millionth phone in August of 2002, representing three and a half years of sales. It shipped its 2 millionth phone just 12 months after that, in July of 2003, and its 3 millionth phone in April of 2004, only 8 months later[11]. Cisco is not the only company reflecting the growth in this industry. According to research done by the Yankee Group, 54 percent of businesses are currently testing or evaluating the potential of VoIP[13]. Another study, by Juniper Research, estimates that by 2009, 10% of all US households and 40% of business lines will be using VoIP. They also estimate that the global VoIP market in that year will be $32 billion[14]. From this information it seems clear that VoIP is an important industry and technology. The next section discusses motivations for switching to VoIP. Section 3.4 goes on to discuss some problems that should be addressed before employing a VoIP system. 3.3 Motivation The largest reason why many people and businesses are switching from PSTNs to VoIP is cost. The discrepancy of cost between these technologies lies in the inherent difference between circuit and packet switched networks. Making a PSTN call requires "renting" bandwidth on a circuit controlled by a few large corporations for the duration of the call. Internet bandwidth, on the other hand, is a largely underutilized resource, and typically users are charged a fixed amount for connectivity regardless of bandwidth use. This makes VoIP extremely cheap. Additionally, while there are many regulatory charges for long distance and international phone calls over PSTNs, packet switched networks, like the Internet, are currently completely unregulated[17]. As a result, many sites exist today that offer free VoIP to consumers (Skype and EarthLink are two), while countless other providers offer unlimited long-distance dialing over IP for a small monthly fee (AT&T, CableVision, Net2Phone). Since IP bandwidth is extremely cheap compared to PSTN lines, this is an economically sound option for both providers and users of VoIP systems. For corporations, the switch to VoIP can be even more economical. Since VoIP uses the same underlying architecture as data networks, a complete switch to VoIP 21 can eliminate the need for a company to have a separate voice network, saving cost and resources. This consolidation of networks makes internal phone lines obsolete, a change that could save companies up to 20% when compared to PSTN[15]. There are also non-economic advantages to VoIP. One advantage emerging recently is the elegance and philosophy of "everything over IP." Merrill Lynch and Vonage have recently released a study[18] that claims that VoIP is a first step towards a day when all communication technologies (Internet, phone, television, radio, etc.) will run over IP. This convergence has economic advantages, but also creates simplicity in merging different mediums; only one set of communication lines and protocols would be necessary. VoIP also allows users to be free of ties to a physical location. In this way, VoIP resembles a cellular phone; a user can plug his IP phone into any Internet jack in the world and have his phone respond to the same number. Wireless IP networks make this comparison even more viable. Another benefit is that telephone addresses are no longer confined to 10-digit numerical codes, but could take nearly any form (e-mail addresses with a special marker, for example). Despite these advantages, there are still several problems with VoIP that need to be addressed. These are discussed next. 3.4 VoIP Deployment Challenges VoIP, like any new technology, comes with a set of challenges that must be addressed before it can be widely and safely used. These challenges fall into three main categories: setting up a VoIP infrastructure capable of interfacing with the existing phone network and supporting VoIP systems from different vendors, maintaining conversation quality that users are accustomed to, and providing security through authentication and privacy. These challenges are discussed individually in the following sections. 3.4.1 Infrastructure Requirements The most obvious problem when employing a new communications technology like VoIP is building an infrastructure that will allow it to run smoothly and integrate with existing technologies. Luckily this problem has, for the most part, already been addressed. One nice thing about VoIP is that it runs over IP, so the existing Internet backbone can be used to transport data and new lines are not necessary. For addressing, call setup, and interfacing with the PSTN, several protocols have been suggested, with two of them, H.323 and SIP, emerging as the most popular[64]. At 22 some point a standard for VoIP will have to be accepted, but until then the various protocols must be able to interface not only with each other, but also with the existing PSTN. For this purpose there exist boxes, known as gateways, that allow translation from IP to PSTN for various VoIP signaling protocols. Thus, two very important aspects of a VoIP infrastructure are already in place. First, a VoIP network can run almost exclusively as a layer over IP, so the connecting network is already set up. Second, this network can readily interface with the PSTN, so users can switch to VoIP one at a time without being disconnected from the existing phone network. This allows VoIP to be incrementally deployable, a crucial characteristic if it is to be incorporated smoothly into the telecom architecture. 3.4.2 Factors Affecting Conversation Quality A second major concern with VoIP has been the issue of conversation quality. Users of any kind of telephone system are accustomed to a certain level of quality, below which the usability of the system quickly degrades. It is therefore important to assure that the conversation quality provided by VoIP systems equal to that of PSTN systems. Until recently this has been a difficult task. There are several factors that can influence conversation quality. One is the problem of delay. Sources disagree on the acceptable amount of delay before which quality rapidly deteriorates, but most place this value somewhere between 100 and 400ms[17, 53]. Delays can come from a number of sources. Computationally, there are delays at each step of the way. The A/D and D/A conversion take time, as does compression by the codec, packetizing and depacketizing the data, and decompression by the codec. Clearly, these delays are directly related to the processor speed of the voice terminals. For modern computers and IP phones they are typically quite small, but for a long time they made VoIP unusable[17]. Additionally, data routed over the Internet can travel through several routers, each of which may have a packet queue that can add to delays. Finally, there is also a delay associated with sending a signal over a wire that is proportional to the distance and the speed of light. This is a physical barrier that cannot be avoided or significantly reduced for VoIP or PSTN conversations. A second problem facing VoIP is that of bandwidth. Any live streaming application, by nature, will require a significant amount of bandwidth. Even after compression and employing Voice Activity Detection (which prevents packets from being sent in periods of silence), nearly all VoIP applications require at least 20 Kilobits per second of bandwidth on the network[16]. Only recently have most homes and businesses had this kind of consistent bandwidth at their disposal. Another factor that affects VoIP quality is jitter, which is the variance of interarrival times between packets. For a real-time application such as VoIP, it is important 23 that packets arrive in a relatively smooth and uniform rate. Jitter can drastically reduce conversation quality if measures are not taken to counter it. In many cases this problem can be mitigated through the use of a jitter buffer[50]. Several other factors affect VoIP quality, including packet drops and transmission errors over the network. With the combination of all of these factors, it has turned out that only in the last decade or so have computer systems and a network infrastructure capable of supporting high-quality and ubiquitous VoJP been in place[58]. There are several ways of estimating conversation quality. One model that has been introduced by the International Telecommunications Union (ITU) is the E-Model [53], which attempts to determine a quality score based on delay, jitter, codec, and several other transmission factors. Evaluating conversation quality is the subject of Section 6, and the E-Model details, as well as an overview of other quality models, can be found there. 3.4.3 Security In addition to challenges concerning conversation quality, a second concern for VoIP relates to security. In contrast to the PSTN, which is largely considered to be a secure, private network, the Internet is open to eavesdropping, impersonation, and denial of service (DoS). A key problem with VoIP security is that many of the protocols and techniques used (see Section 5.1.2) to protect data networks have not been widely used or tested in VoIP networks. Firewalls, for example, are known to create problems with VoIP callsetup protocols[27]. Additionally, adding security inevitably increases packet sizes (thus bandwidth) and delays, as the mathematical encryption algorithms take time to be performed. Security issues with VoIP are discussed in more detail in Section 5, and how security affects conversation quality is the subject of Section 8. 24 4 VoIP Implementation Details Section 3.1 provided a brief overview of Voice over IP systems. This section expands on that topic in more detail. In particular, it addresses the various protocols used in VoIP systems. These break down into two main categories: protocols used for setting up calls and building a VoIP infrastructure (discussed in Section 4.1), and protocols for transporting the speech data over the Internet (Section 4.2). While the focus of this thesis is on conversation quality, the call setup and teardown protocols are discussed for completeness. 4.1 Call Setup Protocols One of the most important needs for VoIP systems was a well-defined and accepted protocol for managing the VoIP infrastructure. Since VoIP devices are associated with an IP address that may change over time (if a laptop user plugs into the Internet from two different locations, for example) there must be a dynamic process of associating a particular user with an IP address. Additionally, there needs to be an established signaling process that allows users to call each other, be connected, receive busy signals, and leave voice-mail; anything users expect from traditional phones should be supported by VoIP. Finally, there should be a well-defined way to interface between VoIP and traditional PSTN phones and networks. Several protocols have been introduced to accomplish these tasks, with the two most important ones being H.323 and Session Initiation Protocol (SIP). 4.1.1 H.323 H.323 was the first widely used standard for VoIP[17]. The first version of H.323 was developed and endorsed by the International Telecommunications Union in 1996, and subsequent versions have been released in 1998, 1999, 2000, and 2003[7]. There are four basic components in an H.323 system, a terminal, gateway, gatekeeper, and multipoint control unit. The terminal is the endpoint device that provides a user interface to the system, similar to the voice terminal block in Figure 1. Terminals can be a dedicated piece of hardware known as an IP phone, a software program running on a personal computer (PC), or a normal telephone running through a network adaptor. The gateway is a device that provides a protocol conversion between the H.323 IP network and other types of networks (PSTNs, for example). This allows the H.323 phones to communicate with devices running different protocols, such as PSTN phones, or VoIP devices running over a non-H.323 protocol (e.g. SIP). Gateways are maintained by a VoIP provider, or by a company to allow their users to interface with other 25 Calling Party A TA-.SYN 3alling C 98.76.54.32 , provides TCP Pert trty a TCP Port H.225 SETUP ALERT PA11-225 B 1234ek ta dg6789 A P2 11.225 CONNECL TCP P., TCP Port ___________ U8 wpRSYN T,.245 TrminatConptilut. party) eb.st.h Set party) Set ta ails fithe Hf246 t 14.245 6ndwmnal Capahilitive Set ACI( AIMacicr SHt ACK tB Srt. bs Capalsiti. 11.245 Tera iL 246 Open L.~*.. C~ae.'.I (44*. r Op.. Lthe d24 rt biis fe..ue' pTatPoeria) 0...) Aditnay yp tp, a a.. d( conet H. snt ort spDPaPt hrougalls them route Figure 2: An H.323 Call Setup communication networks. The third H.323 device, the gatekeeper, provides call control, bandwidth management, and address translation for connections between H.323 endpoints. Calls are initiated through gatekeepers, which are responsible for translating a phone number to an IP address, and also may monitor call times for billing and tracking purposes. Gatekeepers, again, are maintained by a provider or corporation, and users simply route calls through them. The final device, known as a multipoint control unit, enables three or more terminals or gateways to establish a multipoint conference. Three-way calls are currently outside of the scope of this thesis, and so details of the multipoint control unit are not discussed. One major drawback of H.323 is that many control messages and protocols are used to initiate and terminate calls. Figure 2 shows the control messages required to setup an H.323 call between two parties. It can be seen that H.323 relies on several other protocols including H.225 (to set up the connection), and H.245 (to determine the capabilities of each user's terminal). Additionally, connections on two separate ports are required for the setup of a single call. Another criticism of H.323 is that the protocol is binary-encoded, that is, it encodes information in numbers as opposed to text. This encoding is frustrating for programmers debugging H.323 applications because it is difficult to easily observe what the problems are. The binary encoding also makes H.323 less extensible, as information 26 must be contained in a very specific part of the packet with a fixed size. For these reasons many people have recently moved away from H.323 and accepted SIP as the standard for VoIP. 4.1.2 Session Initiation Protocol (SIP) Session Initiation Protocol[72], commonly known as SIP, has recently become the accepted standard for multimedia Internet applications, including VoIP, for the Internet Engineering Task Force (IETF). It was developed and accepted by the IETF, a body that oversees many of the Internet's standard protocols, as part of the Internet Multimedia Conferencing Architecture. In addition to being used for VoIP, SIP can also be used for video conferencing, instant messaging, and chat. In SIP each user is associated with an address known as a uniform resource indicator (URI). The URI (sometimes called a SIP URI, or SIP address) is analogous to uniform resource locators (URLs) of websites, with the key difference that they are meant to be dynamic[22]. URI's are not meant to be tied to a particular physical device, but to a logical entity that might move or exist in multiple places. A SIP architecture has several similarities to an H.323 architecture. Two similar elements are SIP user agents (UAs) and SIP gateways. A SIP UA is analogous to an H.323 terminal; it is a hardware or software device that allows a user to make VoIP calls using SIP. SIP UAs can take the form of dedicated hardware phones, networkadapted analog phones, or software applications running on an Internet-enabled PC. SIP gateways are also analogous to H.323 gateways; they provide translation between protocols. Two common gateways are SIP/PSTN gateways, which allow SIP devices to interface with traditional phone systems, and SIP/H.323 gateways, which interface SIP and H.323 devices together. Like H.323 gateways, these are part of the infrastructure backbone, and are generally maintained by service providers or corporations. SIP architecture also requires a number of SIP servers, devices that handle SIP messages, and each one serves a different function. The three types of SIP servers are proxy servers, redirect servers, and registration servers. They are discussed below. Proxy Servers The role of a proxy server is to handle SIP messages on behalf of SIP user agents. Proxy servers usually have access to a database or location service to help determine what to do with the request. For example, a UA might try to initiate a call to a person who's SIP URI is alicedbigcompany.com. The UA sends a SIP INVITE request to a proxy, and the proxy attempts to determine the IP associated with Alice's address by querying its database or location service. The proxy server then forwards the request to wherever it determines the best location for Alice is, or responds with a "not found" error message. Proxies, in general, do not create new messages, they 27 merely forward and respond to requests as they see fit. Redirect Servers A redirect server is a SIP server that responds to requests, but never forwards them. Like the proxy server, it usually queries a database or location service to determine an appropriate response, but unlike the proxy it will never forward a message. Redirect servers usually inform the requesting UA of the location of someone, but leave it up to the UA to contact that location on its own. Registration Servers A registration server, or registrar, only accepts one type of SIP message, a SIP REGISTER request. The registrar then keeps state of the users registered to it for a particular domain, allowing the proxy and redirect servers to query it for location information. Registration servers usually perform user authentication, although this isn't required. Authentication ensures that only valid SIP users in the registrar's domain are registered and serviced, and prevents outside users from placing and receiving calls through another domain's servers. While each server is logically separate from the other two, any and all three servers can reside in the same physical location, and many open-source and commercially available SIP servers perform all three functions[27]. One advantage to SIP is that it is a text-encoded protocol. This means that the information is sent over the Internet as human-readable text. Other text-based protocols include Hyper-Text Transfer Protocol (HTTP) and Simple Mail Transport Protocol (SMTP), which World Wide Web and e-mail systems use, respectively[73, 74]. The advantage to a text-based protocol is that it is much easier to program, analyze, and debug. A simple traffic sniffing tool can be used on the network to easily understand what information is being sent - a task not nearly as straightforward in H.323 systems. Another advantage is that, SIP call setup is quite simple, as can be seen in Figure 3. This figure, as compared to Figure 2, represents all the signaling required for a complete call (Figure 2 is only an H.323 setup), and shows a call being routed through a SIP proxy server. SIP calls only require the establishment of a single TCP connection, and SIP can even bypass TCP and run entirely over UDP (a discussion of TCP versus UDP can be found later in this section). In SIP, all of the codec negotiation is done in the INVITE and OK messages, bypassing the need for the series of H.245 "terminal capabilities" messages seen in Figure 2. 4.1.3 SIP versus H.323: A Comparison Several sources[22, 27] have provided a comparison of SIP and H.323. Both protocols have their own merits, which are discussed here. 28 Bob Proxy Server(s) Alice INVITE Alice INVITE Alice123.45.67.8 from Alice@123.45.67.8 rA 4.OK OK fromn Alicve 123.45.67.9 ACK Alice with route Alice a123.45.6 7.8 BRY 1 A Iicc ACK AliceI23.45.67.8 123. 4-5.678 E OK BYE Alice 123.45.67.8 OK o Figure 3: A SIP Call Setup and Takedown The difference in H.323 and SIP is largely a product of where they were developed. H.323 was developed by the International Telecommunications Union (ITU), and so it largely resembles other telecommunications protocols. H.323 even reuses parts of ISDN signaling, reflecting its roots in the telecom industry. SIP, on the other hand, was designed by the Internet Engineering Task Force (IETF), and so it resembles other Internet protocols like HTTP and SMTP. One aspect of SIP that is highly advantageous is its use of a common universal addressing scheme: the SIP URI. Because of this it allows a single SIP user to have different SIP-enabled devices at multiple end-points that are all associated with that user. A SIP phone, videoconferencing tool, and instant messaging client could all be tied to a single URI; one of the reasons SIP is said to be more scalable than H.323. Additionally, SIP's text-based encoding, when compared to H.323's binary encoding, is an advantage in both clarity and extensibility. New fields can be added incrementally to SIP, a task that is more difficult to accomplish in H.323. It is unclear when (if ever) a true standard will emerge, but while H.323 arrived and was widely implemented first, SIP's simplicity and versatility have caused it to gain great momentum in the IP telephony world. SIP has now become widely backed by major companies including Microsoft, Cisco, and Nortel, and appears to be the growing trend. An interesting side note about the two protocols is that as time has progressed, they have grown more and more alike[22]. For example, SIP was designed to support DNS (domain name service) from the start, a feature that was added in later versions of 29 H.323. Conversely, a system similar H.323's multipoint control units, which allowed for 3-way calling, is currently being added to SIP. Alan Johnston[22] provides a nice summary of the current situation. "While there are some similarities between the protocols in call setup, and some niche markets that H.323 currently dominates, SIP, with its text encoding, presence and instant message extensions, and Internet architecture, is poised to be the signaling and 'rendezvous' protocol of choice for Internet devices in the future." For these reasons, SIP was chosen as the protocol to use in our experiments. 4.1.4 Others SIP and H.323 are not the only protocols that have been proposed for IP telephony. Others include Megaco, MGCP, and Skinny. While these protocols are used in some areas today, it is believed that either SIP or H.323 (or both) will become the standard for VoIP and so a detailed discussion of these protocols is not provided here. Please refer to the references[78, 79, 67] for more information about these protocols. 4.2 Transport Protocols While SIP and H.323 perform call setup and takedown signaling, they do not provide any means of actually transmitting information over the Internet, nor do they handle the media streams containing the actual audio data. For this task several existing protocols are employed, including IP, UDP, TCP, and RTP. Both SIP and H.323, for example, must run over either TCP or UDP. Additionally, the audio data of the conversation uses RTP running over UDP for delivery. Finally, all of these protocols run over IP. An explanation of what these different transport protocols provide is found below. 4.2.1 Internet Protocol (IP) Internet Protocol[68] (IP) is used to route packets across a data network, such as the Internet. It provides connectionless, best-effort packet delivery, meaning that packets might be lost, delayed, arrive out of sequence, or contain errors. Each location on the network is associated with a particular address, known as an IP address. The current most-employed standard for IP addresses is IP version 4 (IPv4), in which addresses consist of four numbers between 0 and 255, separated by periods. 123.45.67.8 is an example of an IPv4 address, as is 11.0.0.255. A new version of IP, IP version 6 (IPv6) has been developed by the IETF to allow for longer addresses, because the IP address space is rapidly running out. The increased address space 30 provided by IPv6 is necessary if every telephone in the world is to have a unique IP address. That said, IPv6, while an important emerging technology, is largely outside the scope of this document, and we performed all our experiments with the more common IPv4 protocol. IP addresses are assigned by the Internet Assigned Number Association[22] (IANA), and so they are globally unique. This ensures consistency; packets destined for a particular unicast address should always arrive at one and only one unique machine.'. An IP address is a lot like a street address. In the same way that any piece of mail with a specific address on it can be dropped in any mailbox and should find its way to that address, any IP packet with a specific destination IP address, sent to any router on the Internet, should find its way to that computer. 4.2.2 User Datagram Protocol (UDP) User Datagram Protocol[69] (UDP) provides a small step up in complexity from IP. It is also a connectionless, best-effort protocol, meaning packets can be lost or dropped, but it provides a checksum, allowing errors to be detected. Additionally, UDP specifies not only an IP address, but also a port. Certain communications run over specific, well-known ports (for example, HTTP uses port 80, and SIP uses port 5060). In this way, UDP allows multiple logical channels of communication to exist between two computers. In the case of VoIP, SIP could be handling the call setup over port 5060, while the actual data streams are traveling over some other (usually determined at call-setup) port. 4.2.3 Transmission Control Protocol (TCP) Transmission Control Protocol[70] (TCP) also runs over IP, but provides the reliability and guaranteed delivery of data that IP and UDP do not. It does this by using acknowledgements and sequence numbers. Every TCP packet that is sent has a header representing which bytes of the overall stream of data the packet represents. This header allows the receiver to recognize when information is missing and ask the sender to resend. It also allows the receiver to reconstruct information in the correct order, even though packets may arrive out of order. TCP is a good transport protocol to use when guaranteed and reliable delivery is required. Most traffic, including World Wide Web, e-mail, and File Transfer Protocol (FTP) use TCP. A problem with TCP, however, is that it can make things slow. Retransmission takes time (usually more than double the round-trip time (RTT) between sender and receiver) and in live streaming applications (like VoIP) a delay 'Actually, this isn't always the case because of dynamically assigned IP addresses and network address translators (NATs), but can generally be considered true. 31 as small as a second can severely detract from conversation quality. Additionally, TCP has problems with long latencies, since it drastically reduces the transmission rate when packets are lost. For these reasons, TCP is generally not used for media streams, although it can be used for call-setup. As mentioned previously, currently SIP and H.323 support delivery over both TCP and UDP. 4.2.4 Real-Time Protocol (RTP) VoIP and other media streaming applications are a unique type of traffic with very specific requirements. More than any other service these applications are extremely time-sensitive. Delays larger than a few fractions of a second are simply unacceptable. Fortunately, these applications do not need guaranteed delivery of every packet. Audio and video applications can usually interpolate a value for a missing data point and end-users likely won't notice a loss in quality. A final requirement of these applications is that ordering should be preserved. Users don't want to hear words out of place in a conversation. Moreover, many voice and video codecs use differences rather than absolute values to encode data, and a reordering of information can make it completely useless. To summarize, VoIP requires a quick delivery of data along with ordering information, but doesn't require guaranteed delivery. Real-Time Protocol[71] (RTP) was developed to meet these characteristics. RTP runs over UDP, and is therefore a connectionless, best-effort protocol. This ensures the fastest possible delivery over the Internet, and also means that some packets will be dropped or arrive out of order. In addition to the UDP checksum, which alerts the receiver if the data in the packet is contains errors, it also provides several multimedia-specific features. One of these features is sequencing. As mentioned above, sequencing is important in streaming applications. Typical VoIP applications actually hold on to packets for a certain amount of time (known as the jitter buffer, and mentioned earlier in Section 3.4.2) before delivering them to the end-user. This allows some packets that arrive late or out of order to be correctly placed in the stream, and has been shown to drastically increase conversation quality[50]. The sequencing feature of RTP allows this to be accomplished. In addition to sequencing RTP also provides other valuable media-specific features. It has a built-in field to represent the codec used according to a defined list of standard codecs that is kept by the IANA[30]. It also provides a timestamp when the packet is created (used by RTCP, see below), and a marker bit, which is usually used to signal the start of a new audio stream, or a special type of packet. RTP runs together with RTCP (Real-Time Control Protocol). RTCP is a protocol in which the two endpoints communicate things like delay, packet loss, and jitter to each other over the network. Some applications may use RTCP to renegotiate a media 32 connection depending on the integrity of the channel. For example, if bandwidth seems to be creating excessive delays and packet losses, a new codec can be used that represents a lower voice quality, but requires less bandwidth. 33 5 VoIP and Security This section discusses the role of security in VoIP systems. Section 5.1 provides an overview of Internet security in the context of VoIP systems. Following that, Section 5.2 moves onto issues specific to VoIP systems. Since the main focus of this thesis is the effect of security on conversation quality, this section is only a brief summary of Internet security and how it applies to VoIP. The reader is referred to the references for more information on the topic of Internet security. 5.1 Internet Security Overview In the early development of the Internet, security was hardly in anyone's thoughts. The Internet at that stage was the ARPANET, which was a network of computers on which everyone knew everyone else who was connected. Thus, the need for security was overlooked; at that time the entire ARPANET was basically a private network[28]. However, as the ARPANET evolved into the public Internet, the need for security became recognized. As a result security usually became a layer that ran over the existing non-secure protocols, such as TCP and UDP. This Section first discusses exactly what is meant by Internet security, and then moves on to the details of how security is generally accomplished today. 5.1.1 Definition of Security In general, Internet security can be divided into four central issues. These issues are: confidentiality (privacy), integrity, availability, and non-repudiation. Confidentiality or Privacy Confidentiality means keeping your information private. Confidentiality can be extremely important for military, business, or personal reasons. A user should be able to control who has access to information he wants to keep private. Confidentiality in the context of VoIP means that a person who isn't involved in the conversation should not be able to determine who is talking to whom, or what is being said. Integrity Integrity implies that information has not been modified or concealed. It also means that the source of information cannot be changed. Integrity basically means that whatever information you receive is guaranteed to be what you think it is. A malicious attacker cannot modify the information without you knowing, and can't pretend to be someone he is not. Integrity in VoIP systems implies that the person you are talking to is who they claim to be, and their words cannot be altered in transit. Integrity is also important in communicating with VoIP servers that may route or track your 34 calls. A second, more subtle aspect of integrity is that the conversation should not be able to be replayed at some date as though it is live. This aspect is known as replay protection. Availability Availability means that an attacker should not be able to prevent a person from using a system. A compromise of availability is usually accomplished through a denial of service (DoS) attack. DoS attacks can be specifically targeted at a single point or can take down an entire system. For VoIP systems to maintain availability it should be impossible for attackers to prevent a single person from using VoIP. A less stringent definition of availability implies that attackers should not be able to take down crucial components of the VoIP architecture (e.g. SIP Servers). Defending availability is the most difficult security task to accomplish because of the large number, distributed location, and shared functionality of the network devices that participate in a single call. Additionally, any denial of service attack on the Internet could potentially affect VoIP users. One example of such an attack occurred in 2001, when the "Code Red" worm took down significant portions of the Internet for extended periods of time, costing industry and government an estimated $2.6 billion in damages[12]. Non-Repudiation Non-repudiation means preventing the sender of a message from denying it was transmitted by them. Non-repudiation prevents a user who promised something from denying he made the promise. In the context of VoIP, non-repudiation means that a person shouldn't be able to deny making a call he has made; there should be some proof that would contradict this claim. 5.1.2 Security Details The four elements of security are accomplished through various mechanisms. Two important mechanisms are authentication and encryption. Both of these mechanisms rely on the use of secret keys, which are usually large numbers known only to a particular user or group of users. A password is another common type of secret key. In this section, we are not discussing the mathematics of security per se, but more the general principles involved with shared secret and public key cryptography. Then we will discuss authentication and encryption in the context of these mechanisms. Finally, we'll go over a few specific implementations of security. This section is not meant to be a comprehensive guide to Internet Security, but merely a review to provide enough information necessary to understand security in the context of this thesis. Reference [24] is an excellent source of comprehensive information about 35 cryptography systems. Shared Secret vs. Public Key As mentioned previously, there are two main cryptographic mechanisms, shared-secret and public key. Both of these mechanisms rely on the use of large, "secret" numbers, but they differ slightly in how they work. In shared-secret, or symmetric systems, all parties must know the value of the key. The key becomes the "shared secret", and anyone who knows the key is assumed to be a trusted party. One of the challenges that faces shared-secret systems is key distribution. Everyone must be able to determine the value of the secret key, but this must be done in a way that prevents non-trusted parties from also learning the secret. Because of this, shared keys have to be placed manually in systems. In contrast, public key schemes, originally developed by Ron Rivest of MIT, use a single secret per user, along with a "public key" for that user that is viewable by anyone[26]. This prevents the need for key-exchange, but does represent a problem, as a database or record of all public-keys must be maintained. Another problem with public key encryption is that is computationally more expensive than shared secret encryption. Many applications get around this problem by using public key encryption to authenticate a Diffie-Hellman key exchange. DiffieHellman key exchange[25] was introduced in 1976 as a way to use two private individual keys to arrive at a single shared secret. Diffie-Hellman key exchange, on its own, does not provide authentication, but by combining Diffie-Hellman with public-key authentication, a shared secret can be securely negotiated. This negotiated key, can then be used for the remainder of the communication, allowing it to be more efficient. To more closely understand the difference between these two mechanisms, we will look at them in the context of authentication and encryption schemes. Authentication Authentication is a technique that allows a user to ensure that certain messages are authentic. Authentication can be used to provide both integrity and non-repudiation of data. In addition, password protecting access to certain files can ensure their confidentiality, although if the files are transmitted over the Internet additional measures must also be taken. One method of authentication is digital signatures. Digital signatures use a mathematical function that is dependent on both the data being sent and the secret key of the sender. This function is called a one-way function or hash function, and has the special property that it is easy to compute in one direction, but computationally infeasible to compute the other way. In a shared-secret system, when a piece of data is transmitted, the hash function is used with a key to generate a digital signature that is attached to the data. The 36 receiver of the data can then compute the signature if he knows the key, and verify that it matches. This makes it nearly impossible for a malicious attacker to change the message, because it would change the signature in an unpredictable way. Of course, if the attacker knew the key it would be possible to create a new message and signature pair, but most security systems rely on the assumption that the key is not known. In a public key system, authentication works much the same way. The signature is only creatable by the private key, known only to the signer. Anyone, however, can use his public key, with a different mathematical function, to verify that the signature was created with the original data and private key. Authentication proves that the message has not been tampered with and that the sender of the message knew the key. This ensures that the data's integrity is intact. By authenticating a sequence number replay attacks can also be prevented. Additionally, in a public-key system, the sender cannot deny sending the message, ensuring nonrepudiation. There is, however, no non-repudiation for a shared key system, as any party involved in the communication is capable of sending and receiving all messages with the single key. There are various algorithms used to compute hash functions. Two of the more common ones are MD5, which was developed by MIT Professor Ron Rivest[61] and SHAl (Secure Hash Algorithm, Version 1) [62], which was part of the U.S. Government's Capstone project[63], a project that attempted to develop a set of cryptographic standards. Encryption Encryption is the technique that is commonly used to provide confidentiality, particularly when data is in transit over a non-trusted network, such as the Internet. Typical encryption mechanisms use a one-way cryptographic function to convert readable data into seemingly random values, again using a secret key. In a shared-secret system, the data can then only be decrypted by anyone who knows the secret key using another function. In public key systems, data can be encrypted with the public key, and only someone who knows the secret key is able to decrypt it. By encrypting information before sending it over a network and decrypting it on the receiving side, the data is kept private from anyone on the network who doesn't know the key, thus ensuring confidentiality. There are also various encryption algorithms employed in different areas. In 2000, the National Institute of Standards (NIST) held a competition to determine the new encryption standard. Ultimately an algorithm known as Rijndael became the Advanced Encryption Standard (AES, as it is also now known), beating out other algorithms including Ron Rivest's RC6, MARS, Twofish, and Serpent[52]. Still, AES is not the only encryption algorithm used, and the older 3DES and RC5 are also widely employed. 37 5.1.3 Security Implementation There are several different implementations of security protocols. Different algorithms can be used to authenticate and encrypt data, and different protocols can be used for transporting encrypted data. Technically, security can be applied at any of the four network layers, which in order of increasing complexity are[23]: 1. Data Link: The physical link between individual machines. 2. Network: The underlying addressing scheme (IP) 3. Transport: The establishment of a session (UDP, TCP) 4. Application: The application running over the network (FTP, HTTP, SIP, etc.) Encryption can be done at each of these layers. For example, SHTTP and SRTP are two protocols designed at the application layer to add security to HTTP and RTP, respectively. IP Security, or IPSec[75], on the other hand, runs over the network layer and encrypts any network traffic. The National Institute of Standards (NIST) investigated VoIP security, and recommended the use of IPSec for secure VoIP systems[64]. Thus, we will focus our discussion of security implementations on IPSec. Except where noted, all of the material in this section came from Sheila Frankel's book Demystifying the IPSec Puzzle[39], which is an excellent source of detailed information regarding IPSec. Authentication in IPSec Authentication is done in IPSec either with an authentication header (AH), or authentication of the Encapsulating Security Payload (ESP, see next section). The AH is meant to provide three specific purposes[39]: * Connectionless Integrity: the received message is what was sent, and no tampering has occurred. " Data Origin Authentication: a guarantee that the message was sent by the apparent originator of the message, as opposed to someone masquerading as that originator. " Replay Protection (optional): the assurance that the same message isn't delivered multiple times, and messages aren't delivered grossly out of order. The AH consists of six fields, but the most important ones are: 38 " Security Parameters Index (SPI): a number agreed upon by the communicating parties that represents this particular security association. The SPI is used to keep track of which keys are being used for the authentication and encryption protocols. * Sequence Number Field: a number that increases incrementally with each packet, which is meant to provide replay protection. " Authentication Data: the result of the one-way hash function of the key and the rest of the packet. This is equivalent to the digital signature above, and the method of calculating it is defined by the security association. A nice feature of IPSec is that it is versatile. For example, many cryptographic hash functions exist, including MD5, SHA-1, and SHA-2, and IPSec allows users to customize which algorithm and keys they will use on a connection by connection basis. This is accomplished during key exchange negotiation, and is based on a security policy database (SPD) that is referenced by the SPI and IP address associated with the connection. Encryption in IPSec Encryption occurs in IPSec via the use of the encapsulating security payload (ESP). The ESP, in addition to providing confidentiality, can provide all the functions of the AH. For this reason, many people argue that the AH should be removed from the RFC for IPSec entirely, and the ESP should perform both authentication and encryption. Although authentication is optional, it is highly recommended, otherwise the integrity of the message cannot be guaranteed. Some implementations of IPSec don't support the AH and rely on the ESP with a "null" encryption algorithm (that does not encrypt data) for authentication alone[40]. In addition to authentication, the two additional functions provided by the ESP are: " Confidentiality: a guarantee that even if someone sees the message, the contents are not understandable except by the authorized recipient. " Traffic Analysis Protection (optional): the assurance that an eavesdropper cannot determine who is communicating with whom, or the frequency and volume of communications. The ESP is made up of seven fields, including the SPI, sequence number, and (optional, but recommended) authentication data fields also found in the AH. The important fields unique the the ESP are: * Payload Data Field: the encrypted contents of the packet. The padding and padding length fields (see below) are also encrypted and contained in this field. 39 " Padding: additional bits that are not used but increase the length of the packet. This allows certain cryptographic algorithms that require fixed block sizes to be performed on the packet, and can also provide traffic analysis protection. " Padding Length: total number of padding bytes, so they can be ignored by the receiver. The ESP offers similar versatility in terms of encryption algorithms, and many implementations of IPSec offer a choice among AES, DES, 3DES, Blowfish, and several other common encryption algorithms. The keys, key-sizes, and algorithms used are negotiated beforehand and interpreted through the use of the SPI. The ESP and AH can be used in conjunction to provide authentication and encryption. In addition, the ESP can perform both of these functions, making use of the AH unnecessary. Transport vs. Tunnel Mode IPSec offers two different modes of encryption, called transport and tunnel mode. In transport mode, the two communicating users encrypt the contents of the packet, but leave the IP headers intact. Since packets are routed based solely on IP headers, this allows the encrypted part of the packet to traverse the network normally as long as there is no need for intermediate routers to inspect packet contents (e.g. the TCP header). In this case the entire encrypted packet arrives at its destination, where it can be decrypted by the other party. Because the IP headers are sent in the clear, anyone can know who the communicating parties are, which can represent a breach of confidentiality. Tunnel mode was designed fix this problem. In tunnel mode two machines, typically known as VPN gateways, set up an encrypted "tunnel" between them. When a user wants to communicate securely with someone, he sends his packet to his gateway where the whole packet is encrypted. Then a new IP header with the address of the other party's own gateway is put on the packet, and sent to the other party's gateway. Thus, the two gateways form a secure tunnel that allows any users behind them to communicate with each other. Observers in between the gateways cannot determine who is talking to whom, but merely that someone from one gateway's network is talking to someone else on the other gateway's network. Communication between the user and his own gateway can also be encrypted with a different security association, possibly using transport mode, if the internal network is not trusted. Implementations There are many implementations of IPSec, including FreeS/WAN[40], Openswan[41], StrongSwan[42], and an IPSec version built into Linux kernels 2.6 and higher[43]. FreeS/WAN (which was named for a free implementation developed out of the Secure Wide Area Network (S/WAN) project) was the first widely adopted open-source 40 implementation of IPSec. Unfortunately, FreeS/WAN's development stopped in April of 2003, and as a result it lacks support for several algorithms including AES. Several projects branched from FreeS/WAN, including StrongSwan and Openswan. These implementations of IPSec are done in software, but IPSec can also be implemented in hardware, and many times this is done for performance gains in the encryption algorithms. Additionally, cards exist that offload encryption algorithm computation, but must be combined with IPSec software to fully implement the protocol. In Section 9 we discuss the IPSec implementation we chose for our experiments, Openswan, and why that choice was made. 5.2 Security Applied to VoIP Now we will look at security in the context of VoIP. We will start with the four characteristics of security described above, and then move onto some issues that are VoIP specific. The purpose of this section is to provide an overview of some of the security challenges associated with VoIP networks. Reference [64] provides an excellent discussion of VoIP security considerations in much more detail. 5.2.1 Confidentiality There are various levels of confidentiality that can be accomplished in VoIP systems. The most sensitive information is what is being said in the conversation. To protect this information, the data streams must be encrypted. There are various ways of encrypting VoIP data streams. One way is to use a broad layer of encryption, such as IPSec, for encrypting all communications. Secure RTP (SRTP) [77] can also be used, and has been shown to be slightly more efficient than IPSec. SRTP, however, is limited to the RTP data streams[64], so call setup requires the use of a separate security protocol. A second level of confidentiality is necessary to mask usage patterns. Encrypting data streams alone still allows a malicious eavesdropper to see who is calling whom, what codecs they are using, how long they talked, and other usage patterns. To prevent this information from being seen, a stronger layer of security should be used. In particular the call-setup messages (typically either SIP or H.323) need to be encrypted as well. This usually requires a security relationship to exist between each of the users and the server that is connecting them. The calling party must establish a secure connection to the server to tell it who to call, then the server must establish a second secure connection with the person being called to relay the message and setup the call. This will allow the call to be set up in a private manner. 2 2 Additionally, a third security association must exist between the caller and callee if their communication is to be encrypted 41 A final level of confidentiality can be obtained to prevent traffic analysis. Traffic analysis is a technique that can be used to determine what (encrypted) VoIP conversations look like on a network. Then, even if an eavesdropper couldn't see who is talking or what is being said, he could at least see that a phone call was occurring between two subnets. This can only be prevented by masking VoIP data by sending additional data all the time. Then an eavesdropper wouldn't be able to distinguish a conversation from the background noise of constant data. This level of confidentiality is quite impractical, and for the most part unnecessary in all but the most secretive VoIP applications. 5.2.2 Integrity and Non-Repudiation There are many ways in which integrity is important in VoIP communications. most obvious application is preventing a person from impersonating someone The simplest way to prevent impersonations is to authenticate users at every of the conversation, including authenticating the server that sets up the call. assures that the person calling you is who they claim to be. The else. step This Authentication is also important for server administrators. In SIP, for example, the registration servers should only allow users to register if they "belong" to the group of users known by that particular server. Typically users are required to provide a password when registering and routing calls through a server. One known problem with SIP is the ability to fake, or spoof, another user's phone number. Studies have shown that it is currently easy to fake a PSTN phone number using SIP[65] . This represents a violation of integrity, as the person you are talking to might not be who they claim to be. This threat can sometimes be mitigated by using voice authentication, but biometric authentication is not currently widely implemented. Most of the issues of integrity and non-repudiation can be solved by applying strict authentication rules at every step of the communication. If users and servers are required to authenticate each other, integrity and non-repudiation can be maintained. 5.2.3 Availability Availability is typically the hardest security characteristic to maintain. Many factors, both malicious and accidental, can result in a loss of service. A power failure, for example, could take down SIP servers and prevent users from making calls. Several known DoS attacks already exist against VoIP systems. One very simple attack is to send a large number of INVITE messages to a particular server or user. As the user gets flooded with requests to initiate calls, he cannot tell the difference 42 between real ones and malicious ones. Combatting attacks like these requires complex rules on when and from whom to accept calls[22]. One crucial aspect of telephone communication is the 911 emergency line. Currently, VoIP systems do not have a universal standard for emergency response, and if they are ever to replace PSTN phones, one must be set up and be made available all the time[64]. 5.2.4 Security and Quality of Service A final important aspect of security is the role it plays on quality of service. Securing communication has several effects, including interfering with call setup messages, and affecting conversation quality. For conversation quality there are two aspects: encryption overhead, and traffic prioritization. It is known that cryptographic algorithms can be computationally intense, and can increase network bandwidth significantly. In an environment where bandwidth is limited, such as wireless links in airborne networks, this can be a serious problem. Additionally, the computational delays introduced by encryption could result in the perceived quality of the conversation significantly decreasing. The second way encryption could hinder conversation quality is its influence on traffic prioritization. Traffic prioritization is a technique employed in routers to give delay-sensitive traffic, such as VoIP, a higher priority than traffic whose delivery is not as time-constrained (e.g. e-mail and web browsing). This can greatly improve call quality in bandwidth constrained situations. The use of encryption, however, causes the information that allows the router to prioritize traffic to be hidden. Traffic prioritization is a topic outside of the scope of this thesis and left for future work (see Section 11). The impact of the encryption overhead in non-prioritized VoIP systems, however, is a large part of the rest of this thesis. This subject is explored in Sections 8 and 9. 43 6 Measuring Conversation Quality When using a service such as a telephone or VoIP system, being able to evaluate the quality of the service is extremely important. People simply will not use a telephone that is unreliable or difficult to hear. Secure VoIP systems must not only be reliable, but they should also provide a level of conversation quality equal to the traditional phone system to which people have grown accustomed. For this reason it is important to be able to measure conversation quality. There are various proposed methods of evaluating conversation quality, with the most accepted measure being the Mean Opinion Score (MOS). Details of the MOS and two computational models, the E-Model and the Perceptual Evaluation of Speech Quality (PESQ), are discussed in this section. 6.1 Mean Opinion Score (MOS) The Mean Opinion Score, or MOS, is the most-widely employed measure of conversation quality. It is summarized in recommendation P.800 of the International Telecommunications Union's Telecommunications Standards Sector (ITU-T). The MOS is what it sounds like, an average of people's opinions. MOS is a subjective quality metric that relies on human beings' perception of quality. Generally a particular sound file or channel will be evaluated by several people, and the resulting average of their scores becomes the MOS. MOS scores range from 1-5 with a descriptive quality associated with each score that is summarized in Table 2. An MOS of 4 or better is considered toll quality (equivalent to PSTN phones), while 3.6 is or higher is called "acceptable" for toll quality. A setup with an MOS of less than 3.6 is only considered usable if it offers some benefit over traditional phones. Score 5 4 3 2 1 Quality of Speech Excellent Good Fair Poor Bad Table 2: The MOS Quality Scale There are various tradeoffs associated with the MOS, or any subjective score as a metric for evaluating conversation quality. On one hand, the MOS offers the most accurate assessment of true conversation quality. In an application like VoIP or telephony, ultimately a person will be the one who benefits or suffers from variations 44 I ---O talkspurt w Perceptual Models talkspurt tAkspurtalkspurt Network l "'Saence AteLr b Figure 4: Evaluating VoIP with the PESQ Model in conversation quality. Thus, it makes sense that the metric for evaluating quality should be a subjective judgment done by people. On the other hand, using a subjective score to evaluate quality introduces many logistical problems. One issue is that it becomes quite impractical to perform multiple experiments if each one must be done with several users, each of whom assesses the quality separately so that it can be averaged. Additionally, different people have different perceptions of quality. A person who is accustomed to talking on a cellular phone might evaluate a call higher than one who always uses a land-line. While averaging allows some of this variation to be mitigated, it is difficult to get enough people's opinions to counteract the variation. To avoid the reliance on subjective human evaluation, two additional ways to evaluate quality were introduced by the ITU. These methods are discussed next. 6.2 Perceptual Evaluation of Speech Quality (PESQ) The PESQ[54] is described in ITU-T recommendation P.862, and is a perceptual model. This means that it evaluates quality by comparing the sent and received speech signals in the psychoacoustic (audio) domain. The PESQ's quality rating is obtained by characterizing and determining the amount of distortion between the signals. This is shown in the context of a VoIP call in Figure 4[56]. Note that the analysis occurs outside the channel of communication in this context. While PESQ offers a reasonable method of determining the loss in sound quality due to distortion of a signal, it does have some significant drawbacks. One problem is that the PESQ offers no way to evaluate conversation quality loss due to latency, as only the two endpoint signals are compared. Thus, a perfect channel with a very long latency would receive a high PESQ score, even though users would find its quality unacceptable. Since one of the focuses of our study was on high latency links, 45 the PESQ model was inappropriate. Additionally, the PESQ reveals no information about the sources of degradation, only that they occur. Other perceptual models have been introduced that are similar to PESQ. The Perceptual Speech Quality Measurement (PSQM)[55] was introduced by the ITU before the PESQ and hence, the PESQ is considered an improvement. Additionally, an adaptive quality model[57] has been proposed to change its score over the duration of a conversation. The reader is referred to the references for more information about these models. 6.3 E-Model The final quality model we will look at is the E-Model[53}, which is described in ITUT recommendation G.107. The E-Model is another objective model, calculated based on qualities of the signal and of the channel used for communication. There are a few advantages to the E-Model over perceptual models. Unlike PESQ, the E-Model takes absolute delays into account when determining voice quality. This is impossible to do in a perceptual model that only compares end-point signals. Additionally, the E-Model is broken up into different terms that each represent a different aspect of the connection that contributes to conversation quality. This makes it easier to determine what is causing a particular communication channel to be deficient. Finally, the EModel was designed with the knowledge of VoIP systems in mind, and includes terms for codec and packet loss. This makes it more easily applied to VoIP systems than the other models. The E-Model defines a Quality Rating (R) that varies from 0-100. A formula for translating between R and MOS is given in [53] and summarized in Figure 5. The E-Model relies on an assumption (based on empirical evidence in psychophysical research) that the psychological effect of uncorrelated sources of impairments is additive. The resulting quality rating, R, is thus calculated by determining the signal to noise ratio, R0 , and subtracting a set of impairments: R = Ro - I. - Id- le + A Below is a brief description of each term. * RO: Represents the best possible quality for a given signal to noise ratio. * I,: Simultaneous impairment factor due to loud signals, quantizing distortion, and sidetone effects. * Id: Delay impairment factor. This represents quality loss due to end-to-end delay and echoes. 46 R 100 94.3 ---90 User Satsfaction MOS --- Very Satisied -. 4.5--4--4 Dosrabfe 4.3 Safis6ed 80 ---70 ------ Some Usen6 dissatided Acceptable 3.6 Many usefs diats~fird 3.1 Nearly all users 50 ----- - dissatis5ed -, Not acceptable for toil quality 2.6 Not recowmended 0 Figure 5: R-Values and MOS for Varying Conversation Quality[58] * le: Equipment impairment factor. This takes into account signal distortion (due to a low-rate codec) and packet loss (in VoIP systems). * A: Advantage factor. This attempts to model the fact that users will allow a degradation in quality for other advantages. The E-Model forms a basis for comparing the conversation quality provided by a communication network. Its use in VoIP systems will be discussed in more detail in Section 8. 47 7 VoIP Conversation Modeling To repeatably compare VoIP systems, we will develop a model based on the widelyused conversation model proposed by Paul T. Brady[31]. The remaining sections motivate our approach, describe the existing model, and discuss our modifications to it for VoIP applications. 7.1 Motivation For the purpose of evaluating the impact of network and security conditions on conversation quality, a program that could generate realistic VoIP signaling and data traffic was desired. There are currently several commercial VoIP traffic generators[2, 1], but these systems are built primarily for determining network capacity. As a result, the intricacies of actual conversations are generally ignored and simple "always on" or "on-off" (see Section 7.2.2) models are used[37]. To develop a more accurate model of human conversations, speaker interactions and Voice Activity Detection systems used by commercial VoIP phones (see below) must be represented. Most VoIP phones use a technique known as Voice Activity Detection (VAD) to only send data when a person is talking, and conserve bandwidth in periods of silence. The VAD systems that we studied were not perfect, however, and usually sent data for a few extra seconds in periods of silence. The details of the behavior of the Voice Activity Detectors, and how to model their behavior, is discussed in Section 7.3. The need for different speaker-models arose in evaluating the improvements in conversation quality gained through the use of particular protocols. In particular, push to talk systems are widely employed to improve conversation quality over channels with high latency. We sought to incorporate this benefit into our improved quality assessment tool. Our efforts in this direction are discussed in Section 8.3. A second aspect to security was whether anything could be learned from the (encrypted) traffic streams via pattern profiling. For example, in a military scenario, if it could be observed that one user talked much more frequently than another, it might be inferred that the talkative user was the commanding officer. In this way hierarchical relationships could be determined without being able to decrypt the packet streams. Pattern profiling lies outside the scope of this thesis, but the model we developed could be used to investigate this further. For these reasons it is important to create an accurate model of VoIP traffic, and, by altering the parameters of the model, develop different user and conversation types 48 that could be tested separately. 7.2 Brady's Model for Two-Way Speech Accurately modeling a two-party conversation is complex. A conversation between two individuals can be modeled as a series of talkspurts and silences by each person. These periods can overlap: at any given time both people could be silent, either one could be talking while the other is silent, or they could both be talking. The distribution and duration of these events in PSTN systems is the subject of significant research[32, 31] by Paul Brady. Brady's model has been the primary reference for conversation modeling for 38 years, but was developed before the advent of the Internet and VoIP technology. We have adapted the model to VoIP applications by adjusting the model parameters. In the rest of this section we'll first describe Brady's model via a series of increasingly complex models. We'll then describe how the model can be adapted to more closely resemble what is observed in VoIP systems. 7.2.1 One-Port versus Many-Port Models A one-port model describes a single user's speech patterns given only the user's knowledge of the state of the conversation. The model knows when it hears the other speaker, but only a single person's speech patterns are actually generated by each model. Two one-port models must be connected together via a communication channel to generate a full conversation. A many-port model would attempt to model the entire system including the channels between the speakers. This allows full conversations to be modeled in a single system, but has several disadvantages when compared to a one-port model. One problem is that modeling different channels between the users (variable delays, for example) requires a new model for each type of channel. Also, each model encompasses an entire conversation (speaker pair) as opposed to a single speaker. In other words, to model conversations among N users, N(N--1) different sets of parameters are necessary for the two-port model (one for each speaker pair), while only N sets are required by the one-port model (one for each speaker). Another advantage of the one-port model is that it depends only on what is seen at the receiving end. If someone is talking, but for some reason the data isn't making it to the other user (e.g. a network link went down), the receiving model treats this as silence from the sending model, in much the same way an individual speaker would. Leaving the channel out of the model also allows the patterns of conversation to change depending on the characteristics of the actual channel connecting the users, a 49 -Atalks--_ - - - - Talng Not Talking - ' - stopsakng ' Figure 6: A Two-State Conversation Model property that is important for testing the impact of distinct channels. 7.2.2 The One-Port, Two-State Model Brady used a series of increasingly complex models to better describe human conversations. The simplest and most intuitive model is the two-state model, shown in Figure 6. In the two-state model a user is either in a talking state or a not-talking state. The duration of time the user spends in each state is based on an exponential probability distribution that is discussed in Section 7.2.5. The two-state model, though simple, has several shortcomings. The biggest drawback is that it is not at all dependent on input received from the other party. A person is equally likely to be talking whether the person on the other end is silently waiting or shouting over him! Despite the statistical shortcomings of the two-state model, it is actually the one that most major commercial VoIP traffic generators use to model conversations[37]. 7.2.3 The One-Port, Four-State Model To make the conversation model more realistic, Brady next introduced the four-state model. In the four-state model, the other speaker's input combined with the model's output determines the state. For simplicity, throughout the rest of this section the speaker being modeled will be referred to as speaker A, while the person they are talking to will be called speaker B. The four states of the model then become mutual silence, A talking alone, B talking alone, and double talk. These states and the transitions among them are shown in Figure 7. Note that because this is a one-port model, speaker A only has control of some of the transitions. In particular, A's actions determine when the vertical lines (shown as solid lines) are traversed. Speaker A can decide whether to start talking if he is silent, or stop talking if he is talking. He doesn't, however, control the horizontal transitions (shown as dashed lines). Whether or not B is talking is controlled by something else 50 -Bt --a --- B stops talking -------A stops talking - - - Dooble Talk Talking Alone A talks -- - - -B talks- - - - A stops talking A talks SilenceListening Figure 7: A Four-State Conversation Model (perhaps a symmetric model connected at the other end, but it could be anything), and A can only perceive that B is talking or silent. If A hears that B has fallen silent, a horizontal transition in Figure 7 will be made, but this transition is not controlled by A. 7.2.4 The One-Port, Six-State Model While the four-state model simulates conversation patterns more accurately than the two-state model, Brady found that it was still inadequate, particularly in predicting events around mutual silence and double talk[31]. Brady found that behavior in silence depended largely on who was talking immediately prior to the silence, and similarly that behavior during double-talk depended largely on who interrupted whom. This issue was dealt with by breaking the double-talk and silence states into two states each, one for each possible previous state. The revised, final model is shown in Figure 8. As in the four-state model, only the vertical transitions, representing A's decision to start or stop talking, are controlled by the model. The six-state model was able to represent recorded, human, PSTN conversations, and accurately described a data set of conversations collected by Brady[31, 32] 7.2.5 Model Parameters and Interpretation Although the states of the conversation model have been defined we have not yet discussed how much time is spent in each state. This duration is calculated by first breaking up the conversation into a series of discrete timesteps dt during which the state is forced to remain constant. Brady chooses dt = .005 seconds, but because most VoIP codecs send data at a rate of 50 packets per second (pps) or less, dt = .02 51 Double Talk 8 Interrupis Talking ANone TAlk tak Ai aks B stops talking---- Double Talk A Interrupts - A stops talking A talks A talst-ktalks-- stops talking -A Silence A Last A talks Listening Silence B Last ---- stopstalkng Figure 8: A Six-State Conversation Model seconds (corresponding to this rate) was used for our VoIP modeling. Then, at any given timestep the state will change if either a start-talking pulse occurs (when A is silent), or a stop-talking pulse occurs (when A is talking). These pulses are characterized by a Poisson arrival process[35]. "Start-talking" pulses are called apulses, while "stop-talking" pulses are known as /3-pulses. The values of a and 3 depend on the current state of the conversation, and are such that the probability of talking from a silent state during a dt timestep is atate * dt, and the probability of falling silent from a talking state for any timestep is Ostate * dt. It is important to realize that the a and 3 values are not probabilities themselves, but only become probabilities when multiplied by a dt. This means that the pulses can (and will) have values greater than 1. The physical interpretation of the a and / values is that they. represent a stream of pulses trying to drive speaker A out of his state. An a of some value, say three, means that there is a stream of a-pulses forcing A to talk that occur at an average rate of three pulses/second. Thus, the units of a and 3 are in pulses/second. Figure 9 shows the probability of leaving a particular state for an unspecified a value. A nice property of this model is that the mean time a speaker spends in a state governed by an a (or /) value, neglecting transitions caused by the other speaker, is . In the six-state model there are six parameters that characterize an individual speaker: 3 a values for each of the three states where A is silent, and 3 / values for the three states where A talks. These six parameters are called ap,,e(pause), aalt(alternation), ait(interrupt), /sol(solitary), /te(interrupted), and #tor(interruptor), and are summarized in Table 3. Figure 10, a revised version 6f Figure 8 shows the parameters in relation to the state transitions. 52 C C Z0 Im, 1/a = E[t] I Figure 9: Probability Distribution of Time Spent in a State Double Talk 8 Interrupts <Takinq AJone ap~. 7 - Bstops telking Double Talk A Interrupts - ~ -,S talks, Silence A Last Listening Silence BLast - B stops talking Figure 10: Six-State Conversation Model with a and 3 Parameters 53 Parameter apse Cialt aint f3soi Oted ftor Description Start talking after pausing Start talking in alternation Start talking, interrupting Stop talking when talking alone Stop talking after interrupted Stop talking after interrupting From State Silence A Last Silence B Last Listening Talking Alone Double Talk B Interrupts Double Talk A Interrupts To State Talking Alone Talking Alone Double Talk A Interrupts Silence A Last Listening Listening Table 3: Brady's Parameters for the Six-State Model 7.2.6 Model Limitations The six-state model, although far superior to the two or four-state models, is still not perfect. Brady comments that the model sometimes does not perfectly handle events surrounding double-talk[3 1]. Additionally, the model only covers two-person conversations. In practice it would be beneficial to have a model that could handle 3 or more speakers in a conference call. We speculate that this type of system could be modeled with a logical "or" of the rest of the speakers representing speaker B's behavior, and perhaps an adjustment of the model parameters. For example, to model speaker A's conference call with C, D and E, we could use the six-state model, and represent B's talking as C or D or E talking. This admittedly wouldn't be perfect (there would be no way to know or change behavior if multiple speakers (C and D) are talking at the same time). The validity of this claim, and a model that goes beyond two speakers is outside the scope of this thesis. It is only mentioned as a possible extension of Brady's model. 7.3 Applying Brady's Model to VoIP Brady designed and evaluated his model based on the analog communication channels of PSTN telephones. Additionally, his method of detecting speech involved monitoring when the noise level crossed a threshold volume (that he varied between -45dBm and -35dBm) [32]. Modeling VoIP is a slightly different task. In particular, in addition to determining whether a person is talking or not talking, we wish to know the rate of data being sent over the network. The data rate depends on the codec used (see Table 1), and also whether or not data is being sent. Since codec bandwidth usage is a defined constant per codec, all that needs to be modeled is the on/off patterns of sending data from a particular phone. The phone we chose to model was a Cisco 7940[38], and, as will be shown in the next few sections, the on/off patterns did not closely match those observed by Brady's noise threshold detector. 54 7.3.1 Voice Activity Detection (VAD) Some VoIP phones and software applications send a constant stream of data when they are being used. If this is the case, the bandwidth is a fixed value that depends only on the codec, and can be found in Table 1. A model of this type of traffic would be very simple: as soon as the call is set up, each side should send a fixed data rate that models a particular codec, and this should continue until one of the parties hangs up. Traffic rates are constant whether the calling parties are talking or silent, and in Brady's model, each user would permanently be in a "Double-Talk" state. Most VoIP applications, however, use a technique known as Voice Activity Detection (VAD) to conserve bandwidth. The phones have a built in filter that attempts to determine whether a person is speaking or not, and only sends data when the filter hears speech[17]. When the person is silent, nothing is sent. Theoretically, VAD can result in up to a 50% reduction of the bandwidth used in the absence of VAD. The nature of these VAD systems and how they compared to known methods of conversation modeling was a crucial element in simulating VoIP traffic. The problem we were faced with was determining the behavior of VAD systems when compared to Brady's noise detection algorithm. To determine the behavior of the Cisco 7940's voice detector a series of prerecorded audio samples were played through the phone, and the resulting data was recorded using tcpdump. Samples were played from a music file so that there would be no periods of silence. Samples of .01, .04, .1, .5, 1, 2, 3, 5, and 10 seconds were played, each one with silence gaps of .01, .04, .1, .5, 1, 2, 3, 5, 10, 20, 30, and 60 seconds in between. Then the difference between the actual length of the clip and the recorded period of activity on the network was measured. What was found was remarkably consistent behavior for varying length clips and gaps. All clips with gaps of 2 seconds or less in between were seen as a single stream of data on the network. For the clips with observable periods of silence, the mean difference between time the phone sent packets and the length of the clip was 2.30 seconds, with a standard deviation of .023 seconds. From this observed behavior it is believed that simulating VoIP traffic from a Cisco 7940 IP phone can be done by simulating regular speech, via Brady's model, and adding a "buffer length" of 2.3 seconds. Any talkspurts that overlap due to this buffer are combined into a single, longer talkspurt, which is the behavior observed in the phone. The resulting conversation would consist of a smaller number of longer talkspurts, with shorter periods of silence in between. To check that this behavior is accurate, a single half of a conversation was played through the Cisco 7940. The audio data was a part of the Switchboard Cellular Phase I corpus[33], and the particular conversation used was chosen for its clarity and lack of noise. Traffic from the IP phone was captured using tcpdump, and broken into talkspurts by a Perl script. This was then compared with a transcription of the conversation, which was prepared by the Linguistic ata Consortium[34]. As a 55 final step, a 2.3 second buffer was added to each talkspurt in the transcription, to be compared with the data from the IP phone. The results of this analysis are shown in Figure 11. Figure 11(a) shows the on/off patterns from the original transcription of the conversation. It can be seen that this data consists of many short talking and silence periods. Figure 11(b) shows on/off patterns from the transcription with the addition of the 2.3 second buffer. All silences less than 2.3 seconds have become merged talkspurts, and, as a result, there are fewer, longer talkspurts. Finally, Figure 11(c) shows the result of the audio data being played through the Cisco 7940. Although Figures 11(b) and 11(c) are not exactly the same, it can certainly be seen that the addition of a 2.3 second buffer is a reasonable model of the behavior of the Cisco 7940. Discrepancies (for example, around 177 seconds) are believed to be a result of imperfect transcriptions (due to small amounts of unrecorded noise). Although every discrepancy wasn't investigated individually, in the recorded version of the conversation shown in Figure 11(a), there is an audible sniffle and click at 177 seconds that does not appear in the transcription, which supports the above hypothesis. 7.4 Revising Brady's Parameters Rather than create a rather arbitrary method of modeling VoIP traffic that relies on adding 2.3 seconds to Brady's conversation parameters, we decided to investigate whether a new set of parameters could be developed that would allow Brady's original model to accurately simulate VoIP traffic. This would, at a general level, mean drastically lowering the 3-values (those that correspond to the rate of "stop-talking" pulses) and raising the a-values (the rates of "start-talking" pulses). This would accomplish the desired result of creating longer and fewer talkspurts combined with shorter periods of silence. 7.4.1 Correlating Brady's Data with the Switchboard Corpus We began by comparing the statistics from Brady's conversation data to those from the switchboard speech corpus that we sought to use. The time in each of Brady's six states was compared over the entire data sets. The results of this analysis are shown in Figure 12. There are a few important things to note in Figure 12. One is that the time in each state for both sets of data is symmetric. This is because the data is taken from both speakers' perspectives. That is, when two speakers are talking they see "opposite" states. Each speaker looks at the conversation from his own perspective, and thinks of himself as Speaker A. Imagine a hypothetical conversation between two speakers, Carl and Denise. When 56 I 0 50 100 200 150 250 300 250 300 250 300 time (s) (a) Original Conversation, No Buffer A 0 100 150 200 tIme (S) (b) Original Conversation, 2.3s Buffer Added 2 0 100 200 150 time (s) (c) Original Conversation, transmitted via VoIP Figure 11: Talking On/Off Patterns versus Time for a Single Speaker in a Conversation. 57 a "T SILJ_A ST AIM 3% SL.A)S 7 SLBLAST 10% 37%A-D 37% BAGE 7: 37% 37% (b) Switchboard Speech Corpus (a) Brady's Data Figure 12: Time in Each State for Brady's Data[32] and the Switchboard Speech Corpus[34] Carl is talking and Denise is silent, Carl interprets this as being in state A-TalkingAlone, because Carl considers himself Speaker A. From Denise's perspective, however, she is in state A-Listening, because she also considers herself Speaker A and considers Carl Speaker B. A similar symmetry results for the two Double-Talk states and the two Silence states. Both sides of each conversation were considered because it is statistically meaningless to assign an arbitrary 'Speaker A' and 'Speaker B' to each conversation. A second thing to note about Figure 12 is that although the time spent in the TalkingAlone states is nearly identical, there is significantly more double talk and less silence in the (more modern) speech corpus than in Brady's data. It is not known why this is, but it could be a result of either a small dataset', imperfect noise detection and/or transcriptions, or a change in conversation styles in the last 30 years (Brady's research was done in 1968, while the Switchboard corpus was obtained in 1999-2000). While these discrepancies exist, it is believed that the Switchboard dataset is similar enough to Brady's data that his model is still valid. For the rest of this section it is inferred that the Switchboard corpus can be reasonably correlated with Brady's model. 7.4.2 Comparing the Switchboard Corpus to VoIP Once it had been determined that the switchboard corpus could be used with Brady's model, we sought to understand how the model could be adapted to what was observed in VoIP systems, and whether a revised set of Brady's parameters could model this behavior. The transcriptions from the corpus were used to generate the six 3 Brady's data compared only 16 conversations, while the Switchboard corpus looked at 250 58 Value Average St. Dev. apse aalt aint IsoI /3 ted 1.754 1.478 1.380 0.852 0.289 0.188 0.437 0.335 1.209 0.706 /tor 0.660 0.769 Table 4: Statistics of Brady's Parameters for the Switchboard Data Value Average St. Dev. apse aalt aint 1.838 4.240 2.046 9.586 0.447 0.561 f3so0 0.0198 0.0292 /ted 0.121 0.0684 /tor 0.174 0.239 Table 5: Statistics of Brady's Parameters for the Switchboard Data with Buffer transition parameters (alpha and beta values) from each speaker's perspective. The parameters were obtained from the transcripts via the method outlined in [31], and the resulting averages and standard deviations can be seen in Table 4. To make the data more closely resemble what was observed in the VoIP phones, a 2.3 second buffer was added to the end of every individual talkspurt, for the reasons described in Section 7.3.1. Talkspurts that overlapped due to this addition were combined in a single, longer talkspurt. Then, with these revised transcriptions, Brady's parameters were again extracted. The resulting values can be seen in Table 5. With the addition of the buffer, all three 3 values drop significantly. OtO, and #ted, corresponding to the likelihood of leaving a double-talk state, decrease such that the average time in a double-talk state increases from .5-1.5 seconds to 5-10 seconds. The impact on #,3 the likelihood of stopping when talking alone, is even more dramatic, resulting in the average time in that state increasing from just over 2 seconds to 50 seconds!' The impact of the 2.3 second buffer on the a values (those corresponding to starttalking pulses) is less clear. Although all three a values do increase, they do not do so with the consistency or magnitude of the 3's. Also, it is difficult to significantly characterize the values of apse and a,,lt (the start-talking from silence parameters), because of a lack of data. Of the 250 transcribed conversations, 141 had no silence at all after the addition of the 2.3 second buffer. For these conversations, extracting values for apse and a,,lt was impossible. Another way to see this variability is by looking at the standard deviations of the values. For apse the average value was 1.8 4 This isn't including the fact that a solitary talking state is left when the other person talks as well, which is much more likely. Additionally, once that happens, the person finds themselves in a double-talk state, in which they are more likely to stop talking. The reason for this low value is believed to do with the fact that the majority of silences between alternating speakers is less than 2.3 seconds. Thus, by the time the 2.3 second buffer runs out and Speaker A falls silent, Speaker B has already started talking and A is falling silent from a double-talk state. 59 with a standard deviation of over 4, and for aalt the value was about 2 with a standard deviation of over 10. It can be concluded that the addition of a buffer on the end of each speaker's talkspurts results in a significant drop in f values, and a minor increase in a values. The most affected parameter is #,, the likelihood of stopping to talk when talking alone, because the addition of the buffer causes the vast majority of talking-alone states to be terminated by interrupts. It is difficult to characterize the effect of the buffer on the a values, due to the high variability of the data. Another way to observe the effect of the 2.3 second buffer is to compare the time spent in each state for the calls passed through VoIP phones, with the original transcriptions and the transcriptions after the addition of the buffer. Time and resources did not allow us to do this for every call in the corpus, but instead, three calls from the corpus were analyzed in hopes that they would provide a representative behavior of the corpus as a whole. The results of this analysis can be seen in Figure 13. Figure 13(a) shows the time in each state for the unbuffered transcriptions. It can be verified that this distribution of states closely resembles the average of the Switchboard corpus (refer back to Figure 12). Figure 13(b) shows the time in each state for the same three calls played through the Cisco 7940 IP phones. In this case we see a drastic increase in the amount of double talk, and a large reduction of silence. Finally, Figure 13(c) shows the time in each state with the addition of the 2.3 second buffer. It can be seen that the buffered transcriptions closely resemble the VoIP calls, showing a drastic reduction of silence and increase in double-talk. It is not known why there is more double talk when the calls are played through the VoIP phones than with the addition of the buffer, but this is, again, believed to be a combined result of noisy conversations and imperfect transcriptions. A complete investigation into this issue was not done. 7.5 Developing User Models As a second aspect to conversation modeling, it was also desired to generate different user and scenario models. To see if people could be categorized into different types of speakers, we again looked through the Switchboard data corpus, but this time focused only on conversations involving particular speakers. Unfortunately, for the transcriptions in this data set, the maximum number of conversations any individual participated in was three, with 11 speakers participating in three different conversations each. Although this was a small data sample, for each of the 11 speakers both Brady's parameters and the time in each state was recorded with and without the 2.3 second buffer. 60 AJ NT Sit A LAST 39% 39% A-AsNE aALNE (a) Unbuffered SIL A LAST SL S LAST SLA LAST sicA~AALONE 0IS SL 8 LAST 0% A A-ONE B..."T 20% iftr 8 LONE 19% LiNT _W A3TI 31% BALONE 24% (c) 2.3 Second Buffer (b) VoIP Figure 13: Time in Each State for 3 Calls from the Switchboard Corpus 61 What was found was that although there were some speakers that consistently talked more than others, the values for speakers varied greatly per conversation. In particular, when the parameters for a particular speaker were averaged across the three conversations they had participated in, the standard deviation was on the same order of magnitude as the mean, making the data difficult to interpret. We offer two hypotheses to explain this variation. One is that there simply isn't enough data. Three six-minute calls is not a lot of time to characterize a person's speech patterns. Additionally, the three calls were about different subjects, the nature of which could have affected the person's speech patterns. To do a better analysis of speech patterns on a per-speaker basis would require keeping several variables (subject matter, connection channel properties, call duration) constant as the person talks to different parties. This was not the nature of the Switchboard corpus, and time did not permit our own investigation into this matter. A second potential reason for the high variation of the data is that individual speaker patterns are simply not consistent across contexts. Several studies have recently shown that the advent of the Internet has introduced several types of data that are simply too variable to model with current techniques[36]. How to model this type of data is a subject that is outside the scope of this document. Instead we took another strategy. We decided to use our intuitive understanding of parameters to create user models for different conversation types. A few of those types are discussed below. 7.5.1 The Average Speaker Pair The first set of parameters we will look at is the "average" speaker. For this user we gave each user the average a and 3 values for the entire Switchboard corpus after adding the 2.3 second buffer appropriate for modeling VoIP. An output of the traffic generation program with this setup is shown in Figure 14. This can be compared to a single real conversation that had parameters similar to this, which can be seen in Figure 15. 7.5.2 The Authority Relationship Another conversation type that is of particular interest is a conversation where one of the speakers has authority over the other. This could represent a manager talking with an employee, or, in the military world, a commanding officer talking with a subordinate. These conversations could play a unique role in security because many times it is important to keep the location of the commanding officer secret or hidden. 62 0 500 0 1000 2000 1500 2500 3000 3500 time (s) (a) Average Speaker A J C 0 500 1000 2000 1500 2500 3000 3500 2500 3000 3500 time (s) (b) Average Speaker B 2- 0- 0 500 1000 2000 1500 time (s) (c) Both Speakers Figure 14: Simulated Talking On/Off Patterns versus Time for a Single Conversation with a Pair of Average Speakers 63 0 o 50 100 200 150 250 300 350 250 300 350 250 300 350 time (5) (a) Average Speaker A 0 0 50 100 200 150 time (3) (b) Average Speaker B (0 0. 0 50 100 150 200 time (s) (c) Both Speakers Figure 15: Real Talking On/Off Patterns versus Time for a Single Conversation with a Pair of Average Speakers 64 Speaker Authority Subordinate apse aalt aint isol /ted Otor 2 .3 2 .1 .3 .1 .1 .1 .5 .7 .01 .1 Table 6: Brady's Parameters for a Dominant and Passive Speaker To choose parameters for an authority figure we thought about the conversation qualitatively. The authority figure is less likely to worry about interrupting someone, because he has the authority, so it was decided that his aint value should be above average. Additionally, he is probably less likely to stop talking when interrupted or interrupting someone, so Mted and t, should be lower. Finally, it is believed that authorities usually talk more in conversations[19] so,3r was also lowered. For the subordinate the situation is the opposite. He is likely to be talking less, and so 0,,,, should be higher than average. Additionally, he is quite likely to stop when interrupted SO AteM should be high. Finally, the subordinate is probably unlikely to interrupt the person who has authority over him, so aint was dramatically reduced. The values that were chosen for these two speaker types are summarized in Table 6. The resulting simulated conversation can be seen in Figure 16. As can be seen in the figure, it can be clearly seen that the dominant speaker talks and interrupts more, and the conversation has the desired properties. 7.5.3 An Alternating Protocol A final type of conversation we will consider is one with an alternating protocol. In this case, some other code or protocol is used to determine when a speaker is talking. One example of this is the "over" protocol, where people say "over" at the end of every statement made. Another example can occur over links with a long latency (for example satellite communications). In these cases, sometimes a push-to-talk system is employed where a person first pushes a button indicating that they are talking, and releases it when they finish. In general, the goal of these systems is to eliminate overtalk. Thus, aint should be very low, and !3 ,to and /ted should be relatively high. Additionally, the conversation usually flows with alternation (as opposed to a speaker falling silent and talking again), so aalt should be significantly higher than apse. We also hypothesize that in this scenario, typical talkspurts usually last longer, so 0,,, was kept low. The final values chosen, and the results of a simulated conversation can be seen in Table 7 and Figure 17, respectively. Again, it can be plainly observed that the conversation has the desired properties. 65 0 0 500 1000 2000 1500 3000 3500 2500 3000 3500 2500 3000 2500 time (s) (a) Subordinate Speaker 0 0 0 500 1000 2000 1500 time (s) (b) Authority Speaker 2- E z 0 0 500 1000 2000 1500 3500 time (s) (c) Both Speakers Figure 16: Simulated Talking On/Off Patterns versus Time for a Conversation with a Dominant Speaker 66 0 0 500 1000 2000 1500 2500 3000 3500 2500 3000 3500 2500 3000 3500 time (S) (a) Alternating Speaker A 0 0 0 500 1000 2000 1500 time (s) (b) Alternating Speaker B 2 z 0 500 1000 2000 1500 time (S) (c) Both Speakers Figure 17: Simulated Talking On/Off Patterns versus Time for a Conversation with an Alternating Protocol 67 Speaker apse Oalt aint Alternator .05 1.5 .01 /sol .05 !ted 1.5 I 3 tor 1.5 Table 7: Brady's Parameters for an Alternating Protocol 7.6 Summary In this section we have discussed conversation modeling in the context of VoIP. Starting with a model for conversations developed by Paul Brady, we adjusted the parameters to more accurately reflect the observed data rates of VoIP. This resulted in an averaged increased length of every talkspurt, which we attribute to the use of Voice Activity Detection systems that overestimate the time when a person is talking. We also discussed adapting the parameters of the model to generate different usage patterns. Patterns considered were an authority relationship, and an alternating protocol similar to a push-to-talk system or "over" protocol. A scriptable VoIP traffic generator was developed on the basis of this work on conversation modeling. It allows us to play simulated VoIP conversations that closely resemble users' actual speech patterns. The traffic generator is flexible, allowing Brady's six-parameters to be set as inputs, and thus making it possible to simulate different types of conversations over equivalent network setups. The details of this traffic generator are discussed further in Section 9. 68 8 Adapting the E-Model to Secure VoIP Systems: The Z-Model In Section 6.3 we introduced the E-Model as a way to estimate conversation quality based on connection characteristics. The E-Model can estimate quality under a wide variety of conditions. It was designed to model traditional circuit-switched communication, and later adapted to include packet-switched communication like VoIP. We sought to develop a VoIP-specific model that could be readily utilized by VoIP users and implementers to determine the quality of their service. We also wanted to model the impact of security on voice communication quality. Finally, we sought to improve the model to capture the benefits and relative costs of using conversational protocols (e.g. "over") over high latency links. We call the model that results from these changes the Z-Model. 8.1 Using the E-Model for VoIP Communication Before we discuss any changes made to the E-Model, we will first discuss its use in VoIP communications. Recall the E-Model formula: R Ro - Is - Id -e A (1) We will consider the effects of three factors that impair voice quality: jitter, echo, and error rates. 8.1.1 Jitter Jitter is the interarrival variation in packets. It does not occur in switched networks because there is no concept of a packet in those networks, but it is clearly important for VoIP. Packets that arrive very late due to jitter represent a difficult challenge. Once the audio in the packets has been heard by the user, late-arriving packets are not useful and should then be discarded. On the other hand, discarding every packet that arrives a little bit late because of jitter would cause a significant amount of audio to be lost, which would also result in a reduction of conversation quality. A jitter buffer is used to mitigate this problem. A jitter buffer hold packets for a short time window before the phone converts them to audio and sends the signal to the user. Studies have shown that the use of a jitter buffer can dramatically improve conversation quality[50]. However, using a jitter buffer requires a tradeoff. Waiting a fixed amount of time for packets that might arrive late due to jitter causes the end-to-end latency of every 69 packet to increase. If jitter is so great that the jitter buffer must be increased beyond a few hundred milliseconds, the impairment due to delay may counteract the improvement from reduced jitter. A typical jitter buffer can be anywhere from 2-4 standard deviations longer than the average interarrival time to ensure that most slightly "jittered" packets are picked up, while not wasting time waiting for packets that arrive extremely late. The best value to use depends on both the jitter and latency of the network. From the perspective of the E-Model, the use of a jitter buffer allows us to eliminate jitter from quality models by replacing it with a fixed delay and an increased packet loss. All packets that arrive within the window of the jitter buffer will be delayed in the buffer a fixed length of time before heard by the user. Additionally, packets arriving outside the jitter buffer will be dropped, and are equivalent to packets lost in transit. If ta is the absolute end-to-end delay, and ta-original is the absolute delay in the absence of jitter, then: ta = ta-original + tjitter (2) where tjitter is the length of time that packets wait in the jitter buffer. Additionally, if P10,, is the probability of a lost packet, and Poss-original is the probability of a packet lost in transit, then: Pioss = Poss-originai + Poss jitter (3) where Plossjitter is the probability that a packet arrives outside the duration of the jitter buffer, and is thus dropped. 8.1.2 Echo The next factor we will consider is echo. In the E-Model, echo is represented in the Id term. The specific equation is: Id = Idte + Idie + Idd (4) where Idte and Ide represent terms due to talker and listener echo, respectively, and Idd is the impairment due to absolute delay. Both Idte and Ide have complex equations associated with them, but we will make use of some assumptions. We rely on the fact that many VoIP systems currently provide varying levels of echo cancellation[60]. We also argue that echo in a packet-based speech communication is reduced because there are no analog signals involved in the transmission of the signal (at least on the level that would contribute to the output). Combining these two facts we will assume 70 that there is no talker or listener echo, and the only delay impairment corresponds to absolute delay. Then Id = Idd where Idd is calculated from the one-way latency by Equation 5 below (from [53]). Idd = 25 * {(I + X6 )1/6 - 3 * (1 + [X/3] 6 )1/ 6 +2} (5) where X is defined in the equation below, with ta, the absolute delay given by Equation 2. X = log(ta/100)/log2 8.1.3 (6) Error Rates Finally we consider error rates. Many channels of communication are noisy, and in a digital setting, noise typically leads to single-bit errors; that is, a 1 becoming a 0, or vice versa. Many systems can recover from single-bit errors, but errors can be problematic for VoIP systems. This is because VoIP uses compression schemes to encode its speech. If a single-bit of compressed data is wrong, the entire packet becomes corrupt, and must be thrown away. Thus, we return again to the formula for packet loss, and must add a new term, Poss-erro representing the packet loss due to single-bit errors: PlosS = Poss-originai+ Poss-jitter + Poss.error (7) To calculate P088 erro, we need to know two things: the bit-error rate, ber, of the channel, and the size of the VoIP packet in bits, size. Then the probability of a packet being error-free is given by the following: P1oss-error = 1 - (1 - ber)size (8) Where the probability of an error is 1 minus the probability of a packet in which each bit contains no error. We can usually make a simplification to Equation 8. Many communication channels have very small bit-error rates. 10Gb Ethernet, for example has a bit-error rate of approximately 10-13 [66]. In the case of a very small error rate we can assume the probability of a packet having more than one error is small enough that we neglect it. Then the equation for Poss-error becomes: Poss-error = ber * size 71 (9) For the 10-13 error rate cited for 10Gb Ethernet and a 200-byte (1600-bit) packet typical for VoIP, the values given by the two different formulas for Poss error are 1.599999999872 * 10-10 and 1.6 * 1010, respectively, making the assumption well over 99.99% accurate. For satellite links common in airborne networks the approximation is less accurate, but still good. For the 10-5 bit error rate of Iridium communication (see Table 8 of Section 9.2), the approximation gives 0.016 while the actual value is 0.0158727. This is still about 99.2% accurate. 8.2 Going from E-Model to Z-Model Once the E-Model was adapted for VoIP in the ways described above, we sought to incorporate the effects of security. This is discussed in Section 8.2.1. Additionally, in experiences with conversation modeling we discovered a factor, which we call the disagreement factor, that we believe more accurately reflects delay impairments. This factor is discussed in Section 8.3. 8.2.1 Modeling the Effects of Security Introducing security into VoIP networks affects several different aspects related to voice quality, including jitter, delay, packet loss, and bandwidth. Jitter Encryption introduces another step to the end-to-end communication process, so it is likely to increase the variance of the delays of each packet. In particular, as bandwidth is increased, the cryptographic engine can become a bottleneck, and voice packets may have to wait in an encryption queue for other data to be encrypted before being sent. Naturally, the more data being sent through the encryptors, the longer each individual packet may have to wait in the queue. In general, this queue will cause additional jitter depending on the instantaneous amount of data being sent at any particular time. Fortunately, we can convert jitter to packet loss and absolute delay by again assuming the use of a jitter buffer as previously described. Thus, we only need to be able to measure the increase in jitter associated with added security, and then use the methods of Section 8.1.1 to determine the effect on voice quality. Delay Another effect of adding security is increased delay. Cryptographic algorithms are complex, and may require non-negligible amounts of time to perform. Additionally, packets waiting in an encryption queue can lead to increased jitter, resulting in the 72 need for longer jitter buffers. Recalling Equation 2, we add the encryption terms to the equation for ta, the absolute delay: ta = ta-original + tjitter + tencrypt + tdecrypt + tencryption-buffer (10) where tencrypt and tdecrypt represent the time to encrypt and decrypt a single VoIP packet and are dependent on the encryption algorithm and implementation, as well as the size of the VoIP packet. Additionally, the time spent in the encryption buffer, tencryptionjbuffer is dependent on the encryption mechanism as well as the amount of traffic passing through the encryptors. The tjitter term represents jitter due to factors other than encryption times, and is the same term from equation 2. The cryptographic algorithms are also dependent on the implementation and speed of the computer performing the encryption. Faster computers will perform encryption operations faster. For our model we are assuming Openswan IPSec running on Debian Linux and a computer with 4-2.4GHz Intel Xeon processors and 2GB of memory, since this is the setup we used for our experiments (see Section 9). Packet Loss Encryption mechanisms will also affect packet loss rates. One effect will be the additional loss caused by packets waiting too long in the encryption buffer, as described previously. A second effect will be the increased size of the encrypted packets has on the loss-rate due to single bit errors, described in Section 8.1.3. Recalling the equations for Ploss-error given in Equations 8 and 9, we must plug in the new size size-enc for the packet after encryption. That is: (1 Pioss-error-enc = 1 - - ber)size-enc (11) or, if the ber is low: Ploss-error-enc = ber * size-enc. (12) If we know the ratio of the encrypted versus clear packet sizes, we can also express Pioss-error-enc as a product of the original Poss-error and this ratio. That is: size-enc Ploss-error-enc = . * Poss-rror (13) Putting it all together, and referring back to Equation 7, the new equation for the loss rate, P1o0 s, is: Pioss = Ploss-original + Poss-jitter + Poss-errorenc + 73 Poss-encryption-buffer (14) Bandwidth Finally, any authentication and encryption mechanism comes with increased bandwidth. Encryption creates redundancy, and actual bandwidth observed on an encrypted network is significantly higher than the useful bandwidth seen at the endpoints. Thus, although bandwidth itself does not directly play a role in voice quality (besides the affect on error rates described above), increased bandwidth consumption will affect latencies and loss rates due to congestion. For the purposes of modeling we assume that a network administrator will understand how bandwidth contributes to delay and loss on a network, and do not include a bandwidth term directly in the Z-Model. To help network planning, though, the Z-Model, in addition to providing a quality rating, will help estimate the increase in bandwidth caused by the addition of security. No claims are made, however, as to how the increased bandwidth affects other traffic characteristics, as we expect this to be largely dependant on the specifics of the network. The increased bandwidth is a function of the security implementation and the original size of the voice packets. In understanding the bandwidth requirements of secure VoIP, network administrators will be able to determine if their secured networks will support the equivalent traffic volumes observed in their insecure systems. 8.3 Incorporating Conversational Improvements In experimenting with various conversation models like the ones described in Section 7.5 we realized an important point. In many channels of communication in which there is a significant end-to-end delay, users adopt a signaling protocol to avoid talking over each other. Examples of this include the familiar use of the word "over" at the end of messages, and a push-to-talk system that only allows one person to talk at a time (users talk by pressing a button). Clearly there is an intrinsic benefit to using these systems over high-latency links, but this is not accounted for in the E-Model, where the only term that affects the absolute delay impairment (Idd) is the end-to-end delay (ta) (see Equations 5 and 6). We speculate that the benefit of using such protocols is that it allows the parties to agree on what is occurring in the conversation. By saying "over" after I am done speaking, I am alerting you that I have finished and am now expecting you to talk. Both parties understand this alternating flow. On the other hand, in the absence of such a protocol, you might interpret a brief silence as the end of my statement and then, because of the long latency of the link, we end up talking over each other. One way to define agreement is to use Brady's conversation states (refer back to Section 7.2 for a review of these states). In particular, we will break the conversation into states of double-talk, each individual speaker talking alone, and mutual silence. The more the two speakers agree on what state they are in, the less likely they are 74 to miscommunicate as a result of the high-latency link. As noted above, Idd is the only term in the E-Model that represents impairment due to absolute delay. For the Z-Model, we propose modifying the equation for Idd which depends only on absolute delay. In particular, we believe that this term should depend on the fraction of time the two parties disagree on the conversation state, which we call the disagreement factor: tdisagree tagree + _ tdisagree tdisagree (15) ttotal Where tagree is the time the two speakers agree on which state they're in, tdisagree is time they disagree, and ttotal is the length of the conversation. Note that since agreement and disagreement are mutually exclusive, ttotal = tagree + tdisagree. It should be recognized that in the presence of any significant delay this disagreement fraction will always be nonzero. This is because there is a difference between the time one of the parties starts or stops talking and the time the other party perceives it. In fact, assuming that the delay is much shorter than every talk and silence spurt allows the disagreement factor to be determined in a simpler manner by estimating the disagreement time based on the absolute delay, ta, and the number of state transitions Ntransitions. tdisagree Ntransitions * ta (16) This estimation is a result of each state transition introducing a time of length ta during which the parties are sure to disagree on conversation state. Equation 16 becomes more accurate the longer individual states are, because tdisagree is less likely to be affected by two transitions happening close together. Note that this Equation relies on the assumption that delay times are equal in both directions of communication. The two cases are illustrated in Figures 18 and 19. In Figure 18 the time in each state is longer than the delay, ta and Equation 16 holds. In Figure 19, however, the time in the two talking states is shorter than ta and the equation does not hold. We had originally hoped to design a new equation for the impairment due to absolute delay, Idd, that was based on this disagreement factor and not on the latency, ta. Unfortunately, we did not have enough data to substantiate this equation, so we leave that to a subject of future work. Here we merely suggest that this disagreement factor should be considered over (or perhaps in addition to) the one way latency in determining impairments due to delay. We discuss the validity of this claim in Section 10.3.1. 75 A's Talkspurt Speaker A B hearing A's Talkspurt Speaker B ta tL Agreement Disagreement Figure 18: Agreement and Disagreement Time with ta Shorter than State Lengths A's Talkspurt A hea ing B's Talkspurt Speaker A B hearing A's Talkspurt B's Talkspurt Speaker B ta ta Agreement Disagreement Figure 19: Agreement and Disagreement Time with ta Shorter than State Lengths 76 9 Resources This section describes the resources available for evaluating secure VoIP systems in varying network conditions. Section 9.1 describes a VoIP traffic generating program written based on the results of Section 7. Section 9.2 describes a link emulator program useful for simulating different connection parameters. Section 9.3 describes the devices used to perform the security protocols, and finally, Section 9.4 describes the network in which experiments were run. 9.1 Traffic Generator A VoIP traffic generator was written in C++ to support scriptable, simulated conversations over a network. We used the C++ Portable Types Library[21] to allow the traffic generator to run on both Windows and Linux machines. We also relied on an open-source implementation of RTP and RTCP developed by Jori Liesenborgs known as JRTP[20]. The traffic generator models a Cisco 7940 IP phone. Like that phone, it uses SIP running over UDP for setting up and tearing down calls, and uses RTP for transmitting the voice streams. Voice streams use the same packet size and formatting as the G.711 codec used by the phone, although the payload data was sent from a file of random bits instead of actual encoded speech. The same data file was used in all experiments. The traffic generator also models the phone's voice activity detection, so packets are sent in bursts when the users are talking. The talking patterns of the users are governed by Brady's conversation model (described in Section 7.2) with appropriate VoIP-specific input parameters. For most experiments the average parameters were used to simulate a ten-minute call. However, having the traffic generator allowed many different types of conversations to be modeled under the same network conditions. This feature was used to evaluate the disagreement factor described in Section 8.3. For all of our experiments we focused on the conversation quality provided by the RTP data streams, and not on call setup and takedown. 9.2 Link Emulator We employed a link emulator program developed at MIT Lincoln Laboratory that intercepts and processes packets at the link layer, and can simulate specific latencies, error rates, and bandwidth constraints. Since part of this study was an investigation into the performance of VoIP over airborne links, the characteristics of various communication channels used by airplanes 77 Link Type TCDL Connexion Inmarsat Iridium BER (bps) BW (kbps) 10-7 10-6 1o4 10-7 10-5 128 128 2.4 Latency (ms) 2 325 325 2000 Description Line of sight Boeing, satellite Geosynchronous satellites 66 low earth orbiting satellites Table 8: Characteristics used to Simulate Various Airborne Links were important to know. Four main types of airborne communication are summarized in Table 8. TCDL is a line of sight communication that usually occurs between two planes. The other three are forms of satellite communication, and thus have lower available bandwidths and much higher latencies. Connexion is a satellite-based Internet connection provided by Boeing and commercially available on many airplanes[47]. Inmarsat offers a similar type of communication via its own network[48]. Finally Iridium is a network of low earth orbiting satellites that provides communication anywhere on Earth, and is now being used primarily by the US Government[49]. 9.3 Encryption Devices As mentioned in Section 5.1.3, Openswan[41] was chosen for our experiments from several different implementations of IPSec because it provides the most flexibility in selectable encryption algorithms, including AES. Additionally, Openswan has been adopted as the IPSec implementation for the Fedora Core and Debian Linux operating systems, and is believed to be quite stable. All of our encryption was done in software, and IPSec ran in tunnel mode the entire time. Authentication headers were not used. 9.4 Test Network The network used in the experimentation is shown in Figure 20. Two identical computers running Fedora Core 1 Linux serve as endpoints. The VoIP traffic generator discussed in Section 9.1 was run at these two points. The endpoints are connected by an IPSec tunnel that runs between two nodes labeled IPSeci and IPSec2. These IPSec hosts are both running Debian Linux using the Openswan implementation of IPSec. Finally, between the IPSec hosts are a computer performing static routing using the kernel, and another node, labeled LinkEm, that runs a link-layer link emulator, as described in Section 9.2. When IPSec is turned on, traffic is unencrypted in the lighter, left side of the figure, and encrypted in the darker, right hand side. The computers labeled VoIP User 1 and VoIP User 2 were Dell 1750 servers with 2GB of RAM, and 2 2.4GHz Intel Xeon Processors. The IPsec machines, LinkEm and Debian router were Silicon Mechanics 1271A 1U Rackmount PC's with 2GB of RAM and 4 2.4 GHz Intel Xeon processors. The computers were connected by 1Gbps Ethernets 78 I -, 1 I !M' vpi IP DebiaN tCPdU7P()L tcpdump(2) VoIP User 1 FC1 + Skaian Kmod Debt tcpdump(6) Figure 20: Test Network for Experimentation and switches, but the network interface cards were 100Mbps, which was the limiting factor. Traffic dumps were taken at several points, labeled tcpdump(1-7), depending on what exactly was being measured. Dumps were taken using tcpdump[45], a program built into most Linux implementations that records traffic seen on a particular network interface. 79 10 Methodology This section describes the experiments run to evaluate the performance of secure VoIP and develop and test the parameters of the Z-Model. Section 10.1 describes a set of experiments designed to evaluate the performance of encrypted and unencrypted VoIP under a variety of isolated network conditions. Section 10.2 goes on to explain how the results of these experiments can be used to determine the various constants found in the Z-Model. Finally, in Section 10.3 the Z-Model is evaluated as a tool for predicting conversation quality, and the feasibility of secure VoIP in a variety of wireless airborne network types is discussed. 10.1 Understanding the Performance of Secure VoIP The first set of experiments was designed to study the impact of security on a Voice over IP system's overall conversation quality. As a first step, various encryption algorithms were compared against each other under optimum conditions. Then we focused on 128-bit AES, and evaluated its performance under conditions of increased traffic, reduced bandwidth, and various loss and error rates. In each experiment the average delay and packet loss were measured over time. These values could then be used in refining the Z-model to determine the impact security has on various transmission characteristics. 10.1.1 Experiment 1: Performance Under Optimum Conditions In our first experiment we sought to determine performance of VoIP with IPSec running in ideal conditions. This meant running limited traffic over low-latency, high-bandwidth communication channels. The results of this experiment translate to the best possible quality of service for a secure VoIP call. It is expected that voice quality will decrease from these values as latency and traffic increase, and bandwidth decreases. In addition to determining this baseline quality value, we also sought to compare different encryption algorithms. As mentioned in Section 9.3, Openswan supports a variety of encryption algorithms including DES, 3DES, AES (also known as Rijndael), Blowfish, and Twofish. Additionally, compression can be turned on and off, and varying key sizes can be used. For this experiment we analyzed the performance of 3DES, AES with both 128 and 256-bit keys, and Blowfish. Additionally, we tested the performance of 3DES and 128-bit AES with and without compression. To measure the performance of the various algorithms a single 10-minute simulated VoIP call was made using the VoIP traffic generator. The parameters input into the traffic generator were the average of the data from the switchboard corpus described 80 Algorithm Key Bits 3DES 3DES AES AES(solo) AES AES Blowfish 128 128 128 128 128 256 128 Clockwise Communication Compression Encryption Decryption IPSec2 IPSeci (ms) (ms) .054 .053 no .047 .053 yes .037 no .035 no .027 .034 .027 .034 yes no .034 .034 no .031 .031 Counterclockwise Communication Encryption Decryption IPSec2 IPSeci (ms) (ms) .086 .048 .076 .049 .034 .060 .027 .035 .035 .026 .036 .058 .033 .052 Table 9: Baseline Per-Packet Delays for Various Encryption Algorithms in Openswan in Section 7, and listed in Table 5 of that Section. The same parameters and payload were used for all runs. It was our initial plan to measure end-to-end delays of the packets between the two VoIP endpoints. When we attempted this, however, we found that the delays under these ideal conditions were on the same order of magnitude as the precision with which we were able to sync the clocks. 5 . Thus, we decided to look at the delays going in and out of the encryption boxes, where we could use a single clock for reference. Using tcpdump running on the Openswan boxes at locations 2, 3, 5 and 6 of Figure 20, the time every packet was seen at each side of the encryption device was recorded. Then using a perl tool called pkt-scripts, the average delay was recorded across all packets. For each call 4 delays were recorded: the outbound encryption and inbound decryption times at each of the two boxes. The values of these delays are summarized in Table 9. A few things can immediately be seen from Table 9. The most important result is that encryption under these low bandwidth conditions will not ever result in a significant loss in voice quality. Note that the highest delay from encryption was under 1/10th of a millisecond, and the total one-way delay from both encryption and decryption is well under .15 milliseconds for all algorithms. According to the E-Model, delays don't begin to impair quality until they are greater than 100ms. Thus, the delay overhead due to encryption under these circumstances is less than .15% of the delay required to become an impairment to quality. Compared to other factors, such as round trip time, this delay is incredibly small. Thus, we believe that the encryption overhead under ideal conditions can be virtually neglected, regardless of algorithm or key-size. A few other interesting results can be obtained from Table 9. One thing to note is that the claim that AES is faster than 3DES was verified for this data, as all implementations of AES, including with a 256-bit key, were faster than 3DES. Additionally, it was observed that using compression was a slight performance boost overall, although 'Clocks were synched using NTP[76]. 81 the difference was not extremely significant. The performance boost is likely due to the compression of the RTP headers, since the voice data itself is already compressed by the codec. It is important to note that these encryption times are not fixed, but rather are functions of the amount of traffic and the hardware and software performing the encryption. This can immediately be seen by the fact that node IPSec2 usually outperformed node IPSeci given the same traffic rate and algorithm. Since the nodes are identical machines, it was hypothesized that other processes were tying up the kernel on node IPSeci. To verify this hypothesis we performed a single test of uncompressed AES encryption at a time when no extraneous processes were running on the machine. The result of this test is shown in Table 9, on the line labeled "AES(solo)" and it can be seen that the times on the two machines were almost identical. The performance of 128-bit AES with compression was also measured under "clean" conditions, and comparing it to the clean test of AES without compression reveals that for VoIP, compression seems to play a negligible role in encryption times. Returning to the main result of this experiment, we conclude that under the ideal conditions, every encryption algorithm runs fast enough to support secure VoIP with negligible loss of quality. 10.1.2 Experiment 2: Performance of Openswan with Increased Traffic In order to gain a better understanding of how Openswan performed, the second round of experiments ran a VoIP call with significantly more traffic running through the IPSec machines. The goal of this experiment was to determine how the delay introduced by encryption scaled with traffic going through it for various encryption algorithms. We used the same setup shown in Figure 20, and again simulated a 10minute call with average parameters. This time, however, we also used iperf[46] to generate a stream of UDP traffic between the two VoIP endpoints. The Openswan boxes would encrypt this stream of traffic in addition to the VoIP call. We believed that as traffic increased and the encryptors were forced to do more processing, endto-end delays would noticeably increase. The purpose of this experiment was to put strain on the encryptors, not the network links. For this reason, all of the various nodes were connected via switches and Gbps Ethernets. The link emulator on LinkEm simply passed all traffic through, which is roughly equivalent in behavior to a switch. For this experiment, and most of the remaining experiments, we focused on AES encryption with a 128-bit key. AES was chosen as it has been adopted as the standard encryption algorithm by both NIST and the U.S. Government. Additionally, use of a 128-bit key was chosen because key sizes in AES beyond 128 bits are currently considered unnecessary. As computers get increasingly fast the need to increase 82 Average Delay with Variable Traffic 1.6 - 1.4 2 .u- -. AES-128 .- uencrypted --.. 0.6 4 0.4- 0.2 0 0 10 20 30 40 50 Additional Traffic (Mb/s) 60 70 80 90 Figure 21: End-to-End Delays for 128-bit, Uncompressed AES and Clear Communication with Varying Levels of Traffic key-size becomes relevant, but barring any unknown attack on AES faster than key exhaustion, it is believed that 128-bit AES will be secure for well over 20 years[52]. We performed two sets of experiments, one with encryption turned off, and one with 128-bit AES. We then simulated 10-minute calls with average parameters and additional UDP traffic of 1, 5, 10, 50, 70, and 85 Megabits per second. The 100 Mbps network interface cards on several of the machines prevented data rates larger that 85 Mbps from being explored. For our generated data, 1400-byte packets were sent, because iperf was unable to handle sending high volumes of traffic in smaller-sized packets. Delays were measured from end-to-end on the clear sides of the IPSec boxes (locations 2 and 6 in Figure 20), in hopes of being able to make statements about the end-to-end system as a whole. Although for some of the experiments the accuracy of clock synchronization became a factor, statements could be made, based on symmetry, of the average observed delay in both directions, as well as the relative increased delay introduced by encryption. The results of these experiments are summarized in Figure 21. As can be seen in Figure 21, the expected result of delays increasing as more traffic was added to the network was observed. What was surprising was the relatively small effect encryption had on increasing delays. Using a linear approximation for the two curves (which are approximately linear) it was determined that the slope of the clear experiment was 3 microseconds per megabit/second, while for AES this value was increased to 8 microseconds per megabit/second. The result was that for the 83 Fraction of VolP Packets Lost in 128kb/s Link for Varying Additional Traffic Rates 0.45 - - -..... ------- -.... --------.----.-.--. 0.4 0.35 e0.3 0.25- a.5 *- Encrypted -- Unencry ted 0.2 0.1 0.1 0.05 0 0 20 40 60 80 100 120 Additional Traffic (kbis) Figure 22: Loss Rates for 128-bit, Uncompressed AES and Clear Communication with 128 Kbps Bandwidth and Varying Background Traffic (End-to-end throughput) addition of 85 megabits of traffic AES was .5 milliseconds or approximately 1.5 times slower. Still, compared to the 100 milliseconds (cited by the E-Model) required for absolute delays to impair conversation quality, the computational overhead introduced by encryption is quite small. 10.1.3 Experiment 3: Performance over Low-Bandwidth Links The third experiment we ran sought to determine the significance of the bandwidth overhead introduced by encryption. There are several scenarios in which Ethernet or equivalent bandwidth links are not available. This is especially true in wireless airborne networks where bandwidth can be severely limited. For this experiment we changed the link emulator on the LinkEm computer to simulate a low-bandwidth link. For the purposes of simplicity a fixed bandwidth of 128 kbps was chosen, with 0 for the bit error rate and no additional latency. 128 kbps is the typical bandwidth found in Connexion and Inmarsat communication[47, 48], two common forms of satellite links. We then ran a single VoIP call, along with variable amounts of additional traffic generated with iperf (as described in Section 10.1.2) for both clear and encrypted communication. Iperf packets of 500 bytes were sent. The results of this experiment can be seen in Figure 22. As can be seen in Figure 22, loss rates for encrypted communication are significantly higher than for clear communication, especially for low volumes of additional traffic. 84 Fraction of VolP Packets Lost in 128kb/s Link with Additional Traffic 0.45 -- - -.-.-.- -.-.-- -.-.-.-- -.- 0.4 0.35 MS 0.3- 0.25 - C.+ - Encrypted Unencrypted 0.2 0.15 0.1 0.05 0 0 20 40 60 80 120 100 140 160 180 200 Total Observed Traffi on Link (kb/s) Figure 23: Loss Rates for 128-bit, Uncompressed AES and Clear Communication with 128 Kbps Bandwidth and Varying Background Traffic (Actual Packet Sizes) While encrypted communication demonstrates significant packet loss for only 20kbps of additional traffic, unencrypted communication exhibited no loss until over 40kbps traffic was added. To better understand why this occurs we must understand how the bandwidth of the channel is filling up. Since we are simulating a G.711 voice codec, the Ethernet bandwidth used by the VoIP call alone was measured to be 80 kbps. This represents a significant portion of the 128 kbps available to the channel. When encryption is applied, the size of the VoIP packets increases from 200 to 256 bytes, resulting in an increased bandwidth to 102.4 kbps. This leaves even less room for additional data flows. Additionally, encryption introduces an overhead to the 500-byte iperf packets as well, causing them to increase in size to 600-bytes. By using these updated packet sizes and combining the VoIP and iperf data we were able to re-plot packet loss versus total used bandwidth. This is shown in Figure 23 As can be seen in Figure 23, when the updated sizes for the encrypted packets are considered, the amount of packet loss versus bandwidth is comparable for encrypted and unencrypted communication. Additionally, it can be observed that the point where loss begins to occur is very close to the 128 kbps capacity of the channel. As the capacity of the channel is approached and exceeded, packet loss increases rapidly. The encrypted data point at approximately 150kbps is anomalous. We speculate that since we are observing only loss rates in the VoIP packets 6 that this run was simply 6 As opposed to total loss rates of combined VoIP and iperf packets. 85 Loss Rates for Clear and Encrypted Communication 0.25 0.2 0.15 - - -A--- 0.1 0bserved (clear) OObserved (AES) pected 0 0,05 0 0 0.05 0.1 0.15 Requested Fraction of Packets Lost 0.2 0.25 Figure 24: Introduced Loss Rates for Clear and Encrypted Communication "lucky', and a lot of iperf packets (as opposed to VoIP packets) were dropped. This would result in a low observed loss rate for VoIP, even though a greater fraction of packets could have been dropped overall. We speculate that repeated runs of these experiments would cause the data to more closely align, but leave that experiment as a topic of future work. 10.1.4 Experiment 4: Performance over Links with High Loss and Error Rates The fourth experiment sought to isolate another characteristic of some airborne links: non-negligible drop and error rates. The first thing we looked at was drop rates. Using the LinkEm software we simulated drop rates of .1%, 1%, 5%, 10%, and 20%, and for each case ran a clear and an encrypted VoIP call. The results are shown in Figure 24. It can be seen in this figure that encryption does not appear to affect drop rates at all. There is no observable difference between the clear and encrypted curves; each one more or less follows the expected line. In some cases one slightly outperformed the other, but we attribute this largely to the random drop modeling of the LinkEm program, and not to the use of encryption. A second factor considered was bit error rates. This is the probability that any given bit of a packet is accidentally flipped (a 0 becomes a 1 and vice versa). Bit error rates are quite significant because many times the loss of a single bit is enough to force 86 Drop Rates versus 1.00 -07 1.GOE-06 Bit-Error-Rates for Clear 1.OOE-05 and Encrypted Communication 1.00E-04 1.00 -03 .--o. Unencrypted Encrypted LA. B.r001 Bit error rate Figure 25: Loss Rates versus Bit Error Rates for Clear and Encrypted Communication the packet to be discarded. This is especially true for encrypted data, where a single bit-error makes the packet impossible to decrypt, but can also be true for compressed data (e.g. compressed speech). Because of this, for this experiment any time one or more bits of the packet had errors, we treated the packet as lost. We chose to simulate error rates ranging from 5 * 10-7 to approximately 2.5 * 104 . Simulating bit error rates higher than this became difficult, as setting up the call often failed due to errors. For each error rate, we measured the percent of packets containing at least one error, which we called the drop rate. Headers were omitted when looking for bit-errors, as some headers can change naturally (e.g. time to live field of an IP header) and these changes should not be counted as errors. It is not known whether information could be salvaged from the clear packets containing biterrors because we used simulated speech, but encrypted packets with errors would certainly be dropped. The results of this experiment are shown in Figure 25. It can be seen that the use of encryption drastically increases the likelihood of a dropped packet. In fact, in each experiment performed, encrypted packets were 2.5 times as likely to be dropped with an equivalent bit-error rate. We believe that this increased drop rate is largely due to the increased size of the encrypted packets, as discussed in Section 8.2.1. As the size of the packet grows, there are more opportunities for a single bit error to occur, thus, the per-packet error rate increases. We were, however, surprised at the magnitude of this increase, as we observed encrypted VoIP packets to be only 1.3 times the size of clear ones. This discrepancy is discussed further in Section 10.3.2. 87 10.2 Adding Security to the Z-Model Having obtained traffic data for various encryption setups and link configurations, we are now ready to return to the discussion of Section 8 and fill in some terms in the various equations. In Section 8.2.1 we discussed the addition of security to the Z-Model in the context of four terms: jitter, delay, packet loss, and bandwidth. We will now add the effects of security to the Z-Model for each of these terms. Jitter In Section 8.1.1 we discussed the ability to treat jitter as a combination of a fixed delay and a packet loss rate by assuming the use of a jitter buffer. In particular, we recall Equations 2 and 3, reproduced here: ta = ta-original + tjitter Ploss = Poss.original + Poss-jitter (17) (18) Let's explore what values to choose for tjitter and Poss-jitter to handle jitter in the context of security. Recall that jitter is defined as the variation in packet interarrival times. If we assume that VoIP systems send packets in a consistent, jitter-free manner, then jitter will only be a result of network characteristics. In particular, we can estimate jitter as the variance of the delay experienced by each packet. We were interested in the increased jitter due to encryption. By calculating standard deviations of packet delays across all experiments we were able to determine when this worst-case increase of jitter occurred. In our testing, the largest increased value for jitter from unencrypted to encrypted communication occurred with 128-bit AES with 85 Mbps of additional traffic. In this case the jitter for encrypted communication was .44ms, while the jitter for clear communication was .32ms. This is a 37.5% increase over the normal case. Thus we suggest that in the case of encrypted communication the estimated jitter should be 37.5% greater than that observed in clear communication. Network administrators would then be responsible for using this new value to determine jitter buffer lengths. Knowing how to calculate increased jitter due to security is nice, but sometimes a fixed value is desired. We will examine jitter in our own network as an example of determining this value. To choose jitter buffer lengths in our own network, we will assume that packet delays can be treated as a Gaussian random variable. This is a common method of handling jitter in packet-switched systems[50, 51]. In this case, we can choose a value for a jitter buffer that is four times the maximum observed jitter. Since jitter is the standard deviation of packet delays and we are treating delays as a gaussian variable, 99.99% of all packets will arrive within this window[35]. Using the scenarios that we observed, this results in a worst-case jitter buffer of length tjitter = 1.76ms. 88 Assuming the 99.99% value given from statistics theory is accurate, the maximum value for Poss.jitter becomes .01% in this case. Since we are again considering worstcase scenarios, we will use this maximum value for packets lost due to a jitter buffer of length 1.76ms. We now have worst-case values for tjitter and plossjitter for our test network. It is important to stress that we are making no claims about the level of jitter in a particular network. Our network represented nearly ideal conditions, as packets traversed short physical distances over fast switches and Gigabit Ethernets. Our main goal was to determine the additionaljitter introduced by security, which we estimate at 37.5% above the existing jitter. Network administrators must be aware of the level of jitter on their own network and adjust this value appropriately. We recommend the use of a jitter buffer of length tjitter equal to four times the average jitter, for a loss rate of less than .01%. Delay The next thing we will consider is delay. Recalling Equation 10 of Section 8.2.1, we note that encryption introduces three delays: ta = ta-original + tjitter + tencrypt + tdecrypt + tencryptionbuff er (19) The delays of tencrypt and tdecrypt represent the time to encrypt a single VoIP packet, and are fixed values per codec, while tencryptionuffer represents the variable delay caused by a packet waiting in the encryption buffer while other data is being processed. For this reason tencryptionauffer is dependent on the bandwidth going through the IPSec hosts. Let's first consider tencrypt and tderypt. Experiment 1 measured the time in and out of the IPSec hosts for various encryption algorithms, but we choose to focus on AES. Recalling Table 9, the maximum observed encryption time for AES was .036 milliseconds, and the maximum decryption time was .060 milliseconds. Since we are interested in conservative measurements, we will use these worst-case values and, again, significantly overestimate them for our formulas. In particular, we will choose a value for tencrypt of .072ms and a value of tdecrypt of .12 milliseconds. Each of these values is twice what we observed, and yet their sum is still only fractions of a millisecond. Now let's consider tencryption-buffer. We will express this value as a function of bandwidth passing through the encryptors. Recalling Figure 21, we note that the slope of the best-fit lines for delays of 128-bit AES and clear communication versus bandwidth were 8 microseconds/Mbps and 3 microseconds/Mbps respectively. Thus increasing bandwidth introduces two kinds of latency, one independent of encryption that represents the delays associated with routers, links, etc., and a second representing the increased time in the encryption buffer. Following the philosophy of requiring net- 89 work administrators to understand their own network, we are interested only in the encryption buffer time. We can obtain this time, as a function of bandwidth, by subtracting the two lines in Figure 21. This yields the equation: tencryptionbuf fer(Ips) - adit(bs s)=5ps* * bandwidth(Mbps) 5 Mbps (20) representing the observed delays introduced by encryption. However, following our conservative attitude about times, we will increase the slope of this relationship by a factor of two, yielding the relationship: tencryptionJbuffer( IS) = 10 Mbps * bandwidth(Mbps) (21) Packet Loss We now turn our attention to packet loss. Fortunately, this mostly turns out to be an easy situation. Recalling Equation 14 of Section 8.2.1, we had: Pioss = Poss-originai jitter + Poss-errorenc+ + PoFss 1 Poss-encryptionibuffer (22) We have already shown that Possjitter = 0, but for the experiments we performed, packets were not lost due to the encryption buffer either. Thus Plss-encryption-buffer = 0 as well. Thus, the only overhead in terms of packet loss due to encryption is that due to the increased packet size causing more single-bit errors in individual packets. The additional loss can be calculated from any of Equations 11, 12, or 13, reproduced here: Poss-error-enc = 1 - (1 - ber)size-enc PIoss-error-enc = Poss-errorene ber * size-enc sizeenc size os-error (23) (24) (25) So besides this small overhead due to single bit errors, there is no additional loss due to encryption. There is, however, a caveat associated with this statement. Limitations of our particular network prevented testing the encryption buffers beyond 85 Mbps. We point out that there must be a value for which Poss-encryptionjuffer is non-zero (an IPSec host cannot process packets infinitely fast). That said, since we were not able to reach this value, we will only state that for bandwidth less than or equal to 85 Mbps, IPSec does not introduce any additional packet loss. An exploration of the limitations of IPSec in this regard is left for future study. 90 Bandwidth Finally, we consider bandwidth consumption. In Section 8.2.1 we pointed out that although bandwidth consumed does not directly influence call quality in any way, depending on the network it can drastically affect loss rates, delays, jitter, and other factors that do. This was clearly demonstrated by Experiment 3, which showed increasing loss as bandwidth was used up in a low-bandwidth environment. We make no claims as to how bandwidth will affect these factors on any particular network, and argue that this is the job of the network administrator to determine. We can, however, say something about the bandwidth introduced by the application of encryption. To do this, we will again consider the worst case scenario. The most significant bandwidth increase came from 256-bit AES. This protocol increased the size of the G.711 packets from 200 bytes to 264 bytes, resulting in a bandwidth increase from 80 kbps to 105.6 kbps, for a 32% increase in size. Since G.711 is the codec that requires the most bandwidth and 256-bit AES was the encryption algorithm that introduced the most bandwidth, we believe that network administrators should provide 110 kbps per secure VoIP call (50% more than for calls made in the clear), with a modest amount of additional bandwidth allocated as required by lower-level protocols. In bandwidth limited situations voice codecs other than G.711 could be employed. G.729, for example, can reduce bandwidth use from 80 kbps to approximately 30 kbps. While this is a good way to reduce the bandwidth overhead of VoIP, an exploration of the encrypted bandwidth requirements of other codecs was not performed. This is left as a topic for future work. 10.3 Evaluating the Performance of the Z-Model In this section we evaluate the performance of the Z-Model as a tool for predicting conversation quality. We first look at the disagreement factor defined in Section 8.3. We go on to examine the accuracy of the Z-Model's predicted loss rates for clear and encrypted communication in Section 10.3.2. Finally, we discuss the performance of the Z-Model on a series of links designed to model wireless airborne communication. This analysis serves the dual purpose of evaluating the predictive capabilities of the Z-Model and determining the feasibility of high-quality, secure, VoIP over these unfavorable network links. 10.3.1 Evaluating the Disagreement Factor as a Replacement for Absolute Delay Recall that we introduced the disagreement factor, defined as tagree tdisagree so that the +tdisagree impairment due to delay Idd in the E-Model would demonstrate a benefit from the use 91 Disagreement Factor vs Delay for Different Conversation Models 0.5 0.45 'S 0.4- 0.35 0 { -0 . .2 0.2- j - - Average -u- Push-To-Talk 0.1 0.05 0 0 500 1500 1000 2000 2500 Delay (ims) Figure 26: Disagreement Factor for Varying Two-Way Latencies of signaling protocol over high-latency links. To perform this test we used the link emulator to simulate various latencies. Then for each latency we played two types of conversations: an average one and the push-to-talk system. For the average case the parameters entered into the model are those given in Table 5 and the parameters for the push-to-talk case can be found in Table 7, both of Section 7.5. The results were repeated four times to determine a pattern of behavior. It was hoped that the push-to-talk system would result in a lower disagreement factor than the average case, which could be used to represent the quality improvement gained from its use over high latency connections. To evaluate the disagreement factor, we had the traffic generator program output a timestamp, along with the state it believed itself to be in, every 20 milliseconds. In this way the outputs of the two traffic generators could be compared to determine how often they agreed on the state of the conversation, and how often they disagreed. The clocks were synchronized within a range of under a millisecond, so we believed that clock skew would play a minimal factor in the results of this experiment. We simulated links with latencies of 0, 200, 500, 1000, 1200, 1500, 1700, and 2000 milliseconds. This latency was on top of the end-to-end latency of our system which, because it was less than a millisecond, is considered negligible. Since this experiment was focusing on the disagreement factor, we did not turn on encryption, nor did we add any bandwidth constraints or loss rates. The results of this experiment are summarized in Figure 26. 92 The error bars on the figure represent a single standard deviation averaged from 4 runs. Data points with no error bars are the results of a single run. As can be seen in the figure, the push-to-talk system significantly outperformed the average case for medium length delays. For example, with a 1 second delay using a push-to-talk system resulted in disagreement of conversation state between the two parties only 14% of the time, while using no protocol caused the parties to disagree about 30% of the time. These results are well away from one standard deviation apart, and so we conclude that the disagreement factor is one way of representing the gains of using a push-to-talk system. As delays became extremely long, the improvement of the push-to-talk system over the average case became less pronounced. We suggest that as the one-way latency of the system begins to approach the length of a normal talkspurt, all notion of agreement gets lost. In this case the fraction of time in the same state approaches the fraction obtained by simply treating either side as independent random variables with equivalent probabilities of being in any particular state. That is, we may as well not be participating in the same conversation as long as we participate in a conversation in the same way. The delay becomes so great that exactly what is happening on the other end is meaningless, only the behavior of what is happening matters. Another interesting factor was the high variance of the data observed for the push-totalk case over the average case. Recalling Equation 16 for estimating the disagreement factor based on the number of state transitions, we had: tdisagree - Ntransitions ttotal * ta (26) ttotal With ta being the one-way latency, and the number of state transitions, Ntransitions. This equation helps us understand the higher variation of the push-to-talk system with high latency. As can be seen in Figures 14 and 17, there are significantly more state transitions in the average case. Since the number of transitions is a random variable depending on the input parameters, the lower the number of transitions is, the greater the variance in the number of transitions, and according to Equation 26, this will directly affect the variance of the disagreement factor. 10.3.2 The Loss Due to Errors The next issue we sought to evaluate was our equation for converting bit-error rates to loss rates, given by Equation 11 in Section 8.2.1. Using this equation we estimated the loss rates we expected to obtain for unencrypted and encrypted communication under the error rates discussed in Section 10.1.4. These expected rates, along with the actual rates measured in Experiment 4, are shown in Figure 27. We notice immediately that for encrypted communication, the predicted and observed loss rates are almost identical. For clear communication, however, the observed rates 93 Drop Rates versus 1.00 E-07 Bit-Error-Rates for Clear and 1.00E-05 1.OOE-06 Encrypted Communication 1.00E-04 1.00 E-03 0. A +Unencrypted -4- Encrypted Expected Unencrypted ~ -4- Expected Encrypted 0.0001 Bit error rate Figure 27: Expected and Observed Loss Rates versus Bit Error Rates for Clear and Encrypted Communication were much lower than the predicted rate. One theory for this is that our tool to check for errors only looked at the payload of the packets, and not the headers. This was because some routers may modify upper level headers (such as time to live). IPSec in tunnel mode, on the other hand, encrypts the entire packet, and any errors in the payload would cause the packet to be discarded. Thus, the loss rates calculated from packet errors for clear communication may actually underestimate the real loss rates. We did not pursue this hypothesis any further, however, and it is left as a topic for future work. 10.3.3 Using the Z-Model to Estimate and Measure Overall Conversation Quality As a final step we sought to evaluate the validity of the Z-Model in predicting VoIP conversation quality. To do this, we compared what the model predicted to what was observed for four different scenarios. The four scenarios used were based on four main channels of communication available in airborne networks, and described in Table 8: TCDL, Iridium, Connexion, and Inmarsat. The only modification made to the link characteristics described in Table 8 is that the bandwidth of Iridium was increased from 2.4 to 128 kbps. The reason for this change was that 2.4 kbps is only a small fraction of the bandwidth required for a G.711 VoIP call (approximately 80 kbps), and we thought the extra latency and higher error rate of the Iridium line would 94 be interesting to study in the absence of this bandwidth constraint. Because we are changing the available bandwidth, we won't be able to make actual claims about VoIP service over Iridium. For each type of link we simulated a single VoIP call with average parameters, both encrypted with 128-bit AES and unencrypted. We first predicted the loss rates and delays of each channel based on the results of the previous section, and then used these results to determine an R-value and MOS score representing the overall quality of the communication. We then compared this predicted score to the actual observed behavior. The results are discussed below. TCDL A Tactical Common Data Link (TCDL) requires a line of sight between communicating parties. Recall from Table 8 that TCDL communication has a bit error rate of 10-7, a bandwidth of 10' kbps, and a latency of 2 milliseconds. For unencrypted communication, we use Equation 8, along with the bit error rate of 10-7 and packet size of 200 bytes to estimate a loss rate due to errors of 1.60 * 10-4. We also estimate that the delay in the absence of encryption will be no greater than 4 milliseconds. Plugging these values into the E-Model gives an R score of 94.3 and a corresponding MOS of 4.43. Thus, we predict that TCDL should provide excellent quality VoIP. 7 For encrypted communication over TCDL we use the encrypted packet size of 256 Additionally, we use Equation 10 bytes to estimate a loss rate of 2.05 * 10-. and the methods described above (setting tjitter = 1.76ms and tencrypt + tdecrypt + tencryptionuffer = .2ms) to estimate a worst-case additional delay of 2 milliseconds. This would be added to the delay observed for unencrypted communication. To verify our estimations we simulated a TCDL link and ran two calls, one encrypted and one unencrypted. For the unencrypted communication we observed 0 lost packets and an average delay of 3.02ms. These values yield an R score of 94.3 and MOS of 4.43. Adding the 2 ms encryption overhead, we use a delay of 5 ms and the loss rate above to predict a score for the encrypted case of 94.3 and MOS of 4.43. Thus, we believe that encryption would not hinder voice quality at all in TCDL communication. The observed encrypted communication experienced a loss rate of 2.3 * 10-4 and an average delay of 3.8ms. Both of these measurements correspond to the same R score of 94.3 and MOS of 4.43. Thus, we conclude that TCDL communication will readily support high-quality encrypted and unencrypted VoIP. Connexion 7Please refer to Table 2 and Figure 5 for a review of R and MOS scoring. 95 Connexion links are significantly less reliable and slower than TCDL links. In particular, Connexion links have an estimated bit error rate of 10-6 and a latency of 325 milliseconds. Using the same methods as described for TCDL, for unencrypted communication we estimate a loss rate due to errors of 1.6 * 10- and a latency of 327 milliseconds. These values determine an R score of 76.7 and MOS of 3.89. For encrypted communication, the increased size of encrypted packets results in a loss rate of 2.05 * 10-3 and the overhead of encryption translates to a worst-case latency of 329 milliseconds. With these values, we estimate an R score of 76.5 and MOS of 3.89. Running an unencrypted VoIP call over a Connexion link, we observed a loss rate of 7.6 * 10-4 and an average delay of 326.4 ms. These values translated into an R score of 76.8 and MOS of 3.90, closely matching our predictions. For encrypted communication we saw a loss rate of 2.16 * 10-3 and delay of 327.2, for an R score of 76.7 and MOS of 3.89, again matching our predictions. Thus, while the extra latency of Connexion over TCDL communication does hinder VoIP quality, the effects of encryption do not significantly impair quality any further. Inmarsat Inmarsat links are very similar to Connexion links. The latency of both is 325 milliseconds, with the difference being that Inmarsat links have a bit error rate of 10- 7 as opposed to 10-6 for Connexion. Since Inmarsat has the same error rate as TCDL, its expected loss rates are the same: 1.6 * 10-4 and 2.05 * 10-4, for unencrypted and encrypted communication, respectively. Likewise, since the latency of Inmarsat links is the same as Connexion links, the expected delays are the same, with estimated values of 327 ms for unencrypted communication and 329 ms for encrypted communication. These values determine R scores of 76.7 and 76.5, with corresponding equivalent MOS values of 3.89 for encrypted and unencrypted calls. It should be noted that these values are the same predicted for Connexion, and the error rate appears to have minimal impact on quality. This is discussed in more detail below. In our experiments we observed a loss rate of 2.1 * 10-3 and a delay of 326.3 ms for unencrypted communication. This resulted in an R score of 76.8 and MOS of 3.90. For encrypted communication we saw a loss rate of 2.1 * 10-4 and a delay of 326.9 ms. This produced an R score off 76.7 and MOS of 3.89. Once again, the R scores and MOS's associated with Inmarsat communication were equal to those for Connexion, and the bit error rate was not a significant factor affecting quality. Modified Iridium As a final test we looked at a modified Iridium link. The modification we made was to increase the bandwidth from 2.4 to 128kbps. This provided ample bandwidth for a VoIP call to be transmitted. The bit error rate for Iridium is 10-5, and the delay is 2000 ms. 96 Connection Encryption P 088 I, Delay(ms) Id TCDL TCDL Connexion Connexion Inmarsat Inmarsat Iridium-like Iridium-like none AES-128 none AES-128 none AES-128 none AES-128 0 2.3e-4 7.6e-4 2.1e-3 2.1e-3 2.1e-4 .018 0.041 0 1.9e-3 7.2e-3 .021 .020 2.0e-3 .17 .39 3.0 3.8 326.4 327.2 326.3 326.9 2001 2001 0 0 17.5 17.6 17.5 17.6 48.1 48.1 R 94.3 94.3 76.8 76.7 76.8 76.7 46.1 45.8 MOS 4.43 4.43 3.90 3.89 3.90 3.89 2.37 2.36 Table 10: Factors Contributing to R score for Various Links A bit error rate of 10-5 produces an expected loss rate of .0158 for unencrypted packets, and .0203 for encrypted packets. The expected delays are 2002 ms for unencrypted data and 2004 milliseconds for encrypted data. These values result in an R score of 46.1 and MOS of 2.37 for unencrypted data and a score of 46.0 and MOS of 2.37 for encrypted data. In our experiments with unencrypted data running over this link, we observed a loss rate of .018 and delay of 2001 ms. This resulted in an R score of 46.1 and MOS of 2.37. For encrypted data we observed a loss rate of .041 and a delay of 2001. This resulted in an R score of 45.8 and MOS of 2.36. Once again, the overhead due to encryption is quite small. Discussion The results of this section are summarized in Table 10. We note a few important results that can be observed from this table. The first noticeable result is that encryption did not significantly hinder quality in any of these cases. For all links we looked at, encryption affected the R score by at most .3 and MOS by at most .01. This is good to know; encryption alone should not impair the quality of communication over any of these types of links. A second interesting result is that for all of these links, the effect of latency on the overall score was much more pronounced than the effect of packet loss. The severe impairment caused by the delays of these links suggests that a push-to-talk system might improve conversation quality, but lack of subjective measurements made this impossible to investigate further. In the E-Model, though latency is encapsulated by the Id term, which varied from 0 to 48 for the various links. The impact of packet loss, which is characterized by the Ie term, was much less pronounced. For all of our experiments, packet loss never impaired the R score by more than a few tenths of a point. Compared to 48 points for delay in Iridium systems, or even 17.6 points for delay over Inmarsat and Connexion links, this small degradation in quality could largely be ignored. 97 11 Future Work In this thesis, a technique for estimating VoIP call quality under various network and security conditions has been presented. This work is believed to be a good starting point for those interested in modeling and evaluating VoIP systems. There are, however, several areas for future study. These are broken down in to four categories: conversation modeling, security, network characteristics, and voice-quality. 11.1 Conversation Modeling We did an extensive investigation on how to model the behavior of two users talking on a Cisco 7940 phone running a G.711 VoIP codec. While we believe that our research into Brady's model's parameters accurately reflects the behavior of this system, we do not make any claims about how other systems behave. For example, the 2.3 second additional talkspurt length resulting from the Voice Activity Detection may not apply to other VoIP implementations. Additionally, we did not attempt to model the behavior of any other VoIP codecs, and believe that it is important to investigate these systems at a later date. There were also conversation types that were not implemented, including calls involving more than two people. It would be interesting to observe the bandwidth requirements for calls as the number of parties increases, as presumably each person in the call will be talking a smaller percentage of the time. We also did not send real speech over network links, and instead sent random bits. If speech data is compressible this could have reduced the bandwidth overhead of IPSec by allowing packets to be compressed before encryption. Since speech packets are already compressed we believe this would not be the case, but we are unable to state this definitively. Finally, outside of the average case, we did not develop different user models based on a data set, but instead manually generated parameters we believed would reflect a particular style of conversation. We suggest a comprehensive study into the actual parameters observed for these systems as a topic for further review. 11.2 Security This paper was not an exhaustive overview of VoIP security. Instead, we focused on conversation quality impairment due to implementing particular security protocols, mainly 128-bit AES. We did not research the effect of security being added to call setup and takedown, and suggest this as a topic for further study. Additionally, we assumed that security setup (including key exchange) was performed prior to the VoIP 98 call, and did not consider the case where security protocols fail to set themselves up correctly. We also chose to focus our experiments on IPSec security, and did not consider the various other security protocols that could be used in VoIP systems, including SRTP, and SSIP. Even in the context of IPSec there were still several other options that we did not fully explore. While we did look at several different security algorithms, we did the bulk of our experiments with 128-bit AES. The experiments could be repeated with other algorithms to determine if differences in efficiency become more apparent under sub-optimal network conditions. Additionally, versions of IPSec other than Openswan could be compared and the performance of each evaluated. 11.3 Network Characteristics In our evaluation of the Z-Model, we only looked at 4 types of networks. There are many types of networks outside the realm of airborne networks that would be interesting to look at. In particular, the performance of secure VoIP over 802.11 and Bluetooth wireless networks is a topic we suggest for future study. We were also using Link Emulators to simulate the characteristics of these network links. We do not know if the performance of secure VoIP was influenced by these simulated (as opposed to real) connections. The only way to determine this would be to run secure VoIP over the actual satellite links. 11.4 Voice Quality Perhaps the most important subject for future work is performing a qualitative study on the performance of secure VoIP. The E-Model, although developed based on qualitative research, is not a perfect measure of voice quality. Only qualitative research with human conversants can accurately evaluate the performance of any conversation quality model. Additionally, it would be very helpful to know the perceived quality gain from the use of a push-to-talk system over no protocol for high latency links. This would provide a basis for describing the dependence of quality on the disagreement factor proposed in Section 8.3. Unfortunately since no qualitative evaluations of push-to-talk systems were available, we were unable to model this dependency. 99 12 Conclusion The impact of security on VoIP systems was evaluated for varying network conditions. A model for VoIP was developed, based on a conversation model developed by Paul Brady for use on circuit switched calls. For the system we tested, employing voice activity detection had an effect that was similar to adding a fixed buffer to the end of every talkspurt. This effect was incorporated into Brady's model, resulting in a set of parameters that were drastically different from the ones observed by Brady. In addition to determining how average VoIP traffic patterns differed from circuit switched calls, user models were also developed to represent different styles of conversation. To evaluate conversation quality, various metrics, including MOS, PESQ, and the EModel were considered. Ultimately the E-Model was chosen as a basis to build a secure VoIP conversation quality model. Using the E-Model, various network characteristics and how they affect conversation quality were considered to develop a new model, the Z-Model, which incorporated security. The impact of security protocols on packet loss, delay, and jitter was considered, as well as the effects of bit error rates and bandwidth usage. Additionally, a new metric, called the disagreement factor was introduced as a possible replacement for delay in the E-Model's delay impairment equation. This metric was introduced to encapsulate the improved conversation quality of push to talk systems over high latency links. It was found that this factor did demonstrate improved performance under these circumstances, but the exact relation between the disagreement factor and the R score was not determined due to lack of subjective data. To determine the security dependencies of the Z-Model, a series of experiments were run. In each experiment a particular security or network characteristic was isolated. Results from encrypted trials were compared to unencrypted communication to isolate the overhead due to encryption. The exact implementation considered for the bulk of the experimentation was 128-bit AES running through Openswan IPSec in tunnel mode. It was found that for the majority of cases the overhead due to encryption was quite small. Typically, the estimated call quality from encrypted communication was found to be very close to the call quality of unencrypted communication under equivalent network conditions. Encryption algorithms, even with traffic rates approaching 100 Mbps, were very fast, and delays due to encryption that would significantly impair VoIP quality were not observed for any security setup. The most noticeable effect of encryption was the increased bandwidth that resulted from the encrypted packets. A bandwidth increase of 28-35% was typically observed in VoIP systems after encryption. In the cases where this resulted in a bandwidth increase beyond the capacity of the communication channel, a significantly increased loss rate was observed with encrypted communication. This resulted in reduced con100 versation quality. Finally, the performance of secure VoIP systems over wireless links in airborne networks was evaluated. It was found that TCDL links offer the capability for toll-quality secure VoIP, while Inmarsat and Connexion links provide adequate conversation quality in the absence of additional traffic. The impact of delay on satellite links was observed to impair quality far more than the packets lost due to bit-errors. For all links, neglecting bandwidth issues that could arise if other communication occurred during the VoIP call, the impact of encryption on overall conversation quality was negligible, and encrypted and unencrypted call quality were nearly identical. This behavior matched the predicted quality estimated by the Z-Model, although the Z-Model was observed to slightly overestimate the impact of security in VoIP systems. 101 References [1] Empirix Inc., "Hammer VoIP Test Solution," 2005, http://www.empirix.com/default.asp?action=article&ID=522 (visited: 5/2005). [2] Ixia, "IxVoice RTP Test Library," 2005, http://www.ixiacom.com/products/voice-testing/indexphp (visited: 5/2005). [3] Jonathan Davidson, James Peters, and Brian Gracely, Voice over IP Fundamentals, Cisco Press, 2000. [4] Cisco Inc., " Enabling High Availability for Voice Services in Cable Networks," 2005, http://www.cisco.com/en/US/products/hw/modules/ps4302/products.white-paperO9 186a0080179145.shtml, (visited: 5/2005). [5] Brent Baccala, Connected: An Internet Encyclopedia, Third Edition, 1997, http://www.freesoft.org/CIE/Topics/57.htm (visited: 5/2005). [6] Internet Archive, "Arpanet," 2003 http://www.archive.org/details/arpanet (visited: 5/2005). [7] Paul E. Jones, Packetizer Inc., "H.323 Standards," 2004, http://www.packetizer.com/voip/h323/standards.html (visited: 4/2005). [8] Weinstein, Forgie, McElwain, "Audio Recording of First Packetized Speech Teleconference," MIT Lincoln Laboratory, 5/1/1978. [9] Clifford Weinstein, and James Forgie, "Experience with Speech Communication in Packet Networks", IEEE Journal Selected Areas Communications, vol SAC-1, no. 6, Dec. 1983, pp. 963-980. [10] Intertangent Technology Directory, "History of VoIP", 2004, http://www.intertangent.com/023346/Articles-and-News/1413.html (visited: 12/2004). [11] Cisco Systems Inc., "Adobe Receives Cisco's 3 Millionth IP Telephone," 5/12/2004, http://newsroom.cisco.com/dlls/2004/prod-D51204b.html, (visited: 12/2004). [12] D Moore, C Shannon, J Brown, "Code-Red: a case study on the spread and victims of an Internet Worm," Proceedings of the 2002 ACM SICGOMM Internet Measurement. [13] Colin Haley, "Are You Ready for VoIP?", InternetNews.com, 5/5/2004, http://www.smallbusinesscomputing.com/news/article.php/3349791 (visited: 12/2004). [14] EMarketer.com, "Consumer VoIP Adoption", 10/12/2004, http://www.emarketer.com/Article.aspx?1003085, (visited: 12/2004). [15] Rose Cordero, and Jennifer Williston, "Voice Over Internet Protocol," 2004, http://www.unc.edu/courses/2004spring/law/357c/001/projects/jennwill/VOIP/facts.html (visited: 5/2004). [16] Cisco Systems Inc., "Voice Over IP - Per Call Bandwidth Consumption," 2004, http://www.cisco.com/en/US/tech/tk652/tk698/technologies tech-note09186a0080094ae2 .shtml, (visited: 2/2005). 102 [17] Tasyumruk, Lutfullah, Analysis of Voice Quality Problems of Voice Over Internet Protocol (VoIP), Masters Thesis, Naval PostgraduateSchool, Monterey, CA, September 2003. [18] Glen Campbell, et al., "Everyting Over IP: VoIP: and Beyond," Merrill Lynch, 3/12/2004, http://www.vonage.com/media/pdf/res-03-12_04.pdf (visited: 5/2005). [19] Reynolds, D.A., Campbell, J. P., Campbell, et. al., "Beyond Cepstra: Exploiting High-Level Information in Speaker Recognition" In Proc. Workshop on Multimodal User Authentication in Santa Barbara, California, pp. 223-229, 11-12 December 2003. [20] Jori Liesenborgs, "Jori's RTP Library," 10/2004, http://research.edm.luc.ac.be/jori/jrtplib/jrtplib.html, (visited: 9/2004). [21] Hovik Melikyan, "C++ Portable Types Library," 2004, http://www.melikyan.com/ptypes/ (visited: 9/2004). [22] Johnston, Alan B., SIP: Understanding the Session Initiation Protocol, Artech House, 2004. [23] Joseph D. Harwood, "IPSec: Technology and Application Overview," 1/6/2001, Vesta Corp., http://www.vesta-corp.com/IpsecOverview.pdf, (visited: 5/2005). [24] Bruce Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C, Wiley Books, 1995. [25] W. Diffie and M.E. Hellman, "New directions in cryptography", IEEE Transactions on Information Theory 22 (1976), 644-654. [26] Rivest, R. L., Shamir, A., Adleman, L. A., "A method for obtaining digital signatures and public-key cryptosystems", Communications of the ACM, Vol.21, Nr.2, 1978, S.120-126. [27] Fraunhofer Fokus, "The IP Telephony Site," 2005, http://www.iptel.org (visited: 9/2004). [28] Hari Balakrishnan, Dina Katabi, Robert Morris; MIT Course 6.829: Computer Networks, Lecture Notes 18 and 19, Fall, 2004. [29] S. J. Rees, "Convergence and Voice Over IP," http://www.soc.staffs.ac.uk/sjr3/dccn%20tutorial%20sheet%20three%20with%20answers.doc, (visited: 4/2005). [30] "RTP Parameters," 5/20/2005 http://www.iana.org/assignments/rtp-parameters (visited: May 2005). [31] Brady, Paul T., "A Model for Generating on-off Speech Patterns in Two-Way Conversations, The Bell System Technical Journal, September 1969, pp.2445-2471. [32] Brady, Paul T., "A Statistical Analysis of On-Off Patterns in 16 Conversations, The Bell System Technical Journal, January 1968, pp.73-91. [33] David Graff, Kevin Walker, and David Miller, "Switchboard Cellular Part 1: Audio", University of Pennsylvania, 2001, http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S13 (visited: 2/2004). 103 [34] David Graff, Kevin Walker, and David Miller, "Switchboard Cellular Part 1: Transcribed Audio" University of Pennsylvania, http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S15 (visited: 2/2004). [35] Cox, D.R., and Miller, H.D., The Theory of Stochastic Processes, New York: Wiley, 1965. [36] Walter Willinger, David Alderson, and Lun Li, "A Pragmatic Approach to Dealing with High-Variability in Network Measurements," Internet Measurement Conference, 2004. [37] Personal correspondence with Hammer and Ixia representatives at Voice on the Net Conference, Boston, 2004. [38] Cisco Inc., "Cisco IP Phone 7940G," 2005, http://www.cisco.com/warp/public/cc/pd/tlhw/prodlit/7940_ds.htm (visited: 10/2004). [39] Sheila Frankel, Demystifying the IPSec Puzzle, Artech House, Boston, 2001. [40] Linux FreeS/WAN version 2.06, 4/22/2004, http://www.freeswan.org/ (visited: 3/2005). [41] Openswan version 2.3, 1/2005, http://www.openswan.org/ (visited: 3/2005). [42] StrongSwan - IPSec for Linux, http://www.strongswan.org/ (visited: 3/2005). [43] IPSec Tools, http://ipsec-tools.sourceforge.net/ (visited: 3/2005). [44] Intel Corp., "Intel PRO/100 S Server Adapter," 2005, http://www.intel.com/network/connectivity/products/prolOOs.srvr-adapter.htm, 3/2005). (visited: [45] "TCPDUMP Public Repository," 4/7/2005, http://www.tcpdump.org/ (visited: 4/2005). [46] Iperf version 1.7.0, 3/2003, http://dast.nlanr.net/Projects/Iperf/ (visited: 4/2005). [47] The Boeing Company, "Connexion by Boeing," 2005, http://www.connexionbyboeing.com/ (visited: 4/2005). [48] Inmarsat Ltd., "Inmarsat - Total Communications Network," 2005, http://www.inmarsat.com/ (visited: 4/2005). [49] Iridium Satellite LLC., "About Iridium," 2005, http://www.iridium.com/corp/iri-corp-understand.asp (visited: 4/2005). [50] A Kos, B Klepec, S Tomazic, "Techniques for Performance Improvement of VoIP Applications", Electrotechnical Conference, MELECON 2002. [51] Agilent Technologies, "Jitter Solutions for Telecom, Enterprise, and Digital Designs," 2004, http://cp.literature.agilent.com/litweb/pdf/5988-9592EN.pdf, (visited: 5/2005). [52] National Institute of Standards, "CSRC Cryptographic Toolkit - AES," 1/28/2002, http://csrc.nist.gov/CryptoToolkit/aes, (visited: 5/2005). [53] International Telecommunications Union, "The E Model: A computational model for use in Transmission Planning," ITU-T G.107. 104 [54] "Perceptual Evaluation of Speech Quality," ITU-T P.862. [55] "Perceptual Speech Quality Measure," ITU-T P.861. [56] Lucent Technologies, "Voice over Internet Protocol Voice Quality of Service," 1/1/2004, http://www.lucent.com/livelink/090094038007ffeb-White-paper.pdf, (visited: 5/2005). [57] Christian Hoene, et. al. "A Perceptual Quality Model for Adaptive VoIP Applications", Proceedings of International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS'04), San Jose, California, USA, 2004. [58] Athina Markopoulou, et. al., "Assessment of VoIP Quality over Internet Backbones," IEEE INFOCOM, 2002. [59] "Estimates of Ie and Bpl parameters for a range of CODEC types," ITU-T SG12 D.106, http://www.telchemy.com/reference/ITU%20SG12%20D106%2OBpl%20parameters.pdf, (visited: 4/2005). [60] John C. Gammel, "Echo Cancellation for VoIP," 10/1999, http://www.commsdesign.com/main/1999/10/9910feat4.htm (visited: 5/2005). [61] Mordechai T. Abzug, "MD5 Homepage (unofficial)" http://userpages.umbc.edu/ mabzugl/cs/md5/md5.html (visited: 5/2005). [62] Philip A. DesAutels, "Secure Hash Algorithm - Version 1.0," 10/1997, http://www.w3.org/PICS/DSig/SHAll_0.html (visited: 5/2005). [63] X5 Networks, "Cryptography: What is Capstone?," http://www.x5.net/faqs/crypto/ql50.html (visited: 5/2005). [64] Kuhn, Richard, et al. "Security Considerations for Voice Over IP Systems", National Institute of Standards and Technology, Special Publication 800-58, http://csrc.nist.gov/publications/nistpubs/800-58/SP800-58-final.pdf, (visited: 5/2005). [65] Kevin Poulsen, "VoIP Hacks Gut Caller I.D.," SecurityFocus Jul 6 2004, http://www.securityfocus.com/news/9061, (visited: 1/2005). [66] Richard Taborek, "Recommendation of 10-1 Bit Error Rate for 10 Gigabit Ethernet," IEEE LMSC, 1999, http://grouper.ieee.org/groups/802/3/1OG-study/public/uly99/chang_2_0799.pdf, (visited: 5/2005). [67] Cisco Inc., "Skinny Client Control Protocol (SCCP)," 2005, http://www.cisco.com/en/US/tech/tk652/tk7Ol/tk589/tsd-technology-support -sub-protocollhome.html, (visited: 5/2005). [68] "Internet Protocol," RFC 791, 1981. [69] "User Datagram Protocol," RFC 768, 1980. [70] "Transmission Control Protocol," RFC 793, 1981. [71] "Real Time Protocol," RFC 3550, 2003. [72] "Session Initiation Protocol," RFC 3261, 2002. 105 [73] "Hypertext Transfer Protocol," RFC 2616, 1999. [74] "Simple Mail Transfer Protocol," RFC 2821, 2001. [75] "IP Security Document Roadmap" RFC 2411, 1998. [76] "Network Time Protocol," RFC 1305, 1992. [77] Secure Real-Time Protocol, RFC 3711, 2004. [78] "Megaco Protocol Version 1.0," RFC 3015, 2000. [79] "Media Gateway Control Protocol (MGCP) Version 1.0," RFC 2705, 1999. 106 Q