Modeling and Assessing Secure Voice over IP Performance by

Modeling and Assessing Secure
Voice over IP Performance
by
Cory L. Zue
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the Degrees of
Bachelor of Science in Computer Science and Engineering
and Masters of Engineering in Electrical Engineering and Computer Science
at the Massacusetts Institute of Technology
May 24, 2005
IDE
-C*
IF
Copyright 2005 Cory L. Zue. All rights reserve
MASSACHUSETTS INSTiTUTE
TECHNOLOGY
JUL 18 2005
LIBRARIES
The author hereby grants to M.I.T. permission to reproduce and
distribute publicly paper and electronic copies of this thesis
and to grant others the right to do so.
Author
Department of Electrical Engineering and Computer Science
May 24, 2005
Certified by
CrfebRobert K. Cunningham
Associate Leader - Information Systems Technology Group (Lincoln Lab)
Thesis Supervisor
Acceptedb
A e bArthur C. Smith
Chairman, Department Committee on Graduate Students
BARKER
THIS PAGE INTENTIONALLY LEFT BLANK
2
Modeling and Assessing Secure
Voice over IP Performance
by
Cory L. Zue
Submitted to the
Department of Electrical Engineering and Computer Science
May 25, 2005
In Partial Fulfillment of the Requirements for the Degree of
Bachelor of Science in Computer Science and Engineering
and Master of Engineering in Electrical Engineering and Computer Science
Abstract
Voice over Internet Protocol (VoIP) systems enable efficient communications over
data networks, but security of VoIP and the impact of that security on communications quality has not been quantitatively modeled. A conversational model is adapted
for VoIP and a computational model of communication quality - the Z-Model - is developed. VoIP conversations are simulated for networks with a range of performance
characteristics including differing bandwidth, latency and bit error rates to evaluate
the impact of security on communication quality. Results show that improving confidentiality via encryption of conversation data packets does not introduce significant
delays, but does increase bandwidth. In certain restricted-bandwidth environments
this results in dramatic reductions of perceived conversation quality.
Thesis Supervisor: Robert K. Cunningham
Title: Associate Leader, Information Systems Technology Group
This work is sponsored by the United States Air Force under Air Force Contract FA8721-05-C-0002.
Opinions, interpretations, conclusions and recommendations are those of the author and are not
necessarily endorsed by the United States Government.
3
THIS PAGE INTENTIONALLY LEFT BLANK
4
Acknowledgements
There are several people whose work and input went into this thesis. First I'd like to
thank my parents, for putting me through college and supporting me both financially
and emotionally through some stressful times. From Lincoln Laboratory, I'd like to
thank Lenny Veytser for the use of his tool for analyzing network traffic, and Mark
Yeager for porting my VoIP traffic generating code into a scriptable format. I'd also
like to thank Aaron Beveridge for putting up with my constant requests for help in
various administrative tasks. Mostly, though, I'd like to thank my advisors: Rob
Cunningham, my official thesis advisor, and Cindy McLain, who, although her name
is not on the cover, put as much work into this thesis as any advisor could. Rob
was terrific in keeping me on task and seeing the big picture, while Cindy analyzed
my work with an incredible attention to detail, making sure facts were checked and
grammar rules were not broken. I feel very fortunate to have had not one, but two
excellent advisors willing to work extra hours and late nights to ensure that this thesis
got finished. To both of you I owe an incredible debt of gratitude, and probably a
few hours of sleep.
5
THIS PAGE INTENTIONALLY LEFT BLANK
6
Contents
1
Introduction
14
2
Background
16
2.1
Before VoIP: Telephones and the Public Switched Telephone Network
16
2.2
The Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3
Voice Over IP
19
3.1
System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2
History and Growth
. . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.3
M otivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.4
VoIP Deployment Challenges
. . . . . . . . . . . . . . . . . . . . . .
22
3.4.1
Infrastructure Requirements . . . . . . . . . . . . . . . . . . .
22
3.4.2
Factors Affecting Conversation Quality . . . . . . . . . . . . .
23
3.4.3
Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
4 VoIP Implementation Details
4.1
4.2
5
25
Call Setup Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
4.1.1
H .323
25
4.1.2
Session Initiation Protocol (SIP)
. . . . . . . . . . . . . . . .
27
4.1.3
SIP versus H.323: A Comparison
. . . . . . . . . . . . . . . .
28
4.1.4
Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
Transport Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
4.2.1
Internet Protocol (IP)
. . . . . . . . . . . . . . . . . . . . . .
30
4.2.2
User Datagram Protocol (UDP) . . . . . . . . . . . . . . . . .
31
4.2.3
Transmission Control Protocol (TCP) . . . . . . . . . . . . . .
31
4.2.4
Real-Time Protocol (RTP) . . . . . . . . . . . . . . . . . . . .
32
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VoIP and Security
34
5.1
34
Internet Security Overview . . . . . . . . . . . . . . . . . . . . . . . .
7
5.2
6
5.1.1
Definition of Security . . . . . . . . . . . . . . . . . . . . . . .
34
5.1.2
Security Details . . . . . . . . . . . . . . . . . . . . . . . . . .
35
5.1.3
Security Implementation . . . . . . . . . . . . . . . . . . . . .
38
Security Applied to VoIP . . . . . . . . . . . . . . . . . . . . . . . . .
41
5.2.1
Confidentiality
41
5.2.2
Integrity and Non-Repudiation
. . . . . . . . . . . . . . . . .
42
5.2.3
Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
5.2.4
Security and Quality of Service
43
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
Measuring Conversation Quality
44
6.1
Mean Opinion Score (MOS) . . . . . . . . . . . . . . . . . . . . . . .
44
6.2
Perceptual Evaluation of Speech Quality (PESQ)
. . . . . . . . . . .
45
6.3
E-M odel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
7 VoIP Conversation Modeling
48
7.1
M otivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
7.2
Brady's Model for Two-Way Speech . . . . . . . . . . . . . . . . . . .
49
7.2.1
One-Port versus Many-Port Models . . . . . . . . . . . . . . .
49
7.2.2
The One-Port, Two-State Model
. . . . . . . . . . . . . . . .
50
7.2.3
The One-Port, Four-State Model
. . . . . . . . . . . . . . . .
50
7.2.4
The One-Port, Six-State Model . . . . . . . . . . . . . . . . .
51
7.2.5
Model Parameters and Interpretation . . . . . . . . . . . . . .
51
7.2.6
Model Limitations
54
7.3
Applying Brady's Model to VoIP
. . . . . . . . . . . . . . . . . . . .
54
Voice Activity Detection (VAD) . . . . . . . . . . . . . . . . .
55
Revising Brady's Parameters . . . . . . . . . . . . . . . . . . . . . . .
56
7.4.1
Correlating Brady's Data with the Switchboard Corpus . . . .
56
7.4.2
Comparing the Switchboard Corpus to VoIP . . . . . . . . . .
58
Developing User Models . . . . . . . . . . . . . . . . . . . . . . . . .
60
7.3.1
7.4
7.5
. . . . . . . . . . . . . . . . . . . . . . . .
8
7.6
8
7.5.1
The Average Speaker Pair . . . . . . . . .
62
7.5.2
The Authority Relationship
. . . . . . . .
62
7.5.3
An Alternating Protocol . . . . . . . . . .
65
Sum m ary
. . . . . . . . . . . . . . . . . . . . . .
68
Adapting the E-Model to Secure VoIP Systems: The Z-Model
69
8.1
Using the E-Model for VoIP Communication . . .
69
8.1.1
Jitter . . . . . . . . . . . . . . . . . . . . .
69
8.1.2
Echo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
8.1.3
Error Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
Going from E-Model to Z-Model . . . . . . . . . . . . . . . . . . . . .
72
8.2.1
72
8.2
8.3
Modeling the Effects of Security . . . . . . . . . . . . . . . . .
Incorporating Conversational Improvements
. . . . . . . . . . .
9 Resources
74
77
9.1
Traffic Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
9.2
Link Emulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
9.3
Encryption Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
9.4
Test Network
78
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10 Methodology
80
10.1 Understanding the Performance of Secure VoIP
. . . . . . . . . . . .
10.1.1 Experiment 1: Performance Under Optimum Conditions
80
. . .
80
10.1.2 Experiment 2: Performance of Openswan with Increased Traffic
82
10.1.3 Experiment 3: Performance over Low-Bandwidth Links . . . .
84
10.1.4 Experiment 4: Performance over Links with High Loss and Error R ates . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
10.2 Adding Security to the Z-Model . . . . . . . . . . . . . . . . . . . . .
88
10.3 Evaluating the Performance of the Z-Model
. . . . . . . . . . . . . .
91
10.3.1 Evaluating the Disagreement Factor as a Replacement for Absolute D elay . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
9
10.3.2 The Loss Due to Errors
. . . . . . . . . . . . . . . . . . . . .
93
10.3.3 Using the Z-Model to Estimate and Measure Overall Conversation Q uality . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
11 Future Work
11.1 Conversation Modeling
98
. . . . . . . . . . . . . . . . . . . . . . . . .
98
11.2 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
. . . . . . . . . . . . . . . . . . . . . . . . .
99
11.4 Voice Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
11.3 Network Characteristics
12 Conclusion
100
10
List of Figures
1
A Typical VoIP Setup . . . . . . . . . . . . . . . . . . . . . .
19
2
An H.323 Call Setup . . . . . . . . . . . . . . . . . . . . . . .
26
3
A SIP Call Setup and Takedown
. . . . . . . . . . . . . . . .
29
4
Evaluating VoIP with the PESQ Model
. . . . . . . . . . . .
45
5
R-Values and MOS for Varying Conversation Quality[58] . . .
47
6
A Two-State Conversation Model . . . . . . . . . . . . . . . .
50
7
A Four-State Conversation Model . . . . . . . . . . . . . . . .
51
8
A Six-State Conversation Model
. . . . . . . . . . . . . . . .
52
9
Probability Distribution of Time Spent in a State . . . . . . .
53
10
Six-State Conversation Model with ur and 3 Parameters
53
11
Talking On/Off Patterns versus Time for a Single Speaker in a Conversatio n.
12
Time in Each State for Brady's Data[32] and the Switchboard Speech Corpus[34] 58
13
Time in Each State for 3 Calls from the Switchboard Corpus
. . . . . . . .
61
14
Simulated Talking On/Off Patterns versus Time for a Single Conversation
with a Pair of Average Speakers . . . . . . . . . . . . . . . . . . . . . . . . .
63
Real Talking On/Off Patterns versus Time for a Single Conversation with a
Pair of Average Speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
Simulated Talking On/Off Patterns versus Time for a Conversation with a
Dominant Speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
Simulated Talking On/Off Patterns versus Time for a Conversation with an
Alternating Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
18
Agreement and Disagreement Time with ta Shorter than State Lengths
76
19
Agreement and Disagreement Time with ta Shorter than State Lengths
76
20
Test Network for Experimentation
. . . . . . . . . . . . . . . . . . . . . . .
79
21
End-to-End Delays for 128-bit, Uncompressed AES and Clear Communication with Varying Levels of Traffic . . . . . . . . . . . . . . . . . . . . . . .
83
15
16
17
. . .
57
22
Loss Rates for 128-bit, Uncompressed AES and Clear Communication with
128 Kbps Bandwidth and Varying Background Traffic (End-to-end throughput) 84
23
Loss Rates for 128-bit, Uncompressed AES and Clear Communication with
128 Kbps Bandwidth and Varying Background Traffic (Actual Packet Sizes)
85
Introduced Loss Rates for Clear and Encrypted Communication . . . . . . .
86
24
11
25
Loss Rates versus Bit Error Rates for Clear and Encrypted Communication
87
26
Disagreement Factor for Varying Two-Way Latencies . . . . . . . . . . . . .
92
27
Expected and Observed Loss Rates versus Bit Error Rates for Clear and
Encrypted Communication . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
12
List of Tables
1
Sample and Bandwidth Information for Various Voice Codecs . . . . . . . .
20
2
The MOS Quality Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3
Brady's Parameters for the Six-State Model . . . . . . . . . . . . . . . . . .
54
4
Statistics of Brady's Parameters for the Switchboard Data . . . . . . . . . .
59
5
Statistics of Brady's Parameters for the Switchboard Data with Buffer . . .
59
6
Brady's Parameters for a Dominant and Passive Speaker . . . . . . . . . . .
65
7
Brady's Parameters for an Alternating Protocol . . . . . . . . . . . . . . . .
68
8
Characteristics used to Simulate Various Airborne Links . . . . . . . . . . .
78
9
Baseline Per-Packet Delays for Various Encryption Algorithms in Openswan
81
10
Factors Contributing to R score for Various Links . . . . . . . . . . . . . . .
97
13
1
Introduction
The rapid growth and scope of the Internet has had a tremendous impact on the way
people communicate. E-Mail and instant messaging have revolutionized the speed and
cost of written correspondence, and allowed people separated by great geographical
distances to write to each other much faster and cheaper than previously possible.
Phone conversations, on the other hand, have not experienced a significant change
with the advent of the Internet until only recently. In the past five years, though, the
spread of technology that allows communicating over the Internet, called Voice Over
IP, has demonstrated that this is rapidly changing.
The slow adoption of Voice Over IP, or VoIP, has largely been a result of poor quality
of service. The Internet is subject to delays and bandwidth limitations that, until
recently, have made VoIP unattractive to typical users. In sending e-mail, or browsing
websites, delays of a few seconds do not significantly hamper a user's experience, but
in a verbal conversation it makes things quite difficult.
A second challenge facing VoIP is security. The traditional phone system, the public
switched telephone network or PSTN, consists of a series of dedicated circuits that are
really owned by a few select bodies. This makes it difficult for others to "tap in" and
listen to particular calls. The Internet, on the other hand, spans countless switches,
lines, and routers, and in any given connection is very difficult to know exactly where
data is being sent, or who can see it. For this reason, the privacy that people have
come to expect from their communications is not guaranteed in VoIP systems.
Providing privacy in an Internet setting generally relies on mathematical cryptographic algorithms that encode data before it is sent out in a way that only the
intended recipient can decode. These algorithms, however, introduce an overhead in
both time and bandwidth. If people are to have an equivalent level of privacy from
VoIP systems as that provided by the PSTN, the overhead introduced by applying
security features must not reduce the conversation quality below the point of usability.
The main goal of this research is to develop a methodology for objectively estimating
the quality of VoIP communication in a given network environment. Security should
be incorporated into this model, as well as unique network characteristics that may
arise in less than ideal situations. Some Internet communication runs over links with
limited bandwidth or above average latency, and it is important to understand how
VoIP performs in these environments. Of particular interest is the performance of
VoIP over encrypted wireless airborne networks.
To explore the feasibility of secure VoIP in various network environments, an understanding of behavior of VoIP systems is first necessary. The size and rate of packets
sent, as well as the on/off spurts of VoIP speech are studied so that an accurate
model of a VoIP user can be developed. There has been quite a lot of exploration of
VoIP traffic generation in the commercial sector, with emphasis on testing the limits
14
of infrastructure capacity. These commercial tools, however, such as Hammer VoIP
Test Solution [1] and IxVoice[2], do not use complex conversational models, but rather
send pre-generated traffic streams, or use very simple exponential on/off models[37].
A second task of this thesis is to explore the various ways of providing security in
an Internet setting. There are several ways to ensure security, including different
encryption algorithms, and at which network layer they are applied.
Finally, a metric for conversation quality that incorporates network characteristics
and security implementation should be developed. Conversation quality is largely a
subjective concept. Different people may have different standards of what a "good"
quality conversation is. Being able to objectively estimate voice quality is essential
in estimating the performance of a particular proposed VoIP setup. The focus of this
thesis was on the evaluation of conversation quality, and the impact of security and
unfavorable network links on call setup was not considered, although some information
about the latter is included for completeness.
Section 2 provides some background information about the public switched telephone
network and the Internet. Sections 3 and 4 present VoIP in more detail, discussing
it in a historical context, providing implementation details, and reviewing the protocols involved. Section 5 provides an overview of Internet security, with emphasis
on security's role in the context of VoIP. It also explains why we chose IPSec as the
security layer for our experiments.
Section 6 gets into the various methods of evaluating speech quality, and justifies
the choice of the International Communications Union's E-Model as an appropriate
performance measurement tool. In Section 7 the techniques for conversation modeling
are explored, as well as a methodology for adapting a PSTN conversation model to
mimic VoIP behavior.
The main focus of this thesis is in Sections 8 and 9. In Section 8 the Z-Model is
introduced, a computational model, based on the E-Model, that incorporates security
and fine-tunes the E-Model in the context of VoIP systems. Section 9 discusses the
experiments performed in designing and evaluating the Z-Model as an appropriate
tool for measuring VoIP quality.
Finally, Section 11 discusses the limitations of this research and explores areas for
further study. Section 12 offers some concluding remarks.
15
2
Background
Before delving into the world of VoIP and security, it is appropriate to know a little bit
about VoIP's predecessors. This section will discuss the telephone and the network
built to support telephony: the public switched telephone network. It will also briefly
discuss the advent of the Internet, the network VoIP was built to run over.
2.1
Before VoIP: Telephones and the Public Switched Telephone Network
The Internet has connected people further and faster than ever before in history. But
before the Internet, the telephone was the global method of communicating across
large distances. The first voice transmission was sent by Alexander Graham Bell in
1876[3]. Bell did not have a phone number to dial or an e-mail address, he simply
picked up the phone and the person on the other end could hear what he said. For
a long time, this was the model used by the telephone: a user had to have a direct
connection to whoever he wanted to call.
As phones became more widespread, a new model was required in which each person
had a connection to a central switch. To reach someone else, a caller would ask
the operator of the switch to physically connect the appropriate lines, which would
establish a circuit for the duration of the call[3].
Over time the shape and mechanisms of the phone network changed. Operators were
replaced by a signaling code (i.e. your phone number) that allowed calls to be routed
and connected automatically. As the network grew, it developed hierarchical layers
of switches, each one supporting more traffic. The resulting network is known as the
public switched telephone network (PSTN).
The PSTN is excellent at providing high quality communications. Except in times
of extreme usage, such as a natural disaster or holiday, when many people are trying
to communicate at the same time, people can reach each other whenever they want.
Typically the PSTN is up 99.999% of the time[4]. In addition, the typical sound
quality of PSTN calls is very good. This is because the PSTN calls are circuitswitched. That means that a dedicated circuit is created during the setup phase of
the call, and this circuit continues to serve the call in an exclusive manner until it
completes. Each call is guaranteed a fixed amount of continuous bandwidth, no more
and no less, for the duration of the call.
Circuit-switching also has its disadvantages. First of all, it is bandwidth inefficient;
the circuit is tied up by a conversation, regardless of whether or not either of the
parties happens to be talking. Additionally, this fixed bandwidth makes it difficult to
add new features. Typically, a single home has a 56-kbps phone line, which is simply
16
not enough bandwidth to support Internet, phone, and video at the same time[3].
The alternative to a circuit-switched network is a packet switched network, in which
each chunk of data is individually routed. Packet switched networks, such as the
Internet, make more efficient use of bandwidth. For this reason, the people who hope
to converge the phone and data networks believe that voice should be migrated to
packet networks and not the other way around. As it currently stands, many people
still use the phone network and a modem for data transfers, such as e-mail and web
browsing. As the demand for data and bandwidth continues to increase, the use of
the PSTN for data becomes increasingly infeasible, and the need to migrate voice to
the data network becomes apparent.
2.2
The Internet
In this section we provide a brief history of the Internet. It is not meant to be an
exhaustive study of the Internet details, but rather a context for comparing VoIP to
PSTN calls.
The origins of the Internet go back to a 1969 project of the U.S. Department of Defense. At that time, the PSTN was the only available nation-wide communications
system. The government realized that because of its dependence on circuit switching, switching stations could be targeted during an attack and effectively take down
communication channels of an entire region[5]. The government wanted to build a
network that would dynamically route each piece of traffic (called packets) depending
on the availability of links. The result of this project was the ARPANET, which
would later develop into the Internet[6].
At the heart of the ARPANET was packet switching. This was a way that allowed
data to travel from one point to another in the network without setting up a connection or establishing a fixed path. Each node on the network had its own unique
address, known as an IP (for Internet Protocol) address. Then each packet could be
labeled with a source and destination address, and the routers could look up where
to send the packet to reach that address. One problem with this protocol is that it
is unreliable; packets are not guaranteed to reach their destination. For this reason,
other protocols run over the Internet, such as Transmission Control Protocol (TCP),
and User Datagram Protocol (UDP), that can provide varying levels of connection
control. These protocols are discussed in Section 4.2.
There are several nice things about packet switching. One is that it allows bandwidth
to be used only when it is needed. Low bandwidth applications, such as e-mail, can
share links with voice or video at the same time as long as bandwidth supports it.
Packet switching also allows data to be dynamically routed, so connections can be
maintained even as intermediate links and hosts go up and down. Dynamic routing
also allows bandwidth to be spread across different links in times of congestion.
17
There are some disadvantages to packet switching as well. In particular, it is difficult
to guarantee a particular level of service to users; a task that is readily accomplished
by circuit switched networks. Additionally, the Internet was designed such that the
majority of network control is placed at the endpoints. This can cause network
management to be a difficult task.
Despite its shortcomings, the Internet has experienced incredible growth since the
development of the ARPANET. As recently as 1983 there were fewer than 600 registered hosts on the ARPANET, most of whom were universities and military research
sites[5]. Now the Internet has countless hosts and supports incredible volumes of
traffic. Thus, there is plenty of room on the Internet for voice traffic.
18
Voice Over IP
3
Voice over IP (VoIP) is the transmission of voice data over packet-switched networks,
such as the Internet. This section provides an overview of VoIP technology and discusses why VoIP is an important topic to study. Section 3.1 provides a quick overview
of VoIP systems. Section 3.2 discusses the historical context of VoIP. Section 3.3 discusses the motivation behind implementing VoIP instead of traditional PSTN technology, and Section 3.4 discusses some of the challenges facing VoIP, several of which
are addressed in the rest of this thesis.
3.1
System Overview
A typical VoIP architecture is shown in Figure 1. Each user has a device, called a
"Voice Terminal" that performs the operation of translating speech to data packets
that can be sent over the Internet. Voice Terminals can be a stand-alone piece of
hardware (known as an IP phone), software running on a PC, or a regular phone
running through an adaptor and plugged into the Internet.
Analog to
converter
Dt
copesn
Alice
RTP Packet
UOP Packet
Internet
[T Pce]
Bob's Voice Terminal
Alice's Voice Terminal
(Software or Hardware Phone)
Figure 1: A Typical VoIP Setup
The first element of the Voice Terminal is an analog to digital (A/D) converter.
This device takes in the analog audio signal, either from a phone handset or PC
microphone, and converts it to digital data that can be processed by a computer
chip. Once the speech has been digitized it is usually compressed by a device or
program known as a codec. Compression allows the data to take up less memory, while
sometimes resulting in loss of information and quality. There are several standard
codecs for voice that offer varying levels of compression and quality, and an overview of
the most popular codecs can be found in Table 1[16]. In addition to being compressed,
19
Codec and
Bit
Rate
(Kbps)
G.711 (64)
G.729 (8)
G.723.1 (6.3)
G.723.1 (5.3)
G.726 (32)
G.726 (24)
G.728 (16)
Codec
Sample
Size
(Bytes)
80
10
24
20
20
15
10
Codec
Sample
Interval
(ms)
10
10
30
30
5
5
5
Voice
Payload
Size
(Bytes)
60
20
24
20
80
60
60
Voice
Payload
Size
(ms)
20
20
30
30
20
20
30
Packets
Per
Second
(PPS)
50
50
34
34
50
50
34
Bandwidth
MP
or
FRF.12
(Kbps)
82.8
26.8
18.9
17.9
50.8
42.8
28.5
Bandwidth
Ethernet
(Kbps)
87.2
31.2
21.9
20.8
55.2
47.2
31.5
Table 1: Sample and Bandwidth Information for Various Voice Codecs
the codec also breaks the audio into data values, known as samples, taken at small,
discrete timesteps apart from each other. The exact timestep and size of these samples
depends on the codec used.
The compressed audio data is then wrapped inside a Real-Time Protocol (RTP)
packet[71]. Depending on the codec, one or more samples will be put inside a single
RTP packet. The purpose of the RTP is to provide information on the codec used,
as well as additional sequencing and timing information that allow the allow the
stream of packets to be converted back to audio. Details of how this is accomplished
are discussed in Section 4.2.4. The RTP packet is then put into a User Datagram
Protocol (UDP) packet[69], to be sent over the Internet. UDP provides information
about source and destination addressing so that routers in the Internet that handle
the packet know where to send it. The choice of UDP over TCP is discussed in
Section 4.2.2.
The packet is then shipped over the Internet (or, in some cases a Local Area Network
(LAN)) to the receiving user. This user has his own Voice Terminal that can perform
the inverse operations on the packet. This involves first removing the UDP headers
and parsing the RTP data to determine the codec and ordering of the packet. Then,
the packet is decoded with the appropriate codec, and is passed through a digital
to analog (D/A) converter, where the resulting analog signal can be played (either
through a handset or a computer's speakers). Because both parties in a two-way
conversation need to encode and decode the data, both functions are needed in each
host's voice terminal.
3.2
History and Growth
Although VoIP has recently emerged as one of the hottest topics in technology, it
has actually been researched and developed for quite a long time. In 1978 the first
packet-switched teleconference was held on the ARPANET, with members including
20
Cliff Weinstein of MIT's Lincoln Laboratory[8]. Weinstein et. al. later published
a paper describing sending speech over packet networks in 1983[9]. Additionally,
the first functional VoIP software was released by VocalTec Inc. in 1995[10]. This
software was designed to run on a PC, but, unfortunately, insufficient processor speed,
Internet bandwidth, and other factors prevented VoIP from being a viable option to
PSTN telephony.
Since then, however, advances in processor speeds and a drastic increase in Internet
bandwidth and reliability have allowed VoIP to be a reasonable alternative to PSTN.
As a result, VoIP has experienced tremendous growth. This can be seen in the rate
at which Cisco, the leading manufacturer of IP phones, has sold its products. Cisco
shipped its 1 millionth phone in August of 2002, representing three and a half years
of sales. It shipped its 2 millionth phone just 12 months after that, in July of 2003,
and its 3 millionth phone in April of 2004, only 8 months later[11].
Cisco is not the only company reflecting the growth in this industry. According to
research done by the Yankee Group, 54 percent of businesses are currently testing or
evaluating the potential of VoIP[13]. Another study, by Juniper Research, estimates
that by 2009, 10% of all US households and 40% of business lines will be using VoIP.
They also estimate that the global VoIP market in that year will be $32 billion[14].
From this information it seems clear that VoIP is an important industry and technology. The next section discusses motivations for switching to VoIP. Section 3.4 goes on
to discuss some problems that should be addressed before employing a VoIP system.
3.3
Motivation
The largest reason why many people and businesses are switching from PSTNs to
VoIP is cost. The discrepancy of cost between these technologies lies in the inherent
difference between circuit and packet switched networks. Making a PSTN call requires "renting" bandwidth on a circuit controlled by a few large corporations for the
duration of the call. Internet bandwidth, on the other hand, is a largely underutilized
resource, and typically users are charged a fixed amount for connectivity regardless
of bandwidth use. This makes VoIP extremely cheap. Additionally, while there are
many regulatory charges for long distance and international phone calls over PSTNs,
packet switched networks, like the Internet, are currently completely unregulated[17].
As a result, many sites exist today that offer free VoIP to consumers (Skype and
EarthLink are two), while countless other providers offer unlimited long-distance dialing over IP for a small monthly fee (AT&T, CableVision, Net2Phone). Since IP
bandwidth is extremely cheap compared to PSTN lines, this is an economically sound
option for both providers and users of VoIP systems.
For corporations, the switch to VoIP can be even more economical. Since VoIP
uses the same underlying architecture as data networks, a complete switch to VoIP
21
can eliminate the need for a company to have a separate voice network, saving cost
and resources. This consolidation of networks makes internal phone lines obsolete, a
change that could save companies up to 20% when compared to PSTN[15].
There are also non-economic advantages to VoIP. One advantage emerging recently
is the elegance and philosophy of "everything over IP." Merrill Lynch and Vonage
have recently released a study[18] that claims that VoIP is a first step towards a day
when all communication technologies (Internet, phone, television, radio, etc.) will
run over IP. This convergence has economic advantages, but also creates simplicity in
merging different mediums; only one set of communication lines and protocols would
be necessary.
VoIP also allows users to be free of ties to a physical location. In this way, VoIP
resembles a cellular phone; a user can plug his IP phone into any Internet jack in the
world and have his phone respond to the same number. Wireless IP networks make
this comparison even more viable. Another benefit is that telephone addresses are no
longer confined to 10-digit numerical codes, but could take nearly any form (e-mail
addresses with a special marker, for example).
Despite these advantages, there are still several problems with VoIP that need to be
addressed. These are discussed next.
3.4
VoIP Deployment Challenges
VoIP, like any new technology, comes with a set of challenges that must be addressed before it can be widely and safely used. These challenges fall into three main
categories: setting up a VoIP infrastructure capable of interfacing with the existing phone network and supporting VoIP systems from different vendors, maintaining
conversation quality that users are accustomed to, and providing security through authentication and privacy. These challenges are discussed individually in the following
sections.
3.4.1
Infrastructure Requirements
The most obvious problem when employing a new communications technology like
VoIP is building an infrastructure that will allow it to run smoothly and integrate
with existing technologies. Luckily this problem has, for the most part, already been
addressed.
One nice thing about VoIP is that it runs over IP, so the existing Internet backbone
can be used to transport data and new lines are not necessary.
For addressing, call setup, and interfacing with the PSTN, several protocols have been
suggested, with two of them, H.323 and SIP, emerging as the most popular[64]. At
22
some point a standard for VoIP will have to be accepted, but until then the various
protocols must be able to interface not only with each other, but also with the existing
PSTN. For this purpose there exist boxes, known as gateways, that allow translation
from IP to PSTN for various VoIP signaling protocols.
Thus, two very important aspects of a VoIP infrastructure are already in place. First,
a VoIP network can run almost exclusively as a layer over IP, so the connecting
network is already set up. Second, this network can readily interface with the PSTN,
so users can switch to VoIP one at a time without being disconnected from the
existing phone network. This allows VoIP to be incrementally deployable, a crucial
characteristic if it is to be incorporated smoothly into the telecom architecture.
3.4.2
Factors Affecting Conversation Quality
A second major concern with VoIP has been the issue of conversation quality. Users of
any kind of telephone system are accustomed to a certain level of quality, below which
the usability of the system quickly degrades. It is therefore important to assure that
the conversation quality provided by VoIP systems equal to that of PSTN systems.
Until recently this has been a difficult task.
There are several factors that can influence conversation quality. One is the problem of
delay. Sources disagree on the acceptable amount of delay before which quality rapidly
deteriorates, but most place this value somewhere between 100 and 400ms[17, 53].
Delays can come from a number of sources. Computationally, there are delays at each
step of the way. The A/D and D/A conversion take time, as does compression by
the codec, packetizing and depacketizing the data, and decompression by the codec.
Clearly, these delays are directly related to the processor speed of the voice terminals.
For modern computers and IP phones they are typically quite small, but for a long
time they made VoIP unusable[17]. Additionally, data routed over the Internet can
travel through several routers, each of which may have a packet queue that can add
to delays. Finally, there is also a delay associated with sending a signal over a wire
that is proportional to the distance and the speed of light. This is a physical barrier
that cannot be avoided or significantly reduced for VoIP or PSTN conversations.
A second problem facing VoIP is that of bandwidth. Any live streaming application,
by nature, will require a significant amount of bandwidth. Even after compression
and employing Voice Activity Detection (which prevents packets from being sent in
periods of silence), nearly all VoIP applications require at least 20 Kilobits per second
of bandwidth on the network[16]. Only recently have most homes and businesses had
this kind of consistent bandwidth at their disposal.
Another factor that affects VoIP quality is jitter, which is the variance of interarrival
times between packets. For a real-time application such as VoIP, it is important
23
that packets arrive in a relatively smooth and uniform rate. Jitter can drastically
reduce conversation quality if measures are not taken to counter it. In many cases
this problem can be mitigated through the use of a jitter buffer[50].
Several other factors affect VoIP quality, including packet drops and transmission
errors over the network. With the combination of all of these factors, it has turned out
that only in the last decade or so have computer systems and a network infrastructure
capable of supporting high-quality and ubiquitous VoJP been in place[58].
There are several ways of estimating conversation quality. One model that has been
introduced by the International Telecommunications Union (ITU) is the E-Model
[53], which attempts to determine a quality score based on delay, jitter, codec, and
several other transmission factors. Evaluating conversation quality is the subject of
Section 6, and the E-Model details, as well as an overview of other quality models,
can be found there.
3.4.3
Security
In addition to challenges concerning conversation quality, a second concern for VoIP
relates to security. In contrast to the PSTN, which is largely considered to be a
secure, private network, the Internet is open to eavesdropping, impersonation, and
denial of service (DoS).
A key problem with VoIP security is that many of the protocols and techniques used
(see Section 5.1.2) to protect data networks have not been widely used or tested in
VoIP networks. Firewalls, for example, are known to create problems with VoIP callsetup protocols[27]. Additionally, adding security inevitably increases packet sizes
(thus bandwidth) and delays, as the mathematical encryption algorithms take time
to be performed. Security issues with VoIP are discussed in more detail in Section 5,
and how security affects conversation quality is the subject of Section 8.
24
4
VoIP Implementation Details
Section 3.1 provided a brief overview of Voice over IP systems. This section expands
on that topic in more detail. In particular, it addresses the various protocols used
in VoIP systems. These break down into two main categories: protocols used for
setting up calls and building a VoIP infrastructure (discussed in Section 4.1), and
protocols for transporting the speech data over the Internet (Section 4.2). While the
focus of this thesis is on conversation quality, the call setup and teardown protocols
are discussed for completeness.
4.1
Call Setup Protocols
One of the most important needs for VoIP systems was a well-defined and accepted
protocol for managing the VoIP infrastructure. Since VoIP devices are associated with
an IP address that may change over time (if a laptop user plugs into the Internet from
two different locations, for example) there must be a dynamic process of associating
a particular user with an IP address. Additionally, there needs to be an established
signaling process that allows users to call each other, be connected, receive busy signals, and leave voice-mail; anything users expect from traditional phones should be
supported by VoIP. Finally, there should be a well-defined way to interface between
VoIP and traditional PSTN phones and networks. Several protocols have been introduced to accomplish these tasks, with the two most important ones being H.323 and
Session Initiation Protocol (SIP).
4.1.1
H.323
H.323 was the first widely used standard for VoIP[17]. The first version of H.323
was developed and endorsed by the International Telecommunications Union in 1996,
and subsequent versions have been released in 1998, 1999, 2000, and 2003[7]. There
are four basic components in an H.323 system, a terminal, gateway, gatekeeper, and
multipoint control unit.
The terminal is the endpoint device that provides a user interface to the system,
similar to the voice terminal block in Figure 1. Terminals can be a dedicated piece of
hardware known as an IP phone, a software program running on a personal computer
(PC), or a normal telephone running through a network adaptor.
The gateway is a device that provides a protocol conversion between the H.323 IP network and other types of networks (PSTNs, for example). This allows the H.323 phones
to communicate with devices running different protocols, such as PSTN phones, or
VoIP devices running over a non-H.323 protocol (e.g. SIP). Gateways are maintained
by a VoIP provider, or by a company to allow their users to interface with other
25
Calling Party A
TA-.SYN 3alling
C
98.76.54.32
, provides
TCP Pert
trty
a
TCP Port
H.225 SETUP
ALERT
PA11-225
B
1234ek
ta
dg6789
A
P2
11.225 CONNECL TCP P.,
TCP Port
___________
U8
wpRSYN
T,.245 TrminatConptilut.
party)
eb.st.h
Set
party)
Set
ta ails fithe
Hf246 t
14.245 6ndwmnal Capahilitive Set ACI( AIMacicr
SHt ACK tB Srt.
bs
Capalsiti.
11.245 Tera
iL 246 Open L.~*.. C~ae.'.I (44*.
r
Op.. Lthe
d24
rt biis fe..ue'
pTatPoeria)
0...)
Aditnay
yp
tp,
a
a..
d(
conet
H.
snt
ort
spDPaPt
hrougalls them
route
Figure 2: An H.323 Call Setup
communication networks.
The third H.323 device, the gatekeeper, provides call control, bandwidth management, and address translation for connections between H.323 endpoints. Calls are
initiated through gatekeepers, which are responsible for translating a phone number
to an IP address, and also may monitor call times for billing and tracking purposes.
Gatekeepers, again, are maintained by a provider or corporation, and users simply
route calls through them.
The final device, known as a multipoint control unit, enables three or more terminals
or gateways to establish a multipoint conference. Three-way calls are currently outside
of the scope of this thesis, and so details of the multipoint control unit are not
discussed.
One major drawback of H.323 is that many control messages and protocols are used
to initiate and terminate calls. Figure 2 shows the control messages required to setup
an H.323 call between two parties. It can be seen that H.323 relies on several other
protocols including H.225 (to set up the connection), and H.245 (to determine the
capabilities of each user's terminal). Additionally, connections on two separate ports
are required for the setup of a single call.
Another criticism of H.323 is that the protocol is binary-encoded, that is, it encodes
information in numbers as opposed to text. This encoding is frustrating for programmers debugging H.323 applications because it is difficult to easily observe what the
problems are. The binary encoding also makes H.323 less extensible, as information
26
must be contained in a very specific part of the packet with a fixed size. For these
reasons many people have recently moved away from H.323 and accepted SIP as the
standard for VoIP.
4.1.2
Session Initiation Protocol (SIP)
Session Initiation Protocol[72], commonly known as SIP, has recently become the
accepted standard for multimedia Internet applications, including VoIP, for the Internet Engineering Task Force (IETF). It was developed and accepted by the IETF, a
body that oversees many of the Internet's standard protocols, as part of the Internet
Multimedia Conferencing Architecture. In addition to being used for VoIP, SIP can
also be used for video conferencing, instant messaging, and chat.
In SIP each user is associated with an address known as a uniform resource indicator
(URI). The URI (sometimes called a SIP URI, or SIP address) is analogous to uniform
resource locators (URLs) of websites, with the key difference that they are meant to
be dynamic[22]. URI's are not meant to be tied to a particular physical device, but
to a logical entity that might move or exist in multiple places.
A SIP architecture has several similarities to an H.323 architecture. Two similar
elements are SIP user agents (UAs) and SIP gateways. A SIP UA is analogous to an
H.323 terminal; it is a hardware or software device that allows a user to make VoIP
calls using SIP. SIP UAs can take the form of dedicated hardware phones, networkadapted analog phones, or software applications running on an Internet-enabled PC.
SIP gateways are also analogous to H.323 gateways; they provide translation between
protocols. Two common gateways are SIP/PSTN gateways, which allow SIP devices
to interface with traditional phone systems, and SIP/H.323 gateways, which interface
SIP and H.323 devices together. Like H.323 gateways, these are part of the infrastructure backbone, and are generally maintained by service providers or corporations.
SIP architecture also requires a number of SIP servers, devices that handle SIP messages, and each one serves a different function. The three types of SIP servers are
proxy servers, redirect servers, and registration servers. They are discussed below.
Proxy Servers
The role of a proxy server is to handle SIP messages on behalf of SIP user agents.
Proxy servers usually have access to a database or location service to help determine
what to do with the request. For example, a UA might try to initiate a call to a person
who's SIP URI is alicedbigcompany.com. The UA sends a SIP INVITE request to
a proxy, and the proxy attempts to determine the IP associated with Alice's address
by querying its database or location service. The proxy server then forwards the
request to wherever it determines the best location for Alice is, or responds with a
"not found" error message. Proxies, in general, do not create new messages, they
27
merely forward and respond to requests as they see fit.
Redirect Servers
A redirect server is a SIP server that responds to requests, but never forwards them.
Like the proxy server, it usually queries a database or location service to determine an
appropriate response, but unlike the proxy it will never forward a message. Redirect
servers usually inform the requesting UA of the location of someone, but leave it up
to the UA to contact that location on its own.
Registration Servers
A registration server, or registrar, only accepts one type of SIP message, a SIP REGISTER request. The registrar then keeps state of the users registered to it for a
particular domain, allowing the proxy and redirect servers to query it for location
information. Registration servers usually perform user authentication, although this
isn't required. Authentication ensures that only valid SIP users in the registrar's
domain are registered and serviced, and prevents outside users from placing and receiving calls through another domain's servers.
While each server is logically separate from the other two, any and all three servers
can reside in the same physical location, and many open-source and commercially
available SIP servers perform all three functions[27].
One advantage to SIP is that it is a text-encoded protocol. This means that the information is sent over the Internet as human-readable text. Other text-based protocols
include Hyper-Text Transfer Protocol (HTTP) and Simple Mail Transport Protocol
(SMTP), which World Wide Web and e-mail systems use, respectively[73, 74]. The
advantage to a text-based protocol is that it is much easier to program, analyze, and
debug. A simple traffic sniffing tool can be used on the network to easily understand
what information is being sent - a task not nearly as straightforward in H.323 systems.
Another advantage is that, SIP call setup is quite simple, as can be seen in Figure 3.
This figure, as compared to Figure 2, represents all the signaling required for a complete call (Figure 2 is only an H.323 setup), and shows a call being routed through a
SIP proxy server. SIP calls only require the establishment of a single TCP connection, and SIP can even bypass TCP and run entirely over UDP (a discussion of TCP
versus UDP can be found later in this section). In SIP, all of the codec negotiation
is done in the INVITE and OK messages, bypassing the need for the series of H.245
"terminal capabilities" messages seen in Figure 2.
4.1.3
SIP versus H.323: A Comparison
Several sources[22, 27] have provided a comparison of SIP and H.323. Both protocols
have their own merits, which are discussed here.
28
Bob
Proxy Server(s)
Alice
INVITE Alice
INVITE Alice123.45.67.8
from Alice@123.45.67.8
rA
4.OK
OK fromn Alicve 123.45.67.9
ACK Alice with route
Alice a123.45.6 7.8
BRY 1 A Iicc
ACK AliceI23.45.67.8
123. 4-5.678
E
OK
BYE Alice
123.45.67.8
OK
o
Figure 3: A SIP Call Setup and Takedown
The difference in H.323 and SIP is largely a product of where they were developed.
H.323 was developed by the International Telecommunications Union (ITU), and so
it largely resembles other telecommunications protocols. H.323 even reuses parts of
ISDN signaling, reflecting its roots in the telecom industry. SIP, on the other hand,
was designed by the Internet Engineering Task Force (IETF), and so it resembles
other Internet protocols like HTTP and SMTP.
One aspect of SIP that is highly advantageous is its use of a common universal
addressing scheme: the SIP URI. Because of this it allows a single SIP user to have
different SIP-enabled devices at multiple end-points that are all associated with that
user. A SIP phone, videoconferencing tool, and instant messaging client could all
be tied to a single URI; one of the reasons SIP is said to be more scalable than
H.323. Additionally, SIP's text-based encoding, when compared to H.323's binary
encoding, is an advantage in both clarity and extensibility. New fields can be added
incrementally to SIP, a task that is more difficult to accomplish in H.323.
It is unclear when (if ever) a true standard will emerge, but while H.323 arrived and
was widely implemented first, SIP's simplicity and versatility have caused it to gain
great momentum in the IP telephony world. SIP has now become widely backed
by major companies including Microsoft, Cisco, and Nortel, and appears to be the
growing trend.
An interesting side note about the two protocols is that as time has progressed, they
have grown more and more alike[22]. For example, SIP was designed to support DNS
(domain name service) from the start, a feature that was added in later versions of
29
H.323. Conversely, a system similar H.323's multipoint control units, which allowed
for 3-way calling, is currently being added to SIP.
Alan Johnston[22] provides a nice summary of the current situation. "While there
are some similarities between the protocols in call setup, and some niche markets that
H.323 currently dominates, SIP, with its text encoding, presence and instant message
extensions, and Internet architecture, is poised to be the signaling and 'rendezvous'
protocol of choice for Internet devices in the future." For these reasons, SIP was
chosen as the protocol to use in our experiments.
4.1.4
Others
SIP and H.323 are not the only protocols that have been proposed for IP telephony.
Others include Megaco, MGCP, and Skinny. While these protocols are used in some
areas today, it is believed that either SIP or H.323 (or both) will become the standard
for VoIP and so a detailed discussion of these protocols is not provided here. Please
refer to the references[78, 79, 67] for more information about these protocols.
4.2
Transport Protocols
While SIP and H.323 perform call setup and takedown signaling, they do not provide
any means of actually transmitting information over the Internet, nor do they handle
the media streams containing the actual audio data. For this task several existing
protocols are employed, including IP, UDP, TCP, and RTP. Both SIP and H.323,
for example, must run over either TCP or UDP. Additionally, the audio data of the
conversation uses RTP running over UDP for delivery. Finally, all of these protocols
run over IP. An explanation of what these different transport protocols provide is
found below.
4.2.1
Internet Protocol (IP)
Internet Protocol[68] (IP) is used to route packets across a data network, such as the
Internet. It provides connectionless, best-effort packet delivery, meaning that packets
might be lost, delayed, arrive out of sequence, or contain errors.
Each location on the network is associated with a particular address, known as an IP
address. The current most-employed standard for IP addresses is IP version 4 (IPv4),
in which addresses consist of four numbers between 0 and 255, separated by periods.
123.45.67.8 is an example of an IPv4 address, as is 11.0.0.255. A new version of IP,
IP version 6 (IPv6) has been developed by the IETF to allow for longer addresses,
because the IP address space is rapidly running out. The increased address space
30
provided by IPv6 is necessary if every telephone in the world is to have a unique IP
address. That said, IPv6, while an important emerging technology, is largely outside
the scope of this document, and we performed all our experiments with the more
common IPv4 protocol.
IP addresses are assigned by the Internet Assigned Number Association[22] (IANA),
and so they are globally unique. This ensures consistency; packets destined for a
particular unicast address should always arrive at one and only one unique machine.'.
An IP address is a lot like a street address. In the same way that any piece of mail
with a specific address on it can be dropped in any mailbox and should find its way to
that address, any IP packet with a specific destination IP address, sent to any router
on the Internet, should find its way to that computer.
4.2.2
User Datagram Protocol (UDP)
User Datagram Protocol[69] (UDP) provides a small step up in complexity from
IP. It is also a connectionless, best-effort protocol, meaning packets can be lost or
dropped, but it provides a checksum, allowing errors to be detected. Additionally,
UDP specifies not only an IP address, but also a port. Certain communications run
over specific, well-known ports (for example, HTTP uses port 80, and SIP uses port
5060). In this way, UDP allows multiple logical channels of communication to exist
between two computers. In the case of VoIP, SIP could be handling the call setup
over port 5060, while the actual data streams are traveling over some other (usually
determined at call-setup) port.
4.2.3
Transmission Control Protocol (TCP)
Transmission Control Protocol[70] (TCP) also runs over IP, but provides the reliability and guaranteed delivery of data that IP and UDP do not. It does this by
using acknowledgements and sequence numbers. Every TCP packet that is sent has
a header representing which bytes of the overall stream of data the packet represents.
This header allows the receiver to recognize when information is missing and ask the
sender to resend. It also allows the receiver to reconstruct information in the correct
order, even though packets may arrive out of order.
TCP is a good transport protocol to use when guaranteed and reliable delivery is
required. Most traffic, including World Wide Web, e-mail, and File Transfer Protocol
(FTP) use TCP. A problem with TCP, however, is that it can make things slow.
Retransmission takes time (usually more than double the round-trip time (RTT)
between sender and receiver) and in live streaming applications (like VoIP) a delay
'Actually, this isn't always the case because of dynamically assigned IP addresses and network
address translators (NATs), but can generally be considered true.
31
as small as a second can severely detract from conversation quality. Additionally,
TCP has problems with long latencies, since it drastically reduces the transmission
rate when packets are lost. For these reasons, TCP is generally not used for media
streams, although it can be used for call-setup. As mentioned previously, currently
SIP and H.323 support delivery over both TCP and UDP.
4.2.4
Real-Time Protocol (RTP)
VoIP and other media streaming applications are a unique type of traffic with very
specific requirements. More than any other service these applications are extremely
time-sensitive. Delays larger than a few fractions of a second are simply unacceptable.
Fortunately, these applications do not need guaranteed delivery of every packet. Audio and video applications can usually interpolate a value for a missing data point and
end-users likely won't notice a loss in quality. A final requirement of these applications
is that ordering should be preserved. Users don't want to hear words out of place in a
conversation. Moreover, many voice and video codecs use differences rather than absolute values to encode data, and a reordering of information can make it completely
useless. To summarize, VoIP requires a quick delivery of data along with ordering
information, but doesn't require guaranteed delivery. Real-Time Protocol[71] (RTP)
was developed to meet these characteristics.
RTP runs over UDP, and is therefore a connectionless, best-effort protocol. This
ensures the fastest possible delivery over the Internet, and also means that some
packets will be dropped or arrive out of order. In addition to the UDP checksum,
which alerts the receiver if the data in the packet is contains errors, it also provides
several multimedia-specific features.
One of these features is sequencing. As mentioned above, sequencing is important
in streaming applications. Typical VoIP applications actually hold on to packets
for a certain amount of time (known as the jitter buffer, and mentioned earlier in
Section 3.4.2) before delivering them to the end-user. This allows some packets that
arrive late or out of order to be correctly placed in the stream, and has been shown to
drastically increase conversation quality[50]. The sequencing feature of RTP allows
this to be accomplished.
In addition to sequencing RTP also provides other valuable media-specific features. It
has a built-in field to represent the codec used according to a defined list of standard
codecs that is kept by the IANA[30]. It also provides a timestamp when the packet is
created (used by RTCP, see below), and a marker bit, which is usually used to signal
the start of a new audio stream, or a special type of packet.
RTP runs together with RTCP (Real-Time Control Protocol). RTCP is a protocol in
which the two endpoints communicate things like delay, packet loss, and jitter to each
other over the network. Some applications may use RTCP to renegotiate a media
32
connection depending on the integrity of the channel. For example, if bandwidth
seems to be creating excessive delays and packet losses, a new codec can be used that
represents a lower voice quality, but requires less bandwidth.
33
5
VoIP and Security
This section discusses the role of security in VoIP systems. Section 5.1 provides
an overview of Internet security in the context of VoIP systems. Following that,
Section 5.2 moves onto issues specific to VoIP systems. Since the main focus of this
thesis is the effect of security on conversation quality, this section is only a brief
summary of Internet security and how it applies to VoIP. The reader is referred to
the references for more information on the topic of Internet security.
5.1
Internet Security Overview
In the early development of the Internet, security was hardly in anyone's thoughts.
The Internet at that stage was the ARPANET, which was a network of computers on
which everyone knew everyone else who was connected. Thus, the need for security
was overlooked; at that time the entire ARPANET was basically a private network[28].
However, as the ARPANET evolved into the public Internet, the need for security
became recognized. As a result security usually became a layer that ran over the
existing non-secure protocols, such as TCP and UDP. This Section first discusses
exactly what is meant by Internet security, and then moves on to the details of how
security is generally accomplished today.
5.1.1
Definition of Security
In general, Internet security can be divided into four central issues. These issues are:
confidentiality (privacy), integrity, availability, and non-repudiation.
Confidentiality or Privacy
Confidentiality means keeping your information private. Confidentiality can be extremely important for military, business, or personal reasons. A user should be able
to control who has access to information he wants to keep private. Confidentiality
in the context of VoIP means that a person who isn't involved in the conversation
should not be able to determine who is talking to whom, or what is being said.
Integrity
Integrity implies that information has not been modified or concealed. It also means
that the source of information cannot be changed. Integrity basically means that
whatever information you receive is guaranteed to be what you think it is. A malicious
attacker cannot modify the information without you knowing, and can't pretend to be
someone he is not. Integrity in VoIP systems implies that the person you are talking
to is who they claim to be, and their words cannot be altered in transit. Integrity
is also important in communicating with VoIP servers that may route or track your
34
calls.
A second, more subtle aspect of integrity is that the conversation should not be able
to be replayed at some date as though it is live. This aspect is known as replay
protection.
Availability
Availability means that an attacker should not be able to prevent a person from using
a system. A compromise of availability is usually accomplished through a denial of
service (DoS) attack. DoS attacks can be specifically targeted at a single point or can
take down an entire system. For VoIP systems to maintain availability it should be
impossible for attackers to prevent a single person from using VoIP. A less stringent
definition of availability implies that attackers should not be able to take down crucial
components of the VoIP architecture (e.g. SIP Servers).
Defending availability is the most difficult security task to accomplish because of the
large number, distributed location, and shared functionality of the network devices
that participate in a single call. Additionally, any denial of service attack on the
Internet could potentially affect VoIP users. One example of such an attack occurred
in 2001, when the "Code Red" worm took down significant portions of the Internet for
extended periods of time, costing industry and government an estimated $2.6 billion
in damages[12].
Non-Repudiation
Non-repudiation means preventing the sender of a message from denying it was transmitted by them. Non-repudiation prevents a user who promised something from
denying he made the promise. In the context of VoIP, non-repudiation means that a
person shouldn't be able to deny making a call he has made; there should be some
proof that would contradict this claim.
5.1.2
Security Details
The four elements of security are accomplished through various mechanisms. Two
important mechanisms are authentication and encryption. Both of these mechanisms
rely on the use of secret keys, which are usually large numbers known only to a
particular user or group of users. A password is another common type of secret key.
In this section, we are not discussing the mathematics of security per se, but more
the general principles involved with shared secret and public key cryptography. Then
we will discuss authentication and encryption in the context of these mechanisms.
Finally, we'll go over a few specific implementations of security. This section is not
meant to be a comprehensive guide to Internet Security, but merely a review to
provide enough information necessary to understand security in the context of this
thesis. Reference [24] is an excellent source of comprehensive information about
35
cryptography systems.
Shared Secret vs. Public Key
As mentioned previously, there are two main cryptographic mechanisms, shared-secret
and public key. Both of these mechanisms rely on the use of large, "secret" numbers,
but they differ slightly in how they work.
In shared-secret, or symmetric systems, all parties must know the value of the key.
The key becomes the "shared secret", and anyone who knows the key is assumed
to be a trusted party. One of the challenges that faces shared-secret systems is key
distribution. Everyone must be able to determine the value of the secret key, but
this must be done in a way that prevents non-trusted parties from also learning the
secret. Because of this, shared keys have to be placed manually in systems.
In contrast, public key schemes, originally developed by Ron Rivest of MIT, use a
single secret per user, along with a "public key" for that user that is viewable by
anyone[26]. This prevents the need for key-exchange, but does represent a problem,
as a database or record of all public-keys must be maintained.
Another problem with public key encryption is that is computationally more expensive than shared secret encryption. Many applications get around this problem by
using public key encryption to authenticate a Diffie-Hellman key exchange. DiffieHellman key exchange[25] was introduced in 1976 as a way to use two private individual keys to arrive at a single shared secret. Diffie-Hellman key exchange, on its own,
does not provide authentication, but by combining Diffie-Hellman with public-key
authentication, a shared secret can be securely negotiated. This negotiated key, can
then be used for the remainder of the communication, allowing it to be more efficient.
To more closely understand the difference between these two mechanisms, we will
look at them in the context of authentication and encryption schemes.
Authentication
Authentication is a technique that allows a user to ensure that certain messages are
authentic. Authentication can be used to provide both integrity and non-repudiation
of data. In addition, password protecting access to certain files can ensure their confidentiality, although if the files are transmitted over the Internet additional measures
must also be taken.
One method of authentication is digital signatures. Digital signatures use a mathematical function that is dependent on both the data being sent and the secret key
of the sender. This function is called a one-way function or hash function, and has
the special property that it is easy to compute in one direction, but computationally
infeasible to compute the other way.
In a shared-secret system, when a piece of data is transmitted, the hash function is
used with a key to generate a digital signature that is attached to the data. The
36
receiver of the data can then compute the signature if he knows the key, and verify
that it matches. This makes it nearly impossible for a malicious attacker to change
the message, because it would change the signature in an unpredictable way. Of
course, if the attacker knew the key it would be possible to create a new message and
signature pair, but most security systems rely on the assumption that the key is not
known.
In a public key system, authentication works much the same way. The signature is
only creatable by the private key, known only to the signer. Anyone, however, can
use his public key, with a different mathematical function, to verify that the signature
was created with the original data and private key.
Authentication proves that the message has not been tampered with and that the
sender of the message knew the key. This ensures that the data's integrity is intact. By
authenticating a sequence number replay attacks can also be prevented. Additionally,
in a public-key system, the sender cannot deny sending the message, ensuring nonrepudiation. There is, however, no non-repudiation for a shared key system, as any
party involved in the communication is capable of sending and receiving all messages
with the single key.
There are various algorithms used to compute hash functions. Two of the more common ones are MD5, which was developed by MIT Professor Ron Rivest[61] and SHAl
(Secure Hash Algorithm, Version 1) [62], which was part of the U.S. Government's
Capstone project[63], a project that attempted to develop a set of cryptographic
standards.
Encryption
Encryption is the technique that is commonly used to provide confidentiality, particularly when data is in transit over a non-trusted network, such as the Internet. Typical
encryption mechanisms use a one-way cryptographic function to convert readable data
into seemingly random values, again using a secret key. In a shared-secret system, the
data can then only be decrypted by anyone who knows the secret key using another
function. In public key systems, data can be encrypted with the public key, and only
someone who knows the secret key is able to decrypt it. By encrypting information
before sending it over a network and decrypting it on the receiving side, the data is
kept private from anyone on the network who doesn't know the key, thus ensuring
confidentiality.
There are also various encryption algorithms employed in different areas. In 2000,
the National Institute of Standards (NIST) held a competition to determine the
new encryption standard. Ultimately an algorithm known as Rijndael became the
Advanced Encryption Standard (AES, as it is also now known), beating out other
algorithms including Ron Rivest's RC6, MARS, Twofish, and Serpent[52]. Still, AES
is not the only encryption algorithm used, and the older 3DES and RC5 are also
widely employed.
37
5.1.3
Security Implementation
There are several different implementations of security protocols. Different algorithms
can be used to authenticate and encrypt data, and different protocols can be used for
transporting encrypted data. Technically, security can be applied at any of the four
network layers, which in order of increasing complexity are[23]:
1. Data Link: The physical link between individual machines.
2. Network: The underlying addressing scheme (IP)
3. Transport: The establishment of a session (UDP, TCP)
4. Application: The application running over the network (FTP, HTTP, SIP, etc.)
Encryption can be done at each of these layers. For example, SHTTP and SRTP are
two protocols designed at the application layer to add security to HTTP and RTP,
respectively. IP Security, or IPSec[75], on the other hand, runs over the network layer
and encrypts any network traffic.
The National Institute of Standards (NIST) investigated VoIP security, and recommended the use of IPSec for secure VoIP systems[64]. Thus, we will focus our discussion of security implementations on IPSec. Except where noted, all of the material
in this section came from Sheila Frankel's book Demystifying the IPSec Puzzle[39],
which is an excellent source of detailed information regarding IPSec.
Authentication in IPSec
Authentication is done in IPSec either with an authentication header (AH), or authentication of the Encapsulating Security Payload (ESP, see next section). The AH
is meant to provide three specific purposes[39]:
* Connectionless Integrity: the received message is what was sent, and no tampering has occurred.
" Data Origin Authentication: a guarantee that the message was sent by the
apparent originator of the message, as opposed to someone masquerading as
that originator.
" Replay Protection (optional): the assurance that the same message isn't delivered multiple times, and messages aren't delivered grossly out of order.
The AH consists of six fields, but the most important ones are:
38
"
Security Parameters Index (SPI): a number agreed upon by the communicating
parties that represents this particular security association. The SPI is used to
keep track of which keys are being used for the authentication and encryption
protocols.
* Sequence Number Field: a number that increases incrementally with each
packet, which is meant to provide replay protection.
" Authentication Data: the result of the one-way hash function of the key and
the rest of the packet. This is equivalent to the digital signature above, and the
method of calculating it is defined by the security association.
A nice feature of IPSec is that it is versatile. For example, many cryptographic
hash functions exist, including MD5, SHA-1, and SHA-2, and IPSec allows users to
customize which algorithm and keys they will use on a connection by connection basis.
This is accomplished during key exchange negotiation, and is based on a security
policy database (SPD) that is referenced by the SPI and IP address associated with
the connection.
Encryption in IPSec
Encryption occurs in IPSec via the use of the encapsulating security payload (ESP).
The ESP, in addition to providing confidentiality, can provide all the functions of the
AH. For this reason, many people argue that the AH should be removed from the RFC
for IPSec entirely, and the ESP should perform both authentication and encryption.
Although authentication is optional, it is highly recommended, otherwise the integrity
of the message cannot be guaranteed. Some implementations of IPSec don't support
the AH and rely on the ESP with a "null" encryption algorithm (that does not encrypt
data) for authentication alone[40].
In addition to authentication, the two additional functions provided by the ESP are:
" Confidentiality: a guarantee that even if someone sees the message, the contents
are not understandable except by the authorized recipient.
" Traffic Analysis Protection (optional): the assurance that an eavesdropper cannot determine who is communicating with whom, or the frequency and volume
of communications.
The ESP is made up of seven fields, including the SPI, sequence number, and (optional, but recommended) authentication data fields also found in the AH. The important fields unique the the ESP are:
* Payload Data Field: the encrypted contents of the packet. The padding and
padding length fields (see below) are also encrypted and contained in this field.
39
" Padding: additional bits that are not used but increase the length of the packet.
This allows certain cryptographic algorithms that require fixed block sizes to
be performed on the packet, and can also provide traffic analysis protection.
" Padding Length: total number of padding bytes, so they can be ignored by the
receiver.
The ESP offers similar versatility in terms of encryption algorithms, and many implementations of IPSec offer a choice among AES, DES, 3DES, Blowfish, and several
other common encryption algorithms. The keys, key-sizes, and algorithms used are
negotiated beforehand and interpreted through the use of the SPI.
The ESP and AH can be used in conjunction to provide authentication and encryption. In addition, the ESP can perform both of these functions, making use of the
AH unnecessary.
Transport vs. Tunnel Mode
IPSec offers two different modes of encryption, called transport and tunnel mode. In
transport mode, the two communicating users encrypt the contents of the packet, but
leave the IP headers intact. Since packets are routed based solely on IP headers, this
allows the encrypted part of the packet to traverse the network normally as long as
there is no need for intermediate routers to inspect packet contents (e.g. the TCP
header). In this case the entire encrypted packet arrives at its destination, where it
can be decrypted by the other party. Because the IP headers are sent in the clear,
anyone can know who the communicating parties are, which can represent a breach
of confidentiality.
Tunnel mode was designed fix this problem. In tunnel mode two machines, typically
known as VPN gateways, set up an encrypted "tunnel" between them. When a user
wants to communicate securely with someone, he sends his packet to his gateway
where the whole packet is encrypted. Then a new IP header with the address of the
other party's own gateway is put on the packet, and sent to the other party's gateway.
Thus, the two gateways form a secure tunnel that allows any users behind them to
communicate with each other. Observers in between the gateways cannot determine
who is talking to whom, but merely that someone from one gateway's network is
talking to someone else on the other gateway's network. Communication between the
user and his own gateway can also be encrypted with a different security association,
possibly using transport mode, if the internal network is not trusted.
Implementations
There are many implementations of IPSec, including FreeS/WAN[40], Openswan[41],
StrongSwan[42], and an IPSec version built into Linux kernels 2.6 and higher[43].
FreeS/WAN (which was named for a free implementation developed out of the Secure
Wide Area Network (S/WAN) project) was the first widely adopted open-source
40
implementation of IPSec. Unfortunately, FreeS/WAN's development stopped in April
of 2003, and as a result it lacks support for several algorithms including AES. Several
projects branched from FreeS/WAN, including StrongSwan and Openswan.
These implementations of IPSec are done in software, but IPSec can also be implemented in hardware, and many times this is done for performance gains in the
encryption algorithms. Additionally, cards exist that offload encryption algorithm
computation, but must be combined with IPSec software to fully implement the protocol. In Section 9 we discuss the IPSec implementation we chose for our experiments,
Openswan, and why that choice was made.
5.2
Security Applied to VoIP
Now we will look at security in the context of VoIP. We will start with the four
characteristics of security described above, and then move onto some issues that are
VoIP specific. The purpose of this section is to provide an overview of some of
the security challenges associated with VoIP networks. Reference [64] provides an
excellent discussion of VoIP security considerations in much more detail.
5.2.1
Confidentiality
There are various levels of confidentiality that can be accomplished in VoIP systems.
The most sensitive information is what is being said in the conversation. To protect
this information, the data streams must be encrypted. There are various ways of
encrypting VoIP data streams. One way is to use a broad layer of encryption, such
as IPSec, for encrypting all communications. Secure RTP (SRTP) [77] can also be
used, and has been shown to be slightly more efficient than IPSec. SRTP, however,
is limited to the RTP data streams[64], so call setup requires the use of a separate
security protocol.
A second level of confidentiality is necessary to mask usage patterns. Encrypting data
streams alone still allows a malicious eavesdropper to see who is calling whom, what
codecs they are using, how long they talked, and other usage patterns. To prevent this
information from being seen, a stronger layer of security should be used. In particular
the call-setup messages (typically either SIP or H.323) need to be encrypted as well.
This usually requires a security relationship to exist between each of the users and the
server that is connecting them. The calling party must establish a secure connection
to the server to tell it who to call, then the server must establish a second secure
connection with the person being called to relay the message and setup the call. This
will allow the call to be set up in a private manner. 2
2
Additionally, a third security association must exist between the caller and callee if their communication is to be encrypted
41
A final level of confidentiality can be obtained to prevent traffic analysis. Traffic
analysis is a technique that can be used to determine what (encrypted) VoIP conversations look like on a network. Then, even if an eavesdropper couldn't see who is
talking or what is being said, he could at least see that a phone call was occurring
between two subnets. This can only be prevented by masking VoIP data by sending
additional data all the time. Then an eavesdropper wouldn't be able to distinguish a
conversation from the background noise of constant data. This level of confidentiality
is quite impractical, and for the most part unnecessary in all but the most secretive
VoIP applications.
5.2.2
Integrity and Non-Repudiation
There are many ways in which integrity is important in VoIP communications.
most obvious application is preventing a person from impersonating someone
The simplest way to prevent impersonations is to authenticate users at every
of the conversation, including authenticating the server that sets up the call.
assures that the person calling you is who they claim to be.
The
else.
step
This
Authentication is also important for server administrators. In SIP, for example, the
registration servers should only allow users to register if they "belong" to the group
of users known by that particular server. Typically users are required to provide a
password when registering and routing calls through a server.
One known problem with SIP is the ability to fake, or spoof, another user's phone
number. Studies have shown that it is currently easy to fake a PSTN phone number
using SIP[65] . This represents a violation of integrity, as the person you are talking
to might not be who they claim to be. This threat can sometimes be mitigated
by using voice authentication, but biometric authentication is not currently widely
implemented.
Most of the issues of integrity and non-repudiation can be solved by applying strict
authentication rules at every step of the communication. If users and servers are
required to authenticate each other, integrity and non-repudiation can be maintained.
5.2.3
Availability
Availability is typically the hardest security characteristic to maintain. Many factors,
both malicious and accidental, can result in a loss of service. A power failure, for
example, could take down SIP servers and prevent users from making calls.
Several known DoS attacks already exist against VoIP systems. One very simple
attack is to send a large number of INVITE messages to a particular server or user.
As the user gets flooded with requests to initiate calls, he cannot tell the difference
42
between real ones and malicious ones. Combatting attacks like these requires complex
rules on when and from whom to accept calls[22].
One crucial aspect of telephone communication is the 911 emergency line. Currently,
VoIP systems do not have a universal standard for emergency response, and if they
are ever to replace PSTN phones, one must be set up and be made available all the
time[64].
5.2.4
Security and Quality of Service
A final important aspect of security is the role it plays on quality of service. Securing
communication has several effects, including interfering with call setup messages,
and affecting conversation quality. For conversation quality there are two aspects:
encryption overhead, and traffic prioritization.
It is known that cryptographic algorithms can be computationally intense, and can
increase network bandwidth significantly. In an environment where bandwidth is
limited, such as wireless links in airborne networks, this can be a serious problem.
Additionally, the computational delays introduced by encryption could result in the
perceived quality of the conversation significantly decreasing.
The second way encryption could hinder conversation quality is its influence on traffic prioritization. Traffic prioritization is a technique employed in routers to give
delay-sensitive traffic, such as VoIP, a higher priority than traffic whose delivery is
not as time-constrained (e.g. e-mail and web browsing). This can greatly improve
call quality in bandwidth constrained situations. The use of encryption, however,
causes the information that allows the router to prioritize traffic to be hidden. Traffic
prioritization is a topic outside of the scope of this thesis and left for future work (see
Section 11).
The impact of the encryption overhead in non-prioritized VoIP systems, however, is
a large part of the rest of this thesis. This subject is explored in Sections 8 and 9.
43
6
Measuring Conversation Quality
When using a service such as a telephone or VoIP system, being able to evaluate the
quality of the service is extremely important. People simply will not use a telephone
that is unreliable or difficult to hear. Secure VoIP systems must not only be reliable,
but they should also provide a level of conversation quality equal to the traditional
phone system to which people have grown accustomed. For this reason it is important
to be able to measure conversation quality.
There are various proposed methods of evaluating conversation quality, with the most
accepted measure being the Mean Opinion Score (MOS). Details of the MOS and two
computational models, the E-Model and the Perceptual Evaluation of Speech Quality
(PESQ), are discussed in this section.
6.1
Mean Opinion Score (MOS)
The Mean Opinion Score, or MOS, is the most-widely employed measure of conversation quality. It is summarized in recommendation P.800 of the International Telecommunications Union's Telecommunications Standards Sector (ITU-T). The MOS is
what it sounds like, an average of people's opinions.
MOS is a subjective quality metric that relies on human beings' perception of quality.
Generally a particular sound file or channel will be evaluated by several people, and
the resulting average of their scores becomes the MOS. MOS scores range from 1-5
with a descriptive quality associated with each score that is summarized in Table 2.
An MOS of 4 or better is considered toll quality (equivalent to PSTN phones), while
3.6 is or higher is called "acceptable" for toll quality. A setup with an MOS of less
than 3.6 is only considered usable if it offers some benefit over traditional phones.
Score
5
4
3
2
1
Quality of Speech
Excellent
Good
Fair
Poor
Bad
Table 2: The MOS Quality Scale
There are various tradeoffs associated with the MOS, or any subjective score as a
metric for evaluating conversation quality. On one hand, the MOS offers the most
accurate assessment of true conversation quality. In an application like VoIP or
telephony, ultimately a person will be the one who benefits or suffers from variations
44
I
---O
talkspurt
w
Perceptual
Models
talkspurt
tAkspurtalkspurt
Network
l
"'Saence
AteLr b
Figure 4: Evaluating VoIP with the PESQ Model
in conversation quality. Thus, it makes sense that the metric for evaluating quality
should be a subjective judgment done by people.
On the other hand, using a subjective score to evaluate quality introduces many
logistical problems. One issue is that it becomes quite impractical to perform multiple
experiments if each one must be done with several users, each of whom assesses the
quality separately so that it can be averaged. Additionally, different people have
different perceptions of quality. A person who is accustomed to talking on a cellular
phone might evaluate a call higher than one who always uses a land-line. While
averaging allows some of this variation to be mitigated, it is difficult to get enough
people's opinions to counteract the variation.
To avoid the reliance on subjective human evaluation, two additional ways to evaluate
quality were introduced by the ITU. These methods are discussed next.
6.2
Perceptual Evaluation of Speech Quality (PESQ)
The PESQ[54] is described in ITU-T recommendation P.862, and is a perceptual
model. This means that it evaluates quality by comparing the sent and received
speech signals in the psychoacoustic (audio) domain. The PESQ's quality rating is
obtained by characterizing and determining the amount of distortion between the
signals. This is shown in the context of a VoIP call in Figure 4[56]. Note that the
analysis occurs outside the channel of communication in this context.
While PESQ offers a reasonable method of determining the loss in sound quality due
to distortion of a signal, it does have some significant drawbacks. One problem is
that the PESQ offers no way to evaluate conversation quality loss due to latency, as
only the two endpoint signals are compared. Thus, a perfect channel with a very
long latency would receive a high PESQ score, even though users would find its
quality unacceptable. Since one of the focuses of our study was on high latency links,
45
the PESQ model was inappropriate. Additionally, the PESQ reveals no information
about the sources of degradation, only that they occur.
Other perceptual models have been introduced that are similar to PESQ. The Perceptual Speech Quality Measurement (PSQM)[55] was introduced by the ITU before
the PESQ and hence, the PESQ is considered an improvement. Additionally, an
adaptive quality model[57] has been proposed to change its score over the duration
of a conversation. The reader is referred to the references for more information about
these models.
6.3
E-Model
The final quality model we will look at is the E-Model[53}, which is described in ITUT recommendation G.107. The E-Model is another objective model, calculated based
on qualities of the signal and of the channel used for communication. There are a
few advantages to the E-Model over perceptual models. Unlike PESQ, the E-Model
takes absolute delays into account when determining voice quality. This is impossible
to do in a perceptual model that only compares end-point signals. Additionally, the
E-Model is broken up into different terms that each represent a different aspect of the
connection that contributes to conversation quality. This makes it easier to determine
what is causing a particular communication channel to be deficient. Finally, the EModel was designed with the knowledge of VoIP systems in mind, and includes terms
for codec and packet loss. This makes it more easily applied to VoIP systems than
the other models.
The E-Model defines a Quality Rating (R) that varies from 0-100. A formula for
translating between R and MOS is given in [53] and summarized in Figure 5. The
E-Model relies on an assumption (based on empirical evidence in psychophysical research) that the psychological effect of uncorrelated sources of impairments is additive.
The resulting quality rating, R, is thus calculated by determining the signal to noise
ratio, R0 , and subtracting a set of impairments:
R = Ro - I. - Id-
le + A
Below is a brief description of each term.
* RO: Represents the best possible quality for a given signal to noise ratio.
* I,: Simultaneous impairment factor due to loud signals, quantizing distortion,
and sidetone effects.
* Id: Delay impairment factor. This represents quality loss due to end-to-end
delay and echoes.
46
R
100
94.3 ---90
User Satsfaction
MOS
---
Very Satisied
-.
4.5--4--4
Dosrabfe
4.3
Safis6ed
80 ---70 ------
Some Usen6
dissatided
Acceptable
3.6
Many usefs
diats~fird
3.1
Nearly all users
50
----- -
dissatis5ed
-,
Not acceptable
for toil quality
2.6
Not recowmended
0
Figure 5: R-Values and MOS for Varying Conversation Quality[58]
* le: Equipment impairment factor. This takes into account signal distortion
(due to a low-rate codec) and packet loss (in VoIP systems).
* A: Advantage factor. This attempts to model the fact that users will allow a
degradation in quality for other advantages.
The E-Model forms a basis for comparing the conversation quality provided by a
communication network. Its use in VoIP systems will be discussed in more detail in
Section 8.
47
7
VoIP Conversation Modeling
To repeatably compare VoIP systems, we will develop a model based on the widelyused conversation model proposed by Paul T. Brady[31]. The remaining sections
motivate our approach, describe the existing model, and discuss our modifications to
it for VoIP applications.
7.1
Motivation
For the purpose of evaluating the impact of network and security conditions on conversation quality, a program that could generate realistic VoIP signaling and data
traffic was desired.
There are currently several commercial VoIP traffic generators[2, 1], but these systems
are built primarily for determining network capacity. As a result, the intricacies of
actual conversations are generally ignored and simple "always on" or "on-off" (see
Section 7.2.2) models are used[37].
To develop a more accurate model of human conversations, speaker interactions and
Voice Activity Detection systems used by commercial VoIP phones (see below) must
be represented.
Most VoIP phones use a technique known as Voice Activity Detection (VAD) to only
send data when a person is talking, and conserve bandwidth in periods of silence.
The VAD systems that we studied were not perfect, however, and usually sent data
for a few extra seconds in periods of silence. The details of the behavior of the Voice
Activity Detectors, and how to model their behavior, is discussed in Section 7.3.
The need for different speaker-models arose in evaluating the improvements in conversation quality gained through the use of particular protocols. In particular, push
to talk systems are widely employed to improve conversation quality over channels
with high latency. We sought to incorporate this benefit into our improved quality
assessment tool. Our efforts in this direction are discussed in Section 8.3.
A second aspect to security was whether anything could be learned from the (encrypted) traffic streams via pattern profiling. For example, in a military scenario, if
it could be observed that one user talked much more frequently than another, it might
be inferred that the talkative user was the commanding officer. In this way hierarchical relationships could be determined without being able to decrypt the packet
streams. Pattern profiling lies outside the scope of this thesis, but the model we
developed could be used to investigate this further.
For these reasons it is important to create an accurate model of VoIP traffic, and, by
altering the parameters of the model, develop different user and conversation types
48
that could be tested separately.
7.2
Brady's Model for Two-Way Speech
Accurately modeling a two-party conversation is complex. A conversation between
two individuals can be modeled as a series of talkspurts and silences by each person.
These periods can overlap: at any given time both people could be silent, either
one could be talking while the other is silent, or they could both be talking. The
distribution and duration of these events in PSTN systems is the subject of significant
research[32, 31] by Paul Brady.
Brady's model has been the primary reference for conversation modeling for 38 years,
but was developed before the advent of the Internet and VoIP technology. We have
adapted the model to VoIP applications by adjusting the model parameters.
In the rest of this section we'll first describe Brady's model via a series of increasingly
complex models. We'll then describe how the model can be adapted to more closely
resemble what is observed in VoIP systems.
7.2.1
One-Port versus Many-Port Models
A one-port model describes a single user's speech patterns given only the user's knowledge of the state of the conversation. The model knows when it hears the other
speaker, but only a single person's speech patterns are actually generated by each
model. Two one-port models must be connected together via a communication channel to generate a full conversation.
A many-port model would attempt to model the entire system including the channels
between the speakers. This allows full conversations to be modeled in a single system,
but has several disadvantages when compared to a one-port model. One problem is
that modeling different channels between the users (variable delays, for example)
requires a new model for each type of channel. Also, each model encompasses an
entire conversation (speaker pair) as opposed to a single speaker. In other words, to
model conversations among N users, N(N--1) different sets of parameters are necessary
for the two-port model (one for each speaker pair), while only N sets are required by
the one-port model (one for each speaker).
Another advantage of the one-port model is that it depends only on what is seen at
the receiving end. If someone is talking, but for some reason the data isn't making it
to the other user (e.g. a network link went down), the receiving model treats this as
silence from the sending model, in much the same way an individual speaker would.
Leaving the channel out of the model also allows the patterns of conversation to
change depending on the characteristics of the actual channel connecting the users, a
49
-Atalks--_
-
-
- -
Talng
Not Talking
-
'
- stopsakng '
Figure 6: A Two-State Conversation Model
property that is important for testing the impact of distinct channels.
7.2.2
The One-Port, Two-State Model
Brady used a series of increasingly complex models to better describe human conversations. The simplest and most intuitive model is the two-state model, shown in
Figure 6.
In the two-state model a user is either in a talking state or a not-talking state. The
duration of time the user spends in each state is based on an exponential probability
distribution that is discussed in Section 7.2.5. The two-state model, though simple,
has several shortcomings. The biggest drawback is that it is not at all dependent
on input received from the other party. A person is equally likely to be talking
whether the person on the other end is silently waiting or shouting over him! Despite
the statistical shortcomings of the two-state model, it is actually the one that most
major commercial VoIP traffic generators use to model conversations[37].
7.2.3
The One-Port, Four-State Model
To make the conversation model more realistic, Brady next introduced the four-state
model. In the four-state model, the other speaker's input combined with the model's
output determines the state. For simplicity, throughout the rest of this section the
speaker being modeled will be referred to as speaker A, while the person they are
talking to will be called speaker B. The four states of the model then become mutual
silence, A talking alone, B talking alone, and double talk. These states and the
transitions among them are shown in Figure 7.
Note that because this is a one-port model, speaker A only has control of some of
the transitions. In particular, A's actions determine when the vertical lines (shown as
solid lines) are traversed. Speaker A can decide whether to start talking if he is silent,
or stop talking if he is talking. He doesn't, however, control the horizontal transitions
(shown as dashed lines). Whether or not B is talking is controlled by something else
50
-Bt --a
---
B stops talking
-------A stops talking
- - -
Dooble Talk
Talking Alone
A talks
-- -
-
-B talks- - - -
A stops talking
A talks
SilenceListening
Figure 7: A Four-State Conversation Model
(perhaps a symmetric model connected at the other end, but it could be anything),
and A can only perceive that B is talking or silent. If A hears that B has fallen silent,
a horizontal transition in Figure 7 will be made, but this transition is not controlled
by A.
7.2.4
The One-Port, Six-State Model
While the four-state model simulates conversation patterns more accurately than the
two-state model, Brady found that it was still inadequate, particularly in predicting events around mutual silence and double talk[31]. Brady found that behavior in
silence depended largely on who was talking immediately prior to the silence, and similarly that behavior during double-talk depended largely on who interrupted whom.
This issue was dealt with by breaking the double-talk and silence states into two
states each, one for each possible previous state. The revised, final model is shown
in Figure 8.
As in the four-state model, only the vertical transitions, representing A's decision to
start or stop talking, are controlled by the model.
The six-state model was able to represent recorded, human, PSTN conversations, and
accurately described a data set of conversations collected by Brady[31, 32]
7.2.5
Model Parameters and Interpretation
Although the states of the conversation model have been defined we have not yet
discussed how much time is spent in each state. This duration is calculated by first
breaking up the conversation into a series of discrete timesteps dt during which the
state is forced to remain constant. Brady chooses dt = .005 seconds, but because
most VoIP codecs send data at a rate of 50 packets per second (pps) or less, dt = .02
51
Double Talk 8 Interrupis
Talking ANone
TAlk tak
Ai aks
B stops talking----
Double Talk A Interrupts
-
A stops talking
A talks
A
talst-ktalks--
stops talking
-A
Silence A Last
A talks
Listening
Silence B Last
---- stopstalkng
Figure 8: A Six-State Conversation Model
seconds (corresponding to this rate) was used for our VoIP modeling. Then, at
any given timestep the state will change if either a start-talking pulse occurs (when
A is silent), or a stop-talking pulse occurs (when A is talking). These pulses are
characterized by a Poisson arrival process[35]. "Start-talking" pulses are called apulses, while "stop-talking" pulses are known as /3-pulses. The values of a and 3
depend on the current state of the conversation, and are such that the probability of
talking from a silent state during a dt timestep is atate * dt, and the probability of
falling silent from a talking state for any timestep is Ostate * dt.
It is important to realize that the a and 3 values are not probabilities themselves,
but only become probabilities when multiplied by a dt. This means that the pulses
can (and will) have values greater than 1.
The physical interpretation of the a and / values is that they. represent a stream
of pulses trying to drive speaker A out of his state. An a of some value, say three,
means that there is a stream of a-pulses forcing A to talk that occur at an average
rate of three pulses/second. Thus, the units of a and 3 are in pulses/second. Figure 9
shows the probability of leaving a particular state for an unspecified a value. A nice
property of this model is that the mean time a speaker spends in a state governed by
an a (or /) value, neglecting transitions caused by the other speaker, is .
In the six-state model there are six parameters that characterize an individual speaker:
3 a values for each of the three states where A is silent, and 3 / values for the three
states where A talks. These six parameters are called ap,,e(pause), aalt(alternation),
ait(interrupt), /sol(solitary), /te(interrupted), and #tor(interruptor), and are summarized in Table 3. Figure 10, a revised version 6f Figure 8 shows the parameters in
relation to the state transitions.
52
C
C
Z0
Im,
1/a = E[t]
I
Figure 9: Probability Distribution of Time Spent in a State
Double Talk 8 Interrupts
<Takinq AJone
ap~.
7
-
Bstops telking
Double Talk A Interrupts
-
~
-,S talks,
Silence A Last
Listening
Silence BLast
-
B stops talking
Figure 10: Six-State Conversation Model with a and 3 Parameters
53
Parameter
apse
Cialt
aint
f3soi
Oted
ftor
Description
Start talking after pausing
Start talking in alternation
Start talking, interrupting
Stop talking when talking alone
Stop talking after interrupted
Stop talking after interrupting
From State
Silence A Last
Silence B Last
Listening
Talking Alone
Double Talk B Interrupts
Double Talk A Interrupts
To State
Talking Alone
Talking Alone
Double Talk A Interrupts
Silence A Last
Listening
Listening
Table 3: Brady's Parameters for the Six-State Model
7.2.6
Model Limitations
The six-state model, although far superior to the two or four-state models, is still
not perfect. Brady comments that the model sometimes does not perfectly handle
events surrounding double-talk[3 1]. Additionally, the model only covers two-person
conversations. In practice it would be beneficial to have a model that could handle 3
or more speakers in a conference call.
We speculate that this type of system could be modeled with a logical "or" of the
rest of the speakers representing speaker B's behavior, and perhaps an adjustment
of the model parameters. For example, to model speaker A's conference call with C,
D and E, we could use the six-state model, and represent B's talking as C or D or
E talking. This admittedly wouldn't be perfect (there would be no way to know or
change behavior if multiple speakers (C and D) are talking at the same time).
The validity of this claim, and a model that goes beyond two speakers is outside the
scope of this thesis. It is only mentioned as a possible extension of Brady's model.
7.3
Applying Brady's Model to VoIP
Brady designed and evaluated his model based on the analog communication channels
of PSTN telephones. Additionally, his method of detecting speech involved monitoring when the noise level crossed a threshold volume (that he varied between -45dBm
and -35dBm) [32]. Modeling VoIP is a slightly different task. In particular, in addition
to determining whether a person is talking or not talking, we wish to know the rate
of data being sent over the network. The data rate depends on the codec used (see
Table 1), and also whether or not data is being sent. Since codec bandwidth usage
is a defined constant per codec, all that needs to be modeled is the on/off patterns
of sending data from a particular phone. The phone we chose to model was a Cisco
7940[38], and, as will be shown in the next few sections, the on/off patterns did not
closely match those observed by Brady's noise threshold detector.
54
7.3.1
Voice Activity Detection (VAD)
Some VoIP phones and software applications send a constant stream of data when
they are being used. If this is the case, the bandwidth is a fixed value that depends
only on the codec, and can be found in Table 1. A model of this type of traffic would
be very simple: as soon as the call is set up, each side should send a fixed data rate
that models a particular codec, and this should continue until one of the parties hangs
up. Traffic rates are constant whether the calling parties are talking or silent, and in
Brady's model, each user would permanently be in a "Double-Talk" state.
Most VoIP applications, however, use a technique known as Voice Activity Detection
(VAD) to conserve bandwidth. The phones have a built in filter that attempts to
determine whether a person is speaking or not, and only sends data when the filter
hears speech[17]. When the person is silent, nothing is sent. Theoretically, VAD
can result in up to a 50% reduction of the bandwidth used in the absence of VAD.
The nature of these VAD systems and how they compared to known methods of
conversation modeling was a crucial element in simulating VoIP traffic.
The problem we were faced with was determining the behavior of VAD systems when
compared to Brady's noise detection algorithm. To determine the behavior of the
Cisco 7940's voice detector a series of prerecorded audio samples were played through
the phone, and the resulting data was recorded using tcpdump. Samples were played
from a music file so that there would be no periods of silence. Samples of .01, .04,
.1, .5, 1, 2, 3, 5, and 10 seconds were played, each one with silence gaps of .01, .04,
.1, .5, 1, 2, 3, 5, 10, 20, 30, and 60 seconds in between. Then the difference between
the actual length of the clip and the recorded period of activity on the network was
measured.
What was found was remarkably consistent behavior for varying length clips and gaps.
All clips with gaps of 2 seconds or less in between were seen as a single stream of data
on the network. For the clips with observable periods of silence, the mean difference
between time the phone sent packets and the length of the clip was 2.30 seconds,
with a standard deviation of .023 seconds. From this observed behavior it is believed
that simulating VoIP traffic from a Cisco 7940 IP phone can be done by simulating
regular speech, via Brady's model, and adding a "buffer length" of 2.3 seconds. Any
talkspurts that overlap due to this buffer are combined into a single, longer talkspurt,
which is the behavior observed in the phone. The resulting conversation would consist
of a smaller number of longer talkspurts, with shorter periods of silence in between.
To check that this behavior is accurate, a single half of a conversation was played
through the Cisco 7940. The audio data was a part of the Switchboard Cellular
Phase I corpus[33], and the particular conversation used was chosen for its clarity
and lack of noise. Traffic from the IP phone was captured using tcpdump, and
broken into talkspurts by a Perl script. This was then compared with a transcription
of the conversation, which was prepared by the Linguistic ata Consortium[34]. As a
55
final step, a 2.3 second buffer was added to each talkspurt in the transcription, to be
compared with the data from the IP phone. The results of this analysis are shown in
Figure 11.
Figure 11(a) shows the on/off patterns from the original transcription of the conversation. It can be seen that this data consists of many short talking and silence
periods. Figure 11(b) shows on/off patterns from the transcription with the addition of the 2.3 second buffer. All silences less than 2.3 seconds have become merged
talkspurts, and, as a result, there are fewer, longer talkspurts. Finally, Figure 11(c)
shows the result of the audio data being played through the Cisco 7940. Although
Figures 11(b) and 11(c) are not exactly the same, it can certainly be seen that the
addition of a 2.3 second buffer is a reasonable model of the behavior of the Cisco
7940. Discrepancies (for example, around 177 seconds) are believed to be a result of
imperfect transcriptions (due to small amounts of unrecorded noise). Although every
discrepancy wasn't investigated individually, in the recorded version of the conversation shown in Figure 11(a), there is an audible sniffle and click at 177 seconds that
does not appear in the transcription, which supports the above hypothesis.
7.4
Revising Brady's Parameters
Rather than create a rather arbitrary method of modeling VoIP traffic that relies
on adding 2.3 seconds to Brady's conversation parameters, we decided to investigate
whether a new set of parameters could be developed that would allow Brady's original
model to accurately simulate VoIP traffic. This would, at a general level, mean
drastically lowering the 3-values (those that correspond to the rate of "stop-talking"
pulses) and raising the a-values (the rates of "start-talking" pulses). This would
accomplish the desired result of creating longer and fewer talkspurts combined with
shorter periods of silence.
7.4.1
Correlating Brady's Data with the Switchboard Corpus
We began by comparing the statistics from Brady's conversation data to those from
the switchboard speech corpus that we sought to use. The time in each of Brady's six
states was compared over the entire data sets. The results of this analysis are shown
in Figure 12.
There are a few important things to note in Figure 12. One is that the time in each
state for both sets of data is symmetric. This is because the data is taken from both
speakers' perspectives. That is, when two speakers are talking they see "opposite"
states. Each speaker looks at the conversation from his own perspective, and thinks
of himself as Speaker A.
Imagine a hypothetical conversation between two speakers, Carl and Denise. When
56
I
0
50
100
200
150
250
300
250
300
250
300
time (s)
(a) Original Conversation, No Buffer
A
0
100
150
200
tIme (S)
(b) Original Conversation, 2.3s Buffer Added
2
0
100
200
150
time (s)
(c) Original Conversation, transmitted via VoIP
Figure 11: Talking On/Off Patterns versus Time for a Single Speaker in a Conversation.
57
a "T
SILJ_A ST
AIM
3%
SL.A)S
7
SLBLAST
10%
37%A-D
37%
BAGE
7:
37%
37%
(b) Switchboard Speech Corpus
(a) Brady's Data
Figure 12: Time in Each State for Brady's Data[32] and the Switchboard Speech
Corpus[34]
Carl is talking and Denise is silent, Carl interprets this as being in state A-TalkingAlone, because Carl considers himself Speaker A. From Denise's perspective, however,
she is in state A-Listening, because she also considers herself Speaker A and considers
Carl Speaker B. A similar symmetry results for the two Double-Talk states and the
two Silence states. Both sides of each conversation were considered because it is
statistically meaningless to assign an arbitrary 'Speaker A' and 'Speaker B' to each
conversation.
A second thing to note about Figure 12 is that although the time spent in the TalkingAlone states is nearly identical, there is significantly more double talk and less silence
in the (more modern) speech corpus than in Brady's data. It is not known why this
is, but it could be a result of either a small dataset', imperfect noise detection and/or
transcriptions, or a change in conversation styles in the last 30 years (Brady's research
was done in 1968, while the Switchboard corpus was obtained in 1999-2000). While
these discrepancies exist, it is believed that the Switchboard dataset is similar enough
to Brady's data that his model is still valid. For the rest of this section it is inferred
that the Switchboard corpus can be reasonably correlated with Brady's model.
7.4.2
Comparing the Switchboard Corpus to VoIP
Once it had been determined that the switchboard corpus could be used with Brady's
model, we sought to understand how the model could be adapted to what was observed in VoIP systems, and whether a revised set of Brady's parameters could model
this behavior. The transcriptions from the corpus were used to generate the six
3
Brady's data compared only 16 conversations, while the Switchboard corpus looked at 250
58
Value
Average
St. Dev.
apse
aalt
aint
IsoI
/3 ted
1.754
1.478
1.380
0.852
0.289
0.188
0.437
0.335
1.209
0.706
/tor
0.660
0.769
Table 4: Statistics of Brady's Parameters for the Switchboard Data
Value
Average
St. Dev.
apse
aalt
aint
1.838
4.240
2.046
9.586
0.447
0.561
f3so0
0.0198
0.0292
/ted
0.121
0.0684
/tor
0.174
0.239
Table 5: Statistics of Brady's Parameters for the Switchboard Data with Buffer
transition parameters (alpha and beta values) from each speaker's perspective. The
parameters were obtained from the transcripts via the method outlined in [31], and
the resulting averages and standard deviations can be seen in Table 4.
To make the data more closely resemble what was observed in the VoIP phones, a
2.3 second buffer was added to the end of every individual talkspurt, for the reasons
described in Section 7.3.1. Talkspurts that overlapped due to this addition were combined in a single, longer talkspurt. Then, with these revised transcriptions, Brady's
parameters were again extracted. The resulting values can be seen in Table 5.
With the addition of the buffer, all three 3 values drop significantly. OtO, and #ted,
corresponding to the likelihood of leaving a double-talk state, decrease such that the
average time in a double-talk state increases from .5-1.5 seconds to 5-10 seconds. The
impact on #,3 the likelihood of stopping when talking alone, is even more dramatic,
resulting in the average time in that state increasing from just over 2 seconds to 50
seconds!'
The impact of the 2.3 second buffer on the a values (those corresponding to starttalking pulses) is less clear. Although all three a values do increase, they do not do
so with the consistency or magnitude of the 3's. Also, it is difficult to significantly
characterize the values of apse and a,,lt (the start-talking from silence parameters),
because of a lack of data. Of the 250 transcribed conversations, 141 had no silence
at all after the addition of the 2.3 second buffer. For these conversations, extracting
values for apse and a,,lt was impossible. Another way to see this variability is by
looking at the standard deviations of the values. For apse the average value was 1.8
4
This isn't including the fact that a solitary talking state is left when the other person talks as
well, which is much more likely. Additionally, once that happens, the person finds themselves in a
double-talk state, in which they are more likely to stop talking. The reason for this low value is
believed to do with the fact that the majority of silences between alternating speakers is less than
2.3 seconds. Thus, by the time the 2.3 second buffer runs out and Speaker A falls silent, Speaker B
has already started talking and A is falling silent from a double-talk state.
59
with a standard deviation of over 4, and for aalt the value was about 2 with a standard
deviation of over 10.
It can be concluded that the addition of a buffer on the end of each speaker's talkspurts
results in a significant drop in f values, and a minor increase in a values. The
most affected parameter is #,, the likelihood of stopping to talk when talking alone,
because the addition of the buffer causes the vast majority of talking-alone states to
be terminated by interrupts. It is difficult to characterize the effect of the buffer on
the a values, due to the high variability of the data.
Another way to observe the effect of the 2.3 second buffer is to compare the time spent
in each state for the calls passed through VoIP phones, with the original transcriptions
and the transcriptions after the addition of the buffer. Time and resources did not
allow us to do this for every call in the corpus, but instead, three calls from the
corpus were analyzed in hopes that they would provide a representative behavior of
the corpus as a whole.
The results of this analysis can be seen in Figure 13. Figure 13(a) shows the time
in each state for the unbuffered transcriptions. It can be verified that this distribution of states closely resembles the average of the Switchboard corpus (refer back
to Figure 12). Figure 13(b) shows the time in each state for the same three calls
played through the Cisco 7940 IP phones. In this case we see a drastic increase in
the amount of double talk, and a large reduction of silence. Finally, Figure 13(c)
shows the time in each state with the addition of the 2.3 second buffer. It can be seen
that the buffered transcriptions closely resemble the VoIP calls, showing a drastic
reduction of silence and increase in double-talk.
It is not known why there is more double talk when the calls are played through the
VoIP phones than with the addition of the buffer, but this is, again, believed to be
a combined result of noisy conversations and imperfect transcriptions. A complete
investigation into this issue was not done.
7.5
Developing User Models
As a second aspect to conversation modeling, it was also desired to generate different
user and scenario models. To see if people could be categorized into different types
of speakers, we again looked through the Switchboard data corpus, but this time
focused only on conversations involving particular speakers. Unfortunately, for the
transcriptions in this data set, the maximum number of conversations any individual
participated in was three, with 11 speakers participating in three different conversations each. Although this was a small data sample, for each of the 11 speakers both
Brady's parameters and the time in each state was recorded with and without the
2.3 second buffer.
60
AJ NT
Sit A LAST
39%
39%
A-AsNE
aALNE
(a) Unbuffered
SIL A LAST
SL S LAST
SLA LAST
sicA~AALONE
0IS
SL 8 LAST
0%
A A-ONE
B..."T
20%
iftr
8 LONE
19%
LiNT
_W
A3TI
31%
BALONE
24%
(c) 2.3 Second Buffer
(b) VoIP
Figure 13: Time in Each State for 3 Calls from the Switchboard Corpus
61
What was found was that although there were some speakers that consistently talked
more than others, the values for speakers varied greatly per conversation. In particular, when the parameters for a particular speaker were averaged across the three
conversations they had participated in, the standard deviation was on the same order of magnitude as the mean, making the data difficult to interpret. We offer two
hypotheses to explain this variation.
One is that there simply isn't enough data. Three six-minute calls is not a lot of
time to characterize a person's speech patterns. Additionally, the three calls were
about different subjects, the nature of which could have affected the person's speech
patterns. To do a better analysis of speech patterns on a per-speaker basis would
require keeping several variables (subject matter, connection channel properties, call
duration) constant as the person talks to different parties. This was not the nature
of the Switchboard corpus, and time did not permit our own investigation into this
matter.
A second potential reason for the high variation of the data is that individual speaker
patterns are simply not consistent across contexts. Several studies have recently
shown that the advent of the Internet has introduced several types of data that are
simply too variable to model with current techniques[36]. How to model this type of
data is a subject that is outside the scope of this document.
Instead we took another strategy. We decided to use our intuitive understanding of
parameters to create user models for different conversation types. A few of those
types are discussed below.
7.5.1
The Average Speaker Pair
The first set of parameters we will look at is the "average" speaker. For this user we
gave each user the average a and 3 values for the entire Switchboard corpus after
adding the 2.3 second buffer appropriate for modeling VoIP. An output of the traffic
generation program with this setup is shown in Figure 14. This can be compared to
a single real conversation that had parameters similar to this, which can be seen in
Figure 15.
7.5.2
The Authority Relationship
Another conversation type that is of particular interest is a conversation where one
of the speakers has authority over the other. This could represent a manager talking
with an employee, or, in the military world, a commanding officer talking with a
subordinate. These conversations could play a unique role in security because many
times it is important to keep the location of the commanding officer secret or hidden.
62
0
500
0
1000
2000
1500
2500
3000
3500
time (s)
(a) Average Speaker A
J
C
0
500
1000
2000
1500
2500
3000
3500
2500
3000
3500
time (s)
(b) Average Speaker B
2-
0-
0
500
1000
2000
1500
time (s)
(c) Both Speakers
Figure 14: Simulated Talking On/Off Patterns versus Time for a Single Conversation
with a Pair of Average Speakers
63
0
o
50
100
200
150
250
300
350
250
300
350
250
300
350
time (5)
(a) Average Speaker A
0
0
50
100
200
150
time (3)
(b) Average Speaker B
(0
0.
0
50
100
150
200
time (s)
(c) Both Speakers
Figure 15: Real Talking On/Off Patterns versus Time for a Single Conversation with
a Pair of Average Speakers
64
Speaker
Authority
Subordinate
apse
aalt
aint
isol
/ted
Otor
2
.3
2
.1
.3
.1
.1
.1
.5
.7
.01
.1
Table 6: Brady's Parameters for a Dominant and Passive Speaker
To choose parameters for an authority figure we thought about the conversation
qualitatively. The authority figure is less likely to worry about interrupting someone,
because he has the authority, so it was decided that his aint value should be above
average. Additionally, he is probably less likely to stop talking when interrupted or
interrupting someone, so Mted and t, should be lower. Finally, it is believed that
authorities usually talk more in conversations[19] so,3r was also lowered.
For the subordinate the situation is the opposite. He is likely to be talking less, and
so 0,,,, should be higher than average. Additionally, he is quite likely to stop when
interrupted SO AteM should be high. Finally, the subordinate is probably unlikely to
interrupt the person who has authority over him, so aint was dramatically reduced.
The values that were chosen for these two speaker types are summarized in Table 6.
The resulting simulated conversation can be seen in Figure 16. As can be seen in the
figure, it can be clearly seen that the dominant speaker talks and interrupts more,
and the conversation has the desired properties.
7.5.3
An Alternating Protocol
A final type of conversation we will consider is one with an alternating protocol. In
this case, some other code or protocol is used to determine when a speaker is talking.
One example of this is the "over" protocol, where people say "over" at the end of
every statement made. Another example can occur over links with a long latency (for
example satellite communications). In these cases, sometimes a push-to-talk system
is employed where a person first pushes a button indicating that they are talking,
and releases it when they finish.
In general, the goal of these systems is to eliminate overtalk. Thus, aint should be
very low, and !3 ,to and /ted should be relatively high. Additionally, the conversation
usually flows with alternation (as opposed to a speaker falling silent and talking
again), so aalt should be significantly higher than apse. We also hypothesize that in
this scenario, typical talkspurts usually last longer, so 0,,, was kept low. The final
values chosen, and the results of a simulated conversation can be seen in Table 7 and
Figure 17, respectively. Again, it can be plainly observed that the conversation has
the desired properties.
65
0
0
500
1000
2000
1500
3000
3500
2500
3000
3500
2500
3000
2500
time (s)
(a) Subordinate Speaker
0
0
0
500
1000
2000
1500
time (s)
(b) Authority Speaker
2-
E
z
0
0
500
1000
2000
1500
3500
time (s)
(c) Both Speakers
Figure 16: Simulated Talking On/Off Patterns versus Time for a Conversation with
a Dominant Speaker
66
0
0
500
1000
2000
1500
2500
3000
3500
2500
3000
3500
2500
3000
3500
time (S)
(a) Alternating Speaker A
0
0
0
500
1000
2000
1500
time (s)
(b) Alternating Speaker B
2
z
0
500
1000
2000
1500
time (S)
(c) Both Speakers
Figure 17: Simulated Talking On/Off Patterns versus Time for a Conversation with
an Alternating Protocol
67
Speaker
apse
Oalt
aint
Alternator
.05
1.5
.01
/sol
.05
!ted
1.5
I
3
tor
1.5
Table 7: Brady's Parameters for an Alternating Protocol
7.6
Summary
In this section we have discussed conversation modeling in the context of VoIP. Starting with a model for conversations developed by Paul Brady, we adjusted the parameters to more accurately reflect the observed data rates of VoIP. This resulted in an
averaged increased length of every talkspurt, which we attribute to the use of Voice
Activity Detection systems that overestimate the time when a person is talking. We
also discussed adapting the parameters of the model to generate different usage patterns. Patterns considered were an authority relationship, and an alternating protocol
similar to a push-to-talk system or "over" protocol.
A scriptable VoIP traffic generator was developed on the basis of this work on conversation modeling. It allows us to play simulated VoIP conversations that closely
resemble users' actual speech patterns. The traffic generator is flexible, allowing
Brady's six-parameters to be set as inputs, and thus making it possible to simulate
different types of conversations over equivalent network setups. The details of this
traffic generator are discussed further in Section 9.
68
8
Adapting the E-Model to Secure VoIP Systems:
The Z-Model
In Section 6.3 we introduced the E-Model as a way to estimate conversation quality
based on connection characteristics. The E-Model can estimate quality under a wide
variety of conditions. It was designed to model traditional circuit-switched communication, and later adapted to include packet-switched communication like VoIP. We
sought to develop a VoIP-specific model that could be readily utilized by VoIP users
and implementers to determine the quality of their service. We also wanted to model
the impact of security on voice communication quality. Finally, we sought to improve
the model to capture the benefits and relative costs of using conversational protocols (e.g. "over") over high latency links. We call the model that results from these
changes the Z-Model.
8.1
Using the E-Model for VoIP Communication
Before we discuss any changes made to the E-Model, we will first discuss its use in
VoIP communications. Recall the E-Model formula:
R
Ro - Is - Id -e
A
(1)
We will consider the effects of three factors that impair voice quality: jitter, echo,
and error rates.
8.1.1
Jitter
Jitter is the interarrival variation in packets. It does not occur in switched networks
because there is no concept of a packet in those networks, but it is clearly important
for VoIP. Packets that arrive very late due to jitter represent a difficult challenge.
Once the audio in the packets has been heard by the user, late-arriving packets are
not useful and should then be discarded. On the other hand, discarding every packet
that arrives a little bit late because of jitter would cause a significant amount of audio
to be lost, which would also result in a reduction of conversation quality.
A jitter buffer is used to mitigate this problem. A jitter buffer hold packets for a
short time window before the phone converts them to audio and sends the signal to
the user. Studies have shown that the use of a jitter buffer can dramatically improve
conversation quality[50].
However, using a jitter buffer requires a tradeoff. Waiting a fixed amount of time
for packets that might arrive late due to jitter causes the end-to-end latency of every
69
packet to increase. If jitter is so great that the jitter buffer must be increased beyond a few hundred milliseconds, the impairment due to delay may counteract the
improvement from reduced jitter. A typical jitter buffer can be anywhere from 2-4
standard deviations longer than the average interarrival time to ensure that most
slightly "jittered" packets are picked up, while not wasting time waiting for packets
that arrive extremely late. The best value to use depends on both the jitter and
latency of the network.
From the perspective of the E-Model, the use of a jitter buffer allows us to eliminate
jitter from quality models by replacing it with a fixed delay and an increased packet
loss. All packets that arrive within the window of the jitter buffer will be delayed
in the buffer a fixed length of time before heard by the user. Additionally, packets
arriving outside the jitter buffer will be dropped, and are equivalent to packets lost
in transit. If ta is the absolute end-to-end delay, and ta-original is the absolute delay
in the absence of jitter, then:
ta = ta-original + tjitter
(2)
where tjitter is the length of time that packets wait in the jitter buffer. Additionally, if
P10,, is the probability of a lost packet, and Poss-original is the probability of a packet
lost in transit, then:
Pioss
= Poss-originai + Poss jitter
(3)
where Plossjitter is the probability that a packet arrives outside the duration of the
jitter buffer, and is thus dropped.
8.1.2
Echo
The next factor we will consider is echo. In the E-Model, echo is represented in the
Id term. The specific equation is:
Id = Idte
+ Idie + Idd
(4)
where Idte and Ide represent terms due to talker and listener echo, respectively, and
Idd is the impairment due to absolute delay. Both Idte and Ide have complex equations
associated with them, but we will make use of some assumptions. We rely on the
fact that many VoIP systems currently provide varying levels of echo cancellation[60].
We also argue that echo in a packet-based speech communication is reduced because
there are no analog signals involved in the transmission of the signal (at least on the
level that would contribute to the output). Combining these two facts we will assume
70
that there is no talker or listener echo, and the only delay impairment corresponds
to absolute delay. Then Id = Idd where Idd is calculated from the one-way latency by
Equation 5 below (from [53]).
Idd =
25 * {(I + X6 )1/6 - 3 * (1 + [X/3] 6 )1/ 6 +2}
(5)
where X is defined in the equation below, with ta, the absolute delay given by Equation 2.
X = log(ta/100)/log2
8.1.3
(6)
Error Rates
Finally we consider error rates. Many channels of communication are noisy, and in
a digital setting, noise typically leads to single-bit errors; that is, a 1 becoming a
0, or vice versa. Many systems can recover from single-bit errors, but errors can
be problematic for VoIP systems. This is because VoIP uses compression schemes
to encode its speech. If a single-bit of compressed data is wrong, the entire packet
becomes corrupt, and must be thrown away. Thus, we return again to the formula
for packet loss, and must add a new term, Poss-erro representing the packet loss due
to single-bit errors:
PlosS = Poss-originai+ Poss-jitter + Poss.error
(7)
To calculate P088 erro, we need to know two things: the bit-error rate, ber, of the
channel, and the size of the VoIP packet in bits, size. Then the probability of a
packet being error-free is given by the following:
P1oss-error = 1 - (1 - ber)size
(8)
Where the probability of an error is 1 minus the probability of a packet in which each
bit contains no error.
We can usually make a simplification to Equation 8. Many communication channels
have very small bit-error rates. 10Gb Ethernet, for example has a bit-error rate of
approximately 10-13 [66]. In the case of a very small error rate we can assume the
probability of a packet having more than one error is small enough that we neglect
it. Then the equation for Poss-error becomes:
Poss-error = ber * size
71
(9)
For the 10-13 error rate cited for 10Gb Ethernet and a 200-byte (1600-bit) packet
typical for VoIP, the values given by the two different formulas for Poss error are
1.599999999872 * 10-10 and 1.6 * 1010, respectively, making the assumption well over
99.99% accurate.
For satellite links common in airborne networks the approximation is less accurate,
but still good. For the 10-5 bit error rate of Iridium communication (see Table 8 of
Section 9.2), the approximation gives 0.016 while the actual value is 0.0158727. This
is still about 99.2% accurate.
8.2
Going from E-Model to Z-Model
Once the E-Model was adapted for VoIP in the ways described above, we sought to
incorporate the effects of security. This is discussed in Section 8.2.1. Additionally,
in experiences with conversation modeling we discovered a factor, which we call the
disagreement factor, that we believe more accurately reflects delay impairments. This
factor is discussed in Section 8.3.
8.2.1
Modeling the Effects of Security
Introducing security into VoIP networks affects several different aspects related to
voice quality, including jitter, delay, packet loss, and bandwidth.
Jitter
Encryption introduces another step to the end-to-end communication process, so
it is likely to increase the variance of the delays of each packet. In particular, as
bandwidth is increased, the cryptographic engine can become a bottleneck, and voice
packets may have to wait in an encryption queue for other data to be encrypted before
being sent. Naturally, the more data being sent through the encryptors, the longer
each individual packet may have to wait in the queue. In general, this queue will
cause additional jitter depending on the instantaneous amount of data being sent at
any particular time.
Fortunately, we can convert jitter to packet loss and absolute delay by again assuming
the use of a jitter buffer as previously described. Thus, we only need to be able
to measure the increase in jitter associated with added security, and then use the
methods of Section 8.1.1 to determine the effect on voice quality.
Delay
Another effect of adding security is increased delay. Cryptographic algorithms are
complex, and may require non-negligible amounts of time to perform. Additionally,
packets waiting in an encryption queue can lead to increased jitter, resulting in the
72
need for longer jitter buffers. Recalling Equation 2, we add the encryption terms to
the equation for ta, the absolute delay:
ta = ta-original
+ tjitter + tencrypt + tdecrypt + tencryption-buffer
(10)
where tencrypt and tdecrypt represent the time to encrypt and decrypt a single VoIP
packet and are dependent on the encryption algorithm and implementation, as well
as the size of the VoIP packet. Additionally, the time spent in the encryption buffer,
tencryptionjbuffer is dependent on the encryption mechanism as well as the amount of
traffic passing through the encryptors. The tjitter term represents jitter due to factors
other than encryption times, and is the same term from equation 2.
The cryptographic algorithms are also dependent on the implementation and speed of
the computer performing the encryption. Faster computers will perform encryption
operations faster. For our model we are assuming Openswan IPSec running on Debian
Linux and a computer with 4-2.4GHz Intel Xeon processors and 2GB of memory, since
this is the setup we used for our experiments (see Section 9).
Packet Loss
Encryption mechanisms will also affect packet loss rates. One effect will be the additional loss caused by packets waiting too long in the encryption buffer, as described
previously. A second effect will be the increased size of the encrypted packets has on
the loss-rate due to single bit errors, described in Section 8.1.3. Recalling the equations for Ploss-error given in Equations 8 and 9, we must plug in the new size size-enc
for the packet after encryption. That is:
(1
Pioss-error-enc = 1 -
-
ber)size-enc
(11)
or, if the ber is low:
Ploss-error-enc = ber * size-enc.
(12)
If we know the ratio of the encrypted versus clear packet sizes, we can also express
Pioss-error-enc as a product of the original Poss-error and this ratio. That is:
size-enc
Ploss-error-enc =
.
* Poss-rror
(13)
Putting it all together, and referring back to Equation 7, the new equation for the
loss rate, P1o0 s, is:
Pioss
=
Ploss-original + Poss-jitter
+ Poss-errorenc +
73
Poss-encryption-buffer
(14)
Bandwidth
Finally, any authentication and encryption mechanism comes with increased bandwidth. Encryption creates redundancy, and actual bandwidth observed on an encrypted network is significantly higher than the useful bandwidth seen at the endpoints. Thus, although bandwidth itself does not directly play a role in voice quality
(besides the affect on error rates described above), increased bandwidth consumption
will affect latencies and loss rates due to congestion. For the purposes of modeling
we assume that a network administrator will understand how bandwidth contributes
to delay and loss on a network, and do not include a bandwidth term directly in the
Z-Model. To help network planning, though, the Z-Model, in addition to providing
a quality rating, will help estimate the increase in bandwidth caused by the addition
of security. No claims are made, however, as to how the increased bandwidth affects
other traffic characteristics, as we expect this to be largely dependant on the specifics
of the network. The increased bandwidth is a function of the security implementation and the original size of the voice packets. In understanding the bandwidth
requirements of secure VoIP, network administrators will be able to determine if their
secured networks will support the equivalent traffic volumes observed in their insecure
systems.
8.3
Incorporating Conversational Improvements
In experimenting with various conversation models like the ones described in Section 7.5 we realized an important point. In many channels of communication in
which there is a significant end-to-end delay, users adopt a signaling protocol to avoid
talking over each other. Examples of this include the familiar use of the word "over"
at the end of messages, and a push-to-talk system that only allows one person to
talk at a time (users talk by pressing a button). Clearly there is an intrinsic benefit
to using these systems over high-latency links, but this is not accounted for in the
E-Model, where the only term that affects the absolute delay impairment (Idd) is the
end-to-end delay (ta) (see Equations 5 and 6).
We speculate that the benefit of using such protocols is that it allows the parties to
agree on what is occurring in the conversation. By saying "over" after I am done
speaking, I am alerting you that I have finished and am now expecting you to talk.
Both parties understand this alternating flow. On the other hand, in the absence of
such a protocol, you might interpret a brief silence as the end of my statement and
then, because of the long latency of the link, we end up talking over each other.
One way to define agreement is to use Brady's conversation states (refer back to
Section 7.2 for a review of these states). In particular, we will break the conversation
into states of double-talk, each individual speaker talking alone, and mutual silence.
The more the two speakers agree on what state they are in, the less likely they are
74
to miscommunicate as a result of the high-latency link.
As noted above, Idd is the only term in the E-Model that represents impairment
due to absolute delay. For the Z-Model, we propose modifying the equation for Idd
which depends only on absolute delay. In particular, we believe that this term should
depend on the fraction of time the two parties disagree on the conversation state,
which we call the disagreement factor:
tdisagree
tagree
+
_
tdisagree
tdisagree
(15)
ttotal
Where tagree is the time the two speakers agree on which state they're in, tdisagree
is time they disagree, and ttotal is the length of the conversation. Note that since
agreement and disagreement are mutually exclusive, ttotal = tagree + tdisagree. It should
be recognized that in the presence of any significant delay this disagreement fraction
will always be nonzero. This is because there is a difference between the time one
of the parties starts or stops talking and the time the other party perceives it. In
fact, assuming that the delay is much shorter than every talk and silence spurt allows
the disagreement factor to be determined in a simpler manner by estimating the
disagreement time based on the absolute delay, ta, and the number of state transitions
Ntransitions.
tdisagree
Ntransitions * ta
(16)
This estimation is a result of each state transition introducing a time of length ta
during which the parties are sure to disagree on conversation state. Equation 16
becomes more accurate the longer individual states are, because tdisagree is less likely to
be affected by two transitions happening close together. Note that this Equation relies
on the assumption that delay times are equal in both directions of communication.
The two cases are illustrated in Figures 18 and 19. In Figure 18 the time in each
state is longer than the delay, ta and Equation 16 holds. In Figure 19, however, the
time in the two talking states is shorter than ta and the equation does not hold.
We had originally hoped to design a new equation for the impairment due to absolute
delay, Idd, that was based on this disagreement factor and not on the latency, ta.
Unfortunately, we did not have enough data to substantiate this equation, so we
leave that to a subject of future work. Here we merely suggest that this disagreement
factor should be considered over (or perhaps in addition to) the one way latency
in determining impairments due to delay. We discuss the validity of this claim in
Section 10.3.1.
75
A's Talkspurt
Speaker
A
B hearing A's Talkspurt
Speaker
B
ta
tL
Agreement
Disagreement
Figure 18: Agreement and Disagreement Time with ta Shorter than State Lengths
A's Talkspurt
A hea ing B's Talkspurt
Speaker
A
B hearing A's Talkspurt
B's Talkspurt
Speaker
B
ta
ta
Agreement
Disagreement
Figure 19: Agreement and Disagreement Time with ta Shorter than State Lengths
76
9
Resources
This section describes the resources available for evaluating secure VoIP systems in
varying network conditions. Section 9.1 describes a VoIP traffic generating program
written based on the results of Section 7. Section 9.2 describes a link emulator
program useful for simulating different connection parameters. Section 9.3 describes
the devices used to perform the security protocols, and finally, Section 9.4 describes
the network in which experiments were run.
9.1
Traffic Generator
A VoIP traffic generator was written in C++ to support scriptable, simulated conversations over a network. We used the C++ Portable Types Library[21] to allow
the traffic generator to run on both Windows and Linux machines. We also relied
on an open-source implementation of RTP and RTCP developed by Jori Liesenborgs
known as JRTP[20].
The traffic generator models a Cisco 7940 IP phone. Like that phone, it uses SIP
running over UDP for setting up and tearing down calls, and uses RTP for transmitting the voice streams. Voice streams use the same packet size and formatting as
the G.711 codec used by the phone, although the payload data was sent from a file
of random bits instead of actual encoded speech. The same data file was used in all
experiments.
The traffic generator also models the phone's voice activity detection, so packets are
sent in bursts when the users are talking. The talking patterns of the users are
governed by Brady's conversation model (described in Section 7.2) with appropriate VoIP-specific input parameters. For most experiments the average parameters
were used to simulate a ten-minute call. However, having the traffic generator allowed many different types of conversations to be modeled under the same network
conditions. This feature was used to evaluate the disagreement factor described in
Section 8.3. For all of our experiments we focused on the conversation quality provided by the RTP data streams, and not on call setup and takedown.
9.2
Link Emulator
We employed a link emulator program developed at MIT Lincoln Laboratory that
intercepts and processes packets at the link layer, and can simulate specific latencies,
error rates, and bandwidth constraints.
Since part of this study was an investigation into the performance of VoIP over airborne links, the characteristics of various communication channels used by airplanes
77
Link Type
TCDL
Connexion
Inmarsat
Iridium
BER (bps)
BW (kbps)
10-7
10-6
1o4
10-7
10-5
128
128
2.4
Latency (ms)
2
325
325
2000
Description
Line of sight
Boeing, satellite
Geosynchronous satellites
66 low earth orbiting satellites
Table 8: Characteristics used to Simulate Various Airborne Links
were important to know. Four main types of airborne communication are summarized
in Table 8. TCDL is a line of sight communication that usually occurs between two
planes. The other three are forms of satellite communication, and thus have lower
available bandwidths and much higher latencies. Connexion is a satellite-based Internet connection provided by Boeing and commercially available on many airplanes[47].
Inmarsat offers a similar type of communication via its own network[48]. Finally
Iridium is a network of low earth orbiting satellites that provides communication
anywhere on Earth, and is now being used primarily by the US Government[49].
9.3
Encryption Devices
As mentioned in Section 5.1.3, Openswan[41] was chosen for our experiments from
several different implementations of IPSec because it provides the most flexibility in
selectable encryption algorithms, including AES. Additionally, Openswan has been
adopted as the IPSec implementation for the Fedora Core and Debian Linux operating
systems, and is believed to be quite stable. All of our encryption was done in software,
and IPSec ran in tunnel mode the entire time. Authentication headers were not used.
9.4
Test Network
The network used in the experimentation is shown in Figure 20. Two identical computers running Fedora Core 1 Linux serve as endpoints. The VoIP traffic generator
discussed in Section 9.1 was run at these two points. The endpoints are connected
by an IPSec tunnel that runs between two nodes labeled IPSeci and IPSec2. These
IPSec hosts are both running Debian Linux using the Openswan implementation of
IPSec. Finally, between the IPSec hosts are a computer performing static routing
using the kernel, and another node, labeled LinkEm, that runs a link-layer link emulator, as described in Section 9.2. When IPSec is turned on, traffic is unencrypted in
the lighter, left side of the figure, and encrypted in the darker, right hand side. The
computers labeled VoIP User 1 and VoIP User 2 were Dell 1750 servers with 2GB of
RAM, and 2 2.4GHz Intel Xeon Processors. The IPsec machines, LinkEm and Debian
router were Silicon Mechanics 1271A 1U Rackmount PC's with 2GB of RAM and 4
2.4 GHz Intel Xeon processors. The computers were connected by 1Gbps Ethernets
78
I
-,
1 I !M'
vpi
IP
DebiaN
tCPdU7P()L
tcpdump(2)
VoIP User 1
FC1 + Skaian Kmod
Debt
tcpdump(6)
Figure 20: Test Network for Experimentation
and switches, but the network interface cards were 100Mbps, which was the limiting
factor.
Traffic dumps were taken at several points, labeled tcpdump(1-7), depending on what
exactly was being measured. Dumps were taken using tcpdump[45], a program built
into most Linux implementations that records traffic seen on a particular network
interface.
79
10
Methodology
This section describes the experiments run to evaluate the performance of secure
VoIP and develop and test the parameters of the Z-Model. Section 10.1 describes a
set of experiments designed to evaluate the performance of encrypted and unencrypted
VoIP under a variety of isolated network conditions. Section 10.2 goes on to explain
how the results of these experiments can be used to determine the various constants
found in the Z-Model. Finally, in Section 10.3 the Z-Model is evaluated as a tool
for predicting conversation quality, and the feasibility of secure VoIP in a variety of
wireless airborne network types is discussed.
10.1
Understanding the Performance of Secure VoIP
The first set of experiments was designed to study the impact of security on a Voice
over IP system's overall conversation quality. As a first step, various encryption
algorithms were compared against each other under optimum conditions. Then we
focused on 128-bit AES, and evaluated its performance under conditions of increased
traffic, reduced bandwidth, and various loss and error rates. In each experiment the
average delay and packet loss were measured over time. These values could then
be used in refining the Z-model to determine the impact security has on various
transmission characteristics.
10.1.1
Experiment 1: Performance Under Optimum Conditions
In our first experiment we sought to determine performance of VoIP with IPSec
running in ideal conditions. This meant running limited traffic over low-latency,
high-bandwidth communication channels. The results of this experiment translate to
the best possible quality of service for a secure VoIP call. It is expected that voice
quality will decrease from these values as latency and traffic increase, and bandwidth
decreases.
In addition to determining this baseline quality value, we also sought to compare
different encryption algorithms. As mentioned in Section 9.3, Openswan supports a
variety of encryption algorithms including DES, 3DES, AES (also known as Rijndael),
Blowfish, and Twofish. Additionally, compression can be turned on and off, and
varying key sizes can be used. For this experiment we analyzed the performance of
3DES, AES with both 128 and 256-bit keys, and Blowfish. Additionally, we tested
the performance of 3DES and 128-bit AES with and without compression.
To measure the performance of the various algorithms a single 10-minute simulated
VoIP call was made using the VoIP traffic generator. The parameters input into the
traffic generator were the average of the data from the switchboard corpus described
80
Algorithm Key
Bits
3DES
3DES
AES
AES(solo)
AES
AES
Blowfish
128
128
128
128
128
256
128
Clockwise Communication
Compression Encryption Decryption
IPSec2
IPSeci
(ms)
(ms)
.054
.053
no
.047
.053
yes
.037
no
.035
no
.027
.034
.027
.034
yes
no
.034
.034
no
.031
.031
Counterclockwise Communication
Encryption Decryption
IPSec2
IPSeci
(ms)
(ms)
.086
.048
.076
.049
.034
.060
.027
.035
.035
.026
.036
.058
.033
.052
Table 9: Baseline Per-Packet Delays for Various Encryption Algorithms in Openswan
in Section 7, and listed in Table 5 of that Section. The same parameters and payload
were used for all runs.
It was our initial plan to measure end-to-end delays of the packets between the two
VoIP endpoints. When we attempted this, however, we found that the delays under
these ideal conditions were on the same order of magnitude as the precision with
which we were able to sync the clocks. 5 . Thus, we decided to look at the delays going
in and out of the encryption boxes, where we could use a single clock for reference.
Using tcpdump running on the Openswan boxes at locations 2, 3, 5 and 6 of Figure 20,
the time every packet was seen at each side of the encryption device was recorded.
Then using a perl tool called pkt-scripts, the average delay was recorded across all
packets. For each call 4 delays were recorded: the outbound encryption and inbound
decryption times at each of the two boxes. The values of these delays are summarized
in Table 9.
A few things can immediately be seen from Table 9. The most important result is that
encryption under these low bandwidth conditions will not ever result in a significant
loss in voice quality. Note that the highest delay from encryption was under 1/10th
of a millisecond, and the total one-way delay from both encryption and decryption is
well under .15 milliseconds for all algorithms. According to the E-Model, delays don't
begin to impair quality until they are greater than 100ms. Thus, the delay overhead
due to encryption under these circumstances is less than .15% of the delay required
to become an impairment to quality. Compared to other factors, such as round trip
time, this delay is incredibly small. Thus, we believe that the encryption overhead
under ideal conditions can be virtually neglected, regardless of algorithm or key-size.
A few other interesting results can be obtained from Table 9. One thing to note is that
the claim that AES is faster than 3DES was verified for this data, as all implementations of AES, including with a 256-bit key, were faster than 3DES. Additionally, it
was observed that using compression was a slight performance boost overall, although
'Clocks were synched using NTP[76].
81
the difference was not extremely significant. The performance boost is likely due to
the compression of the RTP headers, since the voice data itself is already compressed
by the codec.
It is important to note that these encryption times are not fixed, but rather are
functions of the amount of traffic and the hardware and software performing the
encryption. This can immediately be seen by the fact that node IPSec2 usually outperformed node IPSeci given the same traffic rate and algorithm. Since the nodes are
identical machines, it was hypothesized that other processes were tying up the kernel
on node IPSeci. To verify this hypothesis we performed a single test of uncompressed
AES encryption at a time when no extraneous processes were running on the machine.
The result of this test is shown in Table 9, on the line labeled "AES(solo)" and it can
be seen that the times on the two machines were almost identical. The performance
of 128-bit AES with compression was also measured under "clean" conditions, and
comparing it to the clean test of AES without compression reveals that for VoIP,
compression seems to play a negligible role in encryption times.
Returning to the main result of this experiment, we conclude that under the ideal
conditions, every encryption algorithm runs fast enough to support secure VoIP with
negligible loss of quality.
10.1.2
Experiment 2: Performance of Openswan with Increased Traffic
In order to gain a better understanding of how Openswan performed, the second
round of experiments ran a VoIP call with significantly more traffic running through
the IPSec machines. The goal of this experiment was to determine how the delay
introduced by encryption scaled with traffic going through it for various encryption
algorithms. We used the same setup shown in Figure 20, and again simulated a 10minute call with average parameters. This time, however, we also used iperf[46] to
generate a stream of UDP traffic between the two VoIP endpoints. The Openswan
boxes would encrypt this stream of traffic in addition to the VoIP call. We believed
that as traffic increased and the encryptors were forced to do more processing, endto-end delays would noticeably increase.
The purpose of this experiment was to put strain on the encryptors, not the network
links. For this reason, all of the various nodes were connected via switches and Gbps
Ethernets. The link emulator on LinkEm simply passed all traffic through, which is
roughly equivalent in behavior to a switch.
For this experiment, and most of the remaining experiments, we focused on AES
encryption with a 128-bit key. AES was chosen as it has been adopted as the standard encryption algorithm by both NIST and the U.S. Government. Additionally,
use of a 128-bit key was chosen because key sizes in AES beyond 128 bits are currently considered unnecessary. As computers get increasingly fast the need to increase
82
Average Delay with Variable Traffic
1.6
-
1.4
2
.u-
-. AES-128
.- uencrypted
--..
0.6
4
0.4-
0.2
0
0
10
20
30
40
50
Additional Traffic (Mb/s)
60
70
80
90
Figure 21: End-to-End Delays for 128-bit, Uncompressed AES and Clear Communication with Varying Levels of Traffic
key-size becomes relevant, but barring any unknown attack on AES faster than key
exhaustion, it is believed that 128-bit AES will be secure for well over 20 years[52].
We performed two sets of experiments, one with encryption turned off, and one with
128-bit AES. We then simulated 10-minute calls with average parameters and additional UDP traffic of 1, 5, 10, 50, 70, and 85 Megabits per second. The 100 Mbps
network interface cards on several of the machines prevented data rates larger that
85 Mbps from being explored.
For our generated data, 1400-byte packets were sent, because iperf was unable to
handle sending high volumes of traffic in smaller-sized packets. Delays were measured
from end-to-end on the clear sides of the IPSec boxes (locations 2 and 6 in Figure 20),
in hopes of being able to make statements about the end-to-end system as a whole.
Although for some of the experiments the accuracy of clock synchronization became a
factor, statements could be made, based on symmetry, of the average observed delay
in both directions, as well as the relative increased delay introduced by encryption.
The results of these experiments are summarized in Figure 21.
As can be seen in Figure 21, the expected result of delays increasing as more traffic
was added to the network was observed. What was surprising was the relatively
small effect encryption had on increasing delays. Using a linear approximation for
the two curves (which are approximately linear) it was determined that the slope
of the clear experiment was 3 microseconds per megabit/second, while for AES this
value was increased to 8 microseconds per megabit/second. The result was that for the
83
Fraction of VolP Packets Lost in 128kb/s Link for Varying Additional Traffic Rates
0.45 -
-
-.....
-------
-....
--------.----.-.--.
0.4
0.35
e0.3
0.25-
a.5
*- Encrypted
-- Unencry ted
0.2
0.1
0.1
0.05
0
0
20
40
60
80
100
120
Additional Traffic (kbis)
Figure 22: Loss Rates for 128-bit, Uncompressed AES and Clear Communication
with 128 Kbps Bandwidth and Varying Background Traffic (End-to-end throughput)
addition of 85 megabits of traffic AES was .5 milliseconds or approximately 1.5 times
slower. Still, compared to the 100 milliseconds (cited by the E-Model) required for
absolute delays to impair conversation quality, the computational overhead introduced
by encryption is quite small.
10.1.3
Experiment 3: Performance over Low-Bandwidth Links
The third experiment we ran sought to determine the significance of the bandwidth
overhead introduced by encryption. There are several scenarios in which Ethernet
or equivalent bandwidth links are not available. This is especially true in wireless
airborne networks where bandwidth can be severely limited.
For this experiment we changed the link emulator on the LinkEm computer to simulate a low-bandwidth link. For the purposes of simplicity a fixed bandwidth of 128
kbps was chosen, with 0 for the bit error rate and no additional latency. 128 kbps is
the typical bandwidth found in Connexion and Inmarsat communication[47, 48], two
common forms of satellite links. We then ran a single VoIP call, along with variable
amounts of additional traffic generated with iperf (as described in Section 10.1.2) for
both clear and encrypted communication. Iperf packets of 500 bytes were sent. The
results of this experiment can be seen in Figure 22.
As can be seen in Figure 22, loss rates for encrypted communication are significantly
higher than for clear communication, especially for low volumes of additional traffic.
84
Fraction of VolP Packets Lost in 128kb/s Link with Additional Traffic
0.45 --
- -.-.-.-
-.-.--
-.-.-.--
-.-
0.4
0.35
MS
0.3-
0.25
-
C.+
-
Encrypted
Unencrypted
0.2
0.15
0.1
0.05
0
0
20
40
60
80
120
100
140
160
180
200
Total Observed Traffi on Link (kb/s)
Figure 23: Loss Rates for 128-bit, Uncompressed AES and Clear Communication
with 128 Kbps Bandwidth and Varying Background Traffic (Actual Packet Sizes)
While encrypted communication demonstrates significant packet loss for only 20kbps
of additional traffic, unencrypted communication exhibited no loss until over 40kbps
traffic was added.
To better understand why this occurs we must understand how the bandwidth of
the channel is filling up. Since we are simulating a G.711 voice codec, the Ethernet
bandwidth used by the VoIP call alone was measured to be 80 kbps. This represents
a significant portion of the 128 kbps available to the channel. When encryption is
applied, the size of the VoIP packets increases from 200 to 256 bytes, resulting in an
increased bandwidth to 102.4 kbps. This leaves even less room for additional data
flows. Additionally, encryption introduces an overhead to the 500-byte iperf packets
as well, causing them to increase in size to 600-bytes. By using these updated packet
sizes and combining the VoIP and iperf data we were able to re-plot packet loss versus
total used bandwidth. This is shown in Figure 23
As can be seen in Figure 23, when the updated sizes for the encrypted packets are
considered, the amount of packet loss versus bandwidth is comparable for encrypted
and unencrypted communication. Additionally, it can be observed that the point
where loss begins to occur is very close to the 128 kbps capacity of the channel. As
the capacity of the channel is approached and exceeded, packet loss increases rapidly.
The encrypted data point at approximately 150kbps is anomalous. We speculate that
since we are observing only loss rates in the VoIP packets 6 that this run was simply
6
As opposed to total loss rates of combined VoIP and iperf packets.
85
Loss Rates for Clear and Encrypted Communication
0.25
0.2
0.15
-
-
-A---
0.1
0bserved (clear)
OObserved (AES)
pected
0
0,05
0
0
0.05
0.1
0.15
Requested Fraction of Packets Lost
0.2
0.25
Figure 24: Introduced Loss Rates for Clear and Encrypted Communication
"lucky', and a lot of iperf packets (as opposed to VoIP packets) were dropped. This
would result in a low observed loss rate for VoIP, even though a greater fraction of
packets could have been dropped overall. We speculate that repeated runs of these
experiments would cause the data to more closely align, but leave that experiment as
a topic of future work.
10.1.4
Experiment 4: Performance over Links with High Loss and Error
Rates
The fourth experiment sought to isolate another characteristic of some airborne links:
non-negligible drop and error rates. The first thing we looked at was drop rates. Using
the LinkEm software we simulated drop rates of .1%, 1%, 5%, 10%, and 20%, and for
each case ran a clear and an encrypted VoIP call. The results are shown in Figure 24.
It can be seen in this figure that encryption does not appear to affect drop rates at
all. There is no observable difference between the clear and encrypted curves; each
one more or less follows the expected line. In some cases one slightly outperformed
the other, but we attribute this largely to the random drop modeling of the LinkEm
program, and not to the use of encryption.
A second factor considered was bit error rates. This is the probability that any given
bit of a packet is accidentally flipped (a 0 becomes a 1 and vice versa). Bit error rates
are quite significant because many times the loss of a single bit is enough to force
86
Drop Rates versus
1.00 -07
1.GOE-06
Bit-Error-Rates for Clear
1.OOE-05
and Encrypted Communication
1.00E-04
1.00 -03
.--o. Unencrypted
Encrypted
LA.
B.r001
Bit error rate
Figure 25: Loss Rates versus Bit Error Rates for Clear and Encrypted Communication
the packet to be discarded. This is especially true for encrypted data, where a single
bit-error makes the packet impossible to decrypt, but can also be true for compressed
data (e.g. compressed speech). Because of this, for this experiment any time one or
more bits of the packet had errors, we treated the packet as lost.
We chose to simulate error rates ranging from 5 * 10-7 to approximately 2.5 * 104 .
Simulating bit error rates higher than this became difficult, as setting up the call
often failed due to errors. For each error rate, we measured the percent of packets
containing at least one error, which we called the drop rate. Headers were omitted
when looking for bit-errors, as some headers can change naturally (e.g. time to live
field of an IP header) and these changes should not be counted as errors. It is not
known whether information could be salvaged from the clear packets containing biterrors because we used simulated speech, but encrypted packets with errors would
certainly be dropped. The results of this experiment are shown in Figure 25.
It can be seen that the use of encryption drastically increases the likelihood of a
dropped packet. In fact, in each experiment performed, encrypted packets were 2.5
times as likely to be dropped with an equivalent bit-error rate. We believe that this
increased drop rate is largely due to the increased size of the encrypted packets, as discussed in Section 8.2.1. As the size of the packet grows, there are more opportunities
for a single bit error to occur, thus, the per-packet error rate increases.
We were, however, surprised at the magnitude of this increase, as we observed encrypted VoIP packets to be only 1.3 times the size of clear ones. This discrepancy is
discussed further in Section 10.3.2.
87
10.2
Adding Security to the Z-Model
Having obtained traffic data for various encryption setups and link configurations,
we are now ready to return to the discussion of Section 8 and fill in some terms in
the various equations. In Section 8.2.1 we discussed the addition of security to the
Z-Model in the context of four terms: jitter, delay, packet loss, and bandwidth. We
will now add the effects of security to the Z-Model for each of these terms.
Jitter
In Section 8.1.1 we discussed the ability to treat jitter as a combination of a fixed
delay and a packet loss rate by assuming the use of a jitter buffer. In particular, we
recall Equations 2 and 3, reproduced here:
ta = ta-original + tjitter
Ploss
=
Poss.original + Poss-jitter
(17)
(18)
Let's explore what values to choose for tjitter and Poss-jitter to handle jitter in the
context of security. Recall that jitter is defined as the variation in packet interarrival
times. If we assume that VoIP systems send packets in a consistent, jitter-free manner,
then jitter will only be a result of network characteristics. In particular, we can
estimate jitter as the variance of the delay experienced by each packet.
We were interested in the increased jitter due to encryption. By calculating standard
deviations of packet delays across all experiments we were able to determine when this
worst-case increase of jitter occurred. In our testing, the largest increased value for
jitter from unencrypted to encrypted communication occurred with 128-bit AES with
85 Mbps of additional traffic. In this case the jitter for encrypted communication was
.44ms, while the jitter for clear communication was .32ms. This is a 37.5% increase
over the normal case. Thus we suggest that in the case of encrypted communication
the estimated jitter should be 37.5% greater than that observed in clear communication. Network administrators would then be responsible for using this new value to
determine jitter buffer lengths.
Knowing how to calculate increased jitter due to security is nice, but sometimes a
fixed value is desired. We will examine jitter in our own network as an example of
determining this value.
To choose jitter buffer lengths in our own network, we will assume that packet delays
can be treated as a Gaussian random variable. This is a common method of handling
jitter in packet-switched systems[50, 51]. In this case, we can choose a value for
a jitter buffer that is four times the maximum observed jitter. Since jitter is the
standard deviation of packet delays and we are treating delays as a gaussian variable,
99.99% of all packets will arrive within this window[35]. Using the scenarios that we
observed, this results in a worst-case jitter buffer of length tjitter = 1.76ms.
88
Assuming the 99.99% value given from statistics theory is accurate, the maximum
value for Poss.jitter becomes .01% in this case. Since we are again considering worstcase scenarios, we will use this maximum value for packets lost due to a jitter buffer
of length 1.76ms. We now have worst-case values for tjitter and plossjitter for our test
network.
It is important to stress that we are making no claims about the level of jitter in
a particular network. Our network represented nearly ideal conditions, as packets
traversed short physical distances over fast switches and Gigabit Ethernets. Our
main goal was to determine the additionaljitter introduced by security, which we
estimate at 37.5% above the existing jitter. Network administrators must be aware
of the level of jitter on their own network and adjust this value appropriately. We
recommend the use of a jitter buffer of length tjitter equal to four times the average
jitter, for a loss rate of less than .01%.
Delay
The next thing we will consider is delay. Recalling Equation 10 of Section 8.2.1, we
note that encryption introduces three delays:
ta = ta-original + tjitter
+
tencrypt
+
tdecrypt + tencryptionbuff er
(19)
The delays of tencrypt and tdecrypt represent the time to encrypt a single VoIP packet,
and are fixed values per codec, while tencryptionuffer represents the variable delay caused by a packet waiting in the encryption buffer while other data is being
processed. For this reason tencryptionauffer is dependent on the bandwidth going
through the IPSec hosts.
Let's first consider tencrypt and
tderypt.
Experiment 1 measured the time in and out
of the IPSec hosts for various encryption algorithms, but we choose to focus on
AES. Recalling Table 9, the maximum observed encryption time for AES was .036
milliseconds, and the maximum decryption time was .060 milliseconds. Since we are
interested in conservative measurements, we will use these worst-case values and,
again, significantly overestimate them for our formulas. In particular, we will choose
a value for tencrypt of .072ms and a value of tdecrypt of .12 milliseconds. Each of
these values is twice what we observed, and yet their sum is still only fractions of a
millisecond.
Now let's consider tencryption-buffer. We will express this value as a function of bandwidth passing through the encryptors. Recalling Figure 21, we note that the slope of
the best-fit lines for delays of 128-bit AES and clear communication versus bandwidth
were 8 microseconds/Mbps and 3 microseconds/Mbps respectively. Thus increasing
bandwidth introduces two kinds of latency, one independent of encryption that represents the delays associated with routers, links, etc., and a second representing the
increased time in the encryption buffer. Following the philosophy of requiring net-
89
work administrators to understand their own network, we are interested only in the
encryption buffer time. We can obtain this time, as a function of bandwidth, by
subtracting the two lines in Figure 21. This yields the equation:
tencryptionbuf fer(Ips)
-
adit(bs
s)=5ps*
* bandwidth(Mbps)
5
Mbps
(20)
representing the observed delays introduced by encryption. However, following our
conservative attitude about times, we will increase the slope of this relationship by a
factor of two, yielding the relationship:
tencryptionJbuffer( IS)
= 10
Mbps
*
bandwidth(Mbps)
(21)
Packet Loss
We now turn our attention to packet loss. Fortunately, this mostly turns out to be
an easy situation. Recalling Equation 14 of Section 8.2.1, we had:
Pioss
= Poss-originai
jitter + Poss-errorenc+
+ PoFss
1
Poss-encryptionibuffer
(22)
We have already shown that Possjitter = 0, but for the experiments we performed,
packets were not lost due to the encryption buffer either. Thus Plss-encryption-buffer = 0
as well.
Thus, the only overhead in terms of packet loss due to encryption is that due to
the increased packet size causing more single-bit errors in individual packets. The
additional loss can be calculated from any of Equations 11, 12, or 13, reproduced
here:
Poss-error-enc
= 1 - (1 - ber)size-enc
PIoss-error-enc =
Poss-errorene
ber * size-enc
sizeenc
size
os-error
(23)
(24)
(25)
So besides this small overhead due to single bit errors, there is no additional loss due
to encryption. There is, however, a caveat associated with this statement. Limitations
of our particular network prevented testing the encryption buffers beyond 85 Mbps.
We point out that there must be a value for which Poss-encryptionjuffer is non-zero
(an IPSec host cannot process packets infinitely fast). That said, since we were not
able to reach this value, we will only state that for bandwidth less than or equal to
85 Mbps, IPSec does not introduce any additional packet loss. An exploration of the
limitations of IPSec in this regard is left for future study.
90
Bandwidth
Finally, we consider bandwidth consumption. In Section 8.2.1 we pointed out that
although bandwidth consumed does not directly influence call quality in any way,
depending on the network it can drastically affect loss rates, delays, jitter, and other
factors that do. This was clearly demonstrated by Experiment 3, which showed
increasing loss as bandwidth was used up in a low-bandwidth environment. We make
no claims as to how bandwidth will affect these factors on any particular network, and
argue that this is the job of the network administrator to determine. We can, however,
say something about the bandwidth introduced by the application of encryption.
To do this, we will again consider the worst case scenario. The most significant
bandwidth increase came from 256-bit AES. This protocol increased the size of the
G.711 packets from 200 bytes to 264 bytes, resulting in a bandwidth increase from 80
kbps to 105.6 kbps, for a 32% increase in size. Since G.711 is the codec that requires
the most bandwidth and 256-bit AES was the encryption algorithm that introduced
the most bandwidth, we believe that network administrators should provide 110 kbps
per secure VoIP call (50% more than for calls made in the clear), with a modest
amount of additional bandwidth allocated as required by lower-level protocols.
In bandwidth limited situations voice codecs other than G.711 could be employed.
G.729, for example, can reduce bandwidth use from 80 kbps to approximately 30 kbps.
While this is a good way to reduce the bandwidth overhead of VoIP, an exploration
of the encrypted bandwidth requirements of other codecs was not performed. This is
left as a topic for future work.
10.3
Evaluating the Performance of the Z-Model
In this section we evaluate the performance of the Z-Model as a tool for predicting
conversation quality. We first look at the disagreement factor defined in Section 8.3.
We go on to examine the accuracy of the Z-Model's predicted loss rates for clear
and encrypted communication in Section 10.3.2. Finally, we discuss the performance
of the Z-Model on a series of links designed to model wireless airborne communication. This analysis serves the dual purpose of evaluating the predictive capabilities of
the Z-Model and determining the feasibility of high-quality, secure, VoIP over these
unfavorable network links.
10.3.1
Evaluating the Disagreement Factor as a Replacement for Absolute Delay
Recall that we introduced the disagreement factor, defined as tagree
tdisagree
so that the
+tdisagree
impairment due to delay Idd in the E-Model would demonstrate a benefit from the use
91
Disagreement Factor vs Delay for Different Conversation Models
0.5
0.45
'S
0.4-
0.35
0
{
-0
.
.2
0.2-
j
- -
Average
-u- Push-To-Talk
0.1
0.05
0
0
500
1500
1000
2000
2500
Delay (ims)
Figure 26: Disagreement Factor for Varying Two-Way Latencies
of signaling protocol over high-latency links. To perform this test we used the link
emulator to simulate various latencies. Then for each latency we played two types
of conversations: an average one and the push-to-talk system. For the average case
the parameters entered into the model are those given in Table 5 and the parameters
for the push-to-talk case can be found in Table 7, both of Section 7.5. The results
were repeated four times to determine a pattern of behavior. It was hoped that the
push-to-talk system would result in a lower disagreement factor than the average case,
which could be used to represent the quality improvement gained from its use over
high latency connections.
To evaluate the disagreement factor, we had the traffic generator program output a
timestamp, along with the state it believed itself to be in, every 20 milliseconds. In
this way the outputs of the two traffic generators could be compared to determine
how often they agreed on the state of the conversation, and how often they disagreed.
The clocks were synchronized within a range of under a millisecond, so we believed
that clock skew would play a minimal factor in the results of this experiment.
We simulated links with latencies of 0, 200, 500, 1000, 1200, 1500, 1700, and 2000
milliseconds. This latency was on top of the end-to-end latency of our system which,
because it was less than a millisecond, is considered negligible. Since this experiment
was focusing on the disagreement factor, we did not turn on encryption, nor did
we add any bandwidth constraints or loss rates. The results of this experiment are
summarized in Figure 26.
92
The error bars on the figure represent a single standard deviation averaged from 4
runs. Data points with no error bars are the results of a single run. As can be seen
in the figure, the push-to-talk system significantly outperformed the average case
for medium length delays. For example, with a 1 second delay using a push-to-talk
system resulted in disagreement of conversation state between the two parties only
14% of the time, while using no protocol caused the parties to disagree about 30% of
the time. These results are well away from one standard deviation apart, and so we
conclude that the disagreement factor is one way of representing the gains of using a
push-to-talk system.
As delays became extremely long, the improvement of the push-to-talk system over
the average case became less pronounced. We suggest that as the one-way latency
of the system begins to approach the length of a normal talkspurt, all notion of
agreement gets lost. In this case the fraction of time in the same state approaches the
fraction obtained by simply treating either side as independent random variables with
equivalent probabilities of being in any particular state. That is, we may as well not
be participating in the same conversation as long as we participate in a conversation
in the same way. The delay becomes so great that exactly what is happening on the
other end is meaningless, only the behavior of what is happening matters.
Another interesting factor was the high variance of the data observed for the push-totalk case over the average case. Recalling Equation 16 for estimating the disagreement
factor based on the number of state transitions, we had:
tdisagree
-
Ntransitions
ttotal
*
ta
(26)
ttotal
With ta being the one-way latency, and the number of state transitions, Ntransitions.
This equation helps us understand the higher variation of the push-to-talk system
with high latency. As can be seen in Figures 14 and 17, there are significantly more
state transitions in the average case. Since the number of transitions is a random
variable depending on the input parameters, the lower the number of transitions is,
the greater the variance in the number of transitions, and according to Equation 26,
this will directly affect the variance of the disagreement factor.
10.3.2
The Loss Due to Errors
The next issue we sought to evaluate was our equation for converting bit-error rates
to loss rates, given by Equation 11 in Section 8.2.1. Using this equation we estimated
the loss rates we expected to obtain for unencrypted and encrypted communication
under the error rates discussed in Section 10.1.4. These expected rates, along with
the actual rates measured in Experiment 4, are shown in Figure 27.
We notice immediately that for encrypted communication, the predicted and observed
loss rates are almost identical. For clear communication, however, the observed rates
93
Drop Rates versus
1.00 E-07
Bit-Error-Rates for Clear and
1.00E-05
1.OOE-06
Encrypted Communication
1.00E-04
1.00 E-03
0.
A
+Unencrypted
-4- Encrypted
Expected Unencrypted
~
-4- Expected Encrypted
0.0001
Bit error rate
Figure 27: Expected and Observed Loss Rates versus Bit Error Rates for Clear and
Encrypted Communication
were much lower than the predicted rate.
One theory for this is that our tool to check for errors only looked at the payload
of the packets, and not the headers. This was because some routers may modify
upper level headers (such as time to live). IPSec in tunnel mode, on the other hand,
encrypts the entire packet, and any errors in the payload would cause the packet to be
discarded. Thus, the loss rates calculated from packet errors for clear communication
may actually underestimate the real loss rates. We did not pursue this hypothesis
any further, however, and it is left as a topic for future work.
10.3.3
Using the Z-Model to Estimate and Measure Overall Conversation
Quality
As a final step we sought to evaluate the validity of the Z-Model in predicting VoIP
conversation quality. To do this, we compared what the model predicted to what was
observed for four different scenarios. The four scenarios used were based on four main
channels of communication available in airborne networks, and described in Table 8:
TCDL, Iridium, Connexion, and Inmarsat. The only modification made to the link
characteristics described in Table 8 is that the bandwidth of Iridium was increased
from 2.4 to 128 kbps. The reason for this change was that 2.4 kbps is only a small
fraction of the bandwidth required for a G.711 VoIP call (approximately 80 kbps),
and we thought the extra latency and higher error rate of the Iridium line would
94
be interesting to study in the absence of this bandwidth constraint. Because we are
changing the available bandwidth, we won't be able to make actual claims about
VoIP service over Iridium.
For each type of link we simulated a single VoIP call with average parameters, both
encrypted with 128-bit AES and unencrypted. We first predicted the loss rates and
delays of each channel based on the results of the previous section, and then used
these results to determine an R-value and MOS score representing the overall quality
of the communication. We then compared this predicted score to the actual observed
behavior. The results are discussed below.
TCDL
A Tactical Common Data Link (TCDL) requires a line of sight between communicating parties. Recall from Table 8 that TCDL communication has a bit error rate
of 10-7, a bandwidth of 10' kbps, and a latency of 2 milliseconds. For unencrypted
communication, we use Equation 8, along with the bit error rate of 10-7 and packet
size of 200 bytes to estimate a loss rate due to errors of 1.60 * 10-4. We also estimate
that the delay in the absence of encryption will be no greater than 4 milliseconds.
Plugging these values into the E-Model gives an R score of 94.3 and a corresponding
MOS of 4.43. Thus, we predict that TCDL should provide excellent quality VoIP. 7
For encrypted communication over TCDL we use the encrypted packet size of 256
Additionally, we use Equation 10
bytes to estimate a loss rate of 2.05 * 10-.
and the methods described above (setting tjitter = 1.76ms and tencrypt +
tdecrypt
+
tencryptionuffer = .2ms) to estimate a worst-case additional delay of 2 milliseconds.
This would be added to the delay observed for unencrypted communication.
To verify our estimations we simulated a TCDL link and ran two calls, one encrypted
and one unencrypted. For the unencrypted communication we observed 0 lost packets
and an average delay of 3.02ms. These values yield an R score of 94.3 and MOS of
4.43.
Adding the 2 ms encryption overhead, we use a delay of 5 ms and the loss rate above
to predict a score for the encrypted case of 94.3 and MOS of 4.43. Thus, we believe
that encryption would not hinder voice quality at all in TCDL communication.
The observed encrypted communication experienced a loss rate of 2.3 * 10-4 and an
average delay of 3.8ms. Both of these measurements correspond to the same R score
of 94.3 and MOS of 4.43.
Thus, we conclude that TCDL communication will readily support high-quality encrypted and unencrypted VoIP.
Connexion
7Please refer to Table 2 and Figure 5 for a review of R and MOS scoring.
95
Connexion links are significantly less reliable and slower than TCDL links. In particular, Connexion links have an estimated bit error rate of 10-6 and a latency of
325 milliseconds. Using the same methods as described for TCDL, for unencrypted
communication we estimate a loss rate due to errors of 1.6 * 10- and a latency of 327
milliseconds. These values determine an R score of 76.7 and MOS of 3.89.
For encrypted communication, the increased size of encrypted packets results in a loss
rate of 2.05 * 10-3 and the overhead of encryption translates to a worst-case latency
of 329 milliseconds. With these values, we estimate an R score of 76.5 and MOS of
3.89.
Running an unencrypted VoIP call over a Connexion link, we observed a loss rate
of 7.6 * 10-4 and an average delay of 326.4 ms. These values translated into an
R score of 76.8 and MOS of 3.90, closely matching our predictions. For encrypted
communication we saw a loss rate of 2.16 * 10-3 and delay of 327.2, for an R score of
76.7 and MOS of 3.89, again matching our predictions.
Thus, while the extra latency of Connexion over TCDL communication does hinder
VoIP quality, the effects of encryption do not significantly impair quality any further.
Inmarsat
Inmarsat links are very similar to Connexion links. The latency of both is 325 milliseconds, with the difference being that Inmarsat links have a bit error rate of 10- 7 as
opposed to 10-6 for Connexion. Since Inmarsat has the same error rate as TCDL, its
expected loss rates are the same: 1.6 * 10-4 and 2.05 * 10-4, for unencrypted and encrypted communication, respectively. Likewise, since the latency of Inmarsat links is
the same as Connexion links, the expected delays are the same, with estimated values
of 327 ms for unencrypted communication and 329 ms for encrypted communication.
These values determine R scores of 76.7 and 76.5, with corresponding equivalent MOS
values of 3.89 for encrypted and unencrypted calls. It should be noted that these values are the same predicted for Connexion, and the error rate appears to have minimal
impact on quality. This is discussed in more detail below.
In our experiments we observed a loss rate of 2.1 * 10-3 and a delay of 326.3 ms for
unencrypted communication. This resulted in an R score of 76.8 and MOS of 3.90.
For encrypted communication we saw a loss rate of 2.1 * 10-4 and a delay of 326.9 ms.
This produced an R score off 76.7 and MOS of 3.89. Once again, the R scores and
MOS's associated with Inmarsat communication were equal to those for Connexion,
and the bit error rate was not a significant factor affecting quality.
Modified Iridium
As a final test we looked at a modified Iridium link. The modification we made was
to increase the bandwidth from 2.4 to 128kbps. This provided ample bandwidth for
a VoIP call to be transmitted. The bit error rate for Iridium is 10-5, and the delay
is 2000 ms.
96
Connection
Encryption
P 088
I,
Delay(ms)
Id
TCDL
TCDL
Connexion
Connexion
Inmarsat
Inmarsat
Iridium-like
Iridium-like
none
AES-128
none
AES-128
none
AES-128
none
AES-128
0
2.3e-4
7.6e-4
2.1e-3
2.1e-3
2.1e-4
.018
0.041
0
1.9e-3
7.2e-3
.021
.020
2.0e-3
.17
.39
3.0
3.8
326.4
327.2
326.3
326.9
2001
2001
0
0
17.5
17.6
17.5
17.6
48.1
48.1
R
94.3
94.3
76.8
76.7
76.8
76.7
46.1
45.8
MOS
4.43
4.43
3.90
3.89
3.90
3.89
2.37
2.36
Table 10: Factors Contributing to R score for Various Links
A bit error rate of 10-5 produces an expected loss rate of .0158 for unencrypted
packets, and .0203 for encrypted packets. The expected delays are 2002 ms for unencrypted data and 2004 milliseconds for encrypted data. These values result in an R
score of 46.1 and MOS of 2.37 for unencrypted data and a score of 46.0 and MOS of
2.37 for encrypted data.
In our experiments with unencrypted data running over this link, we observed a loss
rate of .018 and delay of 2001 ms. This resulted in an R score of 46.1 and MOS of
2.37. For encrypted data we observed a loss rate of .041 and a delay of 2001. This
resulted in an R score of 45.8 and MOS of 2.36.
Once again, the overhead due to encryption is quite small.
Discussion
The results of this section are summarized in Table 10.
We note a few important
results that can be observed from this table.
The first noticeable result is that encryption did not significantly hinder quality in
any of these cases. For all links we looked at, encryption affected the R score by at
most .3 and MOS by at most .01. This is good to know; encryption alone should not
impair the quality of communication over any of these types of links.
A second interesting result is that for all of these links, the effect of latency on the
overall score was much more pronounced than the effect of packet loss. The severe
impairment caused by the delays of these links suggests that a push-to-talk system
might improve conversation quality, but lack of subjective measurements made this
impossible to investigate further. In the E-Model, though latency is encapsulated by
the Id term, which varied from 0 to 48 for the various links. The impact of packet
loss, which is characterized by the Ie term, was much less pronounced. For all of
our experiments, packet loss never impaired the R score by more than a few tenths
of a point. Compared to 48 points for delay in Iridium systems, or even 17.6 points
for delay over Inmarsat and Connexion links, this small degradation in quality could
largely be ignored.
97
11
Future Work
In this thesis, a technique for estimating VoIP call quality under various network and
security conditions has been presented. This work is believed to be a good starting point for those interested in modeling and evaluating VoIP systems. There are,
however, several areas for future study. These are broken down in to four categories:
conversation modeling, security, network characteristics, and voice-quality.
11.1
Conversation Modeling
We did an extensive investigation on how to model the behavior of two users talking on
a Cisco 7940 phone running a G.711 VoIP codec. While we believe that our research
into Brady's model's parameters accurately reflects the behavior of this system, we
do not make any claims about how other systems behave. For example, the 2.3
second additional talkspurt length resulting from the Voice Activity Detection may
not apply to other VoIP implementations. Additionally, we did not attempt to model
the behavior of any other VoIP codecs, and believe that it is important to investigate
these systems at a later date.
There were also conversation types that were not implemented, including calls involving more than two people. It would be interesting to observe the bandwidth
requirements for calls as the number of parties increases, as presumably each person
in the call will be talking a smaller percentage of the time.
We also did not send real speech over network links, and instead sent random bits. If
speech data is compressible this could have reduced the bandwidth overhead of IPSec
by allowing packets to be compressed before encryption. Since speech packets are
already compressed we believe this would not be the case, but we are unable to state
this definitively.
Finally, outside of the average case, we did not develop different user models based
on a data set, but instead manually generated parameters we believed would reflect
a particular style of conversation. We suggest a comprehensive study into the actual
parameters observed for these systems as a topic for further review.
11.2
Security
This paper was not an exhaustive overview of VoIP security. Instead, we focused on
conversation quality impairment due to implementing particular security protocols,
mainly 128-bit AES. We did not research the effect of security being added to call
setup and takedown, and suggest this as a topic for further study. Additionally, we
assumed that security setup (including key exchange) was performed prior to the VoIP
98
call, and did not consider the case where security protocols fail to set themselves up
correctly.
We also chose to focus our experiments on IPSec security, and did not consider the
various other security protocols that could be used in VoIP systems, including SRTP,
and SSIP. Even in the context of IPSec there were still several other options that we
did not fully explore. While we did look at several different security algorithms, we
did the bulk of our experiments with 128-bit AES. The experiments could be repeated
with other algorithms to determine if differences in efficiency become more apparent
under sub-optimal network conditions. Additionally, versions of IPSec other than
Openswan could be compared and the performance of each evaluated.
11.3
Network Characteristics
In our evaluation of the Z-Model, we only looked at 4 types of networks. There
are many types of networks outside the realm of airborne networks that would be
interesting to look at. In particular, the performance of secure VoIP over 802.11 and
Bluetooth wireless networks is a topic we suggest for future study.
We were also using Link Emulators to simulate the characteristics of these network
links. We do not know if the performance of secure VoIP was influenced by these
simulated (as opposed to real) connections. The only way to determine this would be
to run secure VoIP over the actual satellite links.
11.4
Voice Quality
Perhaps the most important subject for future work is performing a qualitative study
on the performance of secure VoIP. The E-Model, although developed based on qualitative research, is not a perfect measure of voice quality. Only qualitative research
with human conversants can accurately evaluate the performance of any conversation
quality model.
Additionally, it would be very helpful to know the perceived quality gain from the use
of a push-to-talk system over no protocol for high latency links. This would provide
a basis for describing the dependence of quality on the disagreement factor proposed
in Section 8.3. Unfortunately since no qualitative evaluations of push-to-talk systems
were available, we were unable to model this dependency.
99
12
Conclusion
The impact of security on VoIP systems was evaluated for varying network conditions.
A model for VoIP was developed, based on a conversation model developed by Paul
Brady for use on circuit switched calls. For the system we tested, employing voice
activity detection had an effect that was similar to adding a fixed buffer to the end of
every talkspurt. This effect was incorporated into Brady's model, resulting in a set of
parameters that were drastically different from the ones observed by Brady. In addition to determining how average VoIP traffic patterns differed from circuit switched
calls, user models were also developed to represent different styles of conversation.
To evaluate conversation quality, various metrics, including MOS, PESQ, and the EModel were considered. Ultimately the E-Model was chosen as a basis to build a secure
VoIP conversation quality model. Using the E-Model, various network characteristics
and how they affect conversation quality were considered to develop a new model, the
Z-Model, which incorporated security. The impact of security protocols on packet loss,
delay, and jitter was considered, as well as the effects of bit error rates and bandwidth
usage.
Additionally, a new metric, called the disagreement factor was introduced as a possible
replacement for delay in the E-Model's delay impairment equation. This metric was
introduced to encapsulate the improved conversation quality of push to talk systems
over high latency links. It was found that this factor did demonstrate improved performance under these circumstances, but the exact relation between the disagreement
factor and the R score was not determined due to lack of subjective data.
To determine the security dependencies of the Z-Model, a series of experiments were
run. In each experiment a particular security or network characteristic was isolated.
Results from encrypted trials were compared to unencrypted communication to isolate
the overhead due to encryption. The exact implementation considered for the bulk
of the experimentation was 128-bit AES running through Openswan IPSec in tunnel
mode.
It was found that for the majority of cases the overhead due to encryption was quite
small. Typically, the estimated call quality from encrypted communication was found
to be very close to the call quality of unencrypted communication under equivalent
network conditions. Encryption algorithms, even with traffic rates approaching 100
Mbps, were very fast, and delays due to encryption that would significantly impair
VoIP quality were not observed for any security setup.
The most noticeable effect of encryption was the increased bandwidth that resulted
from the encrypted packets. A bandwidth increase of 28-35% was typically observed
in VoIP systems after encryption. In the cases where this resulted in a bandwidth
increase beyond the capacity of the communication channel, a significantly increased
loss rate was observed with encrypted communication. This resulted in reduced con100
versation quality.
Finally, the performance of secure VoIP systems over wireless links in airborne networks was evaluated. It was found that TCDL links offer the capability for toll-quality
secure VoIP, while Inmarsat and Connexion links provide adequate conversation quality in the absence of additional traffic. The impact of delay on satellite links was observed to impair quality far more than the packets lost due to bit-errors. For all links,
neglecting bandwidth issues that could arise if other communication occurred during
the VoIP call, the impact of encryption on overall conversation quality was negligible, and encrypted and unencrypted call quality were nearly identical. This behavior
matched the predicted quality estimated by the Z-Model, although the Z-Model was
observed to slightly overestimate the impact of security in VoIP systems.
101
References
[1] Empirix Inc., "Hammer VoIP Test Solution," 2005,
http://www.empirix.com/default.asp?action=article&ID=522
(visited: 5/2005).
[2] Ixia, "IxVoice RTP Test Library," 2005,
http://www.ixiacom.com/products/voice-testing/indexphp (visited: 5/2005).
[3] Jonathan Davidson, James Peters, and Brian Gracely, Voice over IP Fundamentals, Cisco
Press, 2000.
[4] Cisco Inc., " Enabling High Availability for Voice Services in Cable Networks," 2005,
http://www.cisco.com/en/US/products/hw/modules/ps4302/products.white-paperO9
186a0080179145.shtml, (visited: 5/2005).
[5] Brent Baccala, Connected: An Internet Encyclopedia, Third Edition, 1997,
http://www.freesoft.org/CIE/Topics/57.htm (visited: 5/2005).
[6] Internet Archive, "Arpanet," 2003 http://www.archive.org/details/arpanet (visited: 5/2005).
[7] Paul E. Jones, Packetizer Inc., "H.323 Standards," 2004,
http://www.packetizer.com/voip/h323/standards.html (visited: 4/2005).
[8] Weinstein, Forgie, McElwain, "Audio Recording of First Packetized Speech Teleconference,"
MIT Lincoln Laboratory, 5/1/1978.
[9] Clifford Weinstein, and James Forgie, "Experience with Speech Communication in Packet
Networks", IEEE Journal Selected Areas Communications, vol SAC-1, no. 6, Dec. 1983, pp.
963-980.
[10] Intertangent Technology Directory, "History of VoIP", 2004,
http://www.intertangent.com/023346/Articles-and-News/1413.html (visited: 12/2004).
[11] Cisco Systems Inc., "Adobe Receives Cisco's 3 Millionth IP Telephone," 5/12/2004,
http://newsroom.cisco.com/dlls/2004/prod-D51204b.html, (visited: 12/2004).
[12] D Moore, C Shannon, J Brown, "Code-Red: a case study on the spread and victims of an
Internet Worm," Proceedings of the 2002 ACM SICGOMM Internet Measurement.
[13] Colin Haley, "Are You Ready for VoIP?", InternetNews.com, 5/5/2004,
http://www.smallbusinesscomputing.com/news/article.php/3349791 (visited: 12/2004).
[14] EMarketer.com, "Consumer VoIP Adoption", 10/12/2004,
http://www.emarketer.com/Article.aspx?1003085, (visited: 12/2004).
[15] Rose Cordero, and Jennifer Williston, "Voice Over Internet Protocol," 2004,
http://www.unc.edu/courses/2004spring/law/357c/001/projects/jennwill/VOIP/facts.html
(visited: 5/2004).
[16] Cisco Systems Inc., "Voice Over IP - Per Call Bandwidth Consumption," 2004,
http://www.cisco.com/en/US/tech/tk652/tk698/technologies tech-note09186a0080094ae2
.shtml, (visited: 2/2005).
102
[17] Tasyumruk, Lutfullah, Analysis of Voice Quality Problems of Voice Over Internet Protocol
(VoIP), Masters Thesis, Naval PostgraduateSchool, Monterey, CA, September 2003.
[18] Glen Campbell, et al., "Everyting Over IP: VoIP: and Beyond," Merrill Lynch, 3/12/2004,
http://www.vonage.com/media/pdf/res-03-12_04.pdf (visited: 5/2005).
[19] Reynolds, D.A., Campbell, J. P., Campbell, et. al., "Beyond Cepstra: Exploiting High-Level
Information in Speaker Recognition" In Proc. Workshop on Multimodal User Authentication
in Santa Barbara, California, pp. 223-229, 11-12 December 2003.
[20] Jori Liesenborgs, "Jori's RTP Library," 10/2004,
http://research.edm.luc.ac.be/jori/jrtplib/jrtplib.html, (visited: 9/2004).
[21] Hovik Melikyan, "C++ Portable Types Library," 2004, http://www.melikyan.com/ptypes/
(visited: 9/2004).
[22] Johnston, Alan B., SIP: Understanding the Session Initiation Protocol, Artech House, 2004.
[23] Joseph D. Harwood, "IPSec: Technology and Application Overview," 1/6/2001, Vesta Corp.,
http://www.vesta-corp.com/IpsecOverview.pdf, (visited: 5/2005).
[24] Bruce Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C, Wiley
Books, 1995.
[25] W. Diffie and M.E. Hellman, "New directions in cryptography", IEEE Transactions on
Information Theory 22 (1976), 644-654.
[26] Rivest, R. L., Shamir, A., Adleman, L. A., "A method for obtaining digital signatures and
public-key cryptosystems", Communications of the ACM, Vol.21, Nr.2, 1978, S.120-126.
[27] Fraunhofer Fokus, "The IP Telephony Site," 2005, http://www.iptel.org (visited: 9/2004).
[28] Hari Balakrishnan, Dina Katabi, Robert Morris; MIT Course 6.829: Computer Networks,
Lecture Notes 18 and 19, Fall, 2004.
[29] S. J. Rees, "Convergence and Voice Over IP,"
http://www.soc.staffs.ac.uk/sjr3/dccn%20tutorial%20sheet%20three%20with%20answers.doc,
(visited: 4/2005).
[30] "RTP Parameters," 5/20/2005 http://www.iana.org/assignments/rtp-parameters (visited:
May 2005).
[31] Brady, Paul T., "A Model for Generating on-off Speech Patterns in Two-Way Conversations,
The Bell System Technical Journal, September 1969, pp.2445-2471.
[32] Brady, Paul T., "A Statistical Analysis of On-Off Patterns in 16 Conversations, The Bell
System Technical Journal, January 1968, pp.73-91.
[33] David Graff, Kevin Walker, and David Miller, "Switchboard Cellular Part 1: Audio",
University of Pennsylvania, 2001,
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S13 (visited:
2/2004).
103
[34] David Graff, Kevin Walker, and David Miller, "Switchboard Cellular Part 1: Transcribed
Audio" University of Pennsylvania,
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S15 (visited:
2/2004).
[35] Cox, D.R., and Miller, H.D., The Theory of Stochastic Processes, New York: Wiley, 1965.
[36] Walter Willinger, David Alderson, and Lun Li, "A Pragmatic Approach to Dealing with
High-Variability in Network Measurements," Internet Measurement Conference, 2004.
[37] Personal correspondence with Hammer and Ixia representatives at Voice on the Net
Conference, Boston, 2004.
[38] Cisco Inc., "Cisco IP Phone 7940G," 2005,
http://www.cisco.com/warp/public/cc/pd/tlhw/prodlit/7940_ds.htm (visited: 10/2004).
[39] Sheila Frankel, Demystifying the IPSec Puzzle, Artech House, Boston, 2001.
[40] Linux FreeS/WAN version 2.06, 4/22/2004, http://www.freeswan.org/ (visited: 3/2005).
[41] Openswan version 2.3, 1/2005, http://www.openswan.org/ (visited: 3/2005).
[42] StrongSwan - IPSec for Linux, http://www.strongswan.org/ (visited: 3/2005).
[43] IPSec Tools, http://ipsec-tools.sourceforge.net/ (visited: 3/2005).
[44] Intel Corp., "Intel PRO/100 S Server Adapter," 2005,
http://www.intel.com/network/connectivity/products/prolOOs.srvr-adapter.htm,
3/2005).
(visited:
[45] "TCPDUMP Public Repository," 4/7/2005, http://www.tcpdump.org/ (visited: 4/2005).
[46] Iperf version 1.7.0, 3/2003, http://dast.nlanr.net/Projects/Iperf/ (visited: 4/2005).
[47] The Boeing Company, "Connexion by Boeing," 2005, http://www.connexionbyboeing.com/
(visited: 4/2005).
[48] Inmarsat Ltd., "Inmarsat - Total Communications Network," 2005,
http://www.inmarsat.com/ (visited: 4/2005).
[49] Iridium Satellite LLC., "About Iridium," 2005,
http://www.iridium.com/corp/iri-corp-understand.asp (visited: 4/2005).
[50] A Kos, B Klepec, S Tomazic, "Techniques for Performance Improvement of VoIP
Applications", Electrotechnical Conference, MELECON 2002.
[51] Agilent Technologies, "Jitter Solutions for Telecom, Enterprise, and Digital Designs," 2004,
http://cp.literature.agilent.com/litweb/pdf/5988-9592EN.pdf, (visited: 5/2005).
[52] National Institute of Standards, "CSRC Cryptographic Toolkit - AES," 1/28/2002,
http://csrc.nist.gov/CryptoToolkit/aes, (visited: 5/2005).
[53] International Telecommunications Union, "The E Model: A computational model for use in
Transmission Planning," ITU-T G.107.
104
[54] "Perceptual Evaluation of Speech Quality," ITU-T P.862.
[55] "Perceptual Speech Quality Measure," ITU-T P.861.
[56] Lucent Technologies, "Voice over Internet Protocol Voice Quality of Service," 1/1/2004,
http://www.lucent.com/livelink/090094038007ffeb-White-paper.pdf, (visited: 5/2005).
[57] Christian Hoene, et. al. "A Perceptual Quality Model for Adaptive VoIP Applications",
Proceedings of International Symposium on Performance Evaluation of Computer and
Telecommunication Systems (SPECTS'04), San Jose, California, USA, 2004.
[58] Athina Markopoulou, et. al., "Assessment of VoIP Quality over Internet Backbones," IEEE
INFOCOM, 2002.
[59] "Estimates of Ie and Bpl parameters for a range of CODEC types," ITU-T SG12 D.106,
http://www.telchemy.com/reference/ITU%20SG12%20D106%2OBpl%20parameters.pdf,
(visited: 4/2005).
[60] John C. Gammel, "Echo Cancellation for VoIP," 10/1999,
http://www.commsdesign.com/main/1999/10/9910feat4.htm (visited: 5/2005).
[61] Mordechai T. Abzug, "MD5 Homepage (unofficial)"
http://userpages.umbc.edu/ mabzugl/cs/md5/md5.html (visited: 5/2005).
[62] Philip A. DesAutels, "Secure Hash Algorithm - Version 1.0," 10/1997,
http://www.w3.org/PICS/DSig/SHAll_0.html (visited: 5/2005).
[63] X5 Networks, "Cryptography: What is Capstone?,"
http://www.x5.net/faqs/crypto/ql50.html (visited: 5/2005).
[64] Kuhn, Richard, et al. "Security Considerations for Voice Over IP Systems", National Institute
of Standards and Technology, Special Publication 800-58,
http://csrc.nist.gov/publications/nistpubs/800-58/SP800-58-final.pdf, (visited: 5/2005).
[65] Kevin Poulsen, "VoIP Hacks Gut Caller I.D.," SecurityFocus Jul 6 2004,
http://www.securityfocus.com/news/9061, (visited: 1/2005).
[66] Richard Taborek, "Recommendation of 10-1 Bit Error Rate for 10 Gigabit Ethernet," IEEE
LMSC, 1999,
http://grouper.ieee.org/groups/802/3/1OG-study/public/uly99/chang_2_0799.pdf, (visited:
5/2005).
[67] Cisco Inc., "Skinny Client Control Protocol (SCCP)," 2005,
http://www.cisco.com/en/US/tech/tk652/tk7Ol/tk589/tsd-technology-support
-sub-protocollhome.html, (visited: 5/2005).
[68] "Internet Protocol," RFC 791, 1981.
[69] "User Datagram Protocol," RFC 768, 1980.
[70] "Transmission Control Protocol," RFC 793, 1981.
[71] "Real Time Protocol," RFC 3550, 2003.
[72] "Session Initiation Protocol," RFC 3261, 2002.
105
[73] "Hypertext Transfer Protocol," RFC 2616, 1999.
[74] "Simple Mail Transfer Protocol," RFC 2821, 2001.
[75] "IP Security Document Roadmap" RFC 2411, 1998.
[76] "Network Time Protocol," RFC 1305, 1992.
[77] Secure Real-Time Protocol, RFC 3711, 2004.
[78] "Megaco Protocol Version 1.0," RFC 3015, 2000.
[79] "Media Gateway Control Protocol (MGCP) Version 1.0," RFC 2705, 1999.
106
Q