Privacy and Performance Trade-offs in
Anonymous Communication Networks
A DISSERTATION
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY
John D. Geddes
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
Doctor of Philosophy
Nicholas J. Hopper
February, 2017
c John D. Geddes 2017
ALL RIGHTS RESERVED
Acknowledgements
There are a great many people who I owe a debt of gratitude for their support during
this dissertation. First and foremost is my advisor Nick Hopper, who not only helped
contribute and shape this work, but provided endless guidance and encouragement from
start to finish.
I would like to thank all my collaborators and co-authors – Rob Jansen, Max Schuchard,
and Mike Schliep. Your input and contributions have been paramount to getting me
here, and this work is undeniabley better off because of it. Additionally I want to thank
everyone I have had the great pleasure to work with during my tenure in graduate school
– Shuai Li, Zi Lin, Yongdae Kim, Denis Foo Kune, Aziz Mohaisen, Se Eun Oh, Micah
Sherr, Paul Syverson, Chris Thompson, and Chris Wacek.
I would like to thank my other committee members – Stephen McCamant, Andrew
Odlyzko, and Jon Weissman – for their comments and time volunteered during this
process.
I would especially like to thank my parents and brother who have been there since
the beginning. Never in a million years could I have accomplished this without your
continous support and encouragement.
And last but certainly not least, I thank my best friend and wife, Dominique. This
work would not exist without you and your never ending support. No matter the challenges or difficulties faced, you stood by me and helped me through it all.
i
Dedication
To Dominique, who has always been there for me, even through the never ending
“just one more year of grad school”.
ii
Abstract
Anonymous communication systems attempt to prevent adversarial eavesdroppers
from learning the identities of any two parties communicating with each other. In order to protect from global adversaries, such as nation states and large internet service
providers, systems need to induce large amounts of latency in order to sufficiently protect users identities. Other systems sacrifice protection against global adversaries in
order to provide low latency service to their clients. This makes the system usable
for latency sensitive applications like web browsing. In turn, more users participate in
the low latency system, increasing the anonymity set for everybody. These trade-offs
on performance and anonymity provided are inherent in anonymous communication
systems.
In this dissertation we examine these types of trade-offs in Tor, the most popular
low latency anonymous communication system in use today. First we look at how
user anonymity is affected by mechanisms built into Tor for the purpose of increasing
client performance. To this end we introduce an induced throttling attack against flow
control and traffic admission control algorithms which allow an adversarial relay to
reduce the anonymity set of a client using the adversary as an exit. Second we examine
how connections are managed for inter-relay communication and look at some recent
proposals for more efficient relay communication. We show how some of these can be
abused to anonymously launch a low resource denial of service attack against target
relays. With this we then explore two potential solutions which provide more efficient
relay communication along with preventing certain denial of service attacks. Finally,
we introduce a circuit selection algorithm that can be used by a centralized authority
to dramatically increase network utilization. This algorithm is then adapted to work
in a decentralized manner allowing clients to make smarter decisions locally, increasing
performance while having a small impact on client anonymity.
iii
Contents
Acknowledgements
i
Dedication
ii
Abstract
iii
List of Tables
viii
List of Figures
ix
1 Introduction
1
1.1
Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2 Background
6
2.1
The Tor Anonymous Communication Network
. . . . . . . . . . . . . .
7
2.2
Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.3
Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.4
Relay Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.5
Inter-Relay Communication . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.6
Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3 Related Work
3.1
11
Increasing Performance
3.1.1
Scheduling
3.1.2
Selection
. . . . . . . . . . . . . . . . . . . . . . . . . . .
12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
iv
3.2
3.3
3.1.3
Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.1.4
Incentives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Security and Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.2.1
Path Selection and Routing Attacks . . . . . . . . . . . . . . . .
15
3.2.2
Side Channel and Congestion Attacks . . . . . . . . . . . . . . .
16
Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
4 Experimental Setup
19
4.1
Shadow and Network Topologies . . . . . . . . . . . . . . . . . . . . . .
20
4.2
Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
4.3
Adversarial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
5 How Low Can You Go: Balancing Performance with Anonymity in
Tor
25
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
5.2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
5.2.1
Circuit Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . .
27
5.2.2
Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
5.2.3
Traffic Admission Control . . . . . . . . . . . . . . . . . . . . . .
28
5.2.4
Circuit Clogging . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
5.2.5
Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
5.3.1
Algorithmic-Specific Information Leakage . . . . . . . . . . . . .
30
5.3.2
Experimental Setup and Model . . . . . . . . . . . . . . . . . . .
32
Algorithmic Effects on Known Attacks . . . . . . . . . . . . . . . . . . .
32
5.4.1
Throughput as a Signal . . . . . . . . . . . . . . . . . . . . . . .
32
5.4.2
Latency as a Signal . . . . . . . . . . . . . . . . . . . . . . . . . .
36
Induced Throttling via Flow Control . . . . . . . . . . . . . . . . . . . .
37
5.5.1
Artificial Congestion . . . . . . . . . . . . . . . . . . . . . . . . .
37
5.5.2
Small Scale Experiment . . . . . . . . . . . . . . . . . . . . . . .
40
5.5.3
Smoothing Throughput . . . . . . . . . . . . . . . . . . . . . . .
41
5.5.4
Scoring Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .
42
5.5.5
Large Scale Experiments . . . . . . . . . . . . . . . . . . . . . . .
43
5.3
5.4
5.5
v
5.6
Induced Throttling via Traffic Admission Control . . . . . . . . . . . . .
45
5.6.1
Connection Sybils . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5.6.2
Large Scale Experiments . . . . . . . . . . . . . . . . . . . . . . .
49
5.6.3
Search Extensions . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.7
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.8
Conclusion
54
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Managing Tor Connections for Inter-Relay Communication
57
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
6.2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
6.3
Socket Exhaustion Attacks
. . . . . . . . . . . . . . . . . . . . . . . . .
61
6.3.1
Sockets in Tor . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
6.3.2
Attack Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
6.3.3
Effects of Socket Exhaustion . . . . . . . . . . . . . . . . . . . .
64
IMUX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
6.4.1
Connection Management
. . . . . . . . . . . . . . . . . . . . . .
68
6.4.2
Connection Scheduler . . . . . . . . . . . . . . . . . . . . . . . .
71
6.4.3
KIST: Kernel-Informed Socket Transport . . . . . . . . . . . . .
73
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
6.5.1
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . .
74
6.5.2
Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
6.5.3
Connection Management
. . . . . . . . . . . . . . . . . . . . . .
76
6.5.4
Performance
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
Replacing TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
6.6.1
Micro Transport Protocol . . . . . . . . . . . . . . . . . . . . . .
82
6.6.2
xTCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
6.6.3
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . .
86
6.6.4
Performance
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
6.7
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
6.8
Conclusion
89
6.4
6.5
6.6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 Anarchy in Tor: Performance Cost of Decentralization
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
91
92
7.2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
7.3
Maximizing Network Usage . . . . . . . . . . . . . . . . . . . . . . . . .
93
7.3.1
Central Authority . . . . . . . . . . . . . . . . . . . . . . . . . .
93
7.3.2
Offline Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
7.3.3
Circuit Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
7.4
ABRA for Circuit Selection . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.5
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.6
7.7
7.8
7.5.1
Shadow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.5.2
Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.5.3
Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.6.1
Competitive Analysis . . . . . . . . . . . . . . . . . . . . . . . . 105
7.6.2
ABRA Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.6.3
ABRA Performance . . . . . . . . . . . . . . . . . . . . . . . . . 108
Privacy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.7.1
Information Leakage . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.7.2
Colluding Relays Lying . . . . . . . . . . . . . . . . . . . . . . . 112
7.7.3
Denial of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8 Future Work and Final Remarks
8.1
8.2
118
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.1.1
Tor Transports . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.1.2
Simulation Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 120
Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
References
122
vii
List of Tables
6.1
Functions that xTCP needs to intercept and how xTCP must handle them. 85
7.1
The bottleneck clustering methods mean square error across varying
bandwidth granularity and window parameters, with red values indicating scores less than weighted random estimator. . . . . . . . . . . . . . . 106
viii
List of Figures
2.1
Tor client encrypting a cell, sending it on the circuit, and relays peeling
off their layer of encryption. . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2
Internal Tor relay architecture. . . . . . . . . . . . . . . . . . . . . . . .
9
5.1
Results for throughput attack with vanilla Tor compared to EWMA and
N23 scheduling and flow control algorithms. . . . . . . . . . . . . . . . .
5.2
34
Results for throughput attack with vanilla Tor compared to different
throttling algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
5.3
Results of latency attack on EWMA and N23 algorithms. . . . . . . . .
38
5.4
Results of latency attack on the various throttling algorithms. . . . . . .
39
5.5
Effects of artificial throttling. . . . . . . . . . . . . . . . . . . . . . . . .
41
5.6
Raw and smoothed throughput of probe through guard during attack. .
42
5.7
Throttle attack with BitTorrent sending a single stream over the circuit.
44
5.8
Degree of anonymity loss with throttling attacks with different number
of BitTorrent streams over the Tor circuit. . . . . . . . . . . . . . . . . .
5.9
46
Small scale sybil attack on the bandwidth throttling algorithms from [1].
The shaded area represents the period during which the attack is active.
The sybils cause an easily recognizable drop in throughput on the target
circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
5.10 Large scale sybil attack on the bandwidth throttling algorithms from [1].
The shaded area represents the period during which the attack is active.
Due to token bucket sizes, the throughput signal may be missed if the
phases of the attack are too short. . . . . . . . . . . . . . . . . . . . . .
50
5.11 Probabilities of the victim client under attack scenarios for flow control
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
53
5.12 Victim client probabilities, throttling algorithm attack scenarios. . . . .
6.1
55
Showing (a) anonymous socket exhaustion attack using client sybils (b)
throughput of relay when launching a socket exhaustion attack via circuit creation with shaded region representing when the attack was being launched (c) memory consumption as a process opens sockets using
libevent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
65
Showing (a) throughput over time (b) linear regression correlating throughput to the number of open sockets (c) kernel time to open new sockets
over time (d) linear regression correlating kernel time to open a new
socket to the number of open sockets. . . . . . . . . . . . . . . . . . . .
67
6.3
The three different connection scheduling algorithms used in IMUX. . .
72
6.4
Number of open sockets at exit relay in PCTCP compared with IMUX
with varying τ paramaters. . . . . . . . . . . . . . . . . . . . . . . . . .
6.5
77
Comparing socket exhaustion attack in vanilla Tor with Torchestra, PCTCP
and IMUX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
6.6
Performance comparison of IMUX connection schedulers . . . . . . . . .
79
6.7
Performance comparison of IMUX to Torchestra [2] and PCTCP [3] . .
81
6.8
Performance comparison of IMUX to Torchestra [2] and PCTCP [3], all
6.9
using the KIST performance enhancements [4] . . . . . . . . . . . . . . .
83
Operation flows for vanilla Tor and using xTCP . . . . . . . . . . . . . .
84
6.10 Performance comparison of using TCP connections in vanilla and uTP
for inter-relay communication. . . . . . . . . . . . . . . . . . . . . . . . .
7.1
88
Example of circuit bandwidth estimate algorithm. Each step the relay
with the lowest bandwidth is selected, its bandwidth evenly distributed
among any remaining circuit, with each relay on that circuit having their
bandwidth decremented by the circuit bandwidth. . . . . . . . . . . . .
7.2
Example showing that adding an active circuit resulted in network usage
dropping from 20 to 15 MBps . . . . . . . . . . . . . . . . . . . . . . . .
7.3
95
99
Total network bandwidth when using the online and offline circuit selection algorithms while changing the set of available circuits, along with
the relay bandwidth capacity of the best results for both algorithms. . . 107
x
7.4
Total network bandwidth utilization when using vanilla Tor, ABRA for
circuit selection, the congestion and latency aware algorithms, and the
central authority. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.5
Download times and client bandwidth compared across circuit selection
in vanilla Tor, using ABRA, and the centralized authority.
7.6
Weight difference on relays compared to the difference in number of bottleneck circuits on the relay.
7.7
. . . . . . . . . . . . . . . . . . . . . . . . 113
(a) fraction of circuits seen by colluding relays (b) fraction of circuits
where colluding relays appear as guard and exit
7.8
. . . . . . . 111
. . . . . . . . . . . . . 115
(a) attackers bandwidth and targets weight when an adversary is running
the denial of service attack (b) number of clients using the target for a
circuit (c) number of downloads completed by clients using the target
relay before and after attack is started, shown both for vanilla Tor and
using ABRA circuit selection . . . . . . . . . . . . . . . . . . . . . . . . 117
xi
Chapter 1
Introduction
1
2
In order to communicate over the internet, messages are encapsulated into packets
and stamped with a source and destination address. These addresses allow internet
service providers to determine the route that packets should be sent over. The route
consists of a series of third party autonomous systems which forward packets to the
final destination. With the source address the destination can now construct responses
that will be routed to the source, potentially through a completely different set of
autonomous systems. With so many external systems having access to packets being
routed, encryption is vital in hiding what is being said, concealing important data such
as banking information, passwords, and private communication.
While encryption can prevent third parties from reading the contents of the packets,
they still can learn the identity of who is communicating with each other via the source
and destination addresses. Concealing these addresses to hide the identities of who is
communicating with each other is a much more difficult task, as the addresses are vital
to ensure messages can be delivered. An early solution was proxies which would act as a
middle man, forwarding messages between clients and servers. While any eavesdropper
would only see the client or server communicating with the proxy and not each other,
the proxy itself knows that the client and server are communicating with each other. To
address these concerns a wide array of anonymous communication networks [5, 6, 7, 8]
were proposed, with the goal of preventing an adversary from simultaneously learning
the identity of both the client and server communicating with each other.
Of these Tor [8] emerged as one of most widely used systems, currently providing
anonymity to millions of users daily. Instead of tunneling connections through a single
proxy, Tor uses onion routing [9, 10] to securely route connections through three proxies
called relays. The Tor network consists of almost 7,000 volunteer relays contributing
bandwidth ranging from 100 KBps to over 1 GBps. To tunnel connections through the
Tor network clients create circuits through three relays - a guard, middle, and exit who are chosen at random weighted by their bandwidth. By using three relays for the
circuit, Tor ensures that no relay learns both the identify of the client and the server
they are communicating with.
One major obstacle anonymous communications systems have is providing privacy
in the face of a global passive adversary, defined as an adversary that can eavesdrop
on large fractions of the network. Specifically they can observe almost all connections
3
entering the network from clients, and all connections exiting the network to servers.
An adversary that has this capability can use traffic analysis techniques [11] in order
to link the identities of the end points who are communicating with each other through
the anonymity network. Some systems [5, 7] attempt to defend against a global adversary by adding random delays to messages, grouping and reordering messages, and
adding background traffic. These techniques result in networks providing high latency
performance to their clients. While some services, such as email, are not latency sensitive and can handle the additional latency added, other activities such as web browsing
are extremely sensitive to latency and would be unusable on these types of anonymity
systems.
To accommodate this Tor makes a conscious trade-off, providing low latency performance to clients by forwarding messages as soon as possible, sacrificing anonymity
against a global adversary. However Tor is still able to protect against a local adversary,
one who can only eavesdrop on a small fraction of the network, unable to perform endto-end traffic analysis available to a global adversary. The major upside to providing
low latency service to clients is the network is much more usable for common activities
such as web browsing, attracting a large number of clients to use the system. This in
turn increases the anonymity of clients, as they are part of a much larger anonymity set,
making it harder for adversaries to potentially link clients and servers communicating
through Tor.
1.1
Thesis Statement
This dissertation explores the following thesis:
There is an inherent trade-off in the performance of an anonymous communication
network and the privacy that it delivers to end users.
To this end we examine new performance enhancing mechanisms for Tor and look at how
they can be exploited to reduce client anonymity. We explore new side channels that leak
information and reduce client anonymity. By measuring the degree of anonymity we are
able to quantify the amount of information leaked and the reduction in anonymity. We
also investigate new avenues to launch denial of service attacks against Tor relays where
they become unavailable for use by clients. Along side this we design and implement
4
new algorithms which explore the balance of maximizing performance and minimizing
attack surfaces open to adversaries.
1.2
Outline
Induced Throttling Attacks (Chapter 5)
In order to more efficiently allocate resources in Tor there have been proposals suggesting improved flow control [12] and to identify and throttle “loud” clients [13]. In
Chapter 5 we introduce the induced throttling attack which can be leveraged against
both flow control and throttling algorithms. An adversarial exit relay is able to control when data can and cannot be sent on a circuit, allowing for probe clients to detect
changes and identify the guard being used on the circuit. By identifying potential guards
being used on a circuit, the adversary can then use a latency attack [14] to reduce the
client anonymity set. We evaluate the effectiveness of the induced throttling attack and
compare results to known previous side channel attacks.
Managing Tor Connections (Chapter 6)
In Tor all communication between two relays is sent over a single TCP connection. This
has been known to be the source of some performance issues with cross circuit interference [15, 16] causing an unnecessary hit to throughput. One potential solution that has
been explored is increasing the number of inter-relay connections, either capping it at
two [2] or leaving it unbounded with each circuit getting a dedicated TCP connection
[3]. However, in modern operating systems the number of sockets a process is allowed to
have open at any one time is capped. In Chapter 6 we introduce the socket exhaustion
attack, able to launch a denial of service attack against Tor relays by consuming all
open sockets on the Tor process. We explore other options that can be used that both
increase performance and prevent certain socket exhaustion attacks. We first introduce
a generalized solution called IMUX which is able to dynamically manage open connections in an attempt to increase performance while limiting the effectiveness of some
socket exhaustion attacks. We also look at replacing TCP entirely as the protocol for
inter-relay communication. By using the micro transport protocol (uTP), we can have
user space protocol stacks for each circuit. This prevents cross circuit interference while
only requiring a single socket be maintained per relay we are communicating with.
5
Price of Anarchy (Chapter 7)
The price of anarchy [17] is a concept in game theory that is interested with examining the degradation of performance that results from selfish behavior and actions of its
agents. In networking this is referred to as selfish routing [18, 19] where clients making
decisions in a selfish and decentralized manner leave network resource underutilized. In
Chapter 7 we explore the price of anarchy in the Tor network by examining potential
performance gains that can be achieved by centralizing circuit selection. While such a
system would be wildly impractical to use on the live Tor network due to privacy concerns, it does allow us to examine the potential cap in performance and give us insights
on how to improve circuit selection in Tor. With those insights we design a decentralized
version of the algorithm that lets relays gossip information to clients allowing them to
make more intelligent circuit selection decisions. With this system we then analyze any
potential drop in anonymity, both from a passive and active adversary.
Chapter 2
Background
6
7
2.1
The Tor Anonymous Communication Network
This dissertation focuses on the Tor anonymous communication network, primarily due
to its wide spread deployment and relative popularity. Tor’s current network capacity
is almost 200 GBps, providing anonymity to over 1.5 million clients daily. With such a
large user base it is more important than ever to ensure that clients receive high performance while guaranteeing the strongest privacy protections possible. In this chapter
we cover some of the low level details of Tor that will be relevant in later chapters.
Specifically we cover how clients interact and send data through Tor, relay queueing
architecture, and how relays communication internally in the Tor network.
2.2
Circuits
When a client first joins the Tor network, the client connects to a directory authority
and downloads the network consensus along with relay microdescriptors, files containing
information for all the relays in the network, such as the relays address, public keys,
exit policy, and measured bandwidth. Once the client has downloaded these files they
start building 10 circuits. When building a circuit the client selects relays at random
weighted by the relays bandwidth, with extra considerations placed when choosing an
exit and guard relay. For the exit relay, the client only selects amongst relays that are
configured to serve as an exit, along with also considering which ports the relay has
chosen to accept connections on (e.g. 80 for HTTP, 443 for HTTPS, etc). Since guard
relays know the identity of the client, clients want to reduce the chances of selecting an
adversarial guard relay which could attempt to deanonymize them [20, 21]. To this end
clients maintain a guard list of pre-selected relays that are always chosen from when
building circuits. When initially picking relays to add to this set, only relays that have
been assigned the GUARD flag by the authorities are considered. These are relays with a
large enough bandwidth and have been available for a minimum amount of timespan.
Once a guard, middle, and exit relay have been chosen the client can start building
the circuit. To begin building the circuit the client sends to the guard an EXTEND cell
containing key negotiation information. The guard responds with an EXTENDED cell
finishing the key negotiation and confirming the circuit was successfully extended. This
process is repeated with the middle and exit relay, with the EXTEND and EXTENDED cells
8
Figure 2.1: Tor client encrypting a cell, sending it on the circuit, and relays peeling off
their layer of encryption.
forwarded through the circuit being built (i.e. the EXTEND cell sent to the middle relay
is forwarded through the guard). After the circuit building process is completed the
client has a shared symmetric key with the guard, middle, and exit used for delivering
data along the circuit.
2.3
Streams
To have an application send data data through the Tor network, the user configures the
application to use the SOCKS proxy opened by the Tor client. For each TCP connection
created by the application, the Tor client creates a corresponding stream. When the
first stream is created, one of the active circuits is selected and the stream is attached to
the circuit. This circuit is then used to multiplex all streams created over the next 10
minutes. After the 10 minute window a new circuit is selected for all future streams and
the old circuit is destroyed. All data sent over a stream is packaged into 512-byte cells
which are encrypted once each using secret symmetric keys the client shares with each
relay in the circuit. Cells are sent over the circuit with each relay “peeling” off a layer
of encryption and forwarding the cell to the next relay in the circuit. The exit relay
then forwards the clients data to the final destination specified by the client. Figure 2.1
shows an example of this process. When the server wishes to send data back to the
9
Figure 2.2: Internal Tor relay architecture.
client, the reverse of this process occurs. The exit packages data from the server into
cells which get sent along the circuit, with each relay adding a layer of encryption.
The client can then decrypt the cell three times with each shared secret key to recover
the original data sent by the server. Note that by default the raw traffic sent by the
client and server are not encrypted, meaning the exit relay has access to the data. To
prevent any eavesdroppers from viewing their communication the client and server can
establish a secure encrypted connection, such as by using TLS, to protect the underlying
communication.
2.4
Relay Architecture
Figure 2.2 shows the internal architecture of cell processing in Tor relays. Packets are
read off the TCP connection and stored on kernel TCP in buffers. Libevent [22] notifies
Tor when data can be read, at which point packets are processed into TLS records,
decrypted, and copied to connection “in buffer” internal to Tor. At this point Tor can
process the unencrypted data as cells, moving the cells to their respective circuit queues
based on their circuit ID. Tor then iterates through the circuit queues, either in roundrobin fashion or using an EWMA circuit scheduler [23], pulling cells from the queue
and pushing them into their respective connection out buffer corresponding to the next
relay in the circuit. At this point the cell is either decrypted if the cell is travelling
upstream (towards the exit) or encrypted if the cell is travelling downstream (towards
the client). Once the newly encrypted or decrypted cell is pushed onto the out buffer,
10
Tor waits until libevent notifies the process that the TCP buffer can be written to, at
which point the data is flushed to the kernel TCP out buffer and eventually sent to the
next relay.
2.5
Inter-Relay Communication
All communication between pairwise relays is done over a single TCP connection encrypted with TLS. Similarly to how streams are multiplexed over a single circuit, circuits
across two relays are multiplexed over the single connection. Using TCP guarantees reliable and in-order delivery, in addition to containing connection level flow and congestion
control. However since all circuits share the same TCP state there can be cross-circuit
interference resulting in degraded performance [15]. For example if one circuit causes
congestion on the connection, the sending window is shrunk reducing throughput for
every circuit using the TCP connection.
2.6
Flow Control
Tor uses an end-to-end flow control algorithm on both the stream and circuit level. The
algorithm operates on traffic flowing in both directions, with each edge of the circuit
(i.e. the client and exit relay) keeping track of a window for each circuit and stream.
For each cell sent over an ingress edge the corresponding circuit and stream window is
decremented by one. Any time a window reaches 0 the ingress edge stops transmitting
all data on either the stream or circuit. For every n cells the egress edge receives
it delivers a circuit or stream SENDME cell to the ingress edge. Once the SENDME cell is
received by the ingress edge it increments its circuit or stream window by n. By default,
circuit and stream windows are initialized to 1000 and 500 respectively, with SENDME
cells being sent for every n = 100 and n = 50 cells received on the circuit and stream.
Chapter 3
Related Work
11
12
A large amount of research has been done on Tor, mainly focusing on either improving
performance or reducing client anonymity. This chapter surveys a wide array of relevant
research on Tor. Not only does this help place the work in this dissertation in a better
context, much of it is integral to the remaining work as we directly examine how some
proposals affect client anonymity.
3.1
Increasing Performance
One of the major keys to increasing anonymity for Tor users is ensuring a large anonymity
set, that is, a large active user base. To accomplish this Tor needs to offer low latency
to clients; bad performance in the form of slow web browsing can lead to fewer users
using the system overall. To this end, there has been a large set of research looking at
ways to increase performance in Tor.
3.1.1
Scheduling
With limited available resources, relays are constantly making scheduling decisions on
what to send and how much should be processed. These decisions happen on all levels,
between streams, circuits, and connections. One major issue is a small percent of users
consume a large majority of all network resources [24, 25]. These bulky clients, usually
using BitTorrent for downloading large files, crowd out available bandwidth which can
result in web clients experiencing high latency. Initially when a connection was ready
to write, Tor used a round-robin circuit scheduler to iterate through the circuits when
flushing circuit queues. This resulted in a lot of bulk circuit cells being scheduled
before latency sensitive web circuits. To fix this Tang and Goldberg [23] introduced
EWMA circuit scheduling which prioritizes quiet circuits that haven’t written recently
over loud circuits who are constantly transmitting cells. Building on this, AlSabah et al.
[26] introduce an algorithm that attempts to classify circuits and use different Quality
of Service (QoS) requirements for scheduling circuits. Further work by Jansen et al. [27]
showed that not only does circuit scheduling need to be done in global context of all
circuits across all connections, but that if Tor writes data too fast to connections, cells
spend most of their time waiting in the TCP kernel buffer, diminishing control Tor has
a prioritizing circuits. To fix this a rate limiting algorithm was developed, attempting
13
to keep TCP buffers full enough to prevent starvation while allowing Tor to hold onto
more cells for improved circuit prioritization.
While these schedulers attempt to prioritize circuits for scheduling, bulk circuits
can still consume a large enough amount of bandwidth and cause congestion on relays.
Jansen et al. [13] propose methods for allowing guards to throttle the loudest clients,
reducing congestion and increasing throughput for the non-throttled clients. In another
attempt to reduce congestion, AlSabah et al. [12] developed an improved congestion
control algorithm, using an ATM-style link-based algorithm instead of the end-to-end
window algorithm. This allows faster response to congestion as the guard and middle
relay also react when congestion occurs. Moore, Wacek, and Sherr [28] go further and
argue that rate limiting all traffic entering Tor will increase the capacity of the entire
Tor network. Since most schedulers were developed in isolation, many do not work well
when combined in aggregate. Tschorsch and Scheuermann [29] develop a new scheduler
based on a max-min fairness model which achieves a more fair allocation of resources.
3.1.2
Selection
When creating circuits, Tor selects from relays at random weighted by their bandwidth.
Since trusting relays to accurately self report their available bandwidth has major security issues [30], Snader and Borisov have done work looking at using peer-to-peer
bandwidth measuring algorithms [31]. In addition a lot of work has been done on improving relay and circuit selections made by clients [32, 33, 34, 35, 36, 37, 38]. Snader
and Borisov introduced an improved relay and path selection algorithm [39] that can
balance anonymity and performance by considering active measurements of relay performance. Akhoondi et al. developed LASTor [34] that uses network coordinates to
estimate latency between relays to create low-latency circuits to be used by clients.
By grouping near-by relays and selecting from within the group randomly LASTor can
create low-latency circuits while preventing an adversary from simply creating a large
amount of closely located relays and then be selected for all relays along a single circuit
created by a client. Instead of using latency Wang et al. create a congestion-aware
path selection algorithm [35]. By using passive and active round trip measurements
on circuits, they can detect when relays become congested and switch to non-congested
circuits. In a similar fashion, AlSabah et al. propose traffic splitting [40] across multiple
14
circuits. By creating multiple circuits using the same exit relay, clients can load balance
across circuits, dynamically reducing the amount of data sent on congested circuits.
3.1.3
Transport
One of the noted performance issues in Tor is cross-circuit interference caused by the use
of a single TCP connection between relays [41]. For example, if circuits c1 and c2 share
a TCP connection and a cell from c1 gets dropped, all cells from c2 would need to buffer
and wait until retransmission mechanisms from TCP retransmit the cell from c1 . This
is because TCP guarantees in-order delivery, which in this case is not necessary for Tor;
the cells from circuit c2 could be sent while only c1 waits for successful retransmission.
To fix some of these issues Viecco proposes UDP-OR [42], a new protocol design that
uses UDP connections between relays. With this there is end-to-end reliability and
congestion control handled by the edges, preventing cross-circuit interference internally
in the network. Reardon [16] attempts to address this by implementing TCP-over-DTLS
allowing a UDP connection to be dedicated to each circuit. Using a user space TCP
stack, reliable in-order delivery is handled at the circuit level at hop-by-hop basis. Since
these user space implementations tend to not be very mature, AlSabah [3] use IPSec
to enable Tor to dedicate a single kernel level TCP connection to each circuit. With
IPSec an adversary cannot directly determine the exact number of circuits between
two relays while still delivering the performance benefits of dedicating a TCP stack per
circuit. Since most of the cross circuit interference happens due to congestion on bulk
circuits interfering with latency sensitive web circuits, Gopal and Heninger [2] propose
having just two connections between relays, one each for web clients and one for bulk
clients. Finally, in an attempt to work around the specific issue of unnecessary blocking,
Nowlan et al. introduce uTCP and uTLS[43, 44] which allows for out-of-order delivery
in Tor so it can process cells from circuits that are not blocking on lost packets to be
retransmitted.
15
3.1.4
Incentives
While the previous lines of research involved improving efficiency in how Tor handles
traffic, another set looks at potential ways to incentive more relays to join and contributed bandwidth to the Tor network. PAR [45] and XPay [46] offer monetary compensation to incentivize volunteers to run relays and add bandwidth to the Tor network.
Ngan, Dingledine, and Wallach proposed Gold Star [47], which prioritizes traffic from
relays providing high quality service in the Tor network. Using an authority to monitor
and assign gold stars to high performing relays, traffic from this relay would then always
be prioritized ahead of relays without a gold star. Jansen, Hopper, and Kim developed
BRAIDS [48] where clients can obtain tickets and give them to relays in order to have
their traffic prioritized. The relays themselves can accrue tickets which they themselves
can spend and have their own traffic prioritized. Both Gold Star and BRAIDS have
scalability issues due to their reliance on centralized trusted authorities. To address
this Jansen, Johnson, and Syverson built LIRA [49], a light-weight and decentralized
incentive solution allowing for the same kind of traffic prioritization based on tokens
obtained by contributing bandwidth to the Tor network.
3.2
Security and Privacy
While Tor’s adversarial model excludes a global adversary, researchers have discovered
potential ways a local adversary can still reduce the anonymity of clients in the network.
In this section we will cover some of these attacks, including path selection, routing,
side channel, and congestion based attacks.
3.2.1
Path Selection and Routing Attacks
Plenty of research has been done showing that a global adversary, eavesdropping on the
client to guard connection and exit to server connection, can confirm that the client
and server are actually communicating with each other [50, 51, 52]. Given that an
adversary wants to increase the chances that they appear as both the guard and exit
on a circuit, allowing them to run these end-to-end confirmation attacks. Ideally an
adversary providing a fraction f of all network bandwidth will only be the guard and
16
exit on f 2 circuits. The goal of an adversary is to try and increase the percent of circuits
seen to p f 2 .
Path selection attacks attempt to achieve this by exploit how relays and circuits are
chosen by clients. Borisov et al. [53] show how an exit relay can simply not forward
cells on circuits that use a guard they do not control. This forces the client to choose a
different circuit, increasing the chances that clients will select circuits with an adversarial
guard and exit relay. Similarly, back when Tor relied on self reported relay bandwidth
values Bauer et al. [30] showed that relays could lie about their bandwidth, increasing
their chances of being selected as both the guard and exit in circuits without actually
having to dedicate the bandwidth resources. Now Tor actively measures the bandwidth
capacity of relays to calculate the relays weight, which is used by clients when selecting
relays for circuits. Another tactic an adversary can take is to run denial of service attacks
against other relays preventing them from being selected. This should be done without
the adversary having to using much bandwidth, otherwise the same could be achieved
by simply providing that bandwidth to the Tor network, increasing the adversarial
relays weight. To this end Jansen et al. [54] introduce the sniper attack capable of
launching a low resource denial of service attack against even high bandwidth relays.
By forcing a relay’s sending window down to 0, the relay will have to continually buffer
cells eventually causing the relay to exhaust its available memory.
While relays can attempt to increase their chances of being selected as the guard and
exit relay, adversarial autonomous systems (AS) can also attempt to eavesdrop on both
edges of the connection to run a traffic analysis. A lot of work has been done examining
the capabilities of existing ASes from being able to passively launch these kinds of
attacks against anonymous communication systems [55, 21, 56]. More recently Sun et
al. introduced RAPTOR [57], an active AS level routing attack. Here an adversarial
AS can use BGP hijacking techniques to increase the amount of the Tor network they
can globally view, allowing for targeted end-to-end confirmation attacks.
3.2.2
Side Channel and Congestion Attacks
For side channel attacks we have an adversary on a single point in the circuit (i.e. exit
relay, malicious server) that is attempting to learn information that can potentially
identifying the client. This can be done by making passive measurements or actively
17
interfering in the network to increase the signal on the side channel. Hopper et al. [14]
show how to launch two attacks using latency as a side channel. First an adversarial exit
relay can attempt to see if two separate connections from an exit relay are actually from
the same circuit. Second they demonstrate that an exit relay that knows the identify
of the guard being used in the circuit can estimate the latency between the client and
guard. With this they can use network coordinates to significantly reduce the anonymity
set of the client using the circuit. Mittal et al. [58] demonstrate how circuit throughput
can be used by an adversary to identify bottleneck relays. By probing potential relays,
an adversarial exit relay can correlate throughput on a specific circuit to try and identify
the guard being used. Additionally, separate servers with connections from a Tor exit
can attempt to identify if they originate from the same circuit and client.
While the previous side channel attacks rely on passive adversaries that simply make
network measurements, congestion attacks artificially manipulate network conditions
that can be observed on a side channel by an adversary. Murdoch and Danezis [59] first
introduce the idea where a server can periodically push large amounts of data down a
circuit causing congestion. This can then be detected by the adversary to learn exactly
which relays are used for a circuit. Evans et al. [60] expand on this, introducing a
practical congestion attack that can be deployed on larger networks. An adversarial
exit relay injects Javascript code onto a circuit allowing it to monitor round trip times
on the circuit. It then selects a relay it believes might be the guard and creates a long
circuit that loops multiple times back through the selected relay. This significantly
increases the congestion on the relay by cells sent on the long circuit. If the relay
appears on the circuit n times, for every 1 cell sent by the adversary the relay will have
to process it n times. When the adversary selects the actual guard being used, the
recorded round trip times on the target circuit will dramatically increase, notifying the
adversary that it is in fact the guard being used.
3.3
Experimentation
One of the key challenges to conducting research on Tor is running experiments. Due to
the sensitive nature of providing privacy to its users, testing on the live network often
times is problematic, especially when the stated goal is to attack client anonymity.
18
Early research would use Planetlab [61] to setup a test network for experimentation.
One problem with this approach is the network properties (e.g. latency, bandwidth,
packet loss) could be drastically different than the actual Tor network. Additionally,
as the Tor network grew it became difficult to scale up experiments, as Planetlab only
has a few hundred nodes, not all of which are reliable and stable. Some research can
be done by simulating various aspects of Tor allowing for large scale experimentation.
Johnson et al. [62] evaluated the security of relay and path selection and built a path
simulator that only involved the relay selection algorithms from Tor. Tschorsch and
Scheuermann [63] created a new transport algorithm for inter-relay communication and
used the network simulator ns3 [64] to compare against different transport designs.
The major hurdle all these methods have is to simultaneously achieve accuracy and
scalability. To achieve both researchers created systems that would run actual Tor code
but emulate or simulate the networking aspects. Bauer et al. designed ExperimenTor
[65] which uses ModelNet [66] to create a virtual topology which Tor processes can
communicate through. Jansen and Hopper created Shadow [67] which simulates the
network layer while running the actual Tor software in a single process. These methods
allow for precise control over the network environment, allowing for accurate modeling
of the Tor network [68] which can be incorporated into the experimental test beds.
Additionally these systems allow for experimental setups with thousands of clients,
even capable of simulating the entire Tor network [27]. This is all done in a way that
produces accurate and reproducible results.
Chapter 4
Experimental Setup
19
20
As discussed in Section 3.3 running experiments on Tor is a complicated process with
trade offs on accuracy and scalability. Out of the potential tools discussed Shadow
[67] provides the highest accuracy while still allowing topologies consisting of thousands
of nodes. In this chapter we discuss details of Shadow and the network topologies
used, along with the metrics we consider when analyzing effects to anonymity and
performance.
4.1
Shadow and Network Topologies
Shadow [67, 69] is a discrete-event simulator that runs actual Tor code in a simulated
network environment. Not only does this allow us to run large scale experiments, but we
can incorporate any performance or attack code directly into Tor to test the effectiveness
of any implementation. With fine grained control over all aspects of the experiment we
can use carefully built network topologies reflecting the actual Tor network [70]. In
addition we are able to run deterministic and reproducible experiments, allowing us
to isolate the effects of specific changes (e.g. scheduling algorithms). These properties
are difficult to achieve simultaneously, particularly when using PlanetLab or the live
Tor network for experiments. The latter is not even an option when exploring attacks
on client anonymity, as there are large ethic concerns on interacting with and possibly
collecting data on real live Tor users. ExperimenTor does have both these properties,
and is arguably more accurate as it uses a virtual environment running an actual Linux
kernel. However ExperimenTor experiments must run in real time leading to issues of
scalability. Using Shadow allows us to run experiments 10 times as large, and even
supports simulating the full Tor network [27] given enough hardware resources.
To make sure experiments are as accurate as possible, extensive work [70, 27] has
been done on producing realistic network models and topologies. The network topology is represented by a mesh network, with vertices representing countries, Canadian
provinces, and American states. Each vertex has an associated up and down bandwidth, and edges between vertices represent network links with corresponding latency
and packet loss values. Since all Tor relay information is public, when creating an experiment with r relays, Shadow carefully samples from the relay set in order to produce
a sub-sampling reflective of the larger Tor network. Relays are then assigned to a vertex
21
based on their known country code. The client model for Shadow is based on high level
statistics collected by Tor [71] along with work done examining what clients use Tor
for [72, 25]. From this we know the distribution of countries Tor clients originate from,
along with the fact that roughly 90% of clients use Tor for web browsing, while the
remaining 10% are performing bulk downloads, typically using BitTorrent.
Using this information Shadow comes with a fully built network topology and experiments configured with different numbers of relays and clients. Since we are interested
in maximizing accuracy we use the large scale experiment configuration with either 500
relays, 1350 web clients, 150 bulk clients, and 300 performance client or 400 relays, 2375
web clients, 125 bulk clients, and 225 performance clients. Web clients download a 320
KiB file and then pause randomly between 1 and 60,000 milliseconds, drawing from
the UNC think-time distribution [73]. Bulk clients download a 5 MiB file continuously,
with no pauses after a completed download. The performance clients perform a single
download every 10 minutes, downloading either a 50 KiB, 1 MiB, or 5 MiB file. The
experiment is configured to run for 60 virtual minutes, with relays starting between 5
to 15 virtual minutes into the experiment, and clients started between 20 and 30 virtual
minutes. This allows for all clients to properly bootstrap and produces realistic churn
seen in the live network.
4.2
Performance Metrics
A common use for Shadow is to test new or modified performance enhancing algorithms
to see what, if any, effect on performance they might have. Two experiments are setup
using the same exact configuration, except one is configured to use vanilla Tor with no
modifications and the other with the new algorithm enabled. After these experiments
have completed we can then directly examine the performance achieved with the new
algorithm compared with vanilla Tor. For this we are interested in the following metrics.
Time To First Byte: The time to first byte measures the time between when a
client makes a request to download a file and when it receives the very first byte from
the server. The lower the time to first byte the better, as this indicates lower latency
and greater throughput when performing downloads, which is especially important for
latency sensitive applications such as web browsing.
22
Time To Last Byte: The time to last byte measures the time between when a client
makes a request to download a file and when it receives the very last byte from the
server. While time to first byte applies equally across all clients, for time to last byte we
separate it based on web and bulk clients. While all client performance is important,
we are generally more concerned with improving web client download times, as bulk
clients are not particularly sensitive to download times.
Total Network Bandwidth: In fixed intervals Shadow records the bandwidth consumed by every node in the network. The total network bandwidth measures the sum
of all bandwidth consumption across all nodes. Increasing network bandwidth usage in
Tor is important as it indicates clients are better utilizing all available resources.
In general all metrics need to be considered in aggregate, as there are situations
where something might be broken (e.g. clients not working) that are only reflected in a
single metric, with all other metrics actually improving. For example, a faster time to
first byte could be caused by having a large amount of downloads time out. This would
free network resources for other clients that are able to actually use the network, where
they experience less congestion. However in this case we would also see a large drop in
total network bandwidth indicating a problem. Another common phenomenon is we will
see an increase in total network bandwidth and improvement in bulk client download
times; however this will come at the expense of web clients. With prioritized bulk clients
completing downloads faster, we can achieve higher overall network bandwidth but web
client download times suffer which are generally more important.
4.3
Adversarial Model
The first adversarial model we consider is an adversarial exit relay which observes a
circuit connecting to a server and they wish to identify the client associated with the
circuit. The adversary can combine attacks that attempt to identify the guard being used on the circuit [59, 60, 58] with the latency attack [14] which can reduce the
anonymity set of clients assuming knowledge of the guard. More formally, the adversary
has prior knowledge of the set of potential circuits C and all relays R that can serve as
a guard. The victim V ∈ C is using a circuit with guard G ∈ R and using the adversary
as the exit relay. We assume a priori that the probability that any client ci ∈ C is the
23
victim is P (V = ci ) =
1
|C| .
The adversary runs an attack attempting to identify the
guard being used in the circuit, producing an updated guard probability distribution
P (G = rj ). Then for each rj ∈ R the adversary runs the latency attack assuming the
guard is rj . This produces a conditional probability distribution P (V = ci |G = rj ) for
each client assuming the relay rj is the guard. We can then compute the final client
probability as:
P (V = ci ) =
X
rj ∈R
P (V = ci |G = rj ) · P (G = rj )
To measure the amount of information leaked we can use entropy, which quantifies
uncertainty an adversary has about a guard or client, to measure the degree of anonymity
loss [74, 75]. These metrics operate over a reduced anonymity set, which can be applied
to both guards and clients. Given an updated probability distribution we use a threshold
τ to determine what does and does not get included in the reduced anonymity set. By
considering all possible values of τ we can determine the potential maximum degree of
anonymity that is loss. So given an initial set N and probability distribution P (ni ) we
have the reduced set S ⊆ N built as:
S = {ni ∈ N | P (ni ) > τ )
Then for a target T ∈ N the entropy of the reduced anonymity set S is then defined as

log |N |
if T ∈ S
2 |S|
(4.1)
H(S) =
0
otherwise
When calculating entropy for clients we have N = C and T = V and for guards N = R
and T = G. Then with the maximum total entropy defined as HM = log2 (|C|) we
can quantify the actual information leakage as the degree of anonymity loss defined as
1−
H(S)
HM .
By varying the threshold τ we can then determine the maximum degree of
anonymity loss.
The second adversarial model we consider is an adversary that is attempting to
increase the fraction of the network they can observe without increasing the amount
of bandwidth they need to provide. Specifically we are interested in the fraction of
streams that use a circuit where the adversary is the guard and exit on the circuit. In
this case they can run an end-to-end confirmation attack to link the client and server
24
communication with each other. In vanilla Tor if the fraction of bandwidth controlled
by an adversary is f , the fraction of circuits selected that they are on is f and the
fraction of circuits they will be the guard and exit on is f 2 . Outside of providing more
bandwidth, there are two general ways an adversary can artificially inflate the fraction of
circuits they see and compromise. First they can launch denial of service attacks against
other target relays. In this case the fraction of bandwidth controlled by the adversary
increases by having the total Tor network bandwidth decreases. The other method is
if an adversary can take advantage of or trick some mechanism in Tor that increases
the chances of them being selected relative to the fraction of bandwidth they control.
For example before Tor actively measured relay bandwidth and relied on self-reported
values, relays could lie about the amount of bandwidth they provided [30] to increase
the chances of being selected. Even now, if an adversarial relay knows they are being
measured for bandwidth, they can prioritize the measured traffic to achieve an inflated
bandwidth measurement, and then simply throttle all other Tor connections, receiving a
larger bandwidth weight compared to the actual bandwidth they are providing. In these
situations we are interested in the compromised fraction p > f 2 they can compromise
while providing f of all bandwidth.
Chapter 5
How Low Can You Go: Balancing
Performance with Anonymity in
Tor
25
26
5.1
Introduction
Recall that one of the key design choices of the Tor system is the goal of building a
large anonymity set by providing high performance to as many users as possible, while
sacrificing some level of protection from large-scale (global) adversaries. For example,
Tor does not attempt to protect against an end-to-end correlation attack that certain
mix systems try to prevent [5, 76, 7], as they introduce large costs in increased latency
making such systems difficult to use. This performance focus has led researchers to
investigate a variety of methods to improve performance, such as using different circuit
scheduling algorithms [77], better flow control [12], and throttling high bandwidth clients
[1, 78, 79]. Several of these mechanisms have been or will be incorporated into the Tor
software as a result of this research.
One overlooked side effect of these improvements, however, is that in some cases
improving the performance of users can also improve the performance of attacks against
the Tor network. For example, several attacks have been proposed [80, 81, 60, 82] that
rely on measuring the latency or throughput of a Tor circuit to draw inferences about
its source and destination. If an algorithm improves the throughput or responsiveness
of Tor circuits this can improve the accuracy of the measurements used by these attacks
either directly or by averaging a larger sample. Thus it is important to analyze how
these modifications to Tor interact with attacks based on network measurements.
In this section we investigate this interaction. We start by introducing a new class
of attacks based on network measurement, which we call induced throttling attacks. In
these attacks, an adversarial exit node exploits congestion or traffic admission control
algorithms to artificially throttle and unthrottle a chosen circuit without directly sending data through the circuit or relay. This leads to a recognizable pattern in other
circuits sharing resources with the target circuit, leaking information about the connection between the client and entry guard. We show that there are highly effective
induced throttling attacks against most of the proposed scheduling, flow control, and
admission control modifications to Tor, allowing an adversary to uniquely identify entry
guards in many cases.
We also examine the effect these algorithms have on previous attacks [80, 81] to see if
the improvement in performance, and therefore in network measurements, leads to more
27
successful attacks. Through large-scale simulation, we find that for throughput attacks,
the improved network measurements are essentially “cancelled out” by the reduced
variance in performance provided by these improvements. We also find that nearly all
of the proposed improvements increase the effectiveness of latency-based attacks, in
many cases leading to a 10% or higher loss in “degree of anonymity.”
Finally, we perform a comprehensive analysis of the combined effects of throughput, induced throttling and latency-measurement attacks. We show that using induced
throttling, the combined attacks can in many cases uniquely identify the source of a
circuit by a likelihood ratio test. These results indicate that flow and admission control
algorithms can have considerable impact on the security as well as performance of the
Tor network, and new proposals must be evaluated for resistance to induced throttling.
5.2
Background
In this section we discuss the proposed performance enhancing algorithms and some
details of some privacy reducing attacks in Tor.
5.2.1
Circuit Scheduling
When a connection was ready to write, Tor initially used a round robin [83] fair queuing algorithm to determine which circuit to flush cells from. One impact of this is
latency sensitive circuits would often have to queue behind large amount of cells flushed
from bulky file transfer circuits. To fix this Tang and Goldberg [77] suggested using
an EWMA-based algorithm to prioritize web circuits over bulky file sharing circuits,
reducing latency to clients browsing the web.
5.2.2
Flow Control
The high client-to-relay ratio in Tor causes performance problems that have been the
focus of a considerable amount of previous research. The main flow control mechanism
used by Tor is an end-to-end window based system, where the exit relay and client use
SENDME control cells to infer network level congestion. Tor separates data flowing inbound from data flowing outbound, and flow control mechanisms operate independently
on each flow. Each circuit starts with an initial 1000 cell window which is decremented
28
by the source edge node for every cell sent. When the window reaches 0, the source
edge stops sending. Upon receiving 100 cells, the receiver edge node returns a SENDME
cell to the source edge, allowing the source edge to increment its circuit window by 100
and continue sending more cells. Each stream that is multiplexed over a circuit also
has a similar flow control algorithm operating on it, with a 500 cell window and 50 cell
response rate. This work will focus on the circuit level flow control mechanisms.
One of the main issues with flow control in vanilla Tor is that it’s done in an end-toend manner, meaning that circuit edge nodes will take longer to detect and react to any
congestion that occurs in the middle of the circuit. As an alternative approach, AlSabah
et al. introduced N23[12], a link based algorithm that can instead detect and react to
congestion on every link in the circuit. Similar to the native flow control mechanism
in Tor, each relay in an N23-controlled circuit initializes its credit balance to N 2 + N 3
and decrements it by one for every cell it forwards. After a node has forwarded N 2
cells, it returns back a flow control cell containing the number of forwarded cells to the
backward relay. Upon receiving a flow control cell from the forward relay, the backward
relay updates its credit balance to be N 2 + N 3 minus the difference in cells it has
forwarded and cells the forward relay has forwarded. N23 has been shown to improve
detection of and reaction to congestion[12].
5.2.3
Traffic Admission Control
Guard nodes in Tor have the ability1 to throttle clients using a basic rate limiter [84].
The algorithm uses a token bucket whose size and refill rate are configurable to enforce
a long-term average throughput while allowing short-term data bursts. The intuition
behind the algorithm is that a throttling guard node will limit the client’s rate of requests
for new data, which will lower the outstanding amount of data that exists inside the
network at any given time and generally reduce congestion and improve performance.
There have been many proposed uses of and alterations to the approach outlined
above, some of which vary the connections that are throttled [85, 1, 78] and others that
vary the throttle rate [1, 78, 79]. Of particular interest are the algorithms proposed
by Jansen et al. [1], each of which utilize the number of client-to-guard connections
in some way to adjust the algorithm. The bitsplit algorithm divides its configured
1
Tor does not currently enable throttling by default.
29
BandwidthRate evenly among client-to-guard connections, while the flag algorithm uses
the number of client-to-guard connections to determine the rate over which a client
will get flagged as “high throughput” and throttled. Finally, the threshold algorithm
throttles the loudest fraction of client-to-guard connections. These algorithms have been
shown to improve web client performance[1].
5.2.4
Circuit Clogging
Murdoch and Danezis previously proposed a Tor circuit clogging attack [82] in which
the adversary sends data through a circuit in order to cause congestion and change its
latency characteristics. The adversary correlates the latency variations of this circuit
with those of circuits through other relays in order to identify the likely relays of a target
circuit. The attack requires significant bandwidth in order to produce a signal strong
enough for correlation, and it has been shown to be no longer effective [60]. There have
been numerous variations on this attack, some of which have simple defenses [60, 86]
and others that have low success rates [87]. This work does not consider these “general”
congestion attacks where the main focus is keeping bandwidth usage small enough to
remain practical. Instead, we focus on the feasibility of new induced throttling attacks
introduced by recent performance enhancing algorithm proposals, and the effects that
our new attacks have on anonymity.
5.2.5
Fingerprinting
Mittal et al. recently proposed “stealthy” throughput attacks[80] where an adversary
that controls an exit node of a circuit attempts to find its guard relay by using “probe”
clients that measure the attainable throughput through each relay.2
The adversary
may then correlate the circuit throughput measured at the exit node with the throughput of each of its probes to find the guard node with high probability. Some of our
attacks also utilize probe clients in order to recognize the signal produced once throttling has been induced on a circuit. Hopperet al. [81] propose an attack where an
adversary injects malicious javascript into a webpage in order to measure the round trip
time of a circuit. The adversary may use these measurements to narrow the possible
2
The attack is active in that an adversary launches it by sending data to relays, but stealthy in
that its data streams are indistinguishable from normal client streams.
30
path the target circuit is taking through the network and approximate the geographical
location of the client. As our techniques are similar to both of these attacks, we include
them in our evaluation in Section 5.4 and analysis in Section 5.7.
5.3
Methodology
We consider three classes of algorithms that have been proposed: EWMA circuit scheduling [77]; N23 flow control [12]; and bitsplit, flag, and threshold throttling [1]. We also
consider an ideal throttling algorithm that has perfect knowledge of the traffic type of
every stream. This ideal algorithm throttles high throughput nodes at a rate of 50 KB/s
and approximates the difftor approach of AlSabah et al. [85].
5.3.1
Algorithmic-Specific Information Leakage
We will explore new algorithm-specific attacks we have developed, as well as previously
published generic attacks [81, 80], and quantify the extent to which the attacks affect
anonymity. In analyzing the algorithms, we can expect them to have one of two effects: the algorithms may improve the effectiveness of statistical attacks by making
side channel throughput and latency measurements more accurate, improving the adversary’s ability to de-anonymize the client; or the algorithms may reduce the noise
that an adversary uses to eliminate entry guards and clients from the potential candidate set, frustrating the attacks and improving client anonymity. All of the attacks are
attempting to either reduce the candidate set of guard on the circuit or clients using
the circuit. After running the attacks over the candidate set N with a target T ∈ N ,
each entity ni ∈ N will have a corresponding score s(ni ). Along with the raw scores
produced for the target T , to evaluate the effectiveness of the attacks we are interested
in the following metrics.
Percentile: The percentile for a candidate target T of an attack is defined as the
percent of other candidate targets (i.e., members of the anonymity set) with a lower
score3 than T , based on statistical information the attacker uses to score each candidate
3
In attacks where a lower score is “better” we can simply use the inverse for ranking.
31
as the true target. Formally it is calculated as:
percentile =
| {ni ∈ N | s(ni ) < s(T )} |
|N |
A higher percentile for T means there is a greater likelihood that T is the true target.
Percentiles allow an attacker to reduce uncertainty by increasing confidence about the
true target, or increasing confidence in rejecting candidates unlikely to be the target.
Degrees of Anonymity: As discussed in Section 4.3 the degree of anonymity loss
[75, 74] is a useful metric for measure information leakage. Given a threshold τ we can
calculate the reduced anonymity set S = {ni ∈ N | s(ni ) > τ } based on the score of
each entity in N . We can then calculate the entropy of S as

log |N |
if T ∈ S
2 |S|
H(S) =
0
otherwise
(5.1)
The degree of anonymity then quantifies the actual information leakage, calculated over
the maximum total entropy HM = log2 (|N |) and computed as
degree of anonymity loss = 1 −
H(S)
HM
While this may not show the direct implications to client anonymity, we can determine
the best case scenario for a potential adversary by examining the τ that maximizes the
degree of anonymity loss, and even more importantly it allows us to do cross experimental comparisons to determine the effect that different algorithms have under different
attack scenarios.
Client Probability: The previous metrics examine the attack in isolation, evaluating
how effective an attack is at reducing the candidate set for either guards or clients. For
each attack that reduces the candidate guard set, we are also interested in how well
an adversary can reduce the client anonymity set by combining the attack with the
latency attack. For this we want to examine the end client probability distribution set
discussed in Section 4.3. Since all the attacks produce scores for each entity, to produce
a probability distribution we simply normalize over the scores. So for ni ∈ N we have:
s(ni )
nj s(nj )
P (ni ) = P
With this we can combine the attacks as outlined in Section 4.3 to produce a final client
probability distribution P (T = ci ).
32
5.3.2
Experimental Setup and Model
Our experiments will utilize the Shadow simulator [88, 89], an accurate discrete event
simulator that runs the real Tor code over a simulated network. Shadow allows us to
configure large scale experiments running on network sizes not feasible using traditional
distributed network testbeds [61]. In addition, it offers fine grain control over network
topology, latency, and bandwidth characteristics. Shadow also enables precise control
over Tor’s circuit creation process at each individual node, allowing us to experiment
with our attacks in a safe environment. Shadow also allows us to run repeatable experiments while only modifying the algorithm or attack scenario of interest, resulting in
more accurate evaluations and comparisons.
We developed a model of the Tor network based on work by Jansen et al. [68], and
use it as the base of each large scale experiment in the following sections. We will discuss
necessary changes to the following base configuration as we explore each specific attack
scenario: 160 exit relays, 240 nonexit relays, 2375 web clients, 125 bulk clients, 75 small
TorPerf clients, 75 medium TorPerf clients, 75 large TorPerf clients, and 400 HTTP
servers. The web client downloads a 320 KiB file from one of the randomly selected
servers, after which it sleeps for a time between 1 and 60 seconds drawn uniformly
at random before starting the next download. The bulk clients repeatedly download
a 5 MiB file with no wait time between downloads. Finally, the TorPerf clients only
perform one download every 10 minutes, where the small, medium and large clients
download 50 KB, 1 MiB and 5MiB files respectively. This distribution of clients is used
to approximate the findings of McCoy et al. [72], Chaabane et al. [25] and data from
Tor [71].
5.4
Algorithmic Effects on Known Attacks
This section evaluates how recently proposed performance enhancing algorithms affect
previously known guard and client identification attacks against Tor.
5.4.1
Throughput as a Signal
We first explore the scenario of Mittal et al. [80], where an attacker is able to identify
the guard relay of a circuit with high probability by correlating throughput measured
33
at an adversarial exit node to probe measurements made through a set of entry guards.
We ran our base experiment from Section 5.3.2 once without any probe clients in
order to discover what circuits each bulk client created. Then, for every entry G that
was not a middle or exit relay for any bulk client, we instantiated a probe client that
creates a one-hop circuit through G in order to measure its throughput. This was done
to minimize the interference of other probes on bulk circuits, where they only potentially
affect the circuit they are measuring and no other. We compared vanilla Tor with 6
different algorithms: EWMA circuit scheduling [77], N23 flow control [12], bitsplit, flag,
and threshold throttling [1], and ideal throttling.
The results for the EWMA and N23 algorithms can be seen in Figure 5.1. The
percentile of the entry guard seen in Figure 5.1b shows the largest divergence between
the algorithms. For example, around 40% of entry guards in vanilla Tor and N23
have correlation scores in the top 20% of all scores calculated, while EWMA only has
around 25% of its entry guards in the top 20%. In terms of measuring the actual loss
of anonymity we see little difference between the algorithms, as seen in Figure 5.1c.
This shows the degree of anonymity loss while varying the threshold value, that is, the
minimum correlation score an entry guard must have to be included in set of possible
guards. Vanilla Tor has a slightly higher peak in anonymity loss, about 2% larger than
EWMA and N23, leading to greater anonymity reduction for an adversary.
While the previous algorithms had little overall affect on the attacks, the throttling
algorithms have a significantly larger impact on how the attack performs compared to
vanilla Tor, as seen in Figure 5.2. In particular, you see incredibly low entry guard
correlation scores for the ideal and flag algorithms in Figure 5.2a, and looking at the
percentile of the guards in Figure 5.2b we see 48% of entry guards are in the top half
of the list based on correlation score in vanilla Tor, while only 22-41% of the guards
in the throttling algorithm experiments made it in the top half. Indeed, when looking
at degree of anonymity loss in Figure 5.2c, we see a much larger peak in the graph for
vanilla Tor compared to all other algorithms, indicating that in the best case scenario
an adversary would expect more information leakage. This implies that the throttling
algorithms would actually result in a larger anonymity set for the adversary, making the
attack produce worse results. Intuitively, the throughput of throttled circuits tend to
be more similar than the throughput of unthrottled circuits, increasing the uncertainty
1.0
1.0
0.8
0.8
0.6
0.6
CDF
CDF
34
0.4
0.4
0.2
0.2
0.0
0.0
0.2
0.6
0.4
|Correlation Score|
0.8
1.0
0.0
0.0
(a) Entry Scores
0.8
1.0
(b) Percentile of Entry
0.40
Degree of Anonymity Loss
0.6
0.4
Percentile
0.2
vanilla
EWMA
0.35
N23
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0.0
0.2
0.6
0.4
Threshold
0.8
1.0
(c) Degree of Anonymity Loss
Figure 5.1: Results for throughput attack with vanilla Tor compared to EWMA and
N23 scheduling and flow control algorithms.
1.0
1.0
0.8
0.8
0.6
0.6
CDF
CDF
35
0.4
0.4
0.2
0.2
0.0
0.0
0.2
0.6
0.4
|Correlation Score|
0.8
1.0
0.0
0.0
(a) Entry Scores
0.8
1.0
(b) Percentile of Entry
0.40
Degree of Anonymity Loss
0.6
0.4
Percentile
0.2
vanilla
bitsplit
flag
0.35
0.30
threshold
ideal
0.25
0.20
0.15
0.10
0.05
0.00
0.0
0.2
0.6
0.4
Threshold
0.8
1.0
(c) Deg. of Anonymity Loss
Figure 5.2: Results for throughput attack with vanilla Tor compared to different throttling algorithms.
36
during the attack. The throttling algorithms effectively smooth out circuit throughput
to the configured long-term throttle rate, making it more difficult to distinguish the
actual entry guard from the set of potential guards.
5.4.2
Latency as a Signal
We now explore the latency attack of Hopper et al. [81]. They show how an adversarial
exit relay, having learned the identity of the entry guard in the circuit, is able to estimate
the latency between the client and guard. This is accomplished by creating two ping
streams through the circuit, one originating from the client and one from the attacker.
The ping stream from the attacker is used to estimate the latency between the entry
guard and the exit relay which, when subtracted from the ping times between the client
and exit relay produces an estimate of latency between the client and guard. Using
network coordinates to compile the set of “actual” latencies between potential clients
and guards, the adversary is then able to reduce the anonymity set of clients based
on the estimated latency. Since this attack relies on the accuracy of the estimated
latency measurements, the majority of the algorithms have the potential to decrease
the anonymity of the clients by allowing an adversary to discard more potential clients.
For our experiments, we use the same base configuration with 400 relays and 2500
clients, with an additional 250 victim clients setup to send pings through the Tor circuit
every 5 seconds. Then, for each victim client a corresponding attacker client is added,
which creates an identical circuit to the one used by the victim, and then sends a ping
over this circuit every 5 seconds. These corresponding ping clients are used to calculate
the estimated latency between the victim and entry guard as discussed above. In order
to determine the actual latencies between the clients and guard node, we utilize the fact
that Shadow determines the latency distribution that is sampled from between each
node that communicates in the experiment, so we merely assign the median latency of
these distributions as the actual latencies between nodes. This would correspond to the
analysis done in [81] where it is assumed that these quantities were known a priori, so
we believe using this “insider information” doesn’t contradict any assumptions made in
the initial paper outlining the attack. Furthermore, since we’re ultimately concerned
with how the attacks differ using various algorithms, the analysis should hold.
Similar to the original attack, we take the minimum of all observed ping times seen
37
over both the victim and attacker circuit, denoted TV X and TAX respectively. Then,
an estimate of the latency between the victim and entry guard, TV E , is calculated as
T̂AE = TV X − TAX + TAE , where TAE is the latency between the attacker and entry
guard as calculated above. Figures 5.3a and 5.4a show the difference in estimated latency
computed by an adversary and the actual latency between the client and guard, while
Figures 5.3b and 5.4b show how these compare with the differences between the estimate
and other possible clients. While these graphs only show a slight improvement for every
algorithm except EWMA, looking at the degree of anonymity loss in Figures 5.3c and
5.4c shows a noticeable increase in the maximum possible information gain an adversary
can achieve. Even though there is only a slight improvement in the accuracy of the
latency estimation, this allows an adversary to consider a smaller window around the
estimation to filter out potential clients while still retaining similar accuracy rates. This
results in a smaller set of potential clients to consider, and thus a higher reduction in
anonymity of the victim client.
5.5
Induced Throttling via Flow Control
We now look at how an attacker is able to use specific mechanisms in the flow control
algorithms to induce throttling on a target circuit.
5.5.1
Artificial Congestion
Recall from Section 5.2.2, the flow control algorithms work by having control cells sent
backward to notify nodes and clients that they are able to send more data. If there is
congestion and the nodes go long enough without receiving these cells, they stop sending
data until the next control cell is received. Using these mechanisms, an adversarial exit
node can implicitly control when a client can and cannot send data forward, thereby
inducing artificial congestion.
While many times it’s sufficient to model Tor clients as web and bulk [68] making
small and large downloads respectively, in order for an exit node to induce artificial
congestion there needs to be large amounts of data being sent from the client to server.
To address this issue, we introduce a more detailed third type of client modeling the
BitTorrent protocol and scheme that mimics “tit-for-tat”[90]. The new client instead
1.0
0.8
0.8
0.6
0.6
CDF
1.0
0.4
vanilla
EWMA
N23
0.2
0.0
0
50
0.4
vanilla
EWMA
N23
0.2
100 150 200 250 300 350
Estimate Ping Difference
0.0
0.0
0.6
0.4
Percentile
0.2
(a) Ping Differences
0.8
(b) Percentile
0.20
Degree of Anonymity Loss
CDF
38
vanilla
EWMA
N23
0.15
0.10
0.05
0.00
0
50
100
Threshold
150
200
(c) Degree of Anonymity Loss
Figure 5.3: Results of latency attack on EWMA and N23 algorithms.
1.0
1.0
1.0
0.8
0.8
0.4
0.2
0.0
0
50
100 150 200 250
Estimate Ping Difference
CDF
vanilla
bitsplit
flag
threshold
ideal
0.6
vanilla
bitsplit
flag
threshold
ideal
0.6
0.4
0.2
300
0.0
0.0
0.6
0.4
Percentile
0.2
(a) Ping Differences
0.8
(b) Percentile
0.20
Degree of Anonymity Loss
CDF
39
vanilla
bitsplit
flag
threshold
ideal
0.15
0.10
0.05
0.00
0
50
100
Threshold
150
200
(c) Degree of Anonymity Loss
Figure 5.4: Results of latency attack on the various throttling algorithms.
1.0
40
swaps 16 KB blocks of data with the server, and doesn’t continue until both sides
have received the full block, causing large amounts of data to both be uploaded and
downloaded.
To demonstrate the effectiveness of such techniques, we had a bulk and torrent client
create connections over the same circuit where each relay was configured with 128 KB/s
bandwidth. The exit relay would then periodically hold all control cells being sent to the
torrent client in an attempt to throttle the connection. Figure 5.5 shows the observed
throughput of both clients, where the shaded regions indicate periods when the exit
relay was holding control cells from the torrent client. We can see that approximately
30 seconds into these periods, the torrent client runs out of available cells to send and
goes into an idle state, leaving more resources to the bulk client resulting in a rise in
the observed throughput.
Next we want to identify how a potential adversary could utilize this in an attempt
to identify entry guards used in a circuit. We can see the intuition behind the attack
in Figure 5.5: when throttling of a circuit is repeatedly toggled, the throughput of all
other circuits going through those nodes will increase and then decrease, producing a
noticeable pattern which an adversary may be able to detect. We assume a scenario
similar to previous attacks [82, 60, 80], where an adversary controls an exit node and
wants to identify the entry guard of a circuit going through them. The adversary creates
one-hop probe circuits through possible entry guards by extending circuits through
middle relays that the adversary controls, and measures the throughput and congestion
on each circuit at a constant interval. The adversary then periodically throttles the
circuit by holding control cells and tests for an increase in throughput for the duration
of the attack. By repeatedly performing this action an attacker should be able to reduce
the possible set of entry guards that the circuit might be using.
5.5.2
Small Scale Experiment
To test the feasibility of such an attack, we designed a small scale experiment with 20
relays, 190 web clients and 10 bulk clients. One of the exit relays was designated as
the adversary and one of the bulk clients was designated as the victim. The victim
bulk client was then configured to use the torrent client while creating circuits using
the adversary as their exit relay. Then, for each of the 19 remaining non-adversarial
20 40 60 80
0
Throughput (KB/s)
41
o Bulk
x Torrent
●
● ●
●
●●●
● ●
●●●●
●
●
●●●
●●
●
●●
●●●●●●
● ●
●
●●●●
● ●●
●●
●
● ●●
●
● ● ●
●
●
●
●●●
●
●
●
●
●
●
●
● ●●●●
●● ●
●●
●●
●
●
● ● ●
●●
● ●●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●
●●
●●
●
●●
●
●●
●●● ●
●●
●●
●●
●●
●
●
●●
●●
●●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●● ● ● ● ● ●● ●
●
●●
●● ●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ● ●● ●●● ●●●●
● ●● ● ● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
0
50
100
xxxx xxxxxxxxxxxxxxxx xx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
150
200
x
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
250
Time (seconds)
Figure 5.5: Effects of artificial throttling.
relays, a probe client was added and configured to create one-hop circuits through the
relay, measuring observed throughput in 100 ms intervals. The adversarial exit relay
would then wait until the victim client created the connection, then every 60 seconds
would toggle between normal mode in which all control cells are sent as appropriate,
and throttle mode where all control cells are held.
Figure 5.6a shows the observed throughput at the probe client connected to the
entry guard that the victim client was using, where the shaded regions correspond to
the periods where the victim client runs out of available cells to send and is throttled.
During these periods the probe client sees a large spike in observed throughput as
more resources become available at the guard relay. While these changes are visually
identifiable, we need a quantitative test that can deal with noise and variability in order
to analyze a large number of probe clients and reduce the set of possible guards.
5.5.3
Smoothing Throughput
The first step is to smooth out the throughput measurements for each probe client in
order to filter out any noise. Given throughput measurements (ti , bi ), we first compute
the exponentially weighted moving average (EWMA), using α = 0.01. We then take the
output from EWMA and pass it through a cubic spline smoothing algorithm [91] with
smoothing parameter λ = 0.5. It takes as input the set of measurements (ti , EW M Ai )
and a smoothing parameter λ ≥ 0, and returns the function f (t) which minimizes the
0
0.8
0.6
0.4
0.0
0.2
Throughput (KB/s)
50
100
Throughput (KB/s)
150
1.0
42
0
100
200
300
400
500
600
0
100
200
Time
300
400
500
600
Time
(a) Raw Throughput
(b) Smoothed Throughput
Figure 5.6: Raw and smoothed throughput of probe through guard during attack.
penalized residual sum of squares:
n
X
i=1
2
(EW M Ai − f (ti )) + λ
Z
tn
f 00 (t)2 dt
t1
The result of this process can be seen in Figure 5.6b with the normalized smoothed
throughput plotted over the shaded attack windows.
5.5.4
Scoring Algorithm
The intuition behind the scoring algorithm can be seen in Figure 5.6b. Over each attack
window marked by the shaded regions, the guard probe client should see large increases
and decreases in throughput at the beginning and end of the window. Here we want
the scoring algorithm to place heavy weight on consistent large increases and decreases
that align with all the windows, while at the same time minimizing false positives from
potentially assigning too much weight to large spikes that randomly happen to align
with the attack window.
The first step of the scoring algorithm is to calculate a linear regression on the
smoothed throughput over the first δ seconds
4
at the start and end of each attack
window and collect the slope values si and ei for the start and end regression. Then
for each window i, first sort all probe clients based on their si slope value from highest
to lowest, and for each probe client record their relative rank rsi . Repeat this for the
4
Through empirical evaluation we found δ = 30 seconds to be ideal
43
slope value ei , only instead sort the clients from lowest to highest, again recording their
relative rank rei . Rankings are used instead of the raw slope value in order to prevent
the false positives from large spikes mentioned previously. Now that each probe client
has a set of ranks {rs1 , re1 , . . . , rsn , ren } over n attack windows, the final score assigned
to each client is simply the mean of all the ranks, where the lower the score the higher
chance the probe client connected through the entry guard.
5.5.5
Large Scale Experiments
In order to test the accuracy of this attack on a large scale, we used the base experiment
setup discussed in Section 5.3.2, with the addition of one-hop probe clients to connect
through each relay. For each run, a random circuit used by a bulk node was chosen to
be the attack circuit and the exit node used in the circuit was made an attacker node.
The bulk node was updated to run the client that simulated the BitTorrent protocol,
and the probe clients that were initially going through the middle and exit relay were
removed. For each algorithm, we performed 40 runs with the different attack circuits
chosen for each run. We configure the experiments with the vanilla Tor algorithm which
uses SENDME cells for flow control, and the N23 flow control algorithm with N 3 = 500
and N 3 = 100. We experimented with having BitTorrent send 1, 2, 5, and 10 streams
over the circuit. The results using just one stream can be seen in Figure 5.7, and results
when using multiple streams in Figure 5.8.
Figure 5.7a shows the CDF of the average score computed by the ranking algorithm
for the entry guards probe client, while Figure 5.7b shows the percentile compared to
all other probe clients. Interestingly, even though in vanilla Tor not a single entry guard
probe had a score better than 150 out of 400, we still see about 80% of the entry guard
probe clients were in the 25th percentile amongst all probe clients. Furthermore, we
see that the attack is much more successful with the N23 algorithm, especially with
N 3 = 100, with peaks in degree of anonymity loss at 31%, compared to 27% with
N 3 = 500 and 19% in vanilla Tor. This is due to the fact that it is easier to induce
throttling with N23, especially when the N 3 value is low. In vanilla Tor, the initial
window is set at 1000 cells, while for the N23 algorithm this would be N 2 + N 3, which
works out to 510 and 110 cells for N 3 = 500 and N 2 = 100 respectively (default value
for N 2 is 10). The outcome of this is that for N23 with N 3 = 100, we were able
44
1.0
1.0
vanilla
N23 (N3=100)
N23 (N3=500)
0.6
0.4
0.2
0.0
vanilla
N23 (N3=100)
N23 (N3=500)
0.8
CDF
CDF
0.8
0.6
0.4
0.2
0
20
40
60 80 100 120 140 160 180
Entry Probe Rank
0.0
0.0
0.2
(a) Entry Average Ranks
0.4
0.6
Percentile
0.8
1.0
(b) Percentiles
Degree of Anonymity Loss
0.35
vanilla
N23 (N3=500)
N23 (N3=100)
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0
50
100
150
200
Threshold
250
300
(c) Degree of Anonymity Loss
Figure 5.7: Throttle attack with BitTorrent sending a single stream over the circuit.
45
to throttle the attack circuit around 12 times, resulting in 24 comparison points over
all attack windows, while both with N 3 = 500 and vanilla Tor we see around 7 attack
windows, resulting in 14 comparison points. The reason that we see slightly better entry
probe rankings with N 3 = 500 than vanilla Tor is because with N23, each node buffers
cells when it runs out of credit, while vanilla Tor buffers cells at the client. This means
when the flow control cell is finally sent by the attacker, it would cause the entry guard
to flush the circuits buffer and cause an immediately noticeable change in throughput
and thus a higher rank score for the entry guard.
While using one circuit for one stream is the ideal greedy strategy for a BitTorrent
client using Tor, it may not always be feasible to accomplish this. To explore what
effects sending multiple streams over a circuit has on the attacks, for each algorithm
we experimented having BitTorrent send 2, 5, and 10 streams over the circuit that the
attacker throttles. The results are shown in Figure 5.8, and for each algorithm we can
see how the degree of anonymity loss changes as more streams are multiplexed over a
single Tor circuit. Not surprisingly, all three algorithms have a high degree of anonymity
loss when sending 10 streams over the circuit, as this dramatically increases the amount
of data being sent over the circuit. Thus, when the artificial throttling is induced,
there will be larger variations and changes in observed throughput at the entry guard’s
probe client. Even just adding one extra stream to the circuit can in some cases cause
a noticeable reduction in the degree of anonymity, particularly for the N23 algorithm
with N 3 = 100 as seen in Figure 5.8c.
5.6
Induced Throttling via Traffic Admission Control
We now look at how an attacker is able to use specific mechanisms in proposed traffic admission control algorithms to induce throttling on a target circuit, creating a throughput
signal at much lower cost than the techniques required in [80].
5.6.1
Connection Sybils
Recall that each of the algorithms proposed in [1] relies on the number of client-to-guard
connections to adaptively adjust the throttle rate (see Section 5.2). Unfortunately,
this feature may be controlled by the adversary during an active sybil attack [92]: an
46
0.8
1 stream
2 streams
5 streams
10 streams
0.7
0.6
0.5
Degree of Anonymity Loss
Degree of Anonymity Loss
0.8
0.4
0.3
0.2
0.1
0.0
0
50
100
150
200
Threshold
250
300
1 stream
2 streams
5 streams
10 streams
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0
50
(a) Vanilla
100
150
200
Threshold
250
300
(b) N23 (N3=500)
Degree of Anonymity Loss
0.8
1 stream
2 streams
5 streams
10 streams
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0
50
100
150
200
Threshold
250
300
(c) N23 (N3=100)
Figure 5.8: Degree of anonymity loss with throttling attacks with different number of
BitTorrent streams over the Tor circuit.
47
adversary may either modify a Tor client to create multiple connections to a target
guard relay instead of just one, or boot multiple Tor clients that are each instructed to
connect to the same target.5
As a result, the number of connections C at the target
will increase to C = Cn + Cs , where Cn is the number of normal connections and Cs is
the number of sybil connections. The throttling algorithms will be affected as follows:
the bitsplit algorithm will lower the throttle rate to
BandwidthRate
;
Cn +Cs
the flag algorithm will
throttle any connection whose average throughput has ever exceeded
BandwidthRate
Cn +Cs
to the
configured rate6 ; and the threshold algorithm will throttle the loudest fraction f of
connections to the throughput of the quietest of that throttled set7
(but no less than
a floor of 50 KB/s). Therefore, the attacker may cause the throttle rate at any guard
relay to reduce dramatically in all of the algorithms by using enough sybils. In this
way, an adversary controlling an exit node can determine the guard node belonging to a
target circuit with high probability. Note that the attacker need not use any bandwidth
or computational resources beyond that which is required to establish the connections
from its client(s) to the target guard relay.
We test the feasibility of this attack in Shadow. We configure a network with 5
relays, a file server, and a single victim client that downloads a large file from the server
through Tor for 300 seconds. The adversary controls the exit relay on the client’s circuit
and therefore is able to compute the client’s throughput. The attacker starts multiple
sybil nodes at time t = 100 seconds that each connect to the same entry relay used by
the victim. The sybil nodes are shut down at t = 200, after being active for 100 seconds.
The results of this attack on each algorithm are shown in Figure 5.9, where the
shaded area represents the time during which the attack was active. Figure 5.9a shows
that the bitsplit algorithm is only affected while the attack is active, after which the
client throughput returns to normal. However, the client throughput remains degraded
in both Figures 5.9b and 5.9c—the flag algorithm flagged the client as a high throughput
node and did not unflag it while the threshold algorithm continued to throttle at the
50 KB/s floor. Further, notice that the throttling does not occur until roughly 10 to 20
seconds after the attack has begun. This is due to the size of the token bucket, i.e., the
BandwidthBurst configuration: the attack causes the refill rate to drop dramatically,
5
6
7
A similar attack was previously described in [1], Section 5.2, Attack 4.
We use 50 KB/s as advised in [1]
We use f = 0.10 as advised in [1]
250
250
200
200
Throughput (KiB/s)
Throughput (KiB/s)
48
150
100
150
100
50
0
50
0
50
100
150
Time (s)
200
250
300
0
0
50
(a) Bitsplit
100
150
Time (s)
200
250
300
(b) Flag
Throughput (KiB/s)
250
200
150
100
50
0
0
50
100
150
Time (s)
200
250
300
(c) Threshold
Figure 5.9: Small scale sybil attack on the bandwidth throttling algorithms from [1].
The shaded area represents the period during which the attack is active. The sybils
cause an easily recognizable drop in throughput on the target circuit.
49
but it takes time for the client to use up the existing tokens in the bucket. In addition
to the delay associated with the token bucket size, Figure 5.9c shows added delay in the
threshold algorithm because it only updates throttle rates once per minute.
5.6.2
Large Scale Experiments
We further explore the sybil attack on a large scale in Shadow. In addition to the base
experiment setup discussed in Section 5.3.2, we add a victim client who downloads a
large file through a circuit with an adversarial exit and a known target guard. The
adversary starts sybil nodes and instructs them to connect to the known target guard
at time t = 100. The sybil nodes cycle through 60 second active and inactive phases,
and the adversary measures the throughput of the circuit.
The results of this attack on each algorithm are shown in Figure 5.10, where the
shaded area again represents the time during which the attack was active. In the large
scale experiment, only the bitsplit algorithm (Figure 5.10a) produced a repeatable signal
while the throttle rate remained constant after the first phase for both flag (Figure 5.10b)
and threshold (Figure 5.10c). Also, these results show the importance of correctly
computing the attack duration: the signal was missed during the first phase for both
bitsplit and flag because the token bucket had not yet fully drained.
5.6.3
Search Extensions
Because the drop in throughput during the attack is easily recognizable, it is much easier
to carry out than those discussed in Section 5.5. Therefore, in addition to statistical
correlations that eliminate potential guards from a candidate set for a given circuit, a
search strategy over the potential guard set will also be useful. In a linear search, the
adversary would attack each candidate guard one by one until the throughput signal on
the target circuit is recognized. This strategy is simple, but it may take a long time to
test every candidate. A binary search, where the attacker tests half of the candidates in
each phase of the search, would significantly reduce the search time. Note that a binary
search may be ineffective on certain configurations of the flag and threshold algorithms
because of the lack of a repeatable signal (see Figures 5.9 and 5.10).
Regardless of the search strategy, a successful attack will contain enough sybils to
350
300
300
250
250
Throughput (KiB/s)
Throughput (KiB/s)
50
200
150
100
50
200
150
100
50
0
0
−50
0
100
200
300
400
Time (s)
500
600
700
−50
0
100
(a) Bitsplit
200
300
400
Time (s)
500
600
700
(b) Flag
300
Throughput (KiB/s)
250
200
150
100
50
0
−50
0
100
200
300
400
Time (s)
500
600
700
(c) Threshold
Figure 5.10: Large scale sybil attack on the bandwidth throttling algorithms from [1].
The shaded area represents the period during which the attack is active. Due to token
bucket sizes, the throughput signal may be missed if the phases of the attack are too
short.
51
allow the adversary to recognize the throughput signal, but otherwise be as fast as
possible. Given our results above, the attack should consider the token bucket size and
refill rate of each candidate guard to aid in determining the number of sybils to launch
and the length of time each sybil should remain active. An adversary who controls the
circuit exit may compute the average circuit throughput and estimate the amount of
time it would take for the circuit to deplete the remaining tokens in a target guard’s
bucket during an attack. Each sybil should then remain active for at least that amount
of time. Figure 5.10 shows that the throughput signal may be missed without these
considerations.
5.7
Analysis
Having seen how the algorithms perform independently in different attack scenarios, we
now want to examine the overall effects on anonymity that each algorithm has. For our
analysis, we use client probability distributions as discussed in Section ?? to measure
how much information is gained by an adversary. Recall that we have the probability
P
that a client is the victim being P [V = Ci ] = j P [V = Ci |Rj ]P [G = Rj ] for a set
of relays R and clients C. Now we need to determine how to update the probability
distributions for P [G = Rj ] and P [V = Ci |Rj ] based on the attacks we’ve covered.
There are three attacks discussed that are used to learn information about the guard
node used in a circuit, determining P [G = Rj ]. The throughput attack and artificial
throttling attack both attempt to reduce the set of possible entry guards by assigning
a score to each guard and using a threshold value to determine the reduced set. For
each attack, we compile the set of scores assigned to the actual guard nodes, and use
this set as input to a kernel density estimator in order to generate an estimate of the
probability density function, P̂ . Then, given a relay Rj with a score score(Rj ) we
can compute P [G = Rj ] = P̂ [X = score(Rj )]. For the sybil attacks on the throttling
algorithms we were able to uniquely identify the entry guard, so we have P [G = Rj ] = 1
if Rj is the guard, otherwise it’s 0. Therefore, denoting RG as the guard relay, we have
P [V = Ci ] = P [V = Ci |RG ].
For determining the probability distribution for P [V = Ci |Rj ], recall that the la-
tency attack computes the difference between the estimated latency T̂V E and the actual
52
latency TV E as the score, and ranks potential clients that way. Using the absolute
value of the difference as the score, we compute the probability density function P̂ in
the same way as we did for P [G = Rj ]. Therefore, to compute P [V = Ci |Rj ], we let
latCi Rj be the actual latency between client Ci and relay Rj , and T̂V E be the estimate
latency between the client and entry guard. Then, with dif f = |latCi Rj − T̂V E | we have
P [V = Ci |Rj ] = P̂ [X = dif f ] from the computed probability density function.
In our analysis we first concentrate on how the algorithms perform with just the
throughput and latency attack compared to vanilla Tor. We then replace the throughput
attack with the new throttling attacks shown in Section 5.5 and 5.6 to see if there are
improvements over either vanilla Tor or the throughput attack with the algorithms.
For each attack we use a set of potential victims {Vi } and their entry guards Gi and
compute the probabilities P [V = Vi ] as shown above.
The results for the algorithms EWMA, N23, and the induced throttling attacks
described in Section 5.5 are shown in Figure 5.11. We can see in Figure 5.11a that
with just the throughput and latency attack, vanilla Tor leaks about the same amount
of anonymity of a client as EWMA and N23. The exception to this is that a tiny
proportion of clients have probabilities ranging from 4-6% which is outside the range in
vanilla Tor. Referring back to Figure 5.3 we see that these algorithms, N23 in particular,
leak slightly more information than vanilla Tor based on latency estimation. While
sometimes this information gain might be counteracted by the amount of possible entry
guards that need to be considered, there are a small amount of cases where the guard
set is reduced enough that the extra information from the latency attack translates into
a higher probability for the client.
When replacing the throughput attack with the induced throttling attack we start
to see a larger divergence in client probabilities, as shown in Figure 5.11b. While the
induced throttling attack with vanilla Tor leaks slightly more information than vanilla
Tor with the throughput attack, N23 with N 3 = 500 has higher client probabilities
than both attacks on vanilla Tor and higher than N23 with just the throughput attack.
Furthermore, N23 with N 3 = 100 does significantly better than all previous algorithms,
leaking more information than vanilla Tor for almost half the clients, reaching probabilities as high as 15%.
The results in Figure 5.11b assume that the client only sends one stream over the
1.0
1.0
0.8
0.8
0.6
0.6
CDF
CDF
53
0.4
vanilla
EWMA
N23
0.2
0.0
0
1
2
3
4
5
Client Probability (percent)
vanilla
vanilla (attack)
N23 (N3=500)
N23 (N3=100)
0.4
0.2
6
0.0
0
2
(a) Throughput Attack
4
6
8
10 12 14
Client Probability (percent)
16
(b) Throttling Attack
1.0
CDF
0.8
0.6
vanilla
vanilla (attack)
N23 (N3=500)
N23 (N3=100)
0.4
0.2
0.0
0
10
20
30
40
50
60
Client Probability (percent)
70
(c) Perfect Knowledge
Figure 5.11: Probabilities of the victim client under attack scenarios for flow control
algorithms.
54
circuit, the worst case scenario for an adversary. As shown in Figure 5.8, as the number
streams multiplexed over the circuit increases, the degree of anonymity loss sharply
approaches 100% implying that an adversary would be able to uniquely identify the
entry guard. The analysis with this assumption can be seen in Figure 5.11c, where
P [G = Rj ] = 1 when Rj is the entry guard. Here we see a dramatic improvement
from when a client only sends a single stream over the circuit, with some clients having
probabilities as high as 60%, compared to a peak of 15% with a single stream.
Using the throughput attack with the throttling algorithms produces similar results
as N23 and EWMA, as shown in Figure 5.12a. There is a slightly higher upper bound
in the client probability caused by the threshold and ideal throttling algorithms, but
for the most part these results line up fairly closely to what was previously seen. Given
that the throttling algorithms all had similar peaks to N23 with respect to the loss of
anonymity in the latency attacks, these results aren’t too surprising. Even with the
better performance with the latency attack, these gains are wiped out by the fact that
the throughput attack results in too many guards that need to be considered in relation
to the clients. However, once we use the sybil attack with the throttling algorithm
instead of the throughput attack, where we assume an adversary is able to uniquely
identify the entry guard in use, we start to see dramatically higher client probabilities.
Figure 5.12b shows the result of this analysis, where at the extreme end we see clients
with probabilities as high as 90%. This is due to the fact that with the sybil attack
we are able to identify the exact entry guard used by each victim, thus reducing the
noise from having to consider the latency of clients based on other possible relays. This
very effectively demonstrates the level of anonymity lost when an adversary is able to
significantly reduce the set of possible entry guards
5.8
Conclusion
While high performance is vital to the Tor system, algorithms which seek to improve
allocation of network resources via more advanced flow control or traffic admission
algorithms need to take into account the implications on anonymity, both with respect
to existing attacks and the potential for new ones. To this effect, we introduce a new class
of induced throttling attacks and demonstrate the effectiveness across a wide variety of
55
1.0
CDF
0.8
vanilla
bitsplit
flag
threshold
ideal
0.6
0.4
0.2
0.0
0
1
2
3
4
5
6
7
Client Probability (percent)
8
9
(a) Throughput Attack
1.0
CDF
0.8
0.6
vanilla
bitsplit
flag
threshold
0.4
0.2
0.0
0
20
40
60
80
Client Probability (percent)
100
(b) Sybil Attack
Figure 5.12: Victim client probabilities, throttling algorithm attack scenarios.
56
performance enhancing algorithms, resulting in dramatic information leakage on victim
clients. Using the new class of attacks, we perform a comprehensive analysis on the
implications on anonymity, showing both the effects the algorithms have on existing
attacks, as well as showing the increase in information gain from the new attacks.
Preventing these new attacks isn’t straightforward, as in many cases the adversary
is merely exploiting the underlying mechanisms in the algorithms. With the induced
throttling attacks, an adversary acts exactly as they should under heavy congestion, so
prevention or detection becomes difficult without changing the algorithm all together.
In these cases it comes down to the trade off in performance increases verses potential
anonymity loss. In the past Tor has often come down on the side of performance, as
many times the increase in performance has the potential to attract new clients, perhaps
negating any anonymity loss introduced by the algorithms. However, in the throttling
algorithms the adversary is taking advantage of the fact that only the raw number
of open connections are considered when calculating the throttling rate, allowing Sybil
connections to be created using negligible resources. To prevent this we want a potential
adversary to have to allocate a non-trivial amount of bandwidth to the connections in
order have such a substantial affect on the throttling rate. One way to do so is instead
only consider active connections which have seen a minimum amount of bandwidth over
a certain time period. Another approach is instead of basing the throttling rate over
total number of connections, calculate it over the average bandwidth. This way there
is a direct correlation with how much bandwidth an adversary must provide with how
low the throttling rate can be made.
Chapter 6
Managing Tor Connections for
Inter-Relay Communication
57
58
6.1
Introduction
One well-recognized performance issue in Tor stems from the fact that all circuits passing
between a pair of relays are multiplexed over a single TLS connection. As shown by
Reardon and Goldberg [15], this can result in several undesirable effects on performance:
a single, high-volume circuit can lead to link congestion, throttling all circuits sharing
this link [15]; delays for packet re-transmissions can increase latency for other circuits,
leading to “head-of-line” blocking [93]; and long write buffers reduce the effectiveness
of application-level scheduling decisions [4].
As a result, several researchers have proposed changes in the transport protocol for
the links between relays. Reardon and Goldberg suggested that relays should use a
Datagram TLS tunnel at the transport level, while running a separate TCP session
at the application level for each circuit [15]; this adds a high degree of complexity
(an entire TCP implementation) to the application. Similarly, the “Per-Circuit TCP”
(PCTCP) design [3] establishes a TCP session for each circuit, hiding the exact traffic
volumes of these sessions by establishing an IPSEC tunnel between each pair of relays;
however, kernel-level TCP sessions are an exhaustible resource and we demonstrate
in Section 6.3 that this can lead to attacks on both availability and anonymity. In
contrast, the Torchestra transport suggested by Gopal and Heninger [2], has each relay
pair share one TLS session for “bulk download” circuits and another for “interactive
traffic.” Performance then critically depends on the threshold for deciding whether a
given circuit is bulk or interactive.
We present two novel solutions to address these problems. First is inverse-multiplexed
Tor with adaptive channel size (IMUX). In IMUX, each relay pair maintains a set of TLS
connections (channel) roughly proportional to the number of “active” circuits between
the pair, and all circuits share these TLS connections; the total number of connections
per relay is capped. As new circuits are created or old circuits are destroyed, connections are reallocated between channels. This approach allows relays to avoid many of the
performance issues associated with the use of a single TCP session: packet losses and
buffering on a single connection do not cause delays or blocking on the other connections
associated with a channel. At the same time, IMUX can offer performance benefits over
Torchestra by avoiding fate sharing among all interactive streams, or per-circuit designs
59
by avoiding the need for TCP handshaking and slow-start on new circuits. Compared
to designs that require a user-space TCP implementation, IMUX has significantly reduced implementation complexity, and due to the use of a per-relay connection cap,
IMUX can mitigate attacks aimed at exhausting the available TCP sessions at a target
relay. Second is to transparently replace TCP with the micro transport protocol (uTP),
a mature protocol used for communication in BitTorrent. A user space protocol stack
allows per circuit “connections” over a single UDP socket, preventing specific socket
exhaustion attacks while increasing client performance. To this end this chapter makes
the following contributions:
• We describe new socket exhaustion attacks on Tor and PCTCP that can anonymously disable targeted relays, and demonstrate how socket exhaustion leads to
reductions in availability, anonymity, and stability.
• We describe IMUX, a novel approach to the circuit-to-socket solution space.
Our approach naturally generalizes between the “per-circuit” approaches such
as PCTCP and the fixed number of sessions in “vanilla Tor” (1) and Torchestra
(2).
• We analyze a variety of scheduling designs for using a variable number of connec-
tions per channel through large-scale simulations with the Shadow simulator [67].
We compare IMUX to PCTCP and Torchestra, and suggest parameters for IMUX
that empirically outperform both related approaches while avoiding the need for
IPSEC and reducing vulnerability to attacks based on TCP session exhaustion.
• We perform the first large scale simulations of the Torchestra design and the
first simulations that integrate KIST [4] with Torchestra, PCTCP, and IMUX to
compare the performance interactions among the complimentary designs.
• We introduce xTCP, a library for transparently replacing TCP with any in-order
reliable protocol. With xTCP we explore performance improvements when us-
ing uTP for inter-relay communication, allowing a dedicated uTP communication
stack per circuit while avoiding socket exhaustion attacks.
60
6.2
Background
In order to create a circuit, the client sends a series of EXTEND cells through the circuit,
each of which notifies the current last hop to extend the circuit to another relay. For
example, the client sends an EXTEND cell to the guard telling it to extend to the middle.
Afterwards the client sends another EXTEND to the middle telling it to extend the circuit
to the exit. The relay, on receiving an EXTEND cell, will establish a channel to the
next relay if one does not already exist. Cells from all circuits between the two relays
get transfered over this channel, which is responsible for in-order delivery and, ideally,
providing secure communication from potential eavesdroppers. Tor uses a TLS channel
with a single TCP connection between the relays for in-order delivery and uses TLS
to encrypt and authenticate all traffic. This single TCP connection shared by multiple
circuits is a major cause of performance issues in Tor [15]. Many proposals have been
offered to overcome these issues, such as switching completely away from TCP and
using different transport protocols. In this chapter we focus on three proposals, two
which directly address cross-circuit interference and one tangentially related proposal
that deals with how Tor interacts with the TCP connections.
Torchestra: To prevent bulk circuits from interfering with web circuits, Gopal and
Heninger [2] developed a new channel, Torchestra, that creates two TLS connections
between each relay, one reserved for web circuits and the other for bulk. This prevents
head-of-line blocking that might be caused by bulk traffic from interfering with web
traffic. The paper evaluates the Torchestra channel in a “circuit-dumbell” topology and
shows that time to first byte and total download time for “interactive” streams decrease,
while “heavy” streams do not see a significant change in performance.
PCTCP: Similar to TCP-over-DTLS, AlSabah and Goldberg [3] propose dedicating a
separate TCP connection to each circuit and replacing the TLS session with an IPSEC
tunnel that can then carry all the connections without letting an adversary learn circuit
specific information from monitoring the different connections. This has the advantage
of eliminating the reliance on user-space TCP stacks, leading to reduced implementation
complexity and improved performance. However, as we show in the next section, the
use of a kernel-provided socket for every circuit makes it possible to launch attacks that
attempt to exhaust this resource at a targeted relay.
61
KIST: Jansen et al. [4] show that cells spend a large amount of time in the kernel
output buffer, causing unneeded congestion and severely limiting the effect of prioritization in Tor. They introduce a new algorithm KIST with two main components:
global scheduling across all writable circuits is done fixing circuit prioritization, and an
autotuning algorithm that can dynamically determine how much data should be written
to the kernel. This allows data to stay internal to Tor for longer, allowing it to make
smarter scheduling decisions than simply dumping everything it can to the kernel, which
operates in a FIFO manner.
6.3
Socket Exhaustion Attacks
This section discusses the extent to which Tor is vulnerable to socket descriptor exhaustion attacks that may lead to reductions in relay availability and client anonymity,
explains how PCTCP creates a new attack surface with respect to socket exhaustion,
and demonstrates how socket resource usage harms relay stability. The attacks in this
section motivate the need for the intelligent management of sockets in Tor, which is the
focus of Sections 6.4 and 6.5.
6.3.1
Sockets in Tor
On modern operating systems, file descriptors are a scarce resource that the kernel must
manage and allocate diligently. On Linux, for example, soft and hard file limits are used
to restrict the number of open file descriptors that any process may have open at one
time. Once a process exceeds this limit, any system call that attempts to open a new
file descriptor will fail and the kernel will return an EMFILE error code indicating too
many open files. Since sockets are a specific type of file descriptor, this same issue can
arise if a process opens sockets in excess of the file limit. Aware of this limitation, Tor
internally utilizes its own connection limit. For relays running on Linux and BSD, an
internal variable ConnLimit is set to the maximum limit as returned by the getrlimit()
system call; the ConnLimit is set to a hard coded value of 15,000 on all other operating
systems. Each time a socket is opened and closed, an internal counter is incremented
and decremented; if a tor_connect() function call is made when this counter is above
the ConnLimit, it preemptively returns an error rather than waiting for one from the
62
connect system call.
6.3.2
Attack Strategies
There are several cases to consider in order to exploit open sockets as a relay attack
vector. Relay operators may be: (i) running Linux with the default maximum descriptor
limit of 4096; (ii) running Linux with a custom descriptor limit or running a non-Linux
OS with the hard-coded ConnLimit of 15,000; and (iii) running any OS and allowing
unlimited descriptors. We note that setting a custom limit generally requires root
privileges, although it does not require that Tor itself be run as the root user. Also note
that each Tor relay connects to every other relay with which it communicates, leading
to potentially thousands of open sockets under normal operation. In any case, the
adversary’s primary goal is to cause a victim relay to open as many sockets as possible.
Consuming Sockets at Exit Relays: In order to consume sockets at an exit relay,
an adversary can create multiple circuits through independent paths and request TCP
streams to various destinations. Ideally, the adversary would select services that use
persistent connections to ensure that the exit holds open the sockets. The adversary
could then send the minimal amount required to keep the connections active. Although
the adversary remains anonymous (because the victim exit relay does not learn the
adversary’s identity), keeping persistent connections active so that they are not closed
by the exit will come at a bandwidth cost.
Consuming Sockets at Any Relay: Bandwidth may be traded for CPU and memory
by using Tor itself to create the persistent connections, in which case relays in any
position may be targeted. This could be achieved by an adversary connecting several
Tor client instances directly to a victim relay; each such connection would consume a
socket descriptor. However, the victim would be able to determine the adversary’s IP
address (i.e., identity). The attack can also be done anonymously. The basic mechanism
to do so was outlined in The Sniper Attack, Section II-C-3 [54], where it was used in a
relay memory consumption denial of service attack. Here, we use similar techniques to
anonymously consume sockets descriptors at the victim.
The attack is depicted in Figure 6.1a. First, the adversary launches several Tor
client sybils. A1 and A5 are used to build independent circuits through G1 , M1 , E1 and
G2 , M2 , E2 , respectively, following normal path selection policies. These sybil clients
63
also configure a SocksPort to allow connections from other applications. Then, A2 , A3 ,
and A4 use either the Socks4Proxy or Socks5Proxy options to extend new circuits to a
victim V through the Tor circuit built by A1 . The A6 , A7 , and A8 sybils similarly extend
circuits to V through the circuit built by A5 . Each Tor sybil client will create a new
tunneled channel to V , causing the exits E1 and E2 to establish new TCP connections
with V .
Each new TCP connection to V will consume a socket descriptor at the victim relay.
When using either the Socks4Proxy or the Socks5Proxy options, the Tor software
manual states that “Tor will make all OR connections through the SOCKS [4,5] proxy
at host:port (or host:1080 if port is not specified).” We also successfully verified this
behavior using Shadow. This attack allows an adversary to consume roughly one socket
for every sybil client, while remaining anonymous from the perspective of the victim.
Further, the exits E1 and E2 will be blamed if any misbehavior is suspected, who
themselves will be unable to discover the identity of the true attacker. If Tor were to
use a new socket for every circuit, as suggested by PCTCP [3], then the adversary could
effectively launch a similar attack with only a single Tor client.
Consuming Sockets with PCTCP: PCTCP may potentially offer performance gains
by dedicating a separate TCP connection for every circuit. However, PCTCP widens
the attack surface and reduces the cost of the anonymous socket exhaustion attack
discussed above. If all relays use PCTCP, an adversary may simply send EXTEND cells
to a victim relay through any other relay in the network, causing a new circuit to
be built and therefore a socket descriptor opened at the victim. Since the cells are
being forwarded from other relays in the network, the victim relay will not be able to
determine who is originating the attack. Further, the adversary gets long-term persistent
connections cheaply with the use of the Tor config options MaxClientCircuitsPending
and CircuitIdleTimeout. The complexity of the socket exhaustion attack is reduced
and the adversary no longer needs to launch the tunneled sybil attack in order to
anonymously consume the victim’s sockets. By opening circuits with a single client, the
attacker will cause the victim’s number of open connections to reach the ConnLimit or
may cause relay stability problems (or both).
64
6.3.3
Effects of Socket Exhaustion
Socket exhaustion attacks may lead to either reduced relay availability and client anonymity
if there is a descriptor limit in place, or may harm relay stability if either there is no
limit or the limit is too high. We now explore these effects.
Limited Sockets: If there is a limit in place, then opening sockets will consume the
shared descriptor resource. An adversary that can consume all sockets on a relay will
have effectively made that relay unresponsive to connections by other honest nodes due
to Tor’s ConnLimit mechanism. If the adversary can persistently maintain this state
over time, then it has effectively disabled the relay by preventing it from making new
connections to other Tor nodes.
We ran a socket consumption attack against both vanilla Tor and PCTCP using
on live Tor relays. Our attacker node created 1000 circuits every 6 seconds through
a victim relay, starting at time 1800 and ending at time 3600. Figure 6.1b shows the
victim relay’s throughput over time as new circuits are built and victim sockets are
consumed. After consuming all available sockets, the victim relay’s throughput drops
close to 0 as old circuits are destroyed, effectively disabling the relay. This, in turn, will
move honest clients’ traffic away from the relay and onto other relays in the network. If
the adversary is running relays, then it has increased the probability that its relays will
be chosen by clients and therefore has improved its ability to perform end-to-end traffic
correlation [62]. After the attacker stops the attack at time 3600, the victim relay’s
throughput recovers as clients’ are again able to successfully create circuits through it.
Unlimited Sockets: One potential solution to the availability and anonymity concerns
caused by a file descriptor limit is to remove the limit (i.e., set an unlimited limit),
meaning that ConnLimit gets set to 232 or 264 . At the lesser of the two values, and
assuming that an adversary can consume one socket for every 512-byte Tor cell it sends
to a victim, it would take around 2 TiB of network bandwidth to cause the victim to
reach its ConnLimit. However, even if the adversary cannot cause the relay to reach
its ConnLimit, opening and maintaining sockets will still drain the relay’s physical
resources, and will increase the processing time associated with socket-based operations
in the kernel. By removing the open descriptor limit, a relay becomes vulnerable to
performance degradation, increased memory consumption, and an increased risk of being
killed by the kernel or otherwise crashing. An adversary may cause these effects through
A2
A3
A4
Socks4Proxy
Socks5Proxy
R1
G1
M1
E1
G2
M2
E2
R2
A1
A5
V
R3
Socks4Proxy
Socks5Proxy
A6
A7
Throughput (MiB/s)
65
R4
R5
A8
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
vanilla
pctcp
2000 2500 3000 3500 4000 4500 5000
Tick (s)
(a)
(b)
Memory Consumption (GB)
12
10
8
6
4
2
0
Total Memory
Process Memory
0
0.2
0.4
0.6
0.8
1
Number Open Sockets (Millions)
1.2
(c)
Figure 6.1: Showing (a) anonymous socket exhaustion attack using client sybils (b)
throughput of relay when launching a socket exhaustion attack via circuit creation with
shaded region representing when the attack was being launched (c) memory consumption as a process opens sockets using libevent.
66
the same attacks it uses against a relay with a default or custom descriptor limit.
Figure 6.1c shows how memory consumption increases with the number of open
sockets as a process opens over a million sockets. We demonstrate other performance
effects using a private Tor network of 5 machines in our lab. Our network consisted
of 4 relays total (one directory authority), each running on a different machine. We
configured each relay to run Tor v0.2.5.2-alpha modified to use a simplified version
of PCTCP that creates a new OR connection for every new circuit. We then launched
a simple file server, a Tor client, and 5 file clients on the same machine as the directory
authority. The file clients download arbitrary data from the server through a specified
path of the non-directory relays, always using the same relay in each of the entry, middle,
and exit positions. The final machine ran our Tor attacker client that we configured
to accept localhost connections over the ControlPort. We then used a custom python
script to repeatedly: (1) request that 1000 new circuits be created by the Tor client,
and (2) pause for 6 seconds. Each relay tracked socket and bandwidth statistics; we
use throughput and the time to open new sockets to measure performance degradation
effects and relay instability.
The stability effects for the middle relay are shown in Figure 6.2. The attack ran
for just over 2500 seconds and caused the middle relay to successfully open more than
50 thousand sockets. We noticed that our relays were unable to create more sockets
due to port allocation problems, meaning that (1) we were unable to measure the
potentially more serious performance degradation effects that occur when the socket
count exceeds 65 thousand, and (2) unlimited sockets may not be practically attainable
due to port exhaustion between a pair of relays. Figure 6.2a shows throughput over
time and Figure 6.2b shows a negative correlation of bandwidth to the number of open
sockets; both of these figures show a drop of more than 750 KiB/s in the 60 second
moving average throughput during our experiment. Processing overhead during socket
system calls over time is shown in Figure 6.2c, and the correlation to the number
of open sockets is shown in Figure 6.2d; both of these figures clearly indicate that
increases in kernel processing time can be expected as the number of open sockets
increases. Although the absolute time to open sockets is relatively small, it more than
doubled during our experiment; we believe this is a strong indication of performance
degradation in the kernel and that increased processing delays in other kernel socket
67
5500
5500
5000
Bandwidth (KiB)
Bandwidth (KiB)
60 second moving average
4500
4000
5000
4500
4000
r=−0.536, r2 =0.288
3500
0
500
1000 1500
Tick (s)
2000
2500
3500
0
(b)
12
10
8
6
4
60 second moving average
2
0
500
1000 1500
Tick (s)
(c)
2000
2500
Mean Socket Open Time (usec)
Mean Socket Open Time (usec)
(a)
1
2
3
4
5
4
Number of Open Sockets ×10
12
10
8
6
4
r=0.928, r2 =0.861
2
0
1
2
3
4
5
4
Number of Open Sockets ×10
(d)
Figure 6.2: Showing (a) throughput over time (b) linear regression correlating throughput to the number of open sockets (c) kernel time to open new sockets over time (d)
linear regression correlating kernel time to open a new socket to the number of open
sockets.
68
processing functions are likely as well.
6.4
IMUX
This section explores a new algorithm that takes advantage of multiple connections while
respecting the ConnLimit imposed by Tor and preventing the attacks discussed above in
Section 6.3. Both Torchestra and PCTCP can be seen as heuristically derived instances
of a more general resource allocation scheme with two components, one determining
how many connections to open between relays, and the second in selecting a connection
to schedule cells on. Torchestra’s heuristic is to fix the number of connections at two,
designating one for light traffic and the other for heavy, then scheduling cells based on
the traffic classification of each circuit. PCTCP keeps a connection open for each circuit
between two relays, with each connection devoted to a single circuit that schedules
cells on it. While there is a demonstrable advantage to being able to open multiple
connections between two communicating relays, it is important to have an upper limit
on the number of connections allowed to prevent anonymous socket exhaustion attacks
against relays, as shown in Section 6.3. In this section we introduce IMUX, a new
heuristic for handling multiple connections between relays that is able to dynamically
manage open connections while taking into consideration the internal connection limit
in Tor.
6.4.1
Connection Management
Similar to PCTCP, we want to ensure the allocation of connections each channel has
is proportional to the number of active circuits each channel is carrying, dynamically
adjusting as circuits are opened and closed across all channels on the relay. PCTCP
can easily accomplish this by dedicating a connection to each circuit every time one is
opened or closed, but since IMUX enforces an upper limit on connections, the connection
management requires more care, especially since both ends of the channel will may have
different upper limits.
We first need a protocol dictating how and when relays can open and close connections. During the entire time the channel is open, only one relay is allowed to open
connections, initially set to the relay that creates the channel. However, at any time
69
Algorithm 1 Function to determine the maximum number of connections that can be open
on a channel.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
function getMaxConns(nconns, ncircs)
totalCircs ← len(globalActiveCircList)
if ncircs is 0 or totalCircs is 0 then
return 1
end if
f rac ← ncircs/totalCircs
totalM axConns ← ConnLimit · τ
connsLef t ← totalM axConns − n open sockets()
maxconns ← f rac · totalM axConns
maxconns ← M IN (maxconns, nconns · 2)
maxconns ← M IN (maxconns,
nconns + connsLef t)
return maxconns
end function
either relay may close a connection if it detects the number of open sockets approaching the total connection limit. When a relay decides to close a connection, it must
first decide which connection should be closed. To pick which connection to close the
algorithm uses three criteria for prioritizing available connections: (1) Always pick connections that haven’t fully opened yet; (2) Among connections with state OPENING, pick
the one that was created most recently; and (3) If all connections are open, pick the one
that was least recently used. In order to close a connection C, first the relay must make
sure the relay on the other end of the channel is aware C is being closed to prevent data
from being written to C during the closing process. Once a connection C is chosen, the
initiating relay sends out an empty cell with a new command, CLOSING_CONN_BEGIN, and
marks C for close to prevent any more data from being written to it. Once the responding relay on the other end of the channel receives the cell, it flushes any data remaining
in the buffer for C, sends back another CLOSING_CONN_END cell to the initiating relay,
and closes the connection. Once the initiating relay receives the CLOSING_CONN_END, it
knows that it has received all data and is then able to proceed closing the socket.
Once a channel has been established, a housekeeping function is called on the channel
every second that then determines whether to open or close any connections. The
function to determine the maximum number of connections that can open on a channel
can be seen in Algorithm 1. We calculate a soft upper limit on the total number of
70
allowed open connections on the relay by taking ConnLimit and multiplying it by the
parameter τ ∈ (0, 1). ConnLimit is an internal variable that determines the maximum
number of sockets allowed to be open on the relay. On Linux based relays this is set by
calling getrlimit() to get the file limit on the machine, otherwise it is fixed at 15,000.
The parameter τ is a threshold value between 0 and 1 that sets a soft upper limit on
the number of open connections. Since once the number of open connections exceeds
ConnLimit all connect() calls will fail, we want some breathing room so new channels
and connections can still be opened, temporarily going past the soft limit, until other
connections can be closed to bring the relay back under the limit. To calculate the
limit for the channel we simply take this soft limit on the total of all open connections
and multiply it by the fraction of active circuits using the channel. This gives us an
upper bound on the connection limit for the channel. We then take the minimum of this
upper limit with the number of current open connections on the channel multiplied by
two. This is done to prevent rapid connection opening when a channel is first created,
particularly when the other relay has a much lower connection limit. Finally we take
the minimum of that calculation with the number of open connections plus the number
of connections that can be opened on the relay before hitting the soft limit (that could
be negative, signaling that connections need to be closed ). Otherwise the channel could
create too many connections, driving the number of open connections past ConnLimit.
The housekeeping function is called every second and determines if any connections
need to be opened or closed. If we have two few connections and, based on the protocol
discussed in the previous paragraph, the relay is allowed to open connections on this
channel, enough connections are created to match the maximum connections allowed.
If we have too many connections open, we simply close enough until we are at the
connection limit. We use the previously discussed protocol for selecting and closing
connections, prioritizing newly created connections in an attempt to close unneeded
connections before the TLS handshake is started, preventing unnecessary overhead.
In addition to the housekeeping function, whenever a channel accepts an incoming
connection, it also checks to see if the number of connections exceeds the maximum
allowed returned by Algorithm 1. If so, it simply notifies the relay at the other end of
the channel that it is closing the connection, preventing that relay from opening any
more connections as dictated by the protocol.
71
6.4.2
Connection Scheduler
For connection scheduling, PCTCP assigns each circuit a dedicated connection to schedule cells on. Torchestra schedules cells from a circuit to either the light or heavy connection depending on how much data is being sent through the circuit. A circuit starts
out on the light connection, and if at some point its EWMA value crosses a threshold
it is switched to the heavy connection. To accommodate this, a switching protocol is
introduced so the relay downstream can be notified when a circuit has switched and on
what connection it can expect to receive cells.
While scheduling cells from a circuit on a single connection makes in-order delivery
easier by relying on TCP, with multiple connections per channel it is not necessary to do
so and in fact may not be optimal to keep this restraint. Similar to the uTLS implementation in Tor [93] and Conflux [40], we embed an 8-byte sequence number in the relay
header of all cells. This allows the channel to schedule cells across multiple connections
that then get reordered at the other end of the channel. With sequence numbers in
place and the capability to schedule cells from a circuit across multiple connections,
we can evaluate different scheduling algorithms attempting to increase throughput or
“fairness”, for example, where low traffic circuits have better performance than high
traffic ones. We will briefly cover the different algorithms and heuristics we can use.
Circuit Round Robin: The first scheduler, shown in Figure 6.3a, emulates PCTCP by
assigning each circuit a single connection to transmit its cells. When circuits are added
to a channel, the scheduler iterates round robin style through the circuit list, assigning
circuits to successive connections. If there are more circuits than connections some
circuits will share a single connection. When a connection is closed, any circuit assigned
to it will be given a new one to schedule through, with the connections remaining iterated
in the same round robin style as before.
EWMA Mapping: Internal to Tor is a circuit scheduling algorithm proposed by Tang
and Goldberg [94] that uses an exponential moving weight average (EWMA) algorithm
to compute how “noisy” circuits are being, and then schedule them from quietest to
loudest when choosing what circuits to flush. Using the same algorithm, we compute
the EWMA value for each connection as well as the circuits. Then, as seen in Figure 6.3b, the circuits and connections are ordered from lowest to highest EWMA value
and we attempt to map the circuits to a connection with a similar EWMA value. More
72
(a) Circuit Round Robin Scheduler
(b) EWMA Mapping Scheduler
(c) Shortest Queue Scheduler
Figure 6.3: The three different connection scheduling algorithms used in IMUX.
73
specifically, after sorting the circuits, we take the rank of the circuit 1 ≤ ri ≤ ncircs
and compute the percentile pi =
ri
ncircs .
We do the same thing with the connections
computing their percentiles denoted Cj . Then to determine which connection to map a
circuit to, we pick the connection j such that Cj−1 < pi ≤ Cj .
Shortest Queue: While the EWMA mapping scheduler is built around the idea of
“fairness”, where we penalize high usage circuits by scheduling them on busier connections, we can construct an algorithm aimed at increasing overall throughput by always
scheduling cells in an opportunistic manner. The shortest queue scheduler, shown in
Figure 6.3c, calculates the queue length of each connection and schedules cells on connections with the shortest queue. This is done by taking the length of the internal
output buffer queue that Tor keeps for each connections, and adding it with the kernel
TCP buffer that each socket has; this is obtained using the ioctl1
function call and
passing it the socket descriptor and TIOCOUTQ.
6.4.3
KIST: Kernel-Informed Socket Transport
Recent research by Jansen et al. [4] showed that by minimizing the amount of data that
gets buffered inside the kernel and instead keeping it local to Tor, better scheduling
decisions can be made and connections can be written in an opportunistic manner increasing performance. The two main components of the algorithm are global scheduling
and autotuning. In vanilla Tor, libevent iterates through the connections in a round
robin fashion, notifying Tor that the connection can write. When Tor receives this notification, it performs circuit scheduling only on circuits associated with the connection.
Global scheduling takes a list of all connections that can write, and schedules between
circuits associated with every single connection, making circuit prioritization more effective. Once a circuit is chosen to write to a connection, autotuning then determines
how much data should be flushed to the output buffer and onto the kernel. By using a
variety of socket and TCP statistics, it attempts to write just enough to keep data in
the socket buffer at all times, without flushing everything, giving more control to Tor
for scheduling decisions.
KIST can be used in parallel with Torchestra, PCTCP, and IMUX, and also as
a connection manager within the IMUX algorithm. After KIST selects a circuit and
1
http://man7.org/linux/man-pages/man2/ioctl.2.html
74
decides how much data to flush, Torchestra and PCTCP make their own determination
on which connection to write to based on their internal heuristics. IMUX can also take
into account the connection selected by the KIST algorithm and use that to schedule
cells from the circuit. Similar to the other connection schedulers, this means that circuits
can be scheduled across different connections in a more opportunistic fashion.
6.5
Evaluation
In this section we discuss our experimental setup, the details of our implementations
of Torchestra and PCTCP (for comparison with IMUX), evaluate how the dynamic
connection management in IMUX is able to protect against potential denial of service
attacks by limiting the number of open connections, and finally compare performance
across the multiple connection schedulers, along with both Torchestra and PCTCP.
6.5.1
Experimental Setup
We perform experiments in Shadow v.1.9.2 [69, 67], a discrete event network simulator
capable of running real Tor code in a simulated network. Shadow allows us to create
large-scale network deployments that can be run locally and privately on a single machine, avoiding privacy risks associated with running on the public network that many
active users rely on for anonymity. Because Shadow runs the Tor software, we are able
to implement our performance enhancements as patches to Tor v0.2.5.2-alpha. We also
expect that Tor running in Shadow will exhibit realistic application-level performance
effects including those studied in this paper. Finally, Shadow is deterministic; therefore
our results may be independently reproduced by other researchers. Shadow also enables
us to isolate performance effects and attribute them to a specific set of configurations,
such as variations in scheduling algorithms or parameters. This isolation means that our
performance comparisons are meaningful independent of our ability to precisely model
the complex behaviors of the public Tor network.
We initialize a Tor network topology and node configuration and use it as a common
Shadow deployment base for all experiments in this section. For this common base, we
use the large Tor configuration that is distributed with Shadow. The techniques for
producing this model are discussed in detail in [95] and updated in [4]. It consists of 500
75
relays, 1350 web clients, 150 bulk clients, 300 perf clients and 500 file servers. Web clients
repeatedly download a 320 KiB file while pausing between 1 to 60 seconds after every
download. Bulk clients continuously download 5 MiB files with no pausing between
downloads. The perf clients download a file every 60 seconds, with 100 downloading a
50 KiB file, 100 downloading a 1 MiB file, and 100 download a 5 MiB file.
The Shadow perf clients are configured to mimic the behavior of the TorPerf clients
that run in the public Tor network to measure Tor performance over time. Since the
Shadow and Tor perf clients download files of the same size, we verified that the performance characteristics in our Shadow model were reasonably similar to the public
network.
6.5.2
Implementations
In the original Torchestra design discussed in [2], the algorithm uses EWMA in an attempt to classify each circuit as “light” or “heavy”. Since the EWMA value will depend
on many external network factors (available bandwidth, network load, congestion, etc.),
the algorithm uses the average EWMA value for the light and heavy connection as
benchmarks. Using separate threshold values for the light and heavy connection, when
a circuit either goes above or below the average multiplied by the threshold the circuit is
reclassified and is swapped to the other connection. The issue with this, as noted in [26],
is that web traffic tends to be bursty causing temporary spikes in circuit EWMA values.
When this occurs it increases the circuit’s chance of becoming misclassified and assigned
to the bulk connection. Doing so will in turn decrease the average EWMA of both the
light and bulk connections, making it easier for circuits to exceed the light connections
threshold and harder for circuits to drop below the heavy connection threshold, meaning
web circuits that get misclassified temporary will find it more difficult to get reassigned
to the light connection. A better approach would be to use a more complex classifier
such as DiffTor [26] to determine if a circuit was carrying web or bulk traffic. For our
implementation, we have Torchestra use an idealized version of DiffTor where relays
have perfect information about circuit classification. When a circuit is first created by a
client, the client sends either a CELL_TRAFFIC_WEB or CELL_TRAFFIC_BULK cell notifying
each relay of the type of traffic that will be sent through the circuit. Obviously this
would be unrealistic to have in the live Tor network, but it lets us examine Torchestra
76
under an ideal situation.
For PCTCP there are two main components of the algorithm. First is the dedicated
connection that gets assigned to each circuit, and the second is replacing per connection
TLS encryption with a single IPSec layer between the relays, preventing an attacker
from monitoring a single TCP connection to learn information about a circuit. For our
purposes we are interested in the first component, performance gains from dedicating
a connection to each circuit. The IPSec has some potential to increase performance,
since each connection no longer requires a TLS handshake that adds some overhead,
but there are a few obstacles noted in [3] that could hinder deployment. Furthermore, it
can be deployed alongside any algorithm looking to open multiple connections between
relays. For simplicity, our PCTCP implementation simply opens a new TLS connection
for each circuit created that will use the new connection exclusively for transferring
cells.
6.5.3
Connection Management
One of the main goals of the dynamic connection manager in IMUX is to avoid denial
of service attacks by consuming all available open sockets. To achieve this IMUX has
a soft limit that caps the total number of connections at ConnLimit ·τ , where τ is
a parameter between 0 and 1. If this is set too low, we may lose out in potential
performance gains that go unrealized, while if it is too high we risk exceeding the hard
limit ConnLimit, causing new connections to error out leaving ourselves open to denial
of service attacks. During our experiments we empirically observed that τ = 0.9 was
the highest the parameter could be set without risking crossing ConnLimit, particularly
when circuits were being created rapidly causing high connection churn.
In order to find a good candidate for τ we setup an experiment in Shadow to see how
different values act under heavy circuit creation. The experiment consists of 5 clients,
5 guard relays, and 1 exit relay. Each client was configured to create 2-hop circuits
through a unique guard to the lone exit relay. The clients were configured to start
2 minutes apart, starting 250 circuits all at once, with the final client creating 1,000
circuits. The guards all had ConnLimit set to 4096 while the exit relay has ConnLimit
set to 1024. The experiment was run with IMUX enabled using τ values of 0.80, 0.90,
and 0.95. In addition, one run was conducted using PCTCP, contrasting it against the
77
Open Sockets
2000
1500
PCTCP
IMUX-0.95
IMUX-0.90
IMUX-0.80
1000
500
00
100
200
300 400
Time (s)
500
600
Figure 6.4: Number of open sockets at exit relay in PCTCP compared with IMUX with
varying τ paramaters.
connection management in IMUX. For these experiments we disabled the code in Tor
that throws an error once ConnLimit is actually passed to demonstrate the number of
sockets the algorithm would attempt to keep open under the different scenarios.
Figure 6.4 shows the number of open sockets at the exit relay as clients are started
and circuits created. As expected we see PCTCP opening connections as each circuit is
created, spiking heavily when the last client opens 1,000 circuits. IMUX, on the other
hand, is more aggressive in opening connections initially, which quickly approaches
the soft limit of ConnLimit ·τ as soon as the first set of circuits are created and then
plateaus, temporarily dropping as circuits are created and connections need to be closed
and opened. For most of the experiment, the different values of τ produce the same
trend, but there are some important differences. When using τ = 0.95, we see that
after the 3rd client starts at around time 240 seconds, we actually see a quick spike that
goes above the 1024 ConnLimit. This happens because while we can open connections
fairly fast, closing connections requires the cells BEGIN and END closing cells to be sent
and processed which can take some time. Neither τ = 0.90 or τ = 0.80 have these
issues, which are able to reliably stay under 1,000 open connections the entire time.
We can therefore use τ = 0.90 in our experiments in order to maximize the number of
Throughput (MiB/s)
78
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
vanilla
pctcp
torchestra
imux
2000 2500 3000 3500 4000 4500 5000
Tick (s)
Figure 6.5: Comparing socket exhaustion attack in vanilla Tor with Torchestra, PCTCP
and IMUX.
connections we can have open while minimizing the risk of surpassing ConnLimit.
Figure 6.5 shows the effects of the socket exhaustion attack discussed in Section 6.3
with Torchestra and IMUX included. Since Torchestra simply opens two connections
between each relay, the attack is not able to consume all available sockets, leaving the
relay unaffected. The IMUX results show an initial slight drop in throughput as many
connections are being created and destroyed. However, throughput recovers to the levels
achieved with vanilla Tor and Torchestra as the connection manager stabilizes.
6.5.4
Performance
First we wanted to determine how the various connection schedulers covered in Section 6.4.2 perform compared to each other and against vanilla Tor. Figure 6.6 shows
the results of large scale experiments run with each of the connection schedulers operating with IMUX. Round robin clean performs better than both EWMA mapping and
the shortest queue connection schedulers, at least with respect to time to first byte and
download time for web clients shown in Figures 6.6a and 6.6b. This isn’t completely
79
Cumulative Fraction
1.0
0.8
0.6
vanilla
0.4
imux-rr
imux-ewma
0.2
0.8
0.6
vanilla
0.4
imux-rr
imux-ewma
0.2
imux-shortest
0.0
0
1
2
3
4
Download Time (s)
(a) Time to first byte
imux-shortest
5
0.0
0
2
4
6
8
10 12
Download Time (s)
(b) Time to last byte of 320KiB
1.0
Cumulative Fraction
Cumulative Fraction
1.0
0.8
0.6
vanilla
0.4
imux-rr
imux-ewma
0.2
imux-shortest
0.0
0
20
40 60 80 100 120 140
Download Time (s)
(c) Time to last byte of 5MiB
Figure 6.6: Performance comparison of IMUX connection schedulers
14
80
surprising for the shortest queue scheduler as it indiscriminately tries to push as much
data as possible, favoring higher bandwidth traffic at the potential cost of web traffic
performance. The EWMA mapping scheduler produces slightly better results for web
traffic compared to shortest queue scheduling, but it still ends up performing worse than
vanilla Tor. This is related to the issue with using EWMA in Torchestra for classification, that web traffic tends to be sent in large bursts causing the EWMA value to spike
rapidly that then decreases over time. So while under the EWMA mapping scheme the
first data to be sent will be given high prioritization, as the EWMA value climbs the
data gets sent to busier connections causing the total time to last byte to decrease as a
consequence.
Figure 6.7 shows the download times when using IMUX with round robin connection
scheduling against vanilla Tor, Torchestra and PCTCP. While Torchestra and PCTCP
actually perform identically to vanilla Tor, IMUX sees an increase in performance both
for web and bulk downloads. Half of the web clients see an improvement of at least
12% in their download times, with 29% experiencing more than 20% improvement,
with the biggest reduction in download time seen at the 75th percentile, dropping from
4.5 seconds to 3.3 seconds. Gains for bulk clients are seen too, although not as large;
around 10% of clients seeing improvements of 10-12%. Time to first byte across all
clients improves slightly as shown in Figure 6.7a, with 26% of clients seeing reductions
ranging from 20-23% when compared to vanilla Tor, that then drops down to 11% of
clients who see the same level of improvements compared to Torchestra and PCTCP.
We then ran large experiments with KIST enabled, with IMUX using KIST for
connection scheduling. While overall download times improved from the previous experiments, IMUX saw slower download times for web clients and faster downloads for
bulk clients, as seen in Figure 6.8. 40% of clients see an increase in time to first byte,
while 87% of bulk client see their download times decrease from 10-28%. This is due to
the fact that one of the main advantages to using multiple connections per channel is
that it prevents bulk circuits from forcing web circuits to hold off on sending data due to
packet loss the bulk circuit caused. In PCTCP, for example, this will merely cause the
bulk circuits connection to become throttled while still allowing all web circuits to send
data. Since KIST forces Tor to hold on to cells for longer and only writes a minimal
81
Cumulative Fraction
1.0
0.8
0.6
vanilla
0.4
imux-rr
pctcp
0.2
0.8
0.6
vanilla
0.4
imux-rr
pctcp
0.2
torchestra
0.0
0
1
2
3
4
Download Time (s)
(a) Time to first byte
torchestra
5
0.0
0
2
4
6
8
10 12
Download Time (s)
14
(b) Time to last byte of 320KiB
1.0
Cumulative Fraction
Cumulative Fraction
1.0
0.8
0.6
vanilla
0.4
imux-rr
pctcp
0.2
torchestra
0.0
0
20
40 60 80 100 120 140
Download Time (s)
(c) Time to last byte of 5MiB
Figure 6.7: Performance comparison of IMUX to Torchestra [2] and PCTCP [3]
82
amount to the kernel, it’s able to make better scheduling decisions, preventing web traffic from unnecessarily buffering behind bulk traffic. Furthermore, KIST is able to take
packet loss into consideration since it uses the TCP congestion window in calculating
how much to write to the socket. Since the congestion window is reduced when there is
packet loss, KIST will end up writing a smaller amount of data whenever it occurs.
6.6
Replacing TCP
Torchestra, PCTCP, and IMUX attempt to avoid the problem of cross-circuit interference by establishing multiple connections to balance transmitting data over. While these
can sometimes overcome the obstacles, these serve more as a workaround then actual
solution. In this section we explore actually replacing TCP with a different transport
protocol that might be more suited for inter-relay communication.
6.6.1
Micro Transport Protocol
The benefit from increasing the number of TCP connections between relays is to limit
the effect of cross circuit interference. Each connection gets its own kernel level TCP
stack and state, minimizing the effects that congestion experienced on one circuit has
on all the other circuits. However, as discussed in Section 6.3, making it easier for an
adversary to open sockets on a relay can making launching a socket exhaustion attack
trivial. Since the only thing we gain when opening multiple TCP connections is separate
TCP stacks, it is feasible to gain the performance boost while maintaining a single open
connection between relays by using a user space stack instead. Reardon and Goldberg
[15, 16] did just this, using a user-space TCP stack over a UDP connection, using DTLS
for encryption. One major issue with this approach is the lack of a complete and stable
user space, impacting performance and making it difficult to fully integrate into Tor.
Along with the lack of stable user space TCP stack, it is not clear that TCP might
be the best protocol to use for inter-relay communication. Murdoch [96] covers some
potential other protocol that could be used instead of TCP. Due to its use in the
BitTorrent file sharing protocol, one of the more mature user space protocols is the
micro transport protocol (uTP). It provides reliable in order delivery, using a low extra
delay background transport (LEDBAT) [97] algorithm for congestion control, designed
83
1.0
Cumulative Fraction
Cumulative Fraction
1.0
0.8
0.6
vanilla-kist
0.4
imux-kist
pctcp-kist
0.2
0.8
0.6
vanilla-kist
0.4
imux-kist
pctcp-kist
0.2
torchestra-kist
0.0
0
1
2
3
4
Download Time (s)
torchestra-kist
5
(a) Time to first byte
0.0
0
2
4
6
8
10 12
Download Time (s)
14
(b) Time to last byte of 320KiB
Cumulative Fraction
1.0
0.8
0.6
vanilla-kist
0.4
imux-kist
pctcp-kist
0.2
torchestra-kist
0.0
0
20
40 60 80 100 120 140
Download Time (s)
(c) Time to last byte of 5MiB
Figure 6.8: Performance comparison of IMUX to Torchestra [2] and PCTCP [3], all
using the KIST performance enhancements [4]
84
(a) Vanilla Tor
(b) Tor using xTCP
Figure 6.9: Operation flows for vanilla Tor and using xTCP
for utilizing the maximum amount of bandwidth while minimizing latencies. Instead of
using packet loss to signal congestion, uTP uses queuing delay on the sending path to
inform the transfer rate on the connection. The algorithm has a target queuing delay,
and with the end-points collaborating to actively measure delays at each end, the sender
can increase or decrease its congestion window based on how off it is from the target
delay.
6.6.2
xTCP
While using such a protocol might seem promising, replacing the transport layer in Tor
is a very difficult task. As Karsten, Murdoch, and Jansen note in [98]
“One conclusion from experimenting with our libutp-based datagram transport implementation is that its far from trivial to replace the transport in Tor.”
With this large overhead in integrating different transport protocols into Tor, even stable stacks will be difficult to implement and evaluate for performance effects in Tor. To
get around this engineering hurdle we take advantage of the fact that Tor is agnostic to
what transport protocol is used for inter-relay communication, as long as the protocol
85
Function
socket
bind
listen
accept
connect
send
recv
close
epoll_ctl
epoll_wait
Description
Create UDP socket and register it with transport library
No operation, UDP sockets do not bind to an address/port
No operation, UDP sockets do not listen to an address/port
Check for new incoming connections, return new descriptor if found
Create and send CONNECT message to address/port
Append data to write buffer and flush maximum bytes from buffer
Read and append data to buffer queue, return at most len bytes
Send any close messages that might be needed
Call libc epoll_ctl on all descriptors related to socket
Call libc epoll_wait on all descriptors and read in from all that are
readable. For every existing socket, if the read buffer is non-empty
set EPOLLIN, and if the write buffer is empty set EPOLLOUT.
Table 6.1: Functions that xTCP needs to intercept and how xTCP must handle them.
guarantees in order reliable delivery. We develop xTCP, a stand alone library that can
be used to replace TCP with any other transport protocol. The design is shown in
Figure 6.9. Using LD_PRELOAD xTCP can intercept all socket related functions. Instead
of passing them onto the kernel to create kernel space TCP stacks, xTCP creates UDP
connections which can be used by other transport libraries (e.g uTP, SCTP, QUIC)
for establishing connections, sending and receiving data, congestion control, and data
recovery. Table 6.1 lists the libc functions that xTCP needs to intercept in order to
replace TCP used in Tor. All functions involved with creating, managing, and communicating with sockets need to be handled. For scheduling between sockets Tor uses
libevent [22] which is built on top of epoll, so xTCP also needs to intercept epoll functions so Tor can be properly notified when sockets can be read from and written to.
To support other applications that do not use epoll, other polling functions such as
poll and select would need to intercepted and handled internally by xTCP. Finally,
to allow applications the ability to use both TCP and xTCP sockets, we add an option
for the socket function to only intercept calls when the parameter protocol is set to
a newly defined IPPROTO_XTCP.
86
6.6.3
Experimental Setup
With the xTCP library we built a module utilizing the uTP library
2
for reliable
transmission over UDP. For each function in Table 6.1 there exists a corresponding
uTP library function, excluding the epoll functions. Additionally Tor was modified to
open IPPROTO_XTCP sockets when creating OR connections. This enables us to use any
protocol for inter-relay communications, while still using TCP for external connections
(e.g. edge connections such as exit to server). We also added the option to establish
a separate connection for each circuit similar to PCTCP. While any deployed implementation would want to internally manage the connection stack within one socket to
prevent socket exhaustion attacks, for our purposes this achieves the same performance
benefit. Experiments are performed in Shadow v1.10.2 and running Tor v.0.2.5.10 with
the xTCP and PCTCP modifications. We used the large experimental setup deployed
with Shadow, similar to the one described in Section 6.5.1, using 500 relays and 1800
clients. Web clients still download 320 KiB files while randomly pausing and bulk clients
download 5 MiB files continuously. The vanilla experiments used TCP for inter-relay
communication with PCTCP disabled, while the uTP experiments had PCTCP enabled
to ensure each circuit had its own stack.
6.6.4
Performance
Figure 6.10 shows the results of the experiments, comparing vanilla Tor with TCP and
using uTP. One of the biggest performance differences is in time to first byte show
in Figure 6.10a. When using uTP almost every download retrieved the first byte in
under 2 seconds, while the vanilla experiments only achieved this for 70% of downloads.
Furthermore when using TCP for inter-relay communication 8% of downloads had to
wait more than 10 seconds to download the first byte. This is due to a combination of
the separate uTP stack for each circuit (and thus client download) in addition to the
fact that uTP is built to ensure low latency performance. While the time to first byte
improved dramatically for all clients, Figure 6.10b shows that total download time for
web clients were fairly identical in both cases, with a very small number of web clients
experiencing downloads that took 2-3 minutes to complete, compared to the median
2
https://github.com/bittorrent/libutp
87
download time of 12-13 seconds. On the other hand, bulk clients almost universally saw
increase performance when using uTP, as seen in Figure 6.10c. The majority of bulk
clients experienced 50-250% faster download times compared to vanilla Tor.
6.7
Discussion
In this section we discuss the potential for an adversary to game the IMUX algorithm,
along with the limitations on the protections against the denial of service attacks discussed in Section 6.3 and some possible ways to protect against them.
Active Circuits: The IMUX connection manager distributes connections to channels
based on the fraction of active circuits contained on each channel. An adversary could
game the algorithm by artificially increasing the number of active circuits on a channel
they’re using, heavily shifting the distribution of connections to the channel and increase
throughput. The ease of the attack depends heavily on how we define an “active circuit”,
we can do one of three ways: (1) the number of open circuits that haven’t been destroyed;
(2) the number of circuits that has sent a minimal number of cells, measured in raw
numbers of using an EWMA with a threshold; or (3) using DiffTor we can only consider
circuits classified as web or bulk. No matter what definition is used an adversary will
still technically be able to game the algorithm, the major difference is the amount
of bandwidth that needs to be expended by the adversary to accomplish their task.
If we just count the number of open circuits, an adversary could very easily restrict
all other channels to only one connection, while the rest are dedicated to the channel
they’re using. Using an EWMA threshold or DiffTor classifier requires the adversary to
actually send data over the circuits, with the amount determined by what thresholds
are in place. So while the potential to game the algorithm will always exist, the worse
an adversary can do is reduce all other IMUX channels to one connection, the same as
how vanilla Tor operates.
Defense Limitations: By taking into account the connection limit of the relay, the
dynamic connection manager in IMUX is able to balance the performance gains realized
by opening multiple connections while protecting against the new attack surface made
available with PCTCP that lead to a low-bandwidth denial of service attack against
any relay in the network. However there still exists potential socket exhaustion attacks
1.0
1.0
0.8
0.8
Cumulative Fraction
Cumulative Fraction
88
0.6
0.4
0.2
0.6
0.4
0.2
vanilla
vanilla
utp
0.0
0
5
10
15
20
25
Download Time (s)
30
utp
35
(a) Time to first byte
0.0
0
50
100
Download Time (s)
150
200
(b) Time to last byte of 320KiB
1.0
Cumulative Fraction
0.8
0.6
0.4
0.2
vanilla
utp
0.0
0
100
200
300
400
500
Download Time (s)
600
700
(c) Time to last byte of 5MiB
Figure 6.10: Performance comparison of using TCP connections in vanilla and uTP for
inter-relay communication.
89
inherent to how Tor operates. The simplest of these simply requires opening streams
through a targeted exit, causing sockets to be opened to any chosen destination. Since
this is a fundamental part of how an exit relay operates, there is little that can be done
to directly defend against this attack, although it can be made potentially more difficult
to perform. Exit relays can attempt keep sockets short lived and close ones that have
been idle for a short period of time, particularly when close to the connection limit.
They can also attempt to prioritize connections between relays instead of ones exiting
to external servers. While preventing access at all is undesirable, this may be the lesser
of two evils, as it will still allow the relay to participate in the Tor network, possibly
preventing adversarial relays from being chosen. This kind of attack only affects exit
relays, however the technique utilizing the Socks5Proxy option can target any relay in
the network. Since this is performed by tunneling OR connections through a circuit,
the attack is in effect anonymous, meaning relays cannot simply utilize IP blocking to
protect against it. One potential solution is to require clients to solve computationally
intense puzzles in order to create a circuit as proposed by Barbera et al. [99]. This
reduces the ease that a single adversary is able to mass produce circuits, resulting in
socket descriptor consumption. Additionally, since this attack requires the client to send
EXTEND cells through the exit to initiate a connection through the targeted relay, exits
could simply disallow connections back into the Tor network for circuit creation. This
would force an adversary to have to directly connect to whichever non-exit relay they
were targeting, in which case IP blocking becomes a viable strategy to protect against
such an attack once it is detected.
6.8
Conclusion
In this paper we present a new class of socket exhaustion attacks that allow an adversary
to anonymously perform a denial of service attacks against relays in the Tor network.
We outline how PCTCP, a new transport proposal, introduces a new attack surface in
this new class of attacks. In response, we introduce a new protocol, IMUX, generalizing the designs of PCTCP and Torchestra, that is able to take advantage of opening
multiple connections between relays while still able to defend against these socket exhaustion attacks. Through large scale experiments we evaluate a series of connection
90
schedulers operating within IMUX, look at the performance of IMUX with respect to
vanilla Tor, Torchestra and PCTCP, and investigate how all these algorithms operate
with a newly proposed prototype, KIST. Finally, we introduce xTCP, a library that
can replace TCP connections with any reliable in-order transport protocol. Using the
library we show how the micro transport protocol (uTP) can be used instead of TCP
to achieve increased performance while protecting against the circuit opening socket
exhaustion attack introduced by PCTCP.
Chapter 7
Anarchy in Tor: Performance
Cost of Decentralization
91
92
7.1
Introduction
The price of anarchy refers to the performance cost born by systems where individual
agents act in a locally selfish manner. In networking this specific problem is referred
to as selfish rotuing [19, 17], with users making decisions locally in accordance to their
own interest. In this case the network as a whole is left in a sub-optimal state, with
network resources underutilized, degrading client performance. While Tor clients are not
necessarily selfish, circuit selections are made locally without consideration of the state
of the global network. To this end, several researchers have investigated improvements
to path selection [36], such as to consider latency [34] or relay congestion [35] in an
attempt to improve client performance.
In this section we explore the price of anarchy in Tor, examining how much performance is sacrificed to accommodate decentralized decision making. To this end we
design and implement a central authority in charge of client selection for all clients. With
a global view of the network the central authority can more intelligently route clients
around bottlenecks, increasing network utilization. Additionally we perform competitive analysis against the online algorithm used by the central authority. With an offline
genetic algorithm for circuit selection we can examine how close to an “ideal” solution
the online algorithm is. Using the same concepts from the online algorithm, we develop
a decentralized avoiding bottleneck relay algorithm (ABRA) allowing clients to make
smarter circuit selection decisions. Finally we perform a privacy analysis when using
ABRA for circuit selection, considering both information leakage and how an active
adversary can abuse the protocol to reduce client anonymity.
7.2
Background
When a client first joins the Tor network, it downloads a list of relays with their relevant
information from a directory authority. The client then creates 10 circuits through a
guard, middle, and exit relay. The relays used for each circuit are selected at random
weighted by bandwidth. For each TCP connection the client wishes to make, Tor creates
a stream which is then assigned to a circuit; each circuit will typically handle multiple
streams concurrently. The Tor client will find a viable circuit that can be used for the
stream, which will be then be used for the next 10 minutes or until the circuit becomes
93
unusable. After the 10 minute window a new circuit is selected to be used for the
following 10 minutes, with the old circuit being destroyed after any stream still using it
has ended.
To improve client performance a wide range of work has been done on improving relay
and circuit selection in Tor. Akhoondi, Yu, and Madhyastha developed LASTor [34]
which builds circuits based on inter-relay latency to even further reduce client latency
when using circuits. Wang et al. build a congestion aware path selection [35] that can
use active and passive round trip time measurements to estimate circuit congestion.
Clients can then select circuits in an attempt to avoid circuits with congested relays.
AlSabaha et al. create a traffic splitting algorithm [40] that is able to load balance
between two circuits with a shared exit relay, allowing for clients to dynamically adapt
to relays that become temporarily congested.
7.3
Maximizing Network Usage
To explore the price of anarchy in Tor we want to try and maximize network usage by
using a central authority for all circuit selection decisions. To this end we design and
implement online and offline algorithms with the sole purpose of achieving maximal
network utilization. The online algorithm is confined to processing requests as they
are made, meaning that at any one time it can know all current active downloads and
which circuits they are using, but has no knowledge of how long the download will last
or the start times of any future downloads. The offline algorithm, on the other hand,
has full knowledge of the entire set of requests that will be made, both the times that
the downloads will start and how long they will last.
7.3.1
Central Authority
The online circuit selection algorithm used by the central authority is based on the delayweighted capacity (DWC) routing algorithm, designed with the goal of minimizing the
“rejection rate” of requests in which the bandwidth requirements of the request cannot
be satisfied. In the algorithm, the network is represented as an undirected graph, where
each vertex is a router and edges represent links between routers, with the bandwidth
and delay of each link assigned to the edge. For an ingress-egress pair of routers (s, t),
94
the algorithm continually extracts a least delay path LPi , adds it to the set of all least
delay paths LP , and then removes all edges in LPi from the graph. This step is repeated
until no paths exist between s and t in the graph, leaving us with a set of least delay
paths LP = {LP1 , LP2 , . . . , LPk }.
For each path LPi we have the residual bandwidth Bi which is the minimum band-
width across all links, and the end-to-end delay Di across the entire path. The delayweighted capacity (DWC) of the ingress-egress pair (s, t) is then defined as DW C =
Pk Bi
i=1 Di . The link that determines the bandwidth for a path is the critical link, with
the set of all critical links represented by C = {C1 , C2 , . . . , Ck }. The basic idea is that
when picking a path, we want to avoid critical links as much as possible. To do so,
each link is assigned a weight based on how many times it is a critical link in a path:
P
wl = l∈Ci αi . The alpha value can take on one of three functions: (1) Number of
times a link is critical (αi = 1) (2) The overall delay of the path (αi =
delay and bandwidth of the path (αi =
1
Bi ·Di )
1
Di )
(3) The
. With each link assigned a weight, the
routing algorithm simply chooses the path with the lowest sum of weights across all
links in the path. Once a path is chosen, the bandwidth of the path is subtracted from
the available bandwidth of all links in the path. Any links with no remaining available
bandwidth are removed from the graph and the next path can be selected.
The main difference between the DWC routing algorithm and circuit selection in
Tor is we do not need to extract viable paths; when a download begins Tor clients have
a set of available circuits and we simply need to select the best one. For the online
circuit selection algorithm the main take away is that we want to avoid low bandwidth
relays which are a bottleneck on lots of active circuits. Since the central authority has
a global view of the network they can, in real time, make these calculations and direct
clients around potential bottlenecks.
For this we need an algorithm to identify bottleneck relays given a set of active
circuits. The algorithm has a list of relays and their bandwidth, along with the list of
active circuits. First we extract the relay with the lowest bandwidth per circuit, defined
as rcircBW = rbw /|{c ∈ circuits|r ∈ c}|. This relay is the bottleneck on every circuit
it is on in the circuit list, so we can assign it a weight of rw = 1/rbw . Afterwards, for
each circuit that r appears on we can assign the circuit a bandwidth value of rcircBW .
Additionally, for every other relay r0 that appears in those circuits, we decrement their
95
Figure 7.1: Example of circuit bandwidth estimate algorithm. Each step the relay
with the lowest bandwidth is selected, its bandwidth evenly distributed among any
remaining circuit, with each relay on that circuit having their bandwidth decremented
by the circuit bandwidth.
0 = r0 − r
available bandwidth rbw
circBW , accounting for the bandwidth being consumed
bw
by the circuit. Once this is completed all circuits that r appears on are removed from
the circuit list. Additionally, any relay that has a remaining bandwidth of 0 (which will
always include r) is removed from the list of relays. This is repeated until there are no
remaining circuits in the circuit list. The pseudocode for this is shown in Algorithm 2
and an example of of this process is shown in Figure 7.1.
Note that when removing relays from the list when they run out of available bandwidth we want to be sure that there are no remaining active circuits that the relay
appears on. For the bottleneck relay r this is trivial since we actively remove any circuit that r appears on. But we also want to make sure this holds if another relay r0 is
removed. So let relays r and r0 have bandwidths b and b0 and are on c and c0 circuits
respectively. Relay r is the one selected as the bottleneck relay, and both relays have
a 0 bandwidth after the bandwidths are updated. By definition r is selected such that
b
c
≤
b0
c0 .
When iterating through the c circuits let n be the number of circuits that r0
is on. Note that this means that n ≤ c0 . After the circuits have been iterated through
and operations performed, r0 will have b0 − cb · n bandwidth left and on c0 − n remaining
circuits. We assume that r0 has 0 bandwidth, so b0 −
b
c
· n = 0. We want to show this
means that r0 is on no more circuits so that c0 − n = 0. We have
b0 −
b
b
b0 · c
· n = 0 ⇒ b0 = · n ⇒ n =
c
c
b
96
Algorithm 2 Calculate relay weight and active circuit bandwidth
1: function CalcCircuitBW(activeCircuits)
2:
activeRelays ← GetRelays(activeCircuits)
3:
while not activeCircuits.empty() do
4:
r ← GetBottleneckRelay(activeRelays)
5:
r.weight ← 1/r.bw
6:
circuits ← GetCircuits(activeCircuits, r)
7:
circuitBW ← r.bw/circuits.len()
8:
for c ∈ circuits do
9:
c.bw ← circuitBW
10:
for circRelay ∈ c.relays do
11:
circRelay.bw −= c.bw
12:
if circRelay.bw = 0 then
13:
activeRelays.remove(circRelay)
14:
end if
15:
end for
16:
activeCircuits.remove(c)
17:
end for
18:
end while
19: end function
So that means that the number of circuits r0 is left on is
b0 · c
b · C 0 b0 · c
=
−
b
b
b
0
0
b·c −b ·c
b
c0 − n = c0 −
=
However, r was picked such that
b0
b
≤ 0 ⇒ b · c0 ≤ b0 · c ⇒ b · c0 − b0 · c ≤ 0
c
c
and since b > 0 we know that
c0 − n =
b · c0 − b0 · c
≤0
b
and from how n was defined we have
n ≤ c0 ⇒ 0 ≤ c0 − n
which implies that 0 ≤ c0 − n ≤ 0 ⇒ c0 − n = 0. So we know if r0 has no remaining
bandwidth it will not be on any remaining active circuit.
97
Now when a download begins, the algorithm knows the set of circuits the download
can use and a list of all currently active circuits. The algorithm to compute relay weights
is run and the circuit with the lowest combined weight is selected. If there are multiple
circuits with the same lowest weight then the central authority selects the circuit with
the highest bandwidth, defined as the minimum remaining bandwidth (after the weight
computation algorithm has completed) across all relays in the circuit.
7.3.2
Offline Algorithm
An offline algorithm is allowed access to the entire input set while operating. For circuit
selection in Tor this means prior knowledge of the start and end time of every download
that will occur. While obviously impossible to do on a live network, even most Tor
client models [70] consist of client start times, along with file sizes to download and
how long to pause between downloads. To produce a comparison, we use the offline
algorithm to rerun circuit selection on observed download start and end times. We
then rerun the downloads with the same exact start and end times with the new circuit
selection produced by the offline algorithm to see if network utilization improves. As
input the algorithm takes a list of downloads di with their corresponding start and end
times (si , ei ) and set of circuits Ci = {ci1 , ci2 , ...} available to the download, along with
the list of relays rj and their bandwidth bwj . The algorithm then outputs a mapping
di → cij ∈ Ci with cij = hr1 , r2 , r3 i.
While typical routing problems are solved in terms of graph theory algorithms (e.g.
max-flow min-cut), the offline circuit selection problem more closely resembles typical
job scheduling problems [100]. There are some complications when it comes to precisely
defining the machine environment, but the biggest issues is the scheduling problems that
most closely resemble circuit selection have been discovered to be NP-hard. Due to these
complications we develop a genetic algorithm, which have been shown to perform well on
other job scheduling problems, in order to compute a lower-bound on the performance
of an optimal offline solution.
The important pieces of a genetic algorithm are the seed solutions, how to breed
solutions, how to mutate a solution, and the fitness function used to score and rank
solutions. In our case a solution consists of a complete mapping of downloads to circuits.
To breed two parent solutions, for each download we randomly select a parent solution,
98
and use the circuit selected for the download in that solution as the circuit for the child
solution. For mutation, after a circuit has been selected for the child solution, with
a small probability m% we randomly select a relay on the circuit to be replaced with
a different random relay. Finally, to score a solution we want a fitness function that
estimates the total network utilization for a given circuit to download mapping.
To estimate network utilization for a set of downloads, we first need a method for
calculating the bandwidth being used for a fixed set of active circuits. For this we
can use the weight calculation algorithm from the previous section to estimate each
active circuit’s bandwidth, which can then be aggregated into total network bandwidth.
To compute the entire amount of bandwidth used across an entire set of downloads, we
extract all start and end times from the downloads and sort them in ascending order. We
then iterate over the times, and at each time ti we calculate the bandwidth being used
as bwi and then increment total bandwidth totalBW = totalBW + bwi · (ti+1 − ti ). The
idea here is that between ti and ti+1 , the same set of circuits will be active consuming
bwi bandwidth, so we add the product to total bandwidth usage. This is used for the
fitness function for the genetic algorithm, scoring a solution based on the estimated
total bandwidth usage. We consider two different methods for seeding the population,
explained in the next section.
7.3.3
Circuit Sets
In both the online and offline circuit selection algorithms, each download has a set of
potential circuits that can be selected from. For the online algorithm we want to use
the full set of all valid circuits. We consider a circuit r1 , r2 , r3 valid if r3 is an actual exit
relay, and if r1 6= r2 , r2 6= r3 , and r1 6= r3 . Note we do not require that r1 actually has
the guard flag set. This is because while there are mechanisms internally in Tor that
prevent any use of a non-exit relay being used as an exit relay, there is nothing stopping
a client from using a non-guard relay as their guard in a circuit. We also remove any
“duplicate” circuits that contain the same set of relays, just in a different order. This
is done since neither algorithm considers inter-relay latency, only relay bandwidth, so
there is no need to consider circuits with identical.
For the offline algorithm we need to carefully craft the set of circuits as to prevent
the genetic algorithm from getting stuck at a local maximum. The motivation behind
99
Figure 7.2: Example showing that adding an active circuit resulted in network usage
dropping from 20 to 15 MBps
.
generating this pruned set can be seen in Figure 7.2. It shows that we can add an
active circuit and actually reduce the amount of bandwidth being pushed through the
Tor network. In the example shown it will always be better to use one of the already
active circuits shown on the left instead of the new circuit seen on the right side of the
graphic. There are two main ideas behind building the pruned circuit set: (1) when
building circuits always use non-exit relays for the guard and middle if possible and (2)
select relays with the highest bandwidth. To build a circuit to add to the pruned set,
the algorithm first finds the exit relay with the highest bandwidth. If none exist, the
algorithm can stop as no new circuits can be built. Once it has an exit, the algorithm
then searches for the non-exit relay with the highest bandwidth, and this relay is used
for the middle relay in the circuit. If one is not found it searches the remaining exit
relays for the highest bandwidth relay to use for the middle in the circuit. If it still
cannot find a relay, the algorithm stops as there are not enough relays left to build
another circuit. The search process for the middle is replicated for the guard, again
stopping if it cannot find a suitable relay. Now that the circuit has a guard, middle, and
exit relay, the circuit is added to the pruned circuit set and the algorithm calculates
the circuit bandwidth as the minimum bandwidth across all relays in the circuit. Each
relay then decrements its relay bandwidth by the circuit bandwidth. Any relay that
now has a bandwidth of 0 is permanently removed from the relay list. This is repeated
until the algorithm can no longer build valid circuits.
Using these sets allows us to explore the full potential performances that can be
100
realized when considering both relay and circuit selection. To analyze the impacts on
circuit selection alone, we also run the algorithms using the original circuit set. For each
download, we extract the original circuits available in the vanilla experiments. Then
when selecting a circuit for the download in the online and offline algorithm we only
pick from those original circuits.
7.4
ABRA for Circuit Selection
While using a central authority for circuit selection allows us explore potential performance increases of better network utilization, using such an entity would have severe
privacy and anonymity implications on clients using Tor. Instead we would like to use
the key insights from the online algorithm and adapt it to a decentralized algorithm
that can be used to allow clients themselves to make more intelligent circuit selection
decisions. The main point of using the central authority with the online algorithm is
to directly route clients around bottlenecks. For the decentralized version we introduce
the avoiding bottleneck relay algorithm (ABRA) which has relays themselves estimate
which circuits they are a bottleneck on. With this they can locally compute and gossip their DWC weight, with clients then selecting circuits based on the combined relay
weight value.
To main obstacle is a method for relays to accurately estimate how many circuits they
are bottlenecks on. To locally identify bottleneck circuits, we make three observations:
(1) to be a bottleneck on any circuit the relay’s bandwidth should be fully consumed
(2) all bottleneck circuits should be sending more cells than non-bottleneck circuits,
and (3) the number of cells being sent on bottleneck circuits should be fairly uniform.
The last point is due to flow control built into Tor, where the sending rate on a circuit
should converge to the capacity of the bottleneck relay.
All of this first requires identifying circuit bandwidth which needs to be done in
a way that does not over estimate bulky circuits and under estimate quiet but bursty
circuits. To accomplish this we consider two parameters. First is the bandwidth window,
in which we keep track of all traffic sent on a circuit for the past w seconds. The other
parameter is the bandwidth granularity, a sliding window of g seconds that we scan over
the w second bandwidth window to find the maximum number of cells transferred during
101
Algorithm 3 Head/Tail clustering algorithm
1: function HeadTail(data, threshold)
2:
m ← sum(data)/len(data)
3:
head ← {d ∈ data—d ≥ m}
4:
if len(head)/len(data) < threshold then
5:
head ← HeadT ail(head, threshold)
6:
end if
7:
return head
8: end function
any g second period. That maximum value is then assigned as the circuit bandwidth.
With circuits assigned a bandwidth value we need a method for clustering them
into bottleneck and non-bottleneck groups. For this we consider several clustering algorithms.
Head/Tail: The head/tail clustering algorithm [101] is useful when the underlying
data has a long tail, which could be useful for bottleneck identification, as we expect to
have a tight clustering around the bottlenecks with other circuits randomly distributed
amongst the lower bandwidth values. The algorithm first splits the data into two sets,
the tail set containing all values less than the arithmetic mean, and everything greater
to or equal to the mean is put in the head set. If the percent of values that ended up
in the head set is less than some threshold, the process is repeated using the head set
as the new data set. Once the threshold is passed the function returns the very last
head set as the head cluster of the data. This algorithm is shown in Algorithm 3. For
bottleneck identification we simply pass in the circuit bandwidth data and the head
cluster returned contains all the bottleneck circuits.
Kernel Density Estimator: The idea behind the kernel density estimator [102, 103]
is we are going to try and fit a multimodal distribution based on a Gaussian kernel to
the circuits. Instead of using the bandwidth estimate for each circuit, the estimator
takes as input the entire bandwidth history seen across all circuits, giving the estimator
more data points to build a more accurate density estimate. For the kernel bandwidth
we initially use the square root of the mean of all values. Once we have a density we
compute the set of local minima {m1 , ..., mn } and classify every circuit with bandwidth
above mn as a bottleneck. If the resulting density estimate is unimodal and we do not
have any local minima, we repeat this process, halving the kernel bandwidth until we
102
get a multimodal distribution.
After relays have identified the circuits they are bottlenecks on, they compute their
P 1
DWC weight as the sum of the inverse of those circuits’ bandwidths, weight =
bwi .
This weight is then periodically gossiped to all clients that have a circuit through the
relay. In the centralized DWC algorithm paths are always selected with the lowest
combined DWC weight across all nodes in the path. However since there is a delay
between when circuits become bottlenecks and notify clients we want to avoid clients
over-burdening relays that might be congested. Instead clients will select circuits at
random based on the DWC weight information. First the client checks if there are any
circuits with DWC weight of 0, indicating that no relay on the circuit is a bottleneck on
any circuit. If there are any, all zero weight circuits are added to a list which the client
selects from randomly weighted by the circuit bandwidth, which is the lowest advertised
bandwidth of each relay in the circuit. If there are no zero-weight circuits, the client
then defaults to selecting amongst the circuits randomly weighted by the inverse of the
DWC weight, since we want to bias our selection to circuits with lower weights.
7.5
Experimental Setup
In this section we discuss our experimental setup and some implementation details of
the algorithms used.
7.5.1
Shadow
To empirically test how our algorithms would perform in the Tor network, we use Shadow
[69, 67], a discrete event network simulator with the capability to run actual Tor code
in a simulated network environment. Shadow allows us to set up a large scale network
configuration of clients, relays, and servers, which can all be simulated on a single
machine. This lets us run experiments privately without operating on the actual Tor
network, avoiding potential privacy concerns of dealing with real users. Additionally,
it lets us have a global view and precise control over every aspect of the network,
which would be impossible in the live Tor network. Most importantly, Shadow performs
deterministic runs, allowing for reproducible results and letting us isolate exactly what
we want to test for performance effects.
103
Our experiments are configured to use Shadow v1.9.2 and Tor v0.2.5.10. We use the
large network configuration distrubyted with Shadow which consists of 500 relays, 1350
web clients, 150 bulk clients, 300 performance clients, and 500 servers. The default
client model in Shadow [67, 70, 27] downloads a file of a specific size, and after the
download is finished it chooses how long to pause until it starts the next download. Web
clients download 320 KB files and randomly pause between 1 and 60,000 milliseconds
before starting the next download; bulk clients download 5 MB files with no break
between downloads; and performance clients are split into three groups, downloading
50 KB, 1 MB, and 5 MB files, respectively, with 1 minute pauses between downloads.
Additionally, since the offline algorithm uses deterministic download times, we introduce
a new fixed download model. In this model each client has a list of downloads it will
perform with each download containing a start time, end time, and optional circuit to
use for the download. Instead of downloading a file of fixed size it downloads non-stop
until the end time is reached. In order to generate a fixed download experiment, we use
the download start and end times seen during an experiment running vanilla Tor using
the default client model.
To measure network utilization, every 10 seconds each relay logs the total number
of bytes it has sent during the previous 10 second window. This allows us to calculate
the total relay bandwidth being used during the experimental run, with higher total
bandwidth values indicating better utilization of network resources. When using the
default client model we also examine client performance, with clients reporting the time
it took them to download the first byte, and time to download the entire file.
7.5.2
Implementations
Online and Offline Algorithms: When using the online and offline algorithm with
the fixed download client model we create a tool that precomputes circuit selection
using as input the list of downloads, with their respective start and stop times, along
with which set of circuits described in Section 7.3.3 to consider for the downloads.
The computed mapping of downloads to circuits is used by Shadow to manage which
circuits clients need to build and maintain, along with which circuit is selected each
time a download starts. For the offline algorithm the genetic algorithm was run for 100
rounds with a breed percentile of b = 0.2 and elite percentile e = 0.1. This was done
104
for each circuit set, with mutation set to m = 0 for the pruned and original circuit sets,
and m = 0.01 when using the full circuit set. After 100 rounds the population with the
highest score saved its circuit selection for every download.
While the offline algorithm can only operate in the fixed download model, the online
algorithm can also work in the default client mode. For this we introduce a new central
authority (CA) node which controls circuit selection for every client. Using the Tor
control protocol the CA listens on the CIRC and STREAM events to know what circuits
are available at each client and when downloads start and end. Every time a STREAM NEW
event is received it runs the algorithm described in Section 7.3.1 over the set of active
circuits. This results in each relay having an assigned DWC weight and the CA simply
picks the circuit on the client with the lowest combined weight.
Avoiding Bottleneck Relays Algorithm: To incorporate ABRA in Tor we first
implemented the local weight calculation method described in Section 7.4. Each circuit
keeps track of the number of cells received over each 100 millisecond interval for the past
w seconds. We created a new GOSSIP cell type which relays send with their DWC weight
downstream on each circuit they have. To prevent gossip cells from causing congestion,
relays send the gossip cells on circuits every 5 seconds based on the circuit ID; specifically
when now() ≡ circID mod 5. For clients we modified the circuit_get_best function
to select circuits randomly based on the sum of relay weights as discussed in Section 7.4.
Congestion and Latency Aware: Previous work has been done on using congestion
[35] and latency [34] to guide circuit selection. For congestion aware path selection we
use the implementation described in [36]. Clients get measurements on circuit round-trip
times from three sources: (1) time it takes to send the last EXTEND cell when creating
the circuit (2) a special PROBE cell sent right after the circuit has been created, and
(3) the time it takes to create a connection when using a circuit. For each circuit the
client keeps track of the minimum RTT seen rttmin , and whenever it receives an RTT
measurement calculates the congestion time tc = rtt − rttmin . When a client selects
a circuit it randomly chooses 3 available circuits and picks the one with the lowest
congestion time. Additionally we add an active mode where PROBE cells are sent every
n seconds, allowing for more up to date measurements on all circuits the client has
available. Instead of using network coordinates to estimate latency between relays as
done in [34], we create a latency aware selection algorithm that uses directly measured
105
RTT times via PROBE cell measurements. With these measurements the client either
selects the circuit with the lowest rttmin , or selects amongst available circuits randomly
weighted by each circuit’s rttmin .
7.5.3
Consistency
To keep the experiments as identical as possible, every experiment was configured with a
central authority and had relays gossiping their weight to clients. If the experiment was
testing a local circuit selection algorithm, the central authority would return a circuit ID
of 0 which indicates that the client itself should select a circuit. At this point the client
would use either the vanilla, congestion aware, latency aware, or ABRA algorithms to
select a circuit. This way any latency or bandwidth overhead incurred from the various
algorithms is present in every experiment, and differences between experimental results
are all due to different circuit selection strategies.
7.6
Performance
In this section we compare the performance of using ABRA for circuit selection to
vanilla Tor, congestion-based and latency-aware circuit selection, and our centralized
algorithms.
7.6.1
Competitive Analysis
To get an understanding of how well the online algorithm performs to a more optimal
offline solution, we first perform competitive analysis running both the online and offline
algorithms on the fixed download client model. We extracted all download start and
end times from the vanilla experiment, along with the original circuits available to each
download. Both the online algorithm (with the original and full circuit sets) and offline
algorithm (seeded with original and pruned circuit sets) were run over the downloads
to precompute a circuit selection. Experiments were then configured using each of the
computed circuit selections to analyze the exact network utilization impacts each of the
circuit selections had.
Figures 7.3a shows total network bandwidth achieved with the online and offline
algorithms when they use the full and pruned circuit sets respectively. Both algorithms
106
Bandwidth Granularity
Bandwidth Window
Jenks Two
Jenks Best
Head/Tail
KernelDensity
1s
1.67
1.11
0.74
0.81
100
2s
1.79
1.14
0.92
0.82
ms
5s
3.56
1.91
1.66
0.85
10s
8.69
3.87
4.36
0.93
1s
1.80
1.00
0.90
0.79
1 second
2s
5s
2.39 5.28
1.18 2.25
1.13 2.45
0.79 0.95
10s
13.25
4.70
6.14
1.62
Table 7.1: The bottleneck clustering methods mean square error across varying bandwidth granularity and window parameters, with red values indicating scores less than
weighted random estimator.
produce almost identical network utilization, with median network bandwidth going
from 148 MBps in vanilla Tor up to 259 MBps, a 75% increase. To see why we can
look at relay capacity seen in Figure 7.3c. This looks at the percent of bandwidth
being used on a relay compared to their configured BandwidthRate. Note that this can
go higher than 100% because Tor also has a BandwidthBurstRate that allows a relay
to temporarily send more than its BandwidthRate over short periods of time. This
shows us that in the best algorithm runs relays are operating at very close to optimal
utilization. Half of the time relays are using 90% or more of their configured bandwidth,
which was only happening about 20% of the time in vanilla Tor, meaning the centralized
algorithms were able to take advantage of resources that were otherwise being left idle.
Interestingly, while both algorithms had the highest performance gain when using
larger circuit sets, they still produced improved performance when restricted to the
original circuit sets as seen in Figure 7.3b. The online algorithm produced results
close to those seen when using the full circuit set, with median network bandwidth at
248 MBps compared to 259 MBps for the full set. The offline algorithm, while still
performing better than vanilla Tor, did not produce as much improvement when using
the original circuit set, with network bandwidth at 206 MBps.
7.6.2
ABRA Parameters
With the various parameters and clustering methods available to the algorithm, the
first thing we are interested in is which parameters result in the most accurate bottleneck estimation. To measure this we configured an experiment to run with the central
authority described in Section 7.5.2 making circuit selections. Every time the central
authority selects a circuit for a client, for every relay it outputs how many circuits the
107
1.0
0.8
0.6
CDF
CDF
0.8
1.0
vanilla
online-full
offline-pruned
0.4
0.2
0.0
50
vanilla
online-orig
offline-orig
0.6
0.4
0.2
100
150
200
250
Total Bandwidth (MiB/s)
300
0.0
50
(a) Full and Pruned Circuits
100
150
200
250
Total Bandwidth (MiB/s)
300
(b) Original Circuits
1.0
CDF
0.8
0.6
0.4
0.2
vanilla
online
offline
0.0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
Relay Bandwidth Consumption (%)
(c) Relay Capacity
Figure 7.3: Total network bandwidth when using the online and offline circuit selection
algorithms while changing the set of available circuits, along with the relay bandwidth
capacity of the best results for both algorithms.
108
relay is a bottleneck on. Additionally, during the run every relay periodically outputs
the entire 10 second bandwidth history of every circuit it is on. With this information we
can compute offline which circuits every relay would have classified itself as a bottleneck
on, depending on the parameters used. For comparing estimates, every time the central
authority outputs its bottleneck figures, every relay that outputs their circuit history
within 10 simulated milliseconds will have their local bottleneck estimate compared to
the estimate produced by the central authority.
Table 7.1 shows the mean-squared error between the bottleneck estimations produced by the central authority and local estimates with varying parameters and clustering methods. The bandwidth granularity was set to either 100 milliseconds or 1 second,
and the bandwidth window was picked to be either 1, 2, 5, or 10 seconds. Every value in
black performed worse than a weighted random estimator, which simply picked a random value from the set of all estimates computed by the central authority. While the
kernel-density estimator produced the lowest mean-squared error on average, the lowest
error value came from when we used the head/tail clustering algorithm with bandwidth
granularity set to 100 milliseconds bandwidth window at 1 second.
7.6.3
ABRA Performance
To evaluate the performance of Tor using ABRA for circuit selection, an experiment
was configured to use ABRA with g = 100ms and w = 1s. Additionally we ran experiments using congestion aware, latency aware, and latency weighted circuit selection.
For congestion aware circuit selection three separate experiments were run, one using
only passive probing, and two with clients actively probing circuits either every 5 or
60 seconds. Finally, an experiment was run with the central authority using the online
algorithm to make circuit selection decisions for all clients. Figure 7.4 shows the CDF
of the total relay bandwidth observed during the experiment. Congestion and latency
aware network utilization, shown in Figures 7.4a and 7.4b, actually drop compared to
vanilla Tor, pushing at best half as much data through the network. Clients in the
congestion aware experiments were generally responding to out of date congestion information, with congested relays being over weighted by clients, causing those relays to
become even more congested. Since latency aware circuit selection does not take into
account relay bandwidth, low bandwidth relays with low latencies between each other
109
CDF
0.8
0.6
1.0
vanilla
abra
cong-passive
cong-active60
cong-active5
0.8
CDF
1.0
0.4
0.2
0.0
20
vanilla
abra
latency
latency-weighted
0.6
0.4
0.2
40
60 80 100 120 140 160 180
Total Bandwidth (MiB/s)
0.0
20
(a) Congestion Aware
1.0
CDF
0.8
40
60 80 100 120 140 160 180
Total Bandwidth (MiB/s)
(b) Latency Aware
vanilla
abra
central
0.6
0.4
0.2
0.0
120 130 140 150 160 170 180 190
Total Bandwidth (MiB/s)
(c) Central Authority
Figure 7.4: Total network bandwidth utilization when using vanilla Tor, ABRA for
circuit selection, the congestion and latency aware algorithms, and the central authority.
110
are selected more often than they should be and become extremely congested.
While the congestion and latency aware algorithms performed worse than vanilla
Tor, using both ABRA and the central authority for circuit selection produced increased
network utilization. Figure 7.4c shows that ABRA and the central authority produced
on average 14% and 20% better network utilization respectively when compared to
vanilla Tor. Figure 7.5 looks at client performance using ABRA and centralized circuit selection. Download times universally improved, with some web clients performing
downloads almost twice as fast, and bulk clients consistently seeing a 5-10% improvement. While all web clients had faster download times, about 20% saw even better
results when using the central authority compared to ABRA for circuit selection. The
only slight degradation in performance was with ABRA, where about 12-13% of downloads had a slower time to first byte compared to vanilla Tor. The central authority,
however, consistently resulted in circuit selections that produced faster times to first
byte across all client downloads.
7.7
Privacy Analysis
With the addition of gossip cells and using ABRA for circuit selection, there are some
potential avenues for abuse an adversary could take advantage of to reduce client
anonymity. In this section we cover each of these and examine how effective the methods
are for an adversary.
7.7.1
Information Leakage
The first issue is that relays advertising their locally computed weight could leak information about other clients to an adversary. Mittal et al. [58] showed how an adversary
could use throughput measurements to identify bottleneck relays in circuits. Relay
weight values could be used similarly, where an adversary could attempt to correlate
start and stop times of connections with the weight values of potential bottleneck relays. To examine how much information is leaked by the weight values, every time a
relay sent a GOSSIP cell we recorded the weight of the relay and the number of active
circuits using the relay. Then for every two consecutive GOSSIP cells sent by a relay we
then recorded the difference in weight and number of active circuits, which should give
1.0
1.0
0.8
0.8
0.6
0.4
0.2
vanilla
abra
central
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Download Time (s)
Cumulative Fraction
Cumulative Fraction
111
0.6
0.4
0.2
0.0
vanilla
abra
central
0
(a) Time to First Byte
2
4
6
Download Time (s)
8
10
(b) Web Download Times
Cumulative Fraction
1.0
0.8
0.6
0.4
0.2
0.0
0
vanilla
abra
central
20
40
60
Download Time (s)
80
100
(c) Bulk Download Times
Figure 7.5: Download times and client bandwidth compared across circuit selection in
vanilla Tor, using ABRA, and the centralized authority.
112
us a good idea of how well the change in these two values are correlated over a short
period of time. Figure 7.6 shows the distribution of weight differences across various
changes in clients actively using the relay. Since we are interested in times when the
relay is a bottleneck on new circuits, we excluded times when the weight difference was
0 as this is indicative that the relay was not a bottleneck on any of the new circuits.
This shows an almost nonexistent correlation with an R2 value of 0.00021. In the situation similar to the one outlined in [58] where an adversary is attempting to identify
bottleneck relays used in a circuit, we are particularly interested in the situation where
the number of active circuits using the relay as a bottleneck increases by 1. If there
were large (maybe temporary) changes noticeable to an adversary they could identify
the bottleneck relay. But as we can see in Figure 7.6 the distribution of weight changes
when client difference is 1 is not significantly different from the distribution for larger
client differences, meaning it would be extremely difficult to identify bottleneck relays
by correlating weight changes.
7.7.2
Colluding Relays Lying
With relays self reporting their locally calculated weights, adversarial relays could lie
about their weight, consistently telling clients they have a weight of 0, increasing their
chances of being on a selected circuit. Note that construction of circuits using ABRA is
still unchanged, so the number of compromised circuits built by a client will not change;
it is only when a client assigns streams to circuits that malicious relays could abuse
GOSSIP cells to improve the probability of compromising a stream. Furthermore, this
attack has a “self-damping” effect, in that attracting streams away from non-adversarial
relays will decrease the bottleneck weight of those relays. Still, colluding relays, in an
effort to increase the percent of circuits they are able to observe, could in a joint effort
lie about their weights in an attempt to reduce anonymity of clients.
To determine how much of an effect such an attack would have, while running the
experiment using ABRA for circuit selection we recorded the circuits available to the
client, along with the respective weight and bandwidth information. A random set of
relays were then marked as the “adversary”. For each circuit selection decision made,
if the adversary had a relay on an available circuit we would rerun the circuit selection
assuming all the adversarial relays had a reported weight of 0. We would then record
113
Figure 7.6: Weight difference on relays compared to the difference in number of bottleneck circuits on the relay.
114
the fraction of total bandwidth controlled by the adversary, the fraction of streams
they saw with and without lying, and the fraction of compromised streams, with and
without lying. A stream is considered compromised if the adversary controls both the
guard and exit in its selected circuit. This process was repeated over 2,000,000 times to
produce a range of expected streams seen and compromised as the fraction of bandwidth
controlled by an adversary increases. Figures 7.7a and 7.7b shows the median fraction
of streams observed bounded by the 10th and 90th percentile. As the adversary controls
more bandwidth, and thus more relays that can lie about their weight, an adversary is
able to observe about 10% more streams than they would have seen if they were not
lying. Figure 7.7b shows the fraction of streams that are compromised with and without
adversarial relays lying. When lying an adversary sees an increase in roughly 2-3% of
the streams they are able to compromise. For example an adversary who controls 20%
of the bandwidth sees their percent of streams compromised go from 4.7% to 6.9%.
Note that these serve as an upper bound on the actual effect as this analysis is done
statically and ignores effects on the weight of non-adversarial relays.
7.7.3
Denial of Service
While adversarial relays are limited in the number of extra streams they can observe
by lying about their weight, they still could have the ability to reduce the chances that
other relays are selected. To achieve this they would need to artificially inflate the
weight of other relays in the network, preventing clients from selecting circuits that the
target relays appear on. Recall that the local weight calculation is based on how many
bottleneck circuits the relay estimates they are on. This means that an adversary cannot
simply just create inactive circuits through the relay to inflate their weight, since those
circuits would never be labeled as bottleneck. So to actually cause the weight to increase
the adversary needs to actually send data through the circuits. To test the effectiveness
of this attack, we configured an experiment to create 250 one-hop circuits through a
target relay. After 15 minutes the one-hop circuits were activated, downloading as much
data as they could. Note that we want as many circuits through the relay as possible to
make their weight as large as possible. The relay weight is summed across all bottleneck
P
circuits,
bw(ci )−1 . If we have n one-hop circuits through a relay of bandwidth bw,
each circuit will have a bandwidth of roughly
bw
n ,
so the weight on the relay will be
115
(a)
(b)
Figure 7.7: (a) fraction of circuits seen by colluding relays (b) fraction of circuits where
colluding relays appear as guard and exit
116
bw −1
Pn
1
n
=
n2
bw .
Figure 7.8a looks at the weight of the target relay along with how much bandwidth
the attacker is using, with the shaded region noting when the attack is active. We see
that the attacker is able to push through close to the maximum bandwidth that the relay
can handle, around 35 MB/s. When the circuits are active the weight spikes to almost
100 times what it was. After the attack is started the number plummets to almost 0,
down from the 30-40 it was previously on. But while the attack does succeed in inflating
the weight of the target, note that the adversary has to fully saturate the bandwidth
of the target. Doing this in vanilla Tor will have almost the same effect, essentially
running a denial of service by consuming all the available bandwidth. Figure 7.8c looks
at the number of completed downloads using the target relay in both vanilla Tor and
when using ABRA for circuit selection. Both experiments see the number of successful
downloads drop to 0 while the attack is running. So even though the addition of the
relay weight adds another mechanism that can be used to run a denial of service attack,
the avenue (saturating bandwidth) is the same.
7.8
Conclusion
In this paper we introduced an online algorithm that is capable of centrally managing
circuit selection for clients. Additionally we develop the decentralized version ABRA,
the avoiding bottleneck relay algorithm. This algorithm lets relays estimate which
circuits they are bottlenecks are on, allowing them to compute a weight that can be
gossiped to clients, allowing for more coordinated circuit selection. ABRA is compared
to congestion and latency aware circuit selection algorithms, showing that while these
algorithms tend to actually under perform vanilla Tor, ABRA results in a 14% increase
in network utilization. We show that the decentralized approach ABRA takes produces
utilization close to what even a centralized authority can provide. Using competitive
analysis we show that an online algorithm employed by a central authority matches a
lower-bound offline genetic algorithm. Finally, we examine potential ways an adversary
could abuse ABRA, finding that while information leakage is minimal, there exist small
increases in the percent of streams compromised based on the bandwidth controlled by
an adversary acting maliciously.
117
5
attacker bw
target weight
40
4
30
3
20
2
10
1
0
Weight
Bandwidth (MB/s)
50
0
5
10
15 20
Time (m)
25
0
30
(a)
(b)
(c)
Figure 7.8: (a) attackers bandwidth and targets weight when an adversary is running the
denial of service attack (b) number of clients using the target for a circuit (c) number of
downloads completed by clients using the target relay before and after attack is started,
shown both for vanilla Tor and using ABRA circuit selection
Chapter 8
Future Work and Final Remarks
118
119
8.1
Future Work
In this section we discuss open questions related to this dissertation and ideas for future
research on Tor and anonymous communication systems.
8.1.1
Tor Transports
In Section 6.6 we introduced xTCP that was able to replace TCP with any other inorder reliable transport protocol. With this we were able to have Tor use the micro
transport protocol (uTP) for all inter-relay communication, allowing each circuit to
have its own user space protocol stack. We showed that using uTP instead of TCP
was able to increase client performance, with especially large improvements for time to
first byte. While these initial results are promising, a large amount of work remains in
investigating potential different protocols to use [104, 98].
Along with uTP, the stream control transmission protocol (SCTP) [105] could be a
good candidate for inter-relay communication. The main benefit is it allows for partial
ordering data delivery, eliminating the issue of head-of-line blocking that can cause
performance issues in Tor. Another possibility is to use Google’s QUIC [106] for Tor
[107]. QUIC offers a combination of TCP reliability and congestion control, along with
TLS for encryption and SPDY/HTTP2 for efficient and fast web page loading. In [104]
Murdoch additionally recommends CurveCP [108], a secure in-order reliable transport
protocol.
While other protocols are more mature and tested, it could be possible that a stand
alone Tor specific protocol could provide best performance. Tschorsch and Scheuermann
[63] recently implemented their own protocol for better flow and congestion control based
on back pressure, while using traditional techniques from TCP for packet loss detection
and retransmission. A protocol even more customized for Tor could deliver even better
performance guarantees. Jansen et al. showed with KIST [27] that allowing Tor to
buffer more data internally allows for better cell handling and prioritization, increasing
performance to latency sensitive applications. A Tor specific protocol would then have
direct control over traffic management and prioritization, allowing it to manage stream,
circuit, and connection level flow and congestion control in a global context.
120
8.1.2
Simulation Accuracy
With more work being done on the low level networking side of Tor [27, 2, 3, 109, 93, 63],
it is becoming increasingly important that the network stack in Shadow be as accurate
as possible. While using PlanetLab or ExperimenTor [65] we can use the actual kernel
TCP stack, these solutions have problems with scaling to full Tor topologies. In [27] a
large amount of work went into updating the Shadow TCP stack to replicate the Linux
kernel TCP implementation, but it is very difficult to guarantee equivalent performance.
Recently a couple of projects have been released that attempt to create a user
space networking library based on the Linux operating system code. NUSE [110], the
network stack in userspace, is part of the library operating system (LibOS) [111]. A
host backend layer handles the interactions from the kernel layer, with applications
making calls through the POSIX layer sitting above the kernel layer. The Linux kernel
library (LKL) [112, 113] is another library that applications can use to incorporate
Linux operating system functionality into their program. The advantage to LKL is the
kernel code is compiled into a lone object file which the application can directly link to.
Tschorsch and Scheuermann [63] compared their Tor back pressure protocol by integrating it into ns-3 [114], a network simulator with a wide variety of protocols built
into it. While this has the advantage of comparing their back pressure protocol against
the actual TCP implementation, it abstracts a lot of details of Tor that could reduce
simulation accuracy. Instead the ns-3 TCP implementation could be ported over into
either a stand alone library or directly into Shadow. This would reduce the overhead of
having a library compiled from the entire kernel, in addition to relying on a more mature
code base that has for decades been used for network simulations and experimentation.
8.2
Final Remarks
Since this dissertation began Tor has grown enormously, initially with a network capacity of 5 GB/s servicing a few hundred thousand clients, to almost 200 GB/s of relay
bandwidth handling over 2 million clients daily. Additionally, during this period with
higher network bandwidth and more efficient data processing, the time it took to download a 1 MB file has dramatically decreased, going from about a 20 second download
time to almost 2.5 seconds currently. This large expansion in both network capacity and
121
client performance has allowed users and political dissidents in oppressive regimes to access the internet and globally communicate anonymously, an impressive and important
achievement.
With an ever growing number of Tor users it is increasingly important to ensure
that the system properly balances performance and privacy. While this is an inherently
subjective question, this dissertation has worked to demonstrate some of the trade-offs
that are had where an adversary is able to take advantage of mechanisms intended to
increase client performance. To this end we first explore an induced throttling attack
able to abuse flow and traffic admission control, allowing a low resource attack on guard
identification which can be used to reduce the client anonymity set. We next introduced
a socket exhaustion attack in which an adversary consume all available sockets. The
attack is made even worse by systems like PCTCP which allow an adversary to anonymously open sockets just by creating circuits through a target. With this in mind we
introduce IMUX which is able to balance opening more connections to increase performance while preventing these easy denial of service attacks. Additionally we examine
the effects of replacing TCP entirely for inter-relay communication, and find that using
uTP simultaneously prevents certain socket exhaustion attacks while improving client
performance. Finally we examine the price of anarchy in Tor and find that potentially
40-50% of network resources are being left idle to the detriment of client performance.
In an attempt to close this gap we introduce ABRA for circuit selection, a decentralized
solution that can take advantage of these idle resources with minimal potential loss of
client anonymity.
References
[1] Rob Jansen, Paul Syverson, and Nicholas Hopper. Throttling Tor Bandwidth
Parasites. In Proceedings of the 21st USENIX Security Symposium, 2012.
[2] Deepika Gopal and Nadia Heninger. Torchestra: Reducing interactive traffic delays over Tor. In Proceedings of the Workshop on Privacy in the Electronic Society
(WPES 2012). ACM, October 2012.
[3] Mashael AlSabah and Ian Goldberg. PCTCP: Per-Circuit TCP-over-IPsec Transport for Anonymous Communication Overlay Networks. In Proceedings of the
20th ACM conference on Computer and Communications Security (CCS 2013),
November 2013.
[4] Rob Jansen, John Geddes, Chris Wacek, Micah Sherr, and Paul Syverson. Never
Been KIST: Tor’s Congestion Management Blossoms, with kernel-informed socket
transport. In Proceedings of the 23rd conference on USENIX security symposium.
USENIX Association, 2014.
[5] Ceki Gulcu and Gene Tsudik. Mixing E-mail with Babel. In Proceedings of the
Symposium on Network and Distributed System Security., 1996.
[6] Michael J. Freedman and Robert Morris. Tarzan: A Peer-to-Peer Anonymizing
Network Layer. In Proceedings of the 9th ACM Conference on Computer and
Communications Security (CCS 2002), November 2002.
[7] George Danezis, Roger Dingledine, and Nick Mathewson. Mixminion: Design of
a type III anonymous remailer protocol. In Proc. of IEEE Security and Privacy.,
2003.
122
123
[8] Roger Dingledine, Nick Mathewson, and Paul Syverson.
Tor: The second-
generation onion router. In Proceedings of the 13th USENIX Security Symposium,
August 2004.
[9] David Goldschlag, Michael Reed, and Paul Syverson. Onion routing. Communications of the ACM, 42(2):39–41, 1999.
[10] Michael G Reed, Paul F Syverson, and David M Goldschlag. Anonymous connections and onion routing. Selected Areas in Communications, IEEE Journal on,
16(4):482–494, 1998.
[11] Adam Back, Ulf Möller, and Anton Stiglic. Traffic Analysis Attacks and TradeOffs in Anonymity Providing Systems. In Ira S. Moskowitz, editor, Information
Hiding: 4th International Workshop, IH 2001, pages 245–257, Pittsburgh, PA,
USA, April 2001. Springer-Verlag, LNCS 2137.
[12] Mashael AlSabah, Kevin Bauer, Ian Goldberg, Dirk Grunwald, Damon McCoy,
Stefan Savage, and Geoffrey Voelker. DefenestraTor: Throwing out windows in
Tor. In PETS, 2011.
[13] Rob Jansen, Paul Syverson, and Nicholas Hopper. Throttling Tor Bandwidth
Parasites. In Proceedings of the 21st USENIX Security Symposium, August 2012.
[14] Nicholas Hopper, Eugene Y. Vasserman, and Eric Chan-Tin.
Anonymity does Network Latency Leak?
How Much
ACM Transactions on Information
and System Security, 13(2):13–28, February 2010.
[15] Joel Reardon and Ian Goldberg. Improving tor using a TCP-over-DTLS tunnel.
In Proceedings of the 18th conference on USENIX security symposium, pages 119–
134. USENIX Association, 2009.
[16] Joel Reardon. Improving Tor using a TCP-over-DTLS Tunnel. Master’s thesis,
University of Waterloo, September 2008.
[17] Tim Roughgarden. Selfish routing and the price of anarchy, volume 174. MIT
press Cambridge, 2005.
124
[18] Tim Roughgarden and Éva Tardos. How bad is selfish routing? Journal of the
ACM (JACM), 49(2):236–259, 2002.
[19] Tim Roughgarden. Selfish routing. PhD thesis, Cornell University, 2002.
[20] George Danezis. The traffic analysis of continuous-time mixes. In International
Workshop on Privacy Enhancing Technologies, pages 35–50. Springer, 2004.
[21] Steven J Murdoch and Piotr Zieliński.
Sampled traffic analysis by internet-
exchange-level adversaries. In International Workshop on Privacy Enhancing
Technologies, pages 167–183. Springer, 2007.
[22] libevent an event notification library. http://libevent.org/.
[23] Can Tang and Ian Goldberg. An Improved Algorithm for Tor Circuit Scheduling.
In Angelos D. Keromytis and Vitaly Shmatikov, editors, Proceedings of the 2010
ACM Conference on Computer and Communications Security (CCS 2010). ACM,
October 2010.
[24] Damon McCoy, Kevin Bauer, Dirk Grunwald, Tadayoshi Kohno, and Douglas
Sicker.
Shining Light in Dark Places: Understanding the Tor, network.
In
Nikita Borisov and Ian Goldberg, editors, Privacy Enhancing Technologies: 8th
International Symposium, PETS 2008, pages 63–76, Leuven, Belgium, July 2008.
Springer-Verlag, LNCS 5134.
[25] Abdelberi Chaabane, Pere Manils, and Mohamed Ali Kaafar. Digging into anonymous traffic: A deep analysis of the tor anonymizing network. In Network and
System Security (NSS), 2010 4th International Conference on, 2010.
[26] Mashael AlSabah, Kevin Bauer, and Ian Goldberg. Enhancing Tor’s Performance
using Real-time Traffic Classification. In Proceedings of the 19th ACM conference
on Computer and Communications Security (CCS 2012), October 2012.
[27] Rob Jansen, John Geddes, Chris Wacek, Micah Sherr, and Paul Syverson. Never
Been KIST: Tor’s Congestion, management blossoms with kernel-informed socket
transport. In Proceedings of 23rd USENIX Security Symposium (USENIX Security
14), San Diego, CA, August 2014. USENIX Association.
125
[28] W Brad Moore, Chris Wacek, and Micah Sherr. Exploring the potential benefits
of expanded rate limiting in tor: Slow and steady wins the race with tortoise. In
Proceedings of the 27th Annual Computer Security Applications Conference, pages
207–216. ACM, 2011.
[29] Florian Tschorsch and Björn Scheuermann. Tor is unfairAnd what to do about
it. In Local Computer Networks (LCN), 2011 IEEE 36th Conference on, pages
432–440. IEEE, 2011.
[30] Kevin Bauer, Damon McCoy, Dirk Grunwald, Tadayoshi Kohno, and Douglas
Sicker. Low-Resource Routing Attacks Against Tor. In Proceedings of the Workshop on Privacy in the Electronic Society (WPES 2007), October 2007.
[31] Robin Snader and Nikita Borisov. EigenSpeed: secure peer-to-peer bandwidth
evaluation. In IPTPS, page 9, 2009.
[32] Micah Sherr, Matt Blaze, and Boon Thau Loo. Scalable link-based relay selection
for anonymous routing. In Privacy Enhancing Technologies, pages 73–93. Springer,
2009.
[33] Micah Sherr, Andrew Mao, William R Marczak, Wenchao Zhou, Boon Thau
Loo, and Matthew A Blaze. A3 : An Extensible Platform for Application-Aware
Anonymity. 2010.
[34] Masoud Akhoondi, Curtis Yu, and Harsha V. Madhyastha. LASTor: A LowLatency AS-Aware Tor Client. In Proceedings of the 2012 IEEE Symposium on
Security and Privacy, May 2012.
[35] Tao Wang, Kevin Bauer, Clara Forero, and Ian Goldberg. Congestion-aware Path
Selection for Tor. In Proceedings of Financial Cryptography and Data Security
(FC’12), February 2012.
[36] Christopher Wacek, Henry Tan, Kevin Bauer, and Micah Sherr. An Empirical
Evaluation of Relay Selection in Tor. In Proceedings of the Network and Distributed
System Security Symposium - NDSS’13. Internet Society, February 2013.
126
[37] Michael Backes, Aniket Kate, Sebastian Meiser, and Esfandiar Mohammadi.
(Nothing else) MATor(s): Monitoring the Anonymity of, tor’s path selection.
In Proceedings of the 21th ACM conference on Computer and Communications
Security (CCS 2014), November 2014.
[38] SJ Herbert, Steven J Murdoch, and Elena Punskaya. Optimising node selection
probabilities in multi-hop M/D/1 queuing networks to reduce latency of Tor.
Electronics Letters, 50(17):1205–1207, 2014.
[39] Robin Snader and Nikita Borisov. Improving security and performance in the
Tor network through tunable path selection. Dependable and Secure Computing,
IEEE Transactions on, 8(5):728–741, 2011.
[40] Mashael Alsabah, Kevin Bauer, Tariq Elahi, and Ian Goldberg. The Path Less
Travelled: Overcoming Tor’s Bottlenecks with Traffic Splitting. In Proceedings of
the 13th Privacy Enhancing Technologies Symposium (PETS 2013), July 2013.
[41] Roger Dingledine and Steven J Murdoch. Performance Improvements on Tor
or, Why Tor is slow and what were going to do about it. Online: http://www.
torproject. org/press/presskit/2009-03-11-performance. pdf, 2009.
[42] Camilo Viecco. UDP-OR: A fair onion transport design. Proceedings of Hot Topics
in Privacy Enhancing Technologies (HOTPETS08), 2008.
[43] Michael F Nowlan, Nabin Tiwari, Janardhan Iyengar, Syed Obaid Amin, and
Bryan Ford. Fitting square pegs through round pipes. In USENIX Symposium
on Networked Systems Design and Implementation (NSDI), 2012.
[44] Michael F Nowlan, David Wolinsky, and Bryan Ford. Reducing latency in Tor
circuits with unordered delivery. In USENIX Workshop on Free and Open Communications on the Internet (FOCI), 2013.
[45] Elli Androulaki, Mariana Raykova, Shreyas Srivatsan, Angelos Stavrou, and
Steven M Bellovin. PAR: Payment for anonymous routing. In International Symposium on Privacy Enhancing Technologies Symposium, pages 219–236. Springer,
2008.
127
[46] Yao Chen, Radu Sion, and Bogdan Carbunar. XPay: Practical anonymous payments for Tor routing and other networked services. In Proceedings of the 8th
ACM workshop on Privacy in the electronic society, pages 41–50. ACM, 2009.
[47] Tsuen-Wan “Johnny” Ngan, Roger Dingledine, and Dan S. Wallach. Building
Incentives into Tor. In Radu Sion, editor, Proceedings of Financial Cryptography
(FC ’10), January 2010.
[48] Rob Jansen, Nicholas Hopper, and Yongdae Kim. Recruiting New Tor Relays with
BRAIDS. In Angelos D. Keromytis and Vitaly Shmatikov, editors, Proceedings
of the 2010 ACM Conference on Computer and Communications Security (CCS
2010). ACM, October 2010.
[49] Rob Jansen, Aaron Johnson, and Paul Syverson. LIRA: Lightweight Incentivized
Routing for Anonymity. In Proceedings of the Network and Distributed System
Security Symposium - NDSS’13. Internet Society, February 2013.
[50] Zhen Ling, Junzhou Luo, Wei Yu, Xinwen Fu, Dong Xuan, and Weijia Jia. A new
cell counter based attack against tor. In Proceedings of the 16th ACM conference
on Computer and communications security, pages 578–589. ACM, 2009.
[51] Xinwen Fu, Zhen Ling, J Luo, W Yu, W Jia, and W Zhao. One cell is enough to
break tors anonymity. In Proceedings of Black Hat Technical Security Conference,
pages 578–589. Citeseer, 2009.
[52] Zhen Ling, Junzhou Luo, Wei Yu, Xinwen Fu, Weijia Jia, and Wei Zhao. Protocollevel attacks against Tor. Computer Networks, 57(4):869–886, 2013.
[53] Nikita Borisov, George Danezis, Prateek Mittal, and Parisa Tabriz. Denial of Service or Denial of Security? How Attacks on, reliability can compromise anonymity.
In Proceedings of CCS 2007, October 2007.
[54] Rob Jansen, Florian Tschorsch, Aaron Johnson, and Björn Scheuermann. The
Sniper Attack: Anonymously Deanonymizing and Disabling the Tor Network.
In Proceedings of the Network and Distributed Security Symposium - NDSS ’14.
IEEE, February 2014.
128
[55] Nick Feamster and Roger Dingledine. Location diversity in anonymity networks.
In Proceedings of the 2004 ACM workshop on Privacy in the electronic society,
pages 66–76. ACM, 2004.
[56] Matthew Edman and Paul Syverson. AS-awareness in Tor path selection. In Proceedings of the 16th ACM conference on Computer and communications security,
pages 380–389. ACM, 2009.
[57] Yixin Sun, Anne Edmundson, Laurent Vanbever, Oscar Li, Jennifer Rexford,
Mung Chiang, and Prateek Mittal. RAPTOR: routing attacks on privacy in
tor. In 24th USENIX Security Symposium (USENIX Security 15), pages 271–286,
2015.
[58] Prateek Mittal, Ahmed Khurshid, Joshua Juen, Matthew Caesar, and Nikita
Borisov. Stealthy traffic analysis of low-latency anonymous communication using throughput fingerprinting. In Proceedings of the 18th ACM conference on
Computer and communications security, pages 215–226. ACM, 2011.
[59] Steven J. Murdoch and George Danezis. Low-Cost Traffic Analysis of Tor. In
2005 IEEE Symposium on Security and Privacy,(IEEE S&P, 2005) Proceedings,
pages 183–195. IEEE CS, May 2005.
[60] Nathan S Evans, Roger Dingledine, and Christian Grothoff. A practical congestion
attack on Tor using long paths. In Proceedings of the 18th USENIX Security
Symposium, 2009.
[61] Brent Chun, David Culler, Timothy Roscoe, Andy Bavier, Larry Peterson, Mike
Wawrzoniak, and Mic Bowman. PlanetLab: an overlay testbed for broad-coverage
services. SIGCOMM Computer Communication Review, 33, 2003.
[62] Aaron Johnson, Chris Wacek, Rob Jansen, Micah Sherr, and Paul Syverson. Users
Get Routed: Traffic Correlation on Tor by Realistic Adversaries. In Proceedings
of the 20th ACM conference on Computer and Communications Security (CCS
2013), November 2013.
[63] Florian Tschorsch and Björn Scheuermann.
Mind the gap:
towards a
backpressure-based transport protocol for the Tor network. In 13th USENIX
129
Symposium on Networked Systems Design and Implementation (NSDI 16), pages
597–610, 2016.
[64] George F Riley and Thomas R Henderson. The ns-3 network simulator. In Modeling and Tools for Network Simulation, pages 15–34. Springer, 2010.
[65] Kevin S Bauer, Micah Sherr, and Dirk Grunwald. ExperimenTor: A Testbed for
Safe and Realistic Tor Experimentation. In CSET, 2011.
[66] Amin Vahdat, Ken Yocum, Kevin Walsh, Priya Mahadevan, Dejan Kostić, Jeff
Chase, and David Becker. Scalability and accuracy in a large-scale network emulator. ACM SIGOPS Operating Systems Review, 36(SI):271–284, 2002.
[67] Rob Jansen and Nicholas Hopper. Shadow: Running Tor in a Box for Accurate and
Efficient Experimentation. In Proc. of the 19th Network and Distributed System
Security Symposium, 2012.
[68] Rob Jansen, Kevin Bauer, Nicholas Hopper, and Roger Dingledine. Methodically
Modeling the Tor Network. In Proceedings of the 5th Workshop on Cyber Security
Experimentation and Test, August 2012.
[69] Shadow Homepage and Code Repositories. https://shadow.github.io/, https:
//github.com/shadow/.
[70] Rob Jansen, Kevin S Bauer, Nicholas Hopper, and Roger Dingledine. Methodically
Modeling the Tor Network. In CSET, 2012.
[71] The Tor Project. The Tor Metrics Portal. http://metrics.torproject.org/.
[72] Damon Mccoy, Tadayoshi Kohno, and Douglas Sicker. Shining light in dark places:
Understanding the Tor network. In Proc. of the 8th Privacy Enhancing Technologies Symp., 2008.
[73] Félix Hernández-Campos, Kevin Jeffay, and F Donelson Smith. Tracking the evolution of web traffic: 1995-2003. In Modeling, Analysis and Simulation of Computer Telecommunications Systems, 2003. MASCOTS 2003. 11th IEEE/ACM International Symposium on, pages 16–25. IEEE, 2003.
130
[74] Andrei Serjantov and George Danezis. Towards an information theoretic metric
for anonymity. In Privacy Enhancing Technologies, 2003.
[75] Claudia Diaz, Stefaan Seys, Joris Claessens, and Bart Preneel. Towards measuring
anonymity. In Privacy Enhancing Technologies, 2003.
[76] Ulf Möller, Lance Cottrell, Peter Palfrader, and Len Sassaman. Mixmaster protocolversion 2. Draft, July, 2003.
[77] Can Tang and Ian Goldberg. An improved algorithm for Tor circuit scheduling. In
Proc. of the 17th ACM conf. on Computer and Communications Security, 2010.
[78] W. Brad Moore, Chris Wacek, and Micah Sherr. Exploring the Potential Benefits
of Expanded Rate Limiting in Tor: Slow and Steady Wins the Race With Tortoise.
In Proceedings of 2011 Annual Computer Security Applications Conference, 2011.
[79] Deepika Gopal and Nadia Heninger. Torchestra: Reducing interactive traffic delays over Tor. In Proc. of the Workshop on Privacy in the Electronic Society,
2012.
[80] Prateek Mittal, Ahmed Khurshid, Joshua Juen, Matthew Caesar, and Nikita
Borisov. Stealthy traffic analysis of low-latency anonymous communication using throughput fingerprinting. In Proc. of the 18th ACM conf. on Computer and
Communications Security, 2011.
[81] Nicholas Hopper, Eugene Y. Vasserman, and Eric Chan-tin. How much anonymity
does network latency leak. In In CCS 07: Proceedings of the 14th ACM conference
on Computer and communications security. ACM, 2007.
[82] Steven J. Murdoch and George Danezis. Low-Cost Traffic Analysis of Tor. In
Proceedings of the 2005 IEEE Symposium on Security and Privacy, SP ’05, 2005.
[83] E.L. Hahne. Round-robin scheduling for max-min fairness in data networks. IEEE
Journal on Selected Areas in Communications, 9(7):1024–1039, 1991.
[84] The
clients
Tor
Project.
by
entry
Research
guards.
problem:
adaptive
throttling
of
Tor
https://blog.torproject.org/blog/
research-problem-adaptive-throttling-tor-clients-entry-guards.
131
[85] Mashael AlSabah, Kevin Bauer, and Ian Goldberg. Enhancing Tor’s performance
using real-time traffic classification. In Proceedings of the 2012 ACM conference
on Computer and communications security, 2012.
[86] Amir Houmansadr and Nikita Borisov. SWIRL: A Scalable Watermark to Detect Correlated Network Flows. In Proc. of the Network and Distributed Security
Symp., 2011.
[87] Sambuddho Chakravarty, Angelos Stavrou, and Angelos D. Keromytis. Traffic
Analysis Against Low-Latency Anonymity Networks Using Available Bandwidth
Estimation. In Proceedings of the European Symposium Research Computer Security - ESORICS’10, 2010.
[88] Rob Jansen. The Shadow Simulator. http://shadow.cs.umn.edu/.
[89] Rob Jansen and Nick Hopper. Shadow: Running Tor in a Box for Accurate and
Efficient Experimentation. In Proceedings of the 19th Network and Distributed
System Security Symposium, 2012.
[90] Bram Cohen. Incentives build robustness in BitTorrent. In Workshop on Economics of Peer-to-Peer systems, 2003.
[91] Trevor J Hastie and Robert J Tibshirani. Generalized additive models, volume 43.
1990.
[92] John R. Douceur. The Sybil Attack. In IPTPS ’01: Revised Papers from the First
International Workshop on Peer-to-Peer Systems, 2002.
[93] Michael F Nowlan, David Wolinsky, and Bryan Ford. Reducing Latency in Tor
Circuits with Unordered, delivery. In USENIX Workshop on Free and Open Communications on the Internet (FOCI), 2013.
[94] Can Tang and Ian Goldberg. An Improved Algorithm for Tor Circuit Scheduling.
In Proc. of the 17th Conference on Computer and Communication Security, 2010.
[95] Rob Jansen, Kevin Bauer, Nicholas Hopper, and Roger Dingledine. Methodically
Modeling the Tor Network. In Proceedings of the USENIX Workshop on Cyber
Security Experimentation and Test (CSET 2012), August 2012.
132
[96] Steven J Murdoch.
Comparison of Tor datagram designs.
Technical
report, Technical report, Nov. 2011. www.cl.cam.ac.uk/~sjm217/papers/
tor11datagramcomparison.pdf, 2011.
[97] Stanislav Shalunov. Low Extra Delay Background Transport (LEDBAT).
[98] Karsten Loesing, Steven J. Murdoch, and Rob Jansen. Evaluation of a libutpbased Tor Datagram Implementation. Technical report, Tech. Rep. 2013-10-001,
The Tor Project, 2013.
[99] Marco Valerio Barbera, Vasileios P Kemerlis, Vasilis Pappas, and Angelos D
Keromytis. CellFlood: Attacking Tor onion routers on the cheap. In Computer
Security–ESORICS 2013, pages 664–681. Springer, 2013.
[100] Ronald L Graham, Eugene L Lawler, Jan Karel Lenstra, and AHG Rinnooy Kan.
Optimization and approximation in deterministic sequencing and scheduling: a
survey. Annals of discrete mathematics, 5:287–326, 1979.
[101] Bin Jiang. Head/tail breaks: A new classification scheme for data with a heavytailed distribution. The Professional Geographer, 65(3):482–494, 2013.
[102] Murray Rosenblatt et al. Remarks on some nonparametric estimates of a density
function. The Annals of Mathematical Statistics, 27(3):832–837, 1956.
[103] Emanuel Parzen. On estimation of a probability density function and mode. The
annals of mathematical statistics, 33(3):1065–1076, 1962.
[104] Steven J. Murdoch.
Comparison of Tor Datagram Designs.
Tech Report
2011-11-001, Tor, November 2011. http://www.cl.cam.ac.uk/~sjm217/papers/
tor11datagramcomparison.pdf.
[105] Randall Stewart. Stream control transmission protocol. Technical report, 2007.
[106] QUIC, a multiplexed stream transport over UDP. https://www.chromium.org/
quic.
[107] Xiaofin Li and Kevein Ku. Help with TOR on UDP/QUIC. https://www.
mail-archive.com/tor-dev@lists.torproject.org/msg07823.html.
133
[108] Daniel Bernstein. CurveCP: Usable security for the Internet. URL: http://curvecp.
org, 2011.
[109] John Geddes, Rob Jansen, and Nicholas Hopper. IMUX: Managing tor connections from two to infinity, and, beyond. In Proceedings of the 12th Workshop on
Privacy in the Electronic Society (WPES), November 2014.
[110] NUSE: Network stack in USerspacE. http://libos-nuse.github.io/.
[111] Hajime Tazaki, Ryo Nakamura, and Yuji Sekiya. Library Operating System with
Mainline Linux Network Stack. Proceedings of netdev.
[112] Octavian Purdila, Lucian Adrian Grijincu, and Nicolae Tapus. LKL: The Linux
kernel library. In Proceedings of the RoEduNet International Conference, pages
328–333, 2010.
[113] Linux kernel library project. https://github.com/lkl.
[114] Thomas R Henderson, Mathieu Lacage, George F Riley, C Dowell, and J Kopena.
Network simulations with the ns-3 simulator. SIGCOMM demonstration, 15:17,
2008.