Privacy and Performance Trade-offs in Anonymous Communication Networks A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY John D. Geddes IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy Nicholas J. Hopper February, 2017 c John D. Geddes 2017 ALL RIGHTS RESERVED Acknowledgements There are a great many people who I owe a debt of gratitude for their support during this dissertation. First and foremost is my advisor Nick Hopper, who not only helped contribute and shape this work, but provided endless guidance and encouragement from start to finish. I would like to thank all my collaborators and co-authors – Rob Jansen, Max Schuchard, and Mike Schliep. Your input and contributions have been paramount to getting me here, and this work is undeniabley better off because of it. Additionally I want to thank everyone I have had the great pleasure to work with during my tenure in graduate school – Shuai Li, Zi Lin, Yongdae Kim, Denis Foo Kune, Aziz Mohaisen, Se Eun Oh, Micah Sherr, Paul Syverson, Chris Thompson, and Chris Wacek. I would like to thank my other committee members – Stephen McCamant, Andrew Odlyzko, and Jon Weissman – for their comments and time volunteered during this process. I would especially like to thank my parents and brother who have been there since the beginning. Never in a million years could I have accomplished this without your continous support and encouragement. And last but certainly not least, I thank my best friend and wife, Dominique. This work would not exist without you and your never ending support. No matter the challenges or difficulties faced, you stood by me and helped me through it all. i Dedication To Dominique, who has always been there for me, even through the never ending “just one more year of grad school”. ii Abstract Anonymous communication systems attempt to prevent adversarial eavesdroppers from learning the identities of any two parties communicating with each other. In order to protect from global adversaries, such as nation states and large internet service providers, systems need to induce large amounts of latency in order to sufficiently protect users identities. Other systems sacrifice protection against global adversaries in order to provide low latency service to their clients. This makes the system usable for latency sensitive applications like web browsing. In turn, more users participate in the low latency system, increasing the anonymity set for everybody. These trade-offs on performance and anonymity provided are inherent in anonymous communication systems. In this dissertation we examine these types of trade-offs in Tor, the most popular low latency anonymous communication system in use today. First we look at how user anonymity is affected by mechanisms built into Tor for the purpose of increasing client performance. To this end we introduce an induced throttling attack against flow control and traffic admission control algorithms which allow an adversarial relay to reduce the anonymity set of a client using the adversary as an exit. Second we examine how connections are managed for inter-relay communication and look at some recent proposals for more efficient relay communication. We show how some of these can be abused to anonymously launch a low resource denial of service attack against target relays. With this we then explore two potential solutions which provide more efficient relay communication along with preventing certain denial of service attacks. Finally, we introduce a circuit selection algorithm that can be used by a centralized authority to dramatically increase network utilization. This algorithm is then adapted to work in a decentralized manner allowing clients to make smarter decisions locally, increasing performance while having a small impact on client anonymity. iii Contents Acknowledgements i Dedication ii Abstract iii List of Tables viii List of Figures ix 1 Introduction 1 1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 6 2.1 The Tor Anonymous Communication Network . . . . . . . . . . . . . . 7 2.2 Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Relay Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Inter-Relay Communication . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Related Work 3.1 11 Increasing Performance 3.1.1 Scheduling 3.1.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 iv 3.2 3.3 3.1.3 Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.4 Incentives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Security and Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Path Selection and Routing Attacks . . . . . . . . . . . . . . . . 15 3.2.2 Side Channel and Congestion Attacks . . . . . . . . . . . . . . . 16 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Experimental Setup 19 4.1 Shadow and Network Topologies . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 Adversarial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 How Low Can You Go: Balancing Performance with Anonymity in Tor 25 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2.1 Circuit Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2.2 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2.3 Traffic Admission Control . . . . . . . . . . . . . . . . . . . . . . 28 5.2.4 Circuit Clogging . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2.5 Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3.1 Algorithmic-Specific Information Leakage . . . . . . . . . . . . . 30 5.3.2 Experimental Setup and Model . . . . . . . . . . . . . . . . . . . 32 Algorithmic Effects on Known Attacks . . . . . . . . . . . . . . . . . . . 32 5.4.1 Throughput as a Signal . . . . . . . . . . . . . . . . . . . . . . . 32 5.4.2 Latency as a Signal . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Induced Throttling via Flow Control . . . . . . . . . . . . . . . . . . . . 37 5.5.1 Artificial Congestion . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.5.2 Small Scale Experiment . . . . . . . . . . . . . . . . . . . . . . . 40 5.5.3 Smoothing Throughput . . . . . . . . . . . . . . . . . . . . . . . 41 5.5.4 Scoring Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.5.5 Large Scale Experiments . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 5.4 5.5 v 5.6 Induced Throttling via Traffic Admission Control . . . . . . . . . . . . . 45 5.6.1 Connection Sybils . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.6.2 Large Scale Experiments . . . . . . . . . . . . . . . . . . . . . . . 49 5.6.3 Search Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.7 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.8 Conclusion 54 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Managing Tor Connections for Inter-Relay Communication 57 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.3 Socket Exhaustion Attacks . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3.1 Sockets in Tor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3.2 Attack Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.3.3 Effects of Socket Exhaustion . . . . . . . . . . . . . . . . . . . . 64 IMUX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.4.1 Connection Management . . . . . . . . . . . . . . . . . . . . . . 68 6.4.2 Connection Scheduler . . . . . . . . . . . . . . . . . . . . . . . . 71 6.4.3 KIST: Kernel-Informed Socket Transport . . . . . . . . . . . . . 73 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.5.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.5.3 Connection Management . . . . . . . . . . . . . . . . . . . . . . 76 6.5.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Replacing TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.6.1 Micro Transport Protocol . . . . . . . . . . . . . . . . . . . . . . 82 6.6.2 xTCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.6.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.6.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.8 Conclusion 89 6.4 6.5 6.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Anarchy in Tor: Performance Cost of Decentralization 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 91 92 7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.3 Maximizing Network Usage . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.3.1 Central Authority . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.3.2 Offline Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.3.3 Circuit Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.4 ABRA for Circuit Selection . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.6 7.7 7.8 7.5.1 Shadow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.5.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.5.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.6.1 Competitive Analysis . . . . . . . . . . . . . . . . . . . . . . . . 105 7.6.2 ABRA Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.6.3 ABRA Performance . . . . . . . . . . . . . . . . . . . . . . . . . 108 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.7.1 Information Leakage . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.7.2 Colluding Relays Lying . . . . . . . . . . . . . . . . . . . . . . . 112 7.7.3 Denial of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8 Future Work and Final Remarks 8.1 8.2 118 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.1.1 Tor Transports . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.1.2 Simulation Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 120 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 References 122 vii List of Tables 6.1 Functions that xTCP needs to intercept and how xTCP must handle them. 85 7.1 The bottleneck clustering methods mean square error across varying bandwidth granularity and window parameters, with red values indicating scores less than weighted random estimator. . . . . . . . . . . . . . . 106 viii List of Figures 2.1 Tor client encrypting a cell, sending it on the circuit, and relays peeling off their layer of encryption. . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Internal Tor relay architecture. . . . . . . . . . . . . . . . . . . . . . . . 9 5.1 Results for throughput attack with vanilla Tor compared to EWMA and N23 scheduling and flow control algorithms. . . . . . . . . . . . . . . . . 5.2 34 Results for throughput attack with vanilla Tor compared to different throttling algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3 Results of latency attack on EWMA and N23 algorithms. . . . . . . . . 38 5.4 Results of latency attack on the various throttling algorithms. . . . . . . 39 5.5 Effects of artificial throttling. . . . . . . . . . . . . . . . . . . . . . . . . 41 5.6 Raw and smoothed throughput of probe through guard during attack. . 42 5.7 Throttle attack with BitTorrent sending a single stream over the circuit. 44 5.8 Degree of anonymity loss with throttling attacks with different number of BitTorrent streams over the Tor circuit. . . . . . . . . . . . . . . . . . 5.9 46 Small scale sybil attack on the bandwidth throttling algorithms from [1]. The shaded area represents the period during which the attack is active. The sybils cause an easily recognizable drop in throughput on the target circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.10 Large scale sybil attack on the bandwidth throttling algorithms from [1]. The shaded area represents the period during which the attack is active. Due to token bucket sizes, the throughput signal may be missed if the phases of the attack are too short. . . . . . . . . . . . . . . . . . . . . . 50 5.11 Probabilities of the victim client under attack scenarios for flow control algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 53 5.12 Victim client probabilities, throttling algorithm attack scenarios. . . . . 6.1 55 Showing (a) anonymous socket exhaustion attack using client sybils (b) throughput of relay when launching a socket exhaustion attack via circuit creation with shaded region representing when the attack was being launched (c) memory consumption as a process opens sockets using libevent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 65 Showing (a) throughput over time (b) linear regression correlating throughput to the number of open sockets (c) kernel time to open new sockets over time (d) linear regression correlating kernel time to open a new socket to the number of open sockets. . . . . . . . . . . . . . . . . . . . 67 6.3 The three different connection scheduling algorithms used in IMUX. . . 72 6.4 Number of open sockets at exit relay in PCTCP compared with IMUX with varying τ paramaters. . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 77 Comparing socket exhaustion attack in vanilla Tor with Torchestra, PCTCP and IMUX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.6 Performance comparison of IMUX connection schedulers . . . . . . . . . 79 6.7 Performance comparison of IMUX to Torchestra [2] and PCTCP [3] . . 81 6.8 Performance comparison of IMUX to Torchestra [2] and PCTCP [3], all 6.9 using the KIST performance enhancements [4] . . . . . . . . . . . . . . . 83 Operation flows for vanilla Tor and using xTCP . . . . . . . . . . . . . . 84 6.10 Performance comparison of using TCP connections in vanilla and uTP for inter-relay communication. . . . . . . . . . . . . . . . . . . . . . . . . 7.1 88 Example of circuit bandwidth estimate algorithm. Each step the relay with the lowest bandwidth is selected, its bandwidth evenly distributed among any remaining circuit, with each relay on that circuit having their bandwidth decremented by the circuit bandwidth. . . . . . . . . . . . . 7.2 Example showing that adding an active circuit resulted in network usage dropping from 20 to 15 MBps . . . . . . . . . . . . . . . . . . . . . . . . 7.3 95 99 Total network bandwidth when using the online and offline circuit selection algorithms while changing the set of available circuits, along with the relay bandwidth capacity of the best results for both algorithms. . . 107 x 7.4 Total network bandwidth utilization when using vanilla Tor, ABRA for circuit selection, the congestion and latency aware algorithms, and the central authority. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.5 Download times and client bandwidth compared across circuit selection in vanilla Tor, using ABRA, and the centralized authority. 7.6 Weight difference on relays compared to the difference in number of bottleneck circuits on the relay. 7.7 . . . . . . . . . . . . . . . . . . . . . . . . 113 (a) fraction of circuits seen by colluding relays (b) fraction of circuits where colluding relays appear as guard and exit 7.8 . . . . . . . 111 . . . . . . . . . . . . . 115 (a) attackers bandwidth and targets weight when an adversary is running the denial of service attack (b) number of clients using the target for a circuit (c) number of downloads completed by clients using the target relay before and after attack is started, shown both for vanilla Tor and using ABRA circuit selection . . . . . . . . . . . . . . . . . . . . . . . . 117 xi Chapter 1 Introduction 1 2 In order to communicate over the internet, messages are encapsulated into packets and stamped with a source and destination address. These addresses allow internet service providers to determine the route that packets should be sent over. The route consists of a series of third party autonomous systems which forward packets to the final destination. With the source address the destination can now construct responses that will be routed to the source, potentially through a completely different set of autonomous systems. With so many external systems having access to packets being routed, encryption is vital in hiding what is being said, concealing important data such as banking information, passwords, and private communication. While encryption can prevent third parties from reading the contents of the packets, they still can learn the identity of who is communicating with each other via the source and destination addresses. Concealing these addresses to hide the identities of who is communicating with each other is a much more difficult task, as the addresses are vital to ensure messages can be delivered. An early solution was proxies which would act as a middle man, forwarding messages between clients and servers. While any eavesdropper would only see the client or server communicating with the proxy and not each other, the proxy itself knows that the client and server are communicating with each other. To address these concerns a wide array of anonymous communication networks [5, 6, 7, 8] were proposed, with the goal of preventing an adversary from simultaneously learning the identity of both the client and server communicating with each other. Of these Tor [8] emerged as one of most widely used systems, currently providing anonymity to millions of users daily. Instead of tunneling connections through a single proxy, Tor uses onion routing [9, 10] to securely route connections through three proxies called relays. The Tor network consists of almost 7,000 volunteer relays contributing bandwidth ranging from 100 KBps to over 1 GBps. To tunnel connections through the Tor network clients create circuits through three relays - a guard, middle, and exit who are chosen at random weighted by their bandwidth. By using three relays for the circuit, Tor ensures that no relay learns both the identify of the client and the server they are communicating with. One major obstacle anonymous communications systems have is providing privacy in the face of a global passive adversary, defined as an adversary that can eavesdrop on large fractions of the network. Specifically they can observe almost all connections 3 entering the network from clients, and all connections exiting the network to servers. An adversary that has this capability can use traffic analysis techniques [11] in order to link the identities of the end points who are communicating with each other through the anonymity network. Some systems [5, 7] attempt to defend against a global adversary by adding random delays to messages, grouping and reordering messages, and adding background traffic. These techniques result in networks providing high latency performance to their clients. While some services, such as email, are not latency sensitive and can handle the additional latency added, other activities such as web browsing are extremely sensitive to latency and would be unusable on these types of anonymity systems. To accommodate this Tor makes a conscious trade-off, providing low latency performance to clients by forwarding messages as soon as possible, sacrificing anonymity against a global adversary. However Tor is still able to protect against a local adversary, one who can only eavesdrop on a small fraction of the network, unable to perform endto-end traffic analysis available to a global adversary. The major upside to providing low latency service to clients is the network is much more usable for common activities such as web browsing, attracting a large number of clients to use the system. This in turn increases the anonymity of clients, as they are part of a much larger anonymity set, making it harder for adversaries to potentially link clients and servers communicating through Tor. 1.1 Thesis Statement This dissertation explores the following thesis: There is an inherent trade-off in the performance of an anonymous communication network and the privacy that it delivers to end users. To this end we examine new performance enhancing mechanisms for Tor and look at how they can be exploited to reduce client anonymity. We explore new side channels that leak information and reduce client anonymity. By measuring the degree of anonymity we are able to quantify the amount of information leaked and the reduction in anonymity. We also investigate new avenues to launch denial of service attacks against Tor relays where they become unavailable for use by clients. Along side this we design and implement 4 new algorithms which explore the balance of maximizing performance and minimizing attack surfaces open to adversaries. 1.2 Outline Induced Throttling Attacks (Chapter 5) In order to more efficiently allocate resources in Tor there have been proposals suggesting improved flow control [12] and to identify and throttle “loud” clients [13]. In Chapter 5 we introduce the induced throttling attack which can be leveraged against both flow control and throttling algorithms. An adversarial exit relay is able to control when data can and cannot be sent on a circuit, allowing for probe clients to detect changes and identify the guard being used on the circuit. By identifying potential guards being used on a circuit, the adversary can then use a latency attack [14] to reduce the client anonymity set. We evaluate the effectiveness of the induced throttling attack and compare results to known previous side channel attacks. Managing Tor Connections (Chapter 6) In Tor all communication between two relays is sent over a single TCP connection. This has been known to be the source of some performance issues with cross circuit interference [15, 16] causing an unnecessary hit to throughput. One potential solution that has been explored is increasing the number of inter-relay connections, either capping it at two [2] or leaving it unbounded with each circuit getting a dedicated TCP connection [3]. However, in modern operating systems the number of sockets a process is allowed to have open at any one time is capped. In Chapter 6 we introduce the socket exhaustion attack, able to launch a denial of service attack against Tor relays by consuming all open sockets on the Tor process. We explore other options that can be used that both increase performance and prevent certain socket exhaustion attacks. We first introduce a generalized solution called IMUX which is able to dynamically manage open connections in an attempt to increase performance while limiting the effectiveness of some socket exhaustion attacks. We also look at replacing TCP entirely as the protocol for inter-relay communication. By using the micro transport protocol (uTP), we can have user space protocol stacks for each circuit. This prevents cross circuit interference while only requiring a single socket be maintained per relay we are communicating with. 5 Price of Anarchy (Chapter 7) The price of anarchy [17] is a concept in game theory that is interested with examining the degradation of performance that results from selfish behavior and actions of its agents. In networking this is referred to as selfish routing [18, 19] where clients making decisions in a selfish and decentralized manner leave network resource underutilized. In Chapter 7 we explore the price of anarchy in the Tor network by examining potential performance gains that can be achieved by centralizing circuit selection. While such a system would be wildly impractical to use on the live Tor network due to privacy concerns, it does allow us to examine the potential cap in performance and give us insights on how to improve circuit selection in Tor. With those insights we design a decentralized version of the algorithm that lets relays gossip information to clients allowing them to make more intelligent circuit selection decisions. With this system we then analyze any potential drop in anonymity, both from a passive and active adversary. Chapter 2 Background 6 7 2.1 The Tor Anonymous Communication Network This dissertation focuses on the Tor anonymous communication network, primarily due to its wide spread deployment and relative popularity. Tor’s current network capacity is almost 200 GBps, providing anonymity to over 1.5 million clients daily. With such a large user base it is more important than ever to ensure that clients receive high performance while guaranteeing the strongest privacy protections possible. In this chapter we cover some of the low level details of Tor that will be relevant in later chapters. Specifically we cover how clients interact and send data through Tor, relay queueing architecture, and how relays communication internally in the Tor network. 2.2 Circuits When a client first joins the Tor network, the client connects to a directory authority and downloads the network consensus along with relay microdescriptors, files containing information for all the relays in the network, such as the relays address, public keys, exit policy, and measured bandwidth. Once the client has downloaded these files they start building 10 circuits. When building a circuit the client selects relays at random weighted by the relays bandwidth, with extra considerations placed when choosing an exit and guard relay. For the exit relay, the client only selects amongst relays that are configured to serve as an exit, along with also considering which ports the relay has chosen to accept connections on (e.g. 80 for HTTP, 443 for HTTPS, etc). Since guard relays know the identity of the client, clients want to reduce the chances of selecting an adversarial guard relay which could attempt to deanonymize them [20, 21]. To this end clients maintain a guard list of pre-selected relays that are always chosen from when building circuits. When initially picking relays to add to this set, only relays that have been assigned the GUARD flag by the authorities are considered. These are relays with a large enough bandwidth and have been available for a minimum amount of timespan. Once a guard, middle, and exit relay have been chosen the client can start building the circuit. To begin building the circuit the client sends to the guard an EXTEND cell containing key negotiation information. The guard responds with an EXTENDED cell finishing the key negotiation and confirming the circuit was successfully extended. This process is repeated with the middle and exit relay, with the EXTEND and EXTENDED cells 8 Figure 2.1: Tor client encrypting a cell, sending it on the circuit, and relays peeling off their layer of encryption. forwarded through the circuit being built (i.e. the EXTEND cell sent to the middle relay is forwarded through the guard). After the circuit building process is completed the client has a shared symmetric key with the guard, middle, and exit used for delivering data along the circuit. 2.3 Streams To have an application send data data through the Tor network, the user configures the application to use the SOCKS proxy opened by the Tor client. For each TCP connection created by the application, the Tor client creates a corresponding stream. When the first stream is created, one of the active circuits is selected and the stream is attached to the circuit. This circuit is then used to multiplex all streams created over the next 10 minutes. After the 10 minute window a new circuit is selected for all future streams and the old circuit is destroyed. All data sent over a stream is packaged into 512-byte cells which are encrypted once each using secret symmetric keys the client shares with each relay in the circuit. Cells are sent over the circuit with each relay “peeling” off a layer of encryption and forwarding the cell to the next relay in the circuit. The exit relay then forwards the clients data to the final destination specified by the client. Figure 2.1 shows an example of this process. When the server wishes to send data back to the 9 Figure 2.2: Internal Tor relay architecture. client, the reverse of this process occurs. The exit packages data from the server into cells which get sent along the circuit, with each relay adding a layer of encryption. The client can then decrypt the cell three times with each shared secret key to recover the original data sent by the server. Note that by default the raw traffic sent by the client and server are not encrypted, meaning the exit relay has access to the data. To prevent any eavesdroppers from viewing their communication the client and server can establish a secure encrypted connection, such as by using TLS, to protect the underlying communication. 2.4 Relay Architecture Figure 2.2 shows the internal architecture of cell processing in Tor relays. Packets are read off the TCP connection and stored on kernel TCP in buffers. Libevent [22] notifies Tor when data can be read, at which point packets are processed into TLS records, decrypted, and copied to connection “in buffer” internal to Tor. At this point Tor can process the unencrypted data as cells, moving the cells to their respective circuit queues based on their circuit ID. Tor then iterates through the circuit queues, either in roundrobin fashion or using an EWMA circuit scheduler [23], pulling cells from the queue and pushing them into their respective connection out buffer corresponding to the next relay in the circuit. At this point the cell is either decrypted if the cell is travelling upstream (towards the exit) or encrypted if the cell is travelling downstream (towards the client). Once the newly encrypted or decrypted cell is pushed onto the out buffer, 10 Tor waits until libevent notifies the process that the TCP buffer can be written to, at which point the data is flushed to the kernel TCP out buffer and eventually sent to the next relay. 2.5 Inter-Relay Communication All communication between pairwise relays is done over a single TCP connection encrypted with TLS. Similarly to how streams are multiplexed over a single circuit, circuits across two relays are multiplexed over the single connection. Using TCP guarantees reliable and in-order delivery, in addition to containing connection level flow and congestion control. However since all circuits share the same TCP state there can be cross-circuit interference resulting in degraded performance [15]. For example if one circuit causes congestion on the connection, the sending window is shrunk reducing throughput for every circuit using the TCP connection. 2.6 Flow Control Tor uses an end-to-end flow control algorithm on both the stream and circuit level. The algorithm operates on traffic flowing in both directions, with each edge of the circuit (i.e. the client and exit relay) keeping track of a window for each circuit and stream. For each cell sent over an ingress edge the corresponding circuit and stream window is decremented by one. Any time a window reaches 0 the ingress edge stops transmitting all data on either the stream or circuit. For every n cells the egress edge receives it delivers a circuit or stream SENDME cell to the ingress edge. Once the SENDME cell is received by the ingress edge it increments its circuit or stream window by n. By default, circuit and stream windows are initialized to 1000 and 500 respectively, with SENDME cells being sent for every n = 100 and n = 50 cells received on the circuit and stream. Chapter 3 Related Work 11 12 A large amount of research has been done on Tor, mainly focusing on either improving performance or reducing client anonymity. This chapter surveys a wide array of relevant research on Tor. Not only does this help place the work in this dissertation in a better context, much of it is integral to the remaining work as we directly examine how some proposals affect client anonymity. 3.1 Increasing Performance One of the major keys to increasing anonymity for Tor users is ensuring a large anonymity set, that is, a large active user base. To accomplish this Tor needs to offer low latency to clients; bad performance in the form of slow web browsing can lead to fewer users using the system overall. To this end, there has been a large set of research looking at ways to increase performance in Tor. 3.1.1 Scheduling With limited available resources, relays are constantly making scheduling decisions on what to send and how much should be processed. These decisions happen on all levels, between streams, circuits, and connections. One major issue is a small percent of users consume a large majority of all network resources [24, 25]. These bulky clients, usually using BitTorrent for downloading large files, crowd out available bandwidth which can result in web clients experiencing high latency. Initially when a connection was ready to write, Tor used a round-robin circuit scheduler to iterate through the circuits when flushing circuit queues. This resulted in a lot of bulk circuit cells being scheduled before latency sensitive web circuits. To fix this Tang and Goldberg [23] introduced EWMA circuit scheduling which prioritizes quiet circuits that haven’t written recently over loud circuits who are constantly transmitting cells. Building on this, AlSabah et al. [26] introduce an algorithm that attempts to classify circuits and use different Quality of Service (QoS) requirements for scheduling circuits. Further work by Jansen et al. [27] showed that not only does circuit scheduling need to be done in global context of all circuits across all connections, but that if Tor writes data too fast to connections, cells spend most of their time waiting in the TCP kernel buffer, diminishing control Tor has a prioritizing circuits. To fix this a rate limiting algorithm was developed, attempting 13 to keep TCP buffers full enough to prevent starvation while allowing Tor to hold onto more cells for improved circuit prioritization. While these schedulers attempt to prioritize circuits for scheduling, bulk circuits can still consume a large enough amount of bandwidth and cause congestion on relays. Jansen et al. [13] propose methods for allowing guards to throttle the loudest clients, reducing congestion and increasing throughput for the non-throttled clients. In another attempt to reduce congestion, AlSabah et al. [12] developed an improved congestion control algorithm, using an ATM-style link-based algorithm instead of the end-to-end window algorithm. This allows faster response to congestion as the guard and middle relay also react when congestion occurs. Moore, Wacek, and Sherr [28] go further and argue that rate limiting all traffic entering Tor will increase the capacity of the entire Tor network. Since most schedulers were developed in isolation, many do not work well when combined in aggregate. Tschorsch and Scheuermann [29] develop a new scheduler based on a max-min fairness model which achieves a more fair allocation of resources. 3.1.2 Selection When creating circuits, Tor selects from relays at random weighted by their bandwidth. Since trusting relays to accurately self report their available bandwidth has major security issues [30], Snader and Borisov have done work looking at using peer-to-peer bandwidth measuring algorithms [31]. In addition a lot of work has been done on improving relay and circuit selections made by clients [32, 33, 34, 35, 36, 37, 38]. Snader and Borisov introduced an improved relay and path selection algorithm [39] that can balance anonymity and performance by considering active measurements of relay performance. Akhoondi et al. developed LASTor [34] that uses network coordinates to estimate latency between relays to create low-latency circuits to be used by clients. By grouping near-by relays and selecting from within the group randomly LASTor can create low-latency circuits while preventing an adversary from simply creating a large amount of closely located relays and then be selected for all relays along a single circuit created by a client. Instead of using latency Wang et al. create a congestion-aware path selection algorithm [35]. By using passive and active round trip measurements on circuits, they can detect when relays become congested and switch to non-congested circuits. In a similar fashion, AlSabah et al. propose traffic splitting [40] across multiple 14 circuits. By creating multiple circuits using the same exit relay, clients can load balance across circuits, dynamically reducing the amount of data sent on congested circuits. 3.1.3 Transport One of the noted performance issues in Tor is cross-circuit interference caused by the use of a single TCP connection between relays [41]. For example, if circuits c1 and c2 share a TCP connection and a cell from c1 gets dropped, all cells from c2 would need to buffer and wait until retransmission mechanisms from TCP retransmit the cell from c1 . This is because TCP guarantees in-order delivery, which in this case is not necessary for Tor; the cells from circuit c2 could be sent while only c1 waits for successful retransmission. To fix some of these issues Viecco proposes UDP-OR [42], a new protocol design that uses UDP connections between relays. With this there is end-to-end reliability and congestion control handled by the edges, preventing cross-circuit interference internally in the network. Reardon [16] attempts to address this by implementing TCP-over-DTLS allowing a UDP connection to be dedicated to each circuit. Using a user space TCP stack, reliable in-order delivery is handled at the circuit level at hop-by-hop basis. Since these user space implementations tend to not be very mature, AlSabah [3] use IPSec to enable Tor to dedicate a single kernel level TCP connection to each circuit. With IPSec an adversary cannot directly determine the exact number of circuits between two relays while still delivering the performance benefits of dedicating a TCP stack per circuit. Since most of the cross circuit interference happens due to congestion on bulk circuits interfering with latency sensitive web circuits, Gopal and Heninger [2] propose having just two connections between relays, one each for web clients and one for bulk clients. Finally, in an attempt to work around the specific issue of unnecessary blocking, Nowlan et al. introduce uTCP and uTLS[43, 44] which allows for out-of-order delivery in Tor so it can process cells from circuits that are not blocking on lost packets to be retransmitted. 15 3.1.4 Incentives While the previous lines of research involved improving efficiency in how Tor handles traffic, another set looks at potential ways to incentive more relays to join and contributed bandwidth to the Tor network. PAR [45] and XPay [46] offer monetary compensation to incentivize volunteers to run relays and add bandwidth to the Tor network. Ngan, Dingledine, and Wallach proposed Gold Star [47], which prioritizes traffic from relays providing high quality service in the Tor network. Using an authority to monitor and assign gold stars to high performing relays, traffic from this relay would then always be prioritized ahead of relays without a gold star. Jansen, Hopper, and Kim developed BRAIDS [48] where clients can obtain tickets and give them to relays in order to have their traffic prioritized. The relays themselves can accrue tickets which they themselves can spend and have their own traffic prioritized. Both Gold Star and BRAIDS have scalability issues due to their reliance on centralized trusted authorities. To address this Jansen, Johnson, and Syverson built LIRA [49], a light-weight and decentralized incentive solution allowing for the same kind of traffic prioritization based on tokens obtained by contributing bandwidth to the Tor network. 3.2 Security and Privacy While Tor’s adversarial model excludes a global adversary, researchers have discovered potential ways a local adversary can still reduce the anonymity of clients in the network. In this section we will cover some of these attacks, including path selection, routing, side channel, and congestion based attacks. 3.2.1 Path Selection and Routing Attacks Plenty of research has been done showing that a global adversary, eavesdropping on the client to guard connection and exit to server connection, can confirm that the client and server are actually communicating with each other [50, 51, 52]. Given that an adversary wants to increase the chances that they appear as both the guard and exit on a circuit, allowing them to run these end-to-end confirmation attacks. Ideally an adversary providing a fraction f of all network bandwidth will only be the guard and 16 exit on f 2 circuits. The goal of an adversary is to try and increase the percent of circuits seen to p f 2 . Path selection attacks attempt to achieve this by exploit how relays and circuits are chosen by clients. Borisov et al. [53] show how an exit relay can simply not forward cells on circuits that use a guard they do not control. This forces the client to choose a different circuit, increasing the chances that clients will select circuits with an adversarial guard and exit relay. Similarly, back when Tor relied on self reported relay bandwidth values Bauer et al. [30] showed that relays could lie about their bandwidth, increasing their chances of being selected as both the guard and exit in circuits without actually having to dedicate the bandwidth resources. Now Tor actively measures the bandwidth capacity of relays to calculate the relays weight, which is used by clients when selecting relays for circuits. Another tactic an adversary can take is to run denial of service attacks against other relays preventing them from being selected. This should be done without the adversary having to using much bandwidth, otherwise the same could be achieved by simply providing that bandwidth to the Tor network, increasing the adversarial relays weight. To this end Jansen et al. [54] introduce the sniper attack capable of launching a low resource denial of service attack against even high bandwidth relays. By forcing a relay’s sending window down to 0, the relay will have to continually buffer cells eventually causing the relay to exhaust its available memory. While relays can attempt to increase their chances of being selected as the guard and exit relay, adversarial autonomous systems (AS) can also attempt to eavesdrop on both edges of the connection to run a traffic analysis. A lot of work has been done examining the capabilities of existing ASes from being able to passively launch these kinds of attacks against anonymous communication systems [55, 21, 56]. More recently Sun et al. introduced RAPTOR [57], an active AS level routing attack. Here an adversarial AS can use BGP hijacking techniques to increase the amount of the Tor network they can globally view, allowing for targeted end-to-end confirmation attacks. 3.2.2 Side Channel and Congestion Attacks For side channel attacks we have an adversary on a single point in the circuit (i.e. exit relay, malicious server) that is attempting to learn information that can potentially identifying the client. This can be done by making passive measurements or actively 17 interfering in the network to increase the signal on the side channel. Hopper et al. [14] show how to launch two attacks using latency as a side channel. First an adversarial exit relay can attempt to see if two separate connections from an exit relay are actually from the same circuit. Second they demonstrate that an exit relay that knows the identify of the guard being used in the circuit can estimate the latency between the client and guard. With this they can use network coordinates to significantly reduce the anonymity set of the client using the circuit. Mittal et al. [58] demonstrate how circuit throughput can be used by an adversary to identify bottleneck relays. By probing potential relays, an adversarial exit relay can correlate throughput on a specific circuit to try and identify the guard being used. Additionally, separate servers with connections from a Tor exit can attempt to identify if they originate from the same circuit and client. While the previous side channel attacks rely on passive adversaries that simply make network measurements, congestion attacks artificially manipulate network conditions that can be observed on a side channel by an adversary. Murdoch and Danezis [59] first introduce the idea where a server can periodically push large amounts of data down a circuit causing congestion. This can then be detected by the adversary to learn exactly which relays are used for a circuit. Evans et al. [60] expand on this, introducing a practical congestion attack that can be deployed on larger networks. An adversarial exit relay injects Javascript code onto a circuit allowing it to monitor round trip times on the circuit. It then selects a relay it believes might be the guard and creates a long circuit that loops multiple times back through the selected relay. This significantly increases the congestion on the relay by cells sent on the long circuit. If the relay appears on the circuit n times, for every 1 cell sent by the adversary the relay will have to process it n times. When the adversary selects the actual guard being used, the recorded round trip times on the target circuit will dramatically increase, notifying the adversary that it is in fact the guard being used. 3.3 Experimentation One of the key challenges to conducting research on Tor is running experiments. Due to the sensitive nature of providing privacy to its users, testing on the live network often times is problematic, especially when the stated goal is to attack client anonymity. 18 Early research would use Planetlab [61] to setup a test network for experimentation. One problem with this approach is the network properties (e.g. latency, bandwidth, packet loss) could be drastically different than the actual Tor network. Additionally, as the Tor network grew it became difficult to scale up experiments, as Planetlab only has a few hundred nodes, not all of which are reliable and stable. Some research can be done by simulating various aspects of Tor allowing for large scale experimentation. Johnson et al. [62] evaluated the security of relay and path selection and built a path simulator that only involved the relay selection algorithms from Tor. Tschorsch and Scheuermann [63] created a new transport algorithm for inter-relay communication and used the network simulator ns3 [64] to compare against different transport designs. The major hurdle all these methods have is to simultaneously achieve accuracy and scalability. To achieve both researchers created systems that would run actual Tor code but emulate or simulate the networking aspects. Bauer et al. designed ExperimenTor [65] which uses ModelNet [66] to create a virtual topology which Tor processes can communicate through. Jansen and Hopper created Shadow [67] which simulates the network layer while running the actual Tor software in a single process. These methods allow for precise control over the network environment, allowing for accurate modeling of the Tor network [68] which can be incorporated into the experimental test beds. Additionally these systems allow for experimental setups with thousands of clients, even capable of simulating the entire Tor network [27]. This is all done in a way that produces accurate and reproducible results. Chapter 4 Experimental Setup 19 20 As discussed in Section 3.3 running experiments on Tor is a complicated process with trade offs on accuracy and scalability. Out of the potential tools discussed Shadow [67] provides the highest accuracy while still allowing topologies consisting of thousands of nodes. In this chapter we discuss details of Shadow and the network topologies used, along with the metrics we consider when analyzing effects to anonymity and performance. 4.1 Shadow and Network Topologies Shadow [67, 69] is a discrete-event simulator that runs actual Tor code in a simulated network environment. Not only does this allow us to run large scale experiments, but we can incorporate any performance or attack code directly into Tor to test the effectiveness of any implementation. With fine grained control over all aspects of the experiment we can use carefully built network topologies reflecting the actual Tor network [70]. In addition we are able to run deterministic and reproducible experiments, allowing us to isolate the effects of specific changes (e.g. scheduling algorithms). These properties are difficult to achieve simultaneously, particularly when using PlanetLab or the live Tor network for experiments. The latter is not even an option when exploring attacks on client anonymity, as there are large ethic concerns on interacting with and possibly collecting data on real live Tor users. ExperimenTor does have both these properties, and is arguably more accurate as it uses a virtual environment running an actual Linux kernel. However ExperimenTor experiments must run in real time leading to issues of scalability. Using Shadow allows us to run experiments 10 times as large, and even supports simulating the full Tor network [27] given enough hardware resources. To make sure experiments are as accurate as possible, extensive work [70, 27] has been done on producing realistic network models and topologies. The network topology is represented by a mesh network, with vertices representing countries, Canadian provinces, and American states. Each vertex has an associated up and down bandwidth, and edges between vertices represent network links with corresponding latency and packet loss values. Since all Tor relay information is public, when creating an experiment with r relays, Shadow carefully samples from the relay set in order to produce a sub-sampling reflective of the larger Tor network. Relays are then assigned to a vertex 21 based on their known country code. The client model for Shadow is based on high level statistics collected by Tor [71] along with work done examining what clients use Tor for [72, 25]. From this we know the distribution of countries Tor clients originate from, along with the fact that roughly 90% of clients use Tor for web browsing, while the remaining 10% are performing bulk downloads, typically using BitTorrent. Using this information Shadow comes with a fully built network topology and experiments configured with different numbers of relays and clients. Since we are interested in maximizing accuracy we use the large scale experiment configuration with either 500 relays, 1350 web clients, 150 bulk clients, and 300 performance client or 400 relays, 2375 web clients, 125 bulk clients, and 225 performance clients. Web clients download a 320 KiB file and then pause randomly between 1 and 60,000 milliseconds, drawing from the UNC think-time distribution [73]. Bulk clients download a 5 MiB file continuously, with no pauses after a completed download. The performance clients perform a single download every 10 minutes, downloading either a 50 KiB, 1 MiB, or 5 MiB file. The experiment is configured to run for 60 virtual minutes, with relays starting between 5 to 15 virtual minutes into the experiment, and clients started between 20 and 30 virtual minutes. This allows for all clients to properly bootstrap and produces realistic churn seen in the live network. 4.2 Performance Metrics A common use for Shadow is to test new or modified performance enhancing algorithms to see what, if any, effect on performance they might have. Two experiments are setup using the same exact configuration, except one is configured to use vanilla Tor with no modifications and the other with the new algorithm enabled. After these experiments have completed we can then directly examine the performance achieved with the new algorithm compared with vanilla Tor. For this we are interested in the following metrics. Time To First Byte: The time to first byte measures the time between when a client makes a request to download a file and when it receives the very first byte from the server. The lower the time to first byte the better, as this indicates lower latency and greater throughput when performing downloads, which is especially important for latency sensitive applications such as web browsing. 22 Time To Last Byte: The time to last byte measures the time between when a client makes a request to download a file and when it receives the very last byte from the server. While time to first byte applies equally across all clients, for time to last byte we separate it based on web and bulk clients. While all client performance is important, we are generally more concerned with improving web client download times, as bulk clients are not particularly sensitive to download times. Total Network Bandwidth: In fixed intervals Shadow records the bandwidth consumed by every node in the network. The total network bandwidth measures the sum of all bandwidth consumption across all nodes. Increasing network bandwidth usage in Tor is important as it indicates clients are better utilizing all available resources. In general all metrics need to be considered in aggregate, as there are situations where something might be broken (e.g. clients not working) that are only reflected in a single metric, with all other metrics actually improving. For example, a faster time to first byte could be caused by having a large amount of downloads time out. This would free network resources for other clients that are able to actually use the network, where they experience less congestion. However in this case we would also see a large drop in total network bandwidth indicating a problem. Another common phenomenon is we will see an increase in total network bandwidth and improvement in bulk client download times; however this will come at the expense of web clients. With prioritized bulk clients completing downloads faster, we can achieve higher overall network bandwidth but web client download times suffer which are generally more important. 4.3 Adversarial Model The first adversarial model we consider is an adversarial exit relay which observes a circuit connecting to a server and they wish to identify the client associated with the circuit. The adversary can combine attacks that attempt to identify the guard being used on the circuit [59, 60, 58] with the latency attack [14] which can reduce the anonymity set of clients assuming knowledge of the guard. More formally, the adversary has prior knowledge of the set of potential circuits C and all relays R that can serve as a guard. The victim V ∈ C is using a circuit with guard G ∈ R and using the adversary as the exit relay. We assume a priori that the probability that any client ci ∈ C is the 23 victim is P (V = ci ) = 1 |C| . The adversary runs an attack attempting to identify the guard being used in the circuit, producing an updated guard probability distribution P (G = rj ). Then for each rj ∈ R the adversary runs the latency attack assuming the guard is rj . This produces a conditional probability distribution P (V = ci |G = rj ) for each client assuming the relay rj is the guard. We can then compute the final client probability as: P (V = ci ) = X rj ∈R P (V = ci |G = rj ) · P (G = rj ) To measure the amount of information leaked we can use entropy, which quantifies uncertainty an adversary has about a guard or client, to measure the degree of anonymity loss [74, 75]. These metrics operate over a reduced anonymity set, which can be applied to both guards and clients. Given an updated probability distribution we use a threshold τ to determine what does and does not get included in the reduced anonymity set. By considering all possible values of τ we can determine the potential maximum degree of anonymity that is loss. So given an initial set N and probability distribution P (ni ) we have the reduced set S ⊆ N built as: S = {ni ∈ N | P (ni ) > τ ) Then for a target T ∈ N the entropy of the reduced anonymity set S is then defined as log |N | if T ∈ S 2 |S| (4.1) H(S) = 0 otherwise When calculating entropy for clients we have N = C and T = V and for guards N = R and T = G. Then with the maximum total entropy defined as HM = log2 (|C|) we can quantify the actual information leakage as the degree of anonymity loss defined as 1− H(S) HM . By varying the threshold τ we can then determine the maximum degree of anonymity loss. The second adversarial model we consider is an adversary that is attempting to increase the fraction of the network they can observe without increasing the amount of bandwidth they need to provide. Specifically we are interested in the fraction of streams that use a circuit where the adversary is the guard and exit on the circuit. In this case they can run an end-to-end confirmation attack to link the client and server 24 communication with each other. In vanilla Tor if the fraction of bandwidth controlled by an adversary is f , the fraction of circuits selected that they are on is f and the fraction of circuits they will be the guard and exit on is f 2 . Outside of providing more bandwidth, there are two general ways an adversary can artificially inflate the fraction of circuits they see and compromise. First they can launch denial of service attacks against other target relays. In this case the fraction of bandwidth controlled by the adversary increases by having the total Tor network bandwidth decreases. The other method is if an adversary can take advantage of or trick some mechanism in Tor that increases the chances of them being selected relative to the fraction of bandwidth they control. For example before Tor actively measured relay bandwidth and relied on self-reported values, relays could lie about the amount of bandwidth they provided [30] to increase the chances of being selected. Even now, if an adversarial relay knows they are being measured for bandwidth, they can prioritize the measured traffic to achieve an inflated bandwidth measurement, and then simply throttle all other Tor connections, receiving a larger bandwidth weight compared to the actual bandwidth they are providing. In these situations we are interested in the compromised fraction p > f 2 they can compromise while providing f of all bandwidth. Chapter 5 How Low Can You Go: Balancing Performance with Anonymity in Tor 25 26 5.1 Introduction Recall that one of the key design choices of the Tor system is the goal of building a large anonymity set by providing high performance to as many users as possible, while sacrificing some level of protection from large-scale (global) adversaries. For example, Tor does not attempt to protect against an end-to-end correlation attack that certain mix systems try to prevent [5, 76, 7], as they introduce large costs in increased latency making such systems difficult to use. This performance focus has led researchers to investigate a variety of methods to improve performance, such as using different circuit scheduling algorithms [77], better flow control [12], and throttling high bandwidth clients [1, 78, 79]. Several of these mechanisms have been or will be incorporated into the Tor software as a result of this research. One overlooked side effect of these improvements, however, is that in some cases improving the performance of users can also improve the performance of attacks against the Tor network. For example, several attacks have been proposed [80, 81, 60, 82] that rely on measuring the latency or throughput of a Tor circuit to draw inferences about its source and destination. If an algorithm improves the throughput or responsiveness of Tor circuits this can improve the accuracy of the measurements used by these attacks either directly or by averaging a larger sample. Thus it is important to analyze how these modifications to Tor interact with attacks based on network measurements. In this section we investigate this interaction. We start by introducing a new class of attacks based on network measurement, which we call induced throttling attacks. In these attacks, an adversarial exit node exploits congestion or traffic admission control algorithms to artificially throttle and unthrottle a chosen circuit without directly sending data through the circuit or relay. This leads to a recognizable pattern in other circuits sharing resources with the target circuit, leaking information about the connection between the client and entry guard. We show that there are highly effective induced throttling attacks against most of the proposed scheduling, flow control, and admission control modifications to Tor, allowing an adversary to uniquely identify entry guards in many cases. We also examine the effect these algorithms have on previous attacks [80, 81] to see if the improvement in performance, and therefore in network measurements, leads to more 27 successful attacks. Through large-scale simulation, we find that for throughput attacks, the improved network measurements are essentially “cancelled out” by the reduced variance in performance provided by these improvements. We also find that nearly all of the proposed improvements increase the effectiveness of latency-based attacks, in many cases leading to a 10% or higher loss in “degree of anonymity.” Finally, we perform a comprehensive analysis of the combined effects of throughput, induced throttling and latency-measurement attacks. We show that using induced throttling, the combined attacks can in many cases uniquely identify the source of a circuit by a likelihood ratio test. These results indicate that flow and admission control algorithms can have considerable impact on the security as well as performance of the Tor network, and new proposals must be evaluated for resistance to induced throttling. 5.2 Background In this section we discuss the proposed performance enhancing algorithms and some details of some privacy reducing attacks in Tor. 5.2.1 Circuit Scheduling When a connection was ready to write, Tor initially used a round robin [83] fair queuing algorithm to determine which circuit to flush cells from. One impact of this is latency sensitive circuits would often have to queue behind large amount of cells flushed from bulky file transfer circuits. To fix this Tang and Goldberg [77] suggested using an EWMA-based algorithm to prioritize web circuits over bulky file sharing circuits, reducing latency to clients browsing the web. 5.2.2 Flow Control The high client-to-relay ratio in Tor causes performance problems that have been the focus of a considerable amount of previous research. The main flow control mechanism used by Tor is an end-to-end window based system, where the exit relay and client use SENDME control cells to infer network level congestion. Tor separates data flowing inbound from data flowing outbound, and flow control mechanisms operate independently on each flow. Each circuit starts with an initial 1000 cell window which is decremented 28 by the source edge node for every cell sent. When the window reaches 0, the source edge stops sending. Upon receiving 100 cells, the receiver edge node returns a SENDME cell to the source edge, allowing the source edge to increment its circuit window by 100 and continue sending more cells. Each stream that is multiplexed over a circuit also has a similar flow control algorithm operating on it, with a 500 cell window and 50 cell response rate. This work will focus on the circuit level flow control mechanisms. One of the main issues with flow control in vanilla Tor is that it’s done in an end-toend manner, meaning that circuit edge nodes will take longer to detect and react to any congestion that occurs in the middle of the circuit. As an alternative approach, AlSabah et al. introduced N23[12], a link based algorithm that can instead detect and react to congestion on every link in the circuit. Similar to the native flow control mechanism in Tor, each relay in an N23-controlled circuit initializes its credit balance to N 2 + N 3 and decrements it by one for every cell it forwards. After a node has forwarded N 2 cells, it returns back a flow control cell containing the number of forwarded cells to the backward relay. Upon receiving a flow control cell from the forward relay, the backward relay updates its credit balance to be N 2 + N 3 minus the difference in cells it has forwarded and cells the forward relay has forwarded. N23 has been shown to improve detection of and reaction to congestion[12]. 5.2.3 Traffic Admission Control Guard nodes in Tor have the ability1 to throttle clients using a basic rate limiter [84]. The algorithm uses a token bucket whose size and refill rate are configurable to enforce a long-term average throughput while allowing short-term data bursts. The intuition behind the algorithm is that a throttling guard node will limit the client’s rate of requests for new data, which will lower the outstanding amount of data that exists inside the network at any given time and generally reduce congestion and improve performance. There have been many proposed uses of and alterations to the approach outlined above, some of which vary the connections that are throttled [85, 1, 78] and others that vary the throttle rate [1, 78, 79]. Of particular interest are the algorithms proposed by Jansen et al. [1], each of which utilize the number of client-to-guard connections in some way to adjust the algorithm. The bitsplit algorithm divides its configured 1 Tor does not currently enable throttling by default. 29 BandwidthRate evenly among client-to-guard connections, while the flag algorithm uses the number of client-to-guard connections to determine the rate over which a client will get flagged as “high throughput” and throttled. Finally, the threshold algorithm throttles the loudest fraction of client-to-guard connections. These algorithms have been shown to improve web client performance[1]. 5.2.4 Circuit Clogging Murdoch and Danezis previously proposed a Tor circuit clogging attack [82] in which the adversary sends data through a circuit in order to cause congestion and change its latency characteristics. The adversary correlates the latency variations of this circuit with those of circuits through other relays in order to identify the likely relays of a target circuit. The attack requires significant bandwidth in order to produce a signal strong enough for correlation, and it has been shown to be no longer effective [60]. There have been numerous variations on this attack, some of which have simple defenses [60, 86] and others that have low success rates [87]. This work does not consider these “general” congestion attacks where the main focus is keeping bandwidth usage small enough to remain practical. Instead, we focus on the feasibility of new induced throttling attacks introduced by recent performance enhancing algorithm proposals, and the effects that our new attacks have on anonymity. 5.2.5 Fingerprinting Mittal et al. recently proposed “stealthy” throughput attacks[80] where an adversary that controls an exit node of a circuit attempts to find its guard relay by using “probe” clients that measure the attainable throughput through each relay.2 The adversary may then correlate the circuit throughput measured at the exit node with the throughput of each of its probes to find the guard node with high probability. Some of our attacks also utilize probe clients in order to recognize the signal produced once throttling has been induced on a circuit. Hopperet al. [81] propose an attack where an adversary injects malicious javascript into a webpage in order to measure the round trip time of a circuit. The adversary may use these measurements to narrow the possible 2 The attack is active in that an adversary launches it by sending data to relays, but stealthy in that its data streams are indistinguishable from normal client streams. 30 path the target circuit is taking through the network and approximate the geographical location of the client. As our techniques are similar to both of these attacks, we include them in our evaluation in Section 5.4 and analysis in Section 5.7. 5.3 Methodology We consider three classes of algorithms that have been proposed: EWMA circuit scheduling [77]; N23 flow control [12]; and bitsplit, flag, and threshold throttling [1]. We also consider an ideal throttling algorithm that has perfect knowledge of the traffic type of every stream. This ideal algorithm throttles high throughput nodes at a rate of 50 KB/s and approximates the difftor approach of AlSabah et al. [85]. 5.3.1 Algorithmic-Specific Information Leakage We will explore new algorithm-specific attacks we have developed, as well as previously published generic attacks [81, 80], and quantify the extent to which the attacks affect anonymity. In analyzing the algorithms, we can expect them to have one of two effects: the algorithms may improve the effectiveness of statistical attacks by making side channel throughput and latency measurements more accurate, improving the adversary’s ability to de-anonymize the client; or the algorithms may reduce the noise that an adversary uses to eliminate entry guards and clients from the potential candidate set, frustrating the attacks and improving client anonymity. All of the attacks are attempting to either reduce the candidate set of guard on the circuit or clients using the circuit. After running the attacks over the candidate set N with a target T ∈ N , each entity ni ∈ N will have a corresponding score s(ni ). Along with the raw scores produced for the target T , to evaluate the effectiveness of the attacks we are interested in the following metrics. Percentile: The percentile for a candidate target T of an attack is defined as the percent of other candidate targets (i.e., members of the anonymity set) with a lower score3 than T , based on statistical information the attacker uses to score each candidate 3 In attacks where a lower score is “better” we can simply use the inverse for ranking. 31 as the true target. Formally it is calculated as: percentile = | {ni ∈ N | s(ni ) < s(T )} | |N | A higher percentile for T means there is a greater likelihood that T is the true target. Percentiles allow an attacker to reduce uncertainty by increasing confidence about the true target, or increasing confidence in rejecting candidates unlikely to be the target. Degrees of Anonymity: As discussed in Section 4.3 the degree of anonymity loss [75, 74] is a useful metric for measure information leakage. Given a threshold τ we can calculate the reduced anonymity set S = {ni ∈ N | s(ni ) > τ } based on the score of each entity in N . We can then calculate the entropy of S as log |N | if T ∈ S 2 |S| H(S) = 0 otherwise (5.1) The degree of anonymity then quantifies the actual information leakage, calculated over the maximum total entropy HM = log2 (|N |) and computed as degree of anonymity loss = 1 − H(S) HM While this may not show the direct implications to client anonymity, we can determine the best case scenario for a potential adversary by examining the τ that maximizes the degree of anonymity loss, and even more importantly it allows us to do cross experimental comparisons to determine the effect that different algorithms have under different attack scenarios. Client Probability: The previous metrics examine the attack in isolation, evaluating how effective an attack is at reducing the candidate set for either guards or clients. For each attack that reduces the candidate guard set, we are also interested in how well an adversary can reduce the client anonymity set by combining the attack with the latency attack. For this we want to examine the end client probability distribution set discussed in Section 4.3. Since all the attacks produce scores for each entity, to produce a probability distribution we simply normalize over the scores. So for ni ∈ N we have: s(ni ) nj s(nj ) P (ni ) = P With this we can combine the attacks as outlined in Section 4.3 to produce a final client probability distribution P (T = ci ). 32 5.3.2 Experimental Setup and Model Our experiments will utilize the Shadow simulator [88, 89], an accurate discrete event simulator that runs the real Tor code over a simulated network. Shadow allows us to configure large scale experiments running on network sizes not feasible using traditional distributed network testbeds [61]. In addition, it offers fine grain control over network topology, latency, and bandwidth characteristics. Shadow also enables precise control over Tor’s circuit creation process at each individual node, allowing us to experiment with our attacks in a safe environment. Shadow also allows us to run repeatable experiments while only modifying the algorithm or attack scenario of interest, resulting in more accurate evaluations and comparisons. We developed a model of the Tor network based on work by Jansen et al. [68], and use it as the base of each large scale experiment in the following sections. We will discuss necessary changes to the following base configuration as we explore each specific attack scenario: 160 exit relays, 240 nonexit relays, 2375 web clients, 125 bulk clients, 75 small TorPerf clients, 75 medium TorPerf clients, 75 large TorPerf clients, and 400 HTTP servers. The web client downloads a 320 KiB file from one of the randomly selected servers, after which it sleeps for a time between 1 and 60 seconds drawn uniformly at random before starting the next download. The bulk clients repeatedly download a 5 MiB file with no wait time between downloads. Finally, the TorPerf clients only perform one download every 10 minutes, where the small, medium and large clients download 50 KB, 1 MiB and 5MiB files respectively. This distribution of clients is used to approximate the findings of McCoy et al. [72], Chaabane et al. [25] and data from Tor [71]. 5.4 Algorithmic Effects on Known Attacks This section evaluates how recently proposed performance enhancing algorithms affect previously known guard and client identification attacks against Tor. 5.4.1 Throughput as a Signal We first explore the scenario of Mittal et al. [80], where an attacker is able to identify the guard relay of a circuit with high probability by correlating throughput measured 33 at an adversarial exit node to probe measurements made through a set of entry guards. We ran our base experiment from Section 5.3.2 once without any probe clients in order to discover what circuits each bulk client created. Then, for every entry G that was not a middle or exit relay for any bulk client, we instantiated a probe client that creates a one-hop circuit through G in order to measure its throughput. This was done to minimize the interference of other probes on bulk circuits, where they only potentially affect the circuit they are measuring and no other. We compared vanilla Tor with 6 different algorithms: EWMA circuit scheduling [77], N23 flow control [12], bitsplit, flag, and threshold throttling [1], and ideal throttling. The results for the EWMA and N23 algorithms can be seen in Figure 5.1. The percentile of the entry guard seen in Figure 5.1b shows the largest divergence between the algorithms. For example, around 40% of entry guards in vanilla Tor and N23 have correlation scores in the top 20% of all scores calculated, while EWMA only has around 25% of its entry guards in the top 20%. In terms of measuring the actual loss of anonymity we see little difference between the algorithms, as seen in Figure 5.1c. This shows the degree of anonymity loss while varying the threshold value, that is, the minimum correlation score an entry guard must have to be included in set of possible guards. Vanilla Tor has a slightly higher peak in anonymity loss, about 2% larger than EWMA and N23, leading to greater anonymity reduction for an adversary. While the previous algorithms had little overall affect on the attacks, the throttling algorithms have a significantly larger impact on how the attack performs compared to vanilla Tor, as seen in Figure 5.2. In particular, you see incredibly low entry guard correlation scores for the ideal and flag algorithms in Figure 5.2a, and looking at the percentile of the guards in Figure 5.2b we see 48% of entry guards are in the top half of the list based on correlation score in vanilla Tor, while only 22-41% of the guards in the throttling algorithm experiments made it in the top half. Indeed, when looking at degree of anonymity loss in Figure 5.2c, we see a much larger peak in the graph for vanilla Tor compared to all other algorithms, indicating that in the best case scenario an adversary would expect more information leakage. This implies that the throttling algorithms would actually result in a larger anonymity set for the adversary, making the attack produce worse results. Intuitively, the throughput of throttled circuits tend to be more similar than the throughput of unthrottled circuits, increasing the uncertainty 1.0 1.0 0.8 0.8 0.6 0.6 CDF CDF 34 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.6 0.4 |Correlation Score| 0.8 1.0 0.0 0.0 (a) Entry Scores 0.8 1.0 (b) Percentile of Entry 0.40 Degree of Anonymity Loss 0.6 0.4 Percentile 0.2 vanilla EWMA 0.35 N23 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.0 0.2 0.6 0.4 Threshold 0.8 1.0 (c) Degree of Anonymity Loss Figure 5.1: Results for throughput attack with vanilla Tor compared to EWMA and N23 scheduling and flow control algorithms. 1.0 1.0 0.8 0.8 0.6 0.6 CDF CDF 35 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.6 0.4 |Correlation Score| 0.8 1.0 0.0 0.0 (a) Entry Scores 0.8 1.0 (b) Percentile of Entry 0.40 Degree of Anonymity Loss 0.6 0.4 Percentile 0.2 vanilla bitsplit flag 0.35 0.30 threshold ideal 0.25 0.20 0.15 0.10 0.05 0.00 0.0 0.2 0.6 0.4 Threshold 0.8 1.0 (c) Deg. of Anonymity Loss Figure 5.2: Results for throughput attack with vanilla Tor compared to different throttling algorithms. 36 during the attack. The throttling algorithms effectively smooth out circuit throughput to the configured long-term throttle rate, making it more difficult to distinguish the actual entry guard from the set of potential guards. 5.4.2 Latency as a Signal We now explore the latency attack of Hopper et al. [81]. They show how an adversarial exit relay, having learned the identity of the entry guard in the circuit, is able to estimate the latency between the client and guard. This is accomplished by creating two ping streams through the circuit, one originating from the client and one from the attacker. The ping stream from the attacker is used to estimate the latency between the entry guard and the exit relay which, when subtracted from the ping times between the client and exit relay produces an estimate of latency between the client and guard. Using network coordinates to compile the set of “actual” latencies between potential clients and guards, the adversary is then able to reduce the anonymity set of clients based on the estimated latency. Since this attack relies on the accuracy of the estimated latency measurements, the majority of the algorithms have the potential to decrease the anonymity of the clients by allowing an adversary to discard more potential clients. For our experiments, we use the same base configuration with 400 relays and 2500 clients, with an additional 250 victim clients setup to send pings through the Tor circuit every 5 seconds. Then, for each victim client a corresponding attacker client is added, which creates an identical circuit to the one used by the victim, and then sends a ping over this circuit every 5 seconds. These corresponding ping clients are used to calculate the estimated latency between the victim and entry guard as discussed above. In order to determine the actual latencies between the clients and guard node, we utilize the fact that Shadow determines the latency distribution that is sampled from between each node that communicates in the experiment, so we merely assign the median latency of these distributions as the actual latencies between nodes. This would correspond to the analysis done in [81] where it is assumed that these quantities were known a priori, so we believe using this “insider information” doesn’t contradict any assumptions made in the initial paper outlining the attack. Furthermore, since we’re ultimately concerned with how the attacks differ using various algorithms, the analysis should hold. Similar to the original attack, we take the minimum of all observed ping times seen 37 over both the victim and attacker circuit, denoted TV X and TAX respectively. Then, an estimate of the latency between the victim and entry guard, TV E , is calculated as T̂AE = TV X − TAX + TAE , where TAE is the latency between the attacker and entry guard as calculated above. Figures 5.3a and 5.4a show the difference in estimated latency computed by an adversary and the actual latency between the client and guard, while Figures 5.3b and 5.4b show how these compare with the differences between the estimate and other possible clients. While these graphs only show a slight improvement for every algorithm except EWMA, looking at the degree of anonymity loss in Figures 5.3c and 5.4c shows a noticeable increase in the maximum possible information gain an adversary can achieve. Even though there is only a slight improvement in the accuracy of the latency estimation, this allows an adversary to consider a smaller window around the estimation to filter out potential clients while still retaining similar accuracy rates. This results in a smaller set of potential clients to consider, and thus a higher reduction in anonymity of the victim client. 5.5 Induced Throttling via Flow Control We now look at how an attacker is able to use specific mechanisms in the flow control algorithms to induce throttling on a target circuit. 5.5.1 Artificial Congestion Recall from Section 5.2.2, the flow control algorithms work by having control cells sent backward to notify nodes and clients that they are able to send more data. If there is congestion and the nodes go long enough without receiving these cells, they stop sending data until the next control cell is received. Using these mechanisms, an adversarial exit node can implicitly control when a client can and cannot send data forward, thereby inducing artificial congestion. While many times it’s sufficient to model Tor clients as web and bulk [68] making small and large downloads respectively, in order for an exit node to induce artificial congestion there needs to be large amounts of data being sent from the client to server. To address this issue, we introduce a more detailed third type of client modeling the BitTorrent protocol and scheme that mimics “tit-for-tat”[90]. The new client instead 1.0 0.8 0.8 0.6 0.6 CDF 1.0 0.4 vanilla EWMA N23 0.2 0.0 0 50 0.4 vanilla EWMA N23 0.2 100 150 200 250 300 350 Estimate Ping Difference 0.0 0.0 0.6 0.4 Percentile 0.2 (a) Ping Differences 0.8 (b) Percentile 0.20 Degree of Anonymity Loss CDF 38 vanilla EWMA N23 0.15 0.10 0.05 0.00 0 50 100 Threshold 150 200 (c) Degree of Anonymity Loss Figure 5.3: Results of latency attack on EWMA and N23 algorithms. 1.0 1.0 1.0 0.8 0.8 0.4 0.2 0.0 0 50 100 150 200 250 Estimate Ping Difference CDF vanilla bitsplit flag threshold ideal 0.6 vanilla bitsplit flag threshold ideal 0.6 0.4 0.2 300 0.0 0.0 0.6 0.4 Percentile 0.2 (a) Ping Differences 0.8 (b) Percentile 0.20 Degree of Anonymity Loss CDF 39 vanilla bitsplit flag threshold ideal 0.15 0.10 0.05 0.00 0 50 100 Threshold 150 200 (c) Degree of Anonymity Loss Figure 5.4: Results of latency attack on the various throttling algorithms. 1.0 40 swaps 16 KB blocks of data with the server, and doesn’t continue until both sides have received the full block, causing large amounts of data to both be uploaded and downloaded. To demonstrate the effectiveness of such techniques, we had a bulk and torrent client create connections over the same circuit where each relay was configured with 128 KB/s bandwidth. The exit relay would then periodically hold all control cells being sent to the torrent client in an attempt to throttle the connection. Figure 5.5 shows the observed throughput of both clients, where the shaded regions indicate periods when the exit relay was holding control cells from the torrent client. We can see that approximately 30 seconds into these periods, the torrent client runs out of available cells to send and goes into an idle state, leaving more resources to the bulk client resulting in a rise in the observed throughput. Next we want to identify how a potential adversary could utilize this in an attempt to identify entry guards used in a circuit. We can see the intuition behind the attack in Figure 5.5: when throttling of a circuit is repeatedly toggled, the throughput of all other circuits going through those nodes will increase and then decrease, producing a noticeable pattern which an adversary may be able to detect. We assume a scenario similar to previous attacks [82, 60, 80], where an adversary controls an exit node and wants to identify the entry guard of a circuit going through them. The adversary creates one-hop probe circuits through possible entry guards by extending circuits through middle relays that the adversary controls, and measures the throughput and congestion on each circuit at a constant interval. The adversary then periodically throttles the circuit by holding control cells and tests for an increase in throughput for the duration of the attack. By repeatedly performing this action an attacker should be able to reduce the possible set of entry guards that the circuit might be using. 5.5.2 Small Scale Experiment To test the feasibility of such an attack, we designed a small scale experiment with 20 relays, 190 web clients and 10 bulk clients. One of the exit relays was designated as the adversary and one of the bulk clients was designated as the victim. The victim bulk client was then configured to use the torrent client while creating circuits using the adversary as their exit relay. Then, for each of the 19 remaining non-adversarial 20 40 60 80 0 Throughput (KB/s) 41 o Bulk x Torrent ● ● ● ● ●●● ● ● ●●●● ● ● ●●● ●● ● ●● ●●●●●● ● ● ● ●●●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●● ●● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ●● ●● ● ●● ● ●● ●●● ● ●● ●● ●● ●● ● ● ●● ●● ●● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ●● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●●● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0 50 100 xxxx xxxxxxxxxxxxxxxx xx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 150 200 x xxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx 250 Time (seconds) Figure 5.5: Effects of artificial throttling. relays, a probe client was added and configured to create one-hop circuits through the relay, measuring observed throughput in 100 ms intervals. The adversarial exit relay would then wait until the victim client created the connection, then every 60 seconds would toggle between normal mode in which all control cells are sent as appropriate, and throttle mode where all control cells are held. Figure 5.6a shows the observed throughput at the probe client connected to the entry guard that the victim client was using, where the shaded regions correspond to the periods where the victim client runs out of available cells to send and is throttled. During these periods the probe client sees a large spike in observed throughput as more resources become available at the guard relay. While these changes are visually identifiable, we need a quantitative test that can deal with noise and variability in order to analyze a large number of probe clients and reduce the set of possible guards. 5.5.3 Smoothing Throughput The first step is to smooth out the throughput measurements for each probe client in order to filter out any noise. Given throughput measurements (ti , bi ), we first compute the exponentially weighted moving average (EWMA), using α = 0.01. We then take the output from EWMA and pass it through a cubic spline smoothing algorithm [91] with smoothing parameter λ = 0.5. It takes as input the set of measurements (ti , EW M Ai ) and a smoothing parameter λ ≥ 0, and returns the function f (t) which minimizes the 0 0.8 0.6 0.4 0.0 0.2 Throughput (KB/s) 50 100 Throughput (KB/s) 150 1.0 42 0 100 200 300 400 500 600 0 100 200 Time 300 400 500 600 Time (a) Raw Throughput (b) Smoothed Throughput Figure 5.6: Raw and smoothed throughput of probe through guard during attack. penalized residual sum of squares: n X i=1 2 (EW M Ai − f (ti )) + λ Z tn f 00 (t)2 dt t1 The result of this process can be seen in Figure 5.6b with the normalized smoothed throughput plotted over the shaded attack windows. 5.5.4 Scoring Algorithm The intuition behind the scoring algorithm can be seen in Figure 5.6b. Over each attack window marked by the shaded regions, the guard probe client should see large increases and decreases in throughput at the beginning and end of the window. Here we want the scoring algorithm to place heavy weight on consistent large increases and decreases that align with all the windows, while at the same time minimizing false positives from potentially assigning too much weight to large spikes that randomly happen to align with the attack window. The first step of the scoring algorithm is to calculate a linear regression on the smoothed throughput over the first δ seconds 4 at the start and end of each attack window and collect the slope values si and ei for the start and end regression. Then for each window i, first sort all probe clients based on their si slope value from highest to lowest, and for each probe client record their relative rank rsi . Repeat this for the 4 Through empirical evaluation we found δ = 30 seconds to be ideal 43 slope value ei , only instead sort the clients from lowest to highest, again recording their relative rank rei . Rankings are used instead of the raw slope value in order to prevent the false positives from large spikes mentioned previously. Now that each probe client has a set of ranks {rs1 , re1 , . . . , rsn , ren } over n attack windows, the final score assigned to each client is simply the mean of all the ranks, where the lower the score the higher chance the probe client connected through the entry guard. 5.5.5 Large Scale Experiments In order to test the accuracy of this attack on a large scale, we used the base experiment setup discussed in Section 5.3.2, with the addition of one-hop probe clients to connect through each relay. For each run, a random circuit used by a bulk node was chosen to be the attack circuit and the exit node used in the circuit was made an attacker node. The bulk node was updated to run the client that simulated the BitTorrent protocol, and the probe clients that were initially going through the middle and exit relay were removed. For each algorithm, we performed 40 runs with the different attack circuits chosen for each run. We configure the experiments with the vanilla Tor algorithm which uses SENDME cells for flow control, and the N23 flow control algorithm with N 3 = 500 and N 3 = 100. We experimented with having BitTorrent send 1, 2, 5, and 10 streams over the circuit. The results using just one stream can be seen in Figure 5.7, and results when using multiple streams in Figure 5.8. Figure 5.7a shows the CDF of the average score computed by the ranking algorithm for the entry guards probe client, while Figure 5.7b shows the percentile compared to all other probe clients. Interestingly, even though in vanilla Tor not a single entry guard probe had a score better than 150 out of 400, we still see about 80% of the entry guard probe clients were in the 25th percentile amongst all probe clients. Furthermore, we see that the attack is much more successful with the N23 algorithm, especially with N 3 = 100, with peaks in degree of anonymity loss at 31%, compared to 27% with N 3 = 500 and 19% in vanilla Tor. This is due to the fact that it is easier to induce throttling with N23, especially when the N 3 value is low. In vanilla Tor, the initial window is set at 1000 cells, while for the N23 algorithm this would be N 2 + N 3, which works out to 510 and 110 cells for N 3 = 500 and N 2 = 100 respectively (default value for N 2 is 10). The outcome of this is that for N23 with N 3 = 100, we were able 44 1.0 1.0 vanilla N23 (N3=100) N23 (N3=500) 0.6 0.4 0.2 0.0 vanilla N23 (N3=100) N23 (N3=500) 0.8 CDF CDF 0.8 0.6 0.4 0.2 0 20 40 60 80 100 120 140 160 180 Entry Probe Rank 0.0 0.0 0.2 (a) Entry Average Ranks 0.4 0.6 Percentile 0.8 1.0 (b) Percentiles Degree of Anonymity Loss 0.35 vanilla N23 (N3=500) N23 (N3=100) 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 50 100 150 200 Threshold 250 300 (c) Degree of Anonymity Loss Figure 5.7: Throttle attack with BitTorrent sending a single stream over the circuit. 45 to throttle the attack circuit around 12 times, resulting in 24 comparison points over all attack windows, while both with N 3 = 500 and vanilla Tor we see around 7 attack windows, resulting in 14 comparison points. The reason that we see slightly better entry probe rankings with N 3 = 500 than vanilla Tor is because with N23, each node buffers cells when it runs out of credit, while vanilla Tor buffers cells at the client. This means when the flow control cell is finally sent by the attacker, it would cause the entry guard to flush the circuits buffer and cause an immediately noticeable change in throughput and thus a higher rank score for the entry guard. While using one circuit for one stream is the ideal greedy strategy for a BitTorrent client using Tor, it may not always be feasible to accomplish this. To explore what effects sending multiple streams over a circuit has on the attacks, for each algorithm we experimented having BitTorrent send 2, 5, and 10 streams over the circuit that the attacker throttles. The results are shown in Figure 5.8, and for each algorithm we can see how the degree of anonymity loss changes as more streams are multiplexed over a single Tor circuit. Not surprisingly, all three algorithms have a high degree of anonymity loss when sending 10 streams over the circuit, as this dramatically increases the amount of data being sent over the circuit. Thus, when the artificial throttling is induced, there will be larger variations and changes in observed throughput at the entry guard’s probe client. Even just adding one extra stream to the circuit can in some cases cause a noticeable reduction in the degree of anonymity, particularly for the N23 algorithm with N 3 = 100 as seen in Figure 5.8c. 5.6 Induced Throttling via Traffic Admission Control We now look at how an attacker is able to use specific mechanisms in proposed traffic admission control algorithms to induce throttling on a target circuit, creating a throughput signal at much lower cost than the techniques required in [80]. 5.6.1 Connection Sybils Recall that each of the algorithms proposed in [1] relies on the number of client-to-guard connections to adaptively adjust the throttle rate (see Section 5.2). Unfortunately, this feature may be controlled by the adversary during an active sybil attack [92]: an 46 0.8 1 stream 2 streams 5 streams 10 streams 0.7 0.6 0.5 Degree of Anonymity Loss Degree of Anonymity Loss 0.8 0.4 0.3 0.2 0.1 0.0 0 50 100 150 200 Threshold 250 300 1 stream 2 streams 5 streams 10 streams 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0 50 (a) Vanilla 100 150 200 Threshold 250 300 (b) N23 (N3=500) Degree of Anonymity Loss 0.8 1 stream 2 streams 5 streams 10 streams 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0 50 100 150 200 Threshold 250 300 (c) N23 (N3=100) Figure 5.8: Degree of anonymity loss with throttling attacks with different number of BitTorrent streams over the Tor circuit. 47 adversary may either modify a Tor client to create multiple connections to a target guard relay instead of just one, or boot multiple Tor clients that are each instructed to connect to the same target.5 As a result, the number of connections C at the target will increase to C = Cn + Cs , where Cn is the number of normal connections and Cs is the number of sybil connections. The throttling algorithms will be affected as follows: the bitsplit algorithm will lower the throttle rate to BandwidthRate ; Cn +Cs the flag algorithm will throttle any connection whose average throughput has ever exceeded BandwidthRate Cn +Cs to the configured rate6 ; and the threshold algorithm will throttle the loudest fraction f of connections to the throughput of the quietest of that throttled set7 (but no less than a floor of 50 KB/s). Therefore, the attacker may cause the throttle rate at any guard relay to reduce dramatically in all of the algorithms by using enough sybils. In this way, an adversary controlling an exit node can determine the guard node belonging to a target circuit with high probability. Note that the attacker need not use any bandwidth or computational resources beyond that which is required to establish the connections from its client(s) to the target guard relay. We test the feasibility of this attack in Shadow. We configure a network with 5 relays, a file server, and a single victim client that downloads a large file from the server through Tor for 300 seconds. The adversary controls the exit relay on the client’s circuit and therefore is able to compute the client’s throughput. The attacker starts multiple sybil nodes at time t = 100 seconds that each connect to the same entry relay used by the victim. The sybil nodes are shut down at t = 200, after being active for 100 seconds. The results of this attack on each algorithm are shown in Figure 5.9, where the shaded area represents the time during which the attack was active. Figure 5.9a shows that the bitsplit algorithm is only affected while the attack is active, after which the client throughput returns to normal. However, the client throughput remains degraded in both Figures 5.9b and 5.9c—the flag algorithm flagged the client as a high throughput node and did not unflag it while the threshold algorithm continued to throttle at the 50 KB/s floor. Further, notice that the throttling does not occur until roughly 10 to 20 seconds after the attack has begun. This is due to the size of the token bucket, i.e., the BandwidthBurst configuration: the attack causes the refill rate to drop dramatically, 5 6 7 A similar attack was previously described in [1], Section 5.2, Attack 4. We use 50 KB/s as advised in [1] We use f = 0.10 as advised in [1] 250 250 200 200 Throughput (KiB/s) Throughput (KiB/s) 48 150 100 150 100 50 0 50 0 50 100 150 Time (s) 200 250 300 0 0 50 (a) Bitsplit 100 150 Time (s) 200 250 300 (b) Flag Throughput (KiB/s) 250 200 150 100 50 0 0 50 100 150 Time (s) 200 250 300 (c) Threshold Figure 5.9: Small scale sybil attack on the bandwidth throttling algorithms from [1]. The shaded area represents the period during which the attack is active. The sybils cause an easily recognizable drop in throughput on the target circuit. 49 but it takes time for the client to use up the existing tokens in the bucket. In addition to the delay associated with the token bucket size, Figure 5.9c shows added delay in the threshold algorithm because it only updates throttle rates once per minute. 5.6.2 Large Scale Experiments We further explore the sybil attack on a large scale in Shadow. In addition to the base experiment setup discussed in Section 5.3.2, we add a victim client who downloads a large file through a circuit with an adversarial exit and a known target guard. The adversary starts sybil nodes and instructs them to connect to the known target guard at time t = 100. The sybil nodes cycle through 60 second active and inactive phases, and the adversary measures the throughput of the circuit. The results of this attack on each algorithm are shown in Figure 5.10, where the shaded area again represents the time during which the attack was active. In the large scale experiment, only the bitsplit algorithm (Figure 5.10a) produced a repeatable signal while the throttle rate remained constant after the first phase for both flag (Figure 5.10b) and threshold (Figure 5.10c). Also, these results show the importance of correctly computing the attack duration: the signal was missed during the first phase for both bitsplit and flag because the token bucket had not yet fully drained. 5.6.3 Search Extensions Because the drop in throughput during the attack is easily recognizable, it is much easier to carry out than those discussed in Section 5.5. Therefore, in addition to statistical correlations that eliminate potential guards from a candidate set for a given circuit, a search strategy over the potential guard set will also be useful. In a linear search, the adversary would attack each candidate guard one by one until the throughput signal on the target circuit is recognized. This strategy is simple, but it may take a long time to test every candidate. A binary search, where the attacker tests half of the candidates in each phase of the search, would significantly reduce the search time. Note that a binary search may be ineffective on certain configurations of the flag and threshold algorithms because of the lack of a repeatable signal (see Figures 5.9 and 5.10). Regardless of the search strategy, a successful attack will contain enough sybils to 350 300 300 250 250 Throughput (KiB/s) Throughput (KiB/s) 50 200 150 100 50 200 150 100 50 0 0 −50 0 100 200 300 400 Time (s) 500 600 700 −50 0 100 (a) Bitsplit 200 300 400 Time (s) 500 600 700 (b) Flag 300 Throughput (KiB/s) 250 200 150 100 50 0 −50 0 100 200 300 400 Time (s) 500 600 700 (c) Threshold Figure 5.10: Large scale sybil attack on the bandwidth throttling algorithms from [1]. The shaded area represents the period during which the attack is active. Due to token bucket sizes, the throughput signal may be missed if the phases of the attack are too short. 51 allow the adversary to recognize the throughput signal, but otherwise be as fast as possible. Given our results above, the attack should consider the token bucket size and refill rate of each candidate guard to aid in determining the number of sybils to launch and the length of time each sybil should remain active. An adversary who controls the circuit exit may compute the average circuit throughput and estimate the amount of time it would take for the circuit to deplete the remaining tokens in a target guard’s bucket during an attack. Each sybil should then remain active for at least that amount of time. Figure 5.10 shows that the throughput signal may be missed without these considerations. 5.7 Analysis Having seen how the algorithms perform independently in different attack scenarios, we now want to examine the overall effects on anonymity that each algorithm has. For our analysis, we use client probability distributions as discussed in Section ?? to measure how much information is gained by an adversary. Recall that we have the probability P that a client is the victim being P [V = Ci ] = j P [V = Ci |Rj ]P [G = Rj ] for a set of relays R and clients C. Now we need to determine how to update the probability distributions for P [G = Rj ] and P [V = Ci |Rj ] based on the attacks we’ve covered. There are three attacks discussed that are used to learn information about the guard node used in a circuit, determining P [G = Rj ]. The throughput attack and artificial throttling attack both attempt to reduce the set of possible entry guards by assigning a score to each guard and using a threshold value to determine the reduced set. For each attack, we compile the set of scores assigned to the actual guard nodes, and use this set as input to a kernel density estimator in order to generate an estimate of the probability density function, P̂ . Then, given a relay Rj with a score score(Rj ) we can compute P [G = Rj ] = P̂ [X = score(Rj )]. For the sybil attacks on the throttling algorithms we were able to uniquely identify the entry guard, so we have P [G = Rj ] = 1 if Rj is the guard, otherwise it’s 0. Therefore, denoting RG as the guard relay, we have P [V = Ci ] = P [V = Ci |RG ]. For determining the probability distribution for P [V = Ci |Rj ], recall that the la- tency attack computes the difference between the estimated latency T̂V E and the actual 52 latency TV E as the score, and ranks potential clients that way. Using the absolute value of the difference as the score, we compute the probability density function P̂ in the same way as we did for P [G = Rj ]. Therefore, to compute P [V = Ci |Rj ], we let latCi Rj be the actual latency between client Ci and relay Rj , and T̂V E be the estimate latency between the client and entry guard. Then, with dif f = |latCi Rj − T̂V E | we have P [V = Ci |Rj ] = P̂ [X = dif f ] from the computed probability density function. In our analysis we first concentrate on how the algorithms perform with just the throughput and latency attack compared to vanilla Tor. We then replace the throughput attack with the new throttling attacks shown in Section 5.5 and 5.6 to see if there are improvements over either vanilla Tor or the throughput attack with the algorithms. For each attack we use a set of potential victims {Vi } and their entry guards Gi and compute the probabilities P [V = Vi ] as shown above. The results for the algorithms EWMA, N23, and the induced throttling attacks described in Section 5.5 are shown in Figure 5.11. We can see in Figure 5.11a that with just the throughput and latency attack, vanilla Tor leaks about the same amount of anonymity of a client as EWMA and N23. The exception to this is that a tiny proportion of clients have probabilities ranging from 4-6% which is outside the range in vanilla Tor. Referring back to Figure 5.3 we see that these algorithms, N23 in particular, leak slightly more information than vanilla Tor based on latency estimation. While sometimes this information gain might be counteracted by the amount of possible entry guards that need to be considered, there are a small amount of cases where the guard set is reduced enough that the extra information from the latency attack translates into a higher probability for the client. When replacing the throughput attack with the induced throttling attack we start to see a larger divergence in client probabilities, as shown in Figure 5.11b. While the induced throttling attack with vanilla Tor leaks slightly more information than vanilla Tor with the throughput attack, N23 with N 3 = 500 has higher client probabilities than both attacks on vanilla Tor and higher than N23 with just the throughput attack. Furthermore, N23 with N 3 = 100 does significantly better than all previous algorithms, leaking more information than vanilla Tor for almost half the clients, reaching probabilities as high as 15%. The results in Figure 5.11b assume that the client only sends one stream over the 1.0 1.0 0.8 0.8 0.6 0.6 CDF CDF 53 0.4 vanilla EWMA N23 0.2 0.0 0 1 2 3 4 5 Client Probability (percent) vanilla vanilla (attack) N23 (N3=500) N23 (N3=100) 0.4 0.2 6 0.0 0 2 (a) Throughput Attack 4 6 8 10 12 14 Client Probability (percent) 16 (b) Throttling Attack 1.0 CDF 0.8 0.6 vanilla vanilla (attack) N23 (N3=500) N23 (N3=100) 0.4 0.2 0.0 0 10 20 30 40 50 60 Client Probability (percent) 70 (c) Perfect Knowledge Figure 5.11: Probabilities of the victim client under attack scenarios for flow control algorithms. 54 circuit, the worst case scenario for an adversary. As shown in Figure 5.8, as the number streams multiplexed over the circuit increases, the degree of anonymity loss sharply approaches 100% implying that an adversary would be able to uniquely identify the entry guard. The analysis with this assumption can be seen in Figure 5.11c, where P [G = Rj ] = 1 when Rj is the entry guard. Here we see a dramatic improvement from when a client only sends a single stream over the circuit, with some clients having probabilities as high as 60%, compared to a peak of 15% with a single stream. Using the throughput attack with the throttling algorithms produces similar results as N23 and EWMA, as shown in Figure 5.12a. There is a slightly higher upper bound in the client probability caused by the threshold and ideal throttling algorithms, but for the most part these results line up fairly closely to what was previously seen. Given that the throttling algorithms all had similar peaks to N23 with respect to the loss of anonymity in the latency attacks, these results aren’t too surprising. Even with the better performance with the latency attack, these gains are wiped out by the fact that the throughput attack results in too many guards that need to be considered in relation to the clients. However, once we use the sybil attack with the throttling algorithm instead of the throughput attack, where we assume an adversary is able to uniquely identify the entry guard in use, we start to see dramatically higher client probabilities. Figure 5.12b shows the result of this analysis, where at the extreme end we see clients with probabilities as high as 90%. This is due to the fact that with the sybil attack we are able to identify the exact entry guard used by each victim, thus reducing the noise from having to consider the latency of clients based on other possible relays. This very effectively demonstrates the level of anonymity lost when an adversary is able to significantly reduce the set of possible entry guards 5.8 Conclusion While high performance is vital to the Tor system, algorithms which seek to improve allocation of network resources via more advanced flow control or traffic admission algorithms need to take into account the implications on anonymity, both with respect to existing attacks and the potential for new ones. To this effect, we introduce a new class of induced throttling attacks and demonstrate the effectiveness across a wide variety of 55 1.0 CDF 0.8 vanilla bitsplit flag threshold ideal 0.6 0.4 0.2 0.0 0 1 2 3 4 5 6 7 Client Probability (percent) 8 9 (a) Throughput Attack 1.0 CDF 0.8 0.6 vanilla bitsplit flag threshold 0.4 0.2 0.0 0 20 40 60 80 Client Probability (percent) 100 (b) Sybil Attack Figure 5.12: Victim client probabilities, throttling algorithm attack scenarios. 56 performance enhancing algorithms, resulting in dramatic information leakage on victim clients. Using the new class of attacks, we perform a comprehensive analysis on the implications on anonymity, showing both the effects the algorithms have on existing attacks, as well as showing the increase in information gain from the new attacks. Preventing these new attacks isn’t straightforward, as in many cases the adversary is merely exploiting the underlying mechanisms in the algorithms. With the induced throttling attacks, an adversary acts exactly as they should under heavy congestion, so prevention or detection becomes difficult without changing the algorithm all together. In these cases it comes down to the trade off in performance increases verses potential anonymity loss. In the past Tor has often come down on the side of performance, as many times the increase in performance has the potential to attract new clients, perhaps negating any anonymity loss introduced by the algorithms. However, in the throttling algorithms the adversary is taking advantage of the fact that only the raw number of open connections are considered when calculating the throttling rate, allowing Sybil connections to be created using negligible resources. To prevent this we want a potential adversary to have to allocate a non-trivial amount of bandwidth to the connections in order have such a substantial affect on the throttling rate. One way to do so is instead only consider active connections which have seen a minimum amount of bandwidth over a certain time period. Another approach is instead of basing the throttling rate over total number of connections, calculate it over the average bandwidth. This way there is a direct correlation with how much bandwidth an adversary must provide with how low the throttling rate can be made. Chapter 6 Managing Tor Connections for Inter-Relay Communication 57 58 6.1 Introduction One well-recognized performance issue in Tor stems from the fact that all circuits passing between a pair of relays are multiplexed over a single TLS connection. As shown by Reardon and Goldberg [15], this can result in several undesirable effects on performance: a single, high-volume circuit can lead to link congestion, throttling all circuits sharing this link [15]; delays for packet re-transmissions can increase latency for other circuits, leading to “head-of-line” blocking [93]; and long write buffers reduce the effectiveness of application-level scheduling decisions [4]. As a result, several researchers have proposed changes in the transport protocol for the links between relays. Reardon and Goldberg suggested that relays should use a Datagram TLS tunnel at the transport level, while running a separate TCP session at the application level for each circuit [15]; this adds a high degree of complexity (an entire TCP implementation) to the application. Similarly, the “Per-Circuit TCP” (PCTCP) design [3] establishes a TCP session for each circuit, hiding the exact traffic volumes of these sessions by establishing an IPSEC tunnel between each pair of relays; however, kernel-level TCP sessions are an exhaustible resource and we demonstrate in Section 6.3 that this can lead to attacks on both availability and anonymity. In contrast, the Torchestra transport suggested by Gopal and Heninger [2], has each relay pair share one TLS session for “bulk download” circuits and another for “interactive traffic.” Performance then critically depends on the threshold for deciding whether a given circuit is bulk or interactive. We present two novel solutions to address these problems. First is inverse-multiplexed Tor with adaptive channel size (IMUX). In IMUX, each relay pair maintains a set of TLS connections (channel) roughly proportional to the number of “active” circuits between the pair, and all circuits share these TLS connections; the total number of connections per relay is capped. As new circuits are created or old circuits are destroyed, connections are reallocated between channels. This approach allows relays to avoid many of the performance issues associated with the use of a single TCP session: packet losses and buffering on a single connection do not cause delays or blocking on the other connections associated with a channel. At the same time, IMUX can offer performance benefits over Torchestra by avoiding fate sharing among all interactive streams, or per-circuit designs 59 by avoiding the need for TCP handshaking and slow-start on new circuits. Compared to designs that require a user-space TCP implementation, IMUX has significantly reduced implementation complexity, and due to the use of a per-relay connection cap, IMUX can mitigate attacks aimed at exhausting the available TCP sessions at a target relay. Second is to transparently replace TCP with the micro transport protocol (uTP), a mature protocol used for communication in BitTorrent. A user space protocol stack allows per circuit “connections” over a single UDP socket, preventing specific socket exhaustion attacks while increasing client performance. To this end this chapter makes the following contributions: • We describe new socket exhaustion attacks on Tor and PCTCP that can anonymously disable targeted relays, and demonstrate how socket exhaustion leads to reductions in availability, anonymity, and stability. • We describe IMUX, a novel approach to the circuit-to-socket solution space. Our approach naturally generalizes between the “per-circuit” approaches such as PCTCP and the fixed number of sessions in “vanilla Tor” (1) and Torchestra (2). • We analyze a variety of scheduling designs for using a variable number of connec- tions per channel through large-scale simulations with the Shadow simulator [67]. We compare IMUX to PCTCP and Torchestra, and suggest parameters for IMUX that empirically outperform both related approaches while avoiding the need for IPSEC and reducing vulnerability to attacks based on TCP session exhaustion. • We perform the first large scale simulations of the Torchestra design and the first simulations that integrate KIST [4] with Torchestra, PCTCP, and IMUX to compare the performance interactions among the complimentary designs. • We introduce xTCP, a library for transparently replacing TCP with any in-order reliable protocol. With xTCP we explore performance improvements when us- ing uTP for inter-relay communication, allowing a dedicated uTP communication stack per circuit while avoiding socket exhaustion attacks. 60 6.2 Background In order to create a circuit, the client sends a series of EXTEND cells through the circuit, each of which notifies the current last hop to extend the circuit to another relay. For example, the client sends an EXTEND cell to the guard telling it to extend to the middle. Afterwards the client sends another EXTEND to the middle telling it to extend the circuit to the exit. The relay, on receiving an EXTEND cell, will establish a channel to the next relay if one does not already exist. Cells from all circuits between the two relays get transfered over this channel, which is responsible for in-order delivery and, ideally, providing secure communication from potential eavesdroppers. Tor uses a TLS channel with a single TCP connection between the relays for in-order delivery and uses TLS to encrypt and authenticate all traffic. This single TCP connection shared by multiple circuits is a major cause of performance issues in Tor [15]. Many proposals have been offered to overcome these issues, such as switching completely away from TCP and using different transport protocols. In this chapter we focus on three proposals, two which directly address cross-circuit interference and one tangentially related proposal that deals with how Tor interacts with the TCP connections. Torchestra: To prevent bulk circuits from interfering with web circuits, Gopal and Heninger [2] developed a new channel, Torchestra, that creates two TLS connections between each relay, one reserved for web circuits and the other for bulk. This prevents head-of-line blocking that might be caused by bulk traffic from interfering with web traffic. The paper evaluates the Torchestra channel in a “circuit-dumbell” topology and shows that time to first byte and total download time for “interactive” streams decrease, while “heavy” streams do not see a significant change in performance. PCTCP: Similar to TCP-over-DTLS, AlSabah and Goldberg [3] propose dedicating a separate TCP connection to each circuit and replacing the TLS session with an IPSEC tunnel that can then carry all the connections without letting an adversary learn circuit specific information from monitoring the different connections. This has the advantage of eliminating the reliance on user-space TCP stacks, leading to reduced implementation complexity and improved performance. However, as we show in the next section, the use of a kernel-provided socket for every circuit makes it possible to launch attacks that attempt to exhaust this resource at a targeted relay. 61 KIST: Jansen et al. [4] show that cells spend a large amount of time in the kernel output buffer, causing unneeded congestion and severely limiting the effect of prioritization in Tor. They introduce a new algorithm KIST with two main components: global scheduling across all writable circuits is done fixing circuit prioritization, and an autotuning algorithm that can dynamically determine how much data should be written to the kernel. This allows data to stay internal to Tor for longer, allowing it to make smarter scheduling decisions than simply dumping everything it can to the kernel, which operates in a FIFO manner. 6.3 Socket Exhaustion Attacks This section discusses the extent to which Tor is vulnerable to socket descriptor exhaustion attacks that may lead to reductions in relay availability and client anonymity, explains how PCTCP creates a new attack surface with respect to socket exhaustion, and demonstrates how socket resource usage harms relay stability. The attacks in this section motivate the need for the intelligent management of sockets in Tor, which is the focus of Sections 6.4 and 6.5. 6.3.1 Sockets in Tor On modern operating systems, file descriptors are a scarce resource that the kernel must manage and allocate diligently. On Linux, for example, soft and hard file limits are used to restrict the number of open file descriptors that any process may have open at one time. Once a process exceeds this limit, any system call that attempts to open a new file descriptor will fail and the kernel will return an EMFILE error code indicating too many open files. Since sockets are a specific type of file descriptor, this same issue can arise if a process opens sockets in excess of the file limit. Aware of this limitation, Tor internally utilizes its own connection limit. For relays running on Linux and BSD, an internal variable ConnLimit is set to the maximum limit as returned by the getrlimit() system call; the ConnLimit is set to a hard coded value of 15,000 on all other operating systems. Each time a socket is opened and closed, an internal counter is incremented and decremented; if a tor_connect() function call is made when this counter is above the ConnLimit, it preemptively returns an error rather than waiting for one from the 62 connect system call. 6.3.2 Attack Strategies There are several cases to consider in order to exploit open sockets as a relay attack vector. Relay operators may be: (i) running Linux with the default maximum descriptor limit of 4096; (ii) running Linux with a custom descriptor limit or running a non-Linux OS with the hard-coded ConnLimit of 15,000; and (iii) running any OS and allowing unlimited descriptors. We note that setting a custom limit generally requires root privileges, although it does not require that Tor itself be run as the root user. Also note that each Tor relay connects to every other relay with which it communicates, leading to potentially thousands of open sockets under normal operation. In any case, the adversary’s primary goal is to cause a victim relay to open as many sockets as possible. Consuming Sockets at Exit Relays: In order to consume sockets at an exit relay, an adversary can create multiple circuits through independent paths and request TCP streams to various destinations. Ideally, the adversary would select services that use persistent connections to ensure that the exit holds open the sockets. The adversary could then send the minimal amount required to keep the connections active. Although the adversary remains anonymous (because the victim exit relay does not learn the adversary’s identity), keeping persistent connections active so that they are not closed by the exit will come at a bandwidth cost. Consuming Sockets at Any Relay: Bandwidth may be traded for CPU and memory by using Tor itself to create the persistent connections, in which case relays in any position may be targeted. This could be achieved by an adversary connecting several Tor client instances directly to a victim relay; each such connection would consume a socket descriptor. However, the victim would be able to determine the adversary’s IP address (i.e., identity). The attack can also be done anonymously. The basic mechanism to do so was outlined in The Sniper Attack, Section II-C-3 [54], where it was used in a relay memory consumption denial of service attack. Here, we use similar techniques to anonymously consume sockets descriptors at the victim. The attack is depicted in Figure 6.1a. First, the adversary launches several Tor client sybils. A1 and A5 are used to build independent circuits through G1 , M1 , E1 and G2 , M2 , E2 , respectively, following normal path selection policies. These sybil clients 63 also configure a SocksPort to allow connections from other applications. Then, A2 , A3 , and A4 use either the Socks4Proxy or Socks5Proxy options to extend new circuits to a victim V through the Tor circuit built by A1 . The A6 , A7 , and A8 sybils similarly extend circuits to V through the circuit built by A5 . Each Tor sybil client will create a new tunneled channel to V , causing the exits E1 and E2 to establish new TCP connections with V . Each new TCP connection to V will consume a socket descriptor at the victim relay. When using either the Socks4Proxy or the Socks5Proxy options, the Tor software manual states that “Tor will make all OR connections through the SOCKS [4,5] proxy at host:port (or host:1080 if port is not specified).” We also successfully verified this behavior using Shadow. This attack allows an adversary to consume roughly one socket for every sybil client, while remaining anonymous from the perspective of the victim. Further, the exits E1 and E2 will be blamed if any misbehavior is suspected, who themselves will be unable to discover the identity of the true attacker. If Tor were to use a new socket for every circuit, as suggested by PCTCP [3], then the adversary could effectively launch a similar attack with only a single Tor client. Consuming Sockets with PCTCP: PCTCP may potentially offer performance gains by dedicating a separate TCP connection for every circuit. However, PCTCP widens the attack surface and reduces the cost of the anonymous socket exhaustion attack discussed above. If all relays use PCTCP, an adversary may simply send EXTEND cells to a victim relay through any other relay in the network, causing a new circuit to be built and therefore a socket descriptor opened at the victim. Since the cells are being forwarded from other relays in the network, the victim relay will not be able to determine who is originating the attack. Further, the adversary gets long-term persistent connections cheaply with the use of the Tor config options MaxClientCircuitsPending and CircuitIdleTimeout. The complexity of the socket exhaustion attack is reduced and the adversary no longer needs to launch the tunneled sybil attack in order to anonymously consume the victim’s sockets. By opening circuits with a single client, the attacker will cause the victim’s number of open connections to reach the ConnLimit or may cause relay stability problems (or both). 64 6.3.3 Effects of Socket Exhaustion Socket exhaustion attacks may lead to either reduced relay availability and client anonymity if there is a descriptor limit in place, or may harm relay stability if either there is no limit or the limit is too high. We now explore these effects. Limited Sockets: If there is a limit in place, then opening sockets will consume the shared descriptor resource. An adversary that can consume all sockets on a relay will have effectively made that relay unresponsive to connections by other honest nodes due to Tor’s ConnLimit mechanism. If the adversary can persistently maintain this state over time, then it has effectively disabled the relay by preventing it from making new connections to other Tor nodes. We ran a socket consumption attack against both vanilla Tor and PCTCP using on live Tor relays. Our attacker node created 1000 circuits every 6 seconds through a victim relay, starting at time 1800 and ending at time 3600. Figure 6.1b shows the victim relay’s throughput over time as new circuits are built and victim sockets are consumed. After consuming all available sockets, the victim relay’s throughput drops close to 0 as old circuits are destroyed, effectively disabling the relay. This, in turn, will move honest clients’ traffic away from the relay and onto other relays in the network. If the adversary is running relays, then it has increased the probability that its relays will be chosen by clients and therefore has improved its ability to perform end-to-end traffic correlation [62]. After the attacker stops the attack at time 3600, the victim relay’s throughput recovers as clients’ are again able to successfully create circuits through it. Unlimited Sockets: One potential solution to the availability and anonymity concerns caused by a file descriptor limit is to remove the limit (i.e., set an unlimited limit), meaning that ConnLimit gets set to 232 or 264 . At the lesser of the two values, and assuming that an adversary can consume one socket for every 512-byte Tor cell it sends to a victim, it would take around 2 TiB of network bandwidth to cause the victim to reach its ConnLimit. However, even if the adversary cannot cause the relay to reach its ConnLimit, opening and maintaining sockets will still drain the relay’s physical resources, and will increase the processing time associated with socket-based operations in the kernel. By removing the open descriptor limit, a relay becomes vulnerable to performance degradation, increased memory consumption, and an increased risk of being killed by the kernel or otherwise crashing. An adversary may cause these effects through A2 A3 A4 Socks4Proxy Socks5Proxy R1 G1 M1 E1 G2 M2 E2 R2 A1 A5 V R3 Socks4Proxy Socks5Proxy A6 A7 Throughput (MiB/s) 65 R4 R5 A8 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 vanilla pctcp 2000 2500 3000 3500 4000 4500 5000 Tick (s) (a) (b) Memory Consumption (GB) 12 10 8 6 4 2 0 Total Memory Process Memory 0 0.2 0.4 0.6 0.8 1 Number Open Sockets (Millions) 1.2 (c) Figure 6.1: Showing (a) anonymous socket exhaustion attack using client sybils (b) throughput of relay when launching a socket exhaustion attack via circuit creation with shaded region representing when the attack was being launched (c) memory consumption as a process opens sockets using libevent. 66 the same attacks it uses against a relay with a default or custom descriptor limit. Figure 6.1c shows how memory consumption increases with the number of open sockets as a process opens over a million sockets. We demonstrate other performance effects using a private Tor network of 5 machines in our lab. Our network consisted of 4 relays total (one directory authority), each running on a different machine. We configured each relay to run Tor v0.2.5.2-alpha modified to use a simplified version of PCTCP that creates a new OR connection for every new circuit. We then launched a simple file server, a Tor client, and 5 file clients on the same machine as the directory authority. The file clients download arbitrary data from the server through a specified path of the non-directory relays, always using the same relay in each of the entry, middle, and exit positions. The final machine ran our Tor attacker client that we configured to accept localhost connections over the ControlPort. We then used a custom python script to repeatedly: (1) request that 1000 new circuits be created by the Tor client, and (2) pause for 6 seconds. Each relay tracked socket and bandwidth statistics; we use throughput and the time to open new sockets to measure performance degradation effects and relay instability. The stability effects for the middle relay are shown in Figure 6.2. The attack ran for just over 2500 seconds and caused the middle relay to successfully open more than 50 thousand sockets. We noticed that our relays were unable to create more sockets due to port allocation problems, meaning that (1) we were unable to measure the potentially more serious performance degradation effects that occur when the socket count exceeds 65 thousand, and (2) unlimited sockets may not be practically attainable due to port exhaustion between a pair of relays. Figure 6.2a shows throughput over time and Figure 6.2b shows a negative correlation of bandwidth to the number of open sockets; both of these figures show a drop of more than 750 KiB/s in the 60 second moving average throughput during our experiment. Processing overhead during socket system calls over time is shown in Figure 6.2c, and the correlation to the number of open sockets is shown in Figure 6.2d; both of these figures clearly indicate that increases in kernel processing time can be expected as the number of open sockets increases. Although the absolute time to open sockets is relatively small, it more than doubled during our experiment; we believe this is a strong indication of performance degradation in the kernel and that increased processing delays in other kernel socket 67 5500 5500 5000 Bandwidth (KiB) Bandwidth (KiB) 60 second moving average 4500 4000 5000 4500 4000 r=−0.536, r2 =0.288 3500 0 500 1000 1500 Tick (s) 2000 2500 3500 0 (b) 12 10 8 6 4 60 second moving average 2 0 500 1000 1500 Tick (s) (c) 2000 2500 Mean Socket Open Time (usec) Mean Socket Open Time (usec) (a) 1 2 3 4 5 4 Number of Open Sockets ×10 12 10 8 6 4 r=0.928, r2 =0.861 2 0 1 2 3 4 5 4 Number of Open Sockets ×10 (d) Figure 6.2: Showing (a) throughput over time (b) linear regression correlating throughput to the number of open sockets (c) kernel time to open new sockets over time (d) linear regression correlating kernel time to open a new socket to the number of open sockets. 68 processing functions are likely as well. 6.4 IMUX This section explores a new algorithm that takes advantage of multiple connections while respecting the ConnLimit imposed by Tor and preventing the attacks discussed above in Section 6.3. Both Torchestra and PCTCP can be seen as heuristically derived instances of a more general resource allocation scheme with two components, one determining how many connections to open between relays, and the second in selecting a connection to schedule cells on. Torchestra’s heuristic is to fix the number of connections at two, designating one for light traffic and the other for heavy, then scheduling cells based on the traffic classification of each circuit. PCTCP keeps a connection open for each circuit between two relays, with each connection devoted to a single circuit that schedules cells on it. While there is a demonstrable advantage to being able to open multiple connections between two communicating relays, it is important to have an upper limit on the number of connections allowed to prevent anonymous socket exhaustion attacks against relays, as shown in Section 6.3. In this section we introduce IMUX, a new heuristic for handling multiple connections between relays that is able to dynamically manage open connections while taking into consideration the internal connection limit in Tor. 6.4.1 Connection Management Similar to PCTCP, we want to ensure the allocation of connections each channel has is proportional to the number of active circuits each channel is carrying, dynamically adjusting as circuits are opened and closed across all channels on the relay. PCTCP can easily accomplish this by dedicating a connection to each circuit every time one is opened or closed, but since IMUX enforces an upper limit on connections, the connection management requires more care, especially since both ends of the channel will may have different upper limits. We first need a protocol dictating how and when relays can open and close connections. During the entire time the channel is open, only one relay is allowed to open connections, initially set to the relay that creates the channel. However, at any time 69 Algorithm 1 Function to determine the maximum number of connections that can be open on a channel. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: function getMaxConns(nconns, ncircs) totalCircs ← len(globalActiveCircList) if ncircs is 0 or totalCircs is 0 then return 1 end if f rac ← ncircs/totalCircs totalM axConns ← ConnLimit · τ connsLef t ← totalM axConns − n open sockets() maxconns ← f rac · totalM axConns maxconns ← M IN (maxconns, nconns · 2) maxconns ← M IN (maxconns, nconns + connsLef t) return maxconns end function either relay may close a connection if it detects the number of open sockets approaching the total connection limit. When a relay decides to close a connection, it must first decide which connection should be closed. To pick which connection to close the algorithm uses three criteria for prioritizing available connections: (1) Always pick connections that haven’t fully opened yet; (2) Among connections with state OPENING, pick the one that was created most recently; and (3) If all connections are open, pick the one that was least recently used. In order to close a connection C, first the relay must make sure the relay on the other end of the channel is aware C is being closed to prevent data from being written to C during the closing process. Once a connection C is chosen, the initiating relay sends out an empty cell with a new command, CLOSING_CONN_BEGIN, and marks C for close to prevent any more data from being written to it. Once the responding relay on the other end of the channel receives the cell, it flushes any data remaining in the buffer for C, sends back another CLOSING_CONN_END cell to the initiating relay, and closes the connection. Once the initiating relay receives the CLOSING_CONN_END, it knows that it has received all data and is then able to proceed closing the socket. Once a channel has been established, a housekeeping function is called on the channel every second that then determines whether to open or close any connections. The function to determine the maximum number of connections that can open on a channel can be seen in Algorithm 1. We calculate a soft upper limit on the total number of 70 allowed open connections on the relay by taking ConnLimit and multiplying it by the parameter τ ∈ (0, 1). ConnLimit is an internal variable that determines the maximum number of sockets allowed to be open on the relay. On Linux based relays this is set by calling getrlimit() to get the file limit on the machine, otherwise it is fixed at 15,000. The parameter τ is a threshold value between 0 and 1 that sets a soft upper limit on the number of open connections. Since once the number of open connections exceeds ConnLimit all connect() calls will fail, we want some breathing room so new channels and connections can still be opened, temporarily going past the soft limit, until other connections can be closed to bring the relay back under the limit. To calculate the limit for the channel we simply take this soft limit on the total of all open connections and multiply it by the fraction of active circuits using the channel. This gives us an upper bound on the connection limit for the channel. We then take the minimum of this upper limit with the number of current open connections on the channel multiplied by two. This is done to prevent rapid connection opening when a channel is first created, particularly when the other relay has a much lower connection limit. Finally we take the minimum of that calculation with the number of open connections plus the number of connections that can be opened on the relay before hitting the soft limit (that could be negative, signaling that connections need to be closed ). Otherwise the channel could create too many connections, driving the number of open connections past ConnLimit. The housekeeping function is called every second and determines if any connections need to be opened or closed. If we have two few connections and, based on the protocol discussed in the previous paragraph, the relay is allowed to open connections on this channel, enough connections are created to match the maximum connections allowed. If we have too many connections open, we simply close enough until we are at the connection limit. We use the previously discussed protocol for selecting and closing connections, prioritizing newly created connections in an attempt to close unneeded connections before the TLS handshake is started, preventing unnecessary overhead. In addition to the housekeeping function, whenever a channel accepts an incoming connection, it also checks to see if the number of connections exceeds the maximum allowed returned by Algorithm 1. If so, it simply notifies the relay at the other end of the channel that it is closing the connection, preventing that relay from opening any more connections as dictated by the protocol. 71 6.4.2 Connection Scheduler For connection scheduling, PCTCP assigns each circuit a dedicated connection to schedule cells on. Torchestra schedules cells from a circuit to either the light or heavy connection depending on how much data is being sent through the circuit. A circuit starts out on the light connection, and if at some point its EWMA value crosses a threshold it is switched to the heavy connection. To accommodate this, a switching protocol is introduced so the relay downstream can be notified when a circuit has switched and on what connection it can expect to receive cells. While scheduling cells from a circuit on a single connection makes in-order delivery easier by relying on TCP, with multiple connections per channel it is not necessary to do so and in fact may not be optimal to keep this restraint. Similar to the uTLS implementation in Tor [93] and Conflux [40], we embed an 8-byte sequence number in the relay header of all cells. This allows the channel to schedule cells across multiple connections that then get reordered at the other end of the channel. With sequence numbers in place and the capability to schedule cells from a circuit across multiple connections, we can evaluate different scheduling algorithms attempting to increase throughput or “fairness”, for example, where low traffic circuits have better performance than high traffic ones. We will briefly cover the different algorithms and heuristics we can use. Circuit Round Robin: The first scheduler, shown in Figure 6.3a, emulates PCTCP by assigning each circuit a single connection to transmit its cells. When circuits are added to a channel, the scheduler iterates round robin style through the circuit list, assigning circuits to successive connections. If there are more circuits than connections some circuits will share a single connection. When a connection is closed, any circuit assigned to it will be given a new one to schedule through, with the connections remaining iterated in the same round robin style as before. EWMA Mapping: Internal to Tor is a circuit scheduling algorithm proposed by Tang and Goldberg [94] that uses an exponential moving weight average (EWMA) algorithm to compute how “noisy” circuits are being, and then schedule them from quietest to loudest when choosing what circuits to flush. Using the same algorithm, we compute the EWMA value for each connection as well as the circuits. Then, as seen in Figure 6.3b, the circuits and connections are ordered from lowest to highest EWMA value and we attempt to map the circuits to a connection with a similar EWMA value. More 72 (a) Circuit Round Robin Scheduler (b) EWMA Mapping Scheduler (c) Shortest Queue Scheduler Figure 6.3: The three different connection scheduling algorithms used in IMUX. 73 specifically, after sorting the circuits, we take the rank of the circuit 1 ≤ ri ≤ ncircs and compute the percentile pi = ri ncircs . We do the same thing with the connections computing their percentiles denoted Cj . Then to determine which connection to map a circuit to, we pick the connection j such that Cj−1 < pi ≤ Cj . Shortest Queue: While the EWMA mapping scheduler is built around the idea of “fairness”, where we penalize high usage circuits by scheduling them on busier connections, we can construct an algorithm aimed at increasing overall throughput by always scheduling cells in an opportunistic manner. The shortest queue scheduler, shown in Figure 6.3c, calculates the queue length of each connection and schedules cells on connections with the shortest queue. This is done by taking the length of the internal output buffer queue that Tor keeps for each connections, and adding it with the kernel TCP buffer that each socket has; this is obtained using the ioctl1 function call and passing it the socket descriptor and TIOCOUTQ. 6.4.3 KIST: Kernel-Informed Socket Transport Recent research by Jansen et al. [4] showed that by minimizing the amount of data that gets buffered inside the kernel and instead keeping it local to Tor, better scheduling decisions can be made and connections can be written in an opportunistic manner increasing performance. The two main components of the algorithm are global scheduling and autotuning. In vanilla Tor, libevent iterates through the connections in a round robin fashion, notifying Tor that the connection can write. When Tor receives this notification, it performs circuit scheduling only on circuits associated with the connection. Global scheduling takes a list of all connections that can write, and schedules between circuits associated with every single connection, making circuit prioritization more effective. Once a circuit is chosen to write to a connection, autotuning then determines how much data should be flushed to the output buffer and onto the kernel. By using a variety of socket and TCP statistics, it attempts to write just enough to keep data in the socket buffer at all times, without flushing everything, giving more control to Tor for scheduling decisions. KIST can be used in parallel with Torchestra, PCTCP, and IMUX, and also as a connection manager within the IMUX algorithm. After KIST selects a circuit and 1 http://man7.org/linux/man-pages/man2/ioctl.2.html 74 decides how much data to flush, Torchestra and PCTCP make their own determination on which connection to write to based on their internal heuristics. IMUX can also take into account the connection selected by the KIST algorithm and use that to schedule cells from the circuit. Similar to the other connection schedulers, this means that circuits can be scheduled across different connections in a more opportunistic fashion. 6.5 Evaluation In this section we discuss our experimental setup, the details of our implementations of Torchestra and PCTCP (for comparison with IMUX), evaluate how the dynamic connection management in IMUX is able to protect against potential denial of service attacks by limiting the number of open connections, and finally compare performance across the multiple connection schedulers, along with both Torchestra and PCTCP. 6.5.1 Experimental Setup We perform experiments in Shadow v.1.9.2 [69, 67], a discrete event network simulator capable of running real Tor code in a simulated network. Shadow allows us to create large-scale network deployments that can be run locally and privately on a single machine, avoiding privacy risks associated with running on the public network that many active users rely on for anonymity. Because Shadow runs the Tor software, we are able to implement our performance enhancements as patches to Tor v0.2.5.2-alpha. We also expect that Tor running in Shadow will exhibit realistic application-level performance effects including those studied in this paper. Finally, Shadow is deterministic; therefore our results may be independently reproduced by other researchers. Shadow also enables us to isolate performance effects and attribute them to a specific set of configurations, such as variations in scheduling algorithms or parameters. This isolation means that our performance comparisons are meaningful independent of our ability to precisely model the complex behaviors of the public Tor network. We initialize a Tor network topology and node configuration and use it as a common Shadow deployment base for all experiments in this section. For this common base, we use the large Tor configuration that is distributed with Shadow. The techniques for producing this model are discussed in detail in [95] and updated in [4]. It consists of 500 75 relays, 1350 web clients, 150 bulk clients, 300 perf clients and 500 file servers. Web clients repeatedly download a 320 KiB file while pausing between 1 to 60 seconds after every download. Bulk clients continuously download 5 MiB files with no pausing between downloads. The perf clients download a file every 60 seconds, with 100 downloading a 50 KiB file, 100 downloading a 1 MiB file, and 100 download a 5 MiB file. The Shadow perf clients are configured to mimic the behavior of the TorPerf clients that run in the public Tor network to measure Tor performance over time. Since the Shadow and Tor perf clients download files of the same size, we verified that the performance characteristics in our Shadow model were reasonably similar to the public network. 6.5.2 Implementations In the original Torchestra design discussed in [2], the algorithm uses EWMA in an attempt to classify each circuit as “light” or “heavy”. Since the EWMA value will depend on many external network factors (available bandwidth, network load, congestion, etc.), the algorithm uses the average EWMA value for the light and heavy connection as benchmarks. Using separate threshold values for the light and heavy connection, when a circuit either goes above or below the average multiplied by the threshold the circuit is reclassified and is swapped to the other connection. The issue with this, as noted in [26], is that web traffic tends to be bursty causing temporary spikes in circuit EWMA values. When this occurs it increases the circuit’s chance of becoming misclassified and assigned to the bulk connection. Doing so will in turn decrease the average EWMA of both the light and bulk connections, making it easier for circuits to exceed the light connections threshold and harder for circuits to drop below the heavy connection threshold, meaning web circuits that get misclassified temporary will find it more difficult to get reassigned to the light connection. A better approach would be to use a more complex classifier such as DiffTor [26] to determine if a circuit was carrying web or bulk traffic. For our implementation, we have Torchestra use an idealized version of DiffTor where relays have perfect information about circuit classification. When a circuit is first created by a client, the client sends either a CELL_TRAFFIC_WEB or CELL_TRAFFIC_BULK cell notifying each relay of the type of traffic that will be sent through the circuit. Obviously this would be unrealistic to have in the live Tor network, but it lets us examine Torchestra 76 under an ideal situation. For PCTCP there are two main components of the algorithm. First is the dedicated connection that gets assigned to each circuit, and the second is replacing per connection TLS encryption with a single IPSec layer between the relays, preventing an attacker from monitoring a single TCP connection to learn information about a circuit. For our purposes we are interested in the first component, performance gains from dedicating a connection to each circuit. The IPSec has some potential to increase performance, since each connection no longer requires a TLS handshake that adds some overhead, but there are a few obstacles noted in [3] that could hinder deployment. Furthermore, it can be deployed alongside any algorithm looking to open multiple connections between relays. For simplicity, our PCTCP implementation simply opens a new TLS connection for each circuit created that will use the new connection exclusively for transferring cells. 6.5.3 Connection Management One of the main goals of the dynamic connection manager in IMUX is to avoid denial of service attacks by consuming all available open sockets. To achieve this IMUX has a soft limit that caps the total number of connections at ConnLimit ·τ , where τ is a parameter between 0 and 1. If this is set too low, we may lose out in potential performance gains that go unrealized, while if it is too high we risk exceeding the hard limit ConnLimit, causing new connections to error out leaving ourselves open to denial of service attacks. During our experiments we empirically observed that τ = 0.9 was the highest the parameter could be set without risking crossing ConnLimit, particularly when circuits were being created rapidly causing high connection churn. In order to find a good candidate for τ we setup an experiment in Shadow to see how different values act under heavy circuit creation. The experiment consists of 5 clients, 5 guard relays, and 1 exit relay. Each client was configured to create 2-hop circuits through a unique guard to the lone exit relay. The clients were configured to start 2 minutes apart, starting 250 circuits all at once, with the final client creating 1,000 circuits. The guards all had ConnLimit set to 4096 while the exit relay has ConnLimit set to 1024. The experiment was run with IMUX enabled using τ values of 0.80, 0.90, and 0.95. In addition, one run was conducted using PCTCP, contrasting it against the 77 Open Sockets 2000 1500 PCTCP IMUX-0.95 IMUX-0.90 IMUX-0.80 1000 500 00 100 200 300 400 Time (s) 500 600 Figure 6.4: Number of open sockets at exit relay in PCTCP compared with IMUX with varying τ paramaters. connection management in IMUX. For these experiments we disabled the code in Tor that throws an error once ConnLimit is actually passed to demonstrate the number of sockets the algorithm would attempt to keep open under the different scenarios. Figure 6.4 shows the number of open sockets at the exit relay as clients are started and circuits created. As expected we see PCTCP opening connections as each circuit is created, spiking heavily when the last client opens 1,000 circuits. IMUX, on the other hand, is more aggressive in opening connections initially, which quickly approaches the soft limit of ConnLimit ·τ as soon as the first set of circuits are created and then plateaus, temporarily dropping as circuits are created and connections need to be closed and opened. For most of the experiment, the different values of τ produce the same trend, but there are some important differences. When using τ = 0.95, we see that after the 3rd client starts at around time 240 seconds, we actually see a quick spike that goes above the 1024 ConnLimit. This happens because while we can open connections fairly fast, closing connections requires the cells BEGIN and END closing cells to be sent and processed which can take some time. Neither τ = 0.90 or τ = 0.80 have these issues, which are able to reliably stay under 1,000 open connections the entire time. We can therefore use τ = 0.90 in our experiments in order to maximize the number of Throughput (MiB/s) 78 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 vanilla pctcp torchestra imux 2000 2500 3000 3500 4000 4500 5000 Tick (s) Figure 6.5: Comparing socket exhaustion attack in vanilla Tor with Torchestra, PCTCP and IMUX. connections we can have open while minimizing the risk of surpassing ConnLimit. Figure 6.5 shows the effects of the socket exhaustion attack discussed in Section 6.3 with Torchestra and IMUX included. Since Torchestra simply opens two connections between each relay, the attack is not able to consume all available sockets, leaving the relay unaffected. The IMUX results show an initial slight drop in throughput as many connections are being created and destroyed. However, throughput recovers to the levels achieved with vanilla Tor and Torchestra as the connection manager stabilizes. 6.5.4 Performance First we wanted to determine how the various connection schedulers covered in Section 6.4.2 perform compared to each other and against vanilla Tor. Figure 6.6 shows the results of large scale experiments run with each of the connection schedulers operating with IMUX. Round robin clean performs better than both EWMA mapping and the shortest queue connection schedulers, at least with respect to time to first byte and download time for web clients shown in Figures 6.6a and 6.6b. This isn’t completely 79 Cumulative Fraction 1.0 0.8 0.6 vanilla 0.4 imux-rr imux-ewma 0.2 0.8 0.6 vanilla 0.4 imux-rr imux-ewma 0.2 imux-shortest 0.0 0 1 2 3 4 Download Time (s) (a) Time to first byte imux-shortest 5 0.0 0 2 4 6 8 10 12 Download Time (s) (b) Time to last byte of 320KiB 1.0 Cumulative Fraction Cumulative Fraction 1.0 0.8 0.6 vanilla 0.4 imux-rr imux-ewma 0.2 imux-shortest 0.0 0 20 40 60 80 100 120 140 Download Time (s) (c) Time to last byte of 5MiB Figure 6.6: Performance comparison of IMUX connection schedulers 14 80 surprising for the shortest queue scheduler as it indiscriminately tries to push as much data as possible, favoring higher bandwidth traffic at the potential cost of web traffic performance. The EWMA mapping scheduler produces slightly better results for web traffic compared to shortest queue scheduling, but it still ends up performing worse than vanilla Tor. This is related to the issue with using EWMA in Torchestra for classification, that web traffic tends to be sent in large bursts causing the EWMA value to spike rapidly that then decreases over time. So while under the EWMA mapping scheme the first data to be sent will be given high prioritization, as the EWMA value climbs the data gets sent to busier connections causing the total time to last byte to decrease as a consequence. Figure 6.7 shows the download times when using IMUX with round robin connection scheduling against vanilla Tor, Torchestra and PCTCP. While Torchestra and PCTCP actually perform identically to vanilla Tor, IMUX sees an increase in performance both for web and bulk downloads. Half of the web clients see an improvement of at least 12% in their download times, with 29% experiencing more than 20% improvement, with the biggest reduction in download time seen at the 75th percentile, dropping from 4.5 seconds to 3.3 seconds. Gains for bulk clients are seen too, although not as large; around 10% of clients seeing improvements of 10-12%. Time to first byte across all clients improves slightly as shown in Figure 6.7a, with 26% of clients seeing reductions ranging from 20-23% when compared to vanilla Tor, that then drops down to 11% of clients who see the same level of improvements compared to Torchestra and PCTCP. We then ran large experiments with KIST enabled, with IMUX using KIST for connection scheduling. While overall download times improved from the previous experiments, IMUX saw slower download times for web clients and faster downloads for bulk clients, as seen in Figure 6.8. 40% of clients see an increase in time to first byte, while 87% of bulk client see their download times decrease from 10-28%. This is due to the fact that one of the main advantages to using multiple connections per channel is that it prevents bulk circuits from forcing web circuits to hold off on sending data due to packet loss the bulk circuit caused. In PCTCP, for example, this will merely cause the bulk circuits connection to become throttled while still allowing all web circuits to send data. Since KIST forces Tor to hold on to cells for longer and only writes a minimal 81 Cumulative Fraction 1.0 0.8 0.6 vanilla 0.4 imux-rr pctcp 0.2 0.8 0.6 vanilla 0.4 imux-rr pctcp 0.2 torchestra 0.0 0 1 2 3 4 Download Time (s) (a) Time to first byte torchestra 5 0.0 0 2 4 6 8 10 12 Download Time (s) 14 (b) Time to last byte of 320KiB 1.0 Cumulative Fraction Cumulative Fraction 1.0 0.8 0.6 vanilla 0.4 imux-rr pctcp 0.2 torchestra 0.0 0 20 40 60 80 100 120 140 Download Time (s) (c) Time to last byte of 5MiB Figure 6.7: Performance comparison of IMUX to Torchestra [2] and PCTCP [3] 82 amount to the kernel, it’s able to make better scheduling decisions, preventing web traffic from unnecessarily buffering behind bulk traffic. Furthermore, KIST is able to take packet loss into consideration since it uses the TCP congestion window in calculating how much to write to the socket. Since the congestion window is reduced when there is packet loss, KIST will end up writing a smaller amount of data whenever it occurs. 6.6 Replacing TCP Torchestra, PCTCP, and IMUX attempt to avoid the problem of cross-circuit interference by establishing multiple connections to balance transmitting data over. While these can sometimes overcome the obstacles, these serve more as a workaround then actual solution. In this section we explore actually replacing TCP with a different transport protocol that might be more suited for inter-relay communication. 6.6.1 Micro Transport Protocol The benefit from increasing the number of TCP connections between relays is to limit the effect of cross circuit interference. Each connection gets its own kernel level TCP stack and state, minimizing the effects that congestion experienced on one circuit has on all the other circuits. However, as discussed in Section 6.3, making it easier for an adversary to open sockets on a relay can making launching a socket exhaustion attack trivial. Since the only thing we gain when opening multiple TCP connections is separate TCP stacks, it is feasible to gain the performance boost while maintaining a single open connection between relays by using a user space stack instead. Reardon and Goldberg [15, 16] did just this, using a user-space TCP stack over a UDP connection, using DTLS for encryption. One major issue with this approach is the lack of a complete and stable user space, impacting performance and making it difficult to fully integrate into Tor. Along with the lack of stable user space TCP stack, it is not clear that TCP might be the best protocol to use for inter-relay communication. Murdoch [96] covers some potential other protocol that could be used instead of TCP. Due to its use in the BitTorrent file sharing protocol, one of the more mature user space protocols is the micro transport protocol (uTP). It provides reliable in order delivery, using a low extra delay background transport (LEDBAT) [97] algorithm for congestion control, designed 83 1.0 Cumulative Fraction Cumulative Fraction 1.0 0.8 0.6 vanilla-kist 0.4 imux-kist pctcp-kist 0.2 0.8 0.6 vanilla-kist 0.4 imux-kist pctcp-kist 0.2 torchestra-kist 0.0 0 1 2 3 4 Download Time (s) torchestra-kist 5 (a) Time to first byte 0.0 0 2 4 6 8 10 12 Download Time (s) 14 (b) Time to last byte of 320KiB Cumulative Fraction 1.0 0.8 0.6 vanilla-kist 0.4 imux-kist pctcp-kist 0.2 torchestra-kist 0.0 0 20 40 60 80 100 120 140 Download Time (s) (c) Time to last byte of 5MiB Figure 6.8: Performance comparison of IMUX to Torchestra [2] and PCTCP [3], all using the KIST performance enhancements [4] 84 (a) Vanilla Tor (b) Tor using xTCP Figure 6.9: Operation flows for vanilla Tor and using xTCP for utilizing the maximum amount of bandwidth while minimizing latencies. Instead of using packet loss to signal congestion, uTP uses queuing delay on the sending path to inform the transfer rate on the connection. The algorithm has a target queuing delay, and with the end-points collaborating to actively measure delays at each end, the sender can increase or decrease its congestion window based on how off it is from the target delay. 6.6.2 xTCP While using such a protocol might seem promising, replacing the transport layer in Tor is a very difficult task. As Karsten, Murdoch, and Jansen note in [98] “One conclusion from experimenting with our libutp-based datagram transport implementation is that its far from trivial to replace the transport in Tor.” With this large overhead in integrating different transport protocols into Tor, even stable stacks will be difficult to implement and evaluate for performance effects in Tor. To get around this engineering hurdle we take advantage of the fact that Tor is agnostic to what transport protocol is used for inter-relay communication, as long as the protocol 85 Function socket bind listen accept connect send recv close epoll_ctl epoll_wait Description Create UDP socket and register it with transport library No operation, UDP sockets do not bind to an address/port No operation, UDP sockets do not listen to an address/port Check for new incoming connections, return new descriptor if found Create and send CONNECT message to address/port Append data to write buffer and flush maximum bytes from buffer Read and append data to buffer queue, return at most len bytes Send any close messages that might be needed Call libc epoll_ctl on all descriptors related to socket Call libc epoll_wait on all descriptors and read in from all that are readable. For every existing socket, if the read buffer is non-empty set EPOLLIN, and if the write buffer is empty set EPOLLOUT. Table 6.1: Functions that xTCP needs to intercept and how xTCP must handle them. guarantees in order reliable delivery. We develop xTCP, a stand alone library that can be used to replace TCP with any other transport protocol. The design is shown in Figure 6.9. Using LD_PRELOAD xTCP can intercept all socket related functions. Instead of passing them onto the kernel to create kernel space TCP stacks, xTCP creates UDP connections which can be used by other transport libraries (e.g uTP, SCTP, QUIC) for establishing connections, sending and receiving data, congestion control, and data recovery. Table 6.1 lists the libc functions that xTCP needs to intercept in order to replace TCP used in Tor. All functions involved with creating, managing, and communicating with sockets need to be handled. For scheduling between sockets Tor uses libevent [22] which is built on top of epoll, so xTCP also needs to intercept epoll functions so Tor can be properly notified when sockets can be read from and written to. To support other applications that do not use epoll, other polling functions such as poll and select would need to intercepted and handled internally by xTCP. Finally, to allow applications the ability to use both TCP and xTCP sockets, we add an option for the socket function to only intercept calls when the parameter protocol is set to a newly defined IPPROTO_XTCP. 86 6.6.3 Experimental Setup With the xTCP library we built a module utilizing the uTP library 2 for reliable transmission over UDP. For each function in Table 6.1 there exists a corresponding uTP library function, excluding the epoll functions. Additionally Tor was modified to open IPPROTO_XTCP sockets when creating OR connections. This enables us to use any protocol for inter-relay communications, while still using TCP for external connections (e.g. edge connections such as exit to server). We also added the option to establish a separate connection for each circuit similar to PCTCP. While any deployed implementation would want to internally manage the connection stack within one socket to prevent socket exhaustion attacks, for our purposes this achieves the same performance benefit. Experiments are performed in Shadow v1.10.2 and running Tor v.0.2.5.10 with the xTCP and PCTCP modifications. We used the large experimental setup deployed with Shadow, similar to the one described in Section 6.5.1, using 500 relays and 1800 clients. Web clients still download 320 KiB files while randomly pausing and bulk clients download 5 MiB files continuously. The vanilla experiments used TCP for inter-relay communication with PCTCP disabled, while the uTP experiments had PCTCP enabled to ensure each circuit had its own stack. 6.6.4 Performance Figure 6.10 shows the results of the experiments, comparing vanilla Tor with TCP and using uTP. One of the biggest performance differences is in time to first byte show in Figure 6.10a. When using uTP almost every download retrieved the first byte in under 2 seconds, while the vanilla experiments only achieved this for 70% of downloads. Furthermore when using TCP for inter-relay communication 8% of downloads had to wait more than 10 seconds to download the first byte. This is due to a combination of the separate uTP stack for each circuit (and thus client download) in addition to the fact that uTP is built to ensure low latency performance. While the time to first byte improved dramatically for all clients, Figure 6.10b shows that total download time for web clients were fairly identical in both cases, with a very small number of web clients experiencing downloads that took 2-3 minutes to complete, compared to the median 2 https://github.com/bittorrent/libutp 87 download time of 12-13 seconds. On the other hand, bulk clients almost universally saw increase performance when using uTP, as seen in Figure 6.10c. The majority of bulk clients experienced 50-250% faster download times compared to vanilla Tor. 6.7 Discussion In this section we discuss the potential for an adversary to game the IMUX algorithm, along with the limitations on the protections against the denial of service attacks discussed in Section 6.3 and some possible ways to protect against them. Active Circuits: The IMUX connection manager distributes connections to channels based on the fraction of active circuits contained on each channel. An adversary could game the algorithm by artificially increasing the number of active circuits on a channel they’re using, heavily shifting the distribution of connections to the channel and increase throughput. The ease of the attack depends heavily on how we define an “active circuit”, we can do one of three ways: (1) the number of open circuits that haven’t been destroyed; (2) the number of circuits that has sent a minimal number of cells, measured in raw numbers of using an EWMA with a threshold; or (3) using DiffTor we can only consider circuits classified as web or bulk. No matter what definition is used an adversary will still technically be able to game the algorithm, the major difference is the amount of bandwidth that needs to be expended by the adversary to accomplish their task. If we just count the number of open circuits, an adversary could very easily restrict all other channels to only one connection, while the rest are dedicated to the channel they’re using. Using an EWMA threshold or DiffTor classifier requires the adversary to actually send data over the circuits, with the amount determined by what thresholds are in place. So while the potential to game the algorithm will always exist, the worse an adversary can do is reduce all other IMUX channels to one connection, the same as how vanilla Tor operates. Defense Limitations: By taking into account the connection limit of the relay, the dynamic connection manager in IMUX is able to balance the performance gains realized by opening multiple connections while protecting against the new attack surface made available with PCTCP that lead to a low-bandwidth denial of service attack against any relay in the network. However there still exists potential socket exhaustion attacks 1.0 1.0 0.8 0.8 Cumulative Fraction Cumulative Fraction 88 0.6 0.4 0.2 0.6 0.4 0.2 vanilla vanilla utp 0.0 0 5 10 15 20 25 Download Time (s) 30 utp 35 (a) Time to first byte 0.0 0 50 100 Download Time (s) 150 200 (b) Time to last byte of 320KiB 1.0 Cumulative Fraction 0.8 0.6 0.4 0.2 vanilla utp 0.0 0 100 200 300 400 500 Download Time (s) 600 700 (c) Time to last byte of 5MiB Figure 6.10: Performance comparison of using TCP connections in vanilla and uTP for inter-relay communication. 89 inherent to how Tor operates. The simplest of these simply requires opening streams through a targeted exit, causing sockets to be opened to any chosen destination. Since this is a fundamental part of how an exit relay operates, there is little that can be done to directly defend against this attack, although it can be made potentially more difficult to perform. Exit relays can attempt keep sockets short lived and close ones that have been idle for a short period of time, particularly when close to the connection limit. They can also attempt to prioritize connections between relays instead of ones exiting to external servers. While preventing access at all is undesirable, this may be the lesser of two evils, as it will still allow the relay to participate in the Tor network, possibly preventing adversarial relays from being chosen. This kind of attack only affects exit relays, however the technique utilizing the Socks5Proxy option can target any relay in the network. Since this is performed by tunneling OR connections through a circuit, the attack is in effect anonymous, meaning relays cannot simply utilize IP blocking to protect against it. One potential solution is to require clients to solve computationally intense puzzles in order to create a circuit as proposed by Barbera et al. [99]. This reduces the ease that a single adversary is able to mass produce circuits, resulting in socket descriptor consumption. Additionally, since this attack requires the client to send EXTEND cells through the exit to initiate a connection through the targeted relay, exits could simply disallow connections back into the Tor network for circuit creation. This would force an adversary to have to directly connect to whichever non-exit relay they were targeting, in which case IP blocking becomes a viable strategy to protect against such an attack once it is detected. 6.8 Conclusion In this paper we present a new class of socket exhaustion attacks that allow an adversary to anonymously perform a denial of service attacks against relays in the Tor network. We outline how PCTCP, a new transport proposal, introduces a new attack surface in this new class of attacks. In response, we introduce a new protocol, IMUX, generalizing the designs of PCTCP and Torchestra, that is able to take advantage of opening multiple connections between relays while still able to defend against these socket exhaustion attacks. Through large scale experiments we evaluate a series of connection 90 schedulers operating within IMUX, look at the performance of IMUX with respect to vanilla Tor, Torchestra and PCTCP, and investigate how all these algorithms operate with a newly proposed prototype, KIST. Finally, we introduce xTCP, a library that can replace TCP connections with any reliable in-order transport protocol. Using the library we show how the micro transport protocol (uTP) can be used instead of TCP to achieve increased performance while protecting against the circuit opening socket exhaustion attack introduced by PCTCP. Chapter 7 Anarchy in Tor: Performance Cost of Decentralization 91 92 7.1 Introduction The price of anarchy refers to the performance cost born by systems where individual agents act in a locally selfish manner. In networking this specific problem is referred to as selfish rotuing [19, 17], with users making decisions locally in accordance to their own interest. In this case the network as a whole is left in a sub-optimal state, with network resources underutilized, degrading client performance. While Tor clients are not necessarily selfish, circuit selections are made locally without consideration of the state of the global network. To this end, several researchers have investigated improvements to path selection [36], such as to consider latency [34] or relay congestion [35] in an attempt to improve client performance. In this section we explore the price of anarchy in Tor, examining how much performance is sacrificed to accommodate decentralized decision making. To this end we design and implement a central authority in charge of client selection for all clients. With a global view of the network the central authority can more intelligently route clients around bottlenecks, increasing network utilization. Additionally we perform competitive analysis against the online algorithm used by the central authority. With an offline genetic algorithm for circuit selection we can examine how close to an “ideal” solution the online algorithm is. Using the same concepts from the online algorithm, we develop a decentralized avoiding bottleneck relay algorithm (ABRA) allowing clients to make smarter circuit selection decisions. Finally we perform a privacy analysis when using ABRA for circuit selection, considering both information leakage and how an active adversary can abuse the protocol to reduce client anonymity. 7.2 Background When a client first joins the Tor network, it downloads a list of relays with their relevant information from a directory authority. The client then creates 10 circuits through a guard, middle, and exit relay. The relays used for each circuit are selected at random weighted by bandwidth. For each TCP connection the client wishes to make, Tor creates a stream which is then assigned to a circuit; each circuit will typically handle multiple streams concurrently. The Tor client will find a viable circuit that can be used for the stream, which will be then be used for the next 10 minutes or until the circuit becomes 93 unusable. After the 10 minute window a new circuit is selected to be used for the following 10 minutes, with the old circuit being destroyed after any stream still using it has ended. To improve client performance a wide range of work has been done on improving relay and circuit selection in Tor. Akhoondi, Yu, and Madhyastha developed LASTor [34] which builds circuits based on inter-relay latency to even further reduce client latency when using circuits. Wang et al. build a congestion aware path selection [35] that can use active and passive round trip time measurements to estimate circuit congestion. Clients can then select circuits in an attempt to avoid circuits with congested relays. AlSabaha et al. create a traffic splitting algorithm [40] that is able to load balance between two circuits with a shared exit relay, allowing for clients to dynamically adapt to relays that become temporarily congested. 7.3 Maximizing Network Usage To explore the price of anarchy in Tor we want to try and maximize network usage by using a central authority for all circuit selection decisions. To this end we design and implement online and offline algorithms with the sole purpose of achieving maximal network utilization. The online algorithm is confined to processing requests as they are made, meaning that at any one time it can know all current active downloads and which circuits they are using, but has no knowledge of how long the download will last or the start times of any future downloads. The offline algorithm, on the other hand, has full knowledge of the entire set of requests that will be made, both the times that the downloads will start and how long they will last. 7.3.1 Central Authority The online circuit selection algorithm used by the central authority is based on the delayweighted capacity (DWC) routing algorithm, designed with the goal of minimizing the “rejection rate” of requests in which the bandwidth requirements of the request cannot be satisfied. In the algorithm, the network is represented as an undirected graph, where each vertex is a router and edges represent links between routers, with the bandwidth and delay of each link assigned to the edge. For an ingress-egress pair of routers (s, t), 94 the algorithm continually extracts a least delay path LPi , adds it to the set of all least delay paths LP , and then removes all edges in LPi from the graph. This step is repeated until no paths exist between s and t in the graph, leaving us with a set of least delay paths LP = {LP1 , LP2 , . . . , LPk }. For each path LPi we have the residual bandwidth Bi which is the minimum band- width across all links, and the end-to-end delay Di across the entire path. The delayweighted capacity (DWC) of the ingress-egress pair (s, t) is then defined as DW C = Pk Bi i=1 Di . The link that determines the bandwidth for a path is the critical link, with the set of all critical links represented by C = {C1 , C2 , . . . , Ck }. The basic idea is that when picking a path, we want to avoid critical links as much as possible. To do so, each link is assigned a weight based on how many times it is a critical link in a path: P wl = l∈Ci αi . The alpha value can take on one of three functions: (1) Number of times a link is critical (αi = 1) (2) The overall delay of the path (αi = delay and bandwidth of the path (αi = 1 Bi ·Di ) 1 Di ) (3) The . With each link assigned a weight, the routing algorithm simply chooses the path with the lowest sum of weights across all links in the path. Once a path is chosen, the bandwidth of the path is subtracted from the available bandwidth of all links in the path. Any links with no remaining available bandwidth are removed from the graph and the next path can be selected. The main difference between the DWC routing algorithm and circuit selection in Tor is we do not need to extract viable paths; when a download begins Tor clients have a set of available circuits and we simply need to select the best one. For the online circuit selection algorithm the main take away is that we want to avoid low bandwidth relays which are a bottleneck on lots of active circuits. Since the central authority has a global view of the network they can, in real time, make these calculations and direct clients around potential bottlenecks. For this we need an algorithm to identify bottleneck relays given a set of active circuits. The algorithm has a list of relays and their bandwidth, along with the list of active circuits. First we extract the relay with the lowest bandwidth per circuit, defined as rcircBW = rbw /|{c ∈ circuits|r ∈ c}|. This relay is the bottleneck on every circuit it is on in the circuit list, so we can assign it a weight of rw = 1/rbw . Afterwards, for each circuit that r appears on we can assign the circuit a bandwidth value of rcircBW . Additionally, for every other relay r0 that appears in those circuits, we decrement their 95 Figure 7.1: Example of circuit bandwidth estimate algorithm. Each step the relay with the lowest bandwidth is selected, its bandwidth evenly distributed among any remaining circuit, with each relay on that circuit having their bandwidth decremented by the circuit bandwidth. 0 = r0 − r available bandwidth rbw circBW , accounting for the bandwidth being consumed bw by the circuit. Once this is completed all circuits that r appears on are removed from the circuit list. Additionally, any relay that has a remaining bandwidth of 0 (which will always include r) is removed from the list of relays. This is repeated until there are no remaining circuits in the circuit list. The pseudocode for this is shown in Algorithm 2 and an example of of this process is shown in Figure 7.1. Note that when removing relays from the list when they run out of available bandwidth we want to be sure that there are no remaining active circuits that the relay appears on. For the bottleneck relay r this is trivial since we actively remove any circuit that r appears on. But we also want to make sure this holds if another relay r0 is removed. So let relays r and r0 have bandwidths b and b0 and are on c and c0 circuits respectively. Relay r is the one selected as the bottleneck relay, and both relays have a 0 bandwidth after the bandwidths are updated. By definition r is selected such that b c ≤ b0 c0 . When iterating through the c circuits let n be the number of circuits that r0 is on. Note that this means that n ≤ c0 . After the circuits have been iterated through and operations performed, r0 will have b0 − cb · n bandwidth left and on c0 − n remaining circuits. We assume that r0 has 0 bandwidth, so b0 − b c · n = 0. We want to show this means that r0 is on no more circuits so that c0 − n = 0. We have b0 − b b b0 · c · n = 0 ⇒ b0 = · n ⇒ n = c c b 96 Algorithm 2 Calculate relay weight and active circuit bandwidth 1: function CalcCircuitBW(activeCircuits) 2: activeRelays ← GetRelays(activeCircuits) 3: while not activeCircuits.empty() do 4: r ← GetBottleneckRelay(activeRelays) 5: r.weight ← 1/r.bw 6: circuits ← GetCircuits(activeCircuits, r) 7: circuitBW ← r.bw/circuits.len() 8: for c ∈ circuits do 9: c.bw ← circuitBW 10: for circRelay ∈ c.relays do 11: circRelay.bw −= c.bw 12: if circRelay.bw = 0 then 13: activeRelays.remove(circRelay) 14: end if 15: end for 16: activeCircuits.remove(c) 17: end for 18: end while 19: end function So that means that the number of circuits r0 is left on is b0 · c b · C 0 b0 · c = − b b b 0 0 b·c −b ·c b c0 − n = c0 − = However, r was picked such that b0 b ≤ 0 ⇒ b · c0 ≤ b0 · c ⇒ b · c0 − b0 · c ≤ 0 c c and since b > 0 we know that c0 − n = b · c0 − b0 · c ≤0 b and from how n was defined we have n ≤ c0 ⇒ 0 ≤ c0 − n which implies that 0 ≤ c0 − n ≤ 0 ⇒ c0 − n = 0. So we know if r0 has no remaining bandwidth it will not be on any remaining active circuit. 97 Now when a download begins, the algorithm knows the set of circuits the download can use and a list of all currently active circuits. The algorithm to compute relay weights is run and the circuit with the lowest combined weight is selected. If there are multiple circuits with the same lowest weight then the central authority selects the circuit with the highest bandwidth, defined as the minimum remaining bandwidth (after the weight computation algorithm has completed) across all relays in the circuit. 7.3.2 Offline Algorithm An offline algorithm is allowed access to the entire input set while operating. For circuit selection in Tor this means prior knowledge of the start and end time of every download that will occur. While obviously impossible to do on a live network, even most Tor client models [70] consist of client start times, along with file sizes to download and how long to pause between downloads. To produce a comparison, we use the offline algorithm to rerun circuit selection on observed download start and end times. We then rerun the downloads with the same exact start and end times with the new circuit selection produced by the offline algorithm to see if network utilization improves. As input the algorithm takes a list of downloads di with their corresponding start and end times (si , ei ) and set of circuits Ci = {ci1 , ci2 , ...} available to the download, along with the list of relays rj and their bandwidth bwj . The algorithm then outputs a mapping di → cij ∈ Ci with cij = hr1 , r2 , r3 i. While typical routing problems are solved in terms of graph theory algorithms (e.g. max-flow min-cut), the offline circuit selection problem more closely resembles typical job scheduling problems [100]. There are some complications when it comes to precisely defining the machine environment, but the biggest issues is the scheduling problems that most closely resemble circuit selection have been discovered to be NP-hard. Due to these complications we develop a genetic algorithm, which have been shown to perform well on other job scheduling problems, in order to compute a lower-bound on the performance of an optimal offline solution. The important pieces of a genetic algorithm are the seed solutions, how to breed solutions, how to mutate a solution, and the fitness function used to score and rank solutions. In our case a solution consists of a complete mapping of downloads to circuits. To breed two parent solutions, for each download we randomly select a parent solution, 98 and use the circuit selected for the download in that solution as the circuit for the child solution. For mutation, after a circuit has been selected for the child solution, with a small probability m% we randomly select a relay on the circuit to be replaced with a different random relay. Finally, to score a solution we want a fitness function that estimates the total network utilization for a given circuit to download mapping. To estimate network utilization for a set of downloads, we first need a method for calculating the bandwidth being used for a fixed set of active circuits. For this we can use the weight calculation algorithm from the previous section to estimate each active circuit’s bandwidth, which can then be aggregated into total network bandwidth. To compute the entire amount of bandwidth used across an entire set of downloads, we extract all start and end times from the downloads and sort them in ascending order. We then iterate over the times, and at each time ti we calculate the bandwidth being used as bwi and then increment total bandwidth totalBW = totalBW + bwi · (ti+1 − ti ). The idea here is that between ti and ti+1 , the same set of circuits will be active consuming bwi bandwidth, so we add the product to total bandwidth usage. This is used for the fitness function for the genetic algorithm, scoring a solution based on the estimated total bandwidth usage. We consider two different methods for seeding the population, explained in the next section. 7.3.3 Circuit Sets In both the online and offline circuit selection algorithms, each download has a set of potential circuits that can be selected from. For the online algorithm we want to use the full set of all valid circuits. We consider a circuit r1 , r2 , r3 valid if r3 is an actual exit relay, and if r1 6= r2 , r2 6= r3 , and r1 6= r3 . Note we do not require that r1 actually has the guard flag set. This is because while there are mechanisms internally in Tor that prevent any use of a non-exit relay being used as an exit relay, there is nothing stopping a client from using a non-guard relay as their guard in a circuit. We also remove any “duplicate” circuits that contain the same set of relays, just in a different order. This is done since neither algorithm considers inter-relay latency, only relay bandwidth, so there is no need to consider circuits with identical. For the offline algorithm we need to carefully craft the set of circuits as to prevent the genetic algorithm from getting stuck at a local maximum. The motivation behind 99 Figure 7.2: Example showing that adding an active circuit resulted in network usage dropping from 20 to 15 MBps . generating this pruned set can be seen in Figure 7.2. It shows that we can add an active circuit and actually reduce the amount of bandwidth being pushed through the Tor network. In the example shown it will always be better to use one of the already active circuits shown on the left instead of the new circuit seen on the right side of the graphic. There are two main ideas behind building the pruned circuit set: (1) when building circuits always use non-exit relays for the guard and middle if possible and (2) select relays with the highest bandwidth. To build a circuit to add to the pruned set, the algorithm first finds the exit relay with the highest bandwidth. If none exist, the algorithm can stop as no new circuits can be built. Once it has an exit, the algorithm then searches for the non-exit relay with the highest bandwidth, and this relay is used for the middle relay in the circuit. If one is not found it searches the remaining exit relays for the highest bandwidth relay to use for the middle in the circuit. If it still cannot find a relay, the algorithm stops as there are not enough relays left to build another circuit. The search process for the middle is replicated for the guard, again stopping if it cannot find a suitable relay. Now that the circuit has a guard, middle, and exit relay, the circuit is added to the pruned circuit set and the algorithm calculates the circuit bandwidth as the minimum bandwidth across all relays in the circuit. Each relay then decrements its relay bandwidth by the circuit bandwidth. Any relay that now has a bandwidth of 0 is permanently removed from the relay list. This is repeated until the algorithm can no longer build valid circuits. Using these sets allows us to explore the full potential performances that can be 100 realized when considering both relay and circuit selection. To analyze the impacts on circuit selection alone, we also run the algorithms using the original circuit set. For each download, we extract the original circuits available in the vanilla experiments. Then when selecting a circuit for the download in the online and offline algorithm we only pick from those original circuits. 7.4 ABRA for Circuit Selection While using a central authority for circuit selection allows us explore potential performance increases of better network utilization, using such an entity would have severe privacy and anonymity implications on clients using Tor. Instead we would like to use the key insights from the online algorithm and adapt it to a decentralized algorithm that can be used to allow clients themselves to make more intelligent circuit selection decisions. The main point of using the central authority with the online algorithm is to directly route clients around bottlenecks. For the decentralized version we introduce the avoiding bottleneck relay algorithm (ABRA) which has relays themselves estimate which circuits they are a bottleneck on. With this they can locally compute and gossip their DWC weight, with clients then selecting circuits based on the combined relay weight value. To main obstacle is a method for relays to accurately estimate how many circuits they are bottlenecks on. To locally identify bottleneck circuits, we make three observations: (1) to be a bottleneck on any circuit the relay’s bandwidth should be fully consumed (2) all bottleneck circuits should be sending more cells than non-bottleneck circuits, and (3) the number of cells being sent on bottleneck circuits should be fairly uniform. The last point is due to flow control built into Tor, where the sending rate on a circuit should converge to the capacity of the bottleneck relay. All of this first requires identifying circuit bandwidth which needs to be done in a way that does not over estimate bulky circuits and under estimate quiet but bursty circuits. To accomplish this we consider two parameters. First is the bandwidth window, in which we keep track of all traffic sent on a circuit for the past w seconds. The other parameter is the bandwidth granularity, a sliding window of g seconds that we scan over the w second bandwidth window to find the maximum number of cells transferred during 101 Algorithm 3 Head/Tail clustering algorithm 1: function HeadTail(data, threshold) 2: m ← sum(data)/len(data) 3: head ← {d ∈ data—d ≥ m} 4: if len(head)/len(data) < threshold then 5: head ← HeadT ail(head, threshold) 6: end if 7: return head 8: end function any g second period. That maximum value is then assigned as the circuit bandwidth. With circuits assigned a bandwidth value we need a method for clustering them into bottleneck and non-bottleneck groups. For this we consider several clustering algorithms. Head/Tail: The head/tail clustering algorithm [101] is useful when the underlying data has a long tail, which could be useful for bottleneck identification, as we expect to have a tight clustering around the bottlenecks with other circuits randomly distributed amongst the lower bandwidth values. The algorithm first splits the data into two sets, the tail set containing all values less than the arithmetic mean, and everything greater to or equal to the mean is put in the head set. If the percent of values that ended up in the head set is less than some threshold, the process is repeated using the head set as the new data set. Once the threshold is passed the function returns the very last head set as the head cluster of the data. This algorithm is shown in Algorithm 3. For bottleneck identification we simply pass in the circuit bandwidth data and the head cluster returned contains all the bottleneck circuits. Kernel Density Estimator: The idea behind the kernel density estimator [102, 103] is we are going to try and fit a multimodal distribution based on a Gaussian kernel to the circuits. Instead of using the bandwidth estimate for each circuit, the estimator takes as input the entire bandwidth history seen across all circuits, giving the estimator more data points to build a more accurate density estimate. For the kernel bandwidth we initially use the square root of the mean of all values. Once we have a density we compute the set of local minima {m1 , ..., mn } and classify every circuit with bandwidth above mn as a bottleneck. If the resulting density estimate is unimodal and we do not have any local minima, we repeat this process, halving the kernel bandwidth until we 102 get a multimodal distribution. After relays have identified the circuits they are bottlenecks on, they compute their P 1 DWC weight as the sum of the inverse of those circuits’ bandwidths, weight = bwi . This weight is then periodically gossiped to all clients that have a circuit through the relay. In the centralized DWC algorithm paths are always selected with the lowest combined DWC weight across all nodes in the path. However since there is a delay between when circuits become bottlenecks and notify clients we want to avoid clients over-burdening relays that might be congested. Instead clients will select circuits at random based on the DWC weight information. First the client checks if there are any circuits with DWC weight of 0, indicating that no relay on the circuit is a bottleneck on any circuit. If there are any, all zero weight circuits are added to a list which the client selects from randomly weighted by the circuit bandwidth, which is the lowest advertised bandwidth of each relay in the circuit. If there are no zero-weight circuits, the client then defaults to selecting amongst the circuits randomly weighted by the inverse of the DWC weight, since we want to bias our selection to circuits with lower weights. 7.5 Experimental Setup In this section we discuss our experimental setup and some implementation details of the algorithms used. 7.5.1 Shadow To empirically test how our algorithms would perform in the Tor network, we use Shadow [69, 67], a discrete event network simulator with the capability to run actual Tor code in a simulated network environment. Shadow allows us to set up a large scale network configuration of clients, relays, and servers, which can all be simulated on a single machine. This lets us run experiments privately without operating on the actual Tor network, avoiding potential privacy concerns of dealing with real users. Additionally, it lets us have a global view and precise control over every aspect of the network, which would be impossible in the live Tor network. Most importantly, Shadow performs deterministic runs, allowing for reproducible results and letting us isolate exactly what we want to test for performance effects. 103 Our experiments are configured to use Shadow v1.9.2 and Tor v0.2.5.10. We use the large network configuration distrubyted with Shadow which consists of 500 relays, 1350 web clients, 150 bulk clients, 300 performance clients, and 500 servers. The default client model in Shadow [67, 70, 27] downloads a file of a specific size, and after the download is finished it chooses how long to pause until it starts the next download. Web clients download 320 KB files and randomly pause between 1 and 60,000 milliseconds before starting the next download; bulk clients download 5 MB files with no break between downloads; and performance clients are split into three groups, downloading 50 KB, 1 MB, and 5 MB files, respectively, with 1 minute pauses between downloads. Additionally, since the offline algorithm uses deterministic download times, we introduce a new fixed download model. In this model each client has a list of downloads it will perform with each download containing a start time, end time, and optional circuit to use for the download. Instead of downloading a file of fixed size it downloads non-stop until the end time is reached. In order to generate a fixed download experiment, we use the download start and end times seen during an experiment running vanilla Tor using the default client model. To measure network utilization, every 10 seconds each relay logs the total number of bytes it has sent during the previous 10 second window. This allows us to calculate the total relay bandwidth being used during the experimental run, with higher total bandwidth values indicating better utilization of network resources. When using the default client model we also examine client performance, with clients reporting the time it took them to download the first byte, and time to download the entire file. 7.5.2 Implementations Online and Offline Algorithms: When using the online and offline algorithm with the fixed download client model we create a tool that precomputes circuit selection using as input the list of downloads, with their respective start and stop times, along with which set of circuits described in Section 7.3.3 to consider for the downloads. The computed mapping of downloads to circuits is used by Shadow to manage which circuits clients need to build and maintain, along with which circuit is selected each time a download starts. For the offline algorithm the genetic algorithm was run for 100 rounds with a breed percentile of b = 0.2 and elite percentile e = 0.1. This was done 104 for each circuit set, with mutation set to m = 0 for the pruned and original circuit sets, and m = 0.01 when using the full circuit set. After 100 rounds the population with the highest score saved its circuit selection for every download. While the offline algorithm can only operate in the fixed download model, the online algorithm can also work in the default client mode. For this we introduce a new central authority (CA) node which controls circuit selection for every client. Using the Tor control protocol the CA listens on the CIRC and STREAM events to know what circuits are available at each client and when downloads start and end. Every time a STREAM NEW event is received it runs the algorithm described in Section 7.3.1 over the set of active circuits. This results in each relay having an assigned DWC weight and the CA simply picks the circuit on the client with the lowest combined weight. Avoiding Bottleneck Relays Algorithm: To incorporate ABRA in Tor we first implemented the local weight calculation method described in Section 7.4. Each circuit keeps track of the number of cells received over each 100 millisecond interval for the past w seconds. We created a new GOSSIP cell type which relays send with their DWC weight downstream on each circuit they have. To prevent gossip cells from causing congestion, relays send the gossip cells on circuits every 5 seconds based on the circuit ID; specifically when now() ≡ circID mod 5. For clients we modified the circuit_get_best function to select circuits randomly based on the sum of relay weights as discussed in Section 7.4. Congestion and Latency Aware: Previous work has been done on using congestion [35] and latency [34] to guide circuit selection. For congestion aware path selection we use the implementation described in [36]. Clients get measurements on circuit round-trip times from three sources: (1) time it takes to send the last EXTEND cell when creating the circuit (2) a special PROBE cell sent right after the circuit has been created, and (3) the time it takes to create a connection when using a circuit. For each circuit the client keeps track of the minimum RTT seen rttmin , and whenever it receives an RTT measurement calculates the congestion time tc = rtt − rttmin . When a client selects a circuit it randomly chooses 3 available circuits and picks the one with the lowest congestion time. Additionally we add an active mode where PROBE cells are sent every n seconds, allowing for more up to date measurements on all circuits the client has available. Instead of using network coordinates to estimate latency between relays as done in [34], we create a latency aware selection algorithm that uses directly measured 105 RTT times via PROBE cell measurements. With these measurements the client either selects the circuit with the lowest rttmin , or selects amongst available circuits randomly weighted by each circuit’s rttmin . 7.5.3 Consistency To keep the experiments as identical as possible, every experiment was configured with a central authority and had relays gossiping their weight to clients. If the experiment was testing a local circuit selection algorithm, the central authority would return a circuit ID of 0 which indicates that the client itself should select a circuit. At this point the client would use either the vanilla, congestion aware, latency aware, or ABRA algorithms to select a circuit. This way any latency or bandwidth overhead incurred from the various algorithms is present in every experiment, and differences between experimental results are all due to different circuit selection strategies. 7.6 Performance In this section we compare the performance of using ABRA for circuit selection to vanilla Tor, congestion-based and latency-aware circuit selection, and our centralized algorithms. 7.6.1 Competitive Analysis To get an understanding of how well the online algorithm performs to a more optimal offline solution, we first perform competitive analysis running both the online and offline algorithms on the fixed download client model. We extracted all download start and end times from the vanilla experiment, along with the original circuits available to each download. Both the online algorithm (with the original and full circuit sets) and offline algorithm (seeded with original and pruned circuit sets) were run over the downloads to precompute a circuit selection. Experiments were then configured using each of the computed circuit selections to analyze the exact network utilization impacts each of the circuit selections had. Figures 7.3a shows total network bandwidth achieved with the online and offline algorithms when they use the full and pruned circuit sets respectively. Both algorithms 106 Bandwidth Granularity Bandwidth Window Jenks Two Jenks Best Head/Tail KernelDensity 1s 1.67 1.11 0.74 0.81 100 2s 1.79 1.14 0.92 0.82 ms 5s 3.56 1.91 1.66 0.85 10s 8.69 3.87 4.36 0.93 1s 1.80 1.00 0.90 0.79 1 second 2s 5s 2.39 5.28 1.18 2.25 1.13 2.45 0.79 0.95 10s 13.25 4.70 6.14 1.62 Table 7.1: The bottleneck clustering methods mean square error across varying bandwidth granularity and window parameters, with red values indicating scores less than weighted random estimator. produce almost identical network utilization, with median network bandwidth going from 148 MBps in vanilla Tor up to 259 MBps, a 75% increase. To see why we can look at relay capacity seen in Figure 7.3c. This looks at the percent of bandwidth being used on a relay compared to their configured BandwidthRate. Note that this can go higher than 100% because Tor also has a BandwidthBurstRate that allows a relay to temporarily send more than its BandwidthRate over short periods of time. This shows us that in the best algorithm runs relays are operating at very close to optimal utilization. Half of the time relays are using 90% or more of their configured bandwidth, which was only happening about 20% of the time in vanilla Tor, meaning the centralized algorithms were able to take advantage of resources that were otherwise being left idle. Interestingly, while both algorithms had the highest performance gain when using larger circuit sets, they still produced improved performance when restricted to the original circuit sets as seen in Figure 7.3b. The online algorithm produced results close to those seen when using the full circuit set, with median network bandwidth at 248 MBps compared to 259 MBps for the full set. The offline algorithm, while still performing better than vanilla Tor, did not produce as much improvement when using the original circuit set, with network bandwidth at 206 MBps. 7.6.2 ABRA Parameters With the various parameters and clustering methods available to the algorithm, the first thing we are interested in is which parameters result in the most accurate bottleneck estimation. To measure this we configured an experiment to run with the central authority described in Section 7.5.2 making circuit selections. Every time the central authority selects a circuit for a client, for every relay it outputs how many circuits the 107 1.0 0.8 0.6 CDF CDF 0.8 1.0 vanilla online-full offline-pruned 0.4 0.2 0.0 50 vanilla online-orig offline-orig 0.6 0.4 0.2 100 150 200 250 Total Bandwidth (MiB/s) 300 0.0 50 (a) Full and Pruned Circuits 100 150 200 250 Total Bandwidth (MiB/s) 300 (b) Original Circuits 1.0 CDF 0.8 0.6 0.4 0.2 vanilla online offline 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Relay Bandwidth Consumption (%) (c) Relay Capacity Figure 7.3: Total network bandwidth when using the online and offline circuit selection algorithms while changing the set of available circuits, along with the relay bandwidth capacity of the best results for both algorithms. 108 relay is a bottleneck on. Additionally, during the run every relay periodically outputs the entire 10 second bandwidth history of every circuit it is on. With this information we can compute offline which circuits every relay would have classified itself as a bottleneck on, depending on the parameters used. For comparing estimates, every time the central authority outputs its bottleneck figures, every relay that outputs their circuit history within 10 simulated milliseconds will have their local bottleneck estimate compared to the estimate produced by the central authority. Table 7.1 shows the mean-squared error between the bottleneck estimations produced by the central authority and local estimates with varying parameters and clustering methods. The bandwidth granularity was set to either 100 milliseconds or 1 second, and the bandwidth window was picked to be either 1, 2, 5, or 10 seconds. Every value in black performed worse than a weighted random estimator, which simply picked a random value from the set of all estimates computed by the central authority. While the kernel-density estimator produced the lowest mean-squared error on average, the lowest error value came from when we used the head/tail clustering algorithm with bandwidth granularity set to 100 milliseconds bandwidth window at 1 second. 7.6.3 ABRA Performance To evaluate the performance of Tor using ABRA for circuit selection, an experiment was configured to use ABRA with g = 100ms and w = 1s. Additionally we ran experiments using congestion aware, latency aware, and latency weighted circuit selection. For congestion aware circuit selection three separate experiments were run, one using only passive probing, and two with clients actively probing circuits either every 5 or 60 seconds. Finally, an experiment was run with the central authority using the online algorithm to make circuit selection decisions for all clients. Figure 7.4 shows the CDF of the total relay bandwidth observed during the experiment. Congestion and latency aware network utilization, shown in Figures 7.4a and 7.4b, actually drop compared to vanilla Tor, pushing at best half as much data through the network. Clients in the congestion aware experiments were generally responding to out of date congestion information, with congested relays being over weighted by clients, causing those relays to become even more congested. Since latency aware circuit selection does not take into account relay bandwidth, low bandwidth relays with low latencies between each other 109 CDF 0.8 0.6 1.0 vanilla abra cong-passive cong-active60 cong-active5 0.8 CDF 1.0 0.4 0.2 0.0 20 vanilla abra latency latency-weighted 0.6 0.4 0.2 40 60 80 100 120 140 160 180 Total Bandwidth (MiB/s) 0.0 20 (a) Congestion Aware 1.0 CDF 0.8 40 60 80 100 120 140 160 180 Total Bandwidth (MiB/s) (b) Latency Aware vanilla abra central 0.6 0.4 0.2 0.0 120 130 140 150 160 170 180 190 Total Bandwidth (MiB/s) (c) Central Authority Figure 7.4: Total network bandwidth utilization when using vanilla Tor, ABRA for circuit selection, the congestion and latency aware algorithms, and the central authority. 110 are selected more often than they should be and become extremely congested. While the congestion and latency aware algorithms performed worse than vanilla Tor, using both ABRA and the central authority for circuit selection produced increased network utilization. Figure 7.4c shows that ABRA and the central authority produced on average 14% and 20% better network utilization respectively when compared to vanilla Tor. Figure 7.5 looks at client performance using ABRA and centralized circuit selection. Download times universally improved, with some web clients performing downloads almost twice as fast, and bulk clients consistently seeing a 5-10% improvement. While all web clients had faster download times, about 20% saw even better results when using the central authority compared to ABRA for circuit selection. The only slight degradation in performance was with ABRA, where about 12-13% of downloads had a slower time to first byte compared to vanilla Tor. The central authority, however, consistently resulted in circuit selections that produced faster times to first byte across all client downloads. 7.7 Privacy Analysis With the addition of gossip cells and using ABRA for circuit selection, there are some potential avenues for abuse an adversary could take advantage of to reduce client anonymity. In this section we cover each of these and examine how effective the methods are for an adversary. 7.7.1 Information Leakage The first issue is that relays advertising their locally computed weight could leak information about other clients to an adversary. Mittal et al. [58] showed how an adversary could use throughput measurements to identify bottleneck relays in circuits. Relay weight values could be used similarly, where an adversary could attempt to correlate start and stop times of connections with the weight values of potential bottleneck relays. To examine how much information is leaked by the weight values, every time a relay sent a GOSSIP cell we recorded the weight of the relay and the number of active circuits using the relay. Then for every two consecutive GOSSIP cells sent by a relay we then recorded the difference in weight and number of active circuits, which should give 1.0 1.0 0.8 0.8 0.6 0.4 0.2 vanilla abra central 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Download Time (s) Cumulative Fraction Cumulative Fraction 111 0.6 0.4 0.2 0.0 vanilla abra central 0 (a) Time to First Byte 2 4 6 Download Time (s) 8 10 (b) Web Download Times Cumulative Fraction 1.0 0.8 0.6 0.4 0.2 0.0 0 vanilla abra central 20 40 60 Download Time (s) 80 100 (c) Bulk Download Times Figure 7.5: Download times and client bandwidth compared across circuit selection in vanilla Tor, using ABRA, and the centralized authority. 112 us a good idea of how well the change in these two values are correlated over a short period of time. Figure 7.6 shows the distribution of weight differences across various changes in clients actively using the relay. Since we are interested in times when the relay is a bottleneck on new circuits, we excluded times when the weight difference was 0 as this is indicative that the relay was not a bottleneck on any of the new circuits. This shows an almost nonexistent correlation with an R2 value of 0.00021. In the situation similar to the one outlined in [58] where an adversary is attempting to identify bottleneck relays used in a circuit, we are particularly interested in the situation where the number of active circuits using the relay as a bottleneck increases by 1. If there were large (maybe temporary) changes noticeable to an adversary they could identify the bottleneck relay. But as we can see in Figure 7.6 the distribution of weight changes when client difference is 1 is not significantly different from the distribution for larger client differences, meaning it would be extremely difficult to identify bottleneck relays by correlating weight changes. 7.7.2 Colluding Relays Lying With relays self reporting their locally calculated weights, adversarial relays could lie about their weight, consistently telling clients they have a weight of 0, increasing their chances of being on a selected circuit. Note that construction of circuits using ABRA is still unchanged, so the number of compromised circuits built by a client will not change; it is only when a client assigns streams to circuits that malicious relays could abuse GOSSIP cells to improve the probability of compromising a stream. Furthermore, this attack has a “self-damping” effect, in that attracting streams away from non-adversarial relays will decrease the bottleneck weight of those relays. Still, colluding relays, in an effort to increase the percent of circuits they are able to observe, could in a joint effort lie about their weights in an attempt to reduce anonymity of clients. To determine how much of an effect such an attack would have, while running the experiment using ABRA for circuit selection we recorded the circuits available to the client, along with the respective weight and bandwidth information. A random set of relays were then marked as the “adversary”. For each circuit selection decision made, if the adversary had a relay on an available circuit we would rerun the circuit selection assuming all the adversarial relays had a reported weight of 0. We would then record 113 Figure 7.6: Weight difference on relays compared to the difference in number of bottleneck circuits on the relay. 114 the fraction of total bandwidth controlled by the adversary, the fraction of streams they saw with and without lying, and the fraction of compromised streams, with and without lying. A stream is considered compromised if the adversary controls both the guard and exit in its selected circuit. This process was repeated over 2,000,000 times to produce a range of expected streams seen and compromised as the fraction of bandwidth controlled by an adversary increases. Figures 7.7a and 7.7b shows the median fraction of streams observed bounded by the 10th and 90th percentile. As the adversary controls more bandwidth, and thus more relays that can lie about their weight, an adversary is able to observe about 10% more streams than they would have seen if they were not lying. Figure 7.7b shows the fraction of streams that are compromised with and without adversarial relays lying. When lying an adversary sees an increase in roughly 2-3% of the streams they are able to compromise. For example an adversary who controls 20% of the bandwidth sees their percent of streams compromised go from 4.7% to 6.9%. Note that these serve as an upper bound on the actual effect as this analysis is done statically and ignores effects on the weight of non-adversarial relays. 7.7.3 Denial of Service While adversarial relays are limited in the number of extra streams they can observe by lying about their weight, they still could have the ability to reduce the chances that other relays are selected. To achieve this they would need to artificially inflate the weight of other relays in the network, preventing clients from selecting circuits that the target relays appear on. Recall that the local weight calculation is based on how many bottleneck circuits the relay estimates they are on. This means that an adversary cannot simply just create inactive circuits through the relay to inflate their weight, since those circuits would never be labeled as bottleneck. So to actually cause the weight to increase the adversary needs to actually send data through the circuits. To test the effectiveness of this attack, we configured an experiment to create 250 one-hop circuits through a target relay. After 15 minutes the one-hop circuits were activated, downloading as much data as they could. Note that we want as many circuits through the relay as possible to make their weight as large as possible. The relay weight is summed across all bottleneck P circuits, bw(ci )−1 . If we have n one-hop circuits through a relay of bandwidth bw, each circuit will have a bandwidth of roughly bw n , so the weight on the relay will be 115 (a) (b) Figure 7.7: (a) fraction of circuits seen by colluding relays (b) fraction of circuits where colluding relays appear as guard and exit 116 bw −1 Pn 1 n = n2 bw . Figure 7.8a looks at the weight of the target relay along with how much bandwidth the attacker is using, with the shaded region noting when the attack is active. We see that the attacker is able to push through close to the maximum bandwidth that the relay can handle, around 35 MB/s. When the circuits are active the weight spikes to almost 100 times what it was. After the attack is started the number plummets to almost 0, down from the 30-40 it was previously on. But while the attack does succeed in inflating the weight of the target, note that the adversary has to fully saturate the bandwidth of the target. Doing this in vanilla Tor will have almost the same effect, essentially running a denial of service by consuming all the available bandwidth. Figure 7.8c looks at the number of completed downloads using the target relay in both vanilla Tor and when using ABRA for circuit selection. Both experiments see the number of successful downloads drop to 0 while the attack is running. So even though the addition of the relay weight adds another mechanism that can be used to run a denial of service attack, the avenue (saturating bandwidth) is the same. 7.8 Conclusion In this paper we introduced an online algorithm that is capable of centrally managing circuit selection for clients. Additionally we develop the decentralized version ABRA, the avoiding bottleneck relay algorithm. This algorithm lets relays estimate which circuits they are bottlenecks are on, allowing them to compute a weight that can be gossiped to clients, allowing for more coordinated circuit selection. ABRA is compared to congestion and latency aware circuit selection algorithms, showing that while these algorithms tend to actually under perform vanilla Tor, ABRA results in a 14% increase in network utilization. We show that the decentralized approach ABRA takes produces utilization close to what even a centralized authority can provide. Using competitive analysis we show that an online algorithm employed by a central authority matches a lower-bound offline genetic algorithm. Finally, we examine potential ways an adversary could abuse ABRA, finding that while information leakage is minimal, there exist small increases in the percent of streams compromised based on the bandwidth controlled by an adversary acting maliciously. 117 5 attacker bw target weight 40 4 30 3 20 2 10 1 0 Weight Bandwidth (MB/s) 50 0 5 10 15 20 Time (m) 25 0 30 (a) (b) (c) Figure 7.8: (a) attackers bandwidth and targets weight when an adversary is running the denial of service attack (b) number of clients using the target for a circuit (c) number of downloads completed by clients using the target relay before and after attack is started, shown both for vanilla Tor and using ABRA circuit selection Chapter 8 Future Work and Final Remarks 118 119 8.1 Future Work In this section we discuss open questions related to this dissertation and ideas for future research on Tor and anonymous communication systems. 8.1.1 Tor Transports In Section 6.6 we introduced xTCP that was able to replace TCP with any other inorder reliable transport protocol. With this we were able to have Tor use the micro transport protocol (uTP) for all inter-relay communication, allowing each circuit to have its own user space protocol stack. We showed that using uTP instead of TCP was able to increase client performance, with especially large improvements for time to first byte. While these initial results are promising, a large amount of work remains in investigating potential different protocols to use [104, 98]. Along with uTP, the stream control transmission protocol (SCTP) [105] could be a good candidate for inter-relay communication. The main benefit is it allows for partial ordering data delivery, eliminating the issue of head-of-line blocking that can cause performance issues in Tor. Another possibility is to use Google’s QUIC [106] for Tor [107]. QUIC offers a combination of TCP reliability and congestion control, along with TLS for encryption and SPDY/HTTP2 for efficient and fast web page loading. In [104] Murdoch additionally recommends CurveCP [108], a secure in-order reliable transport protocol. While other protocols are more mature and tested, it could be possible that a stand alone Tor specific protocol could provide best performance. Tschorsch and Scheuermann [63] recently implemented their own protocol for better flow and congestion control based on back pressure, while using traditional techniques from TCP for packet loss detection and retransmission. A protocol even more customized for Tor could deliver even better performance guarantees. Jansen et al. showed with KIST [27] that allowing Tor to buffer more data internally allows for better cell handling and prioritization, increasing performance to latency sensitive applications. A Tor specific protocol would then have direct control over traffic management and prioritization, allowing it to manage stream, circuit, and connection level flow and congestion control in a global context. 120 8.1.2 Simulation Accuracy With more work being done on the low level networking side of Tor [27, 2, 3, 109, 93, 63], it is becoming increasingly important that the network stack in Shadow be as accurate as possible. While using PlanetLab or ExperimenTor [65] we can use the actual kernel TCP stack, these solutions have problems with scaling to full Tor topologies. In [27] a large amount of work went into updating the Shadow TCP stack to replicate the Linux kernel TCP implementation, but it is very difficult to guarantee equivalent performance. Recently a couple of projects have been released that attempt to create a user space networking library based on the Linux operating system code. NUSE [110], the network stack in userspace, is part of the library operating system (LibOS) [111]. A host backend layer handles the interactions from the kernel layer, with applications making calls through the POSIX layer sitting above the kernel layer. The Linux kernel library (LKL) [112, 113] is another library that applications can use to incorporate Linux operating system functionality into their program. The advantage to LKL is the kernel code is compiled into a lone object file which the application can directly link to. Tschorsch and Scheuermann [63] compared their Tor back pressure protocol by integrating it into ns-3 [114], a network simulator with a wide variety of protocols built into it. While this has the advantage of comparing their back pressure protocol against the actual TCP implementation, it abstracts a lot of details of Tor that could reduce simulation accuracy. Instead the ns-3 TCP implementation could be ported over into either a stand alone library or directly into Shadow. This would reduce the overhead of having a library compiled from the entire kernel, in addition to relying on a more mature code base that has for decades been used for network simulations and experimentation. 8.2 Final Remarks Since this dissertation began Tor has grown enormously, initially with a network capacity of 5 GB/s servicing a few hundred thousand clients, to almost 200 GB/s of relay bandwidth handling over 2 million clients daily. Additionally, during this period with higher network bandwidth and more efficient data processing, the time it took to download a 1 MB file has dramatically decreased, going from about a 20 second download time to almost 2.5 seconds currently. This large expansion in both network capacity and 121 client performance has allowed users and political dissidents in oppressive regimes to access the internet and globally communicate anonymously, an impressive and important achievement. With an ever growing number of Tor users it is increasingly important to ensure that the system properly balances performance and privacy. While this is an inherently subjective question, this dissertation has worked to demonstrate some of the trade-offs that are had where an adversary is able to take advantage of mechanisms intended to increase client performance. To this end we first explore an induced throttling attack able to abuse flow and traffic admission control, allowing a low resource attack on guard identification which can be used to reduce the client anonymity set. We next introduced a socket exhaustion attack in which an adversary consume all available sockets. The attack is made even worse by systems like PCTCP which allow an adversary to anonymously open sockets just by creating circuits through a target. With this in mind we introduce IMUX which is able to balance opening more connections to increase performance while preventing these easy denial of service attacks. Additionally we examine the effects of replacing TCP entirely for inter-relay communication, and find that using uTP simultaneously prevents certain socket exhaustion attacks while improving client performance. Finally we examine the price of anarchy in Tor and find that potentially 40-50% of network resources are being left idle to the detriment of client performance. In an attempt to close this gap we introduce ABRA for circuit selection, a decentralized solution that can take advantage of these idle resources with minimal potential loss of client anonymity. References [1] Rob Jansen, Paul Syverson, and Nicholas Hopper. Throttling Tor Bandwidth Parasites. In Proceedings of the 21st USENIX Security Symposium, 2012. [2] Deepika Gopal and Nadia Heninger. Torchestra: Reducing interactive traffic delays over Tor. In Proceedings of the Workshop on Privacy in the Electronic Society (WPES 2012). ACM, October 2012. [3] Mashael AlSabah and Ian Goldberg. PCTCP: Per-Circuit TCP-over-IPsec Transport for Anonymous Communication Overlay Networks. In Proceedings of the 20th ACM conference on Computer and Communications Security (CCS 2013), November 2013. [4] Rob Jansen, John Geddes, Chris Wacek, Micah Sherr, and Paul Syverson. Never Been KIST: Tor’s Congestion Management Blossoms, with kernel-informed socket transport. In Proceedings of the 23rd conference on USENIX security symposium. USENIX Association, 2014. [5] Ceki Gulcu and Gene Tsudik. Mixing E-mail with Babel. In Proceedings of the Symposium on Network and Distributed System Security., 1996. [6] Michael J. Freedman and Robert Morris. Tarzan: A Peer-to-Peer Anonymizing Network Layer. In Proceedings of the 9th ACM Conference on Computer and Communications Security (CCS 2002), November 2002. [7] George Danezis, Roger Dingledine, and Nick Mathewson. Mixminion: Design of a type III anonymous remailer protocol. In Proc. of IEEE Security and Privacy., 2003. 122 123 [8] Roger Dingledine, Nick Mathewson, and Paul Syverson. Tor: The second- generation onion router. In Proceedings of the 13th USENIX Security Symposium, August 2004. [9] David Goldschlag, Michael Reed, and Paul Syverson. Onion routing. Communications of the ACM, 42(2):39–41, 1999. [10] Michael G Reed, Paul F Syverson, and David M Goldschlag. Anonymous connections and onion routing. Selected Areas in Communications, IEEE Journal on, 16(4):482–494, 1998. [11] Adam Back, Ulf Möller, and Anton Stiglic. Traffic Analysis Attacks and TradeOffs in Anonymity Providing Systems. In Ira S. Moskowitz, editor, Information Hiding: 4th International Workshop, IH 2001, pages 245–257, Pittsburgh, PA, USA, April 2001. Springer-Verlag, LNCS 2137. [12] Mashael AlSabah, Kevin Bauer, Ian Goldberg, Dirk Grunwald, Damon McCoy, Stefan Savage, and Geoffrey Voelker. DefenestraTor: Throwing out windows in Tor. In PETS, 2011. [13] Rob Jansen, Paul Syverson, and Nicholas Hopper. Throttling Tor Bandwidth Parasites. In Proceedings of the 21st USENIX Security Symposium, August 2012. [14] Nicholas Hopper, Eugene Y. Vasserman, and Eric Chan-Tin. Anonymity does Network Latency Leak? How Much ACM Transactions on Information and System Security, 13(2):13–28, February 2010. [15] Joel Reardon and Ian Goldberg. Improving tor using a TCP-over-DTLS tunnel. In Proceedings of the 18th conference on USENIX security symposium, pages 119– 134. USENIX Association, 2009. [16] Joel Reardon. Improving Tor using a TCP-over-DTLS Tunnel. Master’s thesis, University of Waterloo, September 2008. [17] Tim Roughgarden. Selfish routing and the price of anarchy, volume 174. MIT press Cambridge, 2005. 124 [18] Tim Roughgarden and Éva Tardos. How bad is selfish routing? Journal of the ACM (JACM), 49(2):236–259, 2002. [19] Tim Roughgarden. Selfish routing. PhD thesis, Cornell University, 2002. [20] George Danezis. The traffic analysis of continuous-time mixes. In International Workshop on Privacy Enhancing Technologies, pages 35–50. Springer, 2004. [21] Steven J Murdoch and Piotr Zieliński. Sampled traffic analysis by internet- exchange-level adversaries. In International Workshop on Privacy Enhancing Technologies, pages 167–183. Springer, 2007. [22] libevent an event notification library. http://libevent.org/. [23] Can Tang and Ian Goldberg. An Improved Algorithm for Tor Circuit Scheduling. In Angelos D. Keromytis and Vitaly Shmatikov, editors, Proceedings of the 2010 ACM Conference on Computer and Communications Security (CCS 2010). ACM, October 2010. [24] Damon McCoy, Kevin Bauer, Dirk Grunwald, Tadayoshi Kohno, and Douglas Sicker. Shining Light in Dark Places: Understanding the Tor, network. In Nikita Borisov and Ian Goldberg, editors, Privacy Enhancing Technologies: 8th International Symposium, PETS 2008, pages 63–76, Leuven, Belgium, July 2008. Springer-Verlag, LNCS 5134. [25] Abdelberi Chaabane, Pere Manils, and Mohamed Ali Kaafar. Digging into anonymous traffic: A deep analysis of the tor anonymizing network. In Network and System Security (NSS), 2010 4th International Conference on, 2010. [26] Mashael AlSabah, Kevin Bauer, and Ian Goldberg. Enhancing Tor’s Performance using Real-time Traffic Classification. In Proceedings of the 19th ACM conference on Computer and Communications Security (CCS 2012), October 2012. [27] Rob Jansen, John Geddes, Chris Wacek, Micah Sherr, and Paul Syverson. Never Been KIST: Tor’s Congestion, management blossoms with kernel-informed socket transport. In Proceedings of 23rd USENIX Security Symposium (USENIX Security 14), San Diego, CA, August 2014. USENIX Association. 125 [28] W Brad Moore, Chris Wacek, and Micah Sherr. Exploring the potential benefits of expanded rate limiting in tor: Slow and steady wins the race with tortoise. In Proceedings of the 27th Annual Computer Security Applications Conference, pages 207–216. ACM, 2011. [29] Florian Tschorsch and Björn Scheuermann. Tor is unfairAnd what to do about it. In Local Computer Networks (LCN), 2011 IEEE 36th Conference on, pages 432–440. IEEE, 2011. [30] Kevin Bauer, Damon McCoy, Dirk Grunwald, Tadayoshi Kohno, and Douglas Sicker. Low-Resource Routing Attacks Against Tor. In Proceedings of the Workshop on Privacy in the Electronic Society (WPES 2007), October 2007. [31] Robin Snader and Nikita Borisov. EigenSpeed: secure peer-to-peer bandwidth evaluation. In IPTPS, page 9, 2009. [32] Micah Sherr, Matt Blaze, and Boon Thau Loo. Scalable link-based relay selection for anonymous routing. In Privacy Enhancing Technologies, pages 73–93. Springer, 2009. [33] Micah Sherr, Andrew Mao, William R Marczak, Wenchao Zhou, Boon Thau Loo, and Matthew A Blaze. A3 : An Extensible Platform for Application-Aware Anonymity. 2010. [34] Masoud Akhoondi, Curtis Yu, and Harsha V. Madhyastha. LASTor: A LowLatency AS-Aware Tor Client. In Proceedings of the 2012 IEEE Symposium on Security and Privacy, May 2012. [35] Tao Wang, Kevin Bauer, Clara Forero, and Ian Goldberg. Congestion-aware Path Selection for Tor. In Proceedings of Financial Cryptography and Data Security (FC’12), February 2012. [36] Christopher Wacek, Henry Tan, Kevin Bauer, and Micah Sherr. An Empirical Evaluation of Relay Selection in Tor. In Proceedings of the Network and Distributed System Security Symposium - NDSS’13. Internet Society, February 2013. 126 [37] Michael Backes, Aniket Kate, Sebastian Meiser, and Esfandiar Mohammadi. (Nothing else) MATor(s): Monitoring the Anonymity of, tor’s path selection. In Proceedings of the 21th ACM conference on Computer and Communications Security (CCS 2014), November 2014. [38] SJ Herbert, Steven J Murdoch, and Elena Punskaya. Optimising node selection probabilities in multi-hop M/D/1 queuing networks to reduce latency of Tor. Electronics Letters, 50(17):1205–1207, 2014. [39] Robin Snader and Nikita Borisov. Improving security and performance in the Tor network through tunable path selection. Dependable and Secure Computing, IEEE Transactions on, 8(5):728–741, 2011. [40] Mashael Alsabah, Kevin Bauer, Tariq Elahi, and Ian Goldberg. The Path Less Travelled: Overcoming Tor’s Bottlenecks with Traffic Splitting. In Proceedings of the 13th Privacy Enhancing Technologies Symposium (PETS 2013), July 2013. [41] Roger Dingledine and Steven J Murdoch. Performance Improvements on Tor or, Why Tor is slow and what were going to do about it. Online: http://www. torproject. org/press/presskit/2009-03-11-performance. pdf, 2009. [42] Camilo Viecco. UDP-OR: A fair onion transport design. Proceedings of Hot Topics in Privacy Enhancing Technologies (HOTPETS08), 2008. [43] Michael F Nowlan, Nabin Tiwari, Janardhan Iyengar, Syed Obaid Amin, and Bryan Ford. Fitting square pegs through round pipes. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012. [44] Michael F Nowlan, David Wolinsky, and Bryan Ford. Reducing latency in Tor circuits with unordered delivery. In USENIX Workshop on Free and Open Communications on the Internet (FOCI), 2013. [45] Elli Androulaki, Mariana Raykova, Shreyas Srivatsan, Angelos Stavrou, and Steven M Bellovin. PAR: Payment for anonymous routing. In International Symposium on Privacy Enhancing Technologies Symposium, pages 219–236. Springer, 2008. 127 [46] Yao Chen, Radu Sion, and Bogdan Carbunar. XPay: Practical anonymous payments for Tor routing and other networked services. In Proceedings of the 8th ACM workshop on Privacy in the electronic society, pages 41–50. ACM, 2009. [47] Tsuen-Wan “Johnny” Ngan, Roger Dingledine, and Dan S. Wallach. Building Incentives into Tor. In Radu Sion, editor, Proceedings of Financial Cryptography (FC ’10), January 2010. [48] Rob Jansen, Nicholas Hopper, and Yongdae Kim. Recruiting New Tor Relays with BRAIDS. In Angelos D. Keromytis and Vitaly Shmatikov, editors, Proceedings of the 2010 ACM Conference on Computer and Communications Security (CCS 2010). ACM, October 2010. [49] Rob Jansen, Aaron Johnson, and Paul Syverson. LIRA: Lightweight Incentivized Routing for Anonymity. In Proceedings of the Network and Distributed System Security Symposium - NDSS’13. Internet Society, February 2013. [50] Zhen Ling, Junzhou Luo, Wei Yu, Xinwen Fu, Dong Xuan, and Weijia Jia. A new cell counter based attack against tor. In Proceedings of the 16th ACM conference on Computer and communications security, pages 578–589. ACM, 2009. [51] Xinwen Fu, Zhen Ling, J Luo, W Yu, W Jia, and W Zhao. One cell is enough to break tors anonymity. In Proceedings of Black Hat Technical Security Conference, pages 578–589. Citeseer, 2009. [52] Zhen Ling, Junzhou Luo, Wei Yu, Xinwen Fu, Weijia Jia, and Wei Zhao. Protocollevel attacks against Tor. Computer Networks, 57(4):869–886, 2013. [53] Nikita Borisov, George Danezis, Prateek Mittal, and Parisa Tabriz. Denial of Service or Denial of Security? How Attacks on, reliability can compromise anonymity. In Proceedings of CCS 2007, October 2007. [54] Rob Jansen, Florian Tschorsch, Aaron Johnson, and Björn Scheuermann. The Sniper Attack: Anonymously Deanonymizing and Disabling the Tor Network. In Proceedings of the Network and Distributed Security Symposium - NDSS ’14. IEEE, February 2014. 128 [55] Nick Feamster and Roger Dingledine. Location diversity in anonymity networks. In Proceedings of the 2004 ACM workshop on Privacy in the electronic society, pages 66–76. ACM, 2004. [56] Matthew Edman and Paul Syverson. AS-awareness in Tor path selection. In Proceedings of the 16th ACM conference on Computer and communications security, pages 380–389. ACM, 2009. [57] Yixin Sun, Anne Edmundson, Laurent Vanbever, Oscar Li, Jennifer Rexford, Mung Chiang, and Prateek Mittal. RAPTOR: routing attacks on privacy in tor. In 24th USENIX Security Symposium (USENIX Security 15), pages 271–286, 2015. [58] Prateek Mittal, Ahmed Khurshid, Joshua Juen, Matthew Caesar, and Nikita Borisov. Stealthy traffic analysis of low-latency anonymous communication using throughput fingerprinting. In Proceedings of the 18th ACM conference on Computer and communications security, pages 215–226. ACM, 2011. [59] Steven J. Murdoch and George Danezis. Low-Cost Traffic Analysis of Tor. In 2005 IEEE Symposium on Security and Privacy,(IEEE S&P, 2005) Proceedings, pages 183–195. IEEE CS, May 2005. [60] Nathan S Evans, Roger Dingledine, and Christian Grothoff. A practical congestion attack on Tor using long paths. In Proceedings of the 18th USENIX Security Symposium, 2009. [61] Brent Chun, David Culler, Timothy Roscoe, Andy Bavier, Larry Peterson, Mike Wawrzoniak, and Mic Bowman. PlanetLab: an overlay testbed for broad-coverage services. SIGCOMM Computer Communication Review, 33, 2003. [62] Aaron Johnson, Chris Wacek, Rob Jansen, Micah Sherr, and Paul Syverson. Users Get Routed: Traffic Correlation on Tor by Realistic Adversaries. In Proceedings of the 20th ACM conference on Computer and Communications Security (CCS 2013), November 2013. [63] Florian Tschorsch and Björn Scheuermann. Mind the gap: towards a backpressure-based transport protocol for the Tor network. In 13th USENIX 129 Symposium on Networked Systems Design and Implementation (NSDI 16), pages 597–610, 2016. [64] George F Riley and Thomas R Henderson. The ns-3 network simulator. In Modeling and Tools for Network Simulation, pages 15–34. Springer, 2010. [65] Kevin S Bauer, Micah Sherr, and Dirk Grunwald. ExperimenTor: A Testbed for Safe and Realistic Tor Experimentation. In CSET, 2011. [66] Amin Vahdat, Ken Yocum, Kevin Walsh, Priya Mahadevan, Dejan Kostić, Jeff Chase, and David Becker. Scalability and accuracy in a large-scale network emulator. ACM SIGOPS Operating Systems Review, 36(SI):271–284, 2002. [67] Rob Jansen and Nicholas Hopper. Shadow: Running Tor in a Box for Accurate and Efficient Experimentation. In Proc. of the 19th Network and Distributed System Security Symposium, 2012. [68] Rob Jansen, Kevin Bauer, Nicholas Hopper, and Roger Dingledine. Methodically Modeling the Tor Network. In Proceedings of the 5th Workshop on Cyber Security Experimentation and Test, August 2012. [69] Shadow Homepage and Code Repositories. https://shadow.github.io/, https: //github.com/shadow/. [70] Rob Jansen, Kevin S Bauer, Nicholas Hopper, and Roger Dingledine. Methodically Modeling the Tor Network. In CSET, 2012. [71] The Tor Project. The Tor Metrics Portal. http://metrics.torproject.org/. [72] Damon Mccoy, Tadayoshi Kohno, and Douglas Sicker. Shining light in dark places: Understanding the Tor network. In Proc. of the 8th Privacy Enhancing Technologies Symp., 2008. [73] Félix Hernández-Campos, Kevin Jeffay, and F Donelson Smith. Tracking the evolution of web traffic: 1995-2003. In Modeling, Analysis and Simulation of Computer Telecommunications Systems, 2003. MASCOTS 2003. 11th IEEE/ACM International Symposium on, pages 16–25. IEEE, 2003. 130 [74] Andrei Serjantov and George Danezis. Towards an information theoretic metric for anonymity. In Privacy Enhancing Technologies, 2003. [75] Claudia Diaz, Stefaan Seys, Joris Claessens, and Bart Preneel. Towards measuring anonymity. In Privacy Enhancing Technologies, 2003. [76] Ulf Möller, Lance Cottrell, Peter Palfrader, and Len Sassaman. Mixmaster protocolversion 2. Draft, July, 2003. [77] Can Tang and Ian Goldberg. An improved algorithm for Tor circuit scheduling. In Proc. of the 17th ACM conf. on Computer and Communications Security, 2010. [78] W. Brad Moore, Chris Wacek, and Micah Sherr. Exploring the Potential Benefits of Expanded Rate Limiting in Tor: Slow and Steady Wins the Race With Tortoise. In Proceedings of 2011 Annual Computer Security Applications Conference, 2011. [79] Deepika Gopal and Nadia Heninger. Torchestra: Reducing interactive traffic delays over Tor. In Proc. of the Workshop on Privacy in the Electronic Society, 2012. [80] Prateek Mittal, Ahmed Khurshid, Joshua Juen, Matthew Caesar, and Nikita Borisov. Stealthy traffic analysis of low-latency anonymous communication using throughput fingerprinting. In Proc. of the 18th ACM conf. on Computer and Communications Security, 2011. [81] Nicholas Hopper, Eugene Y. Vasserman, and Eric Chan-tin. How much anonymity does network latency leak. In In CCS 07: Proceedings of the 14th ACM conference on Computer and communications security. ACM, 2007. [82] Steven J. Murdoch and George Danezis. Low-Cost Traffic Analysis of Tor. In Proceedings of the 2005 IEEE Symposium on Security and Privacy, SP ’05, 2005. [83] E.L. Hahne. Round-robin scheduling for max-min fairness in data networks. IEEE Journal on Selected Areas in Communications, 9(7):1024–1039, 1991. [84] The clients Tor Project. by entry Research guards. problem: adaptive throttling of Tor https://blog.torproject.org/blog/ research-problem-adaptive-throttling-tor-clients-entry-guards. 131 [85] Mashael AlSabah, Kevin Bauer, and Ian Goldberg. Enhancing Tor’s performance using real-time traffic classification. In Proceedings of the 2012 ACM conference on Computer and communications security, 2012. [86] Amir Houmansadr and Nikita Borisov. SWIRL: A Scalable Watermark to Detect Correlated Network Flows. In Proc. of the Network and Distributed Security Symp., 2011. [87] Sambuddho Chakravarty, Angelos Stavrou, and Angelos D. Keromytis. Traffic Analysis Against Low-Latency Anonymity Networks Using Available Bandwidth Estimation. In Proceedings of the European Symposium Research Computer Security - ESORICS’10, 2010. [88] Rob Jansen. The Shadow Simulator. http://shadow.cs.umn.edu/. [89] Rob Jansen and Nick Hopper. Shadow: Running Tor in a Box for Accurate and Efficient Experimentation. In Proceedings of the 19th Network and Distributed System Security Symposium, 2012. [90] Bram Cohen. Incentives build robustness in BitTorrent. In Workshop on Economics of Peer-to-Peer systems, 2003. [91] Trevor J Hastie and Robert J Tibshirani. Generalized additive models, volume 43. 1990. [92] John R. Douceur. The Sybil Attack. In IPTPS ’01: Revised Papers from the First International Workshop on Peer-to-Peer Systems, 2002. [93] Michael F Nowlan, David Wolinsky, and Bryan Ford. Reducing Latency in Tor Circuits with Unordered, delivery. In USENIX Workshop on Free and Open Communications on the Internet (FOCI), 2013. [94] Can Tang and Ian Goldberg. An Improved Algorithm for Tor Circuit Scheduling. In Proc. of the 17th Conference on Computer and Communication Security, 2010. [95] Rob Jansen, Kevin Bauer, Nicholas Hopper, and Roger Dingledine. Methodically Modeling the Tor Network. In Proceedings of the USENIX Workshop on Cyber Security Experimentation and Test (CSET 2012), August 2012. 132 [96] Steven J Murdoch. Comparison of Tor datagram designs. Technical report, Technical report, Nov. 2011. www.cl.cam.ac.uk/~sjm217/papers/ tor11datagramcomparison.pdf, 2011. [97] Stanislav Shalunov. Low Extra Delay Background Transport (LEDBAT). [98] Karsten Loesing, Steven J. Murdoch, and Rob Jansen. Evaluation of a libutpbased Tor Datagram Implementation. Technical report, Tech. Rep. 2013-10-001, The Tor Project, 2013. [99] Marco Valerio Barbera, Vasileios P Kemerlis, Vasilis Pappas, and Angelos D Keromytis. CellFlood: Attacking Tor onion routers on the cheap. In Computer Security–ESORICS 2013, pages 664–681. Springer, 2013. [100] Ronald L Graham, Eugene L Lawler, Jan Karel Lenstra, and AHG Rinnooy Kan. Optimization and approximation in deterministic sequencing and scheduling: a survey. Annals of discrete mathematics, 5:287–326, 1979. [101] Bin Jiang. Head/tail breaks: A new classification scheme for data with a heavytailed distribution. The Professional Geographer, 65(3):482–494, 2013. [102] Murray Rosenblatt et al. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3):832–837, 1956. [103] Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962. [104] Steven J. Murdoch. Comparison of Tor Datagram Designs. Tech Report 2011-11-001, Tor, November 2011. http://www.cl.cam.ac.uk/~sjm217/papers/ tor11datagramcomparison.pdf. [105] Randall Stewart. Stream control transmission protocol. Technical report, 2007. [106] QUIC, a multiplexed stream transport over UDP. https://www.chromium.org/ quic. [107] Xiaofin Li and Kevein Ku. Help with TOR on UDP/QUIC. https://www. mail-archive.com/tor-dev@lists.torproject.org/msg07823.html. 133 [108] Daniel Bernstein. CurveCP: Usable security for the Internet. URL: http://curvecp. org, 2011. [109] John Geddes, Rob Jansen, and Nicholas Hopper. IMUX: Managing tor connections from two to infinity, and, beyond. In Proceedings of the 12th Workshop on Privacy in the Electronic Society (WPES), November 2014. [110] NUSE: Network stack in USerspacE. http://libos-nuse.github.io/. [111] Hajime Tazaki, Ryo Nakamura, and Yuji Sekiya. Library Operating System with Mainline Linux Network Stack. Proceedings of netdev. [112] Octavian Purdila, Lucian Adrian Grijincu, and Nicolae Tapus. LKL: The Linux kernel library. In Proceedings of the RoEduNet International Conference, pages 328–333, 2010. [113] Linux kernel library project. https://github.com/lkl. [114] Thomas R Henderson, Mathieu Lacage, George F Riley, C Dowell, and J Kopena. Network simulations with the ns-3 simulator. SIGCOMM demonstration, 15:17, 2008.