CS 265 Project Report Anonymous Peer-to-Peer File Sharing Student: Jianning Yang (tomyang13@yahoo.com) Advisor: Dr. Mark Stamp (stamp@cs.sjsu.edu) 1 Introduction Recently lots of attention is given to a distributed framework called Peer-to-Peer (P2P) network. A P2P network consists of a large number of peers interconnected together to share all kinds of digital content. In a P2P network, there is no central or dedicated server and all peers are independent from each other. A key weakness of most existing P2P systems is the lack of anonymity. Without anonymity, it is possible for third parties to identify the participants involved. As a result, a considerable amount of research has being done in the field of anonymity. In their seminal article, Bennet and Grothoff [4] state the following anonymity requirements in P2P system. 1. Anonymous P2P system should make it impossible for third parties to identify the participants involved. 2. Anonymous P2P system should guarantee that only the content receiver knows the content. 3. Anonymous P2P system should allow the content publisher to plausibly deny that the content originated from him or her. The aim of this project is to design an Anonymous P2P File Sharing System (APTPFS) by utilizing a variety of crypto techniques learned from CS 265. The system operates as a distributed file system across many computers that allow files to be stored and requested anonymously. 2 2.1 Overview of P2P Network Network Topology There are two major P2P network topologies: centralized topology, and decentralized topology. In centralized topology, network functionality depends largely on the central server and each peer accesses a central server to upload and download information. The central server coordinates the communication among peers. Decentralized P2P topology does not contain any central server. In this kind of topology, peers are normally organized in an unstructured fashion. The most popular decentralized P2P systems are Freenet, GNUnet, Gnutella, and MUTE. Decentralized networks have many advantages over centralized ones. One of the advantages is that decentralized networks could eliminate single points of failure. Another advantage is that decentralized networks are very scalable. 2.2 Peer Addressing Like other kinds of networks, a P2P network provides addressing scheme to identify each individual peer. Because communication is always initiated by each individual peer, peers are not required to have static addresses. For example, each peer in the MUTE network is identified by a randomly generated virtual address [9]. 2.3 Discovering Peers In P2P network, new peer can easily discover existing peers via host listing and host caching. Host listing uses a centralized server to maintain a list of active peers. When a new peer wants to join the network, it first registers with the server. Following the registration, the server will return contact information of some active peers to which the new peer should connect. Because the server randomly select active peers, host listing encourages sparseness and it creates nearly optimal network structure [1] [6]. In contrast, host caching mechanism does not require any centralized server. In host caching, information about active peers is exchanged among peers. The information is then stored in the local cache and can be used later. However, the local cache is empty initially so that host listing should be used for the initial peer discovery. Once the initial peers have been discovered, the host caching can be used exclusively. Therefore, discovering peers in P2P networks always relies on some degree of centralization [1]. APTPFS – An Anonymous P2P File Sharing System 3 3.1 System Architecture Figure 1 depicts the basic system architecture of APTPFS. When a new peer wants to join the APTPFS system, it first establishes a SSL connection to a Peer Listing Server (PLS) and sends a registration request to that server. The main purpose of PLS is to maintain a list that contains information regarding current active peers inside the system. Upon receiving the request, the PLS adds the new peer’s information (e.g. IP and Port) into the list. PLS also randomly selects a few peers from the list and sends their information back to the new peer. The new peer then creates and maintains SSL connections to those selected peers. Consequently those selected peers become neighbors of the new peer. Because APTPFS is a decentralized P2P system, there is no single server in the system to coordinate file sharing process. PLS is not considered as a server since it only serves as a registry and is consulted only when a peer wants to join or leave the system. Each peer, identified by a randomly generated virtual address, only keeps direct connections with its neighbors. All queries are propagated through the network hop-by-hop, and responses are routed back through paths discovered by queries. AFTPFS Peer Listing Server (PLS) SSL Registration Peer E Registration SSL Registration File SSL Query Peer C SSL Query SSL Peer A SSL Query Query SSL Peer D SSL Query SSL Peer B Query Peer F File Figure 1 3.1.1 Virtual Address Inspired by MUTE [9], APTPFS hide peer’s identity with virtual address. When a peer starts up, it is assigned a virtual address that is generated randomly. During starts up, the peer generates a random number, appends IP address to the random number, and then uses SHA1 algorithm (e.g. SHA1 (RNG () + IP), where RNG is a random number generation function) to generate a virtual address. Virtual address is the sole identification of peer in APTPFS. Because virtual address is calculated through one-way hashing, peers could not identify each other’s true identity (e.g. IP) based on virtual address. In support of virtual address, each peer maintains a routing table that maps virtual addresses into its neighbor connections (e.g. SSL connections). To populate routing table, each APTPFS peer records recent neighbor connection information on which it has received messages from other peers. When routing table is full, the least used entry will be purged. When a peer receives a message, it checks the virtual address inside the message header and tries to match that virtual address with one of the routing table entries. If it finds an entry, then it randomly chooses a neighbor connection from that entry and forwards the message to that neighbor; otherwise it rejects the message. Peer Z Peer B Peer A Peer X Peer Y Peer W Figure 2 In figure 2, assume that peer X received messages sent by peer A through two of its neighbors (e.g. peer Y and peer Z). Peer X would then populate its routing table as below (table 2). Later when peer X receives a message from peer B, it checks virtual address of the target peer inside message header. Suppose that the target peer is peer A. Peer X knows that it should forward this message to peer Z or peer Y based on routing table. Peer X then randomly chooses one of these two neighbors and forwards the message to it. Peer X’s Routing Table Peers Neighbor Connections SSL Connection to Peer Z Peer A’s virtual address √ SSL Connection to Peer Y SSL Connection to Peer W √ Note: √ indicates that messages addressed to a particular peer could be forwarded to the neighbor connection. Table 1 3.1.2 Hop-by-Hop Connection Each APTPFS peer maintains SSL connections to all of its neighbors, and these connections are used for transporting APTPFS messages (both queries and responses) hop-by-hop. In other words, there is no direct connection among peers except connections to their neighbors. In Figure 2, peer A maintains two connections to its neighbors (peer Z and peer Y). Suppose peer A wants to send out a message to peer B, it can not directly establish a connection to peer B since it only knows peer B’s virtual address. Therefore, it inserts peer B’s virtual address into the message header and forwards the message to one of its neighbors. Let’s assume that it forwards the message to peer Z, which in turn forwards the message to peer X. Upon receiving the message from Z, peer X forwards it to peer B that happens to be the message destination. In this example, the route of this message is A->Z->X->B. 3.1.3 Query Hashing In APTPFS, queries are hashed so that only query sender knows query content. Let’s suppose that peer A wants to download a file called “dreams.mid” in figure 2. Peer A first calculates the hash value of “dreams.mid”. Then peer A broadcasts the query along with the hash value. When an intermediary peer (e.g. peer Z or peer X) receives that query, it does not know what file that peer A is looking for since hashing is a one-way function. When the query is propagated to peer B, peer B finds a match since it registered a file under the same hash value. Similarly peer B also does not know the file name that peer A is looking for. One way for an attacker to reveal the original query is via dictionary attack where he or she could pre-compute hash values of some popular keywords and save them into a dictionary for look up. To thwart dictionary attack, APTPFS uses multiple rounds of SHA1 hashing (e.g. Hash(filename) = SHA1(SHA1(filename)+VirtualAddress)). Because an attacker could still reveal the original query by simply re-computing data dictionary as long as he or she has enough computation power, query hashing only serves to prevent normal users from revealing the query contents. However, exposing query content does not break anonymity as long as the identity of query sender or query receiver is not exposed; this is true in APTPFS since it uses virtual address to hide peer identity. 3.1.4 FORWARD_HASH Flag One problem with the broadcast scheme is that it may expose the query sender’s identity. For example, if the maximum of TTL is 7, then receiving a query with a TTL value 7 from a neighbor means that the neighbor is the query’s originator. MUTE provides a clever approach to deal with this issue. The idea is to send each broadcast message along with a hash value in FORWARD mode for the first few hops. Each intermediary peer rehashes the hash value and uses the last 2 digits of the hash as a random value to determine whether or not the message should be switched out of FORWARD mode. If the FORWARD mode should continue, the new hash value should replace the old hash value. While in FORWARD mode, the TTL will be kept intact [9]. APTPFS uses a similar approach to protect the query sender’s privacy. The FORWARD mode signaled by FORWARD_ flag and a 20-byte SHA1 hash value. Each intermediary peer simply re-hashes the current hash value, which generates another hash. A trigger value is taken from the last byte of the hash. For example, to switch the FORWARD mode with a 1 in 3 chance, peers simply looks at the last byte of the newly-generated hash and switches the mode if the last byte is less than or equal to 85 (there are 256 possible values for the last byte). SHA1 is a cryptographically secure one-way function, so it is very difficult to obtain previous values from the current value, and thus it is difficult to determine how far a message has traveled so far [9]. Figure 3 shows an example of how FORWARD_HASH works in APTPFS. For simplification, it only shows the last byte of the forward hash value. The shaded nodes are colluding nodes while the dotted lines indicate hop-by-hop message propagation paths. Peer Z Peer B FORWARD_155 TTL: 5 Peer A Peer X FORWARD_231 FORWARD_211 Peer Y Peer C FORWARD_155 Peer W TTL: 4 Figure 3 Suppose that peer A is the query sender in Figure 3. Peer A randomly calculates a forward hash and send the query with the forward hash to peer Z and peer Y. Peer Z and peer Y can not determine the origin of this query from the forward hash. To decide what to do with the query, they rehash the forward hash to generate a new forward hash. Because the last byte of the new forward hash is 155, which is greater than the trigger value 85, they forward the query to peer x with the new forward hash. Once receives the query, peer X again rehashes the received forward hash to generate a new forward hash. Then it sends the query with the new forward hash to peer B since the last byte of the new forward hash is 211, which is still greater than the trigger value 85. Now suppose peer B rehash the received forward hash and finds that last byte of the new hash is 76, which is less than the trigger value 85. Peer B then removes the forward flag from the query and sends the query to peer C in normal TTL mode. 3.1.5 DROP_CHAIN Flag FORWARD_HASH scheme ensures that an attacker cannot easily determine the sender of a message, but it does ensure the anonymity of peers that are responding with messages. For example, an attacker could send a search request with TTL value 0 to force the neighbor to send back its own results without passing the message further [9]. MUTE provides an efficient way to thwart this kind of attacks. Once a MUTE node receives a message with TTL value 0, it will switch the message into CHAIN mode, and the message will still travel through many peers before being dropped. Each of these peers could send back results, so the attacker receives cannot be associated results with the neighbor node. Whenever a peer startups, it decides randomly whether it will drop chain messages or pass them. Each peer also picks one neighbor to pass chain messages to it as long as that neighbor remains connected [9]. APTPFS takes the same approach to protect message responder’s anonymity. It uses DROP_CHAIN flag to indicate the CHAIN mode. Whenever a peer startups, it will randomly choose a neighbor so it can forward those chain messages to that neighbor later. In APTPFS, each peer has a probability of 70% to accept chain message. Therefore the average tail chain size is 2 peers (70%*70%=49%). One thing to note is that the query messages will only be broadcasted while in FORWARD mode and TTL mode. Figure 4 shows an example of how DROP_CHAIN works in APTPFS. The shaded node is a colluding node and dotted lines indicate the drop chain path. Peer Z Peer B DROP_CHAIN DROP_CHAIN Peer A DROP_CHAIN Peer X TTL: 0 Peer Y Peer S Figure 4 Suppose that Peer S sends a query with TTL value 0 to peer A. Instead of sending the response back immediately to peer S, peer A set DROP_CHAIN mode inside the query message and forward the message to one of its neighbors (e.g. peer Z). Consequently the query message travel through peer X before being dropped by peer B. Because any one of those peers in the drop chain path can send back the response, peer S can not link the response to peer A by simply set the TTL value to 0 inside the query message. Put it all together, a typical APTPFS query message travels through a random number of hops (range from 1 to 3) in FORWARD mode, switches to TTL mode and travel through a maximum of 5 hops, and continue to travel through a random number of hops (range from 1 to 4) in CHAIN mode before being dropped. 3.1.6 Anonymous Content Publishing In order to publish a file, the user only needs to specify the name of that file, which should be inside a special APTPFS folder. Upon receiving the publishing request, the peer uses the file name to calculate a hash value H1. Then it generates a random number as the AES secret key SK to encrypt the file and use the encrypted file to calculate a hash value H2. Finally it invokes a two-steps process to distribute the encrypted file and the secret key into other peers. In the first step, it randomly chooses two peers from its routing table, and sends the hash values H1/H2 and the secret key SK to each of those peers. In the second step, it randomly chooses another two peers from its routing table, and sends the hash values H1/H2 and the encrypted file to each of those peers. If the publishing process is successful, it will delete the encrypted file from the local APTPFS folder; otherwise, it will prompt the user with an error message. In support of content publishing, each peer keeps two tables: Hash-to-Key table and Hash-to-File table. Hash-to-Key table maps each hash pair (e.g. H1/H2) into a secret key (e.g. SK) used to decrypt the file. Hash-to-File table maps hash pair (e.g. H1/H2) into a encrypted file. This approach is similar to GNUnet’s content publishing approach [10]. The difference is that GNUnet splits the file into small chunks and send those chunks to some selected peers so that each of those peers only keeps a portion of the file while in APTPFS each of selected peers keep a complete encrypted file. Figure 5 illustrates an anonymous content publishing scenario which shows that peer A distributes a encrypted file into peer F and peer H as well as distributes a secret key into peer G and peer J. Peer F Peer E Peer D Peer H H1/H2 AES (file, RN) Peer B H1/H2 AES (file, RN) Peer I Peer C Peer A Peer G Peer J H1/H2 RN H1/H2 RN H1: The hash of file name H2: The hash of encrypted file RN: The random AES key AES (file, RN): the encrypted file Note: Dotted lines form file distribution paths while thick links form key distribution paths Figure 5 Using this scheme, the content publisher could plausibly deny that the content originated from him or her. It is true because the key and the file only reside in other peers. On the other side, administrators of those peers in which the key or the encrypted file resides can also deny that they have any knowledge about the published file since the key or the file is sent to those peers without the administrator’s awareness. Furthermore, because it is unlikely that a peer can have both the key and the file, they could claim that they don’t have both key and encrypted file for decryption. 3.1.7 Anonymous Content Retrieval To download a file, the content receiver CR first broadcasts a search query with the hashed file name SHA1(SHA1(name), CR’s virtual address). When a peer in the network receives such a query, it consults its Hash-toFile table to see if there is a match. If it finds a H1 value that matches SHA1(SHA1(name), CR’s virtual address) in Hash-to-File table, it sends back the relevant file information (e.g. file size) along with H1/H2 hash pair to CR. Needlessly to say, the search result is routed back hop-by-hop. Because CR may receive the query result from many peers, it prompts the user to selects one file to download. After the user selects a file, CR generates two queries: the download file query and the download key query. CR first broadcasts the download key query along with SHA1(H1, CR’s virtual address) and H2. When a peer in the network receives such a query, it consults its Hash-to-Key table to see if there is a match. If it finds an entry matches with SHA1(H1,CR’s virtual address)/H2 pair in Hash-to-Key table, it sends back the corresponding secret key to the content receiver. After receiving the key, CR sends a download file query along with SHA1(H1, CR’s virtual address) and H2 to the peer that is holding the selected file. Upon receiving the download file query, the peer consults its Hashto-File table and sends the corresponding encrypted file to CR hop-by-hop. After CR receives both the encrypted file and the secret key, it decrypts the file and copies the decrypted file into a public folder so the user could use it. Finally it randomly chooses to either keep the secret key or the encrypted file in order to serve future requests from other peers. 3.1.8 Miscellaneous All communication links in APTPFS are secured by Secure Socket Layer (SSL) to prevent network eavesdropping. Each APTPFS message size is fixed at 1KB. For messages larger than 1KB, the system would split the message into 1KB chunks and deliver those chunks one by one. For messages smaller than 1KB, the system may add some padding into the message. The purpose of unified message size is to make it very difficult for an attacker to differentiate different kinds of messages through traffic analysis. In addition, each peer randomly generates some bogus messages and forwards those messages to its neighbors to further thwart traffic analysis attack. 4 Conclusion This paper presents an anonymous P2P file sharing system that ensures anonymity for both producers and consumers of file. By using SSL, the APTPFS system prevents network eavesdropping. Because each APTPFS peer uses virtual address to hide its identity and always avoids direct connection to others, it is highly unlikely that third parties can identify the participants involved in a file sharing session [4]. Additionally, APTPFS’s anonymous content publishing allows the publisher to plausibly deny that the content originated from him or her [5] [3]. Finally, APTPFS takes many measures (e.g. forward hash [9], and drop chain [9]) to protect anonymity against traffic analysis attacks. 5 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. References Andy Oram, Peer-To-Peer: Harnessing the Power of Disruptive Technologies, O’Reilly & Associates, 2001 Brendon Wilson, JXTA, New Riders Publishing, 2002 Christian Grothoff, An Excess-Based Economic Model for Resource Allocation in Peer-to-Peer Networks, Purdue University, 2003 Christian Grothoff, Ioana Patrascu, Krista Bennett, Tiberiu Stef, and Tzvetan Horozov, The GNet Whitepaper, Purdue University, 2002 David Goldschlag, Michael Reed and Paul Syverson, Onion Routing for Anonymous and Private Internet Connections, Naval Reaearch Laboratory, 1999 Dana Moor, John Hebeler, Peer-to-Peer: Building Secure, Scalable, and Manageable Networks, McGraw-Hill Osborne Media, 2001 Dennis Kügler, An Analysis of GNUnet and the Implications for Anonymous, Censorship-Resistant Networks, Federal Office for Information Security, 2003 Ian Clarke, Oskar Sandberg, Brandon Wiley, Theodore W. Hong, Freenet: Distributed Anonymous Information Storage and Retrieval System, In Proc. of the ICSI Workshop on Design Issues in Anonymity and Unobservability, Berkeley, CA, 2000 Jason Rohrer, MUTE: Simple, Anonymous File Sharing, http://mutenet.sourceforge.net, 2004 Krista Bennet, Christian Grothoff, Tzvetan Horozov, Ioana Patrascu, and Tiberiu Stef, GNUNET – A truly anonymous networking infrastructure, Purdue University, 2002 Michael J. Freedman, Emil Sit, Josh Cates, and Robert Morris, Introducing Tarzan: A Peer-To-Peer Anonymizing Network Layer, MIT Laboratory for Computer Science, 2002 Michael K. Reiter and Aviel D. Rubin, Crowds: Anonymity for Web Transactions, AT&T Labs, 1998 National Security Agency, CNSS Policy No. 15, Fact Sheet No. 1, National Security Agency, 2003 Rags Srinivas, Using AES with Java Technology, http://java.sun.com/developer/technicalArticles/Security/AES/AES_v1.html, 2003 RSA Laboratories, RSA Laboratories' Frequently Asked Questions About Today's Cryptography, Version 4.1, http://www.rsasecurity.com/rsalabs/node.asp?id=2152, 2004 Taher Elgamal, The Secure Sockets Layer Protocol (SSL), Danvers IETF Meeting, 1995