Peer-to-Peer Computing CS587x Lecture Department of Computer Science Iowa State University What to Cover Review on some P2P applications Napster Gnutella Freenet Discussion and summary Resource Sharing Questions to answer in order to design a resource-sharing network How How How How to add new nodes to the network can one node know about others can a node find and retrieve data to manage the shared data users Client/Server Architecture Create a server to store the information that these nodes want to share The server is the only data source Clients request data from server Example: mp3.com A client registers to mp3.com and uploads its music files to the server The songs are then stored and indexed on a server that is part of the web site Other uses can connect to the web site and downloads the songs they are interested in Limitation of C/S model Scalability is hard to achieve Presents a single point of failure Requires administration Unused resource at the network edge Some P2P Applications Napster Gnutella Freenet Napster Each node registers to napster.com and provides a list of its song titles The napster server knows the music titles and their sites The songs themselves are still stored locally For a node to download a song, the node contacts the server The server returns a list of nodes that have the song The requesting node selects one of the nodes in the list to download the file directly from the node Highlights of Napster Main innovation: a client downloads a music directly from another client, i.e., P2P communication After a client downloads a music, it can serves other clients Napster server itself does not have any music files It acts as a directory or broker Advantages Each consumer contributes its resource (disk and bandwidth) and content to the community Contents are more reliable because the same file is stored in many nodes, which are geographically distributed Administration and service cost are minimal Drawback Napster is a hybrid P2P system since a central server is required to coordinate file sharing The central server presents a single point of failure Gnutella Creating a Gnutella network A node joins the network with a PING to announce self IP address, port, number/size of shared files Receivers forward the Ping to their neighbors Receivers back-propagate a PONG to announce self Each Pong includes sender’s IP address, number/size of shared files Maintaining a Gnutella network PING neighbors periodically PING Well-known root nodes if starting from scratch Search Protocol For node A to request a file (any kind), it creates a query (A, S, N, T), where S is search string, N unique request ID, T Time-to-Live checks local system, if not found Sends (A, S, N, T) to all Gnutella neighbors B receives a query (A, S, N, T) If B has already received query N or T = 0, drops the query Otherwise, B looks up S locally and sends (N, Result) to A if anything found Any kind of look up (could simply grep, or construct some sql cmd) If not found locally, B sends (B, S, N, T-1) to all of its Gnutella neighbors B records the fact that A has made the request N When B receives a response of the form (N, Result) from one of its neighbors, it forwards the response to A Gnutella Messages PING request the transitive closure of connected nodes to identify them, essentially asking the question "Are you there?“ PONG response by a node upon receiving a PING; the responding node provides its IP address and number of sharable files it contains. This gives the answer that "Yes, I am here….“ QUERY request to locate a set of files matching some filter criteria. These are messages stating, "I am looking for x". HITS response to a query giving a list of files matching the filter criteria and the IP address of the provider, can be many in number. GET/PUSH request a file provider to contact the requester. This provides a simple mechanism to attempt to get through firewalls Partial Map of a Gnutella Network Highlights of Gnutella Pure P2P Unlike Napster, Fully decentralized, no single point of failure Limitations Scalability: if you send out a request with a TTL of 10, and each site contacts six other sites, up to 61+62+63+64+65 +66+67+68+69+610 messages could be exchanged Not anonymous: since result contains the URL string, the source provider can be tracked – this is addressed in Freenet Freenet Freenet is a pure P2P system mainly designed to support distributed information storage and retrieval anonymity for producers, consumers and holders of information adaptive respond to usage patterns Freenet differentiates from Gnutella mainly in Retrieving data Storing data Managing data Architecture Each file is identified by a binary key The key is generated using some hash function Every file is stored, retrieved, and maintained with its file key Each node maintains a local data store and a routing table data store maintains a set of files routing table keeps information about neighboring nodes and the keys that they are thought to hold A sequence of (file key, node address) Used for file retrieval key neighbor 30 123.234.456.1 100 888.234.456.2 65 999.234.456.3 Retrieving data A user first obtains or calculates a key The user sends a search request message (key+TTL) to local node When a node receives a request, it checks its own data storage If the specified data is found, returns it Otherwise, the node looks up its routing table and forwards the request to the node that has the nearest key why do this - the similarity of two keys actually has nothing to do with that of their corresponding files? If this request is successful, the node that has the target data returns the data through the search path, caches the file in its own data store, and key creates a new entry in its routing table neighbor 30 123.234.456.1 100 888.234.456.2 65 999.234.456.3 Example 1.Cache Calculate file binary in file key Cache file in 2.datastore Check routing table for nodedatastore with Create nearest newkey entry in Create new entry in routing table routing table A B FOUND NOT NOT FOUND FOUND 1. Check datastore for file 2. Try 3. Check therouting node with table second for node with nearest nearest key key to requested one FAILURE C E D Cache file in datastore Create new entry in routing table File request (key, hops to live) Data reply + actual data source Failure message Effect of Retrieving Mechanism Anonymity Uncontrolled replication allows one to deny responsibility of having the file Quality of routing improved over time: Nodes specialize in locating sets of similar keys Files with similar keys are stored in clustering (why?) Files are key-clustering instead of subject-clustering Transparent replication of popular data Improved data availability Replication degree depends on data popularity Increasing connectivity The graph becomes more and more connected Effect of Retrieving Mechanism Major difference from Gnutella searching Breadth-first search vs. Depth-first search Replication over the retrieval path Limitation Searching for a document that does not exist? Storing data Calculate binary file key and send insert message like request (key, hops to live) When a node receives an insert proposal, it first checks its own data store If the key already exists, the users need to try again using different key Otherwise, the node looks up the nearest key in its routing table and forwards the insert to the corresponding node If key collision occurs at the adjacent node, the node notifies the inserted to try another key If TTL expires without a key collision, an “all clear” result will be backwarded to the original inserter Storing data Effects of insert mechanism: New files are placed on nodes possessing files with similar keys Limitation How long it takes to insert a file? How about version management? Two different files could have the same key and both may exist in network Different users must have different name space The same user must use different file description (e.g., keywords) for different file Security is a concern Managing data File replacement is done using LRU Data items sorted in decreasing order by time of most recent request/insert Outdated documents fade away naturally as routing table entry will remain for a time File lifetime The time period of keep a file is unknown You cannot delete a file from a Freenet – a file will not disappear unless it is not accessed for a while No guarantee that a document you submit today will exist tomorrow Highlights of Freenet Pure P2P - similar to Gnutella, Provides anonymity Neither data producer and retriever can be identified Searching/Storing/Managing are all different for anonymity and performance purpose P2P Advantages Efficient use of resources Client/Server architecture cannot take advantage of the unused bandwidth, storage, processing power at the edge of network Scalability Each user contributes its resource to the entire community, instead of just a burden Reliability Replicas Geographic distribution No single point of failure Ease of Administration Nodes self organize No need to deploy servers to satisfy demand Built-in fault tolerance, replication, and load balancing P2P Computing Summary P2P computing is the sharing of computer resources by direct exchange between systems Such resource includes information, processing cycles, storage, etc. A P2P network has the following characteristics Each node behaves as client, server, and router Nodes are autonomous (no administrative authority) Network is dynamic: nodes enter and leave the network frequently Nodes collaborate directly with each other (not through wellknown servers) Nodes have widely varying capabilities Homework 3 (Due 04/20) Implement a Gnutella network Network maintenance (60/100 points) ping and pong Nodes being able to retrieve files (40/100) query, hit, get