P2P: An Overview Dr. Tony White Carleton University Outline • Introduction • Evolution of Network Computing – Definitions – The Rise of Edge Computing – Why Peer-to-Peer? What is it? • Applications – Cycle Sharing – Content Delivery – … • Open Problems • Summary Evolution of Network Computing • Client/server: - Introduced inequalities - Required homogeneity •Web introduced: - A common protocol: HTTP - A common document format: HTML - A universal client: the browser P2P Definition • Peer-to-peer computing is the location and sharing of computer resources and services by direct exchange between servents. • A servent is a peer that can adopt the roles of both server and client when operating. P2P Definition “P2P is a class of applications that takes advantage of resources -- storage, cycles, content, human presence -- available at the edges of the Internet. Because accessing these decentralized resources means operating in an environment of unstable connectivity and unpredictable IP addresses, P2P nodes must operate outside the DNS system and have significant or total autonomy from central servers.” Clay Shirkey February, 2000 Definitions I • Pure peer-to-peer is completely decentralized and characterized by lack of a central server or central entity; clients make direct contact with one another. • Computational peer-to-peer uses P2P technology to disseminate computational tasks over multiple clients; peers do not have a direct connection to one another. Definitions II • Datacentric peer-to-peer is information and data residing on systems or devices that is accessible to others when users connect. It is sometimes called peer-assisted or grid-assisted delivery. Applications include distributed file and content sharing. • Usercentric/hybrid peer-to-peer involves clients contacting others via a central server or entity to communicate, share data, or process data. Often used in collaboration applications. What is a P2P network? • It is an overlay network • Peer applications know IP addresses of other peer applications. • Link between two nodes is actually an application-level connection. What matters? • Topology of overlay matters • Where content is stored matters • Search protocol matters • Gnutella results in: – Poor performance – Poor reliability The Rise of Edge Computing … • In P2P, clients also are servers, hence are peers. • Driving P2P is the abundance of: – – – – Computing power Non-volatile storage Network bandwidth (This seems to turn thin clients on their heads.) • Sharing from the edge: – Physical Resources: cycles, disk – Information Resources: files, database access – Services: code mobility implied P2P Enables Complete Access • P2P file swapping is the obvious application – Text, audio, video, executables, … • Searching and sharing – Resources • Information • Information processing capacity – Searches • More current than Google™ – Indexing web logs (blogs, klogs …) • More focused: search within a “peer group” P2P Enables Complete Access … • Searching and sharing: – Instant messaging • locate user quickly independent of service provider. – Buyers and sellers • P2P auctions – compete with Ebay. – Blogging • Sharing of “self”. • Edge-based multi-media streaming: – Web radio – Web TV • Peer shells: – Script complex P2P applications from simpler ones. – Service creation using service composition. P2P Enables Complete Access … • • • • A New Style of Distributed Computing P2P applications tolerate peers coming/going. Result depends on which peers are available. High availability comes from probability that some peers are available. – Not on load-balancing and fail over schemes. – Must avoid “tragedy of the commons”. Examples of Early P2P • Some new Internet applications are different: – SETI@home – Instant messaging services (AIM, MS Messenger, …) – P2P applications – no central authority/server. • Napster – quasi-P2P • Gnutella • Freenet • These applications are vertically integrated: – Non-standard protocols – Closed namespaces – Stand alone Problems I • Topology – Bandwidth usage – Fault tolerance – Search efficiency • Identity – Trust – Anonymity • Security – Authorization – Privacy Problems II • Namespaces • Community Management – Overlaps traditional enterprise groups – Highly dynamic, user controlled • Firewall traversal • Political – IT loses control of content distribution – No control of information flow! • Legal – DRM What is needed? • Interoperability (common protocols & standards): – Communication protocols (e.g. JXTA, Jabber, …) – Representation of identity (or not!) – Semantic content (meta-data) • Secure information exchange: – Must be able to guarantee trust within a network – Prevent unauthorized access to network – Policy-based control of information exchange • Ubiquity – Buy-in from large groups of users Securing Distributed Computations in a Commercial Environment Philippe Golle, Stanford University Stuart Stubblebine, CertCo Example of a Distributed Computation • 580,000 active participants • 565,800 years of CPU time since 1996 • 26.1 TeraFLOPs / sec Commercialization: supply • A dozen of companies have recruited thousands of participants • $100 million in venture funding in 2000 www.mithral.com www.dcypher.net (with www.processtree.com) www.distributed.net www.entropia.com www.parabon.com www.uniteddevices.com www.popularpower.com www.distributedsciences.com www.datasynapse.com www.juno.com Commercialization: demand • Super-computing market: $2 billion / year • Computationally intensive parallelizable projects: – Drug design research – Mathematical research – Economic simulations – Digital entertainment Cheaters! "Fifty percent of the project's resources have been spent dealing with security problems" “The really hard part has to do with verifying computational results" David Anderson, Seti@home's director. Cycle Sharing Participants • Trusted supervisor – – – – – Maintains a pool of registered participants Bids for large computations Divides the computation into tasks that are assigned to participants Collects the results and distributes payment to the participants Example: Distributed.net, Entropia.com, etc… • Untrusted participants – May range from large companies to individual users – Participants are anonymous (No “real world” leverage) – Participants may collude. We distinguish between real-world entities (agents) and anonymous participants. – Participants may leave the computation at any time, either temporarily or for good. Organization • Distribution of tasks – The unit of computation is a task – Assumption: all tasks have the same size and can be run by any participant within the same time bounds. – The supervisor runs a probabilistic algorithm to assign tasks to participants. – The supervisor keeps track of who did what Security • Definition: a computation is secure if no rational, nonrisk-seeking participant ever cheats. • Collusion may occur only before tasks are assigned. • A participant has 3 choices: – Request a computation and do it – Request a computation and NOT do it – Take a leave • Assumption: all errors are malicious Utility function of an agent Run the computation α Cheating detected Cheat and “guess” the result –L Cheating undetected α +E α: Payment received per task E: Benefit of defecting (E = e α) L: Cost of getting caught cheating • Security condition: (α+E)P – L(1-P) < 0 where P is the probability that cheating is undetected Basic scheme • Registration: – Participant performs d+1 unpaid tasks – The supervisor verifies them (at limited cost) – The participant is accepted iff all the results are correct • Assignment of a task: – A task is given to N participants chosen uniformly independently at random – The number N is chosen according to the probability distribution QN i 1 c c i 1 – Payment: a constant amount α per task if all the results agree – If not, the task is re-assigned to a new set of participants • Severance: a participant is paid an amount d.α Properties • Computational overhead = c 1 c (α+E)P – L(1-P) < 0 1c1 p 2 • Security condition: 1 p d 1 e d Computational overhead Setup time Maximum coalition size Maximum e 10% 17% 46% 10 10 10 1% 10% 1% 1 1 10 243% 10 1% 100 • Overhead = 1 e 1 d for “small” p Content Delivery Networks • Swarmcast/OnionNetworks – File is stored in multiple locations – Idea is to retrieve portions of file from separate hosts: • • • • • File is split into small (32k) pieces Requests are random Space of packets bigger than file Only subset of packets required Technique is Forward Error Correction • Kazaa/Morpheus • MojoNation (HiveCache) – Distributed backup and restore system Privacy Networks: Publius • Publius – Publishers: want to publish anonymously – Servers: host random-looking content • Storage – The publisher takes the key, K that is used to encrypt the file and splits it into n shares, such that any k of them can reproduce the original K, but k-1 give no hints as to the key. – Each server receives the encrypted Publius content and one of the shares. • Retrieval – A retriever must get the encrypted Publius content from some server and k of the shares. – Content is tied to URL that is used to recover the data and the shares. Privacy Networks: Freehaven • Anonymity: – – – – Publishers that insert documents, Readers that retrieve documents, Servers that store documents. Uses a free, low-latency, two-way mixnet for forwardanonymous communication. • Accountability: – Reputation and micropayment schemes, which allow us to limit the damage done by servers that misbehave. • Persistence: – Publisher of a document determines its lifetime. • Flexibility: – System functions smoothly as peers dynamically join or leave OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz OceanStore Context: Ubiquitous Computing • Computing everywhere: – Desktop, Laptop, Palmtop – Cars, Cellphones – Shoes? Clothing? Walls? • Connectivity everywhere: – Rapid growth of bandwidth in the interior of the net – Broadband to the home and office – Wireless technologies such as CMDA, Satelite, laser • Where is persistent data???? Utility-based Infrastructure? Canadian OceanStore Sprint AT&T Pac IBM Bell IBM • Data service provided by storage federation • Cross-administrative domain • Pay for Service OceanStore: Everyone’s Data, One Big Utility “The data is just out there” • How many files in the OceanStore? – Assume 1010 people in world – Say 10,000 files/person (very conservative?) – So 1014 files in OceanStore! – If 1 gig files (ok, a stretch), get 1 mole of bytes! Truly impressive number of elements… … but small relative to physical constants Aside: new results: 1.5 Exabytes/year (1.51018) OceanStore Assumptions • Untrusted Infrastructure: – The OceanStore is comprised of untrusted components – Individual hardware has finite lifetimes – All data encrypted within the infrastructure • Responsible Party: – Some organization (i.e. service provider) guarantees that your data is consistent and durable – Not trusted with content of data, merely its integrity • Mostly Well-Connected: – Data producers and consumers are connected to a highbandwidth network most of the time – Exploit multicast for quicker consistency when possible • Promiscuous Caching: – Data may be cached anywhere, anytime The Peer-To-Peer View: Irregular Mesh of “Pools” Key Observation: Want Automatic Maintenance • Can’t possibly manage billions of servers by hand! • System should automatically: – – – – Adapt to failure Exclude malicious elements Repair itself Incorporate new elements • System should be secure and private – Encryption, authentication • System should preserve data over the long term (accessible for 1000 years): – – – – Geographic distribution of information New servers added from time to time Old servers removed from time to time Everything just works Attack Resistant P2P • Content can be compromised by: – Attack by malicious agents – Censorship – Faulty nodes • Remember: – Nodes have finite resources Gnutella query Morpheus/Kazaa ... ... ... ... super peer ... ... Examples • Napster shut down by attacks on central server • Gnutella spammed by Flatplanet • Removal of a few peers shatters Gnutella – 63 from 1800 in figures Performance After deletion of 2/3 of peers, 99% of remainder can still access 99% of the data items DRN design [Jared Saia] • Topology based upon butterfly network (constant degree version of hypercube) • Each vertex of butterfly called a supernode • Each supernode represents a set of peers • Each peer is in multiple supernodes DRN Topology N peers, n supernodes Each peer participates in Clogn randomly chosen supernodes Supernode X connected to supernode Y means all nodes in X connected to all nodes in Y Conclusion • P2P systems popular today – Limewire, Kazaa … • Existing P2P systems vulnerable and inefficient • Many challenges ahead: – Search – Resource Management – Security and Privacy Lots of good research to be done … Appendix I Open Problems in P2P Data Sharing Open Problems in Data Sharing Peer-To-Peer Systems Hector Garcia-Molina ICDT Conference, January 10, 2003 Contributors: Mayank Bawa, Brian Cooper, Arturo Crespo, Neil Daswani, Prasanna Ganesan, Sergio Marti, Qi Sun, Beverly Yang and others P2P Challenges • Search • Resource Management • Security & Privacy not independent challenges! Search • Search Options – – – – – Query Expressiveness Comprehensiveness Topology Data Placement Message Routing Comparison Gnutella Expressivness Comprehensivness Autonomy Efficiency Robustness Topology pwr law Data Placement arbitrary Message Routing flooding CAN Others? Content Addressable Network (CAN) Nodes Data 1 2 A distributed hash table on Internet scales … Comparison Gnutella CAN Topology pwr law grid Data Placement arbitrary hashing Message Routing flooding directed Expressivness Comprehensivness Autonomy Efficiency Robustness Others? Challenge: Exploring the Space + autonomy SIL model a lot of research gnutella can + efficiency robustness + Search Index Link (SIL) Model • • • • Forwarding search link (FSL) Non-forwarding search link (NSL) Forwarding index link (FIL) Non-forwarding index link (NIL) Q B Q D Q A Q C E SIL Model • • • • Forwarding search link (FSL) Non-forwarding search link (NSL) Forwarding index link (FIL) Non-forwarding index link (NIL) BQ Q A Q C F D Q Q H E G Super-Peer Network D E H A C core B G F SIL Challenges • Desirable graph properties • Desirable features • Dynamic configuration Example Property: Redundancy B C A • Redundancy exists in a SIL graph if a link can be removed without reducing coverage Example: Undesirable Feature B E A D C • One-index cycle: Node A has an index link to B, and there is a search path from B to A Avoiding Undesirable Features • Node D is joining the system: – what neighbors should it connect to? – what type of links should it use? B E A D C ? Open Problems: Security • • • • Availability (e.g., coping with DOS attacks) Authenticity Anonymity Access Control (e.g., IP protection, payments,...) Authenticity title: origin of species author: charles darwin ? date: 1859 body: In an island far, far away ... ... More than Just File Integrity title: origin of species author: charles darwin ? date: 1859 00 body: In an island far, far away ... checksum More than Fetching One File T=origin Y=? A=darwin B=? T=origin Y=1800 A=darwin T=origin T=origin Y=1859 Y=1859 A=darwin A=darwin B=abcd T=origin Y=1859 A=darwin Solutions • Authenticity Function A(doc): T or F – at expert sites, at all sites? – can use signature expert sig(doc) • Voting Based – authentic is what majority says • Time Based – e.g., oldest version (available) is authentic user Added Challenge: Efficiency • Example: Current music sharing – everyone has authenticity function – but downloading files is expensive • Solution: Track peer behavior good peer good peer bad peer How to Track Peer Behavior? • Trust Vector [ v1, v2, v3, v4 ] a b c d • Single value between 0 and 1? • Pair of values: [ total downloads, good downloads ] ? Trust Operations [1, .9, .5, 0, 0] a .5 .9 b [1, 1, 0, .3, 1] c .3 .3 1 .2 d update? e [1, 0, 1, 1, .2] Issues • • • • • • Trust computations in dynamic system Overloading good nodes Bad nodes provide good content sometimes Bad nodes can build up reputation Bad nodes can form collectives ... Sample Results Fraction of malicious peers P2P Challenges • Search • Resource Management • Security and Privacy Resource Management capacity = C1 1 2 3 capacity = C3 • Local work: Ci • Remote work: (1 - ) Ci capacity = C2 Incentives for Remote Work • What is best value for ? • How do I get remote nodes to work for me? Local work: Ci Remote work: (1 - ) Ci C1 2 1 3 C3 C2 Conclusion • P2P systems popular today – Limewire, Kazaa … • Existing P2P systems vulnerable and inefficient • Many challenges ahead: – Search – Resource Management – Security and Privacy Lots of good research to be done … For Additional Information • Google: – “Stanford Peers”, OceanStore, Tapestry, Chord • • • • http://www-db.stanford.edu/peers/ http://www.freehaven.net/ http://cs1.cs.nyu.edu/~waldman/publius/ http://www.onionnetworks.com Appendix II P2P Architectures Peer-to-Peer is Not Always Decentralized …when Centralization is Good Nelson Minar <nelson@monkey.org> http://www.media.mit.edu/~nelson/ Talk Overview • Topologies of distributed systems • Strengths and weaknesses • Conclusions Warning: Broad generalizations ahead What is P2P Anyway? • Decentralized Systems: no – – – – Popular Power fails test Napster fails test Most Instant Messaging fails test Confuses topology with function • Edge Resources: yes – Small computers on edges contribute back – All peers are active participants Distributed Systems Topologies • Get away from fundamentalism – “Pure P2P”, “True P2P”, etc • Focus instead on system architecture – How do the pieces fit together? • Concentrate on connection topology • Which topology for which problem? Centralized • • • • • • Client/server Web servers Databases Napster search Instant Messaging Popular Power Ring • Fail-over clusters • Simple load balancing • Assumption – Single owner Hierarchical • DNS • NTP • Usenet (sort of) Decentralized • • • • Gnutella Freenet Hive Internet routing Centralized + Centralized • • • • N-tier apps Database heavy systems Web services gateways Grand Central Centralized + Ring • Serious web applications • High availability servers Centralized + Decentralized • Clip2 Gnutella Reflector • FastTrack – KaZaA – Morpheus • Email What about other topologies? • Centralized + Hierarchical? – Back end tree of information – Caching architectures • Decentralized + Ring? – P2P network of fail-over clusters • Decentralized + Hierarchical? • Decentralized + Centralized? Strengths and Weaknesses • • • • Plenty of topologies to choose from What is each kind good for? Need a set of properties to measure Caution: What follows is very high level Things to Measure • Manageability – How hard is it to keep working? • Information coherence – How authoritative is info? (Auditing, non-repudiation) • Extensibility – How easy is it to grow? • Fault tolerance – How well can it handle failures? • Security – How hard is it to subvert? • Resistance to legal or political intervention – How hard is it to shut down? (Can be good or bad) • Scalability – How big can it grow? Centralized Manageable Coherent Extensible Fault Tolerant Secure Lawsuit-proof Scalable X X X ? System is all in one place All information is in one place No one can add on to system Single point of failure Simply secure one host Easy to shut down One machine. But in practice? Ring Manageable Coherent Extensible Fault Tolerant Secure Lawsuit-proof Scalable X X Simple rules for relationships Easy logic for state Only ring owner can add Fail-over to next host As long as ring has one owner Shut down owner Just add more hosts Hierarchical Manageable Coherent Extensible Fault Tolerant Secure Lawsuit-proof Scalable ½ ½ ½ ½ X X Chain of authority Cache consistency Add more leaves, rebalance Root is vulnerable Too easy to spoof links Just shut down the root Hugely scalable – DNS Decentralized Manageable Coherent Extensible Fault Tolerant Secure Lawsuit-proof Scalable X X X Very difficult, many owners Difficult, unreliable peers Anyone can join in! Redundancy Difficult, open research No one to sue! (…but follow $) ? Theory – yes : Practice – no Centralized + Ring Manageable Coherent Extensible Fault Tolerant Secure Lawsuit-proof Scalable X X Just manage the ring As coherent as ring No more than ring Ring is a huge win As secure as ring Still single place to shut down Ring is a huge win Common architecture for web applications Centralized + Decentralized Manageable Coherent Extensible Fault Tolerant Secure Lawsuit-proof Scalable X ½ X ? Same as decentralized Better than decentralized Anyone can still join! Plenty of redundancy Same as decentralized Still no one to sue Looking very hopeful Best architecture for P2P networks? Centralized vs. Decentralized • Centralized is pretty good! – Manageable – Coherent – Security • Decentralized is exciting – Extensible – Massive fault tolerance – Lawsuit-proof • Scalability is the big question Conclusions • Centralized is easy to deal with – Major architecture for distributed systems – Combines well with rings • Decentralized is good, needs research – Coherence, Manageability, Security – Scalability • Hierarchical is overlooked • Combining architectures is powerful Peer-to-Peer is Not Always Decentralized …when Centralization is Good Nelson Minar <nelson@monkey.org> http://www.media.mit.edu/~nelson/ Thanks to Marc Hedlund, Raffi Krikorian, Tony White Appendix III P2P Industry P2P Industry Outline “There’s no peer-to-peer market any more than there’s a client/server market” – Anne Manes, Sun Microsystems • Peer-to-peer encompasses a wide range of technologies centered around decentralizing computing • Business and revenue models are still fuzzy • There are clear opportunities and research excitement Distribution of P2P Companies Category Examples Industry Share Distributed Computing Entropia United Devices 35% Collaboration / Knowledge Management Groove Networks Engenia 20% Content Distribution Akamai Proksim 10% Infrastructure / Platform Akavi Xdegrees 10% File Sharing Kazaa Napster 10% Distributed Search OpenCola Thinkstream 5% (From “P2P 101: An Overview of the P2P Landscape” by Larry Cheng) Major Features of P2P Industry (From “P2P 101: An Overview of the P2P Landscape” by Larry Cheng) • • • • • Lack of experienced, quality management teams Lack of detailed business models Skeptical investors 150+ active companies Estimated 95% failure rate “The elephant in the room is the fact that most companies here are not commercially viable.” - Heard from a speaker at O’Reilly Current P2P Business Models • Sell P2P products to end-users – No current revenue-generating business model – Sometimes coupled with content-sale models • Sell content through P2P – Subscription-based – I buy content from you – Sponsor-based – Someone pays you to give me content – Ad-based – You give me content and sell ads Current P2P Business Models • Sell something which lets others profit from P2P – – – – Solve a critical problem for decentralized applications Offer support and enhanced services for free tools Specialized packages for particular industries Tools and libraries for P2P infrastructure “The people most likely to make money during a Gold Rush are the ones selling pickaxes and shovels.” Andy Oram, The O’Reilly Network