CITO TechTalk 25-02-2003 Final

advertisement
P2P: An Overview
Dr. Tony White
Carleton University
Outline
• Introduction
• Evolution of Network Computing
– Definitions
– The Rise of Edge Computing
– Why Peer-to-Peer? What is it?
• Applications
– Cycle Sharing
– Content Delivery
– …
• Open Problems
• Summary
Evolution of Network Computing
• Client/server:
- Introduced inequalities
- Required homogeneity
•Web introduced:
- A common protocol: HTTP
- A common document format: HTML
- A universal client: the browser
P2P Definition
• Peer-to-peer computing is the location and
sharing of computer resources and services by
direct exchange between servents.
• A servent is a peer that can adopt the roles of
both server and client when operating.
P2P Definition
“P2P is a class of applications that takes
advantage of resources -- storage, cycles,
content, human presence -- available at the
edges of the Internet. Because accessing these
decentralized resources means operating in an
environment of unstable connectivity and
unpredictable IP addresses, P2P nodes must
operate outside the DNS system and have
significant or total autonomy from central
servers.” Clay Shirkey February, 2000
Definitions I
• Pure peer-to-peer is completely
decentralized and characterized by lack of
a central server or central entity; clients
make direct contact with one another.
• Computational peer-to-peer uses P2P
technology to disseminate computational
tasks over multiple clients; peers do not
have a direct connection to one another.
Definitions II
• Datacentric peer-to-peer is information and
data residing on systems or devices that is
accessible to others when users connect. It is
sometimes called peer-assisted or grid-assisted
delivery. Applications include distributed file
and content sharing.
• Usercentric/hybrid peer-to-peer involves
clients contacting others via a central server or
entity to communicate, share data, or process
data. Often used in collaboration applications.
What is a P2P network?
• It is an overlay network
• Peer applications know
IP addresses of other
peer applications.
• Link between two nodes
is actually an
application-level
connection.
What matters?
• Topology of overlay matters
• Where content is stored matters
• Search protocol matters
• Gnutella results in:
– Poor performance
– Poor reliability
The Rise of Edge Computing …
• In P2P, clients also are servers, hence are peers.
• Driving P2P is the abundance of:
–
–
–
–
Computing power
Non-volatile storage
Network bandwidth
(This seems to turn thin clients on their heads.)
• Sharing from the edge:
– Physical Resources: cycles, disk
– Information Resources: files, database access
– Services: code mobility implied
P2P Enables Complete Access
• P2P file swapping is the obvious application
– Text, audio, video, executables, …
• Searching and sharing
– Resources
• Information
• Information processing capacity
– Searches
• More current than Google™
– Indexing web logs (blogs, klogs …)
• More focused: search within a “peer group”
P2P Enables Complete Access …
• Searching and sharing:
– Instant messaging
• locate user quickly independent of service provider.
– Buyers and sellers
• P2P auctions – compete with Ebay.
– Blogging
• Sharing of “self”.
• Edge-based multi-media streaming:
– Web radio
– Web TV
• Peer shells:
– Script complex P2P applications from simpler ones.
– Service creation using service composition.
P2P Enables Complete Access …
•
•
•
•
A New Style of Distributed Computing
P2P applications tolerate peers coming/going.
Result depends on which peers are available.
High availability comes from probability that
some peers are available.
– Not on load-balancing and fail over schemes.
– Must avoid “tragedy of the commons”.
Examples of Early P2P
• Some new Internet applications are different:
– SETI@home
– Instant messaging services (AIM, MS Messenger, …)
– P2P applications – no central authority/server.
• Napster – quasi-P2P
• Gnutella
• Freenet
• These applications are vertically integrated:
– Non-standard protocols
– Closed namespaces
– Stand alone
Problems I
• Topology
– Bandwidth usage
– Fault tolerance
– Search efficiency
• Identity
– Trust
– Anonymity
• Security
– Authorization
– Privacy
Problems II
• Namespaces
• Community Management
– Overlaps traditional enterprise groups
– Highly dynamic, user controlled
• Firewall traversal
• Political
– IT loses control of content distribution
– No control of information flow!
• Legal
– DRM
What is needed?
• Interoperability (common protocols & standards):
– Communication protocols (e.g. JXTA, Jabber, …)
– Representation of identity (or not!)
– Semantic content (meta-data)
• Secure information exchange:
– Must be able to guarantee trust within a network
– Prevent unauthorized access to network
– Policy-based control of information exchange
• Ubiquity
– Buy-in from large groups of users
Securing Distributed Computations
in a Commercial Environment
Philippe Golle, Stanford University
Stuart Stubblebine, CertCo
Example of a Distributed Computation
• 580,000 active participants
• 565,800 years of CPU time since 1996
• 26.1 TeraFLOPs / sec
Commercialization: supply
• A dozen of companies have recruited thousands of
participants
• $100 million in venture funding in 2000
www.mithral.com
www.dcypher.net
(with www.processtree.com)
www.distributed.net
www.entropia.com
www.parabon.com
www.uniteddevices.com
www.popularpower.com
www.distributedsciences.com
www.datasynapse.com
www.juno.com
Commercialization: demand
• Super-computing market: $2 billion / year
• Computationally intensive parallelizable projects:
– Drug design research
– Mathematical research
– Economic simulations
– Digital entertainment
Cheaters!
"Fifty percent of the project's resources have been spent
dealing with security problems"
“The really hard part has to do with verifying computational
results"
David Anderson, Seti@home's director.
Cycle Sharing Participants
• Trusted supervisor
–
–
–
–
–
Maintains a pool of registered participants
Bids for large computations
Divides the computation into tasks that are assigned to participants
Collects the results and distributes payment to the participants
Example: Distributed.net, Entropia.com, etc…
• Untrusted participants
– May range from large companies to individual users
– Participants are anonymous (No “real world” leverage)
– Participants may collude. We distinguish between real-world entities
(agents) and anonymous participants.
– Participants may leave the computation at any time, either temporarily
or for good.
Organization
• Distribution of tasks
– The unit of computation is a task
– Assumption: all tasks have the same size and can be run by any
participant within the same time bounds.
– The supervisor runs a probabilistic algorithm to assign tasks to
participants.
– The supervisor keeps track of who did what
Security
• Definition: a computation is secure if no rational, nonrisk-seeking participant ever cheats.
• Collusion may occur only before tasks are assigned.
• A participant has 3 choices:
– Request a computation and do it
– Request a computation and NOT do it
– Take a leave
• Assumption: all errors are malicious
Utility function of an agent
Run the computation
α
Cheating detected
Cheat and “guess” the result
–L
Cheating undetected
α +E
α:
Payment received per task
E: Benefit of defecting (E = e α)
L: Cost of getting caught cheating
• Security condition: (α+E)P – L(1-P) < 0
where P is the probability that cheating is undetected
Basic scheme
• Registration:
– Participant performs d+1 unpaid tasks
– The supervisor verifies them (at limited cost)
– The participant is accepted iff all the results are correct
• Assignment of a task:
– A task is given to N participants chosen uniformly independently at
random
– The number N is chosen according to the probability distribution
QN  i   1  c  c i 1
– Payment: a constant amount α per task if all the results agree
– If not, the task is re-assigned to a new set of participants
• Severance: a participant is paid an amount d.α
Properties
• Computational overhead =
c
1 c
(α+E)P – L(1-P) < 0
1c1 p
2
• Security condition:
 1  p 
d
1 e  d
Computational
overhead
Setup time
Maximum
coalition size
Maximum e
10%
17%
46%
10
10
10
1%
10%
1%
1
1
10
243%
10
1%
100
• Overhead =
1
e
1
d
for “small” p
Content Delivery Networks
• Swarmcast/OnionNetworks
– File is stored in multiple locations
– Idea is to retrieve portions of file from separate hosts:
•
•
•
•
•
File is split into small (32k) pieces
Requests are random
Space of packets bigger than file
Only subset of packets required
Technique is Forward Error Correction
• Kazaa/Morpheus
• MojoNation (HiveCache)
– Distributed backup and restore system
Privacy Networks: Publius
• Publius
– Publishers: want to publish anonymously
– Servers: host random-looking content
• Storage
– The publisher takes the key, K that is used to encrypt the
file and splits it into n shares, such that any k of them can
reproduce the original K, but k-1 give no hints as to the key.
– Each server receives the encrypted Publius content and one
of the shares.
• Retrieval
– A retriever must get the encrypted Publius content from
some server and k of the shares.
– Content is tied to URL that is used to recover the data and
the shares.
Privacy Networks: Freehaven
• Anonymity:
–
–
–
–
Publishers that insert documents,
Readers that retrieve documents,
Servers that store documents.
Uses a free, low-latency, two-way mixnet for forwardanonymous communication.
• Accountability:
– Reputation and micropayment schemes, which allow us to
limit the damage done by servers that misbehave.
• Persistence:
– Publisher of a document determines its lifetime.
• Flexibility:
– System functions smoothly as peers dynamically join or
leave
OceanStore
Toward Global-Scale, Self-Repairing,
Secure and Persistent Storage
John Kubiatowicz
OceanStore Context:
Ubiquitous Computing
• Computing everywhere:
– Desktop, Laptop, Palmtop
– Cars, Cellphones
– Shoes? Clothing? Walls?
• Connectivity everywhere:
– Rapid growth of bandwidth in the interior of the net
– Broadband to the home and office
– Wireless technologies such as CMDA, Satelite, laser
• Where is persistent data????
Utility-based Infrastructure?
Canadian
OceanStore
Sprint
AT&T
Pac IBM
Bell
IBM
• Data service provided by storage federation
• Cross-administrative domain
• Pay for Service
OceanStore:
Everyone’s Data, One Big Utility
“The data is just out there”
• How many files in the OceanStore?
– Assume 1010 people in world
– Say 10,000 files/person (very conservative?)
– So 1014 files in OceanStore!
– If 1 gig files (ok, a stretch), get 1 mole of bytes!
Truly impressive number of elements…
… but small relative to physical constants
Aside: new results: 1.5 Exabytes/year (1.51018)
OceanStore Assumptions
• Untrusted Infrastructure:
– The OceanStore is comprised of untrusted components
– Individual hardware has finite lifetimes
– All data encrypted within the infrastructure
• Responsible Party:
– Some organization (i.e. service provider) guarantees that your
data is consistent and durable
– Not trusted with content of data, merely its integrity
• Mostly Well-Connected:
– Data producers and consumers are connected to a highbandwidth network most of the time
– Exploit multicast for quicker consistency when possible
• Promiscuous Caching:
– Data may be cached anywhere, anytime
The Peer-To-Peer View:
Irregular Mesh of “Pools”
Key Observation:
Want Automatic Maintenance
• Can’t possibly manage billions of servers by hand!
• System should automatically:
–
–
–
–
Adapt to failure
Exclude malicious elements
Repair itself
Incorporate new elements
• System should be secure and private
– Encryption, authentication
• System should preserve data over the long term (accessible
for 1000 years):
–
–
–
–
Geographic distribution of information
New servers added from time to time
Old servers removed from time to time
Everything just works
Attack Resistant P2P
• Content can be compromised by:
– Attack by malicious agents
– Censorship
– Faulty nodes
• Remember:
– Nodes have finite resources
Gnutella
query
Morpheus/Kazaa
...
...
...
...
super peer
...
...
Examples
• Napster shut down by attacks on central server
• Gnutella spammed by Flatplanet
• Removal of a few peers shatters Gnutella
– 63 from 1800 in figures
Performance
After deletion of 2/3 of peers, 99% of remainder can still access
99% of the data items
DRN design [Jared Saia]
• Topology based upon butterfly network
(constant degree version of hypercube)
• Each vertex of butterfly called a supernode
• Each supernode represents a set of peers
• Each peer is in multiple supernodes
DRN Topology
N peers, n supernodes
Each peer participates in Clogn randomly chosen supernodes
Supernode X connected to supernode Y means all nodes in X
connected to all nodes in Y
Conclusion
• P2P systems popular today
– Limewire, Kazaa …
• Existing P2P systems vulnerable and inefficient
• Many challenges ahead:
– Search
– Resource Management
– Security and Privacy
Lots of good research to be done …
Appendix I
Open Problems in P2P Data Sharing
Open Problems in Data
Sharing Peer-To-Peer Systems
Hector Garcia-Molina
ICDT Conference, January 10, 2003
Contributors: Mayank Bawa, Brian Cooper, Arturo Crespo,
Neil Daswani, Prasanna Ganesan, Sergio Marti,
Qi Sun, Beverly Yang and others
P2P Challenges
• Search
• Resource Management
• Security & Privacy
 not independent challenges!
Search
• Search Options
–
–
–
–
–
Query Expressiveness
Comprehensiveness
Topology
Data Placement
Message Routing
Comparison
Gnutella
Expressivness
Comprehensivness
Autonomy
Efficiency
Robustness





Topology
pwr law
Data Placement
arbitrary
Message Routing
flooding
CAN
Others?
Content Addressable Network
(CAN)
Nodes
Data
1
2
A distributed hash table on Internet scales …
Comparison
Gnutella
CAN










Topology
pwr law
grid
Data Placement
arbitrary
hashing
Message Routing
flooding
directed
Expressivness
Comprehensivness
Autonomy
Efficiency
Robustness
Others?
Challenge: Exploring the Space
+ autonomy
SIL model
a lot of research
gnutella
can +
efficiency
robustness
+
Search Index Link (SIL) Model
•
•
•
•
Forwarding search link (FSL)
Non-forwarding search link (NSL)
Forwarding index link (FIL)
Non-forwarding index link (NIL)
Q
B
Q
D
Q
A
Q
C
E
SIL Model
•
•
•
•
Forwarding search link (FSL)
Non-forwarding search link (NSL)
Forwarding index link (FIL)
Non-forwarding index link (NIL)
BQ
Q
A
Q
C
F
D
Q
Q
H
E
G
Super-Peer Network
D
E
H
A
C
core
B
G
F
SIL Challenges
• Desirable graph properties
• Desirable features
• Dynamic configuration
Example Property: Redundancy
B
C
A
• Redundancy exists in a SIL graph if a link
can be removed without reducing coverage
Example: Undesirable Feature
B
E
A
D
C
• One-index cycle: Node A has an index link to
B, and there is a search path from B to A
Avoiding Undesirable Features
• Node D is joining the system:
– what neighbors should it connect to?
– what type of links should it use?
B
E
A
D
C
?
Open Problems: Security
•
•
•
•
Availability (e.g., coping with DOS attacks)
Authenticity
Anonymity
Access Control (e.g., IP protection, payments,...)
Authenticity
title: origin of species
author: charles darwin
?
date: 1859
body: In an island far,
far away ...
...
More than Just File Integrity
title: origin of species
author: charles darwin
?
date: 1859 00
body: In an island far,
far away ...
checksum
More than Fetching One File
T=origin
Y=?
A=darwin
B=?
T=origin
Y=1800
A=darwin
T=origin T=origin
Y=1859
Y=1859
A=darwin A=darwin
B=abcd
T=origin
Y=1859
A=darwin
Solutions
• Authenticity Function A(doc): T or F
– at expert sites, at all sites?
– can use signature expert
sig(doc)
• Voting Based
– authentic is what majority says
• Time Based
– e.g., oldest version (available) is authentic
user
Added Challenge: Efficiency
• Example: Current music sharing
– everyone has authenticity function
– but downloading files is expensive
• Solution: Track peer
behavior
good peer
good peer
bad peer
How to Track Peer Behavior?
• Trust Vector
[ v1, v2, v3, v4 ]
a b c d
• Single value between 0 and 1?
• Pair of values:
[ total downloads, good downloads ] ?
Trust Operations
[1, .9, .5, 0, 0]
a
.5
.9
b
[1, 1, 0, .3, 1]
c
.3
.3
1
.2
d
update?
e
[1, 0, 1, 1, .2]
Issues
•
•
•
•
•
•
Trust computations in dynamic system
Overloading good nodes
Bad nodes provide good content sometimes
Bad nodes can build up reputation
Bad nodes can form collectives
...
Sample Results
Fraction of malicious peers
P2P Challenges
• Search
• Resource Management
• Security and Privacy
Resource Management
capacity = C1
1
2
3
capacity = C3
• Local work: Ci
• Remote work: (1 - ) Ci
capacity = C2
Incentives for Remote Work
• What is best value for ?
• How do I get remote nodes to work for me?
Local work: Ci
Remote work: (1 - ) Ci
C1
2
1
3
C3
C2
Conclusion
• P2P systems popular today
– Limewire, Kazaa …
• Existing P2P systems vulnerable and inefficient
• Many challenges ahead:
– Search
– Resource Management
– Security and Privacy
Lots of good research to be done …
For Additional Information
• Google:
– “Stanford Peers”, OceanStore, Tapestry, Chord
•
•
•
•
http://www-db.stanford.edu/peers/
http://www.freehaven.net/
http://cs1.cs.nyu.edu/~waldman/publius/
http://www.onionnetworks.com
Appendix II
P2P Architectures
Peer-to-Peer is Not Always
Decentralized
…when Centralization is Good
Nelson Minar
<nelson@monkey.org>
http://www.media.mit.edu/~nelson/
Talk Overview
• Topologies of distributed systems
• Strengths and weaknesses
• Conclusions
Warning: Broad generalizations ahead
What is P2P Anyway?
• Decentralized Systems: no
–
–
–
–
Popular Power fails test
Napster fails test
Most Instant Messaging fails test
Confuses topology with function
• Edge Resources: yes
– Small computers on edges contribute back
– All peers are active participants
Distributed Systems Topologies
• Get away from fundamentalism
– “Pure P2P”, “True P2P”, etc
• Focus instead on system architecture
– How do the pieces fit together?
• Concentrate on connection topology
• Which topology for which problem?
Centralized
•
•
•
•
•
•
Client/server
Web servers
Databases
Napster search
Instant Messaging
Popular Power
Ring
• Fail-over clusters
• Simple load balancing
• Assumption
– Single owner
Hierarchical
• DNS
• NTP
• Usenet (sort of)
Decentralized
•
•
•
•
Gnutella
Freenet
Hive
Internet routing
Centralized + Centralized
•
•
•
•
N-tier apps
Database heavy systems
Web services gateways
Grand Central
Centralized + Ring
• Serious web
applications
• High availability servers
Centralized + Decentralized
• Clip2 Gnutella Reflector
• FastTrack
– KaZaA
– Morpheus
• Email
What about other topologies?
• Centralized + Hierarchical?
– Back end tree of information
– Caching architectures
• Decentralized + Ring?
– P2P network of fail-over clusters
• Decentralized + Hierarchical?
• Decentralized + Centralized?
Strengths and Weaknesses
•
•
•
•
Plenty of topologies to choose from
What is each kind good for?
Need a set of properties to measure
Caution: What follows is very high level
Things to Measure
• Manageability
– How hard is it to keep working?
• Information coherence
– How authoritative is info? (Auditing, non-repudiation)
• Extensibility
– How easy is it to grow?
• Fault tolerance
– How well can it handle failures?
• Security
– How hard is it to subvert?
• Resistance to legal or political intervention
– How hard is it to shut down? (Can be good or bad)
• Scalability
– How big can it grow?
Centralized
Manageable
Coherent
Extensible
Fault Tolerant
Secure
Lawsuit-proof
Scalable


X
X

X
?
System is all in one place
All information is in one place
No one can add on to system
Single point of failure
Simply secure one host
Easy to shut down
One machine. But in practice?
Ring
Manageable
Coherent
Extensible
Fault Tolerant
Secure
Lawsuit-proof
Scalable


X


X

Simple rules for relationships
Easy logic for state
Only ring owner can add
Fail-over to next host
As long as ring has one owner
Shut down owner
Just add more hosts
Hierarchical
Manageable
Coherent
Extensible
Fault Tolerant
Secure
Lawsuit-proof
Scalable
½
½
½
½
X
X

Chain of authority
Cache consistency
Add more leaves, rebalance
Root is vulnerable
Too easy to spoof links
Just shut down the root
Hugely scalable – DNS
Decentralized
Manageable
Coherent
Extensible
Fault Tolerant
Secure
Lawsuit-proof
Scalable
X
X


X

Very difficult, many owners
Difficult, unreliable peers
Anyone can join in!
Redundancy
Difficult, open research
No one to sue! (…but follow
$)
? Theory – yes : Practice – no
Centralized + Ring
Manageable
Coherent
Extensible
Fault Tolerant
Secure
Lawsuit-proof
Scalable


X


X

Just manage the ring
As coherent as ring
No more than ring
Ring is a huge win
As secure as ring
Still single place to shut down
Ring is a huge win
Common architecture for web applications
Centralized + Decentralized
Manageable
Coherent
Extensible
Fault Tolerant
Secure
Lawsuit-proof
Scalable
X
½


X

?
Same as decentralized
Better than decentralized
Anyone can still join!
Plenty of redundancy
Same as decentralized
Still no one to sue
Looking very hopeful
Best architecture for P2P networks?
Centralized vs. Decentralized
• Centralized is pretty good!
– Manageable
– Coherent
– Security
• Decentralized is exciting
– Extensible
– Massive fault tolerance
– Lawsuit-proof
• Scalability is the big question
Conclusions
• Centralized is easy to deal with
– Major architecture for distributed systems
– Combines well with rings
• Decentralized is good, needs research
– Coherence, Manageability, Security
– Scalability
• Hierarchical is overlooked
• Combining architectures is powerful
Peer-to-Peer is Not Always
Decentralized
…when Centralization is Good
Nelson Minar
<nelson@monkey.org>
http://www.media.mit.edu/~nelson/
Thanks to Marc Hedlund, Raffi Krikorian, Tony White
Appendix III
P2P Industry
P2P Industry Outline
“There’s no peer-to-peer market any more than there’s a
client/server market” – Anne Manes, Sun Microsystems
• Peer-to-peer encompasses a wide range of
technologies centered around decentralizing
computing
• Business and revenue models are still fuzzy
• There are clear opportunities and research
excitement
Distribution of P2P Companies
Category
Examples
Industry Share
Distributed Computing
Entropia
United Devices
35%
Collaboration / Knowledge
Management
Groove Networks
Engenia
20%
Content Distribution
Akamai
Proksim
10%
Infrastructure / Platform
Akavi
Xdegrees
10%
File Sharing
Kazaa
Napster
10%
Distributed Search
OpenCola
Thinkstream
5%
(From “P2P 101: An Overview of the P2P Landscape” by Larry Cheng)
Major Features of P2P Industry
(From “P2P 101: An Overview of the P2P Landscape” by Larry Cheng)
•
•
•
•
•
Lack of experienced, quality management teams
Lack of detailed business models
Skeptical investors
150+ active companies
Estimated 95% failure rate
“The elephant in the room is the fact that most companies here
are not commercially viable.” - Heard from a speaker at O’Reilly
Current P2P Business Models
• Sell P2P products to end-users
– No current revenue-generating business model
– Sometimes coupled with content-sale models
• Sell content through P2P
– Subscription-based – I buy content from you
– Sponsor-based – Someone pays you to give me content
– Ad-based – You give me content and sell ads
Current P2P Business Models
• Sell something which lets others profit from P2P
–
–
–
–
Solve a critical problem for decentralized applications
Offer support and enhanced services for free tools
Specialized packages for particular industries
Tools and libraries for P2P infrastructure
“The people most likely to make money during a Gold
Rush are the ones selling pickaxes and shovels.”
Andy Oram, The O’Reilly Network
Download