A Portal-based P2P System for the Distribution and

advertisement
A Portal-based P2P System for the
Distribution and Management of Large
Data Sets
Rahim Lakhoo (Raz) and
Prof Mark Baker
ACET, University of Reading
E-mail: r.n.lakhoo@rdg.ac.uk
Web: http://acet.rdg.ac.uk/~rnl
May, 07
r.n.lakhoo@rdg.ac.uk
Outline
• Motivation.
• A Portal-based P2P System:
– High-level View,
– Overview,
– Components.
• P2P Simulators:
–
–
–
–
Our requirements,
Simulators investigated,
Issues,
Experiences.
• Summary.
• Conclusions.
May, 07
r.n.lakhoo@rdg.ac.uk
Motivation
• Sloan Digital Sky Survey (SDSS) - uses a telescope to
take optical images of the sky.
• Scientific projects such as SDDS are producing and
working with very large data sets.
• Current methods for distributing the content involve:
– Physically shipping disk drives,
– Splitting and the point-to-point transfer from one location to
another.
• Data sets are growing for projects like SDSS.
– Currently, 5 Tbytes,
– Set to be ~15 Tbytes by the end of the project.
• Storage and bandwidth is costly and limited, and the
data sets will inevitably get larger.
• Managing and maintaining these large data sets is
difficult, will will only become harder over time.
May, 07
r.n.lakhoo@rdg.ac.uk
Motivation
• P2P is being used by normal people to download
multimedia.
• A popular example is BitTorrent.
• It’s success surrounds its protocol, which makes users
share their bandwidth with other people trying to
download the same file.
• BitTorrent Concepts:
– Files are split into small pieces called ‘chunks’,
– Chunks are seeded (uploaded) by a user,
– Users download a ‘torrent’ file which has information about a
file.
– A user loads the ‘torrent’ into an application which then
downloads chunks from different peers,
– A ‘tracker’ tracks which peers have what chunks.
• Peer-to-Peer (P2P) systems offer a potential way to
manage and distribute data sets.
May, 07
r.n.lakhoo@rdg.ac.uk
High-level View
• Data sets such as SDSS are currently kept in a
storage mechanism, such as a RAID array.
• A bootstrapping service is set up and has access to
the SDSS data.
• The data is split into chunks and distributed to the
Portal P2P services, hosted by different portals.
• Users who access the portal can contribute resources
to help store and distribute the data. These are the
Mini Peers.
• The Portal P2P services propagate the Mini Peers with
parts of the data set.
• Any other project partners who want a copy of the
data can join the P2P network and download parts of
the data set from Portal and Mini Peers.
May, 07
r.n.lakhoo@rdg.ac.uk
Overview
• Ideas are loosely based around the concepts
of BitTorrent and Freenet.
• The P2P System consists of:
– A distributed registry, for storing information for
the network peers and also provides a tracker,
– A Bootstrapping Service, which splits the data set
into chunks to be distributed by the peers,
– A Portal P2P Service, which provides storage and
management of the data:
• This service also propagates chunks to the Mini Peers.
– Mini Peers, donate bandwidth and disk space to the
network.
May, 07
r.n.lakhoo@rdg.ac.uk
Overview
May, 07
r.n.lakhoo@rdg.ac.uk
Overview
• The registry (VR) provides the distributed
tracker:
– A tracker helps peers locate other peers with
chunks to download.
• The Bootstrapper initiates the propagation of
the data set to the peers.
• The Portal P2P service manages the Mini
Peers.
• The portal has management and monitoring
tools for the data set.
• All peers volunteer resources to the P2P
network.
May, 07
r.n.lakhoo@rdg.ac.uk
The Virtual Registry
• The Virtual Registry (VR) is provided by
Tycho.
• Tycho is a wide-area asynchronous message
passing system with a integrated distributed
registry.
• The VR can store information which can be
searched and retrieved by peers on the
network.
• Tycho uses HTTP/HTTPS,Sockets/SSL for
communications.
• The VR will provide the distributed P2P
tracker service, for finding peers with chunks
to download.
May, 07
r.n.lakhoo@rdg.ac.uk
The Virtual Registry
May, 07
r.n.lakhoo@rdg.ac.uk
The Virtual Registry
• Tycho has a Service Oriented Architecture
that uses the concept of producers and
consumers.
• In our system, each Tycho mediator has a
consumer and producer, for communications.
• Mediators provide the VR with a distributed
data store, which uses HSQLDB as its
database.
• Local communications are via Sockets/SSL and
wide-area communications via HTTP/HTTPS.
May, 07
r.n.lakhoo@rdg.ac.uk
The Bootstrapper
• A bootstrapping service is needed to
propagate the Portal P2P service with parts of
the data set.
• This service splits the data set into chunks.
• Each chunk has an associated hash value,
which is stored in the Virtual Registry.
• The bootstrapping service needs access to the
original data set(s).
May, 07
r.n.lakhoo@rdg.ac.uk
The Bootstrapper
May, 07
r.n.lakhoo@rdg.ac.uk
The Bootstrapper
• The bootstrapping service needs to propagate
different chunks to different Portals
concurrently.
• Hash values and metadata about the data set
and chunks is stored in the VR.
• This service is also used if a requested chunk
that is not found on the P2P network, due to
chunk corruption. In this case, the missing
chunk needs to be replaced in the P2P system.
May, 07
r.n.lakhoo@rdg.ac.uk
The Portal P2P Service
• The Portal P2P service is a plug-in component
for portals.
• This service stores and serves chunks of the
data set to other peers in the network.
• The portal service propagates chunks to the
Mini peers.
• The monitoring and management of the data
set is handled by the portlet tools and the P2P
service.
• The portal service uses Tycho to synchronise
management tools across all portals in the
network.
May, 07
r.n.lakhoo@rdg.ac.uk
The Portal P2P Service
May, 07
r.n.lakhoo@rdg.ac.uk
The Portal P2P Service
• Each Portal P2P service needs access to a
storage mechanism, for parts of the data set.
• The storage resources provided by the portals
provides space for a copy of the large data
set.
• The Portal P2P service also provides parts of
the data set to other peers in the P2P
network.
• The Portal provides users with an environment
for managing and monitoring the data set
collaboratively between peers.
May, 07
r.n.lakhoo@rdg.ac.uk
The Mini Peers
• Mini peers donate bandwidth and storage
space to the network.
• Mini peers will interact with the P2P network
via their Web browser.
• Mini peers will store chunks that are useful
for other peers.
• Mini peers aim to help other peers download
and distribute the data set.
May, 07
r.n.lakhoo@rdg.ac.uk
The Mini Peers
Portal
Portal
Mini Peer
Sock et/ SS L
S SL
k et /
c
o
S
S ock
et/S
SL
et/S SL
S ock
P2P Swarm
Mini Peer
L
t/S SL
So ck et/ SS
L
S o c ke
So ck et/ SS
Mini Peer
Mini Peer
Web Browser
P2P
Service
Tyhco
May, 07
Storage
(chunks)
Mini Peer
Mini Peer
Mini Peer
r.n.lakhoo@rdg.ac.uk
The Mini Peers
• Client-side Web browser technologies such as
Ajax and JavaScript, will be used for the Mini
Peer.
• They will utilise the VR to publish parts of the
data set, to share with other peers in the
network.
• Mini Peers will store chunks locally on a users
machine.
May, 07
r.n.lakhoo@rdg.ac.uk
P2P Simulators - Requirements
• We wanted to use a simulator to help test and develop
our P2P system with greater assurance.
• Running the P2P system in a simulator would allow us
to configure scenarios for studying system behaviour.
• Our requirements for a simulator were:
–
–
–
–
Have support for customised P2P protocols,
Provide facilities for hierarchical topologies,
Provide visualisations,
Provide reasonably accurate results in terms of ‘real-world’
performance,
– Have good support and documentation,
– Be capable of interfacing with the Java.
May, 07
r.n.lakhoo@rdg.ac.uk
P2P Simulators
• There are many network simulators, some are
more suited to P2P then others.
• Simulators investigated include:
–
–
–
–
–
–
–
May, 07
NS-2 with NAM,
PeerSim,
PlanetSim,
OMNet++ and OverSim,
General Purpose Simulator (GPS),
AgentJ,
P2PSim.
r.n.lakhoo@rdg.ac.uk
Issues
• We short listed three simulators:
– General Purpose Simulator (GPS),
– AgentJ,
– OverSim.
• GPS
– Difficult to implement our own protocol as the simulator is
tightly coupled to the BitTorrent protocol,
– Stability issues were seen with larger simulations.
• AgentJ
•
– Requires a normal Java application,
– Does not support TCP in the simulation environment.
OverSim
– Java support is limited and restricting. It is not possible to
implement a whole simulation with the provided Java support.
May, 07
r.n.lakhoo@rdg.ac.uk
Experiences
• No simulator completely fulfilled our requirements.
• We could not successfully implement our Portal-based
P2P system in these simulators.
• Some of the simulators are complex and take
extensive time to learn.
• Stability issues were seen with some of the
simulators.
• Code written for a simulation is specific to a
particular simulator. The code cannot be reused in the
later stages of development.
• The time taken to implement our P2P system in a
simulator, does not merit many advantages.
May, 07
r.n.lakhoo@rdg.ac.uk
Summary
• We are developing a Portal-based P2P system to help the scientific
community to manage, store and distribute large data sets.
• Our Portal-based P2P system introduces the concept of data sets
being collaboratively downloaded and managed.
• The Portal-based P2P system has four main components:
–
–
–
–
Virtual Registry,
Bootstrapping service,
Portal P2P service,
Mini peers.
• We attempted to simulate our design and idea with one of the P2P
simulators.
• We have investigated and tested several P2P simulators for their
suitability to emulate our design.
• We found that the simulators we studied we inflexible, unstable,
and not easy to use - basically we would have spent more time fixing
them, than actually physically implementing and testing our design
on a cluster.
May, 07
r.n.lakhoo@rdg.ac.uk
Conclusions
• Distributing and managing large data sets is difficult for
projects such as SDSS.
• P2P simulators are not as useful as first thought.
• We will implement our Portal-based P2P system and test
it on a suitable test bed, i.e. a cluster.
• Once the development of our P2P system has reached a
suitable stage, we may consider systems such as
PlanetLab.
– PlanetLab provides time on a real network with 100’s of nodes,
hosted by academic institutes.
• P2P systems are known to be an efficient way to
distribute files and are becoming increasingly popular.
• Implementation should be at a suitable stage for
preliminary testing in a few months.
May, 07
r.n.lakhoo@rdg.ac.uk
References
Tycho http://acet.rdg.ac.uk/projects/
tycho
Further Information http://acet.rdg.ac.uk/projects/
vre/docs.php
May, 07
r.n.lakhoo@rdg.ac.uk
Thank you for listening
Questions?
May, 07
r.n.lakhoo@rdg.ac.uk
Download