A Portal-based P2P System for the Distribution and Management of Large Data Sets Rahim Lakhoo (Raz) and Prof Mark Baker ACET, University of Reading E-mail: r.n.lakhoo@rdg.ac.uk Web: http://acet.rdg.ac.uk/~rnl May, 07 r.n.lakhoo@rdg.ac.uk Outline • Motivation. • A Portal-based P2P System: – High-level View, – Overview, – Components. • P2P Simulators: – – – – Our requirements, Simulators investigated, Issues, Experiences. • Summary. • Conclusions. May, 07 r.n.lakhoo@rdg.ac.uk Motivation • Sloan Digital Sky Survey (SDSS) - uses a telescope to take optical images of the sky. • Scientific projects such as SDDS are producing and working with very large data sets. • Current methods for distributing the content involve: – Physically shipping disk drives, – Splitting and the point-to-point transfer from one location to another. • Data sets are growing for projects like SDSS. – Currently, 5 Tbytes, – Set to be ~15 Tbytes by the end of the project. • Storage and bandwidth is costly and limited, and the data sets will inevitably get larger. • Managing and maintaining these large data sets is difficult, will will only become harder over time. May, 07 r.n.lakhoo@rdg.ac.uk Motivation • P2P is being used by normal people to download multimedia. • A popular example is BitTorrent. • It’s success surrounds its protocol, which makes users share their bandwidth with other people trying to download the same file. • BitTorrent Concepts: – Files are split into small pieces called ‘chunks’, – Chunks are seeded (uploaded) by a user, – Users download a ‘torrent’ file which has information about a file. – A user loads the ‘torrent’ into an application which then downloads chunks from different peers, – A ‘tracker’ tracks which peers have what chunks. • Peer-to-Peer (P2P) systems offer a potential way to manage and distribute data sets. May, 07 r.n.lakhoo@rdg.ac.uk High-level View • Data sets such as SDSS are currently kept in a storage mechanism, such as a RAID array. • A bootstrapping service is set up and has access to the SDSS data. • The data is split into chunks and distributed to the Portal P2P services, hosted by different portals. • Users who access the portal can contribute resources to help store and distribute the data. These are the Mini Peers. • The Portal P2P services propagate the Mini Peers with parts of the data set. • Any other project partners who want a copy of the data can join the P2P network and download parts of the data set from Portal and Mini Peers. May, 07 r.n.lakhoo@rdg.ac.uk Overview • Ideas are loosely based around the concepts of BitTorrent and Freenet. • The P2P System consists of: – A distributed registry, for storing information for the network peers and also provides a tracker, – A Bootstrapping Service, which splits the data set into chunks to be distributed by the peers, – A Portal P2P Service, which provides storage and management of the data: • This service also propagates chunks to the Mini Peers. – Mini Peers, donate bandwidth and disk space to the network. May, 07 r.n.lakhoo@rdg.ac.uk Overview May, 07 r.n.lakhoo@rdg.ac.uk Overview • The registry (VR) provides the distributed tracker: – A tracker helps peers locate other peers with chunks to download. • The Bootstrapper initiates the propagation of the data set to the peers. • The Portal P2P service manages the Mini Peers. • The portal has management and monitoring tools for the data set. • All peers volunteer resources to the P2P network. May, 07 r.n.lakhoo@rdg.ac.uk The Virtual Registry • The Virtual Registry (VR) is provided by Tycho. • Tycho is a wide-area asynchronous message passing system with a integrated distributed registry. • The VR can store information which can be searched and retrieved by peers on the network. • Tycho uses HTTP/HTTPS,Sockets/SSL for communications. • The VR will provide the distributed P2P tracker service, for finding peers with chunks to download. May, 07 r.n.lakhoo@rdg.ac.uk The Virtual Registry May, 07 r.n.lakhoo@rdg.ac.uk The Virtual Registry • Tycho has a Service Oriented Architecture that uses the concept of producers and consumers. • In our system, each Tycho mediator has a consumer and producer, for communications. • Mediators provide the VR with a distributed data store, which uses HSQLDB as its database. • Local communications are via Sockets/SSL and wide-area communications via HTTP/HTTPS. May, 07 r.n.lakhoo@rdg.ac.uk The Bootstrapper • A bootstrapping service is needed to propagate the Portal P2P service with parts of the data set. • This service splits the data set into chunks. • Each chunk has an associated hash value, which is stored in the Virtual Registry. • The bootstrapping service needs access to the original data set(s). May, 07 r.n.lakhoo@rdg.ac.uk The Bootstrapper May, 07 r.n.lakhoo@rdg.ac.uk The Bootstrapper • The bootstrapping service needs to propagate different chunks to different Portals concurrently. • Hash values and metadata about the data set and chunks is stored in the VR. • This service is also used if a requested chunk that is not found on the P2P network, due to chunk corruption. In this case, the missing chunk needs to be replaced in the P2P system. May, 07 r.n.lakhoo@rdg.ac.uk The Portal P2P Service • The Portal P2P service is a plug-in component for portals. • This service stores and serves chunks of the data set to other peers in the network. • The portal service propagates chunks to the Mini peers. • The monitoring and management of the data set is handled by the portlet tools and the P2P service. • The portal service uses Tycho to synchronise management tools across all portals in the network. May, 07 r.n.lakhoo@rdg.ac.uk The Portal P2P Service May, 07 r.n.lakhoo@rdg.ac.uk The Portal P2P Service • Each Portal P2P service needs access to a storage mechanism, for parts of the data set. • The storage resources provided by the portals provides space for a copy of the large data set. • The Portal P2P service also provides parts of the data set to other peers in the P2P network. • The Portal provides users with an environment for managing and monitoring the data set collaboratively between peers. May, 07 r.n.lakhoo@rdg.ac.uk The Mini Peers • Mini peers donate bandwidth and storage space to the network. • Mini peers will interact with the P2P network via their Web browser. • Mini peers will store chunks that are useful for other peers. • Mini peers aim to help other peers download and distribute the data set. May, 07 r.n.lakhoo@rdg.ac.uk The Mini Peers Portal Portal Mini Peer Sock et/ SS L S SL k et / c o S S ock et/S SL et/S SL S ock P2P Swarm Mini Peer L t/S SL So ck et/ SS L S o c ke So ck et/ SS Mini Peer Mini Peer Web Browser P2P Service Tyhco May, 07 Storage (chunks) Mini Peer Mini Peer Mini Peer r.n.lakhoo@rdg.ac.uk The Mini Peers • Client-side Web browser technologies such as Ajax and JavaScript, will be used for the Mini Peer. • They will utilise the VR to publish parts of the data set, to share with other peers in the network. • Mini Peers will store chunks locally on a users machine. May, 07 r.n.lakhoo@rdg.ac.uk P2P Simulators - Requirements • We wanted to use a simulator to help test and develop our P2P system with greater assurance. • Running the P2P system in a simulator would allow us to configure scenarios for studying system behaviour. • Our requirements for a simulator were: – – – – Have support for customised P2P protocols, Provide facilities for hierarchical topologies, Provide visualisations, Provide reasonably accurate results in terms of ‘real-world’ performance, – Have good support and documentation, – Be capable of interfacing with the Java. May, 07 r.n.lakhoo@rdg.ac.uk P2P Simulators • There are many network simulators, some are more suited to P2P then others. • Simulators investigated include: – – – – – – – May, 07 NS-2 with NAM, PeerSim, PlanetSim, OMNet++ and OverSim, General Purpose Simulator (GPS), AgentJ, P2PSim. r.n.lakhoo@rdg.ac.uk Issues • We short listed three simulators: – General Purpose Simulator (GPS), – AgentJ, – OverSim. • GPS – Difficult to implement our own protocol as the simulator is tightly coupled to the BitTorrent protocol, – Stability issues were seen with larger simulations. • AgentJ • – Requires a normal Java application, – Does not support TCP in the simulation environment. OverSim – Java support is limited and restricting. It is not possible to implement a whole simulation with the provided Java support. May, 07 r.n.lakhoo@rdg.ac.uk Experiences • No simulator completely fulfilled our requirements. • We could not successfully implement our Portal-based P2P system in these simulators. • Some of the simulators are complex and take extensive time to learn. • Stability issues were seen with some of the simulators. • Code written for a simulation is specific to a particular simulator. The code cannot be reused in the later stages of development. • The time taken to implement our P2P system in a simulator, does not merit many advantages. May, 07 r.n.lakhoo@rdg.ac.uk Summary • We are developing a Portal-based P2P system to help the scientific community to manage, store and distribute large data sets. • Our Portal-based P2P system introduces the concept of data sets being collaboratively downloaded and managed. • The Portal-based P2P system has four main components: – – – – Virtual Registry, Bootstrapping service, Portal P2P service, Mini peers. • We attempted to simulate our design and idea with one of the P2P simulators. • We have investigated and tested several P2P simulators for their suitability to emulate our design. • We found that the simulators we studied we inflexible, unstable, and not easy to use - basically we would have spent more time fixing them, than actually physically implementing and testing our design on a cluster. May, 07 r.n.lakhoo@rdg.ac.uk Conclusions • Distributing and managing large data sets is difficult for projects such as SDSS. • P2P simulators are not as useful as first thought. • We will implement our Portal-based P2P system and test it on a suitable test bed, i.e. a cluster. • Once the development of our P2P system has reached a suitable stage, we may consider systems such as PlanetLab. – PlanetLab provides time on a real network with 100’s of nodes, hosted by academic institutes. • P2P systems are known to be an efficient way to distribute files and are becoming increasingly popular. • Implementation should be at a suitable stage for preliminary testing in a few months. May, 07 r.n.lakhoo@rdg.ac.uk References Tycho http://acet.rdg.ac.uk/projects/ tycho Further Information http://acet.rdg.ac.uk/projects/ vre/docs.php May, 07 r.n.lakhoo@rdg.ac.uk Thank you for listening Questions? May, 07 r.n.lakhoo@rdg.ac.uk