Sector & Sphere
An Introduction to
Sector/Sphere
Yunhong Gu
Univ. of Illinois at Chicago and VeryCloud LLC
@CHUG, June 22, 2010
Sector: Distributed File System
Sphere: Simplified Parallel Data
Processing Framework
Goal: handling big data on commodity clusters
Open source software, BSD license, written in C++.
Started since 2006, current version 2.3
http://sector.sf.net
Super-computer model:
Expensive, data IO bottleneck
Sector/Sphere model:
Inexpensive, parallel data IO, data locality
Parallel/Distributed Programming with MPI, etc.:
Flexible and powerful.
very complicated application development
Sector/Sphere model (cloud model):
Clusters regarded as a single entity to the developer, simplified programming interface.
Limited to certain data parallel applications.
U pl oa d
Data Center
Data Provider
US Location
Data Center
Downlo ad
Uplo ad
Data Reader
Asia Location
Data Center
Systems for single data centers:
Requires additional effort to locate and move data.
Processing
Data User
US Location
U pl oa d
Data Provider
US Location
Sector/Sphere
Dow nloa d
Data Reader
Asia Location
Sector/Sphere model:
Support wide-area data collection and distribution.
U plo ad
Data Provider
Europe Location
Data Provider
US Location
DFS designed to work on commodity hardware
racks of computers with internal hard disks and high speed network connections
File system level fault tolerance via replication
Support wide area networks
Can be used for data collection and distribution
Not POSIX-compatible yet
User account
Data protection
System Security
Security Server
SSL
Metadata
Scheduling
Service provider
Masters
SSL
System access tools
App. Programming
Interfaces
Clients
Data
UDT
Encryption optional slaves slaves
Storage and
Processing
User accounts, permission, IP access control lists
Use independent accounts, but connect to existing account database via a simple
“driver”, e.g., Linux accounts, LDAP, etc.
Single security server, system continue to run when security server is down, but new users cannot login
Maintain file system metadata
Metadata is a customizable module, currently there are two implementations, one in-memory and one on disk
Authenticate users, slaves, and other masters
(via security server)
Maintain and manage file replication, data IO and data processing requests
Topology aware
Multiple active masters can dynamically join and leave; load balancing between masters
Store Sector files
Sector file is not split into blocks
One Sector file is stored on the “native” file system (e.g., EXT, XFS, etc.) of one or more slave nodes
Process Sector data
Data is processed on the same storage node, or nearest storage node possible
Input and output are Sector files
Sector file system client API
Access Sector files in applications using the C++ API
Sector system tools
File system access tools
FUSE
Mount Sector file system as a local directory
Sphere programming API
Develop parallel data processing applications to process Sector data with a set of simple API
The client communicate with slave directly for data IO, via UDT
http://udt.sf.net
Open source UDP based data transfer protocol
With reliability control and congestion control
Fast, firewall friendly, easy to use
Already used in many commercial and research systems for large data transfer
Files are not split into blocks
Users are responsible to use proper sized files
Directory and File Family
Sector will keep related files together during upload and replication
In-memory object
Data parallel applications
Data is processed at where it resides, or on the nearest possible node (locality)
Same user defined functions (UDF) are applied on all elements (records, blocks, files, or directories)
Processing output can be written to Sector files or sent back to the client
Transparent load balancing and fault tolerance
for each file F in (SDSS datasets) for each image I in F findBrownDwarf(I, …);
Application
Sphere Client
Split data
Collect result n+m ...
n+3 n+2 n+1 n
Input Stream
SphereStream sdss; sdss.init("sdss files");
SphereProcess myproc;
Locate and Schedule
SPEs myproc->run(sdss," findBrownDwarf ", …);
SPE SPE SPE SPE n+3 n+2 n+1 n ...
n-k
Output Stream findBrownDwarf(char* image, int isize, char* result, int rsize);
Slave -> Slave Local
Slave -> Slaves
(Hash/Buckets)
Each output record is assigned an ID; all records with the same
ID are sent to the same
“bucket” file
Slave -> Client n+m ...
Input Stream n+3 b ...
Intermediate
Stream
SPE SPE SPE SPE
3 n+2
2 n+1
1 n
0
SPE SPE SPE SPE b ...
Output Stream
3 2 1 0
A client application
Specify input, output, and name of UDF
Inputs and outputs are usually Sector directories or collection of files
May have multiple round of computation if necessary (iterative/combinative processing)
A UDF
A C++ function following the Sphere specification (parameters and return value)
Compiled into a dynamic library
Map = UDF
MapReduce = 2x UDF
First UDF generates bucket files and second processes the bucket files.
Sphere is more flexible and efficient
UDF can be applied directly on records, blocks, files, and even directories
Support multiple inputs/outputs with better data locality, including certain legacy applications that process files and directories
Native binary data support w/ permanent index files
Sorting is required by Reduce, but it is optional in
Sphere
Output locality allows Sphere to combine multiple operations more efficiently
Terasort: sort 1TB data over distributed servers
Malstone: detect malware website from billions of transactions
Graph processing: analyze very large social networks at billions of vertices (BFS and enumerating cliques)
Genome pipeline: analyze genome sequences
Satellite image processing: compare satellite images of different time, for disaster relief
Sphere is about 2 – 4 times faster than Hadoop
15 Racks in Baltimore (JHU), Chicago
(StarLight and UIC), and San Diego
(Calit2)
10Gb/s inter-site connection on
CiscoWave
1 - 2Gb/s inter-rack connection
Two dual-core AMD CPU, 8 - 16GB RAM,
1-4TB RAID-0 disk
NLR SAND
2151
NLR LOSA
NLR VLAN 2151
CiscoWave
NLR CHIC
NLR VLAN 2560
CiscoWave
NLR WASH
San Diego
Calit2
CalIT2 RACK (32 nodes)
IP: 67.58.56.66-97/26
507
Chicago
StarLight
Chicago
UIC
Baltimore
JHU
StarLight RACK (32 nodes)
206.220.241.90-121/24
UIC RACK (32 nodes)
IP: 192.168.136.5-36/26
JHU RACK (32 nodes)
IP: 192.168.136.70-101/26
Current version 2.3, all core functions ready, still working on to improve code quality and details for certain modules.
Partly funded by NSF for NCDM/UIC
Commercial support via VeryCloud LLC
Next step: support column-based data tables (similar to BigTable)
Open source contributors are welcome
Sector Website: http://sector.sourceforge.net
Email: gu@lac.uic.edu