sibdata.ppt

advertisement
The Sibdata Revolution
Nick Roussopoulos
DCS & UMIACS
&
Univ. of Maryland
September 2009
Data Management: Past to Current
• Structured Data
Emp
ename
Gary
Shirley
Christos
Robin
Uma
Tim
sal
30K
35K
37K
22K
30K
12K
dept
toy
candy
shoe
toy
shoe
Dept
dept
candy
toy
men
shoe
floor
1
2
2
1
mgr
Irene
Jim
John
George
• Structured architectures
CLIENTS
CLIENTS
Processors
Memory
Nick Roussopoulos
CLIENTS
Data Management: Huh???
Nick Roussopoulos
The Landscape
Bell’s Law: Every decade, a new, lower
cost, class of computers emerges, defined
by platform, interface, and interconnect
•
•
•
•
•
Mainframes 1960s
Minicomputers 1970s
Microcomputers/PCs 1980s
Web-based computing 1990s
Devices (Smart phones, PDAs, wireless sensors,
RFID) 2000’s
Enabling a new generation of applications that
Mandate new data management methods & tools.
Nick Roussopoulos
Data Then and Now
• The “Data Industrial Revolution”:
Data used to be “hand-crafted”,
now it’s generated by computers!!!
• The Data Integration quagmire: 40
years of continuous successes (sic)
and still a long way to the end.
• Structure provides crucial
understanding for making data
usable and leads to
discovery/innovation.
Nick Roussopoulos
Data Streaming  Data Explosion
PoS System
Barcodes
Phones
RFID
• Exponential data growth
• New challenges: continuous, interconnected, distributed, physical
• Shrinking business cycles
• More complex decisions
Inventory
Transactional
Systems
Clickstream
Telematics
Nick Roussopoulos
Sensors
The Structure Spectrum
• Structured data (schema-first)
• regular, known, conforming, …
• e.g., Relational database
• Unstructured data (schema-never)
freeform, irregular,
• e.g., plain text, images, audio, …
• Semi-structured data (schema-later)
• Provides structural information, but less
constrained. e.g., XML, tagged text/media
Nick Roussopoulos
Data Integration
• Integration is the ultimate schema-first
problem.
• Requires complete understanding &
disambiguation
• Structure (semantics) is both a key
enabler and a key impediment here.
Nick Roussopoulos
Structured Data: How much
• Conventional Wisdom: ~20% of data is
structured currently.
• Consumer apps, enterprise search,
multimedia apps are placing downward
pressure on this.
Nick Roussopoulos
State of the Art:
Integration-in-the-large
• Team work, huge & expensive effort,
excruciating pain
• Extremely long time lag between data
generation and availability
• Custom-coded implementations that are often
unsuccessful
• Clearing house of already discovered knowledge
(the high overhead is for disambiguating the
semantics of the heterogeneous data)
Nick Roussopoulos
Future:
Integration-in-the-small
• End-user, limited in scope, requires training
• Continuous as the data sources and equipment
evolve
• End-user tools are needed
• Small cost, enormous opportunity for discovery
and innovation
Nick Roussopoulos
Sibling Data
• Aggregation and naming of disparate data
regardless location
• Includes actual data, references to external
data, queries that generate data, & programs
to process data
• May include other sibdata
• Open vs Closed
• Open: continuous accumulation
• Closed: fixed snapshot (archival)
• Location Independent semantics
Nick Roussopoulos
Web search results
Nick Roussopoulos
Content vs URL
• Content 
• http://www.michael
moore.com/
Nick Roussopoulos
Deep-Web Queries
SELECT y.title
FROM Yahoo_Movies m
WHERE m.title like Moore;
Nick Roussopoulos
Result vs. Query
• Results are associated with the time the
query was run
• Queries can be captured in sibdata and
executed at will; thus the sibdata would
be open and captures a different result
each time it executes
Nick Roussopoulos
Queries to Relational Databases
Yahoo_Actors
Nick Roussopoulos
Sibdata
• Deal with all the data from everywhere &
in whatever form they come
• Data co-existence
no integrated schema, no single warehouse
• Expand-as-you-go
• Integrate little by little as you need
• ETL Data mapping-integrating as you add
more data
Nick Roussopoulos
Sibdata Properties
•
•
•
•
•
Lightweight
• Metadata captures the encapsulation, name, and
provenance data
Location-independent
• Accessible from anywhere
Isolated
• Generated with no interference
Durable
• Persist until dropped
Secure
• Guarantee security defined by the creators and sources
• Compose multiple levels of security to its components
Nick Roussopoulos
Comparison to Transactions
• Transactions
• grouping of many actions into an atomic
transaction- ACID properties
• Substrate: database
• Sibdata
• Grouping of data into an atomic sibdata –
LLADS
• Substrate: actions/transactions/data
generators
Nick Roussopoulos
Sibdata Infrastructure
Nick Roussopoulos
Sibdata Servers
• Establish a global sibdata ID and name
• Creates and maintains metadata with
provenance, users, security, etc.
• Provides searchable catalog
• Provides storage for non-sib compliant
data sources
• Fault tolerance (replication)
Nick Roussopoulos
Sib Protocols
• Establish Sibdata protocol
• Concurrency-Consistency issues (?)
• Sharing of data
• Name conventions
• Dispute resolution
• Distributed Logging
• Security Using chits
• Group and multi-valued ownership and
visibility
Nick Roussopoulos
User Interface
•
•
•
•
•
Simple OS support
Query Languages
Graphical Languages
ETL tools
Extra functionality
• High dimensional indexing
• Mining
Nick Roussopoulos
Conclusions
• Need to build Sib Infrastructure
• Refine the sibdata semantics
• Refine the security protocols
• For data aggregates
• User groups
• Great opportunities for innovation
Nick Roussopoulos
Presentations & Project
• 3 X 7 students = 21 presentations ~2 per lecture
• Lecture dates
•
•
•
•
Sep:
Oct:
Nov:
Dec:
15, 22, 29
6, 13, 20, 27
3, 10, 17, 24
1, 8
• Project: Proposal due Sep 29
• Discussion: Every lecture be prepared to give a
2-3 min progress report, papers found, etc.
Nick Roussopoulos
Network Data Independence
Hellerstein Berkeley
• Physical Data Independence
•
•
Decoupling data from layout (not hard coded
applications)
Permits reorganization of data w/o affecting the apps
• Declarative query languages
•
Using the schema
• Distributed Databases
•
•
Transparency hides location from the user who acts as
if he is accessing a centralized database
Limited sites- not capable to expand to the mobility of
and constant change of the configuration
Nick Roussopoulos
Pilars of Data independence
• Indexes- offer indirection allowing
modification of the underlying structure
table R
1 4
5 6 9 11
3
1
occurrence file
• Schema based and declarative query
languages & optimization
Nick Roussopoulos
Sibdata Independence
• Encapsulation of dissimilar data
• Data can be moved, rearranged, altered
• Additional indices on top of Sibdata becomes
part of the sibdata
• Naming and provenance data are fixed
• Do not change to the outside world
• Containment information (sibdata
encapsulation within other sibdata) is
guaranteed
Nick Roussopoulos
DHT (Chord)
• Data centric distribution
• according to content- total data
independence
• very large number of distributed servers
• Configuration changes rapidly (although this
may not be really that important)
• Fault-tolerance (extra machines)
• Limited to single key searches (not range or
join queries
Nick Roussopoulos
Network Names & Services
• Internet Indirection Infrastructure (i3)
•
•
•
Triggers (id,r) where id = global ID and r is an address
to forward packets
When a mobile user moves to r’, he modifies his
trigger to (id,r’)
It also supports 1-to-n mappings (anycast)
• Content Distribution Networks (Akamai)
•
Replicates heavy data (images, videos) to multiple
sites and redirects user accesses to those that are
closer (indirection via location independence)
Nick Roussopoulos
Relevant DB Technologies
• Distributed Aggregation
•
•
Monitor networks (collecting stats)
Computing synopses and pass it along
• Adaptive execution plans
•
•
Feedback to the execution
Commutative tasks to avoid extended delays
• Range search over DHT
•
•
Trie hashing
Still limited
• P2P & Mobile Databases
Nick Roussopoulos
Pier: A P2P in situ Query Engine
Goals
• Massively distributed processing
• Scallability
• Relaxed consistency (best effort)
Architecture
• P2P Built on top of DHT
• Multicast to all related nodes (lscan)
• Pipelining the intermediate results
Nick Roussopoulos
Pier Joins
•
Stored in DHT
•
•
•
•
•
Namespace=relation NR, NS
resourceID =Primary Key (PK)
instanceID =tuple # if not a PK
Assume R and S are already DHT hashed using <NR,PKR,1> and
<NS,PKS,1>
Symmetric Join building phase
•
•
lscan NR and NS eliminate unqualified tuples and not needed
attributes
Rehash all above tuples using
•
•
•
•
namespace NQ
resourceID=R.pkey*S.pkey
Tuples are tagged with relation name
SymmetricJoin Probing phase
•
•
•
Probing in parallel with building (with callbacks) locally
Satisfying tuples are either sent to the Qsite or DHT-ed for the
pipelined op
Consumes a lot of bandwidth
Nick Roussopoulos
Better Joins
• Fetch Matches
•
•
Hash only S
lscan R and fetch NS tuples
• Rewriting Join using 2-way semijoin
•
•
Project R & R on their PK and joining attribute
Do symmetric join on these projections
• Rewriting Join using Bloom filters
•
•
Create and DHT the Bloom filters
Do lscan and access the Bloom filter to eliminate not
joinable tuples
Nick Roussopoulos
Conclusions for Pier
• P2P bring massive parallelism
• Repetitive data comparison over DHT
brings along massive waste of bandwidth
• Smarter in situ distillation (2-way
semijoins, Bloom filters) work better
Nick Roussopoulos
Download