A Petabyte in Your Pocket David Maier Oregon Graduate Institute with help from

advertisement
PetDB
A Petabyte in Your Pocket
David Maier
Oregon Graduate Institute
with help from
D. DeWitt, J. Naughton, L. Delcambre, K.
Tufte, V. Papadimos, P. Tucker
David Maier
1
PetDB
Your PetDB
It’s 2015.
For $300 a year, you can have a personal
petabyte database (PetDB).
You can talk to it from anywhere.
Organizes any kind of digital data.
– Doesn’t lose structure, can restructure
– Queryable
– Handles streams
– Organized by type, content, associations, multiple
categorizations and groupings
Locate items by
– How or where you encountered them
– What you’ve done with them
– Where you were when you accessed them
David Maier
2
PetDB
What Would I Put in a Petabyte?
A lot.
Fill my office floor to ceiling with books  100 GB
What do I do with 10,000 as much?
Many possibilities:
– Contents of every book and magazine I read
– Every web page I visit
– All email I send or receive
– Every TV program I watch
– Every version of every piece of software I use
– Maps of everywhere I go
– Notes from every class or seminar I attend
– All the telephone calls I make
– My “Lifestream” (Freeman and Gerlernter)
David Maier
3
PetDB
Streams and Restructuring
Can incorporate streamed data on the fly.
– MD: Vital signs from patients in ICU
– Factory supervisor: status, output rate of all
machines; finished products; rejects
Can restructure data if desired.
– Combined list of conferences in my area
– Info sheets on autos I’m considering buying
– Comparable salaries of faculty at my rank in
similar departments
David Maier
4
PetDB
Anything I Might Want to Refer Back to
Personally indexed for me.
Can be located in a thousand different ways.
What is the company in Massachusetts I read
about in the article on factory tours when I was
on the plane to the sales meeting in Atlanta
last spring?
David Maier
5
PetDB
Or Things I Might Want in the Future
• Histories of news groups and mailing lists
• Parts of the web I might want to browse,
including past snapshots
• Descriptions and prices for any item I
might want to buy
• Papers I’ve been meaning to read
• Historical data on stocks I’m interested in
Functions as a personal web portal
David Maier
6
PetDB
“Database” Not Completely Apt
• Didn’t have to define a scheme for it
• Doesn’t need to know the datatypes I want to
store in advance
• Doesn’t chop data into rows and columns
Unless I ask
• Can query over information streams
• Don’t need to write and run applications to add
data
Anything I’ve touched is there
Or expressed an interest in
• Not on a particular computer
• Doesn’t have an “outside”
David Maier
7
PetDB
My PetDB is Good to Me
• I don’t move data between environments
I’m never on the “wrong” machine
• Never go back to my office to grab a paper,
never have the wrong folder at a meeting
• Don’t worry a lot about filing systems–PetDB
organizes itself by ways I like to look for
information
• Anticipates what data I’ll be using
David Maier
8
PetDB
How to Do This?
On $300/year
Plan A: Pack my office floor to ceiling with
disk drives.
About a $1 million.
Plan B: Be clever.
– Share
– Stage
– Reconstitute
David Maier
9
PetDB
Share
Most of the information in my PetDB isn’t
unique to me: magazine article, web page,
stock quote.
Store one copy.
Information Paradox: What’s too expensive
for one may be affordable for all.
Others’ PetDBs
My PetDB
David Maier
10
PetDB
Stage
Not all data has to be at my current point of
connection.
Mainly resides in shared and private servers
on the Internet.
Staged to me on a series of data managers.
Access time depends on context, likely use
– Current itinerary: 1 second
– Upcoming trips: 5 seconds
– Past trips: 30 seconds
David Maier
11
PetDB
Reconstitute
“If I found it once, PetDB can find it again”
Remember what procedure or search
constructed or located data originally.
Use the same method to get it again.
Need to ensure base data is archived.
Plus a small amount of unique content
– Stuff I’ve created
– Foreground information that superimposes my
personal perspective: selections, annotations,
responses, manipulations, groupings
David Maier
12
PetDB
What Infrastructure Do I Need?
Net Data Managers
• Network-centric vs. disk-centric
– Data movement vs. data storage
• Work on lives streams as well as stored
data
• Deal with data of arbitrary types
• Run queries of thousands of sites
• Locate data by external contexts as well
as internal content
• Large-scale monitoring
David Maier
13
PetDB
Data Management Space
Query
No Query
DBMS
File System
Disk Centric
Net Data
Managers
(NDMs)
Web Servers
Network Centric
David Maier
14
PetDB
Why Net Data Managers?
File systems won’t work
– No queries, disk centric
Web Servers won’t work
– No structural query, no combining of data
– No support for optimization and execution of
high-level queries spanning 1000s of sites
– No support for triggers
– In reality, nothing more than “page servers”
David Maier
15
PetDB
Limitations of Current DBMSs
• Schema-first
• Load then query
• Data in the box
• Scale
• Search by content, not by context
David Maier
16
PetDB
Key Elements of NDM
• Self-describing data (e.g., XML)
• NetQueries
• Algebraic basis
• Stream-processing components
Oil refinery vs. book-order warehouse
Want to do for net-centric, data-intensive
applications what relational DBs did for
business data processing:
Reduce the coding effort to produce such
applications, while improving performance,
scalability and reliability.
David Maier
17
PetDB
Codd’s Contribution
What’s the most important aspect of the
relational model?
– Calculus?
– Algebra?
– Equivalence?
My opinion: Observing that BDP programs
only do about 6-7 different things:
scan files
select records
combine records
concatenate files
remove fields
remove duplicates
[aggregate records]
What are the building blocks of net data
management?
David Maier
18
PetDB
Users
Browser
Without NDMs
Data Sources
Format
Conversion
Alert
Service
Push
Receiver
Profiles
Format
Conversion
Browser
Push
Receiver
Data Product
Generation
Accumulator
+ Query Eng.
Algorithm
Format
Conversion
Browser
Push
Receiver
Parameter
File
Generic Component
Custom Software
David Maier
19
PetDB
Users
Browser
With NDMs
Sources
Format
Conversion
Alert
Service
Push
Receiver
Profiles
Format
Conversion
Browser
Push
Receiver
Data Product
Generation
Accumulator
+ Query Eng.
Algorithm
Format
Conversion
Browser
Push
Receiver
Parameter
File
Generic Component
Custom Software
David Maier
20
PetDB
Kinds of Components
• Stream-based query processors
• Alerters
• Accumulators
• Remote monitoring/indexing
• Semantic Routers
• Replicators: lazy, eager, just-in-time
• Semantic caches
• Splitters
• Access-mode adapters
• Partial evaluators
David Maier
21
PetDB
Alerting vs. Querying
Data Centric
???
DBMS
!!!
D D
D
Stream of queries past a
store of data
Net Centric
DDD
Alerter
!!!
? ?
?
Stream of data past a
store of queries
David Maier
22
PetDB
Access Modes: Who Decides
When Data
Moves
Producer
Post
Push
Consumer
Pull
Poll
Consumer
Producer
What Data Moves
David Maier
23
PetDB
Assembling Applications from Components
Akamai FreeFlow (see NASDAQ site)
Splitting + Replication + Merge + Adapters
Pull
Base
Server
Web
Content
Browser
Merge
Pull
Text
Push
Split
Graphics
Replicate
Field
Server
Field
Server
David Maier
Field
Server
24
PetDB
NIAGARA Project
Initial investigation of NDM based on XML
University of Wisconsin and OGI
• Stream-oriented XML-QL evaluator
• “Text-in-context” search
• NiagaraCQ
• Merge operator (and rest of algebra)
• XML Firehose
David Maier
25
PetDB
Use of NDM for PetDB
• NetQueries encode procedures for
reconstituting data
• Monitoring sources of interest
• Replication, splitting, push, accumulators,
semantic routing for staging data
• NetQuery to inform an archive server what
to save
• Archives, semantic caches express what
they already hold with a NetQuery
David Maier
26
PetDB
Building the PetDB System
Context
Mgr.
Stager
Task
Analyzer
Stager
Pet
DB
Petster
Profiler
Stager
Secure
Local
Cache
Replicate
Server
Private Archive
IP Server
Back
Quote
Data
Kennel
Indexer
Stream
Processor
WebSnap
Internet
Monitor
Public Archives
David Maier
27
PetDB
What Else is Needed?
• Superimposed Information
Much of my unique content is an organizational
overlay on base data
• Small-footprint data managers
• Presentation model of stream data
• Authorization and Authentication
• QoS control, content scaling
• Intelligent prediction, learning
• Secure staging areas
David Maier
28
Download