What's Next for Database? Jim Gray Microsoft http://research.microsoft.com/~Gray Outline Looking at the past: old problems now look easy Looking forward: data avalanche here integrate ALL kinds of data Watershed: The new world Programs + data: Info Ecosystem All data classes (Objectifying Information) Approximate answers Keynote ▪ 30 September 2005 ▪ 9:00 Old Problems Now Look Easy 1985 goal: 1,000 transactions per second Couldn’t do it at the time At the time: 100 transactions/second 50 M$ for the computer (y2005 dollars) Keynote ▪ 30 September 2005 ▪ 9:00 Old Problems Now Look Easy 1985 goal: 1,000 transactions per second Couldn’t do it at the time At the time: 100 transactions/second 50 M$ for the computer (y2005 dollars) Now: easy Laptop does 8,200 debitcredit tps ~$400 desktop Thousands of DebitCredit Transactions-Per-Second: Easy and Inexpensive, Gray & Levine, MSR-TR-2005-39, ftp://ftp.research.microsoft.com/pub/tr/TR-2005-39.doc Keynote ▪ 30 September 2005 ▪ 9:00 Hardware & Software Progress Throughput 2x per 2 years Throughput/$ 2x per 1.5 years tracks MHz 40%/y hardware, 20%/y software 1000.00 100,000 X86&X64 tpmC per CPU over time 100.00 20 X86&X64 tpmC per Mhz over time 1,000 Throughput / k$ tpmC/cpu 10,000 30x in 10 years 41%/year Double every 2 years TPC-A and TPC-C tps/$ Trends 10.00 TPC-C TPC A 1.00 ~100x in 10 years ~2x per 1.5 years 15 0.10 10 5 0.01 100 0 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 1990 1992 1994 1996 1998 2000 2002 2004 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 No obvious end in sight! A Measure of Transaction Processing 20 Years Later ftp://ftp.research.microsoft.com/pub/tr/TR-2005-57.doc IEEE Data Engineering Bulletin, V. 28.2, pp. 3-4, June 2005 Keynote ▪ 30 September 2005 ▪ 9:00 100x Improvement Every Decade $1B job becomes $10M job $1M job becomes 10K$ job Terabytes common now (~500$ today) Petabytes in a decade. Challenge: We can capture & store everything. What’s interesting? What can you tell me about X? Keynote ▪ 30 September 2005 ▪ 9:00 Q: How Much is “Everything” A: About 15 Exabytes Q: How much is digital? A: 70% and growing Q: Where does it come from? A: Video, voice, sensors, Q: How fast is it growing? A: Growing 10%/y now, 55%/y when ALL digital Information Growth vs Storage Media PB/y print 0.2 2% film 427 4% video 300 5% computer 1,693 55% Source: Larson & Varian, “How Much Information”: as of 2003 http://www.sims.berkeley.edu/research/projects/how-much-info/ Keynote ▪ 30 September 2005 ▪ 9:00 CAG Where is the Data? Smart Objects Everywhere Phones, PDAs, Cameras,… have small DBs. Disk drives have enough cpu, memory to run a full-blown DBMS. All these devices want-need to share data. Need a simple-but-complete dbms They need an Esperanto: a data exchange language and paradigm. Billions of Clients Millions of Servers Keynote ▪ 30 September 2005 ▪ 9:00 The Perfect System Knows everything Knows what you want to know Tells you the answer… in a an easy-to-understand way; just before you ask Tells you what you should have asked And… It is inexpensive to buy It is inexpensive to own. Well, maybe not everyone wants this… but every organization does. Keynote ▪ 30 September 2005 ▪ 9:00 Oh! And the PEOPLE COSTS are HUGE! People costs have always exceeded IT capital. But now that hardware is “free” … Self-managing, self-configuring, self-healing, selforganizing and … is key goal. No DBAs for cell phones or cameras. Requires Clear and simple knobs on modules Software manages these knobs Keynote ▪ 30 September 2005 ▪ 9:00 Our Challenge Capture, Store, Organize, Search, Display All information. Personal Organizational Societal There is a huge gap between what we have today and what we need. Data capture is relatively easy Curate, Organize, Search, Display still too hard. Keynote ▪ 30 September 2005 ▪ 9:00 Outline Looking at the past: old problems now look easy Looking forward: data avalanche here integrate ALL kinds of data Watershed: The new world Programs + data: Info Ecosystem All data classes (Objectifying Information) Approximate answers Keynote ▪ 30 September 2005 ▪ 9:00 DBMS Re-conceptualization Re-Unification of Programs & Data Allows Objectification of Information eg: what is a gene? What properties&methods? what is a person? What properties&methods? What is an X? What properties&methods? Need to “glue” all these models together Time, Space, text,… are core types Person, event, document, gene,.. are extensions. The “Action” is in these extensions. Keynote ▪ 30 September 2005 ▪ 9:00 Code and Data: Separated at Birth COBOL IDENTIFICATION: document AUTHOR, PROGRAM-ID, INSTALLATION, SOURCE-COMPUTER, OBJECT-COMPUTER, SPECIAL-NAMES, FILE-CONTROL, I-O-CONTROL, DATE-WRITTEN, DATE-COMPILED, SECURITY. ENVIRONMENT: OS CONFIGURATION SECTION. INPUT-OUTPUT SECTION. DATA: Files/Records FILE SECTION. WORKING-STORAGE SECTION. LINKAGE SECTION. REPORT SECTION. SCREEN SECTION. “data” PROCEDURE: code “knowledge” Keynote ▪ 30 September 2005 ▪ 9:00 CODASYL - DBTG COnference on DAta SYstems Languages Data Base Task Group Defined DDL for a network data model Set-Relationship semantics Cursor Verbs Isolated from procedures. No encapsulation Klaus Wirth: Programs = Algorithms + Data Structures The Object-Relational World marry programming languages and DBMSs Stored procedures evolve to “real” languages VB, Java, C#,.. With real object models. Data encapsulated: a class with methods Tables are enumerable & indexable Business record sets with foreign keys Objects Records are vectors of objects Opaque or transparent types Set operators on transparent classes Transactions: Preserve invariants A composition strategy An exception strategy Ends Inside-DB Outside-DB dichotomy Keynote ▪ 30 September 2005 ▪ 9:00 Ask not “How to add objects to databases?”, Ask “What kind of object is a database?” Q: Given an object model, what is a DB? A: DataSet class and methods (nested relation with metadata) The basis for the ecosystem Distributed DB Extensible DB Interoperable DB …. implicit in ODBC, OleDB explicit within the DBMS ecosystem Input: Command (any language) Output: Dataset Keynote ▪ 30 September 2005 ▪ 9:00 Question Dataset Tables or Text or cube Or….. DB System Architecture sets records os but applications need to query other data types Added: Keynote ▪ 30 September 2005 ▪ 9:00 sets … records os A Mess? utilities Notification Space Time Data Mine Cubes Text ETL Replication XML Queues Procedures +Text, Time, Space + Triggers and queues + Replication, Pub/sub + Extract-Transform-Load + Cubes, Data mining + XML, XQuery + Programming Languages + Many more extensions coming utilities The classic DBMS model Evolving to be Information Services Container develop, deploy, and execution environment + Programming Languages + Triggers and queues + Replication, Pub/sub + Extract-Transform-Load + Text, Time, Space + Cubes, Data mining + XML, XQuery + Many more extensions coming sets records os utilities Classic ++ DBMS is an ecosystem OO is the key structuring strategy: Everything is a class Database is a complex object Core object is DataSet Classes publish/consume them Depends on strong Object Model Keynote ▪ 30 September 2005 ▪ 9:00 DataSet What’s Outside? Remote Node Remote Node Internet Other us Other us Applications Other us Our API Buffer Pool catalogs itterators Query Processor Keynote ▪ 30 September 2005 ▪ 9:00 data Other us Classic: What’s Outside? Three Tier Computing Clients gather input, do presentation do some workflow (script) Send high-level requests to ORB (Object Request Broker) ORB dispatches workflows, orchestrate flows & queues Workflows invoke business objects Business object read/write database Keynote ▪ 30 September 2005 ▪ 9:00 Presentation workflows Business Objects Databases DBMS is Web Service! Client/server is back; the revenge of TP-lite Web servers and runtimes (Apache, IIS, J2EE, .NET) displaced TP monitors & ORBS Presentation Give persistent objects Holistic programming model & environment Keynote ▪ 30 September 2005 ▪ 9:00 workflows Business Objects DBMS Web services (soap, wsdl, xml) are displacing current brokers DBMS listening to Port 80 publishing WSDL, DISCO,WS-Sec Servicing SOAP calls. DBMS is a web service Basis for distributed systems. A consequence of OR DBMS Databases Queues & Workflows Apps are loosely connected via Queued messages Workflow: Queues are databases. Script Basis for workflow Execute Queues: the first class to add to Administer & an OR DBMS Expedite all built on queues Queues fire triggers. Active databases Synergy with DBMS security, naming, persistence, types, query,… Keynote ▪ 30 September 2005 ▪ 9:00 What’s new here? DBMS have tight-integration with language classes (Java, C#, VB,.. ) The DB is a class Question Dataset You can add classes to DB. Adding indices is “easy” If you have a new idea. Now have solid queue systems Adding workflow is “easy” If you have a new idea. This is a vehicle for publishing data on the Web. Interne t Keynote ▪ 30 September 2005 ▪ 9:00 Web service Tables or Text or cube Or….. Tables or Text or cube Or….. Text, Temporal, and Spatial Data Access Q: What comes after queues? A: Basic types: text, time, space,… Great application of OR technology Key idea: table valued functions == indices An index is a table, organized differently Query executor uses index to map: Key → set (aka sequence of rows) Table valued function can do this map Optimizer can use it. +extras: cost function, cardinality,… select Title, Abstract, T.Rank from Books join FreeTextTable(Title, on select galaxy, distance from GetNearbyObjEQ(22,37) select store, holiday, sum(sales) from Sales join HolidayDates(2004) T on Sales.day = T.day group by store, holiday BIG DEAL: Approximate answers: Rank and Support Keynote ▪ 30 September 2005 ▪ 9:00 Abstract, 'XML semistructured') T BookID = T.Key Data Mining and Machine Learning Tasks: classification, association, prediction Tools: Decision trees, Bayes, A Priori, clustering, regression, Neural net,… now unified with DBs Create table T (x,y,z,u,v,w) Learn “x,y,z” from “u,v,w” using <algorithm> Train T with data. Then can ask: Probability x,y,z,u,v,w What are the u,v,w probabilities given x,y,z Example: Learn height from age. Anyone with a data mining algorithm has full access to the DBMS infrastructure. Challenge: Better learning algorithms. Keynote ▪ 30 September 2005 ▪ 9:00 Notification: Stream and Sensor Processing Traditionally: Query billions of facts Streams: millions of queries one new fact New protein compare to all DNA Change in price or time Implications Q? A! New aggregation operators (extension) New programming style Streams in products: Queries represented as records fact, fact, fact… New query optimizations. facts Q Q Q QQ Q Q Sensor networks push queries out to sensors. Simpler programming model Optimizes power & bandwidth Keynote ▪ 30 September 2005 ▪ 9:00 Notification Semi-Structured Data “Everyone starts with the same schema: <stuff/>.” Then they refine it.” J. Widom “Strong schema” has pros-and-cons. Files <stuff/> and XML <<foo/> <bar/>> are here to stay. Get over it! File directories are databases; Pivot on any attribute Folders are standing queries. Freetext+schema search (better precision/recall) Cohabit with row-stores Keynote ▪ 30 September 2005 ▪ 9:00 Publish-Subscribe, Replication Extract-Transform-Load (ETL) Data has many users Replicas for availability and/or performance Mobile users do local updates synchronize later. Classic Warehouse Replicate to data warehouse Data marts subscribe to publications Disaster Recovery geoplex ETL is a major application & component Data loading Data scrubbing Publish/subscribe workflows. Key to data integration (capture / scrub) Keynote ▪ 30 September 2005 ▪ 9:00 Restatement: DB Systems evolved to be containers for information services develop, deploy, and execution environment Everything is a class Database is a complex object Core object is DataSet Approximate answers This architecture lets you add your new ideas. Keynote ▪ 30 September 2005 ▪ 9:00 sets records os utilities DBMS is an ecosystem Key structuring strategy: DataSet Summary: Looking at the past: old problems now look easy Looking forward: data avalanche here integrate ALL kinds of data Watershed: The new world Programs + data: Info Ecosystem All data classes (Objectifying Information) Approximate answers Keynote ▪ 30 September 2005 ▪ 9:00 Additional Resources Papers at: http://research.microsoft.com/~gray/JimGrayPublications.htm Talks at: http://research.microsoft.com/~gray/JimGrayTalks.htm Basis for this talk: “The Revolution in Database Architecture” http://research.microsoft.com/research/pubs/view.aspx?tr_id=735 Very interesting & related: David Campbell “Service Oriented Database Architecture: App Server-Lite?” http://research.microsoft.com/research/pubs/view.aspx?tr_id=983 Keynote ▪ 30 September 2005 ▪ 9:00 Thank you! Thank you for attending this session and the 2005 PASS Community Summit in Grapevine! Please help us improve the quality of our conference by completing your session evaluation form. Completed evaluation forms may be given to the room monitor as you exit or to staff at the registration desk. •