To the Internet and Beyond: Database Challenges for New/Advanced Applications May 21, 2001 Propel Confidential Agenda • The story of Infoseek • Why Propel? • The problems that arise for us Scoring Framework • Classification Problem •Separate relevant from non-relevant documents •Bayes’ Decision rule: Relevant if P(x(d)|R)P(R) P(x(d)|~R)P(R) where x(d) is the observed representation of d •Independence assumption leads to S(d) = log [p(t)(1-q(t))/(1-p(t))q(t)] where p(t) = P(t|R) and q(t) = P(t|R) The original Infoseek vision • Stolen from Bill Gates… “Information at your fingertips” • To find any piece of information on any computer in the world within 1 second How we got started • Finding information was too expensive and too hard • Our field of dreams • “If you provide useful information at bargain prices, they will come” • In January 1995 we launched Infoseek • Register with a credit card • First month free • 10 cents a transaction What happened... • • • • • • “I thought you said it was FREE to try it?” “You’ve got to be kidding!” “I already pay $10 a month for my access!” “I can’t afford it.” “Go to …” “Why should I pay when the information is available free elsewhere on the net?” • “I don’t like to be nickeled and dimed.” Even more advice... • • • • • • “You should only charge me per query” “You should only charge for document” “I’ll only sign up for a flat fee” “I refuse to pay a flat fee” “I don’t have a credit card” “Your legal agreement is too long” What we did • Dropped the credit card registration for a free trial • Made it very clear you can’t get most of this stuff for free anywhere on the net • Made the pricing easier to understand • Advertised it on our free Net Search “So…. How would you like to provide a free Net Search?” • First reactions • “Are you joking?” • “How would we make money? By making it up in volume?” • Strategy • It would be free advertising for Pro • Limit the search results to 100 hits • Want more? Refer to Infoseek Pro Infoseek Guide • • • • 25M hits/day (200 queries/sec at times) #1 search engine on the Net 1,000 signups/day for Infoseek Pro Discovered advertising sponsorship • 1.5 cents per query • Discovered TV math • we make more money giving away information than selling it Four years later… Propel Confidential How to find Barney pages suitable for your kids +Barney +dinosaur -bash -kill -maim -destroy -hate What people ask about (and why) Propel Confidential Unofficial SIGMOD survey question How many people here search the web for “adult sites”? Top 15 queries on the WWW • sex • Playboy • Penthouse • chat • Hustler • nude • porn * I am not making this up! This list is real! • erotica • games • pornography • porno • adult • ESPN • pussy • Pamela Anderson • What does that mean? • “Uhh… I was just testing!” Unofficial SIGMOD trivia question • Q: What famous IR researcher asked in 1995 “Is this because of the Communications Decency Act (CDA)?” • A: Bruce Croft Why it happens (possible explanations) • Research on CDA • Curious what others looking at • Many new technologies are driven by sex: •VCR •Hotel movies on demand • People are naturally horny What it means • Human race in no danger of extinction • Corporate libraries doing a great job in technical areas • Traditional sex education inadequate • Some of you are not telling the truth • Audience surveys are not always accurate • Bill Gates should admit to Congress that Pamela Anderson is more important than he is • If you didn’t raise your hand, you may need professional help! The secret Infoseek backup bizplan • Selling our list of porn sites We never pursued it… … But other companies did! •Sinfoseek •Infoseak •Nymfoseek •Infopeek •... Relevance ranking Web sites Propel Confidential Facts about Queries • Most queries are short •Average length approx. 2.2 •10% use query syntax (usually incorrectly) •1% used advanced search •Noun phrases only • Precision more important than recall •Users expect precision in top results Relevance ranking objectives Must use several techniques to determine “relevance”: • Page has query term(s) • Popular usage of the term, e.g., penthouse, java, adult, “evil empire”, ... • Page quality • Page/site popularity • Spam reduction/elimination • Porn reduction Relevance ranking factors • Query terms: tf*idf • Usage: Hyperlink text, thesaurus • Quality: site quality, dates, depth, … • Popularity: External link count, proxy stats • Spam: word/phrase unusual statistics (tf limiting) • Porn: site exclusion list, naughty phrase list Relative weighting of these factors is tricky and subjective Should “evil empire” return Microsoft as the top hit? Living in a world of an infinite number of documents Propel Confidential The problem (user view) • Too hard to find things even though only 100M documents indexed • Often precision and relevance, NOT recall • “intel” in the title search gives over 200 hits just like this: Index of /CPANlocal/authors/id/GSAR/x86/intel/ix86/intel/ix8 6/intel/intel/ix86/intel/ix86/ • Query ambiguity, e.g., “baby Bells” The problem (vendor view) • Speed • Size • Cost • Freshness • Load on the Internet/bandwidth (both sides) • Quality (Spam/porn) • Will people be able to find what they are looking for as the net grows? Today’s approach sucks Suck all content into a centralized search engine Infoseek All the world’s content Is there a better way? • We might start by asking the question: “How do people find information today?” Centralized searching techniques are rarely used in real life... • Ask God (and pray for an answer) • Ask DIALOG …and pray... • WWW search (new!) What people DO use is decentralized searching Source 1 Source 2 Question ... Source N Answers and more sources Human distributed searching attributes • Faster than a computer!!! • Complete • Accurate • Can be used to validate an answer • Will always find an answer (eventually) • No specialized hardware • All humans had the same CPU speed/RAM So can’t we design a computer distributed search network that is as fast and accurate and complete as our human distributed search network ? Our goal • Don’t necessarily mimic the process, but adapt the process to the medium One approach • User types query • System searches databases of popular pages as well as meta descriptions of other databases • Repeat until all websites have been searched NOTE: This is the fastest way to search an infinite amount of data What we learned • Relatively weak engines with no proximity got a wide following: people couldn’t see through the hype • Bigger was better People lie • • • • We have “concept searching” We’re growing faster than the net We’ve indexed 95% of the net We have more URLs than anyone else What we learned • Competing for the Internet customer is not always a case of who really has: • the best engine • the highest quality content or the most content • the best price, best GUI, or the best product • It’s more about: • brand name • convincing the customer you are the best What we learned • • • • • • Ads don’t sell themselves If you do 1M ads per day, you’re cookin’ Lots of competition Switching costs are low User behavior can be tracked Seemingly identical pages can have dramatically different click through Mistakes we made • Not pressing for branding • Slow to recognize ad model • No ! at the end of our name Ultra required new thinking • The traditional IDF formula breaks down for 1 Billion documents • Existing data structures would never work • “Managing Gigabytes” didn’t go far enough • Inktomi approach was too inefficient • No sacred cows Ultra: Designed for speed • Speed/space tradeoff • Architected from the ground up for 1Billion docs and 1,000 queries/sec • Everything is done in parallel and multithreaded • Limited disk I/O • Small in-RAM tables • Stable connections Ultra size • Parallel worms (multi-process, multithreaded) • Proprietary database required; OODB’s too slow • Change frequency monitored • >50M URLs Feature set • • • • • • • Natural language queries +, -, “phrases” Fields (link, url, site, title) Case sensitivity Stemming Approximate matching for phrases Gets faster the longer the query • 8 shortest term lists • Space invariant, I.e., CD-ROM =CDROM Spamming • A royal pain • Real-time made it worse • Used “site stop lists” and repeated words in document • Interesting problems arise: • If dup contents submitted, do you remove dups? • If docs ranked the same, which do you show first? My favorite search engine today: Google • Big • Fast • Relevant: Uses referral network of link statistics to determine best page to return for a query. This greatly reduces the spam problem too. • Simple: assumes “and” between words Google • By “ordering” pages based on “importance”, you can construct databases of pages: •Best 10M web pages •2nd best 10M web pages •3rd best … •… • To search 1 trillion pages, you might only search 1 database since most people never look past top 100 hits • There is a technical term for this technique… Why start Propel? • We spent too much time at Infoseek building infrastructure (same as everyone else) • Isn’t it time we stopped re-inventing the wheel? Desktop Application Development The old MSFT way The new MSFT way C, C,C++ C++ C, C++ MS Windows MSDOS MSDOS Internet Application Development Propel Way: Current Way Total Systems Approach Web Servers Web Servers RAS DB Directory Legacy Systems App Servers RAS Integration Systems Mgmt Search Cache Messaging App Servers … Propel Distributed Services Platform DB RAS, Performance, Quality TTM, IT & Labor Costs Directory Legacy Systems … Propel Long Term Direction Retail Applications Business Applications (B2C E-Commerce, Point of Sale, …) (B2B E-Commerce, Marketplace, …) 3rd Party Applications (CRM, ERP, Financials, …) Propel Distributed Services Platform My directives • Design it right: complete systems architecture. • Give me the best possible architecture • Forget standards if you must • Make it infinitely scalable and arbitrarily reliable with the highest possible performance and super-easy maintainability and high levels of data abstraction. • No bottlenecks. Nothing that doesn’t scale. • Must handle worst case nightmares, e.g., eBay for search, … My directives • Ship in a year. • Memory is now cheap. Take advantage. • TCP/IP packets are like 747s. • Touching disk or Oracle is death • I’m willing to sacrifice a tiny amount of latency for greater system maintainability. • Handle data and code versioning cleanly. • No system downtime ever. My question Q: “So how hard can that be?” A: Mike and crew couldn’t make it to SIGMOD this year Unofficial Propel Jingle You need that new site up by yesterday... so it's got to be Java... and your RAS is on the line! That's exactly why we founded Propel. Propel Distributed Services Platform • Java based distributed server architecture • Eliminate traditional infrastructure bottlenecks • Built-in scalability and fault-tolerance Scalable, Fault-tolerant, Clustered Messaging System Distributed Data Managers In-Memory Database Java Object Mapping Global Cache Integrated Search & Queuing Advanced Deployment Performance Mgmt & System Admin Propel Distributed Services Platform Current Internet Architecture Web Server Tier Web Server Web Server Web Server Web Server Web Server Web Server Web Server Web Server Application Server Tier E-Commerce Application E-Commerce Application E-Commerce Application E-Commerce Application Custom Application Data Management Tier Messaging Package Database Cache Systems Management Search Engine Integration Directory Server Legacy Systems Limitations • Complex to setup for • Scalability • Fault-tolerance • Manageability • Expensive high-end servers (4-64 CPU) • Customize application to handle RAS • High development & IT costs The Propel Architecture Web Server Tier Web Server Web Server Web Server Web Server Web Server Web Server Web Server Web Server Application Server Tier E-Commerce Merchandising Application E-Commerce Order Application Management Customer E-Commerce Relationship Application Management Custom Analytics & Application Reporting Custom Custom Application Application Scalable, Fault-tolerant, Clustered Messaging System Distributed Data Managers In-Memory Database Java Object Mapping Global Cache Integrated Search & Queuing Advanced Deployment Performance Mgmt & System Admin Propel Distributed Services Platform Database Server Directory Server Legacy Systems Benefits • Designed for built-in: • Scalability • Fault-tolerance • Manageability of end-to-end system • Multiple, low cost servers (1-4 CPU) • Integrated RAS • Reduced development effort & IT resources Propel Technology Innovations • Clustered Messaging System • Distributed Data Management •In-memory databases •Persistent Queuing •Integrated Search • Global Cache • Advanced Deployment • Java Object Mapping Layer • Distributed Services Manager Clustered Messaging System • Centralized hub for all inter-process communication • Transforms all applications and database managers into clustered distributed services • Enables transparent replication of any service • Dynamic load balancing across all services • Incremental s/w upgrades without site downtime Distributed Data Management • Main-memory speeds • Transparent single view of all distributed data managers •In-memory & disk-based databases • Consistency & durability of e-commerce transactions • Unified query and search capabilities spanning data, text, and persistent queues • Persistent queues integrated into data mgmt framework •Flexible programming model extends database functions (queries, joins, etc.) to queues • Integrated parametric & keyword search •Relevance ranking, custom dictionary, index refreshing, … Distributed Data Management Example: Partitioned Query Processing User, Order, and Product Data Multi-Master Routing Master-Slave Routing User Third-Party DBMS Node Order In Memory Database Node Order In Memory Database Node Prod In Memory Database Node Prod In Memory Database Node Multi-Master Routing ProdN In Memory Database Node ProdN In Memory Database Node Persistent Queues Order Queue Order #98045 Customer #3742 Status “urgent” Destination “ERP” ………… ………… ... Order #97996 Customer #16 Status “in process” Destination “ERP” ………… ………… Unique ability to conduct queries & joins across queues & tables Order #97993 Customer #831 Status “normal” Destination “Financial Svcs ………… ………… . Database Tables Java Object Mapping Layer • Java bean-based programming model for unified access to all distributed data mgmt capabilities • Unifies access to DB tables, text data, persistent queues • Automatically generates Java beans based on XML specifications Java Object Mapping Layer Code Generation Runtime Environment Bean XML APIs Bean Application Bean Bean Bean Bean OM Runtime Object Relational Descriptions Java Object Mapping Platform Data Global Cache • Data cache for dynamically generated page-fragments • Easily incorporated into JSP pages thru use of simple tags (w/name, scope, lifetime) • Multi-level distributed architecture •L1 – local application server caches •L2 – scalable system-wide cache(s) Global Cache Application Server 1 User Request Top Jazz CDs Top Jazz CDs Music News [TODAY] Billboard Top 10 CDs User Request Today’s Music News Application Server Caches (L1) Application Server 2 Top Opera CDs Global Cache (L2) Billboard Top 10 CDs Music News [TODAY] Global Cache Servers Top Jazz CDs Billboard Top 10 CDs Top Blues CDs Top Opera CDs Music News [TODAY] Music News [YESTERDAY] Advanced Deployment Propel Merchant Tool Propel Distributed Services Management Offline Data Product Data Customer Profiles Business Rules Live Production Site Deployment Version Attributes Deployment Type Content Management Tool Versioned Database Tables Support in live production site Distributed Services Manager Benefits: Propel Distributed Services Platform • Reduced development costs • Faster time to market with new features • Incremental capacity on demand and faulttolerance through built-in replication • Transparent data tier scalability through distributed in-memory databases • Lower cost server hardware infrastructure Future challenges Propel Confidential Major challenges: Endure Mike Carey jokes Kirsch: I don’t think we should bump their stock; we have to worry about stock parity, you know. Carey: Yes, you wouldn’t want to make a parity error. Mike Carey jokes: A major challenge • Kirsch: I like the idea of a cache with a back-end store. • Carey: Yeah, you wouldn’t want to go to the store without cache. • Kirsch: I think we need more people working on our caching architecture. • Carey: Maybe I can recruit some people from our finance dept; they are experts in cash management. They can manage large caches and do real-time cash management. Less serious problems • Caching: when/where/how, objects vs query results, consistency models, etc. • Content mgmt: time to revisit the work done in the 80's on CAD database version/configuration schemes? • Replication: need to develop satisfactory schemes for geographic replication (current ones don’t work) Less serious problems • Transactions: Need to provide simple and effective models. •Many application developers don't truly understand the different consistency levels in SQL and how/when to use them •Developers have to work too hard to write consistent/reliable applications (e.g., when talking across the web, where you can't leave transactions open, so you have to accept other notions of atomicity and approaches to achieving them) Future work • Performance: latency, CPU, DB interface • Minimizing communication: •Local object caching with attribute callbacks (possibly with optional min update frequency) •Expanded use of UOIDs • How can distributed applications perform as fast as local applications • Adaptive decisions on applications, data replication, partitioning, master/slave, caching, etc. made by the software, not the programmer • Better software licensing •Per message v. per CPU Future work • Many paradigms, tools, etc. needed to write apps in that world today: • SQL, then Java, and XML at the top • Some of the W3C standards (schema, query, protocol) need more oversight/attention and input from the database community to ensure that they're done "right.” Summary • Google got it right: simple, big, fast, relevant, focus • Propel is trying to make RAMPS transparent to application programmers including transparent database scalability and geographical distribution • Get involved in W3C standards Questions? •Mike Carey has answers! •Mike.carey@propel.com