SIGMOD talk: "Internet and Beyond: Database Challenges"

advertisement
To the Internet and Beyond: Database
Challenges for New/Advanced
Applications
May 21, 2001
Propel Confidential
Agenda
• The story of Infoseek
• Why Propel?
• The problems that arise for us
Scoring Framework
• Classification Problem
•Separate relevant from non-relevant
documents
•Bayes’ Decision rule: Relevant if
P(x(d)|R)P(R)  P(x(d)|~R)P(R)
where x(d) is the observed representation of
d
•Independence assumption leads to
S(d) =  log [p(t)(1-q(t))/(1-p(t))q(t)]
where p(t) = P(t|R) and q(t) = P(t|R)
The original Infoseek vision
• Stolen from Bill Gates…
“Information at your fingertips”
• To find any piece of information on any
computer in the world within 1 second
How we got started
• Finding information was too expensive
and too hard
• Our field of dreams
• “If you provide useful information at bargain
prices, they will come”
• In January 1995 we launched Infoseek
• Register with a credit card
• First month free
• 10 cents a transaction
What happened...
•
•
•
•
•
•
“I thought you said it was FREE to try it?”
“You’ve got to be kidding!”
“I already pay $10 a month for my access!”
“I can’t afford it.”
“Go to …”
“Why should I pay when the information is
available free elsewhere on the net?”
• “I don’t like to be nickeled and dimed.”
Even more advice...
•
•
•
•
•
•
“You should only charge me per query”
“You should only charge for document”
“I’ll only sign up for a flat fee”
“I refuse to pay a flat fee”
“I don’t have a credit card”
“Your legal agreement is too long”
What we did
• Dropped the credit card registration for a
free trial
• Made it very clear you can’t get most of
this stuff for free anywhere on the net
• Made the pricing easier to understand
• Advertised it on our free Net Search
“So…. How would you like to
provide a free Net Search?”
• First reactions
• “Are you joking?”
• “How would we make money? By making it
up in volume?”
• Strategy
• It would be free advertising for Pro
• Limit the search results to 100 hits
• Want more? Refer to Infoseek Pro
Infoseek Guide
•
•
•
•
25M hits/day (200 queries/sec at times)
#1 search engine on the Net
1,000 signups/day for Infoseek Pro
Discovered advertising sponsorship
• 1.5 cents per query
• Discovered TV math
• we make more money giving away
information than selling it
Four years later…
Propel Confidential
How to find Barney pages
suitable for your kids
+Barney
+dinosaur
-bash -kill -maim
-destroy -hate
What people ask about
(and why)
Propel Confidential
Unofficial SIGMOD survey question
How many people here
search the web for
“adult sites”?
Top 15 queries on the WWW
• sex
• Playboy
• Penthouse
• chat
• Hustler
• nude
• porn
* I am not making this up! This list is real!
• erotica
• games
• pornography
• porno
• adult
• ESPN
• pussy
• Pamela Anderson
• What does that mean?
• “Uhh… I was just testing!”
Unofficial SIGMOD trivia question
• Q: What famous IR researcher asked in
1995 “Is this because of the
Communications Decency Act (CDA)?”
• A: Bruce Croft
Why it happens
(possible explanations)
• Research on CDA
• Curious what others looking at
• Many new technologies are driven by sex:
•VCR
•Hotel movies on demand
• People are naturally horny
What it means
• Human race in no danger of extinction
• Corporate libraries doing a great job in
technical areas
• Traditional sex education inadequate
• Some of you are not telling the truth
• Audience surveys are not always accurate
• Bill Gates should admit to Congress that
Pamela Anderson is more important than he is
• If you didn’t raise your hand, you may need
professional help!
The secret Infoseek backup bizplan
• Selling our list of porn sites
We never pursued it…
… But other companies did!
•Sinfoseek
•Infoseak
•Nymfoseek
•Infopeek
•...
Relevance ranking Web sites
Propel Confidential
Facts about Queries
• Most queries are short
•Average length approx. 2.2
•10% use query syntax (usually incorrectly)
•1% used advanced search
•Noun phrases only
• Precision more important than recall
•Users expect precision in top results
Relevance ranking objectives
Must use several techniques to determine
“relevance”:
• Page has query term(s)
• Popular usage of the term, e.g., penthouse,
java, adult, “evil empire”, ...
• Page quality
• Page/site popularity
• Spam reduction/elimination
• Porn reduction
Relevance ranking factors
• Query terms: tf*idf
• Usage: Hyperlink text, thesaurus
• Quality: site quality, dates, depth, …
• Popularity: External link count, proxy stats
• Spam: word/phrase unusual statistics (tf
limiting)
• Porn: site exclusion list, naughty phrase
list
Relative weighting of these factors is
tricky and subjective
Should “evil empire” return
Microsoft as the top hit?
Living in a world of an infinite number
of documents
Propel Confidential
The problem (user view)
• Too hard to find things even though only
100M documents indexed
• Often precision and relevance, NOT recall
• “intel” in the title search gives over 200 hits
just like this:
Index of /CPANlocal/authors/id/GSAR/x86/intel/ix86/intel/ix8
6/intel/intel/ix86/intel/ix86/
• Query ambiguity, e.g., “baby Bells”
The problem (vendor view)
• Speed
• Size
• Cost
• Freshness
• Load on the Internet/bandwidth (both sides)
• Quality (Spam/porn)
• Will people be able to find what they are looking
for as the net grows?
Today’s approach sucks
Suck all content into a
centralized search engine
Infoseek
All the world’s
content
Is there a better way?
• We might start by asking the question:
“How do people find information today?”
Centralized searching techniques
are rarely used in real life...
• Ask God (and pray for an answer)
• Ask DIALOG …and pray...
• WWW search (new!)
What people DO use is
decentralized searching
Source 1
Source 2
Question
...
Source N
Answers and
more sources
Human distributed searching
attributes
• Faster than a computer!!!
• Complete
• Accurate
• Can be used to validate an answer
• Will always find an answer (eventually)
• No specialized hardware
• All humans had the same CPU speed/RAM
So can’t we design a computer
distributed search network that is as
fast and accurate and complete as
our human distributed search
network
?
Our goal
• Don’t necessarily mimic the process, but
adapt the process to the medium
One approach
• User types query
• System searches databases of popular
pages as well as meta descriptions of
other databases
• Repeat until all websites have been
searched
NOTE: This is the fastest way to search an
infinite amount of data
What we learned
• Relatively weak engines with no proximity
got a wide following: people couldn’t see
through the hype
• Bigger was better
People lie
•
•
•
•
We have “concept searching”
We’re growing faster than the net
We’ve indexed 95% of the net
We have more URLs than anyone else
What we learned
• Competing for the Internet customer is
not always a case of who really has:
• the best engine
• the highest quality content or the most
content
• the best price, best GUI, or the best product
• It’s more about:
• brand name
• convincing the customer you are the best
What we learned
•
•
•
•
•
•
Ads don’t sell themselves
If you do 1M ads per day, you’re cookin’
Lots of competition
Switching costs are low
User behavior can be tracked
Seemingly identical pages can have
dramatically different click through
Mistakes we made
• Not pressing for branding
• Slow to recognize ad model
• No ! at the end of our name
Ultra required new thinking
• The traditional IDF formula breaks down
for 1 Billion documents
• Existing data structures would never work
• “Managing Gigabytes” didn’t go far
enough
• Inktomi approach was too inefficient
• No sacred cows
Ultra: Designed for speed
• Speed/space tradeoff
• Architected from the ground up for
1Billion docs and 1,000 queries/sec
• Everything is done in parallel and multithreaded
• Limited disk I/O
• Small in-RAM tables
• Stable connections
Ultra size
• Parallel worms (multi-process, multithreaded)
• Proprietary database required; OODB’s
too slow
• Change frequency monitored
• >50M URLs
Feature set
•
•
•
•
•
•
•
Natural language queries
+, -, “phrases”
Fields (link, url, site, title)
Case sensitivity
Stemming
Approximate matching for phrases
Gets faster the longer the query
• 8 shortest term lists
• Space invariant, I.e., CD-ROM =CDROM
Spamming
• A royal pain
• Real-time made it worse
• Used “site stop lists” and repeated words
in document
• Interesting problems arise:
• If dup contents submitted, do you remove
dups?
• If docs ranked the same, which do you show
first?
My favorite search engine today:
Google
• Big
• Fast
• Relevant: Uses referral network of link
statistics to determine best page to return
for a query. This greatly reduces the spam
problem too.
• Simple: assumes “and” between words
Google
• By “ordering” pages based on “importance”, you
can construct databases of pages:
•Best 10M web pages
•2nd best 10M web pages
•3rd best …
•…
• To search 1 trillion pages, you might only search
1 database since most people never look past
top 100 hits
• There is a technical term for this technique…
Why start Propel?
• We spent too much time at Infoseek
building infrastructure (same as everyone
else)
• Isn’t it time we stopped re-inventing the
wheel?
Desktop Application Development
The old MSFT way
The new MSFT way
C,
C,C++
C++
C, C++
MS Windows
MSDOS
MSDOS
Internet Application Development
Propel Way:
Current Way
Total Systems Approach
Web Servers
Web Servers
RAS
DB
Directory
Legacy
Systems
App Servers
RAS
Integration
Systems Mgmt
Search
Cache
Messaging
App Servers
…
Propel
Distributed
Services
Platform
DB
 RAS, Performance, Quality
 TTM, IT & Labor Costs
Directory
Legacy
Systems
…
Propel Long Term Direction
Retail
Applications
Business
Applications
(B2C E-Commerce,
Point of Sale, …)
(B2B E-Commerce,
Marketplace, …)
3rd Party
Applications
(CRM, ERP,
Financials, …)
Propel Distributed Services Platform
My directives
• Design it right: complete systems architecture.
• Give me the best possible architecture
• Forget standards if you must
• Make it infinitely scalable and arbitrarily
reliable with the highest possible performance
and super-easy maintainability and high levels
of data abstraction.
• No bottlenecks. Nothing that doesn’t scale.
• Must handle worst case nightmares, e.g., eBay
for search, …
My directives
• Ship in a year.
• Memory is now cheap. Take advantage.
• TCP/IP packets are like 747s.
• Touching disk or Oracle is death
• I’m willing to sacrifice a tiny amount of
latency for greater system maintainability.
• Handle data and code versioning cleanly.
• No system downtime ever.
My question
Q: “So how hard can that be?”
A: Mike and crew couldn’t make it to
SIGMOD this year
Unofficial Propel Jingle
You need that new site up by yesterday...
so it's got to be Java...
and your RAS is on the line!
That's exactly why we founded Propel.
Propel Distributed Services Platform
• Java based distributed server architecture
• Eliminate traditional infrastructure
bottlenecks
• Built-in scalability and fault-tolerance
Scalable, Fault-tolerant, Clustered Messaging System
Distributed Data Managers
In-Memory
Database
Java Object Mapping
Global Cache
Integrated Search & Queuing
Advanced Deployment
Performance Mgmt & System Admin
Propel Distributed Services Platform
Current Internet Architecture
Web Server Tier
Web
Server
Web
Server
Web
Server
Web
Server
Web
Server
Web
Server
Web
Server
Web
Server
Application Server Tier
E-Commerce
Application
E-Commerce
Application
E-Commerce
Application
E-Commerce
Application
Custom
Application
Data Management Tier
Messaging
Package
Database
Cache
Systems
Management
Search
Engine
Integration
Directory
Server
Legacy
Systems
Limitations
• Complex to setup for
• Scalability
• Fault-tolerance
• Manageability
• Expensive high-end
servers (4-64 CPU)
• Customize application
to handle RAS
• High development &
IT costs
The Propel Architecture
Web Server Tier
Web
Server
Web
Server
Web
Server
Web
Server
Web
Server
Web
Server
Web
Server
Web
Server
Application Server Tier
E-Commerce
Merchandising
Application
E-Commerce
Order
Application
Management
Customer
E-Commerce
Relationship
Application
Management
Custom
Analytics &
Application
Reporting
Custom
Custom
Application
Application
Scalable, Fault-tolerant, Clustered Messaging System
Distributed Data Managers
In-Memory
Database
Java Object Mapping
Global Cache
Integrated Search & Queuing
Advanced Deployment
Performance Mgmt & System Admin
Propel Distributed Services Platform
Database
Server
Directory
Server
Legacy
Systems
Benefits
• Designed for built-in:
• Scalability
• Fault-tolerance
• Manageability
of end-to-end system
• Multiple, low cost
servers (1-4 CPU)
• Integrated RAS
• Reduced development
effort & IT resources
Propel Technology Innovations
• Clustered Messaging System
• Distributed Data Management
•In-memory databases
•Persistent Queuing
•Integrated Search
• Global Cache
• Advanced Deployment
• Java Object Mapping Layer
• Distributed Services Manager
Clustered Messaging System
• Centralized hub for all inter-process
communication
• Transforms all applications and database
managers into clustered distributed services
• Enables transparent replication of any service
• Dynamic load balancing across all services
• Incremental s/w upgrades without site downtime
Distributed Data Management
• Main-memory speeds
• Transparent single view of all distributed data managers
•In-memory & disk-based databases
• Consistency & durability of e-commerce transactions
• Unified query and search capabilities spanning data,
text, and persistent queues
• Persistent queues integrated into data mgmt framework
•Flexible programming model extends database functions
(queries, joins, etc.) to queues
• Integrated parametric & keyword search
•Relevance ranking, custom dictionary, index refreshing, …
Distributed Data Management
Example:
Partitioned
Query
Processing
User, Order, and
Product Data
Multi-Master
Routing
Master-Slave
Routing
User
Third-Party
DBMS
Node
Order
In Memory
Database
Node
Order
In Memory
Database
Node
Prod
In Memory
Database
Node
Prod
In Memory
Database
Node
Multi-Master
Routing
ProdN
In Memory
Database
Node
ProdN
In Memory
Database
Node
Persistent Queues
Order Queue
Order #98045
Customer #3742
Status “urgent”
Destination “ERP”
…………
…………
...
Order #97996
Customer #16
Status “in process”
Destination “ERP”
…………
…………
Unique ability to
conduct queries & joins
across queues & tables
Order #97993
Customer #831
Status “normal”
Destination
“Financial Svcs
…………
…………
.
Database
Tables
Java Object Mapping Layer
• Java bean-based programming model for
unified access to all distributed data mgmt
capabilities
• Unifies access to DB tables, text data,
persistent queues
• Automatically generates Java beans
based on XML specifications
Java Object Mapping Layer
Code Generation
Runtime Environment
Bean
XML
APIs
Bean
Application
Bean
Bean
Bean
Bean
OM Runtime
Object
Relational
Descriptions
Java
Object
Mapping
Platform Data
Global Cache
• Data cache for dynamically generated
page-fragments
• Easily incorporated into JSP pages thru
use of simple tags (w/name, scope,
lifetime)
• Multi-level distributed architecture
•L1 – local application server caches
•L2 – scalable system-wide cache(s)
Global Cache
Application Server 1
User Request
Top Jazz
CDs
Top
Jazz
CDs
Music
News
[TODAY]
Billboard
Top 10
CDs
User Request
Today’s
Music News
Application Server
Caches (L1)
Application Server 2
Top
Opera
CDs
Global Cache (L2)
Billboard
Top 10
CDs
Music
News
[TODAY]
Global Cache Servers
Top
Jazz
CDs
Billboard
Top 10
CDs
Top
Blues
CDs
Top
Opera
CDs
Music
News
[TODAY]
Music News
[YESTERDAY]
Advanced Deployment
Propel Merchant Tool
Propel Distributed Services Management
Offline Data
Product
Data
Customer
Profiles
Business
Rules
Live Production Site
Deployment
Version
Attributes
Deployment
Type
Content Management Tool
Versioned Database Tables
Support in live production site
Distributed Services Manager
Benefits:
Propel Distributed Services Platform
• Reduced development costs
• Faster time to market with new features
• Incremental capacity on demand and faulttolerance through built-in replication
• Transparent data tier scalability through
distributed in-memory databases
• Lower cost server hardware infrastructure
Future challenges
Propel Confidential
Major challenges: Endure Mike
Carey jokes
Kirsch: I don’t think we should bump their
stock; we have to worry about stock parity,
you know.
Carey: Yes, you wouldn’t want to make a
parity error.
Mike Carey jokes: A major challenge
• Kirsch: I like the idea of a cache with a back-end store.
• Carey: Yeah, you wouldn’t want to go to the store without
cache.
• Kirsch: I think we need more people working on our
caching architecture.
• Carey: Maybe I can recruit some people from our
finance dept; they are experts in cash management.
They can manage large caches and do real-time cash
management.
Less serious problems
• Caching: when/where/how, objects vs
query results, consistency models, etc.
• Content mgmt: time to revisit the work
done in the 80's on CAD database
version/configuration schemes?
• Replication: need to develop satisfactory
schemes for geographic replication
(current ones don’t work)
Less serious problems
• Transactions: Need to provide simple and
effective models.
•Many application developers don't truly understand
the different consistency levels in SQL and
how/when to use them
•Developers have to work too hard to write
consistent/reliable applications (e.g., when talking
across the web, where you can't leave transactions
open, so you have to accept other notions of
atomicity and approaches to achieving them)
Future work
• Performance: latency, CPU, DB interface
• Minimizing communication:
•Local object caching with attribute callbacks
(possibly with optional min update frequency)
•Expanded use of UOIDs
• How can distributed applications perform as fast
as local applications
• Adaptive decisions on applications, data
replication, partitioning, master/slave, caching,
etc. made by the software, not the programmer
• Better software licensing
•Per message v. per CPU
Future work
• Many paradigms, tools, etc. needed to
write apps in that world today:
• SQL, then Java, and XML at the top
• Some of the W3C standards (schema,
query, protocol) need more
oversight/attention and input from the
database community to ensure that they're
done "right.”
Summary
• Google got it right: simple, big, fast,
relevant, focus
• Propel is trying to make RAMPS
transparent to application programmers
including transparent database scalability
and geographical distribution
• Get involved in W3C standards
Questions?
•Mike Carey has answers!
•Mike.carey@propel.com
Download