Search-Based Applications: the Maturation of Search Gregory Grefenstette Exalead Exalead S.A. © 2009 Maturation of Search 1960-1995 • Full text indexing • Term weighting • Stemming, morphological analysis • Phrasal indexing • Numerical fields • Format extractors • Indexing schemes 1995-2005 2005-2010 Web search Link analysis Freshness Spam detection Facets, Categories • Processing schemes • Suggested queries • Multimedia • Database connectors • XML, structured data • Reporting • Search as a Service • • • • • 2 www.exalead.com/search 8 billion URLS, 2 billion images, 200 million videos Wikipedia, cloud tags also Labs.exalead.com 3 Two ways to find information DATABASES VS SEARCH ENGINES 4 Recent Past DATABASES • Structured Data • Transaction • Precise • All tuples • SQL • Slow SEARCH ENGINES • Text • Similarity • Ranking • Intuitive • Fast • Partial 5 More Recent SEARCH ENGINES DATABASES • Structured Data • Transactions • Precise • All tuples • SQL • Slow • Text • Similarity • Ranking • • • • Top-K Column store Map Reduce Data Cube • • • • Connectors Facets Map Reduce Tables • Intuitive • Fast • Partial 6 NOW DATABASES SEARCH BASED APPLICATIONS SEARCH ENGINES Search based Application An application which uses a search engine component, but whose final purpose is not searching for a document, but rather a domain-oriented process result – Examples: • Custom response management • Logistic tracking and tracing • Contextual Advertising • Database reporting after offloading 8 Current situation Databases are the backbone of search in information systems Data Warehouse BI reports Database Business processes DataMart Front-office users Search-enabled application Optimized solution for information access Data Warehouse BI reports Database Business processes Search Engine Front-office users Drawbacks of Using Database Search As a Component Standard Architecture Search Based Architecture 12 How does a Search Based Application work? 14 Database converted to Business Items Stored as structured documents • Business items are concrete objects directly understandable by end-users – Product, Customer, Purchase order, Technical support call • Each business item becomes a document • Straightforward and simple format of the document index allows performance and ease-of-use • Search engine can offer rich and powerful query language that allows to make queries as complex and advanced as SQL despite the flat data model • Search Engine must support – typed fields, intra field scope search, category/facets 15 Database into structured documents Product_ID Product_Name Manufacturer_Names 123 control switch ACME Inc ; The Control Switch Company; Karl GmbH 124 red warning light … Scope Search Product_ID Product_Name 123 control switch 124 Product_ID Manufacturer_ID 123 345 123 8574 123 4483 red warning light All the manufacturers of a product are aggregated into a single flat document… Manufacturer_ID Manufacturer_NAME 345 ACME Inc. 8574 The Control Switch Company 4483 Karl GmbH … but the manufacturer names can still be searched as individual records with scope search "ACME GmbH" does not match the document here) Product_ID Product_Name Manufacturer_Names 123 control switch ACME Inc ; The Control Switch Company; Karl GmbH 124 red warning light … Hierarchical categories Product_ID 123 Color Red Brand ACME Fragile Y Multiple kinds of attributes can be mixed in a same category field. The hierarchical tree structure of the categories preserves the differences between attribute types Nb of wheels Wheel type 3 2 Product_ID Country 123 France 123 UK 123 Germany Multi-valued attributes can also be represented by categories. A single category field can be used to store hundreds or thousands of attribute columns. Product_ID Attributes 123 Color/Red ; Brand/ACME ; Fragile/Y ; Nb_wheels/3 ; Wheel_type/2; Country/France ; Country/UK; Country/Germany 124 … 18 Multi-dimensional facets 19 Multi-dimensional facets • Search results facets provide aggregate values computed onthe-fly with the search results list – One single search query can return the equivalent of dozens of “GROUP BY” SQL clauses – Numerical values associated with facets (count, score, …) can be used to perform complex computations on the results list • Search performance is not affected by the size of the category tree – Thousands of attribute types can be represented by categories – Facets are dynamically selected by the search results: the displayed attributes are always consistent with the search query (e.g. “color” and “engine type” when searching for a car, “screen size” and “CPU speed” when searching for a laptop) 20 CASE STUDY LOGISTICS TRACK & TRACE 21 Gefco overview • A subsidiary of French car maker PSA (Peugeot, Citroën) – Now does most of its business outside of PSA • Logistics operator – Carries cars from factories to dealers (road, rail) – Carries freight (parcels ; originally spare parts) – Supply chain and logistic platform design • 3.5B€, 10 000 employees, 100 countries The original pain • Classical multi-criteria search over Oracle, 2 million rows • Poor performance despite 2 years of optimization – Minute response times – Ask users to do simple queries and preferably at some given hours From forms to a search box 24 25 New application With operational reporting Partner French Post Office Context Stakes Exalead Choice • Project part of the strategic plan of La Poste to improve customer service • Tracing of the mail • Tracing of the resolution of incidents • Management of high volumes : 60 million daily records with a 14 day history • Management of peaks of 1 000 updates per second • Internal and external access to the information with respect of confidentiality • DataBase Offloading • High scalability and management of large volumes • Open API to provide high level applications 28 • Tracing of incidents • Real-time system • Used as an internal audit tool for the mail • Suggestion of addresses for customers • Search in file numbers, addresses, names, etc. Case Study: RightMove 31 Rightmove: Reduce Costs and Improve Performance through Database Stats • • • • 2 million real estate ads, 29 million monthly visitors Peak throughput 400 queries per second (QPS) 99.99% availability rate. • Replaced 30 Oracle CPUs with 9 search CPUs • Reduced cost of search per 100 queries from £0.06 to £0.01. • Rapid Time to Market and Development Independence Gains • new platform to market in 3 months, • data connections handled by a built-in ODBC Connector • application customization via open, standards-based APIs • IT staff achieved independence to modify or expand functionality • Easy scaling by adding inexpensive commodity hardware Exalead Choice • Improve the End User Experience, with Simpler, More Robust Search and More Timely Data • Rightmove’s new SBA provides search and navigation features, more intuitive and more powerful, automatically incorporates data facets • Data refresh rate of less than 2 minutes. 32 Advantages of Search Based Applications 33 35 Conclusions • Search engines mature – Structured data, high volume, high speed • Search based Applications offer – Usage: Search interface familiar to user – Performance: Search engine geared to search, eases load on database platform – Agility: Original database design untouched, reconfiguring output lightweight 36