FAST Data Search Stein Jørgen Ryan, PhD Senior software architect, FAST Search and Transfer ASA 21sep05 A brief history of time • 1992: Hardware based engine. Arne Halaas et al, NTNU. – Working system was demonstrated at IFI. – Searched Encyclopaedia Britannica preloaded into 320 MB or RAM! • 1994-1997: FTP-search. Tor Egge et al, NTNU. – Initially used MS-160. – MS-160 phased out by software based solution. – Software-based FTP-search still in use today. • 1997: FAST incorporated. – Fast web search. – Scalable cluster. – Picture and video compression. • • 1998: FAST signs agreements with Lycos and DELL. 2003: Overture acquires FAST web search division. – FAST shifts focus to Intranet search. FAST architecture key points • Do not assume responsilibity for storing data. – Index stores pointers to the data (URL/filename). • Mother of all simplifications: Index is read-only. – – – – • May add new documents, but existing documents hardly ever changes. Perfect for document archives. Dramatic simplification compared to traditional DBMS. Transactions are of no concern. If the data source is ever updated: – Rebuild index at regular time intervals. • Ideal for clustering: – Indexing: Each node builds an index for part of the universe. – Searching: Just ask each node and merge their answers. – Embarrassingly parallell? Overview of FAST ESP SECURITY ACCESS MODULE ACL Monitor User Monitor Web Application QUERY API Database CONTENT API File Portlets Applications SFE Manager Custom Custom CONTENT CONNECTORS QUERY CONNECTORS MANAGEMENT & APPLICATION SERVICES Deployment Business Application Administration TOOLS & TOOL BUILDING FRAMEWORK Custom > Logical data model • • The answer to a query is a sequence of documents sorted by relevance. Documents consist of – – – • Predicates on document attributes: – • <,>,= etc Predicates on sections: – – • Attributes of atomic data types. Integer, date etc. Nested sections of words. NOTE: A document with only attributes can be seen as a relational tuple. Section contains a given word. Section satisfies a given relation to another section in the same document (ref XPATH): • Child • Parent • Ancestor • Descendant • Following • Preceeding All data fed to the system is normalized to fit this data model. – XML is used to represent normalized documents. Document processing stage Typical FAST ESP installation Data Source Content Distributor Doc Proc Cluster Crawler Cluster Indexing Cluster Anchor Cluster Search Cluster Status Server Admin Server QR Cluster Load Balancer App Front End Relevance • • Results from search nodes are merged and then sorted by relevance. Several aspects contribute to the relevance of a document: – – – – – – Term frequency (document/global) Freshness (document age) Context (ie title section). Proximity of occurrences in multiple word query. Authority (number of references from other documents). Rank tuning (pay hard cash for high relevance). Aggregation and navigation • Aggregator: Function of an attribute over all returned documents: – – – – • Min Max Average Histogram Navigation: – Include only documents that satisfy a given predicate on an aggregator. – Example: Find all candidates with above average test score that live in Oslo. – When used with histograms this is Query By Example (QBE). Demo • http://www.thisistravel.co.uk