FAST Data Search Stein Jørgen Ryan, PhD 21sep05

advertisement
FAST Data Search
Stein Jørgen Ryan, PhD
Senior software architect, FAST Search and Transfer ASA
21sep05
A brief history of time
•
1992: Hardware based engine. Arne Halaas et al, NTNU.
– Working system was demonstrated at IFI.
– Searched Encyclopaedia Britannica preloaded into 320 MB or RAM!
•
1994-1997: FTP-search. Tor Egge et al, NTNU.
– Initially used MS-160.
– MS-160 phased out by software based solution.
– Software-based FTP-search still in use today.
•
1997: FAST incorporated.
– Fast web search.
– Scalable cluster.
– Picture and video compression.
•
•
1998: FAST signs agreements with Lycos and DELL.
2003: Overture acquires FAST web search division.
– FAST shifts focus to Intranet search.
FAST architecture key points
•
Do not assume responsilibity for storing data.
– Index stores pointers to the data (URL/filename).
•
Mother of all simplifications: Index is read-only.
–
–
–
–
•
May add new documents, but existing documents hardly ever changes.
Perfect for document archives.
Dramatic simplification compared to traditional DBMS.
Transactions are of no concern.
If the data source is ever updated:
– Rebuild index at regular time intervals.
•
Ideal for clustering:
– Indexing: Each node builds an index for part of the universe.
– Searching: Just ask each node and merge their answers.
– Embarrassingly parallell?
Overview of FAST ESP
SECURITY ACCESS MODULE
ACL Monitor
User Monitor
Web
Application
QUERY API
Database
CONTENT API
File
Portlets
Applications
SFE Manager
Custom
Custom
CONTENT
CONNECTORS
QUERY
CONNECTORS
MANAGEMENT & APPLICATION SERVICES
Deployment
Business
Application
Administration
TOOLS & TOOL BUILDING FRAMEWORK
Custom
>
Logical data model
•
•
The answer to a query is a sequence of documents sorted by relevance.
Documents consist of
–
–
–
•
Predicates on document attributes:
–
•
<,>,= etc
Predicates on sections:
–
–
•
Attributes of atomic data types. Integer, date etc.
Nested sections of words.
NOTE: A document with only attributes can be seen as a relational tuple.
Section contains a given word.
Section satisfies a given relation to another section in the same document (ref XPATH):
• Child
• Parent
• Ancestor
• Descendant
• Following
• Preceeding
All data fed to the system is normalized to fit this data model.
–
XML is used to represent normalized documents.
Document processing stage
Typical FAST ESP installation
Data
Source
Content
Distributor
Doc Proc
Cluster
Crawler
Cluster
Indexing
Cluster
Anchor
Cluster
Search
Cluster
Status
Server
Admin
Server
QR
Cluster
Load
Balancer
App
Front End
Relevance
•
•
Results from search nodes are merged and then sorted by relevance.
Several aspects contribute to the relevance of a document:
–
–
–
–
–
–
Term frequency (document/global)
Freshness (document age)
Context (ie title section).
Proximity of occurrences in multiple word query.
Authority (number of references from other documents).
Rank tuning (pay hard cash for high relevance).
Aggregation and navigation
•
Aggregator: Function of an attribute over all returned documents:
–
–
–
–
•
Min
Max
Average
Histogram
Navigation:
– Include only documents that satisfy a given predicate on an aggregator.
– Example: Find all candidates with above average test score that live in Oslo.
– When used with histograms this is Query By Example (QBE).
Demo
•
http://www.thisistravel.co.uk
Download