chapter4

advertisement
Query Languages
Information Retrieval
Concerned with the:
• Representation of
• Storage of
• Organization of, and
• Access to
Information items.
Recap 1
Important points
– Data retrieval vs. information retrieval
– Users specify their needs via an intermediary
language.
– Documents are represented by an abstraction of their
content.
– Traditional model vs. berry-picking model
– Evaluation (precision/recall, single-value measures,
human measures)
– Task characteristics (question answering, openended analysis, ad-hoc vs. filtering)
– Collection characteristics (size, document relations)
Recap 2
Important points
– Content types (text, descriptive/semantic
metadata, multimedia)
– Metadata (formats & sets)
– Information Theory (entropy)
– Models of symbol distribution (Zipf’s law,
Heap’s law)
– Distance measures (Hamming distance,
Levenshtein distance)
– Markup languages (SGML, HTML, XML)
– Multimedia formats (header + data)
Query Languages
Query language determines which queries
can be formulated
– User-oriented languages
– System-oriented languages (protocols)
Language dependent on underlying
information retrieval model
Systems may enhance the query using
– Word expansion (thesaurus & stemming)
– removing stopwords
Keyword-Based Querying
A query is composed of keywords
– Documents containing keywords are returned
– Intuitive, easy to express, fast ranking
– Single-word and multi-word queries
Classification of keyword-based queries
–
–
–
–
single-word queries
context queries
Boolean queries
natural language queries
Single-Word Queries
Word query
– the most elementary query that can be formulated
– a word is a sequence of letters surrounded by
separators
– in many models, words are only types of queries
allowed
Result of word queries
– the set of documents containing at least one of the
words of the query
– the resulting documents are ranked according to a
degree of similarity to the query
• use term frequency and inverse document frequency
Context Queries
Phrase query
– a sequence of single-word queries
– ignore separators and uninteresting words
example: “…enhance the retrieval…”
– ranked in a fashion analogous to single words
Proximity query
– a phrase query with a maximum allowed distance
(character or word) between words in the query
example: distance = 4 “enhance the power of
retrieval”
– physical proximity has semantic value: the words in
the same paragraph are related in some way
Boolean Queries
Boolean query
– composed of atoms (basic queries) that retrieve
documents, and of Boolean operators which work on
their operands (sets of documents)
Query syntax tree
– compositional scheme
– leaves: basic queries
– internal nodes: operators
AND
translation
syntactic
OR
syntax
Boolean Queries
Operators in Boolean queries
– e1 OR e2 : selecting all docs satisfying e1 or e2
– e1 AND e2 : selecting all docs satisfying both e1 and
e2
– e1 BUT e2 : selecting all docs satisfying e1 but not e2
Classic Boolean system
– no ranking of the retrieved docs
– does not allow partial matching
– alternative: fuzzy Boolean set of operators
• meaning of AND and OR can be relaxed (e.g., appearing in
some operands)
Natural Language
Natural language query
– blurring the distinction between AND and OR → query
becomes an enumeration of words and context queries
– higher ranking is assigned to those documents
matching more parts of the query
Characteristics
– retrieving all the documents close to the query
– a complete document can be used as a query → leads
to the use of relevance feedback techniques (user
selects a document from the result, and submits it as a
new query)
– example system: AskJeeves
Query Languages:
Patterns & Structures
Pattern Matching
Pattern
– a set of syntactic features that must occur in a text
segment
Types of patterns
– Words: string (sequence of characters) in the text
– Prefixes: string forming the beginning of a text word
(e.g., comput → computer, computation)
– Suffixes: string forming the termination of a text word
(e.g., ters → computers, painters)
– Substrings: string appeared within a text word
(allowed word separators)
(e.g., tal → talk, metallic & any flow → many flowers)
Pattern Matching (cont.)
More Types of Patterns
– Ranges: a pair of strings matched any word lying
between them in lexicographical order
(e.g., held to hold → hoax, hissing)
– Allowing errors: retrieving word similar to given word
(e.g., flower → flo wer [edit distance = 1])
– Regular expressions: general patterns built up by
simple strings and operators
(e.g., pro (blem | tein) (s | ε) (0 | 1 | 2)* )
→ Problem02, proteins
– Extended patterns: classes of characters, conditional
expressions, wild characters, combinations
Structural Queries
Structural query
– mixing contents and structure in queries
• content constraints (words, phrases, patterns)
• structural constraints (containment, proximity) and
restrictions on structural elements (chapters,
sections)
Type of structures of text
– form-like fixed structure
– hypertext structure
– hierarchical structure
Structural Queries (cont.)
Types of structures
form
hypertext
hierarchical
Fixed Structure
Traditional restrictions
–
–
–
–
–
documents had a fixed set of fields
each field had some text inside
only rarely the fields appear in any order or repeat
fields were not allowed to nest or overlap
retrieval: specifying a given basic pattern to be found
only in a given field
Characteristics
– reasonable to retrieve text collection having a fixed
structure (e.g. mail archive) → inadequate to
represent the hierarchical structure such as HTML
docs
– expansion to relational DB model
Hypertext
Hypertext (navigational)
– a directed graph where the nodes hold some text and
the links represent connections between nodes or
between positions inside the nodes
Browsing / Searching in hypertext
– retrieval from a hypertext: browsing (traversing the
hypertext nodes following link → navigational activity)
– even in web, one can search by the text contents of
the nodes, but not by their structural connectivity
– some search engines now allow searching for specific
source or destination anchors (but not general
structure + content queries)
Hierarchical Structure
Hierarchical structure
– an intermediate structuring model lying between fixed
structure and hypertext
– represents a recursive decomposition of the text
– a natural model for many text collections, e.g., books,
articles, legal documents, structured programs, etc.
Hierarchical models
– PAT Expressions, Overlapped Lists, List of References,
Proximal Nodes, Tree Matching
Issues in hierarchical models
– static or dynamic structure, restrictions on the structure,
integration with text, query language
Query Protocols
Query protocols
– query language used to query text database
– standards intended not for human use but for
querying library systems and querying CD-ROMs
Some important query protocols
–
–
–
–
–
Z39.50
WAIS
CCL
CD-RDx
SFQL
Download