Query Languages Information Retrieval Concerned with the: • Representation of • Storage of • Organization of, and • Access to Information items. Recap 1 Important points – Data retrieval vs. information retrieval – Users specify their needs via an intermediary language. – Documents are represented by an abstraction of their content. – Traditional model vs. berry-picking model – Evaluation (precision/recall, single-value measures, human measures) – Task characteristics (question answering, openended analysis, ad-hoc vs. filtering) – Collection characteristics (size, document relations) Recap 2 Important points – Content types (text, descriptive/semantic metadata, multimedia) – Metadata (formats & sets) – Information Theory (entropy) – Models of symbol distribution (Zipf’s law, Heap’s law) – Distance measures (Hamming distance, Levenshtein distance) – Markup languages (SGML, HTML, XML) – Multimedia formats (header + data) Query Languages Query language determines which queries can be formulated – User-oriented languages – System-oriented languages (protocols) Language dependent on underlying information retrieval model Systems may enhance the query using – Word expansion (thesaurus & stemming) – removing stopwords Keyword-Based Querying A query is composed of keywords – Documents containing keywords are returned – Intuitive, easy to express, fast ranking – Single-word and multi-word queries Classification of keyword-based queries – – – – single-word queries context queries Boolean queries natural language queries Single-Word Queries Word query – the most elementary query that can be formulated – a word is a sequence of letters surrounded by separators – in many models, words are only types of queries allowed Result of word queries – the set of documents containing at least one of the words of the query – the resulting documents are ranked according to a degree of similarity to the query • use term frequency and inverse document frequency Context Queries Phrase query – a sequence of single-word queries – ignore separators and uninteresting words example: “…enhance the retrieval…” – ranked in a fashion analogous to single words Proximity query – a phrase query with a maximum allowed distance (character or word) between words in the query example: distance = 4 “enhance the power of retrieval” – physical proximity has semantic value: the words in the same paragraph are related in some way Boolean Queries Boolean query – composed of atoms (basic queries) that retrieve documents, and of Boolean operators which work on their operands (sets of documents) Query syntax tree – compositional scheme – leaves: basic queries – internal nodes: operators AND translation syntactic OR syntax Boolean Queries Operators in Boolean queries – e1 OR e2 : selecting all docs satisfying e1 or e2 – e1 AND e2 : selecting all docs satisfying both e1 and e2 – e1 BUT e2 : selecting all docs satisfying e1 but not e2 Classic Boolean system – no ranking of the retrieved docs – does not allow partial matching – alternative: fuzzy Boolean set of operators • meaning of AND and OR can be relaxed (e.g., appearing in some operands) Natural Language Natural language query – blurring the distinction between AND and OR → query becomes an enumeration of words and context queries – higher ranking is assigned to those documents matching more parts of the query Characteristics – retrieving all the documents close to the query – a complete document can be used as a query → leads to the use of relevance feedback techniques (user selects a document from the result, and submits it as a new query) – example system: AskJeeves Query Languages: Patterns & Structures Pattern Matching Pattern – a set of syntactic features that must occur in a text segment Types of patterns – Words: string (sequence of characters) in the text – Prefixes: string forming the beginning of a text word (e.g., comput → computer, computation) – Suffixes: string forming the termination of a text word (e.g., ters → computers, painters) – Substrings: string appeared within a text word (allowed word separators) (e.g., tal → talk, metallic & any flow → many flowers) Pattern Matching (cont.) More Types of Patterns – Ranges: a pair of strings matched any word lying between them in lexicographical order (e.g., held to hold → hoax, hissing) – Allowing errors: retrieving word similar to given word (e.g., flower → flo wer [edit distance = 1]) – Regular expressions: general patterns built up by simple strings and operators (e.g., pro (blem | tein) (s | ε) (0 | 1 | 2)* ) → Problem02, proteins – Extended patterns: classes of characters, conditional expressions, wild characters, combinations Structural Queries Structural query – mixing contents and structure in queries • content constraints (words, phrases, patterns) • structural constraints (containment, proximity) and restrictions on structural elements (chapters, sections) Type of structures of text – form-like fixed structure – hypertext structure – hierarchical structure Structural Queries (cont.) Types of structures form hypertext hierarchical Fixed Structure Traditional restrictions – – – – – documents had a fixed set of fields each field had some text inside only rarely the fields appear in any order or repeat fields were not allowed to nest or overlap retrieval: specifying a given basic pattern to be found only in a given field Characteristics – reasonable to retrieve text collection having a fixed structure (e.g. mail archive) → inadequate to represent the hierarchical structure such as HTML docs – expansion to relational DB model Hypertext Hypertext (navigational) – a directed graph where the nodes hold some text and the links represent connections between nodes or between positions inside the nodes Browsing / Searching in hypertext – retrieval from a hypertext: browsing (traversing the hypertext nodes following link → navigational activity) – even in web, one can search by the text contents of the nodes, but not by their structural connectivity – some search engines now allow searching for specific source or destination anchors (but not general structure + content queries) Hierarchical Structure Hierarchical structure – an intermediate structuring model lying between fixed structure and hypertext – represents a recursive decomposition of the text – a natural model for many text collections, e.g., books, articles, legal documents, structured programs, etc. Hierarchical models – PAT Expressions, Overlapped Lists, List of References, Proximal Nodes, Tree Matching Issues in hierarchical models – static or dynamic structure, restrictions on the structure, integration with text, query language Query Protocols Query protocols – query language used to query text database – standards intended not for human use but for querying library systems and querying CD-ROMs Some important query protocols – – – – – Z39.50 WAIS CCL CD-RDx SFQL