Database Management Systems & Programming LIS 558 - Week 8 Information Retrieval & File Structures Faculty of Information & Media Studies Summer 2000 Lecture Outline Guest Speaker – – – – Trevor Richards LIS Grad. Product/Technical Knowledge Training Coordinator EMCO Ltd. Demonstration of InMagic Break Database Categories Advantages/Disadvantages between DBMS and IR Systems Types of Information Systems 6 types of information systems • • • • • • Information Retrieval Systems Database Management Systems Management Information Systems Question Answering Systems Decision Support Systems Artificial Intelligence Systems Types of Information Systems IR QA DBMS DS MIS AI Many systems are hybrids Types of Information Systems Database Management Systems • Concerned with storage, maintenance, and retrieval of data facts available in the system in explicit form, e.g., –books Items being retrieved are •authors •titles •call number –products •orders •items •sales –widgets •colour •size •shape typically attribute- value pairs that either match or do not match Calculations may be performed on values present in database or on values returned by queries Types of Information Systems Information Retrieval Systems • IRSs deals with representation, storage, and access to information items, typically as documents or document surrogates (or more recently multimedia documents), e.g., –newspaper articles –magazines –research reports –books –bibliographic references –references with abstracts –web documents? Types of Information Systems Management Information Systems • Basically, database management systems designed to meet information needs of managers • Provide complex views and manipulations of corporate information Types of Information Systems Question Answering Systems • Provide access to factual information in a natural language setting –e.g., http://debra.dgbt.doc.ca/chat/chat.html Types of Information Systems Decision Support Systems • Integration of a variety of systems, including IR systems, expert systems, databases, computer graphics systems, which are normally thought to be needed for decision-making purposes Types of Information Systems Artificial Intelligence Information Systems • Interdisciplinary approach to designing systems • Includes expert systems, neural networks, intelligent agents, information filtering, etc. • Increasingly, AI systems are being built integrated with DBMS or information retrieval facilities IR Database Types Bibliographic Full-text Image Numeric/statistical Descriptive (text) Directories (reference sources) text Documents Functional View of IR break into words documents assign doc ID numbers words *term weighting stoplist stemmed words filtered words *stemming term weights document numbers and *field numbers Database relevant documents stemmed words Boolean operations retrieved documents *stemming query terms parse query query *ranking Interface ranked documents queries queries documents *relevance judgements Users * indicates attribute is optional Functional View of IR File structures Query operations Term operations Document operations File structures Linear List –newest item is inserted at the end of list of items (or list of variables) • advantages –simple to create –easy to update –saves space • disadvantages –no indexing –speed of searching is very slow –must make comparisons with every item in the list File structures Ordered Sequential File –e.g., file of information ordered by author • advantages –faster to search • disadvantages –updating difficult and slow • read entire file into RAM and then do a binary search -- becomes problematic when the file is very large -- searching is quite fast, but updating remains slow File structures Index File –Data file is accompanied by index file –Index file provides pointers to the beginning of sections of the data file beginning with a new letter • advantages –index is very small and can be read into RAM –binary search done on index which is extremely fast • disadvantages –number of records at each letter may be unevenly distributed so in places searching could be slow –index could be more detailed but this increases the space required –updating is difficult and slow because both file and index must be revised Information Retrieval Systems Now, in addition to thousands of commercial and public domain databases, we have the World Wide Web Web = huge full-text multimedia database One Billion pages as of January 2000 With all this information available how does a person find what they are looking for? A Telephone Directory A Periodical Index A Cookbook Information Retrieval Systems Concept of index as a mechanism for providing access to information is nearly as old as the printed book itself Cornerstone of information retrieval systems Provide fast and efficient access A Textbook A Alfalfa Document File Apples B Apricots C Beans D E Broccoli F Carrots Cocoa Index File Fudge Oatmeal Inverted Index Files advantages –updating is easy since records can be added to end of file –searching is fast e.g., Suppose we have a large database for baking and cooking information and we want to locate recipes using the ingredients oatmeal, raisins, apple, and perhaps cocoa Inverted Index Files Document File Document 1 word1 oatmeal apple word4 raisins word6 Document 2 word1 cocoa word3 oatmeal word5 word6 Document 3 cocoa word2 word3 oatmeal raisins word6 Document 4 word1 raisins cocoa word4 word5 word6 Document 5 oatmeal word2 raisins word4 word5 word6 Topic #occurrences Recipe Document # 1 apple Inverted cocoa Index oatmeal File raisins 1 2 3 4 2 3 4 2 3 5 1 3 4 1 4 1 3 5 4 5 Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Inverted File Keyword Document File Hits Link apples 1 Document #1 cocoa 3 Document #2 oatmeal 4 raisins 4 Document #3 Document #4 Document #5 Topic apples cocoa oatmeall raisins #occurrences 1 3 4 4 Information Item# 1 1 1 2 2 3 4 3 3 4 5 5 Query Operations Boolean Queries –most systems offer boolean query capabilities: •AND OR NOT –To identify documents containing a particular term only inverted index file needs to be used –Results are determined through the creation and manipulation of sets (just a different type of file) Topic #Postings Information Item# apple 1 1 cocoa 3 oatmeal 4 1 raisins 4 1 2 3 2 3 3 4 5 4 5 Boolean OR Operator oatmeal OR raisins oatmeal 2 1 5 raisins 3 OR broadens a search 4 Boolean AND Operator oatmeal AND cocoa oatmeal 1 2 cocoa 3 5 AND narrows a search 4 Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Inverted File Document File Keyword Hits Field Pos Wt Link apples 2 cocoa 3 oatmeal 4 raisins 4 1 4 .5 Document #1 Document #2 Document #3 Document #4 Document #5 Functional View of IR File structures Query operations Term operations Document operations Query Operations Typically querying is independent of file structures Boolean Queries • most commercial systems offer boolean query capabilities: •AND OR NOT • To identify documents containing a particular term only the inverted index file needs to be used • Results are determined through the creation of sets and the boolean results are determined by the implementation of boolean set intersection, set union, and set difference operations (see descriptions in Kroenke Chapter on SQL) Query Operations Adjacency / Proximity Operators • very expensive for indexing and storage • location, field information also stored in postings file Truncation • query locates all index terms matching the word stem • right and left truncation require separate indexes to be built Relevance ranking • many newer systems provide facility for ranking documents. Typically this is based on a measure of word frequency within and between documents Functional View of IR File structures Query operations Term operations Document operations Term Operations Stemming • Conflation of related words, usually reducing to a common root -- do not confuse with truncation –e.g., psychlog (psychologist, psychological, psychology) • Sometimes this automatically done on-the-fly at query time • Also done at the time index is created (index is consequently much smaller) Term Weighting • • • • (weights are determined for each term) used for relevance and ranking determinations Sometimes done on-the-fly term-weight (based on inter-document and inter-database frequencies) are computed and stored at time of indexing (recalculation = $$$) Term Operations Stop Lists • many of the most frequently occurring words make ineffective search terms (i.e., discrimination value is low) e.g., –like, the, and, to, of, an, out, a, … • frequently occurring words are filtered out during the processing of the index and/or during querying • generally stop lists should be employed conservatively Functional View of IR File structures Query operations Term operations Document operations Document Operations • • • • • assignment of unique ID numbers parsing of fields or segments masking of fields for searching or display indexing of search terms creation of inverted index and postings files • user interface display of documents IR versus Relational DBMS Both have sophisticated file access and file management utilities Both employ complex indexing structures (e.g., B+trees) Provide query facilities Provide similar user interface features (e.g., menus, commands, etc.) and flexible views of data IR versus Relational DBMS Information retrieval systems • • • • provide access to content of entire documents semi- or non-structured information retrieval is probabilistic applications range from small to very large Database Management Systems • provide access to tables of structured data • retrieval is deterministic • applications range from very small to very large Critical to consider differences during design Database Software Generic Database Management Software –Access, Paradox, dBASE, FoxPro, Filemaker –Oracle, Informix, Ingres, DB2, MiniSQL, MySQL • used to create relational databases • can handle textual and numeric data with limitations • can provide limited IR and arithmetic functions • can handle images, sound files, video clips, etc. in digital format Next Week No Lecture No Lab Don’t forget to start working on your project!