Chapter one Information Retrieval Contents • • • • • • • • • • • Definition of Information Retrieval (IR) Information retrieval system Purpose/goal/objective of Information Retrieval Database retrieval Vs. information retrieval History of Information retrieval Components of information retrieval system (IRS) Basic concepts in Information retrieval The information retrieval process The Structure of an IR system Challenges in Information Retrieval (IR) Application areas of IR Information retrieval - Definition Information retrieval (IR) • It deals with representation, storage, organization and access to information items to satisfy the user‘s interest or information need. – Information items: usually text, but possibly also image, audio, video, etc. – Text items are often referred to as documents, and may be of different scope (books, article, paragraphs, etc.) – Information items are translated to a query consisting of keywords (word forms) which summarizes the description of the user information needed. • Conceptually, IR is used to cover all related problems in finding needed information • Technically, information retrieval refers to (text) string manipulation, indexing, matching, querying, etc. • IR involves helping users to find information that matches their information needs • Its techniques and applications have reached many fields where processing large amount of information is essential. Information retrieval - Storing/Retrieving • Information storage – How and where is information stored? • Retrieving information – How is information recovered (well gained) from storage – How to find needed information – Linked with accessing/filtering stage Information Retrieval - Accessing/Filtering • Using the organization created in the O/I stage to: – Select desired (or relevant) information – Locate that information – Retrieve the information from its storage location (often via a network) More on IR • IR is concerned with retrieval of relevant documents from a large collection of documents • Relevant documents are identified according to specific criteria (usually called query) • IR usually deals with NL text which is not always well structured and could be semantically ambiguous • IR deals with very large sets of documents _High amount of robustness, efficiency _Domain-independent & multi-linguality • IR considers NL text mainly from a lexical view Key Issues IR • How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them – Organizing • How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs – Retrieving • Revised task statement – Build a system that retrieves documents that users are likely to find relevant to their queries – This set of assumption underlies the field of IR Information Retrieval System (IRS) • Is a system that is capable of storage, retrieval, and maintenance of information items • The two most important entities in information retrieval systems are – Information need (queries)…on one hand we have users – Information items (documents)…on the other hand we have DOCs that we collect • The processes of an IR system is to match two abstractions – Data abstracted in the system – Queries abstracted from user’s information needs Need [ ] Docs matching the two sides • The goal of IR systems is to help users find information that satisfies their information needs. • Information items are translated to a query consisting of keywords (word forms) which summarizes the description of the user information needed. • Given the user query, the key goal of an IR system is to retrieve information which might be useful or relevant to the user. Basic Structure of Information Retrieval System Info. need Query Retrieval Document collection IR system Answer list (ranked documents) Overview of IR Document Query indexing indexing (Query analysis) Representation (keywords) Query evaluation Representation (keywords) Purpose of an Information Retrieval • The main purpose of an IRS is to capture/find wanted items (information ) from a large document set and to filter out unwanted information 1. Writers ideas • Information is generated by authors • Authors generate large quantities of information every day • Are represented in the form of documents 2. Information need • At the other end we have chain of communications • We have readers, each with its own individual need for information which has to be selected from the mass available 3. IRS • Helps to organize the information sources in (1) • Helps to bring (1) and (2) together (writer and information seeker) Information Retrieval – Objective/Purpose/Goal • Minimize search overload of a user who is locating needed information • Overhead : time spent in all steps leading to the reading of items containing the needed information (query generation, query execution, scanning results, reading non-relevant items, etc.) • Needed information: either/it can be – All information in the system relevant to the user needs – Sufficient information in the system to complete a task Information Retrieval – Objective/Purpose/Goal • Support user search, providing tools to overcome obstacles such as : – Ambiguities inherent (nature) in languages • Homographs – words with identical spelling but with multiple meanings – Limit to user’s ability to express needs • Lack of system experience or aptitude – Lack of experience in the area being searched • Initially only vague concept of information sought • Differences between user’s vocabulary and authors’ vocabulary: different words with similar meanings Database Retrieval Vs. Information Retrieval Systems • Information – DBMS: structured data (often homogeneous records), semantic unambiguity – IR systems: unstructured (free text), ambiguity • Answers – DBMS: • Records (tuples) • Perfect precision and recall, each item is relevant (no ranking) • Well defined results – IR systems • Vs. Documents • Vs. Imperfect precision and recall, each item has specific relevance (ranking) • Vs. fuzzy (Vague) or ill defined results • Relationship – Systems complement each other • DB grew out of files and traditional business system • IR grew out of library science and need to categorize/group/access books/articles Comparison of Data retrieval and Information retrieval • • • • • Content Data object Matching Items wanted Query language • Query specification • Model • Structure Data retrieval Information retrieval Data Table Exact match Matching SQL (artificial) Complete Deterministic Highly structured Information Document Partial match, best match Relevant Natural Incomplete Probabilistic less structured Database Retrieval Vs. Information Retrieval Systems • Data retrieval – which docs contain a set of keywords? – Well defined semantics – a single erroneous object implies failure! • Information retrieval – Information about a subject or topic – Semantics is frequently loose – Small errors are tolerated • IR system: – Interpret contents of information items – Generate a ranking which reflects relevance – Notion of relevance is most important • Information retrieval is much more difficult than data retrieval History of Information Retrieval • Organization and storage of knowledge for ease of access is centuries old • That is, the history of recording knowledge goes as far as thousands of years. • Important events – Development of writing – Books and printing technology – News publishing – Journal publishing (economic reasons- books are not economical interns of money and time) – Libraries (to put publications in one centre) • As the size of the collection grows, access to documents becomes more difficult without proper mechanism History of Information Retrieval • Therefore, in order to reach to documents in libraries or other collection, access mechanisms were necessary • Simple methods to facilitate access to single document – Table of contents – Keyword index • Classical methods to facilitate access to collections of documents – Index (keywords, authors) – Hierarchical (Dewey-Decimal classification) • The increasing demand for information access created information science as a discipline. • IR then becomes an important sub discipline that is concerned with developing theories and methods of access to information History of Information Retrieval • Now- The World-Wide-web – A gigantic distributed collection of heterogeneous information items (web pages) – Many classical information techniques apply – New challenges • New challenges: – Magnitude of content (explosive rate of growth) – Number of users (everybody is searching the web) – Highly dynamic in nature (what is there today, may disappear tomorrow) – Varying quality (no librarians at the gate) History of Information Retrieval • The mechanized era (Sparc Jones and Willett, 1997) – IR systems were mainly used by librarians • for carrying out bibliographic searches in place of manual tools such as card catalogue and universal classification systems • The advent of word processing technology (software + hardware) a rapid, wide spread growth in the usage of IR Increased interest in Web-based distributed information processing and in the application of IR techniques to nontextual information − The growth of knowledge Creation of discipline • Discipline oriented era e.g. Science from philosophy physics from science electricity from physics electronics from electricity Similarly, information retrieval from the wider discipline of information science History of Information Retrieval • Then came the Problem Oriented Era – Disciplines are merged to form a new subject – This is due to seepage – E.g. Molecular Biology from physics and Biology (Fosket, 1988) • The Mission Oriented Era – Spanning the whole rage of disciplines • Such growth knowledge gave birth to the creation of disciplines (domain knowledge) which then brought about the need for classification and indexing – Putting related knowledge together • E.g. Science, Arts, and Humanities – Creation of subclasses within classes • Designing ways and means of accessing information (which IR is on that provides this) Components of IR Systems IR Systems Human Components System Components • Human Components – Users -- who create the needs of the system (the user) – Organization -- who makes it possible to have the system (the funder) – Information professionals -- who operate the system and provide the services (the server) • System Components – Data -- the content of the system – Device & media -- hardware of the system – Algorithms & procedures -- software of the system The User Task • The user task Retrieval Database Browsing • The user task might be one of rtetrival or browsing – Retrieval • • • • information or data Information need (retrieval goal) is focused and crystalized Purposeful Often user is sophesticated – Browsing • Information need (retrival goal) is vague and impresise • Glancing around • Often user is naive • Both are initiated by the user Users • The user – anyone who need to find some information • The user groups – Group by their knowledge of the system • Novice (beginner) users vs. experienced users • End users vs. information specialists – Group by their domain knowledge • Domain experts vs. general public – Group by information needs • Need to locate a particular item • Need some information • Need all information on a subject User’s Information Needs • At all levels of our life we need information (e.g. crossing the road, health, nutrition, travel,…) • Information need is the desire to know, the desire to fill a gap of knowledge • Example- problem: one wants to cross a road in a high traffic area • What is the information he needs? He needs information • About the direction people drive (left or right) • About the meanings of the traffic light (green, yellow, and red) • Sign posts, etc • People depend on information to carry out their activities of daily life. – Need to accomplish some goals – Need to solve some problems • People realize a lack of information • Perceive a gap in their knowledge state • Desire to fill the gap ? Reality Goals User’s information needs Reality Reality Reality Reality Reality ? ? Info. Needs ? Info. Systems Goals Goals Goals Goals Goals Reality Reality Reality Reality Reality ?? ?? ? Goals Goals Goals Goals Goals Info. Needs ?? First Abstraction Principle Info. Systems ?? Data Problems Request Queries Second Abstraction Principle Basic Concepts - Logical view of documents •The logical view of documents •Full text •Set of index terms •Full text + structure Accents spacing Docs stopwords Noun groups stemming Manual indexing structure structure Full text Index terms Document representation viewed as a continuum: logical view of docs might shift Document Processing Steps From “Modern IR” textbook Other Central Concepts in IR - Documents • Document Retrieval Model Are IR systems better called Document Retrieval systems? Query formulation Relevance Feedback Retrieved documents Formal language Retrieval retrieval Document representation User’s information need Indexing Documents • Document: a long string of characters contained in a single file • • What do we mean by a document? − Full document? − Document surrogates? − Pages? A document is a representation of some aggregation of information, treated as a unit The Retrieval Process • • Web search engine Web browser User Interface Text 4, 10 user need Text Text Operations 6, 7 logical view logical view Query Operations DB Manager Module Indexing user feedback 5 query Searching 8 inverted file Index 8 The document data base indexed retrieved docs Ranking ranked docs 2 Text Database The Retrieval Process • The user interface – think of it as the user interface available with current IR systems including • Web search engines • Web browsers • Can be seen or interpreted in terms of component subprocesses whose study fields many of the topics that will be covered in the course • The figure in the previous slide will be used to describe the retrieval process • It is necessary to define the text database before any of the retrieval processes are initiated • This is usually done by the manager of the database and includes specifying the following – the documents to be used – The operations to be performed on the text – The text model to be used (the text structure and what elements can be retrieved) • The text operations transform the original documents and generate a logical view of them • • • • • • • The Retrieval Process Once the logical view of the documents is defined, the database manager builds an index of the text An index is a critical data structure It allows fast searching over large volumes of data Different index structures might be used , but the most popular one is the inverted file as indicated in the slide Given the document database is indexed, the retrieval process can be initiated The user first specifies a user need which is then parsed and transformed by the same text operation applied to the text Then the query operations might be applied before the actual query, which provides the system representation for the user need, is generated The Retrieval Process • The query is then processed to obtain the retrieved documents • Before the retrieved documents are sent to the user, the retrieved documents are ranked according to the likelihood of relevance • The user then examines the set of ranked documents in the search for useful information • Two choices for the user – Reformulate query, run on entire collection – Reformulate query, run on result set • At this point, he might pinpoint a subset of the documents seen as definitely of interest and initiate a user feedback cycle • In such a cycle, the system uses the documents selected by the user to change the query formulation • Hopefully, this modified query is a better representation of the real user need • What about the small numbers? What Kind of Data does Information Retrieval Deal With? • Unformatted or unstructured data (as opposed to relational database)- as written by the authors – Textual data: papers, technical reports, newspaper articles – Web pages (HTML and XML files) Semi– Non-textual data: images, graphics, video structured Completely untagged, plain text • Most popular IR application nowadays: WWW search engines, e.g., Google, Altavista, Yahoo!, etc. spider Web pages Index Search engine User queries Examples of IR systems • Most people have used IR systems one way or the other: – Library systems to search for books, papers and course information – World wide web search engines likeGoogle (also Altavesta, yahoo) which retrieve documents on the Web containing the keywords, and return a ranked list of relevant indices to documents. • Such Search Engines are word form based and often analyze the link structure of the WWW • Google, Altavesta and Yahoo are most popular example of IR application nowadays – Electronic encyclopaedia (online or CDROM) – Electronic manual systems such as Sun Microsystem’s AnswerBook • input data: various set of manuals • query format: supports AND/OR (soften form), phrase, etc. • presentation of results: results ranked and document types Why is IR Important? • Most information available is in textual form and has no predefined format (e.g., emails and newsgroup articles). • Integration of text retrieval capability in most relational database systems. SQL already supports limited search capability such as search based on regular expressions: – select * from Employee where Name like ’%Lee%’. • Increasing number of online documentation systems (no more hardcopy!) • Of course, the blooming of World Wide Web Challenges of IR User Search/select Info. Needs Queries Translating info. needs to queries Information Stored Information Matching queries To stored information Query result evaluation Does information found match user’s information needs? Why is IR a Difficult Problem? • The size of the web is doubling every year: – – – – – 50 million pages in November 1995 320 million pages in December 1997 800 million pages in February 1999 1 billion pages in 2000 and growing every day • Huge amount of data (e.g., WWW) dictates efficiency, effectiveness and user-friendliness Thus : – Any IR system needs the capability of large scale data processing – Use indexes and various representations are required Why is IR a Difficult Problem? (Cont.) • Unstructured data: difficult to capture semantics in documents. Compare: – “select * from Employee where Salary > 100,000” – “retrieve all news items about corporate takeover” – Why is the second query more difficult to answer? The following query is even more difficult: • “retrieve all news items about corporate takeover involving an internet company” • Documents have unrestricted domains – it is hard to predefine or pre-categorize the subject domains of documents – In particular subject is related to several major topics including lingustics, psychology, Cybernetics, Communications, Information System design, Engineering & Technology, Networking, Computer Science, Mathematics, Economics, Management Science, education … Why is IR a Difficult Problem? (Cont.) • Diversified user base: expert to casual users – The users of information retrieval systems include • Research scientists (that seek articles related to particular experiments) • Engineers (who try to determine W/r a patent is covering some new idea has previously been obtained) • Attorney( who search for legal presidents) • Buyers in general (who try to obtain new product information) – Information retrieval users • • • • One size Have a wide variety of different information needs (Interest) cannot fit all! Exhibit many different backgrounds May be led by many different reasons to use the retrieval facilities As a result, they require a variety of services and end products – In other words, a system may be clumsy for an expert user but difficult to use for a casual user – a system may return information too general to be useful for an expert in the subject but too narrow for a general user • Intention of information and user query is hard to capture Why is IR a Difficult Problem? • Distributed and interlinked (e.g., Hypertext and WWW) – Where to start a search? Unlike in a centralize database, you have only one (or a few) database(s) to search. – How are the information related? How fast How good • Efficiency vs. effectiveness With a limited amount of resources, one can only improve efficiency and effectiveness to a certain degree. Moreover, improving efficiency often means degrading effectiveness, and vice versa. Structure of an IR System Search Line Interest profiles & Queries Formulating query in terms of descriptors Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage of profiles Store1: Profiles/ Search requests Documents & data Storage Line Indexing (Descriptive and Subject) Storage of Documents Comparison/ Matching Store2: Document representations Adapted from Soergel, p. 19 Potentially Relevant Documents Application Areas of IR -IR can be applicable in different areas among those it can be applicable in: Conventional (library catalog) Search by keyword, title, author, etc. Text-based (Lexis-Nexis, Google, FAST). Search by keywords. Limited search using queries in natural language. Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance (shapes, colors). Question answering systems (AskJeeves, Answerbus) Search in (restricted) natural language Other: cross language information retrieval, Information Extraction, music retrieval