Uploaded by Anteneh Abebe

Chapter 1 - Copy

advertisement
Chapter one
Information Retrieval
Contents
•
•
•
•
•
•
•
•
•
•
•
Definition of Information Retrieval (IR)
Information retrieval system
Purpose/goal/objective of Information Retrieval
Database retrieval Vs. information retrieval
History of Information retrieval
Components of information retrieval system (IRS)
Basic concepts in Information retrieval
The information retrieval process
The Structure of an IR system
Challenges in Information Retrieval (IR)
Application areas of IR
Information retrieval - Definition
Information retrieval (IR)
• It deals with representation, storage, organization and access to information
items to satisfy the user‘s interest or information need.
– Information items: usually text, but possibly also image, audio, video, etc.
– Text items are often referred to as documents, and may be of different scope
(books, article, paragraphs, etc.)
– Information items are translated to a query consisting of keywords (word
forms) which summarizes the description of the user information needed.
• Conceptually, IR is used to cover all related problems in finding needed
information
• Technically, information retrieval refers to (text) string manipulation, indexing,
matching, querying, etc.
• IR involves helping users to find information that matches their information
needs
• Its techniques and applications have reached many fields where processing large
amount of information is essential.
Information retrieval - Storing/Retrieving
• Information storage
– How and where is information stored?
• Retrieving information
– How is information recovered (well gained) from
storage
– How to find needed information
– Linked with accessing/filtering stage
Information Retrieval - Accessing/Filtering
• Using the organization created in the O/I stage to:
– Select desired (or relevant) information
– Locate that information
– Retrieve the information from its storage location
(often via a network)
More on IR
• IR is concerned with retrieval of relevant documents
from a large collection of documents
• Relevant documents are identified according to specific
criteria (usually called query)
• IR usually deals with NL text which is not always well
structured and could be semantically ambiguous
• IR deals with very large sets of documents
_High amount of robustness, efficiency
_Domain-independent & multi-linguality
• IR considers NL text mainly from a lexical view
Key Issues IR
• How to describe information resources or information-bearing objects in
ways so that they may be effectively used by those who need to use them
– Organizing
• How to find the appropriate information resources or information-bearing
objects for someone’s (or your own) needs
– Retrieving
• Revised task statement
– Build a system that retrieves documents that users are likely to find
relevant to their queries
– This set of assumption underlies the field of IR
Information Retrieval System (IRS)
• Is a system that is capable of storage, retrieval, and maintenance of
information items
• The two most important entities in information retrieval systems are
– Information need (queries)…on one hand we have users
– Information items (documents)…on the other hand we have DOCs
that we collect
• The processes of an IR system is to match two abstractions
– Data abstracted in the system
– Queries abstracted from user’s information needs
Need
[
]
Docs
matching the two sides
• The goal of IR systems is to help users find information that satisfies their
information needs.
• Information items are translated to a query consisting of keywords (word
forms) which summarizes the description of the user information needed.
• Given the user query, the key goal of an IR system is to retrieve
information which might be useful or relevant to the user.
Basic Structure of Information Retrieval System
Info.
need
Query
Retrieval
Document
collection
IR
system
Answer list (ranked
documents)
Overview of IR
Document
Query
indexing
indexing
(Query analysis)
Representation
(keywords)
Query
evaluation
Representation
(keywords)
Purpose of an Information Retrieval
•
The main purpose of an IRS is to capture/find wanted items
(information ) from a large document set and to filter out
unwanted information
1. Writers ideas
• Information is generated by authors
• Authors generate large quantities of information every day
• Are represented in the form of documents
2. Information need
• At the other end we have chain of communications
• We have readers, each with its own individual need for information
which has to be selected from the mass available
3. IRS
• Helps to organize the information sources in (1)
• Helps to bring (1) and (2) together (writer and information seeker)
Information Retrieval – Objective/Purpose/Goal
• Minimize search overload of a user who is locating needed
information
• Overhead : time spent in all steps leading to the reading of
items containing the needed information (query generation,
query execution, scanning results, reading non-relevant items,
etc.)
• Needed information: either/it can be
– All information in the system relevant to the user needs
– Sufficient information in the system to complete a task
Information Retrieval – Objective/Purpose/Goal
• Support user search, providing tools to overcome
obstacles such as :
– Ambiguities inherent (nature) in languages
• Homographs – words with identical spelling but
with multiple meanings
– Limit to user’s ability to express needs
• Lack of system experience or aptitude
– Lack of experience in the area being searched
• Initially only vague concept of information sought
• Differences between user’s vocabulary and authors’
vocabulary: different words with similar meanings
Database Retrieval Vs. Information Retrieval Systems
• Information
– DBMS: structured data (often homogeneous records), semantic
unambiguity
– IR systems: unstructured (free text), ambiguity
• Answers
– DBMS:
• Records (tuples)
• Perfect precision and recall, each item is relevant (no ranking)
• Well defined results
– IR systems
• Vs. Documents
• Vs. Imperfect precision and recall, each item has specific relevance
(ranking)
• Vs. fuzzy (Vague) or ill defined results
• Relationship
– Systems complement each other
• DB grew out of files and traditional business system
• IR grew out of library science and need to categorize/group/access
books/articles
Comparison of Data retrieval and Information retrieval
•
•
•
•
•
Content
Data object
Matching
Items wanted
Query language
• Query specification
• Model
• Structure
Data retrieval
Information retrieval
Data
Table
Exact match
Matching
SQL (artificial)
Complete
Deterministic
Highly structured
Information
Document
Partial match, best match
Relevant
Natural
Incomplete
Probabilistic
less structured
Database Retrieval Vs. Information Retrieval Systems
• Data retrieval
– which docs contain a set of keywords?
– Well defined semantics
– a single erroneous object implies failure!
• Information retrieval
– Information about a subject or topic
– Semantics is frequently loose
– Small errors are tolerated
• IR system:
– Interpret contents of information items
– Generate a ranking which reflects relevance
– Notion of relevance is most important
• Information retrieval is much more difficult than data retrieval
History of Information Retrieval
• Organization and storage of knowledge for ease of access is centuries
old
• That is, the history of recording knowledge goes as far as thousands
of years.
• Important events
– Development of writing
– Books and printing technology
– News publishing
– Journal publishing (economic reasons- books are not economical
interns of money and time)
– Libraries (to put publications in one centre)
• As the size of the collection grows, access to documents becomes
more difficult without proper mechanism
History of Information Retrieval
• Therefore, in order to reach to documents in libraries or
other collection, access mechanisms were necessary
• Simple methods to facilitate access to single document
– Table of contents
– Keyword index
• Classical methods to facilitate access to collections of
documents
– Index (keywords, authors)
– Hierarchical (Dewey-Decimal classification)
• The increasing demand for information access created
information science as a discipline.
• IR then becomes an important sub discipline that is
concerned with developing theories and methods of
access to information
History of Information Retrieval
• Now- The World-Wide-web
– A gigantic distributed collection of heterogeneous
information items (web pages)
– Many classical information techniques apply
– New challenges
• New challenges:
– Magnitude of content (explosive rate of growth)
– Number of users (everybody is searching the web)
– Highly dynamic in nature (what is there today, may
disappear tomorrow)
– Varying quality (no librarians at the gate)
History of Information Retrieval
• The mechanized era (Sparc Jones and Willett, 1997)
– IR systems were mainly used by librarians
• for carrying out bibliographic searches in place of manual tools such as card
catalogue and universal classification systems
• The advent of word processing technology (software + hardware)
a rapid, wide spread growth in the usage of IR
Increased interest in Web-based distributed information processing and in the
application of IR techniques to nontextual information
− The growth of knowledge  Creation of discipline
• Discipline oriented era
e.g. Science from philosophy
physics from science
electricity from physics
electronics from electricity
Similarly, information retrieval from the wider discipline of information science
History of Information Retrieval
• Then came the Problem Oriented Era
– Disciplines are merged to form a new subject
– This is due to seepage
– E.g. Molecular Biology from physics and Biology (Fosket, 1988)
• The Mission Oriented Era
– Spanning the whole rage of disciplines
• Such growth knowledge gave birth to the creation of disciplines
(domain knowledge) which then brought about the need for
classification and indexing
– Putting related knowledge together
• E.g. Science, Arts, and Humanities
– Creation of subclasses within classes
• Designing ways and means of accessing information (which
IR is on that provides this)
Components of IR Systems
IR Systems
Human Components
System Components
• Human Components
– Users -- who create the needs of the system (the user)
– Organization -- who makes it possible to have the system (the
funder)
– Information professionals -- who operate the system and provide
the services (the server)
• System Components
– Data -- the content of the system
– Device & media -- hardware of the system
– Algorithms & procedures -- software of the system
The User Task
• The user task
Retrieval
Database
Browsing
• The user task might be one of rtetrival or browsing
– Retrieval
•
•
•
•
information or data
Information need (retrieval goal) is focused and crystalized
Purposeful
Often user is sophesticated
– Browsing
• Information need (retrival goal) is vague and impresise
• Glancing around
• Often user is naive
• Both are initiated by the user
Users
• The user
– anyone who need to find some information
• The user groups
– Group by their knowledge of the system
• Novice (beginner) users vs. experienced users
• End users vs. information specialists
– Group by their domain knowledge
• Domain experts vs. general public
– Group by information needs
• Need to locate a particular item
• Need some information
• Need all information on a subject
User’s Information Needs
• At all levels of our life we need information (e.g. crossing the road, health,
nutrition, travel,…)
• Information need is the desire to know, the desire to fill a gap of knowledge
• Example- problem: one wants to cross a road in a high traffic area
• What is the information he needs? He needs information
• About the direction people drive (left or right)
• About the meanings of the traffic light (green, yellow, and red)
• Sign posts, etc
• People depend on information to carry out their activities of daily life.
– Need to accomplish some goals
– Need to solve some problems
• People realize a lack of information
• Perceive a gap in their knowledge state
• Desire to fill the gap
?
Reality
Goals
User’s information needs
Reality
Reality
Reality
Reality
Reality
?
?
Info. Needs
?
Info. Systems
Goals
Goals
Goals
Goals
Goals
Reality
Reality
Reality
Reality
Reality
??
??
?
Goals
Goals
Goals
Goals
Goals
Info. Needs
??
First Abstraction
Principle
Info. Systems
??
Data
Problems
Request
Queries
Second Abstraction
Principle
Basic Concepts - Logical view of documents
•The logical view of documents
•Full text
•Set of index terms
•Full text + structure
Accents
spacing
Docs
stopwords
Noun
groups
stemming
Manual
indexing
structure
structure
Full text
Index terms
Document representation viewed as a continuum: logical view of docs might shift
Document Processing Steps
From “Modern IR” textbook
Other Central Concepts in IR - Documents
• Document Retrieval Model
Are IR systems better called Document Retrieval systems?
Query
formulation
Relevance
Feedback
Retrieved
documents
Formal
language
Retrieval
retrieval
Document
representation
User’s
information need
Indexing
Documents
• Document: a long string of characters contained in a single file
•
•
What do we mean by a document?
− Full document?
− Document surrogates?
− Pages?
A document is a representation of some aggregation of information, treated
as a unit
The Retrieval Process
•
•
Web search engine
Web browser
User
Interface
Text
4, 10
user need
Text
Text Operations
6, 7
logical view
logical view
Query
Operations
DB Manager
Module
Indexing
user feedback
5
query
Searching
8
inverted file
Index
8
The document
data base
indexed
retrieved docs
Ranking
ranked docs
2
Text
Database
The Retrieval Process
• The user interface – think of it as the user interface available with current IR
systems including
• Web search engines
• Web browsers
• Can be seen or interpreted in terms of component subprocesses whose study
fields many of the topics that will be covered in the course
• The figure in the previous slide will be used to describe the retrieval process
• It is necessary to define the text database before any of the retrieval processes
are initiated
• This is usually done by the manager of the database and includes specifying the
following
– the documents to be used
– The operations to be performed on the text
– The text model to be used (the text structure and what elements can be
retrieved)
• The text operations transform the original documents and generate a
logical view of them
•
•
•
•
•
•
•
The Retrieval Process
Once the logical view of the documents is defined, the
database manager builds an index of the text
An index is a critical data structure
It allows fast searching over large volumes of data
Different index structures might be used , but the most
popular one is the inverted file as indicated in the slide
Given the document database is indexed, the retrieval
process can be initiated
The user first specifies a user need which is then parsed
and transformed by the same text operation applied to the
text
Then the query operations might be applied before the
actual query, which provides the system representation for
the user need, is generated
The Retrieval Process
• The query is then processed to obtain the retrieved documents
• Before the retrieved documents are sent to the user, the retrieved
documents are ranked according to the likelihood of relevance
• The user then examines the set of ranked documents in the search for
useful information
• Two choices for the user
– Reformulate query, run on entire collection
– Reformulate query, run on result set
• At this point, he might pinpoint a subset of the documents seen as
definitely of interest and initiate a user feedback cycle
• In such a cycle, the system uses the documents selected by the user to
change the query formulation
• Hopefully, this modified query is a better representation of the real user
need
• What about the small numbers?
What Kind of Data does Information Retrieval Deal With?
• Unformatted or unstructured data (as opposed to relational
database)- as written by the authors
– Textual data: papers, technical reports, newspaper articles
– Web pages (HTML and XML files)
Semi– Non-textual data: images, graphics, video
structured
Completely
untagged,
plain text
• Most popular IR application nowadays: WWW search engines, e.g.,
Google, Altavista, Yahoo!, etc.
spider
Web pages
Index
Search
engine
User
queries
Examples of IR systems
• Most people have used IR systems one way or the other:
– Library systems to search for books, papers and course information
– World wide web search engines likeGoogle (also Altavesta,
yahoo) which retrieve documents on the Web containing the
keywords, and return a ranked list of relevant indices to documents.
• Such Search Engines are word form based and often analyze the
link structure of the WWW
• Google, Altavesta and Yahoo are most popular example of IR
application nowadays
– Electronic encyclopaedia (online or CDROM)
– Electronic manual systems such as Sun Microsystem’s
AnswerBook
• input data: various set of manuals
• query format: supports AND/OR (soften form), phrase, etc.
• presentation of results: results ranked and document types
Why is IR Important?
• Most information available is in textual form and has no predefined format
(e.g., emails and newsgroup articles).
• Integration of text retrieval capability in most relational database systems.
SQL already supports limited search capability such as search based on
regular expressions:
– select * from Employee where Name like ’%Lee%’.
• Increasing number of online documentation systems
(no more hardcopy!)
• Of course, the blooming of World Wide Web
Challenges of IR
User
Search/select
Info. Needs
Queries
Translating info.
needs to queries
Information
Stored Information
Matching queries
To stored information
Query result evaluation
Does information found match user’s
information needs?
Why is IR a Difficult Problem?
• The size of the web is doubling every year:
–
–
–
–
–
50 million pages in November 1995
320 million pages in December 1997
800 million pages in February 1999
1 billion pages in 2000
and growing every day
• Huge amount of data (e.g., WWW) dictates efficiency,
effectiveness and user-friendliness
Thus :
– Any IR system needs the capability of large scale data
processing
– Use indexes and various representations are required
Why is IR a Difficult Problem? (Cont.)
• Unstructured data: difficult to capture semantics in documents.
Compare:
– “select * from Employee where Salary > 100,000”
– “retrieve all news items about corporate takeover”
– Why is the second query more difficult to answer? The following
query is even more difficult:
• “retrieve all news items about corporate takeover involving an
internet company”
• Documents have unrestricted domains
– it is hard to predefine or pre-categorize the subject domains of
documents
– In particular subject is related to several major topics including
lingustics, psychology, Cybernetics, Communications, Information
System design, Engineering & Technology, Networking, Computer
Science, Mathematics, Economics, Management Science,
education …
Why is IR a Difficult Problem? (Cont.)
• Diversified user base: expert to casual users
– The users of information retrieval systems include
• Research scientists (that seek articles related to particular experiments)
• Engineers (who try to determine W/r a patent is covering some new idea has
previously been obtained)
• Attorney( who search for legal presidents)
• Buyers in general (who try to obtain new product information)
– Information retrieval users
•
•
•
•
One size
Have a wide variety of different information needs (Interest)
cannot fit all!
Exhibit many different backgrounds
May be led by many different reasons to use the retrieval facilities
As a result, they require a variety of services and end products
– In other words, a system may be clumsy for an expert user but difficult to
use for a casual user
– a system may return information too general to be useful for an expert in the
subject but too narrow for a general user
• Intention of information and user query is hard to capture
Why is IR a Difficult Problem?
• Distributed and interlinked (e.g., Hypertext and WWW)
– Where to start a search? Unlike in a centralize database, you have only one
(or a few) database(s) to search.
– How are the information related?
How fast
How good
• Efficiency vs. effectiveness
With a limited amount of resources, one can only improve efficiency and
effectiveness to a certain degree. Moreover, improving efficiency often
means degrading effectiveness, and vice versa.
Structure of an IR System
Search
Line
Interest profiles
& Queries
Formulating query in
terms of
descriptors
Information Storage and Retrieval System
Rules of the game =
Rules for subject indexing +
Thesaurus (which consists of
Lead-In
Vocabulary
and
Indexing
Language
Storage of
profiles
Store1: Profiles/
Search requests
Documents
& data
Storage
Line
Indexing
(Descriptive and
Subject)
Storage of
Documents
Comparison/
Matching
Store2: Document
representations
Adapted from Soergel, p. 19
Potentially
Relevant
Documents
Application Areas of IR
-IR can be applicable in different areas among those it can be applicable
in:

Conventional (library catalog)

Search by keyword, title, author, etc.

Text-based (Lexis-Nexis, Google, FAST).

Search by keywords. Limited search using queries in natural
language.

Multimedia (QBIC, WebSeek, SaFe)

Search by visual appearance (shapes, colors).

Question answering systems (AskJeeves, Answerbus)

Search in (restricted) natural language

Other:

cross language information retrieval, Information
Extraction, music retrieval
Download