presentation source

advertisement
Database Management Systems & Programming
LIS 558 - Week 8
Information Retrieval
&
File Structures
Faculty of Information & Media Studies
Summer 2000
Lecture Outline

Guest Speaker
–
–
–
–
Trevor Richards
LIS Grad.
Product/Technical Knowledge Training Coordinator
EMCO Ltd.

Demonstration of InMagic

Break


Database Categories
Advantages/Disadvantages between
DBMS and IR Systems
Types of Information Systems

6 types of information systems
•
•
•
•
•
•
Information Retrieval Systems
Database Management Systems
Management Information Systems
Question Answering Systems
Decision Support Systems
Artificial Intelligence Systems
Types of Information Systems
IR
QA
DBMS
DS
MIS
AI
Many systems are hybrids
Types of Information Systems

Database Management Systems
• Concerned with storage, maintenance, and
retrieval of data facts available in the system in
explicit form, e.g.,
–books
Items being retrieved are
•authors
•titles
•call number
–products
•orders
•items
•sales
–widgets
•colour
•size
•shape
typically attribute- value pairs
that either match or do not
match
Calculations may be performed
on values present in database or
on values returned by queries
Types of Information Systems

Information Retrieval Systems
• IRSs deals with representation, storage,
and access to information items, typically
as documents or document surrogates (or
more recently multimedia documents), e.g.,
–newspaper articles
–magazines
–research reports
–books
–bibliographic references
–references with abstracts
–web documents?
Types of Information Systems

Management Information Systems
• Basically, database management systems
designed to meet information needs of
managers
• Provide complex views and manipulations
of corporate information
Types of Information Systems

Question Answering Systems
• Provide access to factual information in a
natural language setting
–e.g., http://debra.dgbt.doc.ca/chat/chat.html
Types of Information Systems

Decision Support Systems
• Integration of a variety of systems,
including IR systems, expert systems,
databases, computer graphics systems,
which are normally thought to be needed
for decision-making purposes
Types of Information Systems

Artificial Intelligence Information Systems
• Interdisciplinary approach to designing
systems
• Includes expert systems, neural networks,
intelligent agents, information filtering, etc.
• Increasingly, AI systems are being built
integrated with DBMS or information
retrieval facilities
IR Database Types






Bibliographic
Full-text
Image
Numeric/statistical
Descriptive (text)
Directories (reference sources)
text
Documents
Functional View of IR
break into words
documents
assign doc ID numbers
words
*term weighting
stoplist
stemmed
words
filtered
words
*stemming
term
weights
document numbers
and *field numbers
Database
relevant documents
stemmed words
Boolean
operations
retrieved documents
*stemming
query terms
parse query
query
*ranking
Interface
ranked documents
queries
queries
documents
*relevance judgements
Users
* indicates attribute
is optional
Functional View of IR




File structures
Query operations
Term operations
Document operations
File structures

Linear List
–newest item is inserted at the end of list of
items (or list of variables)
• advantages
–simple to create
–easy to update
–saves space
• disadvantages
–no indexing
–speed of searching is very slow
–must make comparisons with every item in the
list
File structures

Ordered Sequential File
–e.g., file of information ordered by author
• advantages
–faster to search
• disadvantages
–updating difficult and slow
• read entire file into RAM and then do a
binary search -- becomes problematic
when the file is very large -- searching is
quite fast, but updating remains slow
File structures

Index File
–Data file is accompanied by index file
–Index file provides pointers to the beginning of sections
of the data file beginning with a new letter
• advantages
–index is very small and can be read into RAM
–binary search done on index which is extremely fast
• disadvantages
–number of records at each letter may be unevenly
distributed so in places searching could be slow
–index could be more detailed but this increases the space
required
–updating is difficult and slow because both file and index
must be revised
Information Retrieval Systems


Now, in addition to thousands of
commercial and public domain databases,
we have the World Wide Web
Web = huge full-text multimedia
database
One Billion pages as of January 2000

With all this information available how
does a person find what they are looking
for?
A Telephone Directory
A Periodical Index
A Cookbook
Information Retrieval Systems



Concept of index as a mechanism for
providing access to information is
nearly as old as the printed book
itself
Cornerstone of information retrieval
systems
Provide fast and efficient access
A Textbook
A
Alfalfa Document
File
Apples
B
Apricots
C
Beans
D
E
Broccoli
F
Carrots
Cocoa
Index
File
Fudge
Oatmeal
Inverted Index Files

advantages
–updating is easy since records can be added to
end of file
–searching is fast
e.g., Suppose we have a large database for
baking and cooking information and we want to
locate recipes using the ingredients oatmeal,
raisins, apple, and perhaps cocoa
Inverted Index Files
Document
File
Document 1
word1 oatmeal apple word4 raisins word6
Document 2
word1 cocoa word3 oatmeal word5 word6
Document 3
cocoa word2 word3 oatmeal raisins word6
Document 4
word1 raisins cocoa word4 word5 word6
Document 5
oatmeal word2 raisins word4 word5 word6
Topic
#occurrences
Recipe Document #
1
apple
Inverted cocoa
Index
oatmeal
File
raisins
1
2 3
4
2
3
4
2
3
5
1
3
4
1
4
1
3
5
4
5
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Inverted File
Keyword
Document File
Hits Link
apples
1
Document #1
cocoa
3
Document #2
oatmeal
4
raisins
4
Document #3
Document #4
Document #5
Topic
apples
cocoa
oatmeall
raisins
#occurrences
1
3
4
4
Information Item#
1
1
1
2
2
3 4
3
3 4
5
5

Query Operations
Boolean Queries
–most systems offer boolean query capabilities:
•AND
OR
NOT
–To identify documents containing a particular term only
inverted index file needs to be used
–Results are determined through the creation and
manipulation of sets (just a different type of file)
Topic
#Postings
Information Item#
apple
1
1
cocoa
3
oatmeal
4
1
raisins
4
1
2
3
2
3
3
4
5
4
5
Boolean OR Operator
oatmeal OR raisins
oatmeal
2
1
5
raisins
3
OR broadens a search
4
Boolean AND Operator
oatmeal AND cocoa
oatmeal
1
2
cocoa
3
5
AND narrows a search
4
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Inverted File
Document File
Keyword Hits Field Pos Wt Link
apples
2
cocoa
3
oatmeal
4
raisins
4
1 4 .5
Document #1
Document #2
Document #3
Document #4
Document #5
Functional View of IR




File structures
Query operations
Term operations
Document operations
Query Operations


Typically querying is independent of
file structures
Boolean Queries
• most commercial systems offer boolean query
capabilities:
•AND
OR
NOT
• To identify documents containing a particular term
only the inverted index file needs to be used
• Results are determined through the creation of
sets and the boolean results are determined by
the implementation of boolean set intersection,
set union, and set difference operations (see
descriptions in Kroenke Chapter on SQL)
Query Operations

Adjacency / Proximity Operators
• very expensive for indexing and storage
• location, field information also stored in postings file

Truncation
• query locates all index terms matching the word stem
• right and left truncation require separate indexes to be
built

Relevance ranking
• many newer systems provide facility for ranking
documents. Typically this is based on a measure of word
frequency within and between documents
Functional View of IR




File structures
Query operations
Term operations
Document operations
Term Operations

Stemming
• Conflation of related words, usually reducing to a common
root -- do not confuse with truncation
–e.g., psychlog (psychologist, psychological, psychology)
• Sometimes this automatically done on-the-fly at query time
• Also done at the time index is created (index is consequently
much smaller)

Term Weighting
•
•
•
•
(weights are determined for each term)
used for relevance and ranking determinations
Sometimes done on-the-fly
term-weight (based on inter-document and inter-database
frequencies) are computed and stored at time of indexing
(recalculation = $$$)
Term Operations

Stop Lists
• many of the most frequently occurring
words make ineffective search terms (i.e.,
discrimination value is low) e.g.,
–like, the, and, to, of, an, out, a, …
• frequently occurring words are filtered
out during the processing of the index
and/or during querying
• generally stop lists should be employed
conservatively
Functional View of IR




File structures
Query operations
Term operations
Document operations
Document Operations
•
•
•
•
•
assignment of unique ID numbers
parsing of fields or segments
masking of fields for searching or display
indexing of search terms
creation of inverted index and postings
files
• user interface display of documents
IR versus Relational DBMS




Both have sophisticated file access and
file management utilities
Both employ complex indexing structures
(e.g., B+trees)
Provide query facilities
Provide similar user interface features
(e.g., menus, commands, etc.) and flexible
views of data
IR versus Relational DBMS

Information retrieval systems
•
•
•
•

provide access to content of entire documents
semi- or non-structured information
retrieval is probabilistic
applications range from small to very large
Database Management Systems
• provide access to tables of structured data
• retrieval is deterministic
• applications range from very small to very large

Critical to consider differences during
design
Database Software

Generic Database Management Software
–Access, Paradox, dBASE, FoxPro, Filemaker
–Oracle, Informix, Ingres, DB2, MiniSQL, MySQL
• used to create relational databases
• can handle textual and numeric data with
limitations
• can provide limited IR and arithmetic
functions
• can handle images, sound files, video clips,
etc. in digital format
Next Week



No Lecture
No Lab
Don’t forget to start working on your
project!
Download