IT522-lec1

advertisement
IT-522: Web Databases And
Information Retrieval
By
Dr. Syed Noman Hasany
Course Contents (Provided)
•
•
•
•
•
•
•
•
•
Modelling
Query operations
Mark up languages
XML technologies and its applications
Searching the Web
IR models and languages
Indexing and searching
Digital libraries
Project: Designing and developing parts of IR
Systems.
A correction…
• Williams, H. E. D. Lane “Building Effective
Database-Driven Web Sites” 2004, ISBN 13:
9780596005436.
• Reference
• “Web Database Applications with PHP and
MySQL”, 2nd Edition
A correction regarding book
Sessional Marks
•
•
•
•
•
Mid-1: 20 marks
Mid-2: 20 marks
Assignment: 10 marks
Project: 10 marks
Final: 40 marks
Course Description
• This course has two major inter-related
portions:
 Information retrieval (more towards theoretical
discussion and formulae)
 Web databases (more towards practical side)
• Web theory
• PHP and MySQL
Definition of Information Retrieval
Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large
collections (usually stored on computers).
7
7
Types of Information
• Structured: databases
• Semi-structured: XML, RDF
• Unstructured: text documents
Information Retrieval
• The indexing and retrieval of textual
documents.
• Searching for pages on the World Wide Web is
the most recent and perhaps most widely
used IR application
• Concerned firstly with retrieving relevant
documents to a query.
• Concerned secondly with retrieving from large
sets of documents efficiently.
Relevance
• Relevance is a subjective judgment and may
include:
–
–
–
–
Being on the proper subject.
Being timely (recent information).
Being authoritative (from a trusted source).
Satisfying the goals of the user and his/her intended
use of the information (information need)
• Main relevance criterion: an IR system should
fulfill user’s information need
Typical IR Task
•
Given:
– A corpus of textual natural-language documents.
– A user query in the form of a textual string.
•
Find:
– A ranked set of documents that are relevant to
the query.
Typical IR System Architecture
Document
corpus
Query
String
IR
System
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
Key Terms Used in IR
• QUERY: a representation of what the user is
looking for - can be a list of words or a phrase.
• DOCUMENT: an information entity that the user
wants to retrieve
• COLLECTION: a set of documents
• INDEX: a representation of information that makes
querying easier
• TERM: word or concept that appears in a
document or a query
Web Search System
Web
Spider
Document
corpus
Query
String
IR
System
1. Page1
2. Page2
3. Page3
.
.
Ranked
Documents
Spider
• A spider is a program that visits Web sites and reads their pages
and other information in order to create entries for a search
engine index. The major search engines on the Web all have
such a program, also known as a "crawler" or a "bot."
• Spiders are typically programmed to visit sites that have been
submitted by their owners as new or updated. Entire sites or
specific pages can be selectively visited and indexed.
• Spiders are called spiders because they usually visit many sites
in parallel at the same time, their "legs" spanning a large area of
the "web." Spiders can crawl through a site's pages in several
ways. One way is to follow all the hypertext links in each page
until all the pages have been read.
Download