IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany Course Contents (Provided) • • • • • • • • • Modelling Query operations Mark up languages XML technologies and its applications Searching the Web IR models and languages Indexing and searching Digital libraries Project: Designing and developing parts of IR Systems. A correction… • Williams, H. E. D. Lane “Building Effective Database-Driven Web Sites” 2004, ISBN 13: 9780596005436. • Reference • “Web Database Applications with PHP and MySQL”, 2nd Edition A correction regarding book Sessional Marks • • • • • Mid-1: 20 marks Mid-2: 20 marks Assignment: 10 marks Project: 10 marks Final: 40 marks Course Description • This course has two major inter-related portions: Information retrieval (more towards theoretical discussion and formulae) Web databases (more towards practical side) • Web theory • PHP and MySQL Definition of Information Retrieval Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). 7 7 Types of Information • Structured: databases • Semi-structured: XML, RDF • Unstructured: text documents Information Retrieval • The indexing and retrieval of textual documents. • Searching for pages on the World Wide Web is the most recent and perhaps most widely used IR application • Concerned firstly with retrieving relevant documents to a query. • Concerned secondly with retrieving from large sets of documents efficiently. Relevance • Relevance is a subjective judgment and may include: – – – – Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need) • Main relevance criterion: an IR system should fulfill user’s information need Typical IR Task • Given: – A corpus of textual natural-language documents. – A user query in the form of a textual string. • Find: – A ranked set of documents that are relevant to the query. Typical IR System Architecture Document corpus Query String IR System Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . . Key Terms Used in IR • QUERY: a representation of what the user is looking for - can be a list of words or a phrase. • DOCUMENT: an information entity that the user wants to retrieve • COLLECTION: a set of documents • INDEX: a representation of information that makes querying easier • TERM: word or concept that appears in a document or a query Web Search System Web Spider Document corpus Query String IR System 1. Page1 2. Page2 3. Page3 . . Ranked Documents Spider • A spider is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program, also known as a "crawler" or a "bot." • Spiders are typically programmed to visit sites that have been submitted by their owners as new or updated. Entire sites or specific pages can be selectively visited and indexed. • Spiders are called spiders because they usually visit many sites in parallel at the same time, their "legs" spanning a large area of the "web." Spiders can crawl through a site's pages in several ways. One way is to follow all the hypertext links in each page until all the pages have been read.