Document 13557794

advertisement

Syllabus

COSC 462: Introduction to Information Retrieval

Goals  and  Objectives:    

This  class  is  a  specialized  data  mining  class;  we  restrict  our  problem  domain  to  large  collections  of   text  files.  This  is  an  area  of  major  research  today  as  the  theory  developed  has  major  implications  for   search  engine  technologies  (e.g.  Google,  Ask,  Yahoo),  recommender  systems  (e.g.  Netflix,  Amazon),   and  information  aggregation  (Feedly,  Google  News).  Upon  completion  of  this  course,  students  will:  

-­‐ know  the  differences  between  the  major  models  of  information  retrieval  and  identify  the   various  weaknesses  of  each  model    

-­‐ master  the  major  algorithms  for  aggregating  large  collections  of  text  files  

-­‐ master  many  data  mining  algorithms  as  applied  to  information  retrieval  including   entropy  minimizing  search  (ID3),  Bayesian  inference,  clustering  algorithms  and   regression  techniques  

-­‐ implement  a  major  sub-­‐component  of  a  search  engine  

Textbook : Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison-

Wesley Professional, 2 nd edition, 2011. ISBN 0321416910

Other references:

 

" Mining  the  Web:  Analysis  of  Hypertext  and  Semi  Structured  Data ",  Soumen  Chakrabarti,  Morgan  

Kaufmann  Publishers,  2003  

" Modern  Information  Retrieval ",  by  Ricardo  Baeza-­‐Yates  and  Bertier  Ribeiro-­‐Neto,  Addison  Wesley  ,  

1999  

" Managing  Gigabytes:  Compressing  and  Indexing  Documents  and  Images ",  Witten,  Moffat,  Bell,  

Morgan  Kaufman  Publishers  1999  

What you will do this term:

- homework (10% of grade). Most of the specific assignments are included below in the topics. All assignments should be typed.

- paper presentation to class (20% of grade). Topics below give references to online papers. You must sign up for a specific paper and provide a 20-30 minute presentation on it. Your grade will be based upon how well you have demonstrated mastery of the theory presented in the paper (50% of the assigned grade) and how well you explain the theory to the class (50% of the assigned grade)

- term project (team project – 3 or 4 people per team) (35% of grade). A web page of potential programming projects will be posted. Alternatively, if you have a suggestion for a programming project you should get my approval. Programming assignments are intended to underscore the theory we develop from lectures and readings. Surprisingly, we will discover that some of the results cited in publications do not actually work! Caveat Emptor!! All submitted programs should adhere to standards as established in lower COSC courses, including proper documentation, refactored code, descriptive variable naming conventions, etc.

- final exam (35% of grade): a comprehensive exam based upon lectures and readings.

Approximate Grading Scale

90 – 91 A-

80 – 81 B-

65 – 67 C-

50 – 52 D-

<50 F

92-100 A

82-87 B

68-76 C

53-62 D

88-89

77-79

B+

C+

63-64 D+

Grades will be “rounded up on the half” (e.g. 91.5 becomes 92, 91.4 becomes 91)

Attendance: is not required, but you miss class at your own risk. It is your responsibility to find out the missed work; I suggest you get the phone number of a classmate.

Cheating: It violates University policy, you know.... so don't do it. Cheating is defined as representing all or part of someone else’s work as your own. While you are certainly encouraged to seek the advice of others in this class on assignments, the work you hand in should represent your own efforts. Violation of this rule will be dealt with according to University policy. If you are really stuck on a problem, come see the instructor!

Topics:

1) Crawlers, stopwords, stemming and other preprocessing.

Here is a very simple perl script that accepts a URL on the command line, opens a socket to the URL and reads the corresponding web page. Some simple changes (commented in the code) can be implemented to scan for other URLs. This can be modified (by you) to make a simple crawler.

Traditional Information Retrieval (IR) consisted of the following preprocessing steps:

- Convert all text to the same case

- Eliminate numbers and hyphenated words

- Eliminate punctuation

- Eliminate unimportant words. These were typically prepositions and articles. Such words are called stopwords .

- Stem the words. That is, get different versions of the word into a standard form (for example observation , observes , observe might all be represented as observ-

We shall see that modern IR doesn't do many of these steps. Unix has many built in functions for simple preprocessing: we consider an example here performed on this data file (the data file contains e-mail postings from an AI users group). Here is a list of English stopwords und eine Liste auf Deutsch ist hier . Stemming is usually performed with the famous Porter's Algorithm. You can get the code for it here .

Homework Question: do these ideas carry over well to languages other than English?

2) Structure of the Web

Read the paper " Graph Structure of the Web ". There are two main points covered in this paper. First, the number of links into a page as well as the number of links out of a page can be described probabilistically via a power law (such as Zipf's law). Be careful to note that power laws include a constant of proportionality to make the sum of the probabilities sum to 1. Second, the paper demonstrates that the web can be decomposed into 4 distinct classes.

Homework Problem #1 (You all must do this. Hand in typed responses)

Some experiments (don't spend a huge amount of time on this):

1) Is there a path from the White House (www.whitehouse.gov) to Eastern Michigan University

(www.emich.edu)? Is there a path back? If so, please provide it.

2) Is there a path from EMU (www.emich.edu) to the main page of the German railroad system

(www.bahn.de)? Is there a path back? If so, please provide it. Do you think that the EMU website and the

White House web site are in the same component of the web graph? Consider the 4 main components of the web discussed in the paper: what portions do you think Google crawls?

3) Google provides a mechanism where you specify a web page and Google will tell you which pages it thinks point to it. To see which pages Google thinks points to the EMU main web page, type the following in the search window:

link:www.emich.edu

Do some experiments (with other pages of course) and determine if you think the "link" command is accurate. Please note that Google’s "link" command in conjunction with our Sphinx program permits us to roughly determine in-degree and out-degree of any web page.

Homework Problem 1b:

What is the "deep web" ? How can one search it? Do a google on "deep web" and

"invisible web". This should be a 15-20 minute talk and a short paper.

3) Models of Information Retrieval

This is really important and provides much of the foundation for the rest of the class!!

- Boolean Model

- Vector Space Model

- Relevance Feedback systems

Homework Problem #2 (You all must hand this in!!)

Consider a vector space model of IR with document similarity defined by either cosine similarity or vector distance. A web 'hacker' decides his/her web page can receive a higher ranking by repeating every word in the document three times (we assume a vector space model). Will this result in a higher ranking? Provide examples to support your answer.

4) Information Retrieval in other languages (other than English!) 15-20 minute presentation and a short paper

The paper here suggests that IR (information retrieval) is more difficult in Swedish than in English. Why?

Homework Problem #3

Visit the MovieLens (www.movielens.umn.edu) web site and rank movies. Write a 1 page paper describing if the system worked for you or not. That is, was the system able to select movies you liked?

5) Pagerank - Google's ranking system (volunteer needed)

Read the papers " Anatomy of a Search Engine " and " Google's Pagerank Explained ". The first paper is lengthy; read it all but pay particular attention issues of ranking pages and discussions of other models; the second is a Pagerank tutorial (it might be easier to read the second paper first).

Please Note: the first paper contains the following *incorrect* statement:

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages'

PageRanks will be one.

Homework Problems

1) (A Math Problem) Suppose every web page on the web is linked to by at least one other page. What is the total Pagerank of the web?

2) Do META tags impact the results from search engines ? Read about META tags here.

3) (Dated question....the following link no longer exists) Go to the web site www.linkstoyou.com . How does this site claim to increase your search engine ranking?

6) Alternative Ranking Systems

Read about MMR and Kleinberg's HITS algorithm . Neither of these algorithms has been commercially implemented, but each addresses serious problems with existing search engines. MMR attempts to retrieve documents that are dis-similar, since current search engines tend to return documents that are redundant.

HITS attempts to find the highest quality documents (the so called authorities ). We will visit MMR again when we study summarization. For this section, only sections 1-3 of the paper are relevant.

7) Some Data Mining Algorithms

We digress from the web for a moment and focus on some machine learning/data mining algorithms. First, we study clustering with a focus on K-means and hierarchical clustering algorithms. Next we give a short discussion on decision trees and entropy reduced decision trees (Quinlan's ID3 and C4 algorithms). Finally, we discuss Naive Bayesian Inference. These three concepts will be employed in our next two sections.

8) TDT - Topic Detection and Tracking

TDT has received much attention in the United States. Every year research institutions submit programs that compete to complete various tasks, usually involving newspaper articles as test data. The number of distinct text processing tasks has increased over the years and currently stands at 5. Topic tracking involves keeping a list of articles that pertain to the same topic (for example, get all articles on World Cup

Soccer). First Story Detection is having the computer determine when a new (unrelated) story has appeared. In most of these applications, the stories are presented sequentially and solutions must be computed online (that is, one cannot go back a re-read the data). Read about the different TDT tasks here.

We'll read two papers for this section; both deal with single pass clustering. In the first paper , our primary focus is on the clustering algorithm, so just pay particular attention to section 3.1. Paper two attempts to provide some efficiencies (Question: was does "deferral of zero" mean in this paper?)

9) Applications of Clustering Algorithms to Information Retrieval

We consider two papers now. In the first paper , Italian researchers employ a hierarchical clustering algorithm used on web pages. In the second paper , Swedish researchers employ a k-means clustering algorithm to cluster newspaper articles. We will need to define some terms for you to read these papers

(cohesion, coupling, precision, recall, etc.) but you should be able to get the main ideas presented on your own.

10) Another Clustering Application - Summarization

Here is another application of the document-keyword matrix and cosine similarity. Simply, sentences play the role of documents. TF-IDF quantities are calculated for each word in each sentence, and a score

(typically the sum of all the constituent TF-IDF scores) is calculated. The top N (N is specified by the user) scoring sentences are retained as a summary. Variants on this include clustering, scoring by relative position of a sentence in an article, and deletion of related sentences. Our readings will begin with a paper we saw before, concerning MMR. For this reading, we concentrate on section 4 of the paper. The next 3 papers are all very related (and even share some of the same authors). The common theme is this: articles that have previously been clustered (implying a similarity of articles like news reports) need to be summarized. This is called the multi-document summarization problem (recall Topic detection and

Tracking from above).

Centroid-based Summarization

Experiments in Single and Multi-Document Summarization

LexRank: Graph-based Lexical Centrality

11) And yet another Clustering Application - Recommender Systems and

Collaborative Filtering

Go to www.movielens.umn.edu

and try our MovieLens (we had this earlier as a homework assignment).

Recommender Systems ( that is, systems that make recommendations to users.....ever visit

Amazon??

) typically employ one of two methods for recommendation: content (attribute) based or ratings based (collaborative filtering). Content based recommendation requires extracting knowledge about a users specific likes (for example, "I like mystery novels", or "I like sports"). Such information is extracted via textual processing (usually). With collaborative filtering, users are asked to rate items on a scale (for example 1 to 5 with 1 meaning "I like this least" and 5 meaning "I like this most"). Recommendation on an item is then based upon finding individuals that closely match your ratings on other items.

One of the early collaborative filtering systems was GroupLens , a system that made recommendations of news articles. The GroupLens system employs a variation of the correlation coefficient to determine

similarity (much like the cosine similarity of the vector space model). Several alternatives are considered here . Commercial applications of recommender systems are considered here . A separate problem from making a specific recommendation is to make a recommendation of a set of objects. A discussion of this problem can be found here . (Note: this is a LONG paper. You don't need to read it all, but the first 4 or 5 pages provide a good introduction to the problem).

12) SPAM!!!

Some intro probability stuff (slides 1-32)

Abduction

- Bayesian Filters for e-mail spam

If only we could stop this stuff .

One of the experts here is Paul Graham . In his paper "A Plan for Spam" he extends the idea of a Naive Bayesian inference filter. Read about it here .

For those of you considering writing your own spam filters, Graham provides some nice links to spam sources here .

- Link spam

Raise your pagerank by increasing the number of inbound links to your page. Really raise your pagerank by having all your friends start blogs and creating a link to your page daily (have you all tried searching for

"miserable failure" on google). Can we automate the process of detecting such fraudulent links (I use fraudulent to mean that the links serve no useful purpose other than raising pagerank)? Read Brian

Davison's paper here .

13) Tracking Infections - How Does Information Move in BlogSpace?

A relatively new topic! Since blogspace content is crawled more frequently (daily!!), the dynamic nature of the web can be studied more closely. Frequently, bloggers do not give a reference to where they received information. These papers attempt to employ Artificial Intelligence techniques to backtrack where bloggers receive their information. A blog site is said to be infected by another blog site when it receives new information from that site. These papers attempt to track this spread of information [ paper1 ] [ paper2 ]. Visit www.blogpulse.com

and use the available tools.

Download