Goals and Objectives:
This class is a specialized data mining class; we restrict our problem domain to large collections of text files. This is an area of major research today as the theory developed has major implications for search engine technologies (e.g. Google, Ask, Yahoo), recommender systems (e.g. Netflix, Amazon), and information aggregation (Feedly, Google News). Upon completion of this course, students will:
-‐ know the differences between the major models of information retrieval and identify the various weaknesses of each model
-‐ master the major algorithms for aggregating large collections of text files
-‐ master many data mining algorithms as applied to information retrieval including entropy minimizing search (ID3), Bayesian inference, clustering algorithms and regression techniques
-‐ implement a major sub-‐component of a search engine
Textbook : Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison-
Wesley Professional, 2 nd edition, 2011. ISBN 0321416910
Other references:
" Mining the Web: Analysis of Hypertext and Semi Structured Data ", Soumen Chakrabarti, Morgan
Kaufmann Publishers, 2003
" Modern Information Retrieval ", by Ricardo Baeza-‐Yates and Bertier Ribeiro-‐Neto, Addison Wesley ,
1999
" Managing Gigabytes: Compressing and Indexing Documents and Images ", Witten, Moffat, Bell,
Morgan Kaufman Publishers 1999
What you will do this term:
- homework (10% of grade). Most of the specific assignments are included below in the topics. All assignments should be typed.
- paper presentation to class (20% of grade). Topics below give references to online papers. You must sign up for a specific paper and provide a 20-30 minute presentation on it. Your grade will be based upon how well you have demonstrated mastery of the theory presented in the paper (50% of the assigned grade) and how well you explain the theory to the class (50% of the assigned grade)
- term project (team project – 3 or 4 people per team) (35% of grade). A web page of potential programming projects will be posted. Alternatively, if you have a suggestion for a programming project you should get my approval. Programming assignments are intended to underscore the theory we develop from lectures and readings. Surprisingly, we will discover that some of the results cited in publications do not actually work! Caveat Emptor!! All submitted programs should adhere to standards as established in lower COSC courses, including proper documentation, refactored code, descriptive variable naming conventions, etc.
- final exam (35% of grade): a comprehensive exam based upon lectures and readings.
Approximate Grading Scale
90 – 91 A-
80 – 81 B-
65 – 67 C-
50 – 52 D-
<50 F
92-100 A
82-87 B
68-76 C
53-62 D
88-89
77-79
B+
C+
63-64 D+
Grades will be “rounded up on the half” (e.g. 91.5 becomes 92, 91.4 becomes 91)
Attendance: is not required, but you miss class at your own risk. It is your responsibility to find out the missed work; I suggest you get the phone number of a classmate.
Cheating: It violates University policy, you know.... so don't do it. Cheating is defined as representing all or part of someone else’s work as your own. While you are certainly encouraged to seek the advice of others in this class on assignments, the work you hand in should represent your own efforts. Violation of this rule will be dealt with according to University policy. If you are really stuck on a problem, come see the instructor!
Here is a very simple perl script that accepts a URL on the command line, opens a socket to the URL and reads the corresponding web page. Some simple changes (commented in the code) can be implemented to scan for other URLs. This can be modified (by you) to make a simple crawler.
Traditional Information Retrieval (IR) consisted of the following preprocessing steps:
- Convert all text to the same case
- Eliminate numbers and hyphenated words
- Eliminate punctuation
- Eliminate unimportant words. These were typically prepositions and articles. Such words are called stopwords .
- Stem the words. That is, get different versions of the word into a standard form (for example observation , observes , observe might all be represented as observ-
We shall see that modern IR doesn't do many of these steps. Unix has many built in functions for simple preprocessing: we consider an example here performed on this data file (the data file contains e-mail postings from an AI users group). Here is a list of English stopwords und eine Liste auf Deutsch ist hier . Stemming is usually performed with the famous Porter's Algorithm. You can get the code for it here .
Homework Question: do these ideas carry over well to languages other than English?
Read the paper " Graph Structure of the Web ". There are two main points covered in this paper. First, the number of links into a page as well as the number of links out of a page can be described probabilistically via a power law (such as Zipf's law). Be careful to note that power laws include a constant of proportionality to make the sum of the probabilities sum to 1. Second, the paper demonstrates that the web can be decomposed into 4 distinct classes.
Some experiments (don't spend a huge amount of time on this):
1) Is there a path from the White House (www.whitehouse.gov) to Eastern Michigan University
(www.emich.edu)? Is there a path back? If so, please provide it.
2) Is there a path from EMU (www.emich.edu) to the main page of the German railroad system
(www.bahn.de)? Is there a path back? If so, please provide it. Do you think that the EMU website and the
White House web site are in the same component of the web graph? Consider the 4 main components of the web discussed in the paper: what portions do you think Google crawls?
3) Google provides a mechanism where you specify a web page and Google will tell you which pages it thinks point to it. To see which pages Google thinks points to the EMU main web page, type the following in the search window:
link:www.emich.edu
Do some experiments (with other pages of course) and determine if you think the "link" command is accurate. Please note that Google’s "link" command in conjunction with our Sphinx program permits us to roughly determine in-degree and out-degree of any web page.
Homework Problem 1b:
What is the "deep web" ? How can one search it? Do a google on "deep web" and
"invisible web". This should be a 15-20 minute talk and a short paper.
This is really important and provides much of the foundation for the rest of the class!!
- Boolean Model
- Vector Space Model
- Relevance Feedback systems
Consider a vector space model of IR with document similarity defined by either cosine similarity or vector distance. A web 'hacker' decides his/her web page can receive a higher ranking by repeating every word in the document three times (we assume a vector space model). Will this result in a higher ranking? Provide examples to support your answer.
The paper here suggests that IR (information retrieval) is more difficult in Swedish than in English. Why?
Visit the MovieLens (www.movielens.umn.edu) web site and rank movies. Write a 1 page paper describing if the system worked for you or not. That is, was the system able to select movies you liked?
Read the papers " Anatomy of a Search Engine " and " Google's Pagerank Explained ". The first paper is lengthy; read it all but pay particular attention issues of ranking pages and discussions of other models; the second is a Pagerank tutorial (it might be easier to read the second paper first).
Please Note: the first paper contains the following *incorrect* statement:
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages'
PageRanks will be one.
1) (A Math Problem) Suppose every web page on the web is linked to by at least one other page. What is the total Pagerank of the web?
2) Do META tags impact the results from search engines ? Read about META tags here.
3) (Dated question....the following link no longer exists) Go to the web site www.linkstoyou.com . How does this site claim to increase your search engine ranking?
Read about MMR and Kleinberg's HITS algorithm . Neither of these algorithms has been commercially implemented, but each addresses serious problems with existing search engines. MMR attempts to retrieve documents that are dis-similar, since current search engines tend to return documents that are redundant.
HITS attempts to find the highest quality documents (the so called authorities ). We will visit MMR again when we study summarization. For this section, only sections 1-3 of the paper are relevant.
We digress from the web for a moment and focus on some machine learning/data mining algorithms. First, we study clustering with a focus on K-means and hierarchical clustering algorithms. Next we give a short discussion on decision trees and entropy reduced decision trees (Quinlan's ID3 and C4 algorithms). Finally, we discuss Naive Bayesian Inference. These three concepts will be employed in our next two sections.
TDT has received much attention in the United States. Every year research institutions submit programs that compete to complete various tasks, usually involving newspaper articles as test data. The number of distinct text processing tasks has increased over the years and currently stands at 5. Topic tracking involves keeping a list of articles that pertain to the same topic (for example, get all articles on World Cup
Soccer). First Story Detection is having the computer determine when a new (unrelated) story has appeared. In most of these applications, the stories are presented sequentially and solutions must be computed online (that is, one cannot go back a re-read the data). Read about the different TDT tasks here.
We'll read two papers for this section; both deal with single pass clustering. In the first paper , our primary focus is on the clustering algorithm, so just pay particular attention to section 3.1. Paper two attempts to provide some efficiencies (Question: was does "deferral of zero" mean in this paper?)
We consider two papers now. In the first paper , Italian researchers employ a hierarchical clustering algorithm used on web pages. In the second paper , Swedish researchers employ a k-means clustering algorithm to cluster newspaper articles. We will need to define some terms for you to read these papers
(cohesion, coupling, precision, recall, etc.) but you should be able to get the main ideas presented on your own.
Here is another application of the document-keyword matrix and cosine similarity. Simply, sentences play the role of documents. TF-IDF quantities are calculated for each word in each sentence, and a score
(typically the sum of all the constituent TF-IDF scores) is calculated. The top N (N is specified by the user) scoring sentences are retained as a summary. Variants on this include clustering, scoring by relative position of a sentence in an article, and deletion of related sentences. Our readings will begin with a paper we saw before, concerning MMR. For this reading, we concentrate on section 4 of the paper. The next 3 papers are all very related (and even share some of the same authors). The common theme is this: articles that have previously been clustered (implying a similarity of articles like news reports) need to be summarized. This is called the multi-document summarization problem (recall Topic detection and
Tracking from above).
Centroid-based Summarization
Experiments in Single and Multi-Document Summarization
LexRank: Graph-based Lexical Centrality
Go to www.movielens.umn.edu
and try our MovieLens (we had this earlier as a homework assignment).
Recommender Systems ( that is, systems that make recommendations to users.....ever visit
Amazon??
) typically employ one of two methods for recommendation: content (attribute) based or ratings based (collaborative filtering). Content based recommendation requires extracting knowledge about a users specific likes (for example, "I like mystery novels", or "I like sports"). Such information is extracted via textual processing (usually). With collaborative filtering, users are asked to rate items on a scale (for example 1 to 5 with 1 meaning "I like this least" and 5 meaning "I like this most"). Recommendation on an item is then based upon finding individuals that closely match your ratings on other items.
One of the early collaborative filtering systems was GroupLens , a system that made recommendations of news articles. The GroupLens system employs a variation of the correlation coefficient to determine
similarity (much like the cosine similarity of the vector space model). Several alternatives are considered here . Commercial applications of recommender systems are considered here . A separate problem from making a specific recommendation is to make a recommendation of a set of objects. A discussion of this problem can be found here . (Note: this is a LONG paper. You don't need to read it all, but the first 4 or 5 pages provide a good introduction to the problem).
- Bayesian Filters for e-mail spam
If only we could stop this stuff .
One of the experts here is Paul Graham . In his paper "A Plan for Spam" he extends the idea of a Naive Bayesian inference filter. Read about it here .
For those of you considering writing your own spam filters, Graham provides some nice links to spam sources here .
- Link spam
Raise your pagerank by increasing the number of inbound links to your page. Really raise your pagerank by having all your friends start blogs and creating a link to your page daily (have you all tried searching for
"miserable failure" on google). Can we automate the process of detecting such fraudulent links (I use fraudulent to mean that the links serve no useful purpose other than raising pagerank)? Read Brian
Davison's paper here .
A relatively new topic! Since blogspace content is crawled more frequently (daily!!), the dynamic nature of the web can be studied more closely. Frequently, bloggers do not give a reference to where they received information. These papers attempt to employ Artificial Intelligence techniques to backtrack where bloggers receive their information. A blog site is said to be infected by another blog site when it receives new information from that site. These papers attempt to track this spread of information [ paper1 ] [ paper2 ]. Visit www.blogpulse.com
and use the available tools.