Rainbow Tool Kit


Rainbow Tool Kit

Matt Perry

Global Information Systems

Spring 2003



Introduction to Rainbow


Description of Bow Library



Description of Rainbow methods

Naïve Bayes




K Nearest Neighbor


Probabilistic Indexing



Demonstration of Rainbow

20 newsgroups example

What is Rainbow?

 Publicly available executable program that performs document classification

 Part of the Bow (or libbow) library

A library of C code useful for writing statistical text analysis, language modeling and information retrieval programs

Developed by Andrew McCallum of Carnegie

Mellon University

About Bow Library

 Provides facilities for

Recursively descending directories, finding text files.

Finding `document' boundaries when there are multiple documents per file.

Tokenizing a text file, according to several different methods.

Including N-grams among the tokens.

Mapping strings to integers and back again, very efficiently.

Building a sparse matrix of document/token counts.

Pruning vocabulary by word counts or by information gain.

Building and manipulating word vectors.

About Bow Library

 Provides facilities for

Setting word vector weights according to Naive Bayes,

TFIDF, and several other methods.

Smoothing word probabilities according to Laplace

(Dirichlet uniform), M-estimates, Witten-Bell, and Good-


Scoring queries for retrieval or classification.

Writing all data structures to disk in a compact format.

Reading the document/token matrix from disk in an efficient, sparse fashion.

Performing test/train splits, and automatic classification tests.

Operating in server mode, receiving and answering queries over a socket.

About Bow Library

 Does Not

Have English parsing or part-of-speech tagging facilities.

Do smoothing across N-gram models.

Claim to be finished.

Have good documentation.

Claim to be bug-free.

Run on a Windows Machine.

About Bow Library

 In Addition to Rainbow, Bow contains 3 other executable programs

Crossbow - does document clustering

Arrow - does document retrieval – TFIDF

Archer - does document retrieval

Supports AltaVista-type queries

+, , “”, etc.

Back to Rainbow

 Classification Methods used by Rainbow

Naïve Bayes (mostly designed for this)


K-Nearest Neighbor

Probabilistic Indexing

Description of Naïve Bayes

Bayesian reasoning provides a probabilistic approach to learning.

Idea of Naïve Bayes Classification is to assign a new instance the most probable target value, given the attribute values of the new instance.

 How?

Description of Naïve Bayes

 Based on Bayes Theorem

 Notation

P(h) = probability that a hypothesis h holds

Ex. Pr (document1 fits the sports category)

P(D) = probability that training data D will be observed

Ex. Pr (we will encounter document1)

Description of Naïve Bayes

 Notation Continued

P(D|h) probability of observing data D given that hypothesis h holds.

 Ex. Probability that we will observe document 1 given that document 1 is about sports

P(h|D) probability that h holds given training data


This is what we want

Probability that document 1 is a sports document given the training data D

Description of Naïve Bayes

 Bayes Theorem

P ( h | D )

P ( D | h ) P ( h )

P ( D )

Description of Naïve Bayes

 Bayes Theorem

 Provides a way to calculate P(h|D) from P(h), together with P(D) and P(D|h).

 Increases with P(D|h) and P(h)

 Decreases with P(D)

Implies that it is more probable to observe D independent of h.

Less evidence D provides in support of h.

Description of Naïve Bayes

 Approach: Assign the most probable target value given the attributes val

 max v j

P ( v j

| a


,..., a n


Description of Naïve Bayes

 Simplification based on Bayes Theorem val

 max v j

P ( a


,..., a n

| v j

) P ( v j


P ( a


,..., a n

) val

 max v j

P ( a


,..., a n

| v j

) P ( v j


Description of Naïve Bayes

Naïve Bayes assumes (incorrectly) that the attribute values are conditionally independent given the target value val

 max v j

P ( v j


 i

P ( a i

| v j


Rainbow Algorithm

 Let P ( v i

) = probability that a document belongs to class v i

 Let P ( w k

| v j


= probability that a randomly drawn word v j word w k

Rainbow Algorithm

 Estimate

P ( w k

| v j


 n k

1 n

| Vocabulary |

Rainbow Algorithm




Collect all words, punctuation, and other tokens that occur in examples probability terms

P ( v j

) P ( w k

| v j


Return the estimated target value for the document

Doc val

 max v j

P ( v j


 i

P ( a i

| v j



 Most major component of the Rocchio algorithm is the TFIDF (term frequency / inverse document frequency) word weighting scheme.

 TF(w,d) (Term Frequency) is the number of times word w occurs in a document d.

 DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once.


 The inverse document frequency is calculated as

IDF ( w )

 log(

| D |

DF ( w )



 Based on word weight heuristics, the word w i is an important indexing term for a document d if it occurs frequently in that document

 However, words that occurs frequently in many document spanning many categories are rated less importantly


 Each document is D is represented as a vector within a given vector space V :

 d

( d

( 1 )

,..., d

(| F |)



 Value of d (i) of feature w i calculated as the product for a document d is d

( i ) 

TF ( w i

, d )

IDF ( w i


 d(i) is called the weight of the word w i document d .

in the

TFIDF/Rocchio t

3 d

2 d

3 d



φ t

1 d

5 t

2 d


Documents that are “close together” in vector space talk about the same things.



Distance between vectors d

1 and d

2 captured by the cosine of the angle x between them.

Note – this is similarity , not distance t

3 d

2 d


θ t

1 t

2 http://www.stanford.edu/class/cs276a/handouts/lecture4.ppt

TFIDF/Rocchio sim ( d j

, d k


 d

 j d j

 d

 k d k

 n i

1 w i , j w i , k

 n i

1 w i


, j

 n i

1 w i


, k

 Cosine of angle between two vectors

The denominator involves the lengths of the vectors

So the cosine measure is also known as the normalized inner product http://www.stanford.edu/class/cs276a/handouts/lecture4.ppt


A vector can be normalized (given a length of 1) by dividing each of its components by the vector's length

This maps vectors onto the unit circle:


Longer documents don’t get more weight

For normalized vectors, the cosine is simply the dot product: cos( d

 j

, d

 k


 d

 j

 d

 k http://www.stanford.edu/class/cs276a/handouts/lecture4.ppt

Rainbow Algorithm

 Construct a set of prototype vectors

 One vector for each class

 This serves as learned model

 Model is used to classify a new document D

 D is assigned to the class with the most similar vector

K Nearest Neighbor

 Features

All instances correspond to points in an ndimensional Euclidean space

Classification is delayed until a new instance arrives

Classification done by comparing feature vectors of the different points

Target function may be discrete or real-valued

K Nearest Neighbor

 1 Nearest Neighbor

K Nearest Neighbor

An arbitrary instance is represented by (a


(x), a


(x), a


(x),.., a n


 a i

(x) denotes features

Euclidean distance between two instances d(x i

, x j

)=sqrt (sum for r=1 to n (a r

(x i

) - a r

(x j

)) 2 )

Find the k-nearest neighbors whose distance from your test cases falls within a threshold p.

If x of those k-nearest neighbors are in category c i

, then assign the test case to c i

, else it is unmatched.

Rainbow Algorithm

 Construct a model of points in n-dimensional space for each category

 Classify a document D based on the k nearest points

Probabilistic Indexing

 Idea

Quantitative model for automatic indexing based on some statistical assumptions about word distribution.

2 Types of words: function words, specialty words

Function words = words with no importance for defining classes (the, it, etc.)

Specialty words = words that are important in defining classes (war, terrorist, etc.)

Probabilistic Indexing

 Idea

Function words follow a Poisson distribution over the set of all documents

Specialty words do not follow a Poisson distribution over the set of all documents

Specialty word distribution can be described by a

Poisson process within its class

Specialty words distinguish more than one class of documents

Rainbow Method

Goal is to estimate P(C|s i

, d m


Probability that assignment of term s i document d m is correct to the

Once terms have been identified, assign

Form Of Occurrence (FOC)

Certainty that term is correctly identified

Significance of Term

Rainbow Method

 If term t appears in document d and a term descriptor from t to s exists, s an indexing term, then generate a descriptor indictor

 Set of generated term descriptors can be evaluated and a probability calculated that document d lies in class c

Rainbow Demonstration

 20 newsgroups example

Rainbow Commands

 Create a model for the classes:

 rainbow -d ~/model --index training directory

 Classifying Documents:

Pick Method (naivebayes, knn, tfidf, prind )

 rainbow -d ~/model --method= tfidf --test=1

Automatic Test:

 rainbow -d ~/model --test-set=0.4 --test=3

Test 1 at a time:

 rainbow -d ~/model –query test file

Rainbow Demonstration

 Can also run as a server:

 rainbow -d ~/model --query-server= port

Use telnet to classify new documents

 Diagnostics:

List the words with the highest mutual info:

 rainbow -d ~/model -I 10

Perl script for printing stats:

 rainbow -d ~/model --test-set=0.4 --test=2 | rainbowstats.pl
