Charniak Chapter 9 9.1 Clustering Grouping words into classes that reflect commonality of some property works as follows: 1) define the properties one cares about, and be able to give numerical values to them. 2) create a vector of length n with the n numerical values for each item to be classified 3) viewing n-dimensional vector as a point in an n-dimensional space, cluster points that are near. The procedure described above leaves the following points open to variation: 1) The properties used in vector. 2) Distance metric to used to measure how close two points are 3) The clustering algorithm. On the three aspects defined above, the first one seems to have the largest effect on the results. 9.2 Clustering by Next Word Let C(x) denotes the vector of properties of x (intuitively, x's context), where in this chapter x is a word type. We can think of the vector below for wi as counts for each each word wj, which holds how often wj followed by wi in the corpus: C( wi ) = < |w1| ,|w2| ,...|ww| > Instead of applying Euclidian or Cosine distance measures on this vectors, a new metric called mutual information is introduced. The definition of mutual information I(x;y) of two particular outcomes x and y is the amount of information one outcome gives us about the other (See Information Theory Report). So the equation is: log I(x;y) = (- log P(x)) - (-log P(x|y)) = P x,y P x⋅ P y For example, we want to know how much information the word “pancake” gives us about the following word “syrup”. This can be shown as: log I(pancake;syrup) = P pancake,syrup P pancake⋅ P syrup where P(pancake,syrup) is the probability of the pair (wi = pancake, wi+1 = syrup) , and the P(w) is the probability of the occurrence of the word “w” in our text. The average mutual information of the random variables X and Y, I(X;Y), is defined as the amount of information we get about X from knowing the value of Y, on the average. Therefore, this definition says that it is the average over the mutual information of individual combinations. (The assumption made in the formula below is both random variables have possible values {w1,...,ww}. w w ∑ ∑ P wx ,wy I I(X;Y) = wx ;wy y= 1 x= 1 As we cluster words together, average mutual information decreases. Therefore, the metric used is the minimal loss of average mutual information. Suppose we consider to cluster the words “big” and “large”. First, we compute I(Wi , Wi-1) for the separate words. Then, we would create a class “big-large” whose vector C(big-large) is derived by summing the individual components of C(big) and C(large). We would then change all other vectors, e.g C(the), so that they have w-1 components rather than the original components (Two components lost for “big” and “large”, one gained for “big-large”). The idea is to find groups in which the loss of mutual information is small. Generally, loss is smaller when members of the group have similar vectors. A typical clustering algorithm is the greedy algorithm. In this case, we have w clusters, one for each word. Then we combine two groups which have minimal loss of mutual information until the desired number of clusters is reached. However, with a large vocabulary, this strategy is too expensive. Instead, the algorithm starts with 1000 clusters initially, each containing one of the most common words in the corpus, and adding remaining words to one of these clusters using the greedy method. In several cases, this algorithm clusters misspelled words into same group. 9.3 Clustering with Syntactic Information Another experiment of clustering which is restricted to nouns, performed by Pereira and Tishby, is explained in this section. They used a partial parser to extract examples of verb-object relations from a corpus, and the context vector for a noun contained the number of times each verb took the noun as its direct object. However in this book, rather than using actual distributions, they gave the component a value of “1” if the verb normally takes the noun as a direct object or “0” if doing this would give a meaningless sentence. The associated vector C(wi) is for the word wi is the distribution of verbs for which it served as direct object. Three examples of vectors are,(which the columns are verbs); C(door) = <1,0,1,1,1,0,0> C(closet) = <1,0,1,1,1,0,0> C(meeting) = <0,1,0,0,1,1,1> (When I scanned the page, I will insert here a table of noun-verb pairs) From the vectors, we can see that door and closet look exactly same, however the sample training text lacked the words such as enter to make them slightly different. Normally, the vectors are not counts that shows how often each verb had the noun as an object. It is a vector of probabilities, shown below; C(n) = < P(v1 | n) , P(v2 | n) , ... , P(vk | n)> And relative entropy is used as a metric to compare two nouns, but details are not stated here.