Charniak

advertisement
Charniak Chapter 9
9.1 Clustering
Grouping words into classes that reflect commonality of some property works as follows:
1) define the properties one cares about, and be able to give numerical values to them.
2) create a vector of length n with the n numerical values for each item to be classified
3) viewing n-dimensional vector as a point in an n-dimensional space, cluster points that are
near.
The procedure described above leaves the following points open to variation:
1) The properties used in vector.
2) Distance metric to used to measure how close two points are
3) The clustering algorithm.
On the three aspects defined above, the first one seems to have the largest effect on the results.
9.2 Clustering by Next Word
Let C(x) denotes the vector of properties of x (intuitively, x's context), where in this chapter x is a
word type. We can think of the vector below for wi as counts for each each word wj, which holds how
often wj followed by wi in the corpus:
C( wi ) = < |w1| ,|w2| ,...|ww| >
Instead of applying Euclidian or Cosine distance measures on this vectors, a new metric called
mutual information is introduced. The definition of mutual information I(x;y) of two particular
outcomes x and y is the amount of information one outcome gives us about the other (See Information
Theory Report). So the equation is:
log
I(x;y) = (- log P(x)) - (-log P(x|y)) =
P x,y
P x⋅ P y
For example, we want to know how much information the word “pancake” gives us about the
following word “syrup”. This can be shown as:
log
I(pancake;syrup) =
P pancake,syrup
P pancake⋅ P syrup
where P(pancake,syrup) is the probability of the pair (wi = pancake, wi+1 = syrup) , and the P(w) is the
probability of the occurrence of the word “w” in our text.
The average mutual information of the random variables X and Y, I(X;Y), is defined as the
amount of information we get about X from knowing the value of Y, on the average. Therefore, this
definition says that it is the average over the mutual information of individual combinations. (The
assumption made in the formula below is both random variables have possible values {w1,...,ww}.
w
w
∑ ∑ P wx ,wy I
I(X;Y) =
wx ;wy
y= 1 x= 1
As we cluster words together, average mutual information decreases. Therefore, the metric
used is the minimal loss of average mutual information.
Suppose we consider to cluster the words “big” and “large”. First, we compute I(Wi , Wi-1) for
the separate words. Then, we would create a class “big-large” whose vector C(big-large) is derived by
summing the individual components of C(big) and C(large). We would then change all other vectors,
e.g C(the), so that they have w-1 components rather than the original components (Two components
lost for “big” and “large”, one gained for “big-large”). The idea is to find groups in which the loss of
mutual information is small. Generally, loss is smaller when members of the group have similar
vectors.
A typical clustering algorithm is the greedy algorithm. In this case, we have w clusters, one for
each word. Then we combine two groups which have minimal loss of mutual information until the
desired number of clusters is reached. However, with a large vocabulary, this strategy is too expensive.
Instead, the algorithm starts with 1000 clusters initially, each containing one of the most
common words in the corpus, and adding remaining words to one of these clusters using the greedy
method. In several cases, this algorithm clusters misspelled words into same group.
9.3 Clustering with Syntactic Information
Another experiment of clustering which is restricted to nouns, performed by Pereira and Tishby, is
explained in this section. They used a partial parser to extract examples of verb-object relations from a
corpus, and the context vector for a noun contained the number of times each verb took the noun as its
direct object. However in this book, rather than using actual distributions, they gave the component a
value of “1” if the verb normally takes the noun as a direct object or “0” if doing this would give a
meaningless sentence.
The associated vector C(wi) is for the word wi is the distribution of verbs for which it served as
direct object. Three examples of vectors are,(which the columns are verbs);
C(door) = <1,0,1,1,1,0,0>
C(closet) = <1,0,1,1,1,0,0>
C(meeting) = <0,1,0,0,1,1,1>
(When I scanned the page, I will insert here a table of noun-verb pairs)
From the vectors, we can see that door and closet look exactly same, however the sample
training text lacked the words such as enter to make them slightly different. Normally, the vectors are
not counts that shows how often each verb had the noun as an object. It is a vector of probabilities,
shown below;
C(n) = < P(v1 | n) , P(v2 | n) , ... , P(vk | n)>
And relative entropy is used as a metric to compare two nouns, but details are not stated here.
Download