answers to questions

advertisement
Given a large corpus of text:
1) How can you learn relations between 2 words? Devise methods for learning the relations.
2) How can you determine when a sentence ends? Try to come up with solutions.
3) How can you determine when a context ends? Context in the sense that a semantic topic ends. Any
solutions?
4) How can you guess which word comes next? Try to come up with solutions.
A-2) If the punctuation rules are obeyed in the text, we could look to “ . | ? | ! “ followed by a capital
letter to determine the end of a sentence. If not, then we could parse the sentence according to
grammar rules. To illustrate, we can write a simple grammar for a sentence;
<sentence> --> <noun> <verb> <noun>
Then an example of parsing a simple word sequence without punctuations is;
i
eat
<noun> <verb>
meal
she
went to school
<noun> || <noun>
After the word meal, the next word is a noun and it violates the simple grammar, so we can say that a
new sentence is started at that point.
Note: To find that if a word is noun, verb etc, we can use WordNet. For example;
$ wn apple
Information available for noun apple
-hypen
Hypernyms
-hypon, -treen Hyponyms & Hyponym Tree
...
No information available for verb apple
No information available for adj apple
No information available for adv apple
By looking above, WordNet has only definition for the noun “apple” so we can assume that “apple”
can be only noun.
A-3) We can represent the corpus as a network of clusters. We can store each word seen in the clusters
and how frequently they occur. Also, we can think about clusters like a graph and by using distance
metrics, we can find the distance between each cluster, which smaller distances represents similar
clusters.
After constructing this graph of clusters, we can begin to determine the end of a context. For
example, the text that we are examining would be about space. Then the words that will probably
occur would be “astronaut, spaceship, space, moon, planet, America, Russia etc..”. We could see that
most of the words in the space context are either in same cluster or in the clusters that are near
according to our distance metric. Then if we started to encounter words such as “apple, genetically,
DNA etc..” we can say that the context that we were examining previously has finished.
A-4)
First, we could index the words in our corpora in the following format;
word1 -> Document1: p1,p2,p3 , Document7: p219
word2 -> Document3: p4,p5
.
.
where wordi represents the words, Documenti represents the documents that the word is encountered
and pi represents the position of the word in the given document. To estimate which word comes after
wordi, we can look at the positions of the words in a document, and record the next word in each
position next to wordi. Then our estimation will be the word that followed wordi most frequently.
NOTE: After reading N-gram documents, I realized that the method I discussed above is same
in the way of calculating probabilities with bigram method, but it differs in storing the
information. In N-grams implementation, the information stored are different than the method I
discussed above. Details will be explained in the document that discusses N-grams.
A-1) We can use WordNet to find the relations between two words. We can get the hierarchy trees
(especially for synonymy) for the two words and compare them to find a common tree node. Then
common node may describe the relation between the two words if it is not very low in the tree.
Other than WordNet, we can use our corpus for finding relations between 2 words. We can
cluster the data we have, and we can say that if two words appear in the same or relevant concepts, than
these words are related in the means of that context. Furthermore, we can use N-grams to find the
relevancy of the words. Suppose we are observing the words x and y. Furthermore, we found X Ngrams that contains the word x. Then, we can replace the occurrences of the word x with word y and if
the frequencies of the modified N-grams are close to the frequencies of the original N-grams containing
x, we can say that they are synonyms. If the frequencies are not too close but they are considerable,
then we can define a metric to measure the relationship between these two words.
Download