answers to questions

Given a large corpus of text: 1) How can you learn relations between 2 words? Devise methods for learning the relations. 2) How can you determine when a sentence ends? Try to come up with solutions. 3) How can you determine when a context ends? Context in the sense that a semantic topic ends. Any solutions? 4) How can you guess which word comes next? Try to come up with solutions. A-2) If the punctuation rules are obeyed in the text, we could look to “ . | ? | ! “ followed by a capital letter to determine the end of a sentence. If not, then we could parse the sentence according to grammar rules. To illustrate, we can write a simple grammar for a sentence; <sentence> --> <noun> <verb> <noun> Then an example of parsing a simple word sequence without punctuations is; i eat <noun> <verb> meal she went to school <noun> || <noun> After the word meal, the next word is a noun and it violates the simple grammar, so we can say that a new sentence is started at that point. Note: To find that if a word is noun, verb etc, we can use WordNet. For example; $ wn apple Information available for noun apple -hypen Hypernyms -hypon, -treen Hyponyms & Hyponym Tree ... No information available for verb apple No information available for adj apple No information available for adv apple By looking above, WordNet has only definition for the noun “apple” so we can assume that “apple” can be only noun. A-3) We can represent the corpus as a network of clusters. We can store each word seen in the clusters and how frequently they occur. Also, we can think about clusters like a graph and by using distance metrics, we can find the distance between each cluster, which smaller distances represents similar clusters. After constructing this graph of clusters, we can begin to determine the end of a context. For example, the text that we are examining would be about space. Then the words that will probably occur would be “astronaut, spaceship, space, moon, planet, America, Russia etc..”. We could see that most of the words in the space context are either in same cluster or in the clusters that are near according to our distance metric. Then if we started to encounter words such as “apple, genetically, DNA etc..” we can say that the context that we were examining previously has finished. A-4) First, we could index the words in our corpora in the following format; word1 -> Document1: p1,p2,p3 , Document7: p219 word2 -> Document3: p4,p5 . . where wordi represents the words, Documenti represents the documents that the word is encountered and pi represents the position of the word in the given document. To estimate which word comes after wordi, we can look at the positions of the words in a document, and record the next word in each position next to wordi. Then our estimation will be the word that followed wordi most frequently. NOTE: After reading N-gram documents, I realized that the method I discussed above is same in the way of calculating probabilities with bigram method, but it differs in storing the information. In N-grams implementation, the information stored are different than the method I discussed above. Details will be explained in the document that discusses N-grams. A-1) We can use WordNet to find the relations between two words. We can get the hierarchy trees (especially for synonymy) for the two words and compare them to find a common tree node. Then common node may describe the relation between the two words if it is not very low in the tree. Other than WordNet, we can use our corpus for finding relations between 2 words. We can cluster the data we have, and we can say that if two words appear in the same or relevant concepts, than these words are related in the means of that context. Furthermore, we can use N-grams to find the relevancy of the words. Suppose we are observing the words x and y. Furthermore, we found X Ngrams that contains the word x. Then, we can replace the occurrences of the word x with word y and if the frequencies of the modified N-grams are close to the frequencies of the original N-grams containing x, we can say that they are synonyms. If the frequencies are not too close but they are considerable, then we can define a metric to measure the relationship between these two words.

answers to questions

Related documents

Products

Support

answers to questions

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib