Language Technology (2016) Assignment 5: Semantic Analysis Marco Kuhlmann, Robin Kurtz Goal Use state-of-the-art NLP libraries to experiment with a vector space model of word meaning and evaluate it on a word analogy task. Preparations Read and understand this instruction in its entirety. Report Solve the problems (such as P01 ) and describe your solutions in a short report. Some problems require you to write Python code. Send the code and your report as well as your individual reflection forms to your teaching assistant by email. Deadline 2016-02-26 1 Introduction 1.01 Recall from class that a vector space model of word meaning represents words as vectors in a high-dimensional space. In this assignment, you will experiment with a vector space model trained on the Swedish Wikipedia using the tool word2vec: https://code.google.com/archive/p/word2vec/ 1.02 To experiment with the model you will use the gensim library1 . Since this library is not included in the standard Python installation we provide you with a so-called virtual environment. To activate it, run the following command: source /home/729G17/labs/lab5/lab5-env/bin/activate You will now see the tag (lab5-env) at the beginning of your prompt, indicating that the virtual environment is activated in your current terminal session. To deactivate it again, simply run the command deactivate. 1.03 Start a new interactive Python session and import the gensim library: >>> import gensim 1 https://radimrehurek.com/gensim/ 1(3) February 22, 2016 Language Technology (2016) 1.04 Assignment 5: Semantic Analysis Next, load the Wikipedia model: >>> MODEL = "/home/729G17/labs/lab5/wikipedia-sv.bin" >>> model = gensim.models.Word2Vec.load_word2vec_format(MODEL, binary=True) Problem Set A P01 In the Wikipedia model, every word is represented by a vector of some fixed size 𝑛; this size is specified at training time. The following command shows the vector representation for the word student: >>> model['student'] What is 𝑛 for this model? Include the number into your report. P02 As mentioned in class, in a vector space model of word meaning, semantic similarity between words can be quantified using cosine similarity. As an example, the following command computes the similarity of the words student and lärare: >>> model.similarity('student', 'lärare') What is the cosine similarity of two words that are identical? What is the cosine similarity of two words that are completely dissimilar? Include the answers to these questions into your report. 2 Word Analogy Tasks 2.01 In a word analogy task we are given two pairs of words that share some common relation; a famous example is man ∶ kvinna and kung ∶ drottning. The task is to predict the fourth word (here: drottning) from the other three. In doing so, we are essentially trying to answer the question “man is to kvinna as kung is to —?” 2.02 Mikolov et al. (2013) showed that word analogy tasks can be solved by adding and subtracting word vectors. In particular, given a suitable vector space model, the word vector of the fourth word in the example, drottning, will be close (in terms of cosine similarity) to the vector kung − man + kvinna. 2.03 In order to do implement this idea using gensim, you can use the function model.most_similar(positive, negative, topn) This function finds the top-𝑛 most similar words for the vector that is obtained by adding the vectors of the words specified in the positive list and subtracting the vectors of the words specified in the negative list. The parameter topn is a number that specifies how many closest words should be returned. Thus 2(3) February 22, 2016 Language Technology (2016) Assignment 5: Semantic Analysis >>> model.most_similar(['kung', 'kvinna'], ['man'], 3) returns the three most similar words for the vector kung − man + kvinna. Problem Set B P03 When testing a lot of examples, using the most_similar() function off-the-shelf is inconvenient. Implement your own version of this function. Your function should accept a string representation of the reference vector as an argument and return the three most similar words in model, like so: >>> my_most_similar("+kung -man +kvinna") [('drottning', 0.7311), ('tronföljare', 0.7307), ('prinsessa', 0.7277)] P04 Write a function complete() that takes the three first words of an analogy quadruple as its input and predicts the fourth word based on cosine similarity: >>> complete("man", "kvinna", "kung") 'drottning' P05 The file analogies.txt (in the lab directory) contains a list of ten analogy pairs. How good is the vector space model at predicting these analogies? Report the model’s accuracy and discuss the result. Recall that word vectors are computed based on cooccurrence counts: words that appear in similar contexts will receive similar vectors. Can you use this knowledge to explain the results? P06 The example analogies have been picked from ten different categories. How would you name these categories? Are all categories meaning-related? Are all categories equally easy/hard? P07 Complete the list with two new examples from each of the ten categories: one example where the model makes the right prediction and one where it does not. Submit the new list with your lab report. P08 Try to come up with word analogies from new categories and use your code to test whether the model ‘gets’ them. These examples may get you inspired: Frankrike vin Sverige ? Jesus kristendom Buddha ? Tyskland Hitler Italien ? Submit at least ten examples. Do you see any fundamental differences between the examples you selected and the ones that you considered for Problem P05? 3(3) February 22, 2016