Assignment 5: Semantic Analysis

advertisement
Language Technology (2016)
Assignment 5: Semantic Analysis
Marco Kuhlmann, Robin Kurtz
Goal
Use state-of-the-art NLP libraries to experiment with a vector space
model of word meaning and evaluate it on a word analogy task.
Preparations
Read and understand this instruction in its entirety.
Report
Solve the problems (such as P01 ) and describe your solutions in a
short report. Some problems require you to write Python code. Send
the code and your report as well as your individual reflection forms to
your teaching assistant by email.
Deadline
2016-02-26
1 Introduction
1.01
Recall from class that a vector space model of word meaning represents words as
vectors in a high-dimensional space. In this assignment, you will experiment with a
vector space model trained on the Swedish Wikipedia using the tool word2vec:
https://code.google.com/archive/p/word2vec/
1.02
To experiment with the model you will use the gensim library1 . Since this library
is not included in the standard Python installation we provide you with a so-called
virtual environment. To activate it, run the following command:
source /home/729G17/labs/lab5/lab5-env/bin/activate
You will now see the tag (lab5-env) at the beginning of your prompt, indicating that
the virtual environment is activated in your current terminal session. To deactivate it
again, simply run the command deactivate.
1.03
Start a new interactive Python session and import the gensim library:
>>> import gensim
1
https://radimrehurek.com/gensim/
1(3)
February 22, 2016
Language Technology (2016)
1.04
Assignment 5: Semantic Analysis
Next, load the Wikipedia model:
>>> MODEL = "/home/729G17/labs/lab5/wikipedia-sv.bin"
>>> model = gensim.models.Word2Vec.load_word2vec_format(MODEL, binary=True)
Problem Set A
P01
In the Wikipedia model, every word is represented by a vector of some fixed size 𝑛;
this size is specified at training time. The following command shows the vector
representation for the word student:
>>> model['student']
What is 𝑛 for this model? Include the number into your report.
P02
As mentioned in class, in a vector space model of word meaning, semantic similarity
between words can be quantified using cosine similarity. As an example, the following
command computes the similarity of the words student and lärare:
>>> model.similarity('student', 'lärare')
What is the cosine similarity of two words that are identical? What is the cosine
similarity of two words that are completely dissimilar? Include the answers to these
questions into your report.
2 Word Analogy Tasks
2.01
In a word analogy task we are given two pairs of words that share some common
relation; a famous example is man ∶ kvinna and kung ∶ drottning. The task is to
predict the fourth word (here: drottning) from the other three. In doing so, we are
essentially trying to answer the question “man is to kvinna as kung is to —?”
2.02
Mikolov et al. (2013) showed that word analogy tasks can be solved by adding and
subtracting word vectors. In particular, given a suitable vector space model, the word
vector of the fourth word in the example, drottning, will be close (in terms of cosine
similarity) to the vector kung − man + kvinna.
2.03
In order to do implement this idea using gensim, you can use the function
model.most_similar(positive, negative, topn)
This function finds the top-𝑛 most similar words for the vector that is obtained by
adding the vectors of the words specified in the positive list and subtracting the
vectors of the words specified in the negative list. The parameter topn is a number
that specifies how many closest words should be returned. Thus
2(3)
February 22, 2016
Language Technology (2016)
Assignment 5: Semantic Analysis
>>> model.most_similar(['kung', 'kvinna'], ['man'], 3)
returns the three most similar words for the vector kung − man + kvinna.
Problem Set B
P03
When testing a lot of examples, using the most_similar() function off-the-shelf is
inconvenient. Implement your own version of this function. Your function should
accept a string representation of the reference vector as an argument and return the
three most similar words in model, like so:
>>> my_most_similar("+kung -man +kvinna")
[('drottning', 0.7311), ('tronföljare', 0.7307), ('prinsessa', 0.7277)]
P04
Write a function complete() that takes the three first words of an analogy quadruple
as its input and predicts the fourth word based on cosine similarity:
>>> complete("man", "kvinna", "kung")
'drottning'
P05
The file analogies.txt (in the lab directory) contains a list of ten analogy pairs. How
good is the vector space model at predicting these analogies? Report the model’s
accuracy and discuss the result. Recall that word vectors are computed based on cooccurrence counts: words that appear in similar contexts will receive similar vectors.
Can you use this knowledge to explain the results?
P06
The example analogies have been picked from ten different categories. How would
you name these categories? Are all categories meaning-related? Are all categories
equally easy/hard?
P07
Complete the list with two new examples from each of the ten categories: one example
where the model makes the right prediction and one where it does not. Submit the
new list with your lab report.
P08
Try to come up with word analogies from new categories and use your code to test
whether the model ‘gets’ them. These examples may get you inspired:
Frankrike vin Sverige ?
Jesus kristendom Buddha ?
Tyskland Hitler Italien ?
Submit at least ten examples. Do you see any fundamental differences between the
examples you selected and the ones that you considered for Problem P05?
3(3)
February 22, 2016
Download