Uploaded by 76ftytchgvhgv96

semantics

advertisement
7
Vector Semantics
Definition, Applications
Words again
• Structure of words →
• Distribution of words →
•
Words again
• Structure of words →Morphology
• Distribution of words →Language modeling
• Now, we start looking at the meaning of words
–
Also called lexical semantics
• We will mainly focus on vectorial word
representations
Outline
• Lexical semantics
• Distributional word similarities
• Word embeddings
Word Representation
• How is the meaning of a words represented
for computational models?
What do words mean?
• All past models (LM, classification…): words are
character strings
• At higher levels, when we are focusing on
meaning, we have two possibilities:
– Formal semantics: A block of text is to
converted into a formula in a logical language
– Capturing the actual meaning of words: Words are
converted into conceptual representation
centered around their meaning
What do words mean?
• All past models (LM, classification…): words are
character strings
• In higher levels, when we are concerned about meaning,
we have two possibilities:
– Formal semantics: A block of text is to converted
into a formula in a logical language
–
e.g. First Order Logic (FOL) representations:
●
Predicates
–
–
●
Primarily Verbs, VPs, Sentences
Sometimes Nouns and NPs
Arguments
–
Primarily Nouns, Nominals, NPs, PPs
FOL representation
• It allows for:
–
The analysis of truth conditions
●
–
Supports the use of variables
●
–
Allows us to answer questions through the use of variable
binding
Supports inference
●
•
Allows us to answer yes/no questions
Allows us to answer questions that go beyond what we know
explicitly
FOL representation
• Allow logical inferences
–
∀x MAN(x) ⟶ MORTAL(x)
–
MAN(socrates)
–
MORTAL(socrates)
• Important for inference in well-defined domains,
• A knowledge base is manually constructed
• Using automated inference techniques techniques, to draw
conclusions from given facts.
Meaning-centered representations
• Roughly speaking, there are two approaches:
–
The lexicographic tradition aims to capture the
information represented in lexicons, dictionaries, etc.
–
The distributional tradition aims to capture the
meaning of words based on large amounts of raw text
LEXICAL SEMANTICS
Lexical semantics
• Uses resources such as lexicons, thesauri,
ontologies etc. that capture explicit knowledge
about word meanings.
–
•
Example resources: WordNet
Lexical semantics
• Uses resources such as lexicons, thesauri,
ontologies etc. that capture explicit knowledge
about word meanings.
–
Example resources: WordNet
●
Lexical semantics
• Uses resources such as lexicons, thesauri, ontologies etc.
that capture explicit knowledge about word meanings.
Example resources: WordNet
• Assumes words have discrete word senses:
bank1 = financial institution; bank2 = river bank, etc.
• May capture explicit relations between word (senses):
“dog” is a “mammal”, “cars” have “wheels” etc., word
representations should reflect word meaning and
relationship to other words
–
•
•
•
•
Which words have similar meaning
Which words have opposite meaning
Which words have positive or negative connotations
Etc.
Lexical Semantcis
Lexicon entries
Lexical Semantcis
Lexicon entries
Lexical Semantcis: Lemmas and Senses
lemma
sense
mouse (N)
1. any of numerous small rodents...
2. a hand-operated device that controls a
cursor... Modified from the online thesaurus WordNet
• A sense (or concept) is the meaning component of a word
• Lemmas can be polysemous (have multiple senses)
●
A lemma is polysemous if it has different related senses
●
e.g bank financial institution or building
Lexical Semantcis: Homonymy
• Homonyms: different (unrelated) senses with the
same spelling or pronunciation
Remember: bank (financial institution); bank of a
river
• There two types of homonymy:
– Homophones: words that have the same pronunciation
but different meanings and, often, different spellings.
●
"bare" (without covering) and "bear" (the animal)
●
"flower" (a bloom) and "flour" (used in baking)
●
"write" (to put words on paper) and "right" (opposite
of left)
Lexical Semantcis: Homonymy
• Homonyms: different senses with the same spelling or
pronunciation
Remember: bank example
• There two types of homonymy:
– Homophones: words that have the same pronunciation but
different meanings and, often, different spellings.
– Homographs: words that have the same spelling but different
meanings and may or may not have the same pronunciation.
●
"tear" (to rip) and "tear" (to cry)
●
"lead" (to guide) and "lead" (a heavy metal)
●
"bass" (low-frequency sound) and "bass" (a type of fish)
Relations between senses: Synonymy
• Synonyms: words that have the same meaning
in some or all contexts
– Couch/Sofa
– filbert / hazelnut
– Big/large
– car/automobile
– water / H20
– vomit / throw upv
– etc.
Relations between senses: Synonymy
• Synonyms: words that have the same meaning in some or
all contexts
• Note that there are probably no examples of perfect
synonymy.
–
Even if many aspects of meaning are identical, Still
may differ based on politeness, slang, register, genre,
–
e.g. ask & request, inform & tell (politeness)
–
Cool & awesome, crazy & insane (slang)
–
Commence & start & begin (register)
–
Utilize & employ & use (technical writing genre)
–
Uncover & reveal & discover (creative writing genre)
Relations between senses: Synonymy
• Synonyms: words that have the same meaning in
some or all contexts
• Note that there are probably no examples of
perfect synonymy.
• Difference in form → difference in meaning
Lexical Semantcis: Similarity
• Similar words: Words with similar meanings.
Not synonyms, but sharing some element of
meaning
–
car, bicycle
–
cow, horse
Lexical Semantcis: Similarity
• Similar words: Words with similar meanings.
Not synonyms, but sharing some element of
meaning
• A dataset manually created by asking
annotators to score the similarity between
words on a scale from 1 to 10: SimLex-999
dataset
Lexical Semantcis: Similarity
• A dataset manually created by asking
annotators to score the similarity between
words on a scale from 1 to 10: SimLex-999
dataset
Lexical Semantcis: Relatedness
• Also called "word association"
• Words can be Related in any way, perhaps via
a semantic frame or field
–
coffee, tea: similar
–
coffee, cup: related, not similar
Lexical Semantcis: Relatedness
• Semantic field:
–
words that cover a particular semantic domain and are
associated with a specific concept, idea, or theme
• e.g.
–
hospitals
●
–
restaurants
●
–
surgeon, scalpel, nurse, anaesthetic, hospital
waiter, menu, plate, food, menu, chef
houses
●
door, roof, kitchen, family, bed
Lexical Semantcis: Antonymy
• Antonyms: senses that are opposite with
respect to only one feature of meaning;
otherwise they are very similar!
– Dark/light, short/long, fast/slow, hot/cold,
rise/fall, up/down, in/out…
Lexical Semantcis: Antonymy
• Antonyms: senses that are opposite with respect to
only one feature of meaning;
otherwise they are very similar!
– Dark/light, short/long, fast/slow, hot/cold, rise/fall,
up/down, in/out…
• More formally: antonyms can
– define a binary opposition or be at opposite ends of a
scale
long/short, fast/slow
– Be reversives:
rise/fall, up/down
Lexical Semantcis: Hierarchical relations
• Hypernyms & Hyponyms:
• e.g. pet/dog
The hyponym (dog) is more specific than the
hypernym (pet)
• Holonyms and meronyms:
• e.g. car/wheel
–
–
The meronym (wheel) is a part of the
holonym (car)
Lexical semantics resources: WordNet
•
Very large, publicly available lexical database of English:
–
110K nouns, 11K verbs, 22K adjectives, 4.5K adverbs (WordNets for many other
languages exist or are under construction)
•
Each word has a POS tag and one or more word senses. Avg. # of senses: 1.23 nouns,
2.16 verbs, 1.41 adj, 1.24 adverbs
•
Word senses are grouped into synonym sets (“synsets”)
–
81K noun synsets, 13K verb synsets, 19K adj. synsets, 3.5K adverb synsets
•
Synsets are connected in a hierarchy/network
•
defined via conceptual-semantic relations
–
hypernym/hyponym relation (IS-A)
–
holonym/meronym relation (HAS-A)
•
Also lexical relations (derivational morphology), and lemmatization
•
It is accessible: https://wordnet.princeton.edu/
•
It can also be used programmatically using nltk
–
https://www.nltk.org/howto/wordnet.html
Wordnet
• WordNet total
●
155,327 words organized in 175,979 synsets for a total
of 207,016 word-sense pairs; in compressed form, it is
about 12 megabytes in size.
• Check visualizer:
●
https://www.visual-thesaurus.com/wordnet.
php
●
https://wordvis.com/about.html
•
Wordnet
• Example of hierarchical relations from WN
• IS-A relations (hyponymy):
–
–
Hypernym/hyponym (between concepts)
●
meal is a hypernym (superordinate) of breakfast
●
breakfast is a hyponym (subordinate) of meal
●
dog is a hypernym (superordinate) of poodle
●
poodle is a hyponym (subordinate) of (IS-A) dog
Instance hypernym/hyponym (concepts and instances)
●
●
composer is the instance hypernym of (HAS-INSTANCE)
Bach
Bach is an instance hyponym of (IS-INSTANCE-OF)
composer
Wordnet
• Example of Part-Whole relations (meronymy):
• Member holonym/meronym (groups and members)
–
crew is a member holonym of (HAS-MEMBER) co-pilot
–
co-pilot is a member meronym of (IS-MEMBER-OF)
crew
• Part holonym/meronym (wholes and parts)
–
car is a part holonym of (HAS-PART) wheel
–
wheel is a part meronym of (IS-PART-OF) car
• Substance holonym/meronym (substances and
components)
–
bread is a substance holonym of (HAS-COMPONENT)
flour
–
flour is a substance meronym of (IS-COMPONENT-OF)
bread
WordNet-based word similarity
• There have been many attempts to exploit
resources like WordNet to compute word (sense)
similarities.
• Classic approaches use the distance (path length)
between synsets (these paths typically only
consider hypernym/hyponym relations), possibly
augmented with corpus statistics
•
• More recent (neural) approaches aim to learn
(non-Euclidean) embeddings that capture the
hierarchical hypernym/hyponym structure of
WordNet.
Word-sense similarity
• Similarity can be interpreted as:
• Synonymy:
–
sim(couch, sofa)> sim(poodle, dog) >
sim(poodle, pug), …
Do the two words/senses have the same
meaning? (WordNet: synsets are synonyms
(similarity=1), but hypernym/hyponyms
(dog/poodle) are also more similar to each
other than unrelated words)
• Relatedness:
•
–
Word-sense similarity
• Similarity can be interpreted as:
• Synonymy:
• Relatedness:
– How related are the two words/senses to each
other?
– coffee and cup are strongly associated, but not
synonyms
– “Semantic fields”: sets of words that are
topically related (WordNet:
holonyms/meronyms etc. capture some
associations)
Word Similarity examples & problems
• Path length is just the distance between synsets
–
pathlen(nickel, dime) = 2 (nickel—coin—dime)
–
pathlen(nickel, money) = 5 (nickel—…—medium of exchange—
money)
–
pathlen(nickel, budget) = 7 (nickel—…—medium of exchange—…–
budget)
• But do we really want the following?
–
pathlen(nickel, coin) < pathlen(nickel, dime)
–
pathlen(nickel, Richter scale) = pathlen(nickel, budget)
Problems with thesaurus similarity
• We need to have a thesaurus! (not available for all
languages)
• We need to have a thesaurus that contains the
words we’re interested in.
• We need a thesaurus that captures a rich
hierarchy of hypernyms and hyponyms.
• Most thesaurus-based similarities depend on the
specifics of the hierarchy that is implemented in
the thesaurus.
DISTRIBUTIONAL SEMANTICS
Distributional approach to semantics
• Uses large corpora of raw text to learn the meaning
of words from the contexts in which they occur.
• Maps words to (sparse) vectors that capture corpus
statistics
• Contemporary variant: use neural nets to learn
dense vector: “embeddings” from very large
corpora
– (this is a prerequisite for most neural approaches
to NLP)
• If each word type is mapped to a single vector, this
ignores the fact that words have multiple senses or
parts-of-speech
Distributional approach to semantics
• Language understanding requires knowing when
words have similar meanings
• e.g. Question answering:
–
Q: “How tall is Mt. Everest?”
–
Candidate A: “The official height of Mount
Everest is 29029 feet”
–
“tall” is similar to “height”
Distributional approach to semantics
• Language understanding requires knowing when
words have similar meanings
• e.g. Plagiarism detection
–
Which representation will capture similarities?
• As atomic symbols?
[e.g. as in a traditional n-gram language model,
or when we use them as explicit features in a
classifier]
• This is equivalent to very high-dimensional onehot vectors: aardvark=[1,0,…,0], bear=[0,1,000],…,
zebra=[0,…,0,1]
• No: height/tall are as different as height/cat
–
Which representation will capture similarities?
• As atomic symbols?
[e.g. as in a traditional n-gram language model,
or when we use them as explicit features in a
classifier]
• This is equivalent to very high-dimensional onehot vectors: aardvark=[1,0,…,0], bear=[0,1,000],…,
zebra=[0,…,0,1]
• No: height/tall are as different as height/cat
–
tf-idf
Word2Vec
What similarities are catured by vector representations?
• Vector representations of words were originally
motivated by attempts to capture lexical semantics
(the meaning of words) so that words that have
similar meanings have similar representations
• These representations may also capture some
morphological or syntactic properties of words
(parts of speech, inflections, stems etc.).
What similarities are catured by vector representations?
• The meaning of a word is represented as a vectors
Called "embedding" because it's embedded into a
space
• It is the standard way to represent meaning in NLP
• Every modern NLP algorithm uses embeddings as
the representation of word meaning
The Distributional Hypothesis
• Vector representations are based on the
distributional hypothesis
• Zellig Harris (1954):
–
“oculist and eye-doctor … occur in almost the
same environments”
“If A and B have almost identical environments
we say that they are synonyms.”
• John R. Firth 1957:
–
–
You shall know a word by the company it keeps.
The Distributional Hypothesis
• Vector representations are based on the
distributional hypothesis
•
The Distributional Hypothesis
• Context for semantics is used in many ways in NLP
• Distributional similarities (vector-space semantics):
–
Use the set of all contexts in which words (= word
types) appear to measure their similarity
Assumption: Words that appear in similar contexts
(tea, coffee) have similar meanings.
• Word sense disambiguation:
–
–
Use the context of a particular occurrence of a word
(token) to identify which sense it has.
–
Assumption: If a word has multiple distinct senses
(e.g. plant: factory or green plant), each sense will
appear in different contexts
The Distributional Hypothesis
• Context for semantics is used in many ways in NLP
• Distributional similarities (vector-space semantics):
–
Use the set of all contexts in which words (= word
types) appear to measure their similarity
Assumption: Words that appear in similar contexts
(tea, coffee) have similar meanings.
• Word sense disambiguation:
–
–
Use the context of a particular occurrence of a word
(token) to identify which sense it has.
–
Assumption: If a word has multiple distinct senses
(e.g. plant: factory or green plant), each sense will
appear in different contexts
Distributional Similarities
• Basic idea:
– Measure the semantic similarity of words in terms
of the similarity of the contexts in which they
appear
• How?
–
–
Represent words as vectors such that
●
each vector element (dimension) corresponds
to a different context
●
the vector for any particular word captures how
strongly it is associated with each context
Compute the semantic similarity of words as the
similarity of their vectors.
Distributional Similarities
• Basic idea:
– Measure the semantic similarity of words in terms
of the similarity of the contexts in which they
appear
• How?
–
–
Represent words as vectors such that
●
each vector element (dimension) corresponds
to a different context
●
the vector for any particular word captures how
strongly it is associated with each context
Compute the semantic similarity of words as the
similarity of their vectors.
Distributional Similarities
• Distributional similarities use the set of contexts in
which words appear to measure their similarity
Term-Document matrix (information retrieval)
• We search a collection of N documents
–
We can represent each word in the vocabulary V as an
N-dim. vector indicating which documents it appears in.
Conversely, we can represent each document as a Vdimensional vector indicating which words appear in it.
• Finding the most relevant document for a query:
–
–
Queries are also (short) documents
Use the similarity of a query’s vector and the
documents’ vectors to compute which document is
most relevant to the query.
• Intuition: Documents are similar to each other if they
contain the same words.
–
Term-Document matrix (information retrieval)
• A Term-Document Matrix is a 2D table:
–
Each cell contains the frequency (count) of the
term (word) t in document d: tft,d
–
Each column is a vector of counts over words,
representing a document
–
Each row is a vector of counts over documents,
representing a word
Term-Document matrix (information retrieval)
• Each column vector = a document (Each entry
corresponds to one word in the vocabulary)
• Each row vector = a word (Each entry corresponds
to one document in the corpus)
• Two documents are similar if their vectors are
similar
• Two words are similar if their vectors are similar
Term-Document matrix (information retrieval)
Term-Document matrix (information retrieval)
• This model can be adapted to implement a model
of the distributional hypothesis if we treat each
context as a column in our matrix.
Term-Context matrix
• A more common convention is to use word-word
frequency:
•
Term-Context matrix
• A more common convention is to use word-word
frequency:
•
What is a context?
• There are many different interpretations of context
that yield different kinds of similarities:
• Contexts defined by nearby words:
– How often does w appear near the word drink?
Near = “drink appears within a window of ±k
words of w”, or “drink appears in the same
document/sentence as w”
– This yields fairly broad thematic similarities.
• Contexts defined by grammatical relations:
– How often is (the noun) w used as the subject
(object) of the verb drink? (Requires a parser).
– This gives more fine-grained similarities.
From vectors to similarities
• First idea: dot product
•
From vectors to similarities
• First idea: dot product
• However, it has an issue:
• It favors vectors with high values in many
dimensions
• Frequent words (of, the, in) occur with many other
words
• Therefore, their dot product with other words is
higher
From vectors to similarities
• Alternative idea: cosine
•
From vectors to similarities
• Alternative idea: cosine
•
From vectors to similarities
• Alternative idea: cosine
• Example:
•
Frequent words vs. important words
• Raw frequency can be a bad representation
• The co-occurrence matrices we have seen
represent each cell by word frequencies.
• Frequency is useful: if sugar appears a lot near
apricot, that's useful information.
• But overly frequent words like the, it, of are not
very informative about the context
• Solution: weight words (so that the frequent ones
receive lower scores)
Frequent words vs. important words
• Two common wieghting approaches:
–
Tf-idf:
●
Frequent words have low idf
–
Point-wise Mutual Information (PMI):
●
Need to determine p(*[,*])
Frequent words vs. important words
• Two common wieghting approaches:
–
Tf-idf:
●
tft,d = count(t,d)
●
In practice, instead of using raw count:
– tft,d = log10(count(t,d)+1)
Frequent words vs. important words
• Two common wieghting approaches:
–
Tf-idf:
●
Term frequency: tft,d = log10(count(t,d)+1)
●
Document frequency: dft is the number of
documents where the term t occurs.
●
Example:
●
“Romeo” is very special for one Shakespear
document
Frequent words vs. important words
• Two common wieghting approaches:
–
Tf-idf:
●
Term frequency: tft,d = log10(count(t,d)+1)
●
Document frequency: dft is the number of
documents where the term t occurs.
●
Inverse Document Frequency:
N is the total number of documents
Frequent words vs. important words
• Two common wieghting approaches:
–
Tf-idf:
●
Term frequency: tft,d = log10(count(t,d)+1)
●
Document frequency:
Frequent words vs. important words
• Two common wieghting approaches:
–
Tf-idf:
●
Term frequency: tft,d = log10(count(t,d)+1)
●
Document frequency:
Frequent words vs. important words
• Two common wieghting approaches:
–
Tf-idf:
●
In practice, a document can be anything: we
often call each paragraph a document!
Frequent words vs. important words
• Two common wieghting approaches:
–
Tf-idf:
●
Example:
– Raw counts:
Frequent words vs. important words
• Two common wieghting approaches:
–
Tf-idf:
●
Example:
– Raw counts:
–
–
–
– tf-idf
Frequent words vs. important words
• Two common wieghting approaches:
–
PMI:
●
PMI ranges from -inf to +inf
●
In practice, we just use positive PMI values,
and replace all negative PMI values with 0
●
Therefore, we use Positive PMI:
Frequent words vs. important words
• Two common wieghting approaches:
–
PPMI:
●
Example
Frequent words vs. important words
• Two common wieghting approaches:
–
PPMI:
●
Example
Frequent words vs. important words
• Two common wieghting approaches:
–
PPMI:
●
Example
Frequent words vs. important words
• Two common wieghting approaches:
–
PPMI:
●
Example
Frequent words vs. important words
• Two common wieghting approaches:
–
PPMI:
●
PMI is biased toward infrequent events: Very
rare words have very high PMI values
●
A solution is to use add-1 smoothing
Word embddings
• tf-idf (or PMI) vectors are:
–
long (length |V|= 20,000 to 50,000)
sparse (most elements are zero)
• Alternative: learn vectors which are:
–
–
short (length 50-1000)
–
dense (most elements are non-zero)
Word embddings
• A (static) word embedding is a function that maps
each word type to a single vector
–
These vectors are typically dense and have
much lower dimensionality than the size of the
vocabulary (less dimensiosn than tfidf/PPMI)
–
This mapping function typically ignores that the
same string of letters may have different senses
(dining table vs. a table of contents) or parts of
speech (to table a motion vs. a table)
Methods for embeddings
• “Neural Language Model”-inspired models:
Word2vec (skipgram, CBOW), GloVe
• Singular Value Decomposition (SVD)
• Alternative to these "static embeddings":
–
–
Contextual Embeddings (ELMo, BERT)
–
Compute distinct embeddings for a word in its
context
–
Separate embeddings for each token of a word
Methods for embeddings
• Word2vec (Mikolov et al 2013)
https://code.google.com/archive/p/word2vec/
• GloVe (Pennington, Socher, Manning, 2014)
http://nlp.stanford.edu/projects/glove/
Word2vec
• Very popular and influential
• Simple and fast
• Two ways to think about Word2Vec:
–
a simplification of neural language models
–
a binary logistic regression classifier
Word2vec
• Instead of counting how often each word w occurs
near "apricot"
– Train a classifier on a binary prediction task: Is w
likely to show up near "apricot"?
• We don’t actually care about this task
– But we'll take the learned classifier weights as the
word embeddings
• Big idea: self-supervision:
– A word c that occurs near apricot in the corpus
cats as the gold "correct answer" for supervised
learning
– No need for human labels
Word2vec
• Instead of counting how often each word w occurs
near "apricot"
– Train a classifier on a binary prediction task: Is w
likely to show up near "apricot"?
• The parameters of that classifier provide a dense
vector representation of the target word
(embedding)
• Words that appear in similar contexts (that have
high distributional similarity) will have very similar
vector representations.
• These models can be trained on large amounts of
raw text (and pre-trained embeddings can be
downloaded)
Word2vec: skip-gram with negative sampling
• Train a binary classifier that decides whether a target word t
appears in the context of other words c1..k
–
Context: the set of k words near (surrounding) t
–
Treat the target word t and any word that actually appears
in its context in a real corpus as positive examples
–
Treat the target word t and randomly sampled words that
don’t appear in its context as negative examples
–
Train a binary logistic regression classifier to distinguish
these cases
The weights of this classifier depend on the similarity of t
and the words in c1..k
Use the learned weights of this classifier as embeddings for t
–
Skip-gram training
• Given a tuple (t, c) = target, context
(apricot, jam)
(apricot, aardvark)
• where some context words c are from real data
(jam) and others (aardvark) are randomly sampled
from the vocabulary…
• … decide whether c is a real context word for the
target t (a positive example):
• c is real if:
P(D = + ∣ t, c) > P(D = - ∣ t, c) = 1 − P(D = + ∣ t, c)
Skip-gram training
• How to compute P(D = + ∣ t, c)?
• Intuition:
Words are likely to appear near similar words
• Idea:
–
–
Model similarity with a dot-product of vectors:
Similarity(t, c) = f(t.c)
• Problem:
–
–
The dot product is not a probability!
–
(Neither is cosine)
Skip-gram training
• To turn the dot product into a probability, we use
The sigmoid function σ(x)
•
Skip-gram training
• Training sentence:
... lemon, a tablespoon of apricot jam a pinch ...
c1
c2 t
c3 c4
• Training data: input/output pairs centering on apricot.
Assume a +/- 2 word window
• Positive examples (D=+):
(apricot, tablespoon), (apricot, of), (apricot, jam),
(apricot, a)
• Negative examples (D=-):
(apricot, aardvark), (apricot, puddle)…
• for each positive example, sample k noise words
Skip-gram training
• Negative sampling:
–
Where do we get D=- from?
–
Lots of options:
●
Word2Vec: for each good pair (w, c), sample
k words as negative examples
●
Words can be sampled according to corpus
frequency or according to smoothed variant
where freq′(w) = freq(w)0.75
(This gives more weight to rare words)
Skip-gram training
• Assume that t and c are represented as vectors t,
c, so that their dot product tc captures their
similarity
• Use logistic regression to predict whether the pair
(t, c) (target t and context word c), is a positive or
negative example:
• Skip-Gram learns two (sets of) vectors (i.e. two
matrices):
– target embeddings/vectors t and context
embeddings/vectors c
Skip-gram training objective
• Find a model that maximizes the log-likelihood of
the training data D+ ∪ D-:
•
•
•
• This forces the target and context embeddings of
positive examples to be similar ...
• … and the target and context embeddings of
negative examples to be dissimilar
• All words appear with positive and negative
contexts.
Summary of Skip-gram training
• For a vocabulary of size V: Start with 2*V random
vectors (typically 300-dimensional) as initial
embeddings
• Train a logistic regression classifier to distinguish
words that co-occur in corpus from those that
don’t
–
Pairs of words that co-occur are positive
examples
–
Pairs of words that don't co-occur are
negative examples
• During training, target and context vectors of
positive examples will become similar, and those
of negative examples will become dissimilar.
• This returns two embedding matrices T and C,
where each word in the vocabulary is mapped to
a 300-dim. Vector, in both T and C.
• It's common to just add them together,
representing word i as the vector wi + ci
Learning is
performed using
SGD
Properties of the embeddings
●
Small windows (C= +/- 2) : nearest words
are syntactically similar words in same
taxonomy
–
●
Hogwarts nearest neighbors are other fictional
schools: Sunnydale, Evernight, Blandings
Large windows (C= +/- 5) : nearest words
are relate words in same semantic field
–
Hogwarts nearest neighbors are Harry Potter
world: Dumbledore, half-blood, Malfoy
Properties of the embeddings
●
Analogy: Embeddings capture capture
relational meaning
–
vector(‘king’) - vector(‘man’) + vector(‘woman’) =
vector(‘queen’)
–
vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) =
vector(?)
Download