Uploaded by 76ftytchgvhgv96

Vector Semantics: Word Embeddings & Lexical Semantics

Vector Semantics
Definition, Applications
Words again
• Structure of words →
• Distribution of words →
Words again
• Structure of words →Morphology
• Distribution of words →Language modeling
• Now, we start looking at the meaning of words
Also called lexical semantics
• We will mainly focus on vectorial word
• Lexical semantics
• Distributional word similarities
• Word embeddings
Word Representation
• How is the meaning of a words represented
for computational models?
What do words mean?
• All past models (LM, classification…): words are
character strings
• At higher levels, when we are focusing on
meaning, we have two possibilities:
– Formal semantics: A block of text is to
converted into a formula in a logical language
– Capturing the actual meaning of words: Words are
converted into conceptual representation
centered around their meaning
What do words mean?
• All past models (LM, classification…): words are
character strings
• In higher levels, when we are concerned about meaning,
we have two possibilities:
– Formal semantics: A block of text is to converted
into a formula in a logical language
e.g. First Order Logic (FOL) representations:
Primarily Verbs, VPs, Sentences
Sometimes Nouns and NPs
Primarily Nouns, Nominals, NPs, PPs
FOL representation
• It allows for:
The analysis of truth conditions
Supports the use of variables
Allows us to answer questions through the use of variable
Supports inference
Allows us to answer yes/no questions
Allows us to answer questions that go beyond what we know
FOL representation
• Allow logical inferences
∀x MAN(x) ⟶ MORTAL(x)
• Important for inference in well-defined domains,
• A knowledge base is manually constructed
• Using automated inference techniques techniques, to draw
conclusions from given facts.
Meaning-centered representations
• Roughly speaking, there are two approaches:
The lexicographic tradition aims to capture the
information represented in lexicons, dictionaries, etc.
The distributional tradition aims to capture the
meaning of words based on large amounts of raw text
Lexical semantics
• Uses resources such as lexicons, thesauri,
ontologies etc. that capture explicit knowledge
about word meanings.
Example resources: WordNet
Lexical semantics
• Uses resources such as lexicons, thesauri,
ontologies etc. that capture explicit knowledge
about word meanings.
Example resources: WordNet
Lexical semantics
• Uses resources such as lexicons, thesauri, ontologies etc.
that capture explicit knowledge about word meanings.
Example resources: WordNet
• Assumes words have discrete word senses:
bank1 = financial institution; bank2 = river bank, etc.
• May capture explicit relations between word (senses):
“dog” is a “mammal”, “cars” have “wheels” etc., word
representations should reflect word meaning and
relationship to other words
Which words have similar meaning
Which words have opposite meaning
Which words have positive or negative connotations
Lexical Semantcis
Lexicon entries
Lexical Semantcis
Lexicon entries
Lexical Semantcis: Lemmas and Senses
mouse (N)
1. any of numerous small rodents...
2. a hand-operated device that controls a
cursor... Modified from the online thesaurus WordNet
• A sense (or concept) is the meaning component of a word
• Lemmas can be polysemous (have multiple senses)
A lemma is polysemous if it has different related senses
e.g bank financial institution or building
Lexical Semantcis: Homonymy
• Homonyms: different (unrelated) senses with the
same spelling or pronunciation
Remember: bank (financial institution); bank of a
• There two types of homonymy:
– Homophones: words that have the same pronunciation
but different meanings and, often, different spellings.
"bare" (without covering) and "bear" (the animal)
"flower" (a bloom) and "flour" (used in baking)
"write" (to put words on paper) and "right" (opposite
of left)
Lexical Semantcis: Homonymy
• Homonyms: different senses with the same spelling or
Remember: bank example
• There two types of homonymy:
– Homophones: words that have the same pronunciation but
different meanings and, often, different spellings.
– Homographs: words that have the same spelling but different
meanings and may or may not have the same pronunciation.
"tear" (to rip) and "tear" (to cry)
"lead" (to guide) and "lead" (a heavy metal)
"bass" (low-frequency sound) and "bass" (a type of fish)
Relations between senses: Synonymy
• Synonyms: words that have the same meaning
in some or all contexts
– Couch/Sofa
– filbert / hazelnut
– Big/large
– car/automobile
– water / H20
– vomit / throw upv
– etc.
Relations between senses: Synonymy
• Synonyms: words that have the same meaning in some or
all contexts
• Note that there are probably no examples of perfect
Even if many aspects of meaning are identical, Still
may differ based on politeness, slang, register, genre,
e.g. ask & request, inform & tell (politeness)
Cool & awesome, crazy & insane (slang)
Commence & start & begin (register)
Utilize & employ & use (technical writing genre)
Uncover & reveal & discover (creative writing genre)
Relations between senses: Synonymy
• Synonyms: words that have the same meaning in
some or all contexts
• Note that there are probably no examples of
perfect synonymy.
• Difference in form → difference in meaning
Lexical Semantcis: Similarity
• Similar words: Words with similar meanings.
Not synonyms, but sharing some element of
car, bicycle
cow, horse
Lexical Semantcis: Similarity
• Similar words: Words with similar meanings.
Not synonyms, but sharing some element of
• A dataset manually created by asking
annotators to score the similarity between
words on a scale from 1 to 10: SimLex-999
Lexical Semantcis: Similarity
• A dataset manually created by asking
annotators to score the similarity between
words on a scale from 1 to 10: SimLex-999
Lexical Semantcis: Relatedness
• Also called "word association"
• Words can be Related in any way, perhaps via
a semantic frame or field
coffee, tea: similar
coffee, cup: related, not similar
Lexical Semantcis: Relatedness
• Semantic field:
words that cover a particular semantic domain and are
associated with a specific concept, idea, or theme
• e.g.
surgeon, scalpel, nurse, anaesthetic, hospital
waiter, menu, plate, food, menu, chef
door, roof, kitchen, family, bed
Lexical Semantcis: Antonymy
• Antonyms: senses that are opposite with
respect to only one feature of meaning;
otherwise they are very similar!
– Dark/light, short/long, fast/slow, hot/cold,
rise/fall, up/down, in/out…
Lexical Semantcis: Antonymy
• Antonyms: senses that are opposite with respect to
only one feature of meaning;
otherwise they are very similar!
– Dark/light, short/long, fast/slow, hot/cold, rise/fall,
up/down, in/out…
• More formally: antonyms can
– define a binary opposition or be at opposite ends of a
long/short, fast/slow
– Be reversives:
rise/fall, up/down
Lexical Semantcis: Hierarchical relations
• Hypernyms & Hyponyms:
• e.g. pet/dog
The hyponym (dog) is more specific than the
hypernym (pet)
• Holonyms and meronyms:
• e.g. car/wheel
The meronym (wheel) is a part of the
holonym (car)
Lexical semantics resources: WordNet
Very large, publicly available lexical database of English:
110K nouns, 11K verbs, 22K adjectives, 4.5K adverbs (WordNets for many other
languages exist or are under construction)
Each word has a POS tag and one or more word senses. Avg. # of senses: 1.23 nouns,
2.16 verbs, 1.41 adj, 1.24 adverbs
Word senses are grouped into synonym sets (“synsets”)
81K noun synsets, 13K verb synsets, 19K adj. synsets, 3.5K adverb synsets
Synsets are connected in a hierarchy/network
defined via conceptual-semantic relations
hypernym/hyponym relation (IS-A)
holonym/meronym relation (HAS-A)
Also lexical relations (derivational morphology), and lemmatization
It is accessible: https://wordnet.princeton.edu/
It can also be used programmatically using nltk
• WordNet total
155,327 words organized in 175,979 synsets for a total
of 207,016 word-sense pairs; in compressed form, it is
about 12 megabytes in size.
• Check visualizer:
• Example of hierarchical relations from WN
• IS-A relations (hyponymy):
Hypernym/hyponym (between concepts)
meal is a hypernym (superordinate) of breakfast
breakfast is a hyponym (subordinate) of meal
dog is a hypernym (superordinate) of poodle
poodle is a hyponym (subordinate) of (IS-A) dog
Instance hypernym/hyponym (concepts and instances)
composer is the instance hypernym of (HAS-INSTANCE)
Bach is an instance hyponym of (IS-INSTANCE-OF)
• Example of Part-Whole relations (meronymy):
• Member holonym/meronym (groups and members)
crew is a member holonym of (HAS-MEMBER) co-pilot
co-pilot is a member meronym of (IS-MEMBER-OF)
• Part holonym/meronym (wholes and parts)
car is a part holonym of (HAS-PART) wheel
wheel is a part meronym of (IS-PART-OF) car
• Substance holonym/meronym (substances and
bread is a substance holonym of (HAS-COMPONENT)
flour is a substance meronym of (IS-COMPONENT-OF)
WordNet-based word similarity
• There have been many attempts to exploit
resources like WordNet to compute word (sense)
• Classic approaches use the distance (path length)
between synsets (these paths typically only
consider hypernym/hyponym relations), possibly
augmented with corpus statistics
• More recent (neural) approaches aim to learn
(non-Euclidean) embeddings that capture the
hierarchical hypernym/hyponym structure of
Word-sense similarity
• Similarity can be interpreted as:
• Synonymy:
sim(couch, sofa)> sim(poodle, dog) >
sim(poodle, pug), …
Do the two words/senses have the same
meaning? (WordNet: synsets are synonyms
(similarity=1), but hypernym/hyponyms
(dog/poodle) are also more similar to each
other than unrelated words)
• Relatedness:
Word-sense similarity
• Similarity can be interpreted as:
• Synonymy:
• Relatedness:
– How related are the two words/senses to each
– coffee and cup are strongly associated, but not
– “Semantic fields”: sets of words that are
topically related (WordNet:
holonyms/meronyms etc. capture some
Word Similarity examples & problems
• Path length is just the distance between synsets
pathlen(nickel, dime) = 2 (nickel—coin—dime)
pathlen(nickel, money) = 5 (nickel—…—medium of exchange—
pathlen(nickel, budget) = 7 (nickel—…—medium of exchange—…–
• But do we really want the following?
pathlen(nickel, coin) < pathlen(nickel, dime)
pathlen(nickel, Richter scale) = pathlen(nickel, budget)
Problems with thesaurus similarity
• We need to have a thesaurus! (not available for all
• We need to have a thesaurus that contains the
words we’re interested in.
• We need a thesaurus that captures a rich
hierarchy of hypernyms and hyponyms.
• Most thesaurus-based similarities depend on the
specifics of the hierarchy that is implemented in
the thesaurus.
Distributional approach to semantics
• Uses large corpora of raw text to learn the meaning
of words from the contexts in which they occur.
• Maps words to (sparse) vectors that capture corpus
• Contemporary variant: use neural nets to learn
dense vector: “embeddings” from very large
– (this is a prerequisite for most neural approaches
to NLP)
• If each word type is mapped to a single vector, this
ignores the fact that words have multiple senses or
Distributional approach to semantics
• Language understanding requires knowing when
words have similar meanings
• e.g. Question answering:
Q: “How tall is Mt. Everest?”
Candidate A: “The official height of Mount
Everest is 29029 feet”
“tall” is similar to “height”
Distributional approach to semantics
• Language understanding requires knowing when
words have similar meanings
• e.g. Plagiarism detection
Which representation will capture similarities?
• As atomic symbols?
[e.g. as in a traditional n-gram language model,
or when we use them as explicit features in a
• This is equivalent to very high-dimensional onehot vectors: aardvark=[1,0,…,0], bear=[0,1,000],…,
• No: height/tall are as different as height/cat
Which representation will capture similarities?
• As atomic symbols?
[e.g. as in a traditional n-gram language model,
or when we use them as explicit features in a
• This is equivalent to very high-dimensional onehot vectors: aardvark=[1,0,…,0], bear=[0,1,000],…,
• No: height/tall are as different as height/cat
What similarities are catured by vector representations?
• Vector representations of words were originally
motivated by attempts to capture lexical semantics
(the meaning of words) so that words that have
similar meanings have similar representations
• These representations may also capture some
morphological or syntactic properties of words
(parts of speech, inflections, stems etc.).
What similarities are catured by vector representations?
• The meaning of a word is represented as a vectors
Called "embedding" because it's embedded into a
• It is the standard way to represent meaning in NLP
• Every modern NLP algorithm uses embeddings as
the representation of word meaning
The Distributional Hypothesis
• Vector representations are based on the
distributional hypothesis
• Zellig Harris (1954):
“oculist and eye-doctor … occur in almost the
same environments”
“If A and B have almost identical environments
we say that they are synonyms.”
• John R. Firth 1957:
You shall know a word by the company it keeps.
The Distributional Hypothesis
• Vector representations are based on the
distributional hypothesis
The Distributional Hypothesis
• Context for semantics is used in many ways in NLP
• Distributional similarities (vector-space semantics):
Use the set of all contexts in which words (= word
types) appear to measure their similarity
Assumption: Words that appear in similar contexts
(tea, coffee) have similar meanings.
• Word sense disambiguation:
Use the context of a particular occurrence of a word
(token) to identify which sense it has.
Assumption: If a word has multiple distinct senses
(e.g. plant: factory or green plant), each sense will
appear in different contexts
The Distributional Hypothesis
• Context for semantics is used in many ways in NLP
• Distributional similarities (vector-space semantics):
Use the set of all contexts in which words (= word
types) appear to measure their similarity
Assumption: Words that appear in similar contexts
(tea, coffee) have similar meanings.
• Word sense disambiguation:
Use the context of a particular occurrence of a word
(token) to identify which sense it has.
Assumption: If a word has multiple distinct senses
(e.g. plant: factory or green plant), each sense will
appear in different contexts
Distributional Similarities
• Basic idea:
– Measure the semantic similarity of words in terms
of the similarity of the contexts in which they
• How?
Represent words as vectors such that
each vector element (dimension) corresponds
to a different context
the vector for any particular word captures how
strongly it is associated with each context
Compute the semantic similarity of words as the
similarity of their vectors.
Distributional Similarities
• Basic idea:
– Measure the semantic similarity of words in terms
of the similarity of the contexts in which they
• How?
Represent words as vectors such that
each vector element (dimension) corresponds
to a different context
the vector for any particular word captures how
strongly it is associated with each context
Compute the semantic similarity of words as the
similarity of their vectors.
Distributional Similarities
• Distributional similarities use the set of contexts in
which words appear to measure their similarity
Term-Document matrix (information retrieval)
• We search a collection of N documents
We can represent each word in the vocabulary V as an
N-dim. vector indicating which documents it appears in.
Conversely, we can represent each document as a Vdimensional vector indicating which words appear in it.
• Finding the most relevant document for a query:
Queries are also (short) documents
Use the similarity of a query’s vector and the
documents’ vectors to compute which document is
most relevant to the query.
• Intuition: Documents are similar to each other if they
contain the same words.
Term-Document matrix (information retrieval)
• A Term-Document Matrix is a 2D table:
Each cell contains the frequency (count) of the
term (word) t in document d: tft,d
Each column is a vector of counts over words,
representing a document
Each row is a vector of counts over documents,
representing a word
Term-Document matrix (information retrieval)
• Each column vector = a document (Each entry
corresponds to one word in the vocabulary)
• Each row vector = a word (Each entry corresponds
to one document in the corpus)
• Two documents are similar if their vectors are
• Two words are similar if their vectors are similar
Term-Document matrix (information retrieval)
Term-Document matrix (information retrieval)
• This model can be adapted to implement a model
of the distributional hypothesis if we treat each
context as a column in our matrix.
Term-Context matrix
• A more common convention is to use word-word
Term-Context matrix
• A more common convention is to use word-word
What is a context?
• There are many different interpretations of context
that yield different kinds of similarities:
• Contexts defined by nearby words:
– How often does w appear near the word drink?
Near = “drink appears within a window of ±k
words of w”, or “drink appears in the same
document/sentence as w”
– This yields fairly broad thematic similarities.
• Contexts defined by grammatical relations:
– How often is (the noun) w used as the subject
(object) of the verb drink? (Requires a parser).
– This gives more fine-grained similarities.
From vectors to similarities
• First idea: dot product
From vectors to similarities
• First idea: dot product
• However, it has an issue:
• It favors vectors with high values in many
• Frequent words (of, the, in) occur with many other
• Therefore, their dot product with other words is
From vectors to similarities
• Alternative idea: cosine
From vectors to similarities
• Alternative idea: cosine
From vectors to similarities
• Alternative idea: cosine
• Example:
Frequent words vs. important words
• Raw frequency can be a bad representation
• The co-occurrence matrices we have seen
represent each cell by word frequencies.
• Frequency is useful: if sugar appears a lot near
apricot, that's useful information.
• But overly frequent words like the, it, of are not
very informative about the context
• Solution: weight words (so that the frequent ones
receive lower scores)
Frequent words vs. important words
• Two common wieghting approaches:
Frequent words have low idf
Point-wise Mutual Information (PMI):
Need to determine p(*[,*])
Frequent words vs. important words
• Two common wieghting approaches:
tft,d = count(t,d)
In practice, instead of using raw count:
– tft,d = log10(count(t,d)+1)
Frequent words vs. important words
• Two common wieghting approaches:
Term frequency: tft,d = log10(count(t,d)+1)
Document frequency: dft is the number of
documents where the term t occurs.
“Romeo” is very special for one Shakespear
Frequent words vs. important words
• Two common wieghting approaches:
Term frequency: tft,d = log10(count(t,d)+1)
Document frequency: dft is the number of
documents where the term t occurs.
Inverse Document Frequency:
N is the total number of documents
Frequent words vs. important words
• Two common wieghting approaches:
Term frequency: tft,d = log10(count(t,d)+1)
Document frequency:
Frequent words vs. important words
• Two common wieghting approaches:
Term frequency: tft,d = log10(count(t,d)+1)
Document frequency:
Frequent words vs. important words
• Two common wieghting approaches:
In practice, a document can be anything: we
often call each paragraph a document!
Frequent words vs. important words
• Two common wieghting approaches:
– Raw counts:
Frequent words vs. important words
• Two common wieghting approaches:
– Raw counts:
– tf-idf
Frequent words vs. important words
• Two common wieghting approaches:
PMI ranges from -inf to +inf
In practice, we just use positive PMI values,
and replace all negative PMI values with 0
Therefore, we use Positive PMI:
Frequent words vs. important words
• Two common wieghting approaches:
Frequent words vs. important words
• Two common wieghting approaches:
Frequent words vs. important words
• Two common wieghting approaches:
Frequent words vs. important words
• Two common wieghting approaches:
Frequent words vs. important words
• Two common wieghting approaches:
PMI is biased toward infrequent events: Very
rare words have very high PMI values
A solution is to use add-1 smoothing
Word embddings
• tf-idf (or PMI) vectors are:
long (length |V|= 20,000 to 50,000)
sparse (most elements are zero)
• Alternative: learn vectors which are:
short (length 50-1000)
dense (most elements are non-zero)
Word embddings
• A (static) word embedding is a function that maps
each word type to a single vector
These vectors are typically dense and have
much lower dimensionality than the size of the
vocabulary (less dimensiosn than tfidf/PPMI)
This mapping function typically ignores that the
same string of letters may have different senses
(dining table vs. a table of contents) or parts of
speech (to table a motion vs. a table)
Methods for embeddings
• “Neural Language Model”-inspired models:
Word2vec (skipgram, CBOW), GloVe
• Singular Value Decomposition (SVD)
• Alternative to these "static embeddings":
Contextual Embeddings (ELMo, BERT)
Compute distinct embeddings for a word in its
Separate embeddings for each token of a word
Methods for embeddings
• Word2vec (Mikolov et al 2013)
• GloVe (Pennington, Socher, Manning, 2014)
• Very popular and influential
• Simple and fast
• Two ways to think about Word2Vec:
a simplification of neural language models
a binary logistic regression classifier
• Instead of counting how often each word w occurs
near "apricot"
– Train a classifier on a binary prediction task: Is w
likely to show up near "apricot"?
• We don’t actually care about this task
– But we'll take the learned classifier weights as the
word embeddings
• Big idea: self-supervision:
– A word c that occurs near apricot in the corpus
cats as the gold "correct answer" for supervised
– No need for human labels
• Instead of counting how often each word w occurs
near "apricot"
– Train a classifier on a binary prediction task: Is w
likely to show up near "apricot"?
• The parameters of that classifier provide a dense
vector representation of the target word
• Words that appear in similar contexts (that have
high distributional similarity) will have very similar
vector representations.
• These models can be trained on large amounts of
raw text (and pre-trained embeddings can be
Word2vec: skip-gram with negative sampling
• Train a binary classifier that decides whether a target word t
appears in the context of other words c1..k
Context: the set of k words near (surrounding) t
Treat the target word t and any word that actually appears
in its context in a real corpus as positive examples
Treat the target word t and randomly sampled words that
don’t appear in its context as negative examples
Train a binary logistic regression classifier to distinguish
these cases
The weights of this classifier depend on the similarity of t
and the words in c1..k
Use the learned weights of this classifier as embeddings for t
Skip-gram training
• Given a tuple (t, c) = target, context
(apricot, jam)
(apricot, aardvark)
• where some context words c are from real data
(jam) and others (aardvark) are randomly sampled
from the vocabulary…
• … decide whether c is a real context word for the
target t (a positive example):
• c is real if:
P(D = + ∣ t, c) > P(D = - ∣ t, c) = 1 − P(D = + ∣ t, c)
Skip-gram training
• How to compute P(D = + ∣ t, c)?
• Intuition:
Words are likely to appear near similar words
• Idea:
Model similarity with a dot-product of vectors:
Similarity(t, c) = f(t.c)
• Problem:
The dot product is not a probability!
(Neither is cosine)
Skip-gram training
• To turn the dot product into a probability, we use
The sigmoid function σ(x)
Skip-gram training
• Training sentence:
... lemon, a tablespoon of apricot jam a pinch ...
c2 t
c3 c4
• Training data: input/output pairs centering on apricot.
Assume a +/- 2 word window
• Positive examples (D=+):
(apricot, tablespoon), (apricot, of), (apricot, jam),
(apricot, a)
• Negative examples (D=-):
(apricot, aardvark), (apricot, puddle)…
• for each positive example, sample k noise words
Skip-gram training
• Negative sampling:
Where do we get D=- from?
Lots of options:
Word2Vec: for each good pair (w, c), sample
k words as negative examples
Words can be sampled according to corpus
frequency or according to smoothed variant
where freq′(w) = freq(w)0.75
(This gives more weight to rare words)
Skip-gram training
• Assume that t and c are represented as vectors t,
c, so that their dot product tc captures their
• Use logistic regression to predict whether the pair
(t, c) (target t and context word c), is a positive or
negative example:
• Skip-Gram learns two (sets of) vectors (i.e. two
– target embeddings/vectors t and context
embeddings/vectors c
Skip-gram training objective
• Find a model that maximizes the log-likelihood of
the training data D+ ∪ D-:
• This forces the target and context embeddings of
positive examples to be similar ...
• … and the target and context embeddings of
negative examples to be dissimilar
• All words appear with positive and negative
Summary of Skip-gram training
• For a vocabulary of size V: Start with 2*V random
vectors (typically 300-dimensional) as initial
• Train a logistic regression classifier to distinguish
words that co-occur in corpus from those that
Pairs of words that co-occur are positive
Pairs of words that don't co-occur are
negative examples
• During training, target and context vectors of
positive examples will become similar, and those
of negative examples will become dissimilar.
• This returns two embedding matrices T and C,
where each word in the vocabulary is mapped to
a 300-dim. Vector, in both T and C.
• It's common to just add them together,
representing word i as the vector wi + ci
Learning is
performed using
Properties of the embeddings
Small windows (C= +/- 2) : nearest words
are syntactically similar words in same
Hogwarts nearest neighbors are other fictional
schools: Sunnydale, Evernight, Blandings
Large windows (C= +/- 5) : nearest words
are relate words in same semantic field
Hogwarts nearest neighbors are Harry Potter
world: Dumbledore, half-blood, Malfoy
Properties of the embeddings
Analogy: Embeddings capture capture
relational meaning
vector(‘king’) - vector(‘man’) + vector(‘woman’) =
vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) =