7 Vector Semantics Definition, Applications Words again • Structure of words → • Distribution of words → • Words again • Structure of words →Morphology • Distribution of words →Language modeling • Now, we start looking at the meaning of words – Also called lexical semantics • We will mainly focus on vectorial word representations Outline • Lexical semantics • Distributional word similarities • Word embeddings Word Representation • How is the meaning of a words represented for computational models? What do words mean? • All past models (LM, classification…): words are character strings • At higher levels, when we are focusing on meaning, we have two possibilities: – Formal semantics: A block of text is to converted into a formula in a logical language – Capturing the actual meaning of words: Words are converted into conceptual representation centered around their meaning What do words mean? • All past models (LM, classification…): words are character strings • In higher levels, when we are concerned about meaning, we have two possibilities: – Formal semantics: A block of text is to converted into a formula in a logical language – e.g. First Order Logic (FOL) representations: ● Predicates – – ● Primarily Verbs, VPs, Sentences Sometimes Nouns and NPs Arguments – Primarily Nouns, Nominals, NPs, PPs FOL representation • It allows for: – The analysis of truth conditions ● – Supports the use of variables ● – Allows us to answer questions through the use of variable binding Supports inference ● • Allows us to answer yes/no questions Allows us to answer questions that go beyond what we know explicitly FOL representation • Allow logical inferences – ∀x MAN(x) ⟶ MORTAL(x) – MAN(socrates) – MORTAL(socrates) • Important for inference in well-defined domains, • A knowledge base is manually constructed • Using automated inference techniques techniques, to draw conclusions from given facts. Meaning-centered representations • Roughly speaking, there are two approaches: – The lexicographic tradition aims to capture the information represented in lexicons, dictionaries, etc. – The distributional tradition aims to capture the meaning of words based on large amounts of raw text LEXICAL SEMANTICS Lexical semantics • Uses resources such as lexicons, thesauri, ontologies etc. that capture explicit knowledge about word meanings. – • Example resources: WordNet Lexical semantics • Uses resources such as lexicons, thesauri, ontologies etc. that capture explicit knowledge about word meanings. – Example resources: WordNet ● Lexical semantics • Uses resources such as lexicons, thesauri, ontologies etc. that capture explicit knowledge about word meanings. Example resources: WordNet • Assumes words have discrete word senses: bank1 = financial institution; bank2 = river bank, etc. • May capture explicit relations between word (senses): “dog” is a “mammal”, “cars” have “wheels” etc., word representations should reflect word meaning and relationship to other words – • • • • Which words have similar meaning Which words have opposite meaning Which words have positive or negative connotations Etc. Lexical Semantcis Lexicon entries Lexical Semantcis Lexicon entries Lexical Semantcis: Lemmas and Senses lemma sense mouse (N) 1. any of numerous small rodents... 2. a hand-operated device that controls a cursor... Modified from the online thesaurus WordNet • A sense (or concept) is the meaning component of a word • Lemmas can be polysemous (have multiple senses) ● A lemma is polysemous if it has different related senses ● e.g bank financial institution or building Lexical Semantcis: Homonymy • Homonyms: different (unrelated) senses with the same spelling or pronunciation Remember: bank (financial institution); bank of a river • There two types of homonymy: – Homophones: words that have the same pronunciation but different meanings and, often, different spellings. ● "bare" (without covering) and "bear" (the animal) ● "flower" (a bloom) and "flour" (used in baking) ● "write" (to put words on paper) and "right" (opposite of left) Lexical Semantcis: Homonymy • Homonyms: different senses with the same spelling or pronunciation Remember: bank example • There two types of homonymy: – Homophones: words that have the same pronunciation but different meanings and, often, different spellings. – Homographs: words that have the same spelling but different meanings and may or may not have the same pronunciation. ● "tear" (to rip) and "tear" (to cry) ● "lead" (to guide) and "lead" (a heavy metal) ● "bass" (low-frequency sound) and "bass" (a type of fish) Relations between senses: Synonymy • Synonyms: words that have the same meaning in some or all contexts – Couch/Sofa – filbert / hazelnut – Big/large – car/automobile – water / H20 – vomit / throw upv – etc. Relations between senses: Synonymy • Synonyms: words that have the same meaning in some or all contexts • Note that there are probably no examples of perfect synonymy. – Even if many aspects of meaning are identical, Still may differ based on politeness, slang, register, genre, – e.g. ask & request, inform & tell (politeness) – Cool & awesome, crazy & insane (slang) – Commence & start & begin (register) – Utilize & employ & use (technical writing genre) – Uncover & reveal & discover (creative writing genre) Relations between senses: Synonymy • Synonyms: words that have the same meaning in some or all contexts • Note that there are probably no examples of perfect synonymy. • Difference in form → difference in meaning Lexical Semantcis: Similarity • Similar words: Words with similar meanings. Not synonyms, but sharing some element of meaning – car, bicycle – cow, horse Lexical Semantcis: Similarity • Similar words: Words with similar meanings. Not synonyms, but sharing some element of meaning • A dataset manually created by asking annotators to score the similarity between words on a scale from 1 to 10: SimLex-999 dataset Lexical Semantcis: Similarity • A dataset manually created by asking annotators to score the similarity between words on a scale from 1 to 10: SimLex-999 dataset Lexical Semantcis: Relatedness • Also called "word association" • Words can be Related in any way, perhaps via a semantic frame or field – coffee, tea: similar – coffee, cup: related, not similar Lexical Semantcis: Relatedness • Semantic field: – words that cover a particular semantic domain and are associated with a specific concept, idea, or theme • e.g. – hospitals ● – restaurants ● – surgeon, scalpel, nurse, anaesthetic, hospital waiter, menu, plate, food, menu, chef houses ● door, roof, kitchen, family, bed Lexical Semantcis: Antonymy • Antonyms: senses that are opposite with respect to only one feature of meaning; otherwise they are very similar! – Dark/light, short/long, fast/slow, hot/cold, rise/fall, up/down, in/out… Lexical Semantcis: Antonymy • Antonyms: senses that are opposite with respect to only one feature of meaning; otherwise they are very similar! – Dark/light, short/long, fast/slow, hot/cold, rise/fall, up/down, in/out… • More formally: antonyms can – define a binary opposition or be at opposite ends of a scale long/short, fast/slow – Be reversives: rise/fall, up/down Lexical Semantcis: Hierarchical relations • Hypernyms & Hyponyms: • e.g. pet/dog The hyponym (dog) is more specific than the hypernym (pet) • Holonyms and meronyms: • e.g. car/wheel – – The meronym (wheel) is a part of the holonym (car) Lexical semantics resources: WordNet • Very large, publicly available lexical database of English: – 110K nouns, 11K verbs, 22K adjectives, 4.5K adverbs (WordNets for many other languages exist or are under construction) • Each word has a POS tag and one or more word senses. Avg. # of senses: 1.23 nouns, 2.16 verbs, 1.41 adj, 1.24 adverbs • Word senses are grouped into synonym sets (“synsets”) – 81K noun synsets, 13K verb synsets, 19K adj. synsets, 3.5K adverb synsets • Synsets are connected in a hierarchy/network • defined via conceptual-semantic relations – hypernym/hyponym relation (IS-A) – holonym/meronym relation (HAS-A) • Also lexical relations (derivational morphology), and lemmatization • It is accessible: https://wordnet.princeton.edu/ • It can also be used programmatically using nltk – https://www.nltk.org/howto/wordnet.html Wordnet • WordNet total ● 155,327 words organized in 175,979 synsets for a total of 207,016 word-sense pairs; in compressed form, it is about 12 megabytes in size. • Check visualizer: ● https://www.visual-thesaurus.com/wordnet. php ● https://wordvis.com/about.html • Wordnet • Example of hierarchical relations from WN • IS-A relations (hyponymy): – – Hypernym/hyponym (between concepts) ● meal is a hypernym (superordinate) of breakfast ● breakfast is a hyponym (subordinate) of meal ● dog is a hypernym (superordinate) of poodle ● poodle is a hyponym (subordinate) of (IS-A) dog Instance hypernym/hyponym (concepts and instances) ● ● composer is the instance hypernym of (HAS-INSTANCE) Bach Bach is an instance hyponym of (IS-INSTANCE-OF) composer Wordnet • Example of Part-Whole relations (meronymy): • Member holonym/meronym (groups and members) – crew is a member holonym of (HAS-MEMBER) co-pilot – co-pilot is a member meronym of (IS-MEMBER-OF) crew • Part holonym/meronym (wholes and parts) – car is a part holonym of (HAS-PART) wheel – wheel is a part meronym of (IS-PART-OF) car • Substance holonym/meronym (substances and components) – bread is a substance holonym of (HAS-COMPONENT) flour – flour is a substance meronym of (IS-COMPONENT-OF) bread WordNet-based word similarity • There have been many attempts to exploit resources like WordNet to compute word (sense) similarities. • Classic approaches use the distance (path length) between synsets (these paths typically only consider hypernym/hyponym relations), possibly augmented with corpus statistics • • More recent (neural) approaches aim to learn (non-Euclidean) embeddings that capture the hierarchical hypernym/hyponym structure of WordNet. Word-sense similarity • Similarity can be interpreted as: • Synonymy: – sim(couch, sofa)> sim(poodle, dog) > sim(poodle, pug), … Do the two words/senses have the same meaning? (WordNet: synsets are synonyms (similarity=1), but hypernym/hyponyms (dog/poodle) are also more similar to each other than unrelated words) • Relatedness: • – Word-sense similarity • Similarity can be interpreted as: • Synonymy: • Relatedness: – How related are the two words/senses to each other? – coffee and cup are strongly associated, but not synonyms – “Semantic fields”: sets of words that are topically related (WordNet: holonyms/meronyms etc. capture some associations) Word Similarity examples & problems • Path length is just the distance between synsets – pathlen(nickel, dime) = 2 (nickel—coin—dime) – pathlen(nickel, money) = 5 (nickel—…—medium of exchange— money) – pathlen(nickel, budget) = 7 (nickel—…—medium of exchange—…– budget) • But do we really want the following? – pathlen(nickel, coin) < pathlen(nickel, dime) – pathlen(nickel, Richter scale) = pathlen(nickel, budget) Problems with thesaurus similarity • We need to have a thesaurus! (not available for all languages) • We need to have a thesaurus that contains the words we’re interested in. • We need a thesaurus that captures a rich hierarchy of hypernyms and hyponyms. • Most thesaurus-based similarities depend on the specifics of the hierarchy that is implemented in the thesaurus. DISTRIBUTIONAL SEMANTICS Distributional approach to semantics • Uses large corpora of raw text to learn the meaning of words from the contexts in which they occur. • Maps words to (sparse) vectors that capture corpus statistics • Contemporary variant: use neural nets to learn dense vector: “embeddings” from very large corpora – (this is a prerequisite for most neural approaches to NLP) • If each word type is mapped to a single vector, this ignores the fact that words have multiple senses or parts-of-speech Distributional approach to semantics • Language understanding requires knowing when words have similar meanings • e.g. Question answering: – Q: “How tall is Mt. Everest?” – Candidate A: “The official height of Mount Everest is 29029 feet” – “tall” is similar to “height” Distributional approach to semantics • Language understanding requires knowing when words have similar meanings • e.g. Plagiarism detection – Which representation will capture similarities? • As atomic symbols? [e.g. as in a traditional n-gram language model, or when we use them as explicit features in a classifier] • This is equivalent to very high-dimensional onehot vectors: aardvark=[1,0,…,0], bear=[0,1,000],…, zebra=[0,…,0,1] • No: height/tall are as different as height/cat – Which representation will capture similarities? • As atomic symbols? [e.g. as in a traditional n-gram language model, or when we use them as explicit features in a classifier] • This is equivalent to very high-dimensional onehot vectors: aardvark=[1,0,…,0], bear=[0,1,000],…, zebra=[0,…,0,1] • No: height/tall are as different as height/cat – tf-idf Word2Vec What similarities are catured by vector representations? • Vector representations of words were originally motivated by attempts to capture lexical semantics (the meaning of words) so that words that have similar meanings have similar representations • These representations may also capture some morphological or syntactic properties of words (parts of speech, inflections, stems etc.). What similarities are catured by vector representations? • The meaning of a word is represented as a vectors Called "embedding" because it's embedded into a space • It is the standard way to represent meaning in NLP • Every modern NLP algorithm uses embeddings as the representation of word meaning The Distributional Hypothesis • Vector representations are based on the distributional hypothesis • Zellig Harris (1954): – “oculist and eye-doctor … occur in almost the same environments” “If A and B have almost identical environments we say that they are synonyms.” • John R. Firth 1957: – – You shall know a word by the company it keeps. The Distributional Hypothesis • Vector representations are based on the distributional hypothesis • The Distributional Hypothesis • Context for semantics is used in many ways in NLP • Distributional similarities (vector-space semantics): – Use the set of all contexts in which words (= word types) appear to measure their similarity Assumption: Words that appear in similar contexts (tea, coffee) have similar meanings. • Word sense disambiguation: – – Use the context of a particular occurrence of a word (token) to identify which sense it has. – Assumption: If a word has multiple distinct senses (e.g. plant: factory or green plant), each sense will appear in different contexts The Distributional Hypothesis • Context for semantics is used in many ways in NLP • Distributional similarities (vector-space semantics): – Use the set of all contexts in which words (= word types) appear to measure their similarity Assumption: Words that appear in similar contexts (tea, coffee) have similar meanings. • Word sense disambiguation: – – Use the context of a particular occurrence of a word (token) to identify which sense it has. – Assumption: If a word has multiple distinct senses (e.g. plant: factory or green plant), each sense will appear in different contexts Distributional Similarities • Basic idea: – Measure the semantic similarity of words in terms of the similarity of the contexts in which they appear • How? – – Represent words as vectors such that ● each vector element (dimension) corresponds to a different context ● the vector for any particular word captures how strongly it is associated with each context Compute the semantic similarity of words as the similarity of their vectors. Distributional Similarities • Basic idea: – Measure the semantic similarity of words in terms of the similarity of the contexts in which they appear • How? – – Represent words as vectors such that ● each vector element (dimension) corresponds to a different context ● the vector for any particular word captures how strongly it is associated with each context Compute the semantic similarity of words as the similarity of their vectors. Distributional Similarities • Distributional similarities use the set of contexts in which words appear to measure their similarity Term-Document matrix (information retrieval) • We search a collection of N documents – We can represent each word in the vocabulary V as an N-dim. vector indicating which documents it appears in. Conversely, we can represent each document as a Vdimensional vector indicating which words appear in it. • Finding the most relevant document for a query: – – Queries are also (short) documents Use the similarity of a query’s vector and the documents’ vectors to compute which document is most relevant to the query. • Intuition: Documents are similar to each other if they contain the same words. – Term-Document matrix (information retrieval) • A Term-Document Matrix is a 2D table: – Each cell contains the frequency (count) of the term (word) t in document d: tft,d – Each column is a vector of counts over words, representing a document – Each row is a vector of counts over documents, representing a word Term-Document matrix (information retrieval) • Each column vector = a document (Each entry corresponds to one word in the vocabulary) • Each row vector = a word (Each entry corresponds to one document in the corpus) • Two documents are similar if their vectors are similar • Two words are similar if their vectors are similar Term-Document matrix (information retrieval) Term-Document matrix (information retrieval) • This model can be adapted to implement a model of the distributional hypothesis if we treat each context as a column in our matrix. Term-Context matrix • A more common convention is to use word-word frequency: • Term-Context matrix • A more common convention is to use word-word frequency: • What is a context? • There are many different interpretations of context that yield different kinds of similarities: • Contexts defined by nearby words: – How often does w appear near the word drink? Near = “drink appears within a window of ±k words of w”, or “drink appears in the same document/sentence as w” – This yields fairly broad thematic similarities. • Contexts defined by grammatical relations: – How often is (the noun) w used as the subject (object) of the verb drink? (Requires a parser). – This gives more fine-grained similarities. From vectors to similarities • First idea: dot product • From vectors to similarities • First idea: dot product • However, it has an issue: • It favors vectors with high values in many dimensions • Frequent words (of, the, in) occur with many other words • Therefore, their dot product with other words is higher From vectors to similarities • Alternative idea: cosine • From vectors to similarities • Alternative idea: cosine • From vectors to similarities • Alternative idea: cosine • Example: • Frequent words vs. important words • Raw frequency can be a bad representation • The co-occurrence matrices we have seen represent each cell by word frequencies. • Frequency is useful: if sugar appears a lot near apricot, that's useful information. • But overly frequent words like the, it, of are not very informative about the context • Solution: weight words (so that the frequent ones receive lower scores) Frequent words vs. important words • Two common wieghting approaches: – Tf-idf: ● Frequent words have low idf – Point-wise Mutual Information (PMI): ● Need to determine p(*[,*]) Frequent words vs. important words • Two common wieghting approaches: – Tf-idf: ● tft,d = count(t,d) ● In practice, instead of using raw count: – tft,d = log10(count(t,d)+1) Frequent words vs. important words • Two common wieghting approaches: – Tf-idf: ● Term frequency: tft,d = log10(count(t,d)+1) ● Document frequency: dft is the number of documents where the term t occurs. ● Example: ● “Romeo” is very special for one Shakespear document Frequent words vs. important words • Two common wieghting approaches: – Tf-idf: ● Term frequency: tft,d = log10(count(t,d)+1) ● Document frequency: dft is the number of documents where the term t occurs. ● Inverse Document Frequency: N is the total number of documents Frequent words vs. important words • Two common wieghting approaches: – Tf-idf: ● Term frequency: tft,d = log10(count(t,d)+1) ● Document frequency: Frequent words vs. important words • Two common wieghting approaches: – Tf-idf: ● Term frequency: tft,d = log10(count(t,d)+1) ● Document frequency: Frequent words vs. important words • Two common wieghting approaches: – Tf-idf: ● In practice, a document can be anything: we often call each paragraph a document! Frequent words vs. important words • Two common wieghting approaches: – Tf-idf: ● Example: – Raw counts: Frequent words vs. important words • Two common wieghting approaches: – Tf-idf: ● Example: – Raw counts: – – – – tf-idf Frequent words vs. important words • Two common wieghting approaches: – PMI: ● PMI ranges from -inf to +inf ● In practice, we just use positive PMI values, and replace all negative PMI values with 0 ● Therefore, we use Positive PMI: Frequent words vs. important words • Two common wieghting approaches: – PPMI: ● Example Frequent words vs. important words • Two common wieghting approaches: – PPMI: ● Example Frequent words vs. important words • Two common wieghting approaches: – PPMI: ● Example Frequent words vs. important words • Two common wieghting approaches: – PPMI: ● Example Frequent words vs. important words • Two common wieghting approaches: – PPMI: ● PMI is biased toward infrequent events: Very rare words have very high PMI values ● A solution is to use add-1 smoothing Word embddings • tf-idf (or PMI) vectors are: – long (length |V|= 20,000 to 50,000) sparse (most elements are zero) • Alternative: learn vectors which are: – – short (length 50-1000) – dense (most elements are non-zero) Word embddings • A (static) word embedding is a function that maps each word type to a single vector – These vectors are typically dense and have much lower dimensionality than the size of the vocabulary (less dimensiosn than tfidf/PPMI) – This mapping function typically ignores that the same string of letters may have different senses (dining table vs. a table of contents) or parts of speech (to table a motion vs. a table) Methods for embeddings • “Neural Language Model”-inspired models: Word2vec (skipgram, CBOW), GloVe • Singular Value Decomposition (SVD) • Alternative to these "static embeddings": – – Contextual Embeddings (ELMo, BERT) – Compute distinct embeddings for a word in its context – Separate embeddings for each token of a word Methods for embeddings • Word2vec (Mikolov et al 2013) https://code.google.com/archive/p/word2vec/ • GloVe (Pennington, Socher, Manning, 2014) http://nlp.stanford.edu/projects/glove/ Word2vec • Very popular and influential • Simple and fast • Two ways to think about Word2Vec: – a simplification of neural language models – a binary logistic regression classifier Word2vec • Instead of counting how often each word w occurs near "apricot" – Train a classifier on a binary prediction task: Is w likely to show up near "apricot"? • We don’t actually care about this task – But we'll take the learned classifier weights as the word embeddings • Big idea: self-supervision: – A word c that occurs near apricot in the corpus cats as the gold "correct answer" for supervised learning – No need for human labels Word2vec • Instead of counting how often each word w occurs near "apricot" – Train a classifier on a binary prediction task: Is w likely to show up near "apricot"? • The parameters of that classifier provide a dense vector representation of the target word (embedding) • Words that appear in similar contexts (that have high distributional similarity) will have very similar vector representations. • These models can be trained on large amounts of raw text (and pre-trained embeddings can be downloaded) Word2vec: skip-gram with negative sampling • Train a binary classifier that decides whether a target word t appears in the context of other words c1..k – Context: the set of k words near (surrounding) t – Treat the target word t and any word that actually appears in its context in a real corpus as positive examples – Treat the target word t and randomly sampled words that don’t appear in its context as negative examples – Train a binary logistic regression classifier to distinguish these cases The weights of this classifier depend on the similarity of t and the words in c1..k Use the learned weights of this classifier as embeddings for t – Skip-gram training • Given a tuple (t, c) = target, context (apricot, jam) (apricot, aardvark) • where some context words c are from real data (jam) and others (aardvark) are randomly sampled from the vocabulary… • … decide whether c is a real context word for the target t (a positive example): • c is real if: P(D = + ∣ t, c) > P(D = - ∣ t, c) = 1 − P(D = + ∣ t, c) Skip-gram training • How to compute P(D = + ∣ t, c)? • Intuition: Words are likely to appear near similar words • Idea: – – Model similarity with a dot-product of vectors: Similarity(t, c) = f(t.c) • Problem: – – The dot product is not a probability! – (Neither is cosine) Skip-gram training • To turn the dot product into a probability, we use The sigmoid function σ(x) • Skip-gram training • Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4 • Training data: input/output pairs centering on apricot. Assume a +/- 2 word window • Positive examples (D=+): (apricot, tablespoon), (apricot, of), (apricot, jam), (apricot, a) • Negative examples (D=-): (apricot, aardvark), (apricot, puddle)… • for each positive example, sample k noise words Skip-gram training • Negative sampling: – Where do we get D=- from? – Lots of options: ● Word2Vec: for each good pair (w, c), sample k words as negative examples ● Words can be sampled according to corpus frequency or according to smoothed variant where freq′(w) = freq(w)0.75 (This gives more weight to rare words) Skip-gram training • Assume that t and c are represented as vectors t, c, so that their dot product tc captures their similarity • Use logistic regression to predict whether the pair (t, c) (target t and context word c), is a positive or negative example: • Skip-Gram learns two (sets of) vectors (i.e. two matrices): – target embeddings/vectors t and context embeddings/vectors c Skip-gram training objective • Find a model that maximizes the log-likelihood of the training data D+ ∪ D-: • • • • This forces the target and context embeddings of positive examples to be similar ... • … and the target and context embeddings of negative examples to be dissimilar • All words appear with positive and negative contexts. Summary of Skip-gram training • For a vocabulary of size V: Start with 2*V random vectors (typically 300-dimensional) as initial embeddings • Train a logistic regression classifier to distinguish words that co-occur in corpus from those that don’t – Pairs of words that co-occur are positive examples – Pairs of words that don't co-occur are negative examples • During training, target and context vectors of positive examples will become similar, and those of negative examples will become dissimilar. • This returns two embedding matrices T and C, where each word in the vocabulary is mapped to a 300-dim. Vector, in both T and C. • It's common to just add them together, representing word i as the vector wi + ci Learning is performed using SGD Properties of the embeddings ● Small windows (C= +/- 2) : nearest words are syntactically similar words in same taxonomy – ● Hogwarts nearest neighbors are other fictional schools: Sunnydale, Evernight, Blandings Large windows (C= +/- 5) : nearest words are relate words in same semantic field – Hogwarts nearest neighbors are Harry Potter world: Dumbledore, half-blood, Malfoy Properties of the embeddings ● Analogy: Embeddings capture capture relational meaning – vector(‘king’) - vector(‘man’) + vector(‘woman’) = vector(‘queen’) – vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) = vector(?)