Indexing and Searching Timed Media: the Role of Mutual Information Models Tony Davis (StreamSage/Comcast) IPAM, UCLA 1 Oct. 2007 A bit about what we do… StreamSage (now a part of Comcast) focuses on indexing and retrieval of “timed” media (video and audio, aka “multimedia” or “broadcast media”) A variety of applications, but now centered on cable TV This is joint work with many members of the research team: Shachi Dave Abby Elbow David Houghton I-Ju Lai Hemali Majithia Phil Rennert Kevin Reschke Robert Rubinoff Pinaki Sinha Goldee Udani Overview Theme: use of term association models to address challenges of timed media Problems addressed: Retrieving all and only the relevant portions of timed media for a query Lexical semantics (WSD, term expansion, compositionality of multi-word units) Ontology enhancement Topic detection, clustering, and similarity among documents Automatic indexing and retrieval of streaming media Streaming media presents particular difficulties Its timed nature makes navigation cumbersome, so the system must extract relevant intervals of documents rather than present a list of documents to the user Speech recognition is error-prone, so the system must compensate for noisy input We use a mix of statistical and symbolic NLP Various modules factor into calculating relevant intervals for each term: 1. Word sense disambiguation 2. Query expansion 3. Anaphor resolution 4. Name recognition 5. Topic segmentation and identification Statistical and NLP Foundations: the COW and YAKS Models Two types of models COW (Co-Occurring Words) Based on proximity of terms, using a sliding window YAKS (Yet Another Knowledge Source) Based on grammatical relationships between terms The COW Model Large mutual information model of word co-occurrence MI(X,Y) = P(X,Y)/P(X)P(Y) Thus, COW values greater than 1 indicate correlation (tendency to co-occur); less than 1, anticorrelation Values are adjusted for centrality (salience) Two main COW models: New York Times, based on 325 million words (about 6 years) of text Wikipedia, more recent, roughly the same amount of text also have specialized COW models (for medicine, business, and others), as well as models for other languages The COW (Co-Occurring Words) Model The COW model is at the core of what we do Relevance interval construction Document segmentation and topic identification Word sense disambiguation (large-scale and unsupervised, based on clustering COWs of an ambiguous word) Ontology construction Determining semantic relatedness of terms Determining specificity of a term The COW (Co-Occurring Words) Model An example: top 10 COWs of railroad//N post//J shipper//N freight//N locomotive//N rail//N railway//N commuter//N pickups//N train//N burlington//N 165.554 123.568 121.375 119.602 73.7922 64.6594 63.4978 48.3637 44.863 41.4952 Multi-Word Units (MWUs) Why do we care about MWUs? Because they act like single words in many cases, but also: MWUs are often powerful disambiguators of words within them (see, e.g., Yarowsky (1995), Pederson (2002) for wsd methods that exploit this): ‘fuel tank’, ‘fish tank’, ‘tank tread’ ‘indoor pool’, ‘labor pool’, ‘pool table’ Useful in query expansion ‘Dept. of Agriculture’ ‘Agriculture Dept.’ ‘hookworm in dogs’ ‘canine hookworm’ Provide many terms that can be added to ontologies ‘commuter railroad’, ‘peanut butter’ Multi-Word Units (MWUs) in our models MWUs in our system We extract nominal MWUs, using a simple procedure based on POS-tagging: Example: 1. ({N,J}) ({N,J}) N 2. N Prep (‘the’) ({N,J}) N where Prep is ‘in’, ‘on’, ‘of’, ‘to’, ‘by’, ‘with’, ‘without’, ‘for’, or ‘against’ For the most common 100,000 or so MWUs in our corpus, we calculate COW values, as we do for words COWs of MWUs An example: top ten COWs of ‘commuter railroad’ post//J pickups//N rail//N sanitation//N weekday//N transit//N commuter//N subway//N transportation//N railway//N 1234.47 315.005 200.839 186.99 135.609 134.329 119.435 86.6837 86.487 86.2851 COWs of MWUs Another example: top ten COWs of ‘underground railroad’ abolitionist//N slave//N gourd//N runaway//J douglass//N slavery//N harriet//N quilt//N quaker//N historic//N 845.075 401.732 266.538 226.163 170.459 157.654 131.803 109.241 94.6592 86.0395 The YAKS model Motivations COW values reflect simple co-occurrence or association, but no particular relationship beyond that For some purposes, it’s useful to measure the association between two terms in a particular syntactic relationship Construction Parse a lot of text (the same 325 million words of New York Times used to build our NYT COW model); however, long sentences (>25 words) were discarded, as parsing them was slow and error-prone The parser’s output provides information about grammatical relations between words in a clause; to measure the association of a verb (say ‘drink’) and a noun as its object (say ‘beer’), we consider the set of all verb-object pairs, and calculate mutual information over that set Calculate MI for broader semantic classes of terms, e.g.: food, substance. Semantic classes were taken from Cambridge International Dictionary of English (CIDE); there are about 2000 of them, arranged in a shallow hierarchy YAKS Examples Some objects of ‘eat’ OBJ head=eat arg1=hamburger arg1=pretzel arg1=:Food arg1=:Substance arg1=:Sound arg1=:Place 139.249 90.359 18.156 7.89 0.324 0.448 Relevance Intervals Relevance Intervals (RIs) Each RI is a contiguous segment of audio/video deemed relevant to a term 1. RIs are calculated for all content words (after lemmatization) and multi-word expressions RI basis: sentence containing the term Each RI is expanded forward and backward to capture relevant material, using the techniques including: Topic boundary detection by changes in COW values across sentences Topic boundary detection via discourse markers Synonym-based query expansion Anaphor resolution Nearby RIs for the same term are merged Each RI is assigned a magnitude, reflecting its likely importance to a user searching on that term, based on the number of occurrences of the term in the RI, and the COW values of other words in the RI with the term 2. 3. 4. 5. Relevance Intervals: an Example Index term: squatter Among the sentences containing this term are these two, near each other: Paul Bew is professor of Irish politics at Queens University in Belfast. In South Africa the government is struggling to contain a growing demand for land from its black citizens. Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg. In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. NPR’s Kenneth Walker has a report. Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa. “Must give us a place…” We build an RI for squatter around each of these sentences… Relevance Intervals: an Example Index term: squatter Among the sentences containing this term are these two, near each other: Paul Bew is professor of Irish politics at Queens University in Belfast. In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand] Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg. In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand] Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. NPR’s Kenneth Walker has a report. Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa. “Must give us a place…” [topic segment boundary] We build an RI for squatter around each of these sentences… Relevance Intervals: an Example Index term: squatter Among the sentences containing this term are these two, near each other: Paul Bew is professor of Irish politics at Queens University in Belfast. [topic segment boundary] In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand] Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg. In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand] Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. [merge nearby intervals] NPR’s Kenneth Walker has a report. [merge nearby intervals] Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa. “Must give us a place…” [topic segment boundary] Two occurrence of squatter produce a complete merged interval. Relevance Intervals and Virtual Documents The set of all the RIs for a term in a document constitutes the virtual document for that term In effect, the VD for a term is intended to approximate a document that would have been produced had the authors focused solely on that term A VD is assigned a magnitude equal to the highest magnitude of the RIs in it, with a bonus if more than one RI has a similarly high magnitude Merging Ris for multiple terms Occurrence of Original Term Russia Iran Note that this can only be done at query time, So it needs to be fairly quick and simple. Merging RIs Occurrence of Original Term Russia Iran Activation Spreading Merging RIs Occurrence of Original Term Russia Iran Russia and Iran Evaluating RIs and VDs Evaluation of retrieval effectiveness in timed media raises further issues: Building a gold-standard is painstaking, and potentially more subjective It’s necessary to measure how closely the system’s Ris match the gold standard’s What’s a reasonable baseline? We created a gold standard of about 2300 VDs with about 200 queries on about 50 documents (NPR, CNN, ABC, and business webcasts), and rated each RI in a VD on a scale of 1 (highly relevant) to 3 (marginally relevant). Testing of the system was performed on speech recognizer output Evaluating RIs and VDs Measure amounts of extraneous and missed content ideal RI system RI extraneous missed Evaluating RIs and VDs Comparison of percentages of median extraneous and missed content over all queries between system using COWs and system using only sentences with query terms present relevance most relevant relevant most relevant and relevant marginally relevant system extra miss extra miss extra miss extra miss With COWs 9.3 12.7 39.2 27.5 29.9 21.6 61.7 15.0 Without COWs 0.0 57.7 0.3 64.7 0.0 63.4 18.8 60.2 MWUs and compositionality MWUs, idioms, and compositionality Several partially independent factors are in play here (Calzolari, et al. 2002): 1. reduced syntactic and semantic transparency; 2. reduced or lack of compositionality; 3. more or less frozen or fixed status; 4. possible violation of some otherwise general syntactic patterns or rules; 5. a high degree of lexicalization (depending on pragmatic factors); 6. a high degree of conventionality. MWUs, idioms, and compositionality In addition, there are two kinds of “mixed” cases Ambiguous MWUs, with one meaning compositional and the other not: ‘end of the tunnel’, ‘underground railroad’ “Normal” use of some component words, but not others: ‘flea market’ (a kind of market) ‘peanut butter’ (a spread made from peanuts) Automatic detection of non-compositionality Previous work Lin (1999): “based on the hypothesis that when a phrase is noncompositional, its mutual information differs significantly from the mutual informations [sic] of phrases obtained by substituting one of the words in the phrase with a similar word.” For instance, the distribution of ‘peanut’ and ‘butter’ should differ from that of ‘peanut’ and ‘margarine’ Results are not very good yet, because semantically-related words often have quite different distributions, and many compositional collocations are “institutionalized”, so that substituting words within them will change distributional statistics. Automatic detection of non-compositionality Previous work Baldwin, et al. (2002): “use latent semantic analysis to determine the similarity between a multiword expression and its constituent words”; “higher similarities indicate greater decomposability” “Our expectation is that for constituent word-MWE pairs with higher LSA similarities, there is a greater likelihood of the MWE being a hyponym of the constituent word.” (for head words of MWEs) “correlate[s] moderately with WordNet-based hyponymy values.” Automatic detection of non-compositionality We use the COW model for a related approach to the problem COWs (and COW values) of an MWU and its component words will be more alike if the MWU is compositional. We use a measure of occurrences of a component word near an MWU as another criterion of compositionality The more often words in the MWU appear near it, but not as a part of it, the more likely it is that the MWU is compositional. COW pair sum measure Get the top n COWs of an MWU, and of one of its component words. For each pair of COWs (one from each of these lists), find their COW value. railroad post//J shipper//N freight//N commuter railroad post//J pickups//N rail//N COW pair sum measure Get the top n COWs of an MWU, and of one of its component words. For each pair of COWs (one from each of these lists), find their COW value. Then sum up these values. This provides a measure of how similar the contexts in which the MWU and its component word appear are. Feature overlap measure Get the top n COWs (and values) of an MWU, and of one of its component words. For each COW with a value greater than some threshold, treat that COW as a feature of the term. Then compute the overlap coefficient (Jaccard coefficient); for two sets of features A and B: |AB| |AB| Occurrence-based measure For each occurrence of an MWU, determine if a given component word occurs in a window around that occurrence, but not as part of that MWU. Calculate the proportion of occurrences for which this the case, compared to all occurrences of the MWU. Testing the measures We extracted all MWUs tagged as idiomatic in the Cambridge International Dictionary of English (about 1000 expressions). There are about 112 of these that conform to our MWU patterns and occur with sufficient frequency in our corpus that we have calculated COWs for them. fashion victim flea market flip side Testing the measures We then searched the 100,000 MWUs for which we have COW values, choosing compositional MWUs containing the same terms. In some cases, this is difficult or impossible, as no appropriate MWUs are present. About 144 MWUs are on the compositional list. fashion victim fashion designer crime victim flea market [flea collar] market share flip side [coin flip] side of the building Results: basic statistics The idiomatic and compositional sets are quite different in aggregate, though there is a large variance: COW pair sum Feature overlap Occurrence measure Non-idiomatic mean 575.478 s.d. 861.754 mean 0.297 s.d. 0.256 mean 37.877 s.d. 23.470 Idiomatic mean -236.92 s.d. 502.436 mean 0.109 s.d. 0.180 mean 16.954 s.d. 16.637 Results: discriminating the two sets How well does each measure discriminate between idioms and non-idioms? COW pair sum negative positive Non-idiomatic 75 213 Idiomatic 178 46 Results: discriminating the two sets How well does each measure discriminate between idioms and non-idioms? feature overlap < 0.12 >= 0.12 Non-idiomatic 100 188 Idiomatic 175 49 Results: discriminating the two sets How well does each measure discriminate between idioms and non-idioms? occurrence-based measure < 25% >= 25% Non-idiomatic 94 194 Idiomatic 174 50 Results: discriminating the two sets Can we do better combining the measures? We used the decision-tree software C5.0 to check Rule: if COW pair sum <= -216.739 or COW pair sum <= 199.215 and occ. measure < 27.74% then idiomatic; otherwise non-idiomatic yes no Non-idiomatic 50 238 Idiomatic 184 40 Results: discriminating the two sets Some cases are “split”—classified as idiomatic with respect to one component word but not the other word: bear hug is idiomatic w.r.t. bear but not hug flea market is idiomatic w.r.t. flea but not market Other methods to improve performance on this task MWUs often come in semantic “clusters”: ‘almond tree’, ‘peach tree’, ‘blackberry bush’, ‘pepper plant’, etc. Corresponding components in these MWUs can be localized in a small area of WordNet (Barrett, Davis, and Dorr (2001)) or UMLS (Rosario, Hearst, and Fillmore (2002)). “Outliers” that don’t fit the pattern are potentially idiomatic or noncompositional (’plane tree’ but ’rubber tree’, which is compositional). Clustering and topic detection Clustering by similarities among segments Content of a segment is represented by its topically salient terms. The COW model is used to calculate a similarity measure for each pair of segments. Clustering on the resulting matrix of similarities (using the CLUTO package) yields topically distinct clusters of results. Clustering Results Example: crane 10 segments form 2 well-defined clusters, one relating to birds, the other to machinery (and cleanup of the WTC debris in particular) Clustering Results Example: crane Cluster Labeling Topic labels for clusters improve usability. Candidate cluster labels can be obtained from: Topic terms of segments in a cluster Multi-word units containing the query term(s) Outside sources (taxonomies, Wikipedia, …) Topics through latent Dirichlet allocation LDA (Blei, Ng, and Jordan 2003) models documents as probability distribution over a set of underlying topics, with each topic defined as a probability distribution over terms. We’ve generated topic models for the SR output of about 20,000 news programs (from 400 to1000 topics). Nouns, verbs, and MWUs only; all other words discarded Impressionistically, the “best” topics seem to be those in which a third or more of the probability mass is in 10-50 terms. Topics through latent Dirichlet allocation Are there better ways to gauge topic coherence (and relatedness)? We’ve tried two COW sum measure (Wikipedia COW table on 20 most probably words in a topic Average “distance” between CIDE semantic domain codes for term (also top 20, using the similarity measure of Wu and Palmer 1994) Excerpt of CIDE hierarchy: 43 Building and Civil Engineering 66 Buildings 68 Buildings: names and types of 754 Houses and homes 755 Public buildings 365 Rooms 194 Furniture and Fittings 834 Bathroom fixtures and fittings 811 Curtains and wallpaper 804 Tables 805 Chairs and seats Topics through latent Dirichlet allocation Results are quite different (these are with 1000 topics) COW sum measure Rank in WikiCow= 1 Rank in cide= 232 return home war talk stress month post-traumatic rack deal think stress$disorder night psychological face understand pt experience combat series mental$health Rank in WikiCow= 2 Rank in cide= 773 research cell embryo stem$cell disease destroy embryonic$stem scientist parkinson researcher federal funding potential cure today human$embryo medical human$life california create Rank in WikiCow= 3 Rank in cide= 924 giuliani new$hampshire run talk gingrich february former$new thompson social mormon massachusetts mayor iowa newt opening york choice rudy$giuliani jim committee Topics through latent Dirichlet allocation Results are quite different (these are with 1000 topics) COW sum measure Rank in WikiCow= 4 Rank in cide= 650 palestinian israeli prime$minister israelis arafat west$bank jerusalem peace settlement gaza$strip government violence israeli$soldier hama leader move force security peace$process downtown Rank in WikiCow= 5 Rank in cide= 283 japan japanese tokyo connecticut lieberman lucy run lose joe$lieberman support lemont disaster war twenty-first$century anti-war define peterson$cbs senator$joe future figure Rank in WikiCow= 6 Rank in cide= 710 california schwarzenegger union break office view arnold$schwarzenegger tuesday politician budget rosie poll agenda o'donnell political battle california$governor help rating maria Topics through latent Dirichlet allocation Results are quite different (these are with 1000 topics) CIDE distance measure Rank in cide= 1 Rank in WikiCow= 278 play show broadway actor stage star performance theatre perform audience act career tony sing character production young role review performer Rank in cide= 2 Rank in WikiCow= 476 war u.n. refugee united$nation international defend support kill u.s. week flee allow chief board cost remain desperate innocent humanitarian conference Rank in cide= 3 Rank in WikiCow= 151 competition win team skate compete figure stand finish performance gold watch champion hard-on score skater talk meter lose fight gold$medal Topics through latent Dirichlet allocation Results are quite different (these are with 1000 topics) CIDE distance measure Rank in cide= 4 Rank in WikiCow= 858 set move show pass finish mind send-up expect outcome address defence person responsibility wish independent system woman salute term pres Rank in cide= 5 Rank in WikiCow= 478 character play think kind film actor mean mind scene act guy script role good$morning real read nice wonderful stuff interesting Rank in cide= 6 Rank in WikiCow= 892 warren purpose rick hunter week drive hundred influence allow poor peace night leader train sign walk training goal team rally YAKS and Ontology Enhancement YAKS (Yet another knowledge source) What is YAKS: Like the COW model: measures how much more (or less) frequently words occur in certain environments than would be expected by chance. Unlike the COW model: does not measure co-occurrence within a fixed window size. Measures how often words are in one of the syntactic relations tracked by the model. Example: Subject(eat,dog) would be an entry in the YAKS table recording how many times the verb eat occurs in the corpus with dog as its subject. Syntax and Semantics: YAKS YAKS infrastructure: Corpus: New York Times corpus (six years), all sentences under 25 words (~60% of sentences, ~50% of words); about 160 million words. Need good parsing. We used the Xerox Linguistics Environment (XLE) parser from PARC, based on lexical-functional grammar. Syntax and Semantics: YAKS Sample YAKS relations: (YAKS tracks 13 different relations.) Relation SUBJ(V, X) Meaning X is subject of verb V. OBJ(V, X is direct object of verb V. “Dan believes that soup is good”: COMP(believe, that) “Eli sat on the ground”: POBJ(on, the ground) “Dan ate bread and soup”: CONJ(and, bread, soup) X) COMP(V, X) Verb V and complementizer X. POBJ(P, X) Preposition P and object X. CONJ(C, X, Y) Coordinating conjunction C, and two conjuncts (either nouns or verbs) X and Y. Example “Dan ate soup”: SUBJ(eat, Dan) “Dan ate soup”: OBJ(eat, soup) For ontology extension, we use only the OBJ relation. YAKS in ontology enhancement seed items hat sock shirt dress Items from one neighborhood in the ontology YAKS in ontology enhancement hat sock shirt dress wear wash take off iron Which verbs are most associated with these nouns, in a large corpus? YAKS in ontology enhancement hat sock shirt dress wear wash new candidates for that area of the ontology take off iron sweater clothes pants jacket blouse pajamas Whaich other nouns are most associated as objects of those verbs? YAKS in ontology enhancement YAKS technique: The noun-verb-noun cycle reflects the selectional restrictions of verbs that are strongly associated with the seed nouns. This lets us use the information in the statistical properties of the large corpus to find nouns that are semantically close to your seeds in the ontology. This is why we use OBJ (the object relation) for this YAKS technique. (SUBJ found the same information, but with slightly more noise.) YAKS in ontology enhancement YAKS technique: finds new items for a neighborhood, but where exactly do they go? X belongs in here . . . YAKS in ontology enhancement ? X But where? ? ? ? ? ? ? ? ? ? YAKS in ontology enhancement Techniques to precisely locate a candidate term in the ontology: Wikipedia check (already described) (Does the Dog page contain “dog is a mammal”? Does the Mammal page contain “mammal is a dog”?) Yahoo pattern technique YAKS in ontology enhancement 1. 2. Yahoo check: Use known pattern information to form a Yahoo search. • If we suspect that a dog is-a mammal, search Yahoo for “dog is a mammal” and “mammal is a dog”. Also search using other is-a patterns (e.g., from Hearst 1994). Use relative counts (absolute counts hard to interpret). YAKS in ontology enhancement: evaluation 1. 2. 3. Four levels of is-a: Beef, meat, food, substance Seed pairs were beef-meat, meat-food, and food-substance. YAKS induction and siblings added to produce candidate pool. Candidate list was then pared using a version of the Wikipedia check: YAKS in ontology enhancement: evaluation Beef, meat, food, substance — Wikipedia check: Where might candidate term T fit, relative to seed term S? Count mentions of T on the S page, MT, and mentions of S on the T page, MS. Weight doubly mentions in the first paragraph. Subtract: MT - MS . If T is an S, then MT - MS should be >0. YAKS in ontology enhancement: evaluation Beef, meat, food, substance — Results: After YAKS induction and expansion via siblings, had >800 candidate terms to add to the ontology. Wikipedia check pared this to 38 terms, each with its best is-a relationship. Of these, on hand inspection, all but two were very high quality. YAKS in ontology enhancement: evaluation 1. 2. YAKS testing: MILO transportation ontology In each trial, seed terms were three items from near the bottom of the ontology, linked by is-a. For instance, sedan, car, and vehicle. Selecting terms from the bottom of the ontology decreases the lexicalization problems inherent in MILO content. Induced new candidate terms using the YAKS technique, adding siblings as well. YAKS in ontology enhancement: evaluation 3. Candidate list was then pared using a Yahoo is-a pattern check: A. For a candidate term T, compare it to each of the three seeds – i.e., is it a sedan? Is it a car? Is it a vehicle? B. Do this via a Yahoo search for Hearst is-a patterns involving T and each of the seeds. This trial used the patterns Y such as X and X and other Y. – i.e., search for “. . . sedans such as T . . .”, “. . . cars such as T . . .”, “. . . vehicles such as T . . .”, “. . . T and other sedans ...”, and so on. YAKS in ontology enhancement: evaluation 3. Yahoo pattern check, continued: C. Does the candidate have a significant number of Yahoo hits for one of the seeds with one of the is-a patterns? A. Is there a significant difference between that highest number of hits and the number of hits with some other seed? B. That is, is there a clear indication that the candidate is-a for one particular seed? D. If not, discard this candidate. E. If so, then this candidate most likely is-a member of the class named by that seed. YAKS in ontology enhancement: evaluation Results — MILO transportation ontology: candidate Yahoo hits for is-a: sedan car vehicle result truck 0 532 152000 van 0 95 5590 automobile 0 63 14700 minivan 1 76 561 BMW 3 360 66 a BMW is a car Jeep 0 3600 849 a Jeep is a car Corolla 6 66 3 convertible 0 25 76 a truck is a vehicle a van is a vehicle an automobile is a vehicle a minivan is a vehicle a Corolla is a car a convertible is a vehicle Conclusions and future work Conclusions and future work MI models can help compensate for the noisiness in timed media search and retrieval. MI models can help in building knowledge sources from large text collections. We’re still looking for better ways to combine handcrafted semantic resources with statistical ones.