Mining Domain Specific Words from Hierarchical Web Documents Jing-Shin Chang (張景新) Department of Computer Science & Information Engineering National Chi-Nan (暨南) University 1, Univ. Road, Puli, Nantou 545, Taiwan, ROC. jshin@csie.ncnu.edu.tw CJNLP-04, 2004/11/10~15, City U., H.K. TOC Motivation What are DSW’s? Why DSW Mining? (Applications) WSD with DSW’s without sense tagged corpus Constructing Hierarchical Lexicon Tree w/o Clustering Other applications How to Mine DSW’s from Hierarchical Web Documents Preliminary Results Error Sources Remarks 3 Motivation “Is there a quick and easy (engineering) way to construct a large scale WordNet or things like that … now that everyone is talking about ontological knowledge sources and X-WordNet (whatever you call it)…?” …trigger a new view for constructing a lexicon tree with hierarchical semantic links… …DSW identification turns out to be a key to such construction …and can be used in various applications, including DSW-based WSD without using sense tagged corpora… 4 What Are Domain Specific Words (DSW’s) Words that appear frequently in some particular domains: (a) Multiple Sense Words that are used frequently with special meanings or usage in particular domains E.g., piston: “活塞” (mechanics) or “活塞隊” (sports) (b) Single Sense Words that are used frequently in particular domains Suggesting that some words in the current document might be related to this particular sense As “anchor words/tags” in the context for disambiguating other multiple sense words 5 What to Do in DSW Mining DSW Mining Task Find lists of words that occurs frequently in the same domain and associate each list (and words within it) a domain (implicit sense) tag E.g., entertainment: ‘singer’, ‘pop songs’, ‘rock & roll’, ‘Chang Hui-Mei’ (‘Ah-Mei’), ‘album’, … As a side effect, find the hierarchical or network-like relationships between adjacent sets of DSW’s When applied to mining DSW’s associated with each node of a hierarchical directory/document tree Each node being annotated with a domain tag 6 DSW Applications (1) Technical term extraction: ={ w | w DSW(d) } d {computer, traveling, food, …} W(d) 10 DSW Applications (2) Generic WSD based on DSW’s Sd P(s|d,W)P(d|W) = agrmaxS Sd P(s|d,W)P(W|d)P(d) If a large-scale sense-tagged corpus is not available, which is often the case ArgmaxS Machine translation Help select translation lexicon candidates E.g., money bank (when used with “payment”, “loan”, etc.), river bank, memory bank (in PC, Intel, MS Windows domains) 11 DSW Applications Generic WSD based on DSW’s s* arg max P s | w0 , w1n s arg max P w0n | s P s s [sense-based models] arg max P s, d | w0 , w s d n 1 arg max P s | d , w0n P d | w0n s d arg max P s | d , w0n P w0n | d P d s Sum over domains where w0 is a DSW Need sensetagged corpora for training (*not widely available) Implicitly domaintagged corpora are widely available in the web [domain-based models] d P s | d , w0n d , w0 , s : almost deterministic ("one sense per context") 12 DSW Applications (3) Document classification N-class classification based on DSW’s Anti-spamming (Two-class classification) Words in spamming (uninteresting) mails vs. normal (interesting) mails help block spamming mails Interesting domains vs. uninteresting domains P(W|S)P(S) vs. P(W|~S)P(~S) 13 DSW Applications (3.a) Document classification based on DSW’s d: document class label w[1..n]: bag of words in document |D| = n >= 2: number of document classes d * arg max P d | w1n d arg max P w1n | d P d d [class-based models] Anti-spamming based on DSW’s |D|=n=2 (two-class classification) 14 DSW Applications (4) Building large lexicon tree or wordnetlookalike (semi-) automatically from hierarchical web documents Membership: Semantic links among words of the same domain are close (context), similar (synonym, thesaurus), or negated concept (antonym) Hierarchy: Hierarchy of the lexicon suggests some ontological relationships 15 Conventional Methods for Constructing Lexicon Trees Construction by Clustering Collect words in a large corpus Evaluate word association as distance (or closeness) measure for all word pairs Use clustering criteria to build lexicon hierarchy Adjust the hierarchy and Assign semantic/sense tags to nodes of the lexicon tree Thus assigning sense tags to members of each node 16 Clustering Methods for Constructing Lexicon Trees A0, A1, B A0, A1, C A0, A2, D A0, A2, E B C A12 A04 D E A22 17 Clustering Methods for Constructing Lexicon Trees Disadvantages Do not take advantages of hierarchical information of document tree (flattened when collecting words) Word association & Clustering criteria are not related directly to human perception Most clustering algorithms conduct binary merging (or division) in each step for simplicity Automatically generated semantics hierarchy may not reflect human perception Hierarchy boundaries are not clearly & automatically detected Adjustment of hierarchy may not be easy (since human perception is not used to guide clustering) Pairwise association evaluation is costly 18 Hierarchical Information Loss when Collecting Words A04, A12, A22, B, C, D, E A02, A12, B, C A02, A22, D, E A0, A1, B A0, A1, C A0, A2, D A0, A2, E A0, A1, B A0, A1, C A0, A2, D A0, A2, E 19 Clustering Methods for Constructing Lexicon Trees A0, A1, B ? Reflect human perception? A0, A1, C ? A0, A2, D ? ? A0, A2, E ? Why binary? B ? C A12 A04 D E A22 Hierarchy ? 20 Alternative View for Constructing Lexicon Trees Construction by Retaining DSW’s Preserve hierarchical structure of web documents as baseline of semantic hierarchy, which is already mildly confirmed by webmasters Associate each node with DSW’s as members and tag each DSW with the directory/domain name Optionally adjust the tree hierarchy and members of each nodes 21 Constructing Lexicon Trees by Preserving DSW’s O: +DSW X: -DSW O,O,O,O O,X,O,O X,O,X,O X,X,O,O O,O,X,O O,X,O,X O,O,X,X 22 Constructing Lexicon Trees by Preserving DSW’s O: +DSW X: -DSW O,O,O,O O,O,O O,O O,O,O O,O O,O O,O 23 Constructing Lexicon Trees by Preserving DSW’s Advantages Hierarchy Adjustment could be easier if necessary Directory reflect human perception names are highly correlated to sense tags Domain-based model can be used if sense-tagged corpora is not available Pairwise word association evaluation is replaced by computation of “domain specificity” against domains O(|W|x|W|) vs. O(|W|x|D|) Requirements: A well-organized web site Mining DSW’s from such a site 24 Constructing Lexicon Trees by Preserving DSW’s A04, A12, A22, B, C, D, E Is_a, hypernym, … X A02, A12, relationship Y Synonym Antonym A0, A1, B Membership B, C (closeness, A02, A22, D, E similarity) A0, A1, C Y is_a X ?? B is_a X (or A1) A0, A2, D A0, A2, E 25 Alternative View for Constructing Lexicon Trees Benefits: No similarity computation: Closeness (incl. similarity) is already implicitly encoded by human judges No binary clustering: Clustering is already done (implicitly) with human judgment Hierarchical links available: Some well developed relationships are already done Although not perfect… 26 Proposed Method for Mining Web Hierarchy as a Large Document Tree Each document was generated by applying DSW’s to some generic document templates Remove non-specific words from documents, leaving a lexicon tree with DSW’s associated with each node Leaving only domain-specific words Forming a lexicon tree from a document tree Label domain specific words Characteristics: Get associated words by measuring domain-specificity to a known and common domain instead of measuring pairwise association plus clustering 28 Mining Criteria: Cross-Domain Entropy Domain-independent terms tend to distributed evenly in all domains. Distributional “evenness” can be measured with the Cross-Domain Entropy (CDE) defined as follows: Pij: probability of word-i in domain-j fij: normalized frequency * H i H * wi Pij log Pij j Pij fij f j ij 29 Mining Criteria: Cross-Domain Entropy Example: Wi = “piston”, with frequencies (normalized to [0,1]) at various domains: fij = (0.001, 0.62, 0.0003, 0.57, 0.0004) Domain-specific (unevenly distributed) at the 2nd and the 4th domains 30 Mining Algorithm – Step1 Step1 (Data Collection): Acquire a large collection of web documents using a web spider while preserving the directory hierarchy of the documents. Strip unused markup tags from the web pages. 31 Mining Algorithm – Step2 Step2 (Word Segmentation or Chunking): Identify word (or compound word) boundaries in the documents by applying a word segmentation process, such as (Chiang 92; Lin 93), to Chinese-like documents (where word boundaries are not explicit) or applying a compound word chunking algorithms to English-like documents in order to identify interested word entities. 32 Mining Algorithm – Step3 Step3 (Acquiring Normalized Term Frequencies for all Words in Various Domains): For each subdirectory dj, find the number of occurrences nij of each term wi in all the documents, and derive the normalized term frequency fij = nij/Nj by normalizing nij with the total document size, Nj = Si nij, in that directory. The directory is then associated with a set of <wi, dj, fij> tuples, where wi is the i-th words of the complete word list for all documents, dj is the j-th directory name (refer to as the domain hereafter), and fij is the normalized relative frequency of occurrence of in domain dj. 33 Mining Algorithm – Step3 Input: wi , d j , fij : word, domain, normalized frequence triple where nij : frequency of wi in domain d j N j nij : number of words in domain d j i f ij nij / N j : normalized frequency of wi in domain d j 34 Mining Algorithm – Step4 Step4 (Removing Domain-Independent Terms): Domain-independent terms are identified as those terms which distributed evenly in all domains. That is, terms with large Cross-Domain Entropy (CDE) defined as follows: H H w P log P * * i i ij ij j Pij fij f ij j Terms whose CDE is above a threshold can be removed from the lexicon tree since such terms are unlikely to be associated with any domain closely. Terms with a low CDE will be retained in a few domains with the highest normalized 35 frequencies (e.g., top-1 and top-2). Experiments Domains: News articles from a local news site 138 distinct domains including leaf nodes of the directory tree and their parents leaves with the same name are considered in the same domain Examples: baseball, basketball, broadcasting, car, communication, culture, digital, edu(cation), entertainment (流 星+花園), finance, food (大魚大肉,干貝,木耳,錫箔紙,…)… Size: 200M bytes (HTML files) 16K+ unique words after word segmentation 42 Domains (Hierarchy not shown) afternoon-news entertainment ilan listed-elec personal taiwan-china all-baseball europe important local-scene pintung taoyuan america-topic europe2 important2 lotto pl tax-law autumn family important3 main politics ti basketball finance important4 mainland public-forum todaynews bnext fish important5 management readexcellent topic broadcasting-tv focus infotech medical-news readtopic topic2 buybooks focusnews insurance medical shopping trade car food interest-prose miaoli sitemap travel card fund-futures internal-sport middle-taiwan sitemap_title travelwindow changhwa game international-sport middlesouth-taiwan social-forum udn-supplement chiayi global international miscellaneous society udn college golf internet mixtravel south-taiwan udnbw communication happy_worker japan movie special ue culture hardware kaoshiung-city music sport usa-stock daily health-care kaoshiung-sentry nantou star world-econ day_starnews health-club keelung national-travel stock writers digital hot-news life-topic02 newbooks taichung-city yunlin domestic hot-topic life-topic03 north-taiwan taichung-sentry dswa.crp <root> hot-topic2 life-topic1 opinion taiex east-taiwan hot-topic3 life otc tainan ec hot lifestyle out-activity taipei-city edict hsinchu life_newtopic oversea-star taipei-sentry edu hwalen listed-co performance taitung 43 Sample Output (4 Selected Domains) baseball broadcast -TV basketball car 日本職棒 有線電視 一分 千西西 棒球賽 東風 三秒 小型車 熱身 開工 女子組 中古 運動 節目中 包夾 引擎蓋 場次 廣電處 外線 水箱 價碼 收視 犯規 加裝 球團 和信 投籃 市場買氣 部長 新聞局 男子組 目的地 練球 開獎 防守 交車 興農 頻道 冠軍戰 同級 球場 電視 後衛 合作開發 投手 電影 活塞 安全系統 球季 熱門 國男 行李 賽程 影視 華勒 行李廂 太陽 娛樂 費城 西西 Table 1. Sampled domain specific words with low entropies. 44 Preliminary Results Domain specific words and the assigned domain tags are well associated (e.g., “投手” is specifically used in the “baseball” domain.) Extraction with the cross-domain entropy (CDE) metric is well founded. Domain-independent (or irrelevant) words (such as those for webmaster’s advertisements) are well rejected as DSW candidates for their high cross-domain entropy DSW’s are mostly nouns and verbs (open-class words) 46 Preliminary Results Low cross-domain entropy words (DSW’s) in the respective domain are generally highly correlated (e.g., “日本職棒”, “部 長”) New usages of words, such as “活塞” (Pistons) with the “basketball” sense, could also be identified Both are good for WSD tasks to use the DSW’s as contextual evidences 47 Error Sources Single CDE metric may not be sufficient to capture all characteristics of “domain-specificity” Type II error: Some general (non-specific) words may have low entropy simply because they appear only in one domain (CDE=0) Probably due to low occurrence counts (a kind of estimation error) Type I error: Some multiple sense words may have too many senses and thus be mis-recognized as nonspecific in each domain (although the senses are unique in respect domains) 48 Error Sources “Well-organized website” assumption may not be available all the time The hierarchical directory tags may not be appropriate representatives for the document words within a website The hierarchies may not be consistent from website to website 49 Future works Use other knowledge sources, other than the single CDE measure, to co-train the model in a manner similar to [Chang 97b, c] E.g., with other term weighting metrics E.g., stop list acquisition metric for identifying common words (for type II errors) Explore methods and criteria to adjust hierarchy of a single directory tree Explore methods to merge directory trees from different sites 50 Concluding Remarks A simple metric for automatic/semi-automatic identification of DSW’s At low sense tagging cost Rich web resource almost free Implicit semantic tagging implied by the directory hierarchy (imperfect hierarchy but free) A simple method to build semantic links and degree of closeness among DSW’s may be helpful for building large semantically tagged lexicon trees or network linked x-wordnets Good knowledge source for WSD-related applications WSD, Machine translation, document classification, antispamming, … 51 Thanks for your attention!! Thanks!! 52