Constructing Lexicon Trees by Preserving DSW's

Mining Domain Specific Words from Hierarchical Web Documents Jing-Shin Chang (張景新) Department of Computer Science & Information Engineering National Chi-Nan (暨南) University 1, Univ. Road, Puli, Nantou 545, Taiwan, ROC. jshin@csie.ncnu.edu.tw CJNLP-04, 2004/11/10~15, City U., H.K. TOC    Motivation What are DSW’s? Why DSW Mining? (Applications)  WSD with DSW’s without sense tagged corpus  Constructing Hierarchical Lexicon Tree w/o Clustering  Other applications     How to Mine DSW’s from Hierarchical Web Documents Preliminary Results Error Sources Remarks 3 Motivation     “Is there a quick and easy (engineering) way to construct a large scale WordNet or things like that … now that everyone is talking about ontological knowledge sources and X-WordNet (whatever you call it)…?” …trigger a new view for constructing a lexicon tree with hierarchical semantic links… …DSW identification turns out to be a key to such construction …and can be used in various applications, including DSW-based WSD without using sense tagged corpora… 4 What Are Domain Specific Words (DSW’s)  Words that appear frequently in some particular domains:  (a) Multiple Sense Words that are used frequently with special meanings or usage in particular domains  E.g., piston: “活塞” (mechanics) or “活塞隊” (sports)  (b) Single Sense Words that are used frequently in particular domains Suggesting that some words in the current document might be related to this particular sense  As “anchor words/tags” in the context for disambiguating other multiple sense words  5 What to Do in DSW Mining  DSW Mining Task  Find lists of words that occurs frequently in the same domain and associate each list (and words within it) a domain (implicit sense) tag  E.g., entertainment: ‘singer’, ‘pop songs’, ‘rock & roll’, ‘Chang Hui-Mei’ (‘Ah-Mei’), ‘album’, …  As a side effect, find the hierarchical or network-like relationships between adjacent sets of DSW’s When applied to mining DSW’s associated with each node of a hierarchical directory/document tree  Each node being annotated with a domain tag  6 DSW Applications (1)  Technical term extraction: ={ w | w  DSW(d) }  d  {computer, traveling, food, …}  W(d) 10 DSW Applications (2)  Generic WSD based on DSW’s Sd P(s|d,W)P(d|W) = agrmaxS Sd P(s|d,W)P(W|d)P(d)  If a large-scale sense-tagged corpus is not available, which is often the case  ArgmaxS  Machine translation  Help select translation lexicon candidates  E.g., money bank (when used with “payment”, “loan”, etc.), river bank, memory bank (in PC, Intel, MS Windows domains) 11 DSW Applications  Generic WSD based on DSW’s  s*  arg max P s | w0 , w1n s     arg max P w0n | s P  s  s  [sense-based models]  arg max  P s, d | w0 , w s d n 1           arg max  P s | d , w0n P d | w0n s d  arg max  P s | d , w0n P w0n | d P  d  s Sum over domains where w0 is a DSW Need sensetagged corpora for training (*not widely available) Implicitly domaintagged corpora are widely available in the web [domain-based models] d   P s | d , w0n    d , w0 , s  : almost deterministic ("one sense per context") 12 DSW Applications (3)  Document classification  N-class  classification based on DSW’s Anti-spamming (Two-class classification)  Words in spamming (uninteresting) mails vs. normal (interesting) mails help block spamming mails  Interesting domains vs. uninteresting domains  P(W|S)P(S) vs. P(W|~S)P(~S) 13 DSW Applications (3.a)  Document classification based on DSW’s  d: document class label  w[1..n]: bag of words in document  |D| = n >= 2: number of document classes  d *  arg max P d | w1n d     arg max P w1n | d P  d  d  [class-based models] Anti-spamming based on DSW’s  |D|=n=2 (two-class classification) 14 DSW Applications (4)  Building large lexicon tree or wordnetlookalike (semi-) automatically from hierarchical web documents  Membership: Semantic links among words of the same domain are close (context), similar (synonym, thesaurus), or negated concept (antonym)  Hierarchy: Hierarchy of the lexicon suggests some ontological relationships 15 Conventional Methods for Constructing Lexicon Trees  Construction by Clustering  Collect words in a large corpus  Evaluate word association as distance (or closeness) measure for all word pairs  Use clustering criteria to build lexicon hierarchy  Adjust the hierarchy and Assign semantic/sense tags to nodes of the lexicon tree  Thus assigning sense tags to members of each node 16 Clustering Methods for Constructing Lexicon Trees A0, A1, B A0, A1, C A0, A2, D A0, A2, E B C A12 A04 D E A22 17 Clustering Methods for Constructing Lexicon Trees  Disadvantages Do not take advantages of hierarchical information of document tree (flattened when collecting words)  Word association & Clustering criteria are not related directly to human perception  Most clustering algorithms conduct binary merging (or division) in each step for simplicity   Automatically generated semantics hierarchy may not reflect human perception  Hierarchy boundaries are not clearly & automatically detected  Adjustment of hierarchy may not be easy (since human perception is not used to guide clustering)  Pairwise association evaluation is costly 18 Hierarchical Information Loss when Collecting Words A04, A12, A22, B, C, D, E A02, A12, B, C A02, A22, D, E A0, A1, B A0, A1, C A0, A2, D A0, A2, E A0, A1, B A0, A1, C A0, A2, D A0, A2, E 19 Clustering Methods for Constructing Lexicon Trees A0, A1, B ? Reflect human perception? A0, A1, C ? A0, A2, D ? ? A0, A2, E ? Why binary? B ? C A12 A04 D E A22 Hierarchy ? 20 Alternative View for Constructing Lexicon Trees  Construction by Retaining DSW’s  Preserve hierarchical structure of web documents as baseline of semantic hierarchy, which is already mildly confirmed by webmasters  Associate each node with DSW’s as members and tag each DSW with the directory/domain name  Optionally adjust the tree hierarchy and members of each nodes 21 Constructing Lexicon Trees by Preserving DSW’s O: +DSW X: -DSW O,O,O,O O,X,O,O X,O,X,O X,X,O,O O,O,X,O O,X,O,X O,O,X,X 22 Constructing Lexicon Trees by Preserving DSW’s O: +DSW X: -DSW O,O,O,O O,O,O O,O O,O,O O,O O,O O,O 23 Constructing Lexicon Trees by Preserving DSW’s  Advantages  Hierarchy  Adjustment could be easier if necessary  Directory  reflect human perception names are highly correlated to sense tags Domain-based model can be used if sense-tagged corpora is not available  Pairwise word association evaluation is replaced by computation of “domain specificity” against domains   O(|W|x|W|) vs. O(|W|x|D|) Requirements:  A well-organized web site  Mining DSW’s from such a site 24 Constructing Lexicon Trees by Preserving DSW’s A04, A12, A22, B, C, D, E Is_a, hypernym, … X A02, A12, relationship Y Synonym Antonym A0, A1, B Membership B, C (closeness, A02, A22, D, E similarity) A0, A1, C Y is_a X ?? B is_a X (or A1) A0, A2, D A0, A2, E 25 Alternative View for Constructing Lexicon Trees  Benefits:  No similarity computation: Closeness (incl. similarity) is already implicitly encoded by human judges  No binary clustering: Clustering is already done (implicitly) with human judgment  Hierarchical links available: Some well developed relationships are already done  Although not perfect… 26 Proposed Method for Mining  Web Hierarchy as a Large Document Tree  Each document was generated by applying DSW’s to some generic document templates  Remove non-specific words from documents, leaving a lexicon tree with DSW’s associated with each node Leaving only domain-specific words  Forming a lexicon tree from a document tree  Label domain specific words   Characteristics:  Get associated words by measuring domain-specificity to a known and common domain instead of measuring pairwise association plus clustering 28 Mining Criteria: Cross-Domain Entropy Domain-independent terms tend to distributed evenly in all domains.  Distributional “evenness” can be measured with the Cross-Domain Entropy (CDE) defined as follows:   Pij: probability of word-i in domain-j  fij: normalized frequency * H i  H *  wi    Pij log Pij j Pij  fij f j ij 29 Mining Criteria: Cross-Domain Entropy  Example:  Wi = “piston”, with frequencies (normalized to [0,1]) at various domains:  fij = (0.001, 0.62, 0.0003, 0.57, 0.0004)  Domain-specific (unevenly distributed) at the 2nd and the 4th domains 30 Mining Algorithm – Step1  Step1 (Data Collection): Acquire a large collection of web documents using a web spider while preserving the directory hierarchy of the documents. Strip unused markup tags from the web pages. 31 Mining Algorithm – Step2  Step2 (Word Segmentation or Chunking): Identify word (or compound word) boundaries in the documents by applying a word segmentation process, such as (Chiang 92; Lin 93), to Chinese-like documents (where word boundaries are not explicit) or applying a compound word chunking algorithms to English-like documents in order to identify interested word entities. 32 Mining Algorithm – Step3  Step3 (Acquiring Normalized Term Frequencies for all Words in Various Domains): For each subdirectory dj, find the number of occurrences nij of each term wi in all the documents, and derive the normalized term frequency fij = nij/Nj by normalizing nij with the total document size, Nj = Si nij, in that directory. The directory is then associated with a set of <wi, dj, fij> tuples, where wi is the i-th words of the complete word list for all documents, dj is the j-th directory name (refer to as the domain hereafter), and fij is the normalized relative frequency of occurrence of in domain dj. 33 Mining Algorithm – Step3 Input:  wi , d j , fij  : word, domain, normalized frequence triple where nij : frequency of wi in domain d j N j   nij : number of words in domain d j i f ij  nij / N j : normalized frequency of wi in domain d j 34 Mining Algorithm – Step4  Step4 (Removing Domain-Independent Terms): Domain-independent terms are identified as those terms which distributed evenly in all domains. That is, terms with large Cross-Domain Entropy (CDE) defined as follows: H  H  w    P log P * * i i ij ij j Pij  fij f ij j  Terms whose CDE is above a threshold can be removed from the lexicon tree since such terms are unlikely to be associated with any domain closely. Terms with a low CDE will be retained in a few domains with the highest normalized 35 frequencies (e.g., top-1 and top-2). Experiments  Domains:  News articles from a local news site  138 distinct domains including leaf nodes of the directory tree and their parents  leaves with the same name are considered in the same domain  Examples: baseball, basketball, broadcasting, car, communication, culture, digital, edu(cation), entertainment (流星+花園), finance, food (大魚大肉,干貝,木耳,錫箔紙,…)…   Size: 200M bytes (HTML files)  16K+ unique words after word segmentation 42 Domains (Hierarchy not shown) afternoon-news entertainment ilan listed-elec personal taiwan-china all-baseball europe important local-scene pintung taoyuan america-topic europe2 important2 lotto pl tax-law autumn family important3 main politics ti basketball finance important4 mainland public-forum todaynews bnext fish important5 management readexcellent topic broadcasting-tv focus infotech medical-news readtopic topic2 buybooks focusnews insurance medical shopping trade car food interest-prose miaoli sitemap travel card fund-futures internal-sport middle-taiwan sitemap_title travelwindow changhwa game international-sport middlesouth-taiwan social-forum udn-supplement chiayi global international miscellaneous society udn college golf internet mixtravel south-taiwan udnbw communication happy_worker japan movie special ue culture hardware kaoshiung-city music sport usa-stock daily health-care kaoshiung-sentry nantou star world-econ day_starnews health-club keelung national-travel stock writers digital hot-news life-topic02 newbooks taichung-city yunlin domestic hot-topic life-topic03 north-taiwan taichung-sentry dswa.crp <root> hot-topic2 life-topic1 opinion taiex east-taiwan hot-topic3 life otc tainan ec hot lifestyle out-activity taipei-city edict hsinchu life_newtopic oversea-star taipei-sentry edu hwalen listed-co performance taitung 43 Sample Output (4 Selected Domains) baseball broadcast -TV basketball car 日本職棒有線電視一分千西西棒球賽東風三秒小型車熱身開工女子組中古運動節目中包夾引擎蓋場次廣電處外線水箱價碼收視犯規加裝球團和信投籃市場買氣部長新聞局男子組目的地練球開獎防守交車興農頻道冠軍戰同級球場電視後衛合作開發投手電影活塞安全系統球季熱門國男行李賽程影視華勒行李廂太陽娛樂費城西西 Table 1. Sampled domain specific words with low entropies. 44 Preliminary Results  Domain specific words and the assigned domain tags are well associated (e.g., “投手” is specifically used in the “baseball” domain.)  Extraction with the cross-domain entropy (CDE) metric is well founded.  Domain-independent (or irrelevant) words (such as those for webmaster’s advertisements) are well rejected as DSW candidates for their high cross-domain entropy  DSW’s are mostly nouns and verbs (open-class words) 46 Preliminary Results Low cross-domain entropy words (DSW’s) in the respective domain are generally highly correlated (e.g., “日本職棒”, “部長”)  New usages of words, such as “活塞” (Pistons) with the “basketball” sense, could also be identified   Both are good for WSD tasks to use the DSW’s as contextual evidences 47 Error Sources  Single CDE metric may not be sufficient to capture all characteristics of “domain-specificity”  Type II error: Some general (non-specific) words may have low entropy simply because they appear only in one domain (CDE=0)  Probably due to low occurrence counts (a kind of estimation error)  Type I error: Some multiple sense words may have too many senses and thus be mis-recognized as nonspecific in each domain (although the senses are unique in respect domains) 48 Error Sources  “Well-organized website” assumption may not be available all the time  The hierarchical directory tags may not be appropriate representatives for the document words within a website  The hierarchies may not be consistent from website to website 49 Future works  Use other knowledge sources, other than the single CDE measure, to co-train the model in a manner similar to [Chang 97b, c]  E.g., with other term weighting metrics  E.g., stop list acquisition metric for identifying common words (for type II errors) Explore methods and criteria to adjust hierarchy of a single directory tree  Explore methods to merge directory trees from different sites  50 Concluding Remarks  A simple metric for automatic/semi-automatic identification of DSW’s  At low sense tagging cost Rich web resource almost free  Implicit semantic tagging implied by the directory hierarchy (imperfect hierarchy but free)   A simple method to build semantic links and degree of closeness among DSW’s  may be helpful for building large semantically tagged lexicon trees or network linked x-wordnets  Good knowledge source for WSD-related applications  WSD, Machine translation, document classification, antispamming, … 51 Thanks for your attention!! Thanks!! 52

Constructing Lexicon Trees by Preserving DSW's

Related documents

Products

Support

Constructing Lexicon Trees by Preserving DSW's

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib