Constructing Lexicon Trees by Preserving DSW's

advertisement
Mining Domain Specific Words
from
Hierarchical Web Documents
Jing-Shin Chang (張景新)
Department of Computer Science & Information Engineering
National Chi-Nan (暨南) University
1, Univ. Road, Puli, Nantou 545, Taiwan, ROC.
jshin@csie.ncnu.edu.tw
CJNLP-04, 2004/11/10~15, City U., H.K.
TOC



Motivation
What are DSW’s?
Why DSW Mining? (Applications)
 WSD
with DSW’s without sense tagged corpus
 Constructing Hierarchical Lexicon Tree w/o Clustering
 Other applications




How to Mine DSW’s from Hierarchical Web
Documents
Preliminary Results
Error Sources
Remarks
3
Motivation




“Is there a quick and easy (engineering) way to
construct a large scale WordNet or things like
that … now that everyone is talking about
ontological knowledge sources and X-WordNet
(whatever you call it)…?”
…trigger a new view for constructing a lexicon
tree with hierarchical semantic links…
…DSW identification turns out to be a key to such
construction
…and can be used in various applications,
including DSW-based WSD without using sense
tagged corpora…
4
What Are Domain Specific
Words (DSW’s)

Words that appear frequently in some particular
domains:
 (a)
Multiple Sense Words that are used frequently with
special meanings or usage in particular domains

E.g., piston: “活塞” (mechanics) or “活塞隊” (sports)
 (b)
Single Sense Words that are used frequently in
particular domains
Suggesting that some words in the current document might be
related to this particular sense
 As “anchor words/tags” in the context for disambiguating other
multiple sense words

5
What to Do in DSW Mining

DSW Mining Task
 Find
lists of words that occurs frequently in the same
domain and associate each list (and words within it) a
domain (implicit sense) tag

E.g., entertainment: ‘singer’, ‘pop songs’, ‘rock & roll’,
‘Chang Hui-Mei’ (‘Ah-Mei’), ‘album’, …
 As
a side effect, find the hierarchical or network-like
relationships between adjacent sets of DSW’s
When applied to mining DSW’s associated with each node of a
hierarchical directory/document tree
 Each node being annotated with a domain tag

6
DSW Applications (1)

Technical term extraction:
={ w | w  DSW(d) }
 d  {computer, traveling, food, …}
 W(d)
10
DSW Applications (2)

Generic WSD based on DSW’s
Sd P(s|d,W)P(d|W) = agrmaxS Sd
P(s|d,W)P(W|d)P(d)
 If a large-scale sense-tagged corpus is not
available, which is often the case
 ArgmaxS

Machine translation
 Help
select translation lexicon candidates
 E.g., money bank (when used with “payment”,
“loan”, etc.), river bank, memory bank (in PC,
Intel, MS Windows domains)
11
DSW Applications

Generic WSD based on DSW’s

s*  arg max P s | w0 , w1n
s



 arg max P w0n | s P  s 
s

[sense-based models]
 arg max  P s, d | w0 , w
s
d
n
1


 


 

 arg max  P s | d , w0n P d | w0n
s
d
 arg max  P s | d , w0n P w0n | d P  d 
s
Sum over
domains where
w0 is a DSW
Need sensetagged corpora
for training
(*not widely
available)
Implicitly domaintagged corpora are
widely available in
the web
[domain-based models]
d


P s | d , w0n    d , w0 , s  : almost deterministic ("one sense per context")
12
DSW Applications (3)

Document classification
 N-class

classification based on DSW’s
Anti-spamming (Two-class classification)
 Words
in spamming (uninteresting) mails vs.
normal (interesting) mails help block spamming
mails
 Interesting domains vs. uninteresting domains
 P(W|S)P(S) vs. P(W|~S)P(~S)
13
DSW Applications (3.a)

Document classification based on DSW’s
 d:
document class label
 w[1..n]: bag of words in document
 |D| = n >= 2: number of document classes

d *  arg max P d | w1n
d



 arg max P w1n | d P  d 
d

[class-based models]
Anti-spamming based on DSW’s
 |D|=n=2
(two-class classification)
14
DSW Applications (4)

Building large lexicon tree or wordnetlookalike (semi-) automatically from
hierarchical web documents
 Membership:
Semantic links among words of
the same domain are close (context), similar
(synonym, thesaurus), or negated concept
(antonym)
 Hierarchy: Hierarchy of the lexicon suggests
some ontological relationships
15
Conventional Methods for
Constructing Lexicon Trees

Construction by Clustering
 Collect
words in a large corpus
 Evaluate word association as distance (or
closeness) measure for all word pairs
 Use clustering criteria to build lexicon
hierarchy
 Adjust the hierarchy and Assign semantic/sense
tags to nodes of the lexicon tree
 Thus
assigning sense tags to members of each node
16
Clustering Methods for
Constructing Lexicon Trees
A0, A1, B
A0, A1, C
A0, A2, D
A0, A2, E
B
C
A12
A04
D
E
A22
17
Clustering Methods for
Constructing Lexicon Trees

Disadvantages
Do not take advantages of hierarchical information of
document tree (flattened when collecting words)
 Word association & Clustering criteria are not related directly
to human perception
 Most clustering algorithms conduct binary merging (or
division) in each step for simplicity

 Automatically
generated semantics hierarchy may not
reflect human perception
 Hierarchy boundaries are not clearly & automatically
detected
 Adjustment of hierarchy may not be easy (since human
perception is not used to guide clustering)
 Pairwise association evaluation is costly
18
Hierarchical Information Loss
when Collecting Words
A04, A12, A22,
B, C, D, E
A02, A12, B, C
A02, A22, D, E
A0, A1, B
A0, A1, C
A0, A2, D
A0, A2, E
A0, A1, B
A0, A1, C
A0, A2, D
A0, A2, E
19
Clustering Methods for
Constructing Lexicon Trees
A0, A1, B
?
Reflect
human
perception?
A0, A1, C
?
A0, A2, D
?
?
A0, A2, E
?
Why
binary?
B
?
C
A12
A04
D
E
A22
Hierarchy
?
20
Alternative View for
Constructing Lexicon Trees

Construction by Retaining DSW’s
 Preserve
hierarchical structure of web
documents as baseline of semantic hierarchy,
which is already mildly confirmed by
webmasters
 Associate each node with DSW’s as members
and tag each DSW with the directory/domain
name
 Optionally adjust the tree hierarchy and
members of each nodes
21
Constructing Lexicon Trees by
Preserving DSW’s
O: +DSW
X: -DSW
O,O,O,O
O,X,O,O
X,O,X,O
X,X,O,O
O,O,X,O
O,X,O,X
O,O,X,X
22
Constructing Lexicon Trees by
Preserving DSW’s
O: +DSW
X: -DSW
O,O,O,O
O,O,O
O,O
O,O,O
O,O
O,O
O,O
23
Constructing Lexicon Trees by
Preserving DSW’s

Advantages
 Hierarchy

Adjustment could be easier if necessary
 Directory

reflect human perception
names are highly correlated to sense tags
Domain-based model can be used if sense-tagged corpora is
not available
 Pairwise
word association evaluation is replaced by
computation of “domain specificity” against domains


O(|W|x|W|) vs. O(|W|x|D|)
Requirements:
 A well-organized
web site
 Mining DSW’s from such a site
24
Constructing Lexicon Trees by
Preserving DSW’s
A04, A12, A22,
B, C, D, E
Is_a,
hypernym,
…
X
A02, A12,
relationship
Y
Synonym
Antonym
A0, A1, B
Membership
B, C (closeness, A02, A22, D, E
similarity)
A0, A1, C
Y is_a X ?? B is_a X (or A1)
A0, A2, D
A0, A2, E
25
Alternative View for Constructing
Lexicon Trees

Benefits:
 No
similarity computation: Closeness (incl.
similarity) is already implicitly encoded by
human judges
 No binary clustering: Clustering is already done
(implicitly) with human judgment
 Hierarchical links available: Some well
developed relationships are already done
 Although
not perfect…
26
Proposed Method for Mining

Web Hierarchy as a Large Document Tree
 Each
document was generated by applying DSW’s to
some generic document templates

Remove non-specific words from documents,
leaving a lexicon tree with DSW’s associated with
each node
Leaving only domain-specific words
 Forming a lexicon tree from a document tree
 Label domain specific words


Characteristics:
 Get
associated words by measuring domain-specificity
to a known and common domain instead of measuring
pairwise association plus clustering
28
Mining Criteria:
Cross-Domain Entropy
Domain-independent terms tend to
distributed evenly in all domains.
 Distributional “evenness” can be measured
with the Cross-Domain Entropy (CDE)
defined as follows:

 Pij:
probability of word-i in domain-j
 fij: normalized frequency
*
H i  H *  wi    Pij log Pij
j
Pij 
fij
f
j
ij
29
Mining Criteria:
Cross-Domain Entropy

Example:
 Wi
= “piston”, with frequencies (normalized to
[0,1]) at various domains:
 fij
= (0.001, 0.62, 0.0003, 0.57, 0.0004)
 Domain-specific (unevenly distributed) at the 2nd
and the 4th domains
30
Mining Algorithm – Step1

Step1 (Data Collection): Acquire a large
collection of web documents using a web
spider while preserving the directory
hierarchy of the documents. Strip unused
markup tags from the web pages.
31
Mining Algorithm – Step2

Step2 (Word Segmentation or Chunking):
Identify word (or compound word)
boundaries in the documents by applying a
word segmentation process, such as (Chiang
92; Lin 93), to Chinese-like documents
(where word boundaries are not explicit) or
applying a compound word chunking
algorithms to English-like documents in
order to identify interested word entities.
32
Mining Algorithm – Step3

Step3 (Acquiring Normalized Term Frequencies
for all Words in Various Domains): For each
subdirectory dj, find the number of occurrences nij
of each term wi in all the documents, and derive
the normalized term frequency fij = nij/Nj by
normalizing nij with the total document size, Nj =
Si nij, in that directory. The directory is then
associated with a set of <wi, dj, fij> tuples, where
wi is the i-th words of the complete word list for
all documents, dj is the j-th directory name (refer
to as the domain hereafter), and fij is the
normalized relative frequency of occurrence of in
domain dj.
33
Mining Algorithm – Step3
Input:
 wi , d j , fij  : word, domain, normalized frequence triple
where
nij : frequency of wi in domain d j
N j   nij : number of words in domain d j
i
f ij  nij / N j : normalized frequency of wi in domain d j
34
Mining Algorithm – Step4

Step4 (Removing Domain-Independent Terms):
Domain-independent terms are identified as those
terms which distributed evenly in all domains.
That is, terms with large Cross-Domain Entropy
(CDE) defined as follows: H  H  w    P log P
*
*
i
i
ij
ij
j
Pij 
fij
f
ij
j

Terms whose CDE is above a threshold can be
removed from the lexicon tree since such terms
are unlikely to be associated with any domain
closely. Terms with a low CDE will be retained in
a few domains with the highest normalized
35
frequencies (e.g., top-1 and top-2).
Experiments

Domains:
 News
articles from a local news site
 138 distinct domains
including leaf nodes of the directory tree and their parents
 leaves with the same name are considered in the same domain
 Examples: baseball, basketball, broadcasting, car,
communication, culture, digital, edu(cation), entertainment (流
星+花園), finance, food (大魚大肉,干貝,木耳,錫箔紙,…)…


Size: 200M bytes (HTML files)
 16K+
unique words after word segmentation
42
Domains
(Hierarchy not shown)
afternoon-news
entertainment
ilan
listed-elec
personal
taiwan-china
all-baseball
europe
important
local-scene
pintung
taoyuan
america-topic
europe2
important2
lotto
pl
tax-law
autumn
family
important3
main
politics
ti
basketball
finance
important4
mainland
public-forum
todaynews
bnext
fish
important5
management
readexcellent
topic
broadcasting-tv
focus
infotech
medical-news
readtopic
topic2
buybooks
focusnews
insurance
medical
shopping
trade
car
food
interest-prose
miaoli
sitemap
travel
card
fund-futures
internal-sport
middle-taiwan
sitemap_title
travelwindow
changhwa
game
international-sport
middlesouth-taiwan
social-forum
udn-supplement
chiayi
global
international
miscellaneous
society
udn
college
golf
internet
mixtravel
south-taiwan
udnbw
communication
happy_worker
japan
movie
special
ue
culture
hardware
kaoshiung-city
music
sport
usa-stock
daily
health-care
kaoshiung-sentry
nantou
star
world-econ
day_starnews
health-club
keelung
national-travel
stock
writers
digital
hot-news
life-topic02
newbooks
taichung-city
yunlin
domestic
hot-topic
life-topic03
north-taiwan
taichung-sentry
dswa.crp <root>
hot-topic2
life-topic1
opinion
taiex
east-taiwan
hot-topic3
life
otc
tainan
ec
hot
lifestyle
out-activity
taipei-city
edict
hsinchu
life_newtopic
oversea-star
taipei-sentry
edu
hwalen
listed-co
performance
taitung
43
Sample Output (4 Selected
Domains)
baseball
broadcast
-TV
basketball
car
日本職棒 有線電視 一分
千西西
棒球賽
東風
三秒
小型車
熱身
開工
女子組
中古
運動
節目中
包夾
引擎蓋
場次
廣電處
外線
水箱
價碼
收視
犯規
加裝
球團
和信
投籃
市場買氣
部長
新聞局
男子組
目的地
練球
開獎
防守
交車
興農
頻道
冠軍戰
同級
球場
電視
後衛
合作開發
投手
電影
活塞
安全系統
球季
熱門
國男
行李
賽程
影視
華勒
行李廂
太陽
娛樂
費城
西西
Table 1. Sampled domain specific words with low entropies.
44
Preliminary Results

Domain specific words and the assigned domain
tags are well associated (e.g., “投手” is
specifically used in the “baseball” domain.)
 Extraction
with the cross-domain entropy (CDE) metric
is well founded.
 Domain-independent (or irrelevant) words (such as
those for webmaster’s advertisements) are well rejected
as DSW candidates for their high cross-domain entropy

DSW’s are mostly nouns and verbs (open-class
words)
46
Preliminary Results
Low cross-domain entropy words (DSW’s)
in the respective domain are generally
highly correlated (e.g., “日本職棒”, “部
長”)
 New usages of words, such as “活塞”
(Pistons) with the “basketball” sense, could
also be identified

 Both
are good for WSD tasks to use the DSW’s
as contextual evidences
47
Error Sources

Single CDE metric may not be sufficient to
capture all characteristics of “domain-specificity”
 Type
II error: Some general (non-specific) words may
have low entropy simply because they appear only in
one domain (CDE=0)

Probably due to low occurrence counts (a kind of estimation
error)
 Type
I error: Some multiple sense words may have too
many senses and thus be mis-recognized as nonspecific in each domain (although the senses are unique
in respect domains)
48
Error Sources

“Well-organized website” assumption may
not be available all the time
 The
hierarchical directory tags may not be
appropriate representatives for the document
words within a website
 The hierarchies may not be consistent from
website to website
49
Future works

Use other knowledge sources, other than the
single CDE measure, to co-train the model
in a manner similar to [Chang 97b, c]
 E.g.,
with other term weighting metrics
 E.g., stop list acquisition metric for identifying
common words (for type II errors)
Explore methods and criteria to adjust
hierarchy of a single directory tree
 Explore methods to merge directory trees
from different sites

50
Concluding Remarks

A simple metric for automatic/semi-automatic
identification of DSW’s
 At
low sense tagging cost
Rich web resource almost free
 Implicit semantic tagging implied by the directory hierarchy
(imperfect hierarchy but free)


A simple method to build semantic links and
degree of closeness among DSW’s
 may
be helpful for building large semantically tagged
lexicon trees or network linked x-wordnets

Good knowledge source for WSD-related
applications

WSD, Machine translation, document classification, antispamming, …
51
Thanks for your attention!!
Thanks!!
52
Download