ppt

advertisement
A Supervised Machine Learning Approach to Conjunction
Disambiguation in Named Entities
Paweł Mazur
(University of Technology, Wrocław, Poland)
Pawel.Mazur@pwr.wroc.pl
and
Robert Dale
(Macquarie University, Sydney, Australia)
Robert.Dale@mq.edu.au
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
1
Agenda
•
•
•
•
•
•
•
Conjunction in Named Entities
Our approach
Experiments
Results of the experiment
Results’ analysis
Conclusions
Further work
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
2
Conjunction in Candidate Named Entity String
Fujitsu Australia and New Zealand
Australia and New Zealand Banking Group Limited
Peter Smith and Ann Arbor Software Council
• Candidate named entity string:
– Any sequence of words starting with initial capitals
– Single instance of the word and or & form of conjunction
• 45 documents out of 13460: 5.7% of candidate named entity strings
contained conjunction; in some documents the frequency is as high as 23%;
in MUC-7 it is 4.5%
• A lot of candidate named entity strings in this domain contain company
names and person names
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
3
Our Approach - A Classification Task
We distinguish 4 categories of a conjunction in a candidate NE string:
– Category A: Name Internal Conjunction
Copper Mines and Metals Limited
Herbert P Cooper & Son, Ernst and Young
– Category B: Name External Conjunction
Proxy Form and Explanatory Memorandum
Hardware & Operating Systems
EchoStar and News Corporation
The most
common
– Category C: Right-Copy Separator
William and Alma Ford, Connel and Bent Streets,
Eastern and Western Australia
– Category D: Left-Copy Separator
Hospital Equipment & Systems
J H Blair Company Secretary & Corporate Counsel
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
Could be seen as
one linguistic
category
4
Our Approach - Candidate NE String Pattern
String: Australia and New Zealand Banking Group Limited
Pattern: (Loc and Loc Org CompDesig)
String: Peter Smith and Ann Arbor Software Council
Pattern: (GivenName FamilyName and GivenName FamilyName Noun Org)
Patterns are created using gazetteers and simple keyword-based heuristics.
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
5
Tag Set
InitCapped
Loc
Org
FamilyName
CompDesig
Initial
CompPos
GivenName
Of
Abbrev
PersDesig
Det
Dir
Son
Month
AlphaNum
925
245
175
164
138
108
99
89
76
73
39
31
12
7
6
3
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
42.24%
11.19%
7.99%
7.49%
6.30%
4.93%
4.52%
4.06%
3.47%
3.33%
1.78%
1.42%
0.55%
0.32%
0.27%
0.14%
PersDesig: Mr, Mrs, Ms, Miss,
Dr, Prof, Sir, Madam, Messrs, and Jnr.
CompDesig: Ltd, Limited, Pty Ltd, GmbH,
plc and many more and
Investments Pty Ltd, Management Pty Ltd,
Corporate Pty Ltd, Associates Pty Ltd,
Family Trust, Co Limited, Partners,
Partners Limited, Capital Limited, and
Capital Pty Ltd.
CompPos: Director, Secretary, Manager,
Counsel, Managing Director, Member,
Chairman, Chief Executive, Chief
Executive Officer, and CEO, and also
some bodies within organizations, such as
Board and Committee.
6
Data Encoding
Each instance is encoded with 33 attributes:
• 1 binary attribute for each tag for each conjunct
signaling its presence in the string
(2x16=32 attributes in total)
• 1 binary attribute ConjType encoding the lexical form of the
conjunction in the string (0 for &, 1 for and)
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
7
Corpus & Data Sets
• Corpus: 13460 text documents – from 8 to 1000 lines long
• Our corpus is a subcorpus drawn from a collection of company announcements from the
Australian Stock Exchange
• Selection of candidate named entity strings:
sequence of initcapped words and a single conjunction ( and or &),
also optional: of, a, an, the
• We got a set of 10925 strings, 6437 of which are unique
• Hand elimination of wrongly identified strings due to typographic features of the
documents (tables)
• Random selection of 600 examples from the unique set
Name
Internal
185
30.8%
Name
External
350
58.3%
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
Right-Copy Left-Copy
39
6.5%
26
4.3%
Sum
600
100%
8
Machine-learned Classifiers
•Naïve Bayes
•Multilayered Perceptron
•IBk
•K*
•Random Tree
•Logistic Model Trees (LMT)
•J4.8
•SMO
Implementations in WEKA (Waikato Environment for Knowledge Analysis), University
of Waikato in New Zealand
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
9
Baseline
• Determined with the 0-R algorithm: always assigns the most common category
(Name External) – 58.33%
• Better baseline is given by 1-R algorithm:
IF ConjForm=& THEN PredCat←Internal
IF ConjForm=and THEN PredCat←External
baseline = 69.83%
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
10
Results
Algorithm
IBk
Correctly classified
504 (84.00%)
Random Tree
503 (83.83%)
K*
501 (83.50%)
SMO - quadratic kernel
494 (82.33%)
Mult. Perceptron
493 (82.17%)
LMT
487 (81.17%)
J4.8
477 (79.50%)
SMO - linear kernel
468 (78.00%)
Naïve Bayes
424 (70.67%)
Baseline
419 (69.83%)
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
11
Accuracy by Conjunction Category
Category
Precision
Recall
F-Measure
Name Internal
0.814
0.876
0.844
Name External
0.872
0.897
0.885
Right-Copy
0.615
0.410
0.492
Left-Copy
0.800
0.462
0.585
weighted mean
0.834
0.840
0.833
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
12
Confusion Matrix
Name Internal
Name External
Right-Copy
Left-Copy
Classified as:
162
28
6
3
Name Internal
18
314
17
11
Name External
4
6
16
0
Right-Copy|
1
2
0
12
LeftCopy
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
13
Results Analysis: Conjunction Cat. Indicators
For Name External conjunction:
- Month & X
- X & Month
- CompDesig & X
- X & PersDesig
- X & GivenName
- X & Dir
- X & Deter
- Abbrev & X
- X & Abbrev
For Name Internal conjunction:
- X & Son
(note: Sons of Gwalia Ltd and Gwalia Consolidated Ltd)
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
14
Error Analysis: InitCapped
38 of all 96 missclassified examples are InitCapped tag based only (~40%)
In these cases the classification ended up being determined on the basis of ConjForm
attribute (just like the baseline was determined).
There were 134 InitCapped-only patterns in the data set;
96 of them (71.64%) were classified correctly (comparative to the overall baseline
result of 69.83%).
There were also 11 missclassified examples consisting mainly of InitCapped tag. Ex:
Australian Labor Party and Independent Members
Loc InitCapped Org and InitCapped InitCapped
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
15
Error Analysis: Long Patterns
In 2 cases the misclassification was due to the long patterns of the examples:
Fellow of the Australian Institute of Geoscientists and The Australasian Institute of Mining
CompPos Of Det Loc Org Of InitCapped and Det Loc Org Of InitCapped
(Left-Copy => Name Internal)
Fellow of the Royal College of Pathologists of Australasia and Chairman of Scientific Services
Limited
Pos Of Deter Org Of InitCapped Of Loc and Pos Of InitCapped InitCapped Desig
(Name External => Name Internal)
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
16
Error Analysis: Other Cases
• 2 cases of extended patterns – a pattern is built as another (common) pattern +
additional tag:
WD & HO Wills Holdings Limited
Initial Initial & Initial Initial FamilyName CompDesig
(Name Inter)
vs
Initial Initial & Initial Initial FamilyName
(Right-Copy)
• A conjunction of a person name and a company name
Wayne Jones and Topsfield Pty Ltd
– ambiguos even for humans without contextual information
• A conjunction of two person names: in our domain there is only one case where
this is name external type;
• There are around 20 examples where it is difficult to judge the reason for
missclasification - perhaps the reason is the model we have built
• Influence of k-fold evaluation: different classification for the same pattern in
different folds
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
17
Conclusions
•
•
•
•
•
•
Distinguished 4 categories of conjunctions in NEs
Presented the problem as one of classification
Experiment with machine-learned classifiers
Results: F=0.833
Simple tag set used
Some examples are truly ambiguous even for humans
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
18
Further Work
• Multiple conjunctions
• Human supervised N-gram based preprocessing
• Abbreviation preprocessing
• Limit the number of InitCapped tags
• Take into account the syntactic number of tokens
• Use contextual information (ex. syntactic number of associated verb)
• Extend the evaluation data
• Evaluation with full named entity recognition process
Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007
19
Download