Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell, ,

advertisement
Ontology-Driven Automatic Entity
Disambiguation in Unstructured Text
Jed Hassell, Boanerges Aleman-Meza, Budak Arpinar
5th International Semantic Web Conference
Athens, GA, Nov. 5 – 9, 2006
Acknowledgement: NSF-ITR-IDM Award #0325464
‘SemDIS: Discovering Complex Relationships in the Semantic Web’
The Question is …
• How to determine the most likely match of
a named-entity in unstructured text?
• Example:
Which “A. Joshi” is this text referring to?
• out of, say, 20 candidate entities (in a populated
ontology)
2
“likely match” = confidence score
• Idea is to spot entity names in text and
assign each potential match a confidence
score
• The confidence score represents a degree
of certainty that a spotted entity refers to
a particular object in the ontology
3
Our Approach, three steps
1. Spot Entity Names
-
assign initial confidence score
2. Adjust confidence score using:
-
proximity relationships (text)
co-occurrence relationships (text)
connections (graph)
popular entities (graph)
3. Iterate again to propagate result
-
finish when confidence scores are not
updated
4
Spotting Entity Names
• Search document for entity names within the
ontology
• Each match becomes a “candidate entity”
• Assign initial confidence scores
5
Using Text-proximity Relationships
• Relationships that can be expected to be in near
text-proximity of the entity
– Measured in terms of character spaces
6
Using Co-occurrence Relations
• Similar to text-proximity with the exception that
proximity is not relevant
i.e., location within the document does not matter
7
Using Popular Entities (graph)
• Intention: bias the right entity to be the
most popular entity
• This should be used with care, depending
on the domain
• good for tie-breaking
• DBLP scenario: entity with more papers
• e.g., only two “A. Joshi” entities with >50 papers
8
Using Relations to other Entities
• Entities can be related to one another through their
collaboration network
– ‘neighboring’ entities get a boost in their confidence score
• i.e., propagation
– This is the ‘iterative’ step in our apprach,
• It starts with entities having highest confidence score
– Example:
“Conference Program Committee Members:”
- Professor Smith
- Professor Smith’s co-editor in recent book
- Professor Smith’s recently graduated Ph.D advisee
.........
9
In Summary, ontology-driven
• Using “clues”
– from the text where the entity appears
– from the ontology
Example: RDF/XML snippet of a person’s metadata
10
Overview of System Architecture
11
Once no more iterations are needed
• Output of results: XML format
–
–
–
–
URI
Confidence score
Entity name (as it appears in the text)
Start and end position (location in a
document)
• Can easily be converted to other formats
– Microformats, RDFa, ...
12
Sample Output
13
Sample Output - Microformat
14
Evaluation: Gold Standard Set
• We evaluate our method using a gold
standard set of documents
– Randomly chose 20 consecutive post from
DBWorld
– Set of manually disambiguated documents
(two) humans validated the ‘right’ entity match
– We used precision and recall as the
measurement of evaluation for our system
15
Evaluation, sample DBWorld post
16
Sample disambiguated document
17
Using DBLP data as ontology
• Converted DBLP’s bibliographic data to RDF
–
–
–
–
447,121 authors
A SAX parser to convert DBLP’s XML data to RDF
Created relationships such as “co-author”
Added
• Affiliations (for a subset of authors)
• Areas of interest (for a subset of authors)
• spellings for international characters
• Lessons learned lead us to create SwetoDblp
(containing many improvements)
[SwetoDblp] http://lsdis.cs.uga.edu/projects/semdis/swetodblp/
[DBLP] http://www.informatik.uni-trier.de/~ley/db/
18
Evaluation, Precision & Recall
• We define set A as the set of unique
names identified using the disambiguated
dataset (i.e., exact results)
• We define set B as the set of entities
found by our method
• A  B represents the set of entities
correctly identified by our method
19
Evaluation, Precision & Recall
• Precision is the
proportion of correctly
disambiguated entities
with regard to B
• Recall is the
proportion of correctly
disambiguated entities
with regard to A
20
Evaluation, Results
• Precision and recall (compared to gold standard)
Correct Disambiguation
Found Entities
Total Entities
Precision
Recall
602
620
758
97.1%
79.4%
• Precision and recall on a per document basis:
Precision and Recall
100
90
80
Percentage
70
60
Recall
50
Precision
40
30
20
10
21
0
1
2
3
4
5
6
7
8
9
10
11
12
Documents
13
14
15
16
17
18
19
20
Related Work
• Semex Personal Information Management:
– The results of disambiguated entities are
propagated to other ambiguous entities, which
could then be reconciled based on recently
reconciled entities (much like our work does)
– Takes advantage of a predictable structure
such as fields where an email or name is
expected to appear
• Our approach works with unstructured data
[Semex] Dong, Halevy, Madhaven, SIGMOD-2005
22
Related Work
• Kim
– Contains an entity recognition portion that
uses natural language processing
– Evaluations performed on human annotated
corpora
• SCORE Technology
(now, http://www.fortent.com/)
– Uses associations from a knowledge base, yet
implementation details are not available
(commercial product)
[Kim] Popov et al., ISWC-2003
[SCORE] Sheth et al, Internet Computing, 6(4), 2002
23
Conclusions
• Our method uses relationships between entities
in the ontology to go beyond traditional
syntactic-based disambiguation techniques
• This work is among the first to successfully use
relationships for identifying named-entities in
text without relying on the structure of the text
24
Future Work
• Improvements on spotting
– e.g., canonical names (Tim = Timothy)
• Integration/deployment as a UIMA component
• allows analysis along a document collection
• for applications such as semantic annotation and search
• Further evaluations
–
–
–
–
Using different datasets and document sets
Compare with respect to other methods, and
to determine best contributing factor in disambiguation
measure how far in the list we missed the ‘right’ entity
[UIMA] IBM’s Unstructured Information Management Architecture
25
Scalability, Semantics, Automation
• Usage of background knowledge in the
form of a (large) populated ontology
• Flexibility to use a different ontology, but,
– the ontology must ‘fit’ the domain
• It’s an ‘automatic’ approach, yet …
– Human defines threshold values (and some
weights)
26
References
1.
2.
3.
4.
5.
6.
7.
Aleman-Meza, B., Nagarajan, M., Ramakrishnan, C., Ding, L., Kolari, P., Sheth, A., Arpinar,
B., Joshi, A.,Finin, T.: Semantic Analytics on Social Networks: Experiences in Addressing the
Problem of Conflict of Interest Detection. 15th International World Wide Web Conference,
Edinburgh, Scotland (2006)
DBWorld. http://www.cs.wisc.edu/dbworld/ April 9, 2006.
Dong, X. L., Halevy, A., Madhaven, J.: Reference Reconciliation in Complex Information
Spaces. Proc. of SIGMOD, Baltimore, MD. (2005)
Ley, M.: The DBLP Computer Science Bibliography: Evolution, Research Issues,
Perspectives. Proc. of the 9th International Symposium on String Processing and
Information Retrieval, Lisbon, Portugal (Sept. 2002) 1-10
Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: KIM - Semantic
Annotation Platform. Proc. of the 2nd Intl. Semantic Web Conf, Sanibel Island, Florida
(2003)
Sheth, A., Bertram, C., Avant, D., Hammond, B., Kochut, K., Warke, Y.: Managing semantic
content for the Web. IEEE Internet Computing, 6(4) (2002) 80-87
Zhu, J., Uren, V., Motta, E.: ESpotter: Adaptive Named Entity Recognition for Web
Browsing, 3rd Professional Knowledge Management Conference, Kaiserslautern, Germany,
2005
Evaluation datasets at: http://lsdis.cs.uga.edu/~aleman/publications/Hassell_ISWC2006/
27
Download