Presentation Slides

advertisement
DYNAMIC BUILDING OF DOMAIN
SPECIFIC LEXICONS USING
EMERGENT SEMANTICS
Final Presentation
Matt Selway 100079967
Supervisor: Professor Markus Stumptner
Knowledge and Software Engineering Laboratory
School of Computer and Information Science
CONTENTS
Motivations and Goals
 Research Questions
 Method
 Experiments and Results
 Summary and Conclusions
 Limitations and Future Work

MOTIVATIONS AND GOALS

Kleiner et al. (2009) developed a very different
approach to Natural Language Processing (NLP)
Treat NLP as Model Transformation problem
 Utilise Configuration as a model transformation


Model transformation is process of taking input
models and creating output models from them


Foundation of Model Driven Engineering
Configuration is a constraint based searching
technique

In this case the constraints are conformance to the
desired meta model
MOTIVATIONS AND GOALS

Overview of Process (Kleiner et al. 2009)

Method shows
promising
results
However,
requires
use of predefined
lexicon
MOTIVATIONS AND GOALS
Issues for practical applications:
1. Can take a long time to manually build a
complete lexicon, even for a Specific Domain
2. Predefined lexicon is static
3. Reduces level of automation
MOTIVATIONS AND GOALS
Short-range Goals:
1.
At least partially automated creation of domain specific
lexicons directly from the input text and external
resources to retrieve lexical data
2.
Make updates a natural part of the system
3.
Allow sharing/reuse of lexical information
Long-range Goals:
1.
Improve the automated analysis of specifications
2.
Support research into semantic interoperability
3.
Develop global agreement on lexicons/ontologies
RESEARCH QUESTIONS

Can we reduce or eliminate the need to manually
predefine a lexicon by dynamically building a
lexicon based on the input text?

How much of a reduction can be gained?

How well does it work? (i.e. accuracy of retrieved
data, how much data is automatically retrieved)

What are its limitations?
METHOD




Developed an experimental system
Attempted to use emergent semantics and semiotic
dynamics in a similar way to that described by Steels
and Hanappe (2006) for the interoperability of
collective information systems.
They propose a multi-agent system that uses
communication to arrive at an agreement on the
meaning of the data, its tags, and its categories.
They take advantage of the semiotic triad between
data, tags, and categories in user taxonomies (e.g.
Bookmarks in a web browser)

Semiotic triad implies a meaningful relationship between
its three components
METHOD
Basic semiotic triad
(Steels & Hanappe,
2006)
Similarly there exists a semiotic triad between a
word, its use, and the domain it is used in.
 Idea is that this triad can be used in dynamically
developing domain specific lexicons between
information agents.

METHOD (DESIGN)
Multi-agent System
 Lexical information retrieved from other agents
 Initial data downloaded from online sources
 User feedback adjusts the retrieved data
 Agents update their lexicons and associations to
lexicons based on user feedback (using semiotic
relationship)

Lots of changes indicates the agents are actually
using different domains
 Few changes indicates updates to the lexicon in the
same domain

METHOD (ONLINE SOURCES)
Surveyed online lexicons/ontologies (CYC,
WordNet, EDR) and dictionaries (Oxford, ‘The
Free Dictionary’, ‘Your Dictionary’)
 Excluded CYC, WordNet, EDR as not suitable
 Turned to standard online dictionaries



Official dictionaries Oxford/Harvard not suitable
(want money for access)
Discovered the ‘The Free Dictionary’
Large number of entries
 Enough detail in definitions (Transitive/Intransitive
Verbs, Definite/Indefinite Articles, etc.)
 Reasonably standard pages for parsing

METHOD (LEXICON)
METHOD (AGENT COMMUNICATION)
METHOD (AGENT COMMUNICATION)
METHOD (AGENT COMMUNICATION)
EXPERIMENTS AND RESULTS
Percentage Words with Retrieved Data
82.35%
91.67%
0%
10%
20%
30%
40%
50%
Proposal Words
60%
SBVR Words
70%
80%
90%
100%
EXPERIMENTS AND RESULTS
Number of Categories per Word
Mode
Median
Ave.
Max
Min
0
1
2
Proposal Word List
3
4
SBVR Word List
5
6
EXPERIMENTS AND RESULTS
Frequencies for No. Categories per Word
5
4
3
2
1
0
0%
10%
20%
30%
Proposal Word List
40%
SBVR Word List
50%
60%
70%
EXPERIMENTS AND RESULTS
Correct and Additional Categories
Additional Cat.
Words w/ Additional cat.
Words w/ Correct Cat.
0%
10%
20%
30%
Proposal Word List
40%
50%
SBVR Word List
60%
70%
80%
90%
SUMMARY AND CONCLUSIONS

It works!

How well?

High percentage of words had data retrieved, however, too
much unnecessary data reduces the effectiveness

Accuracy is impacted by many factors




Incomplete/incorrect parsing of the web page
Small SBVR specification sample
SBVR keywords
Believe it is worth pursuing and improving




Fix parsing, use multiple sources
Define keyword lexicons, dynamically generate rest
Fill in gaps/cull using words with only one category
Etc.
LIMITATIONS AND FUTURE WORK

Choice of dictionary

Potentially use multiple data sources
Joint words, i.e. most SBVR key words
 Implementation not perfect

Parsing of the data source
 No synonyms


Communication Protocol

Errors in adjusting association strengths
Strength adjustment values and threshold values
used for lexicon classifiers need more research to
find more appropriate values
 Etc.

REFERENCES
Kleiner, M, Albert, P & Bézivin, J 2009,
‘Configuring Models for (Controlled) Languages’,
in Proceedings of the IJCAI–09 Workshop on
Configuration (ConfWS–09), Pacadena, CA, USA,
pp. 61-68.
 Farlex 2010, The Free Dictionary, viewed 11
September 2010, <www.thefreedictionary.com>.
 Steels, L & Hanappe, P 2006, ‘Interoperability
Through Emergent Semantics A Semiotic
Dynamics Approach’, in Journal on Data
Semantics VI, vol. 4090, Springer Berlin /
Heidelberg, pp. 143-167.

Download