View slides

advertisement
WP 2: Semi-automatic
metadata generation driven
by Language Technology
Resources
Lothar Lemnitzer
Project review, Utrecht, 1 Feb 2007
Our Background
• Experience in corpus annotation and
information extraction from texts
• Experience in grammar development
• Experience in statistical modelling
• Experience in eLearning
WP2 Dependencies
•
•
•
•
WP 1:
WP 3:
WP 4:
WP 5:
collection and preparation of LOs
WP 2 results are input to this WP
Integration of tools
Evaluation and validation
M12 Deliverables
Language Resources:
•
•
•
•
•
> 200 000 running words per language
With structural and linguistic annotation
> 1000 manually selected keywords
> 300 manually selected definitions
Local grammars for definitory contexts
M12 Deliverables
Documentation:
•
•
•
•
Guidelines linguistic annotation
Guidelines keyword annotation
Guidelines annotation of definitions
(Guidelines evaluation)
M12 Deliverables
Tools:
• Prototype Keyword Extractor
• Prototype Glossary Candidate Detector
User
Profile
LMS
Documents SCORM
Pseudo-Struct.
Basic XML
LING.
PROCESSOR
Lemmatizer, POS, Partial
Parser
CROSSLINGUAL
RETRIEVAL
CONVERTOR 2
Documents SCORM
Pseudo-Struct
Documents HTML
Ontology
Glossary
CONVERTOR 1
EN
CZ
MT
PL
GE
DT
RO
Documents User
(PDF, DOC, HTML,
SCORM,XML)
EN
Metadata
(Keywords)
Ling. Annot XML
Lexikon GE
Lexicon
BG
PT
MT
LexikonBG
LexikonDT
Lexikon EN
CZ
Lexicon Lexikon
PT
LexikonRO
LexikonPL
REPOSITORY
Linguistically annotated learning
objects
• Structural annotation: par, s, chunk, tok
• Linguistic annotation: base, ctag, msd
attributes
 Example 1
• Specific annotation: marked term, defining
text
Part of the DTD
•
•
•
•
•
•
•
<!ELEMENT markedTerm (chunk | tok | markedTerm)+
<!ATTLIST markedTerm
%a.ana;
kw
(y|n)
"n"
dt
(y|n)
"n"
status
CDATA
#IMPLIED
comment
CDATA
#IMPLIED
•
•
•
•
•
•
•
•
•
•
•
•
•
<!ELEMENT definingText (chunk | tok | markedTerm)+
<!ATTLIST definingText
id
ID
#IMPLIED
xml:lang
CDATA
#IMPLIED
lang
CDATA
#IMPLIED
rend
CDATA
#IMPLIED
type
CDATA
#IMPLIED
wsd
CDATA
#IMPLIED
def
IDREF
#IMPLIED
continue
CDATA
#IMPLIED
part
CDATA
#IMPLIED
status
CDATA
#IMPLIED
comment
CDATA
#IMPLIED
>
>
>
>
Linguistically annotated learning
objects
Use:
• Linguistically annotated texts are input to
the extraction tools
• Marked terms and defining texts are used
as training material and / or as gold
standard for the evaluation
Characteristics of keywords
1. Good keywords have a typical, non random
distribution in and across LOs
2. Keywords tend to appear more often at
certain places in texts (headings etc.)
3. Keywords are often highlighted / emphasised
by authors
Distributional features of keywords
We are using the following metrics to measure keywordiness by
distribution
• Term frequency / inverse document frequency (tf*idf),
• Residual Inverse document frequency (RIDF)
• An adjusted version of RIDF (adjustment by term frequency)
to model inter text distribution of KW
• Term burstiness
to model intra text distribution of KW
Structural and layout features of keywords
We will use:
• Knowledge of text structure used to
identify salient regions (e.g., headings)
• Layout features of texts used to identify
emphasised words ( Example 2)
We will weigh words with such features
higher
Complex keywords
• Complex, multi-word keywords are relevant,
differences between languages
• The keyword extractor allows the extraction of
n-grams of any length
• Evaluation showed that the including bi- or
even trigrams word increases the results, with
larger n-grams the performance begins to drop
• Maximum keyword length can be specified as a
parameter
Complex keywords
Language
Single-word
keywords
Multi-word
keywords
German
91 %
9%
Polish
35 %
65 %
Language settings for the
keyword extractor
• Selection of single keywords is restricted to a
few ctag categories and / or msd values (nouns,
proper nouns, unknown words and some verbs
for most languages)
• Multiword patterns are restricted wrt to
position of function words (style of learning is
acceptable; of learning behaviours is not)
Output of Keyword Extractor
List, ordered by “keywordiness” value,
with the elements
• Normalized form of keyword
• (Statistical figures)
• List of attested forms of the keyword
 Example 3
Evaluation strategy
We will proceed in three steps:
1. Manually assigned keywords will be used to
measure precision and recall of key word extractor
2. Human annotators will judge results from
extractor and rate them
3. The same document(s) will be annotated by many
test persons in order to estimate inter-annotator
agreement on this task
Summary
With the keyword extractor,
1.
2.
3.
4.
5.
We are using several known statistical metrics in
combination with qualitative and linguistic features
We give special emphasis on multiword keywords
We evaluate the impact of these features on the
performance of these tools for eight languages
We integrate this tool into an eLearning application
We have a prototype user interface to this tool
Identification of definitory contexts
• Empirical approach based on linguistic annotation of LO
• Workflow
– Definitory contexts are searched and marked in LOs
– Recurrent patterns are characterized quantitatively
and qualitatively ( Example 4)
– Local grammars are drafted on the basis of these
recurrent patterns
– Extraction of definitory context performed by
lxtransduce (University of Edinburgh - LTG)
Characteristics of local grammars
• Grammar rules match and wrap subtrees of the
XML document tree
• One grammar rule refers to subrules which
match substructures
• Rules can refer to lexical list to constrain
categories further
• The defined term should be identified and
marked
Example
Een vette letter is een letter die zwarter
wordt afgedrukt dan de andere letters.
<rule name="simple_NP" >
<seq>
<and>
<ref name="art"/>
<ref name="cap"/>
</and>
<ref name="adj" mult="*"/>
<ref name="noun" mult="+"/>
</seq>
</rule>
Een vette letter is een letter die zwarter
wordt afgedrukt dan de andere letters.
<query match="tok[@ctag='V' and @base='zijn'
and @msd[starts-with(.,'hulpofkopp')]]"/>
Een vette letter is een letter die zwarter
wordt afgedrukt dan de andere letters.
<rule name="noun_phrase">
<seq>
<ref name="art" mult="?"/>
<ref name="adj" mult="*" />
<ref name="noun" mult="+" />
</seq>
</rule>
Een vette letter is een letter die zwarter
wordt afgedrukt dan de andere letters.
<rule name="is_are_def">
<seq>
<ref name="simple_NP"/>
<query match="tok[@ctag='V' and @base='zijn' and
@msd[starts-with(.,'hulpofkopp')]]"/>
<ref name="noun_phrase" />
<ref name="tok_or_chunk" mult="*"/>
</seq>
</rule>
Een vette letter is een letter die zwarter
wordt afgedrukt dan de andere letters.
Output of Glossary candidate
detector
• Ordered List of words
• Defined Term marked
• (Larger context – one preceding, one
following sentence)
( Example 10)
Evaluation Strategy
We will proceed in two steps:
1. Manually marked definitory contexts will be
used to measure precision and recall of the
glossary candidate detector
2. Human annotator to judge results from the
glossary candidate detector and rate their
quality / completeness
Results
Own LOs
Verbs only
Precision
21.5 %
34.1 %
Recall
34.9 %
29.0 %
Evaluation
Questions to be answered by a user-centered
evaluation:
• Is there a preference for higher recall or for
higher precision?
• Do user profit from seeing a larger context?
Integration of functionalities
Development Server (CVS)
KW/DC
Ontology
Content Portal
ILIAS
LOs
Code
Code/Data
Code
Migration Tool
Nightly Updates
Java Webserver (Tomcat)
Application
Logic
Third
Party
Tools
KW/DC/Onto
Java
Classes
/ Data
Webservices
Use
functionalities
through
SOAP
ILIAS Server
nuSoap
LOs
Axis
Evaluate functionalities
in ILIAS
User Interface
Servlets/JSP
Access
functionalities directly
User Interface Prototypes
User Interface Prototypes
Download