Automatic glossary detector for Bulgarian

advertisement
(Semi)automatic glossary detection
for Bulgarian
Laska Laskova
Linguistic Modelling Laboratory
Institute for Parallel Processing
Bulgarian Academy of Sciences
The project concerned
Language Technology for eLearning
Goal:
Domain:
For:
Via:



(http://www.lt4el.eu/)
improving the retrieval of learning material
Computer Science for non-computer scientists
Bulgarian, Czech, Dutch, English, German, Maltese,
Polish, Portuguese, Romanian
implementing multilingual language technology tools
in Learning Management Systems (ILIAS):
key word extractor
glossary candidate detector
semantic knowledge (ontologies)
26 September 2006
RANLP 2007
2
Content





Introduction
Preprocessing steps
Definition. Definitory context types
Grammar types
Conclusion
26 September 2006
RANLP 2007
3
Introduction

Bulgarian corpus:

54 learning objects (.html, .htm, .doc format):





207 000 tokens appr.
12 LO: Calimera Guidelines, section 3
15 LO: Common MS Office programs tutorials
27 LO: Internet Technologies tutorials for beginners and
advanced users
Problematic features:



High number of non-standard words
32 % terms including English words
Unusual text structure
26 September 2006
RANLP 2007
4
Preprocessing steps
26 September 2006
RANLP 2007
5
Preprocessing steps

Non-standard words
Some examples:
TCP/IP
#FF00FF
.zip
"
Не-Е-Число
име_на_рамка
getNodeNam ()

Problematic steps:
tokenization
lemmatization
26 September 2006
RANLP 2007
6
Preprocessing steps

Unusual text structure
•••••
 Problematic steps: sentence segmentation
definition mark-up
26 September 2006
RANLP 2007
7
Definition. Definitory context
types


Definitory context (definition):
 Canonical structure: definiendum – connector – definiens
 Canonical content: genus proximum – differentia
Some “departures form the norm”:
 Structure. No definiendum and no connector:
“Показва текущия адрес на документа, който разглеждате в Web.”
(“Shows the current address of the document you are browsing.”) (Address bar)

Content. Definition by function (“what it does”). No genus
proximum (and no connector):
“Рамките разделят страницата на отделни части, във всяка от които се
показва друга страница.”
(“Frames divide the web page into parts, such that each part in turn displays
another page.”)
26 September 2006
RANLP 2007
8
Definition. Definitory context
types
Manually annotated definitory contexts: 767, 826 sentences
Type
is_def
verb_def
layout_def
definiendum
connector
Manual
NP/pronoun
aux 3sg/3pl
220
NP/pronoun/
verb 3sg/3pl
404


70
punct_def
NP
: - =
32
 Some definitions span over one sentence / define one definiendum
 Pron_def type (54 manually annotated) definitions are included in
is_def and verb_def type groups, depending on the connector
 Other_def type (59 manually annotated): definitions with various
patterns, less then two occurrences
26 September 2006
RANLP 2007
9
Grammar tool

The implemented tool:
LXtransduce (Tobin, University
of Edinburgh)
 Transducer which
processes an XML input
stream and rewrites it
according to a set of rules
provided in the grammar
file
 Rules:
 regular expressions
 references to other
rules
 top-down grammar
26 September 2006
RANLP 2007
10
Definitory context types
1. is_def
“XML
language is a
set of specific
elements for
standard text
mark-up.”

Structure:



NP is/are NP
Coordinated NP is/are NP
Pronoun is/are NP
26 September 2006
RANLP 2007
11
Grammar
1. is_def

Some characteristics of is_def grammar:





36 simple rules for matching basic tokens: N, Adj,
Adv, “,” ,”-”, etc.
Rule for “съм” (“to be”) verb in 3sg and 3pl present
tense.
Rules for 4 types of premodified NP structures.
Complex rule for NP postmodifier.
Rule for any kind of token, used in main rule:

(NP)+ is/are (premodified NP (postmodifier)+)+ (ANY)+
Why bother with the rule for NP postmodifier?

(NP)+ is/are NP (ANY)+
26 September 2006
RANLP 2007
12
Definitory context types
2. verb_def
“Reversed”
definition:
“(We) call this
place insertion
point”

Structure:



NP V NP
 (null subject) V NP
Pronoun V NP
26 September 2006
RANLP 2007
13
Grammar
2. verb_def

Some characteristics of verb_def grammar:

Open list of verb forms (3sg, 3pl present tense) – 99 items.
Why so many?
“Logical equivalence”: “означава” (means), “представлява”
(represents), “се нарича” (is called), etc.
 “What it does” (often in sentences, part of multi-sentence
definitions): “вмъква” (inserts), “изобразява” (depicts), etc.
Constraints on the structure of verb complex.


26 September 2006
RANLP 2007
14
Definitory context types
3. layout_def

Structure:


 -  - NP
no main verb
26 September 2006
RANLP 2007
15
Grammar
3. layout_def

Some characteristics of verb_def grammar:

How do we match a sentence without a main verb?
<rule name= "tok_not_verb“>
<first> … <query match="tok [not (@ctag = 'V')
and not (@ctag = 'P' and contains (@msd, 'Pr'))
and not (. = '.') and not (. = '[') and not (. = ':')
and not (. = '=') and not (@class = 'internet')
and not (@class = 'num')]"/> … </first> </rule>
+ (NOT_VERB)+
 Main rule: premodified NP (postmodifier)
 Less rules for matching tokens with different POS tag.
 Simplified premodified NP structure.

26 September 2006
RANLP 2007
16
Definitory context types
4. punct_def

Structure


NP : / – / = NP
The definiendum may be preceded by an adjunct (PP)
26 September 2006
RANLP 2007
17
Grammar
4. punct_def

Some characteristics of punct_def grammar:

Rule matching first NP premodifier (if necessary, after a
bullet in a list).
Example: “222. Тематични карти (…)”
 <rule name="first_tok"> …
<ref name="bullet_02" suppress="true"/>
<query match="s [not(@id='s1')/tok [3]
[.~‘^[А-Я]\S*$‘]
[@ctag='A' or … or @ctag='M']"/> … </rule>
26 September 2006
RANLP 2007
18
Definitory context types.
Evaluation
• Recall is more important than precision
Matching
level
Type
sentence
is_def
token
sentence
verb_def
token
sentence
punct_def
token
layout_def
sentence
token
Precision
Recall
F2–score
37.58
20.16
36.86
16.65
22.77
12.67
21.69
26.97
72.27
70.54
70.32
56.68
71.87
62.55
65.71
55.59
55.27
38.49
53.98
31.47
41.81
27.05
39.20
41.07
• Grammars performance
26 September 2006
RANLP 2007
19
Conclusion.
The good outcome example
 ILIAS – extended with Glossary Candidate Detector

Questions?
26 September 2006
RANLP 2007
20
Thank you!
26 September 2006
RANLP 2007
21
Download