(Semi)automatic glossary detection for Bulgarian Laska Laskova Linguistic Modelling Laboratory Institute for Parallel Processing Bulgarian Academy of Sciences The project concerned Language Technology for eLearning Goal: Domain: For: Via: (http://www.lt4el.eu/) improving the retrieval of learning material Computer Science for non-computer scientists Bulgarian, Czech, Dutch, English, German, Maltese, Polish, Portuguese, Romanian implementing multilingual language technology tools in Learning Management Systems (ILIAS): key word extractor glossary candidate detector semantic knowledge (ontologies) 26 September 2006 RANLP 2007 2 Content Introduction Preprocessing steps Definition. Definitory context types Grammar types Conclusion 26 September 2006 RANLP 2007 3 Introduction Bulgarian corpus: 54 learning objects (.html, .htm, .doc format): 207 000 tokens appr. 12 LO: Calimera Guidelines, section 3 15 LO: Common MS Office programs tutorials 27 LO: Internet Technologies tutorials for beginners and advanced users Problematic features: High number of non-standard words 32 % terms including English words Unusual text structure 26 September 2006 RANLP 2007 4 Preprocessing steps 26 September 2006 RANLP 2007 5 Preprocessing steps Non-standard words Some examples: TCP/IP #FF00FF .zip &quot; Не-Е-Число име_на_рамка getNodeNam () Problematic steps: tokenization lemmatization 26 September 2006 RANLP 2007 6 Preprocessing steps Unusual text structure ••••• Problematic steps: sentence segmentation definition mark-up 26 September 2006 RANLP 2007 7 Definition. Definitory context types Definitory context (definition): Canonical structure: definiendum – connector – definiens Canonical content: genus proximum – differentia Some “departures form the norm”: Structure. No definiendum and no connector: “Показва текущия адрес на документа, който разглеждате в Web.” (“Shows the current address of the document you are browsing.”) (Address bar) Content. Definition by function (“what it does”). No genus proximum (and no connector): “Рамките разделят страницата на отделни части, във всяка от които се показва друга страница.” (“Frames divide the web page into parts, such that each part in turn displays another page.”) 26 September 2006 RANLP 2007 8 Definition. Definitory context types Manually annotated definitory contexts: 767, 826 sentences Type is_def verb_def layout_def definiendum connector Manual NP/pronoun aux 3sg/3pl 220 NP/pronoun/ verb 3sg/3pl 404 70 punct_def NP : - = 32 Some definitions span over one sentence / define one definiendum Pron_def type (54 manually annotated) definitions are included in is_def and verb_def type groups, depending on the connector Other_def type (59 manually annotated): definitions with various patterns, less then two occurrences 26 September 2006 RANLP 2007 9 Grammar tool The implemented tool: LXtransduce (Tobin, University of Edinburgh) Transducer which processes an XML input stream and rewrites it according to a set of rules provided in the grammar file Rules: regular expressions references to other rules top-down grammar 26 September 2006 RANLP 2007 10 Definitory context types 1. is_def “XML language is a set of specific elements for standard text mark-up.” Structure: NP is/are NP Coordinated NP is/are NP Pronoun is/are NP 26 September 2006 RANLP 2007 11 Grammar 1. is_def Some characteristics of is_def grammar: 36 simple rules for matching basic tokens: N, Adj, Adv, “,” ,”-”, etc. Rule for “съм” (“to be”) verb in 3sg and 3pl present tense. Rules for 4 types of premodified NP structures. Complex rule for NP postmodifier. Rule for any kind of token, used in main rule: (NP)+ is/are (premodified NP (postmodifier)+)+ (ANY)+ Why bother with the rule for NP postmodifier? (NP)+ is/are NP (ANY)+ 26 September 2006 RANLP 2007 12 Definitory context types 2. verb_def “Reversed” definition: “(We) call this place insertion point” Structure: NP V NP (null subject) V NP Pronoun V NP 26 September 2006 RANLP 2007 13 Grammar 2. verb_def Some characteristics of verb_def grammar: Open list of verb forms (3sg, 3pl present tense) – 99 items. Why so many? “Logical equivalence”: “означава” (means), “представлява” (represents), “се нарича” (is called), etc. “What it does” (often in sentences, part of multi-sentence definitions): “вмъква” (inserts), “изобразява” (depicts), etc. Constraints on the structure of verb complex. 26 September 2006 RANLP 2007 14 Definitory context types 3. layout_def Structure: - - NP no main verb 26 September 2006 RANLP 2007 15 Grammar 3. layout_def Some characteristics of verb_def grammar: How do we match a sentence without a main verb? <rule name= "tok_not_verb“> <first> … <query match="tok [not (@ctag = 'V') and not (@ctag = 'P' and contains (@msd, 'Pr')) and not (. = '.') and not (. = '[') and not (. = ':') and not (. = '=') and not (@class = 'internet') and not (@class = 'num')]"/> … </first> </rule> + (NOT_VERB)+ Main rule: premodified NP (postmodifier) Less rules for matching tokens with different POS tag. Simplified premodified NP structure. 26 September 2006 RANLP 2007 16 Definitory context types 4. punct_def Structure NP : / – / = NP The definiendum may be preceded by an adjunct (PP) 26 September 2006 RANLP 2007 17 Grammar 4. punct_def Some characteristics of punct_def grammar: Rule matching first NP premodifier (if necessary, after a bullet in a list). Example: “222. Тематични карти (…)” <rule name="first_tok"> … <ref name="bullet_02" suppress="true"/> <query match="s [not(@id='s1')/tok [3] [.~‘^[А-Я]\S*$‘] [@ctag='A' or … or @ctag='M']"/> … </rule> 26 September 2006 RANLP 2007 18 Definitory context types. Evaluation • Recall is more important than precision Matching level Type sentence is_def token sentence verb_def token sentence punct_def token layout_def sentence token Precision Recall F2–score 37.58 20.16 36.86 16.65 22.77 12.67 21.69 26.97 72.27 70.54 70.32 56.68 71.87 62.55 65.71 55.59 55.27 38.49 53.98 31.47 41.81 27.05 39.20 41.07 • Grammars performance 26 September 2006 RANLP 2007 19 Conclusion. The good outcome example ILIAS – extended with Glossary Candidate Detector Questions? 26 September 2006 RANLP 2007 20 Thank you! 26 September 2006 RANLP 2007 21