Preliminary Consideration on Word Segmentation ISO Standards Sun Maosong Dept. of Computer Science, Tsinghua University, 23 July 2005, Okinawa, Japan Language resource management - Word segmentation of written texts for mono-lingual and multi-lingual information processing - Part 1: General principles and methods 1. Scope The word segmentation standard series (Part 1, Part 2 and Part 3) aim at any natural language in which the word tokens of its written texts cannot be fully identified by typographic properties (such as white spaces in English). The standards specify what the output is for an input text after the process of word segmentation, pursuing the consistency in word segmentation within/among texts to the maximum extent so as to meet the requirements from a variety of applications in information processing regarding natural languages. The standards are designed to be used in close conjunction with the metamodel presented in Morpho-syntactic Annotation Framework, ISO 16642:2003, Terminology Markup Framework, with revision of ISO 12620, Terminology and other language resources ― Data categories for electronic lexical resources (DCR), and with the working draft of ISO WD 24613:2004, Language resource management—Lexical markup framework (LMF). They also try to be consistent with the existing lexical resource models such as the EAGLES International Standards for Language Engineering (ISLE) and Multilingual ISLE Lexical Entry (MILE) model. 2. Normative references The following normative documents contain provisions that, through reference in this text, constitute provisions of this part of ISO word segmentation standards. For dated references, subsequent amendments to, or revisions of, any of these publications do not apply. However, parties to agreements based on ISO words segmentation standards are encouraged to investigate the possibility of applying the most recent editions of the normative documents indicated below. For undated references, the latest edition of the normative document referred to applies. Members of ISO and IEC maintain registers of currently valid International Standards. ISO 639-1:2002, Codes for the representation of names of languages – Part 1: Alpha-2 Code. 1 ISO 639-2:1998, Code for the representation of languages – part 2: Alpha-3 Code. ISO 639-3:200?, Codes for the representation of languages – Part 3: Alpha-3 Code for the comprehensive coverage of languages ISO 1087-1:2000, Terminology – Vocabulary – Part 1: Theory and application ISO 1087-2:1999, Terminology – Vocabulary – Part 2: Computer application. ISO/IEC 10646-1:2003, Information technology – Information technology -- Universal Multiple-Octet Coded Character Set (UCS) ISO/IEC 11179-3:2003, Information Technology – Data management and interchange – Metadata Registries (MDR) – Part 3: Registry Metamodel (MDR3) ISO 15924:200?, Code for the representation of names of scripts ISO 16642:2003, Computer applications in terminology – TMF (Terminological Markup Framework)—check! 3. Related terms/concepts character morpheme bound morpheme free morpheme word in the context of a given language, is a description composed of at least a /part of speech/ and a lemmatised form NOTE: The description can be more complete with more morphological information and/or syntactic and semantic information. A word is either a single word or a multiword expression. single word word that does not contain any other word derivation change in the form of a word to create a new word, usually by modifying the base/root or affixation 2 NOTE: Sometimes derivation signals a change in part of speech, such as "nation" to "nationalize". Sometimes the part of speech remains the same as in “nationalization” vs. “denationalization”. affix Collective term for bound formatives or word-forming elements that constitute subcategories of word classes. affixation Process of word formation in which the stem is expanded by the addition of an affix. prefix A subclass of bound-forming elements that precede the stem. prefixation Essential process of word formation in which an affix is attached to the beginning of a stem. suffix morphological element that is attached finally to free morpheme constructions, but does not occur as a rule as a free morpheme. suffixation The formation of complex words or word forms through the addition of a suffix to the word stem. infix word formation morpheme that is inserted into the stem. inflection to be regarded as the phenomena at the syntactic level, not at the morphological level. lemmatization to be regarded as the phenomena at the syntactic level, not at the morphological level. compounding compound idiom (multiword expression) a word/phrase composed of an ordered group of words (characters) that has properties that are not predictable from the properties of the individual words (characters) or their normal mode of combination. fixed phrase 3 a phrase that is tightly combined and steadily used in the language, even though its meaning can be predictable from its components. 猪肉(pork) in Chinese tokenization (word breaker) proper noun full form of a word the complete representation of a word for which there is an abbreviated form [ISO 12620]. abbreviated form of a word word-formation word word forms lexeme morphological (internal) structure of a word (word structure) word segmentation unit (WSU) 1. word 2. phrases which are used tightly combined/frequently used Word phrase , Bunsetsu - content word + functional word, e.g., in Korean, word phrase is one of spacing unit. E.g., Seoul-eui (KR), Seoul-no (JP) - Prosodic structure is related. wordlist word type - dictionary word token - text Corpus Representative corpus for a language large enough and well balanced, appropriate for word frequency estimation of a wordlist. Domain-specific corpus raw corpus 4 morphologically annotated corpus - manually … word segmentation a process that performs morphological analysis to any input text. Morphological analsys A process to mark up a token (= morpheme) based on a given set of tags (or part-of-speech) word frequency from the representative corpus (adjusted) word frequency more precise estimation of a word frequency by considering its usage, for example, its distribution over domain and time. – from domain-specific corpus word frequency approximation word frequency acquired by an estimation method based on raw corpus, automatically segmented corpus, manually segmented corpus with some degree of inconsistency, or mixture of them. - estimation vs. approximation: approximation is upper concept of estimation word segmentation ambiguity (optional) type of word segmentation ambiguities (optional) unknown words type of unknown words part of speech degree of lexicalization 4. General principles and methodologies Principles in determining WSUs (1) from linguistic point of view: all the linguistics principles and rules regarding word-formation hold; 5 applicable with much broader linguistic coverage; Principle of unpredictability of the word meaning from its subparts Principle of idiomatization Principle of conventionalization evaluated by frequency Principle of the scope maximization of affixation - /반 파쇼 주의 자/ Principle of the scope maximization of compounding: with respect to the wordlist - 기계번역 - /기계번역 시스템/ Principle of …. 동사변형에 대한 문제 – bound morpheme applicable in a condition Principle of Lexical Integrity Hypothesis syntactic rules may not refer to the internal structure of words. - 水杯 (물잔) - 洗澡 (2) from mental and economical point of view: the fixed phrases are also considered to be included in the wordlist, as determined by the representative corpus; (3) from practical point of view: 大中小学, 외교부장 (외교|부장 – 이 아님, 외교부+부장) Principle in using the standards to process the text Exhausitivity and consistency. WSUs identified in the text may have inner structures, not simply inserting a space in between Flexibility in granularity - e.g., pork = pig meat 6 Methodology in word frequency estimation Basic framework a wordlist (entry of lexicon), with high coverage to texts, and, possibly with a morphological structure for some words respectively Principle of the inclusion of words into the word list: lexemes generated by lexical morphology (derivation and compounding), but not inflection. word formation rules/meta rules: both derivational and compounding a complete prefix/semi-prefix list a complete suffix/semi-suffix list a complete free morpheme list a complete bound morpheme list special morpheme lists that have special functions in the process of word segmentation, for example, *** in Japanese. Word ending of verbs: corpus: to support the quantitative analysis of the wordlist Work Flow of word segmentation Language resource management - Word segmentation of written texts for mono-lingual and multi-lingual information processing - Part 2: Word Segmentation for Chinese, Japanese and Korean detailed word formation for each CJK: both derivational and compounding a wordlist for each CJK, with high coverage to each language most frequent 1000 words (from CJK) a complete prefix list of each CJK a complete suffix list of each CJK corpus of each CJK: quantitative analysis of the wordlist Appendix: Some Concepts possibly concerned in the Standards (Reference) For the purposes of this International Standard, the terms and definitions given in ISO 1087-1, ISO 1087-2 and the following apply: 7 abbreviated form representation formed by omitting words or letters from a longer form automaton abstract mechanism characterized by a certain amount of states and whose behavior is governed by a finite number of rules autonomous word word that can appear as a single word or as a component of a multiword expression Example: “Father” in the multiword expression “father-in-law” Note: opposed to non-autonomous word closed data category data category whose content is constrained by a list of permissible instances which comprise its conceptual domain NOTE: A typical closed data category might be /grammatical number/, which can have as its content the values: /singular/, /plural/ or /dual/. combination of morphological features unlimited number of associations of distinct morphological features NOTE: An example of a combination of morphological features would be the pair: /grammatical number/ and /grammatical gender/. complex data category data category that can have content values NOTE: Complex data categories include both closed data categories and open data categories. conceptual domain set of permissible values associated with a closed data category Note: The conceptual domain of the data category /grammatical number/ can be defined as /singular/, /plural/ and /dual/. data category result of the specification of a given data field or the content of a closed data field NOTE: A data category is to be used as an elementary descriptor in a linguistic structure or an annotation scheme. Examples are: /term/, /definition/, /part of speech/ and /grammatical gender/. Data categories for the management of lexical resources and terminology are comparable to data element concepts in ISO/IEC 11179-3:200?. derivation 8 change in the form of a word to create a new word, usually by modifying the base/root or affixation NOTE: Sometimes derivation signals a change in part of speech, such as "nation" to "nationalize". Sometimes the part of speech remains the same as in “nationalization” vs. “denationalization”. electronic lexical resource ELR lexical database lexical resource database collection consisting of individual data entries each of which documents a word and provides data pertinent to the senses associated with that word, as well as in some cases equivalent words in one or more languages [adapted from ISO 1087] NOTE: Lexical databases and lexical resources are generally corpus-driven and provide more predictable, machine-processable information than machine readable dictionaries, which only mirror the layout of print dictionaries, although many today provide some degree of knowledge organization. Lexical resources can include features for spellchecking and grammar checking, parsing, concordancing, speech recognition and generation, semantic taxonomies and disambiguation, text segmentation, knowledge management, and other NLP functions. electronic terminological resource ETR database collection consisting of individual data entries each of which documents a concept and provides data pertinent to the terms associated with that concept in one or more languages form any sequence of letters, numerals and pictograms used to write or pronounce a word form operation any modification of the form full form the complete representation of a word for which there is an abbreviated form [ISO 12620]. grammatical category See part of speech. homograph word that is written like another word, but that has a different pronunciation, meaning, and/or origin [adapted from ISO 12620] 9 NOTE: An example of difference in meaning for the same spelling of a word is bark: 1) the sound made by a dog; 2) outside covering of the trunk or branches of woody plants; 3) a sailing vessel. homonym word that sounds the same and is written the same as another word, but is different in meaning NOTE: An example is “bear” as a /noun/ and “bear” as a /verb/. homophone word that sounds like another word but is different in writing or meaning NOTE: An example of difference in spelling is “pair” compared to “pear” or “pare” in “The cook used a knife to pare the pair of pears”. human language technology HLT technology as applied to natural languages idiom See multiword expression inflected form form that a word can take when used in a sentence or a phrase NOTE: An inflected form of a word is associated with a combination of morphological features, such as grammatical number or case. inflectional paradigm set of form operations that builds the various inflected forms of a lemmatised form NOTE: An inflectional paradigm is not the explicit list of inflected forms. lemmatised form lemma conventional form chosen to represent the word NOTE: In European languages, the lemmatised form is the /singular/ if there is a variation in /number/, the /masculine/ form if there is a variation in /gender/ and the /infinitive/ for all verbs. In some languages, certain nouns are defective in the singular form, in which case, the /plural/ is chosen. Certain words are also defective in the /masculine/ in which case, the /feminine/ is chosen. The lemmatised form can be graphical or phonetic. lexical database lexical resource See electronic lexical resource 10 machine translation lexicon lexical resource in which the individual entries contain equivalents in two or more languages together with semantic information to facilitate automatic or semi-automatic processing of lexical units during machine translation [ISO 1087] morphological feature category induced from the inflected form of a word NOTE: ISO 12620 provides a comprehensive list of values for European languages. An example of a morphological feature is: /grammatical gender/. morphology of a word morpho-syntax of a word description comprising the lemmatised form or forms of a word, its /part of speech/ data categories, possibly its inflectional paradigm or paradigms, possibly its explicitly listed inflected forms. NOTE: Despite the reference to syntax, morpho-syntactic information does not include syntactic information. multiword expression MWE idiom a word composed of an ordered group of words that has properties that are not predictable from the properties of the individual words or their normal mode of combination. Note: The group of words making up an MWE can be continuous or discontinuous. Example: to take an exam. He took the exam. (continuous). What did you do with the exam that he supposedly took last Thursday ? (discontinuous). natural language processing NLP field covering knowledge and techniques involved in the processing of linguistic data non-autonomous word a word that appears in multiword expressions but cannot appear alone Example: In French “au fur et à mesure”, the component “fur” cannot appear alone. In English, “to take an umbrage”, the component “umbrage” cannot appear alone. See also autonomous word object language language of the lexical object being described [ISO 16642 definition 3.10] 11 open data category data category whose content is completely optional Example: Typical open data categories might include /term/, /lemma/, /definition/. orthography a way of spelling or writing words that conforms to a specified standard Note: Aside from standardized spellings of alphabetical languages, such as standard UK or US English, or reformed German spelling, there can be variations such as transliterations or romanizations of languages in non-native scripts, stenographic renderings, or representations in the Iinternational Phonetic Alphabet. In this regard, orthographic information in a lexical entry can describe a kind of transformation applied to the form that is the object of the entry. The specific value /native/ is dedicated to represent the absence of transformation. Examples: /transliteration/ and /romanization/ are possible values for a transformation. part of speech grammatical category word class category assigned to a word based on its grammatical and semantic properties NOTE: ISO 12620 provides a comprehensive list of values for European languages. Examples of such values are: /noun/ and /verb/. romanization transliteration using Latin script script set of graphic characters used for the written form of one or more languages (ISO/IEC 10646-1, 4.14) Note: The description of scripts ranges from a high level classification such as hieroglyphic or syllabary writing systems vs alphabets to a more precise classification like Roman vs Cyrillic. Scripts are defined by a list of values taken from ISO-15924. Examples are: Hiragana, Katakana, Latin, Cyrillic. semantics of a word description of the meanings of the word simple data category data category that is itself the possible content of a closed data category, but that cannot itself have content Example: /masculine/, /feminine/, and /neuter/ are possible simple data categories associated with the conceptual domain of the closed data category /grammatical gender/ as it is associated with the German language. 12 single word word that does not contain any other word splitting conditions the criteria why a linguistic phenomena is described by one element or by several elements Example: the criteria specifying why a word is described by one entry or two entries in a lexicon syntax of a word description of the behaviour of the word with respect to other words in a sentence or a phrase transcription form resulting from a coherent method of writing down speech sounds transliteration form resulting from the conversion of one writing system into another word in the context of a given language, is a description composed of at least a /part of speech/ and a lemmatised form NOTE: The description can be more complete with more morphological information and/or syntactic and semantic information. A word is either a single word or a multiword expression. word class See part of speech. word frequency number of occurrences of a particular word in a certain corpus, divided by the number of words in this corpus working language language used to describe objects in a lexical resource [ISO 16642 definition 3.21] 13