Preliminary Consideration on Word Segmentation ISO Standards

advertisement
Preliminary Consideration on
Word Segmentation ISO Standards
Sun Maosong
Dept. of Computer Science, Tsinghua University,
23 July 2005, Okinawa, Japan
Language resource management - Word segmentation of written
texts for mono-lingual and multi-lingual information
processing - Part 1: General principles and methods
1. Scope
The word segmentation standard series (Part 1, Part 2 and Part 3) aim at any natural language
in which the word tokens of its written texts cannot be fully identified by typographic properties
(such as white spaces in English).
The standards specify what the output is for an input text after the process of word
segmentation, pursuing the consistency in word segmentation within/among texts to the
maximum extent so as to meet the requirements from a variety of applications in information
processing regarding natural languages.
The standards are designed to be used in close conjunction with the metamodel presented in
Morpho-syntactic Annotation Framework, ISO 16642:2003, Terminology Markup Framework,
with revision of ISO 12620, Terminology and other language resources ― Data categories for
electronic lexical resources (DCR), and with the working draft of ISO WD 24613:2004,
Language resource management—Lexical markup framework (LMF). They also try to be
consistent with the existing lexical resource models such as the EAGLES International
Standards for Language Engineering (ISLE) and Multilingual ISLE Lexical Entry (MILE) model.
2. Normative references
The following normative documents contain provisions that, through reference in this text,
constitute provisions of this part of ISO word segmentation standards. For dated references,
subsequent amendments to, or revisions of, any of these publications do not apply. However,
parties to agreements based on ISO words segmentation standards are encouraged to
investigate the possibility of applying the most recent editions of the normative documents
indicated below. For undated references, the latest edition of the normative document referred
to applies. Members of ISO and IEC maintain registers of currently valid International
Standards.
ISO 639-1:2002, Codes for the representation of names of languages – Part 1: Alpha-2 Code.
1
ISO 639-2:1998, Code for the representation of languages – part 2: Alpha-3 Code.
ISO 639-3:200?, Codes for the representation of languages – Part 3: Alpha-3 Code for the
comprehensive coverage of languages
ISO 1087-1:2000, Terminology – Vocabulary – Part 1: Theory and application
ISO 1087-2:1999, Terminology – Vocabulary – Part 2: Computer application.
ISO/IEC 10646-1:2003, Information technology – Information technology -- Universal
Multiple-Octet Coded Character Set (UCS)
ISO/IEC 11179-3:2003, Information Technology – Data management and interchange –
Metadata Registries (MDR) – Part 3: Registry Metamodel (MDR3)
ISO 15924:200?, Code for the representation of names of scripts
ISO 16642:2003, Computer applications in terminology – TMF (Terminological Markup
Framework)—check!
3. Related terms/concepts
character
morpheme
bound morpheme
free morpheme
word
in the context of a given language, is a description composed of at least a /part of speech/ and
a lemmatised form
NOTE: The description can be more complete with more morphological information and/or
syntactic and semantic information. A word is either a single word or a multiword expression.
single word
word that does not contain any other word
derivation
change in the form of a word to create a new word, usually by modifying the base/root or
affixation
2
NOTE: Sometimes derivation signals a change in part of speech, such as "nation" to
"nationalize".
Sometimes the part of speech remains the same as in “nationalization” vs.
“denationalization”.
affix
Collective term for bound formatives or word-forming elements that constitute subcategories of
word classes.
affixation
Process of word formation in which the stem is expanded by the addition of an affix.
prefix
A subclass of bound-forming elements that precede the stem.
prefixation
Essential process of word formation in which an affix is attached to the beginning of a stem.
suffix
morphological element that is attached finally to free morpheme constructions, but does not
occur as a rule as a free morpheme.
suffixation
The formation of complex words or word forms through the addition of a suffix to the word
stem.
infix
word formation morpheme that is inserted into the stem.
inflection
to be regarded as the phenomena at the syntactic level, not at the morphological level.
lemmatization
to be regarded as the phenomena at the syntactic level, not at the morphological level.
compounding
compound
idiom (multiword expression)
a word/phrase composed of an ordered group of words (characters) that has properties that
are not predictable from the properties of the individual words (characters) or their normal
mode of combination.
fixed phrase
3
a phrase that is tightly combined and steadily used in the language, even though its meaning
can be predictable from its components.
猪肉(pork) in Chinese
tokenization (word breaker)
proper noun
full form of a word
the complete representation of a word for which there is an abbreviated form [ISO 12620].
abbreviated form of a word
word-formation
word
word forms
lexeme
morphological (internal) structure of a word (word structure)
word segmentation unit (WSU)
1. word 2. phrases which are used tightly combined/frequently used
Word phrase , Bunsetsu
-
content word + functional word, e.g., in Korean, word phrase is one of spacing
unit. E.g., Seoul-eui (KR), Seoul-no (JP)
-
Prosodic structure is related.
wordlist
word type - dictionary
word token - text
Corpus
Representative corpus for a language
large enough and well balanced, appropriate for word frequency estimation of a wordlist.
Domain-specific corpus
raw corpus
4
morphologically annotated corpus
- manually …
word segmentation
a process that performs morphological analysis to any input text.
Morphological analsys
A process to mark up a token (= morpheme) based on a given set of tags (or part-of-speech)
word frequency
from the representative corpus
(adjusted) word frequency
more precise estimation of a word frequency by considering its usage, for example, its
distribution over domain and time. – from domain-specific corpus
word frequency approximation
word frequency acquired by an estimation method based on raw corpus, automatically
segmented corpus, manually segmented corpus with some degree of inconsistency, or
mixture of them.
- estimation vs. approximation: approximation is upper concept of estimation
word segmentation ambiguity (optional)
type of word segmentation ambiguities (optional)
unknown words
type of unknown words
part of speech
degree of lexicalization
4. General principles and methodologies
Principles in determining WSUs
(1) from linguistic point of view: all the linguistics principles and rules regarding word-formation
hold;
5

applicable with much broader linguistic coverage;
Principle of unpredictability of the word meaning from its subparts
Principle of idiomatization
Principle of conventionalization evaluated by frequency
Principle of the scope maximization of affixation
- /반 파쇼 주의 자/
Principle of the scope maximization of compounding: with respect to the wordlist
- 기계번역
- /기계번역 시스템/
Principle of …. 동사변형에 대한 문제 – bound morpheme

applicable in a condition
Principle of Lexical Integrity Hypothesis syntactic rules may not refer to the internal structure
of words.
- 水杯 (물잔)
- 洗澡
(2) from mental and economical point of view: the fixed phrases are also considered to be
included in the wordlist, as determined by the representative corpus;
(3) from practical point of view: 大中小学, 외교부장 (외교|부장 – 이 아님, 외교부+부장)
Principle in using the standards to process the text
Exhausitivity and consistency.
WSUs identified in the text may have inner structures, not simply inserting a
space in between
Flexibility in granularity
- e.g., pork = pig meat
6
Methodology in word frequency estimation
Basic framework

a wordlist (entry of lexicon), with high coverage to texts, and, possibly with a
morphological structure for some words respectively
Principle of the inclusion of words into the word list: lexemes generated by lexical
morphology (derivation and compounding), but not inflection.

word formation rules/meta rules: both derivational and compounding

a complete prefix/semi-prefix list

a complete suffix/semi-suffix list

a complete free morpheme list

a complete bound morpheme list

special morpheme lists that have special functions in the process of word segmentation,
for example, *** in Japanese.
 Word ending of verbs:

corpus: to support the quantitative analysis of the wordlist
Work Flow of word segmentation
Language resource management - Word segmentation of written
texts for mono-lingual and multi-lingual information
processing - Part 2: Word Segmentation
for Chinese, Japanese and Korean

detailed word formation for each CJK: both derivational and compounding

a wordlist for each CJK, with high coverage to each language
 most frequent 1000 words (from CJK)

a complete prefix list of each CJK

a complete suffix list of each CJK

corpus of each CJK: quantitative analysis of the wordlist
Appendix: Some Concepts possibly concerned in the Standards (Reference)
For the purposes of this International Standard, the terms and definitions given in ISO 1087-1,
ISO 1087-2 and the following apply:
7
abbreviated form
representation formed by omitting words or letters from a longer form
automaton
abstract mechanism characterized by a certain amount of states and whose behavior is
governed by a finite number of rules
autonomous word
word that can appear as a single word or as a component of a multiword expression
Example: “Father” in the multiword expression “father-in-law”
Note: opposed to non-autonomous word
closed data category
data category whose content is constrained by a list of permissible instances which comprise
its conceptual domain
NOTE: A typical closed data category might be /grammatical number/, which can have as its
content the values: /singular/, /plural/ or /dual/.
combination of morphological features
unlimited number of associations of distinct morphological features
NOTE: An example of a combination of morphological features would be the pair: /grammatical
number/ and /grammatical gender/.
complex data category
data category that can have content values
NOTE: Complex data categories include both closed data categories and open data
categories.
conceptual domain
set of permissible values associated with a closed data category
Note: The conceptual domain of the data category /grammatical number/ can be defined as
/singular/, /plural/ and /dual/.
data category
result of the specification of a given data field or the content of a closed data field
NOTE: A data category is to be used as an elementary descriptor in a linguistic structure or an
annotation scheme. Examples are: /term/, /definition/, /part of speech/ and /grammatical
gender/. Data categories for the management of lexical resources and terminology are
comparable to data element concepts in ISO/IEC 11179-3:200?.
derivation
8
change in the form of a word to create a new word, usually by modifying the base/root or
affixation
NOTE: Sometimes derivation signals a change in part of speech, such as "nation" to
"nationalize".
Sometimes the part of speech remains the same as in “nationalization” vs.
“denationalization”.
electronic lexical resource
ELR
lexical database
lexical resource
database collection consisting of individual data entries each of which documents a word and
provides data pertinent to the senses associated with that word, as well as in some cases
equivalent words in one or more languages [adapted from ISO 1087]
NOTE: Lexical databases and lexical resources are generally corpus-driven and provide more
predictable, machine-processable information than machine readable dictionaries, which only
mirror the layout of print dictionaries, although many today provide some degree of knowledge
organization. Lexical resources can include features for spellchecking and grammar checking,
parsing, concordancing, speech recognition and generation, semantic taxonomies and
disambiguation, text segmentation, knowledge management, and other NLP functions.
electronic terminological resource
ETR
database collection consisting of individual data entries each of which documents a concept
and provides data pertinent to the terms associated with that concept in one or more
languages
form
any sequence of letters, numerals and pictograms used to write or pronounce a word
form operation
any modification of the form
full form
the complete representation of a word for which there is an abbreviated form [ISO 12620].
grammatical category
See part of speech.
homograph
word that is written like another word, but that has a different pronunciation, meaning, and/or
origin [adapted from ISO 12620]
9
NOTE: An example of difference in meaning for the same spelling of a word is bark: 1) the
sound made by a dog; 2) outside covering of the trunk or branches of woody plants; 3) a
sailing vessel.
homonym
word that sounds the same and is written the same as another word, but is different in
meaning
NOTE: An example is “bear” as a /noun/ and “bear” as a /verb/.
homophone
word that sounds like another word but is different in writing or meaning
NOTE: An example of difference in spelling is “pair” compared to “pear” or “pare” in “The cook
used a knife to pare the pair of pears”.
human language technology
HLT
technology as applied to natural languages
idiom
See multiword expression
inflected form
form that a word can take when used in a sentence or a phrase
NOTE: An inflected form of a word is associated with a combination of morphological features,
such as grammatical number or case.
inflectional paradigm
set of form operations that builds the various inflected forms of a lemmatised form
NOTE: An inflectional paradigm is not the explicit list of inflected forms.
lemmatised form
lemma
conventional form chosen to represent the word
NOTE: In European languages, the lemmatised form is the /singular/ if there is a variation in
/number/, the /masculine/ form if there is a variation in /gender/ and the /infinitive/ for all verbs.
In some languages, certain nouns are defective in the singular form, in which case, the /plural/
is chosen. Certain words are also defective in the /masculine/ in which case, the /feminine/ is
chosen. The lemmatised form can be graphical or phonetic.
lexical database
lexical resource
See electronic lexical resource
10
machine translation lexicon
lexical resource in which the individual entries contain equivalents in two or more languages
together with semantic information to facilitate automatic or semi-automatic processing of
lexical units during machine translation [ISO 1087]
morphological feature
category induced from the inflected form of a word
NOTE: ISO 12620 provides a comprehensive list of values for European languages. An
example of a morphological feature is:
/grammatical gender/.
morphology of a word
morpho-syntax of a word
description comprising the lemmatised form or forms of a word, its
/part of speech/ data
categories, possibly its inflectional paradigm or paradigms, possibly its explicitly listed inflected
forms.
NOTE: Despite the reference to syntax, morpho-syntactic information does not include
syntactic information.
multiword expression
MWE
idiom
a word composed of an ordered group of words that has properties that are not predictable
from the properties of the individual words or their normal mode of combination.
Note: The group of words making up an MWE can be continuous or discontinuous.
Example: to take an exam. He took the exam. (continuous). What did you do with the exam
that he supposedly took last Thursday ? (discontinuous).
natural language processing
NLP
field covering knowledge and techniques involved in the processing of linguistic data
non-autonomous word
a word that appears in multiword expressions but cannot appear alone
Example: In French “au fur et à mesure”, the component “fur” cannot appear alone. In English,
“to take an umbrage”, the component “umbrage” cannot appear alone.
See also autonomous word
object language
language of the lexical object being described [ISO 16642 definition 3.10]
11
open data category
data category whose content is completely optional
Example: Typical open data categories might include /term/, /lemma/, /definition/.
orthography
a way of spelling or writing words that conforms to a specified standard
Note: Aside from standardized spellings of alphabetical languages, such as standard UK or
US English, or reformed German spelling, there can be variations such as transliterations or
romanizations of languages in non-native scripts, stenographic renderings, or representations
in the Iinternational Phonetic Alphabet. In this regard, orthographic information in a lexical
entry can describe a kind of transformation applied to the form that is the object of the entry.
The specific value /native/ is dedicated to represent the absence of transformation.
Examples: /transliteration/ and /romanization/ are possible values for a transformation.
part of speech
grammatical category
word class
category assigned to a word based on its grammatical and semantic properties
NOTE:
ISO 12620 provides a comprehensive list of values for European languages.
Examples of such values are: /noun/ and /verb/.
romanization
transliteration using Latin script
script
set of graphic characters used for the written form of one or more languages
(ISO/IEC 10646-1, 4.14)
Note: The description of scripts ranges from a high level classification such as hieroglyphic or
syllabary writing systems vs alphabets to a more precise classification like Roman vs Cyrillic.
Scripts are defined by a list of values taken from ISO-15924. Examples are: Hiragana,
Katakana, Latin, Cyrillic.
semantics of a word
description of the meanings of the word
simple data category
data category that is itself the possible content of a closed data category, but that cannot
itself have content
Example: /masculine/, /feminine/, and /neuter/ are possible simple data categories associated
with the conceptual domain of the closed data category /grammatical gender/ as it is
associated with the German language.
12
single word
word that does not contain any other word
splitting conditions
the criteria why a linguistic phenomena is described by one element or by several elements
Example: the criteria specifying why a word is described by one entry or two entries in a
lexicon
syntax of a word
description of the behaviour of the word with respect to other words in a sentence or a phrase
transcription
form resulting from a coherent method of writing down speech sounds
transliteration
form resulting from the conversion of one writing system into another
word
in the context of a given language, is a description composed of at least a /part of speech/ and
a lemmatised form
NOTE: The description can be more complete with more morphological information and/or
syntactic and semantic information. A word is either a single word or a multiword expression.
word class
See part of speech.
word frequency
number of occurrences of a particular word in a certain corpus, divided by the number of words
in this corpus
working language
language used to describe objects in a lexical resource [ISO 16642 definition 3.21]
13
Download