Document

advertisement
Beyond Text Representation
Building on Unicode to Implement
a Multilingual Text Analysis
Framework
Thomas Hampp – IBM Germany
Content Management Development
Basic Text Analysis Tasks






Code page conversion and text representation
Segmentation (tokens, sentences,
paragraphs)
Morphological analysis / dictionary lookup
Compound word decomposition
Spell Checking/Spell Aid
…
Thomas Hampp IBM
18th International Unicode Conference
2
Advanced Text Analysis Tasks
Summarization
 Categorization/Clustering
 Extraction of names, terms or relations
 Information extraction
 Parsing
All task should be provided for all
languages

Thomas Hampp IBM
18th International Unicode Conference
3
A Library for Text Analysis




The same text analysis tasks are needed in
different multilingual contexts/systems
The same software library should be used in
all contexts/systems to perform the analysis
The library should work language neutral
The text analysis tasks required for a given
context/system should be an input parameter
for the library
Thomas Hampp IBM
18th International Unicode Conference
4
Two Problems and One
Solution

The realization of such a library faces
two kinds of challenges:
A.
B.

Implementing the actual language
specific analysis tasks
Encapsulating the language specific
processing by representing input and
output in a language neutral fashion
Unicode plays a major role in solving
problem B
Thomas Hampp IBM
18th International Unicode Conference
5
A Software Design for a Text
Analysis Library




Single API towards the application
Separated but combinable languagespecific processing modules
Central representation system for
linguistic information
Centralized flow of control driven by
linguistic analysis targets
Thomas Hampp IBM
18th International Unicode Conference
6
Text Analysis Framework (TAF) - High Level Design
TAF Plugins
TAF Information Store
TAF
Engine
Annotation Structure
Table Structure
Plugin Control API
Application API
TAF Application
Tokenization
Document
Representation
Dictionary Lookup
Term- / Name Identifier ...
...
Text Analysis Framework (TAF) Document Buffer & Annotation Structure
Document
[Doc-Attributes]
Type:
Type:
Paragraph
[ Para -Attributes]
Sentence
[Sen-Attributes]
Type:
Type:
Term
[Term-Attributes]
Type:
Type:
Token
[Token-Attributes]
I B M
1
2
3
Type:
Token
[Token-Attributes]
4
s o f
t w a r e
5
8
6
7
9
10
Type:
Token
[Token-Attributes]
11
12
i s
13
14
15
Token
[Token-Attributes]
g r e a t
16
17
18
19
20
21
!
...
22
...
Implementation






Implemented as C++ DLL/shared Lib
Provides an extensive object oriented API for
applications and plugins
Uses Unicode (ICU based) for all text content
Ported to 9 platforms (therefore no platform
dependant solutions acceptable)
Because of use in search/indexing strong
focus on performance
Supports 30+ languages and 90+ code pages
Thomas Hampp IBM
18th International Unicode Conference
10
Enter Unicode



Used as internal character
representation format (character set)
Converters from/to over 90 external
code pages had to be written/integrated
A decision had to be made on the
Unicode encoding format: we choose
UTF-16
Thomas Hampp IBM
18th International Unicode Conference
11
The Pros UTF-16




We started out without knowledge of
surrogate issues
False assumption: Fixed length
encoding
Good balance between size and
straightforward representation
Efficient interoperability with Windows,
Java, XML4C APIs etc
Thomas Hampp IBM
18th International Unicode Conference
12
The Cons of UTF-16




Not a fixed length encoding because of
surrogates
Can not be passed to legacy functions
(C library, OS APIs)
Character classification functions have
to work on pointers for surrogates
Wastes some space with western
languages
Thomas Hampp IBM
18th International Unicode Conference
13
ANSI C/C++ Compatibility



ANSI C++ does define a type w_char for
“wide” character representation (and a
matching wide string class wstring)
Unfortunately size and encoding of w_char
are not standardized
So we combined the ANSI C++
basic_string template class with the
Unicode character data type from ICU to
create a C++ and Unicode conformant string
class
Thomas Hampp IBM
18th International Unicode Conference
14
Impact Beyond Character
Representation





Tokenization
Finite state processing
Dictionary formats
“Environmental” issues
Development tools support
Thomas Hampp IBM
18th International Unicode Conference
15
Impact:
Tokenization




Tokenization needs access to character
properties
Most but not all relevant are provided
by Unicode character database
For application defined properties there
is no more fast & simple 256 character
property lookup
Approach limited to western scripts
Thomas Hampp IBM
18th International Unicode Conference
16
Impact:
Finite State Processing



Finite state character processing in C
usually works with transition tables
encoded as arrays
This is easy to implement and very fast
in execution
To cover the full range of all Unicode
characters, more sophisticated
transition tables are required
Thomas Hampp IBM
18th International Unicode Conference
17
Impact:
Dictionaries





Dictionaries tend to be large
As much of them as possible has to be loaded
in memory for performance reasons
For multilingual (server) applications multiple
dictionaries will be in memory
Therefore dictionary size matters much
Doubling dictionary size might not be an
viable option
Thomas Hampp IBM
18th International Unicode Conference
18
Impact:
“Environmental” Issues



There is always as residue of single byte
string data (from message catalog, command
line, library calls etc.) which sometimes has to
be mixed with Unicode string data
Interfaces for console, messages, logs etc.
are mostly single byte
Configuration files should be platform-neutral,
easily editable and support the full Unicode
character set
Thomas Hampp IBM
18th International Unicode Conference
19
Impact:
Development Tools Support




Only specialized editors can handle
Unicode text
Most debuggers don’t display Unicode
Source code string constants are hard
to maintain
Message catalog compilers on some
platforms are not Unicode enabled
Thomas Hampp IBM
18th International Unicode Conference
20
A Word About Unicode
Normalization Forms




For reasons of efficient interoperability a
fixed Unicode normalization had to be
specified
Early normalization is performance critical
Since round trip convertibility was not a
design goal Unicode Kompatibility Composed
Normal Form has been chosen
Normalization and cope page conversion can
and should be done in one step
Thomas Hampp IBM
18th International Unicode Conference
21
Benefits of Unicode Use




No more code page troubles within the
boundaries of the application
Very often algorithms can be
established for groups of languages
Multilanguage document collections and
even mixed language documents are no
problem to represent
Easy and efficient Java (JNI) integration
Thomas Hampp IBM
18th International Unicode Conference
22
Summing Up:
Building on Unicode…





…solves only the basic character
representation problem for multilingual text
analysis
…sets a solid foundation for a multilingual
system
…enables algorithms to be reused for groups
of languages.
…can have impact on the system far beyond
the character representation level
…has been worth the trouble
Thomas Hampp IBM
18th International Unicode Conference
23
Download