Towards a Language-Independent Universal Digital Library

advertisement
Towards
a Language-Independent
Universal Digital Library
The Second International Conference on Universal Digital Libraries
(ICUDL 2006) 17-19-2006 November, Alexandria, Egypt
Sameh Alansary
Magdy Nagi
Noha Adly
sameh.alansary@bibalex.org
magdy.nagi@bibalex.org
noha.adly@bibalex.org
Bibliotheca Alexandrina
Introduction
• IT made the full text libraries’ assets available
digitally (Independent of time, place and copy).
e.g.
- Million Book Project.
- Nasser Digital Library.
UDL
• Digitization only does not lead to “universality”
in its optimum sense.
• A new dimension of universality should be added:
Independency of Language
Language-dependency blocks information
dissemination
• Language dependency holds language barriers.
• If it is always possible for everyone to read in
everyone’s mother tongue, this will help in:
- Dissemination of knowledge.
- Preservation of nationality and identity.
- Preventing cultural hegemony.
• 80% of books and e-materials is written in English
and 20% is written in other languages.
Attempts to break language barriers
• Translation systems have been introduced (NLP):
Approaches:
1- Direct translation approach.
2- Transfer approach.
3- Interlingual approach.
Examples of Systems:
- Google translation:
http://www.google.ch/language_tools
- Fujitsu systems:
http://www.fujitsu.com/global/services/translation
Drawback of MT systems
1- The quality of results is often inadequate.
2- Work for a limited number of language
combinations.
3- Hold an overload on the network:
To translate from and to only 10 languages, 10
grammars, 10 lexicons, 90 translation dictionaries
and 90 sets of translation rules will be needed, plus
the need for semantic processing in each language.
Towards a universal
system for knowledge
representation
Some questions may bear in mind:
• How can we represent natural language materials in
a language independent format? (a format required)
• What is the system suitable for representing
knowledge in the format selected? (a system required)
• How is this system going to work?
Requirements for a universal representation
of knowledge:
1- The content of the original material (meaning)
must not be lost.
2- This universal format should be understandable
by various platforms over the network.
3- This universal format should be decodable to
any natural language.
UNL System
What is UNL? (1)
• The Universal Networking Language (UNL) is an artificial
language for computers to express information and
knowledge that can be expressed in natural language.
• Started in 1996, as an initiative of the UNU/IAS in Japan
• R&D in UNL
- Development on 15 languages: Arabic, Chinese,
English, French, German, Hindi, Indonesian, Italian,
Japanese, Korean, Portuguese, Russian, Spanish, Thai,
Swahili.
- Transferred to the UNDL Foundation in 2001.
What is UNL? (2)
• It expresses information or knowledge of natural language
(NL) in the form of semantic network with hyper-node.
Example:
The boy who works here went to school
UNL expression:
{UNL}
agt(go(icl>move).@entry.@past, :01)
plt(go(icl>occur).@entry.@past, school(icl>institution))
agt:01(work(icl>do), boy(icl>person.@entry))
plc:01(work(icl>do),here)
{/UNL}
The boy who works here went to school
go(icl>move)
agt
@ entry @ past
plt
boy(icl>person)
@ entry
school(icl>institution)
agt
here
plc
work(icl>do
)
:01
UNL-hyper graph
The UNL System
System
Components
Formalism
Knowledge representation
The UNL-system components
UNL LANGUAGE SERVER
Enconverter =  Deconverter
(EnCO)
(EnCO)
Language Server
UNL <- >Chinese
EnCO
DeCO
UNL document
Language Server
UNL Editor
UNL <-> Arabic
UNL Viewer
UNL Proxy
Language Server
Internet
USER
1
UNL <-> Spanish
2
3
Language Server
UNL <-> Hindi
Language Server
UNL <- >Japanese
EnCO
DeCO
Language Server
UNL <- > English
EnCO
DeCO
EnCO
DeCO
A) Language servers:
Analysis
Rules
Web Server with
UNL document
EnConverter
UNL
UNL-language
Dictionary
UNL
NL
Natural
Language
UNL
Language
Server
Knowledge
Base
DeConverter
Generation
Rules
Concurrence
Dictionary
B) UNL Tools:
1- UNL viewer.
2- UNL editor.
3- UNL verifier.
C) UNL Proxy Server:
• Searches for UNL at the web, send it to the language
server and displays it on the user’s chosen language.
Mechanism of conversion between NL and UNL
Annotation Editor
Natural
Language
texts
Annotated
Natural
Language
texts
UNL Verifier
EnConverter
CoGrammatical
Word
Occurrence
Rules
Dictionary
Dictionary
Natural
Language
texts
Universal Parser
DeConverter
UNL
Document
UNL
KB
UW
Dictionary
Web server
HTML+XML
UNL
Document
UNL as a formal language:
How does it represent knowledge?
1- Universal words (UW): to represent concepts.
Example:
boy(icl>person)
hear(icl>perceive(agt>person,obj>thing))
2- relations: 38 semantic relations can be distinguished.
Example: agt, aoj, bas, con, coo, dur, … etc.
3- Attributes: to express subjectivity of the speaker.
Example: @past, @emphasis, @def, @not, … etc.
4- Knowledge base (UNLKB).
• Define the Universal Word.
• Provide linguistic knowledge of concepts
Ibrahim Shihata UNL Arabic Center (ISUAC)
• It is established at Bibliotheca Alexandrina.
• It is responsible for designing, implementing, and
maintaining the various components of the Arabic
language server.
• The Arabic language server will be capable of:
- Enconverting the Arabic texts to the universal format.
- Deconverting the universal materials produce by
other language centers to Arabic.
The Achievements of the ISUAC
A) Arabic language resources and tools.
B) Developing tools.
C) Arabic language-based universal materials.
A) Arabic language resources and tools:
1- The Arabic Dictionary:
It is a repository of information for all UNL Arabic
grammars.
Head Words
(Vocabulary of Arabic)
Universal words
(Vocabulary of UNL)
Linguistics Features
(Linguistic info about HWs)
Dictionary
2- Arabic EnConversion Rules:
• It is responsible for Enconverting Arabic to UNL.
• Arabic EnConversion Rules are able to:
1- Perform morphological analysis to extract concepts
the Arabic words refer to.
2- Assign exact semantic relation between concepts as
being expressed in the context of the Arabic sentence.
• Simulation of how Enconverter works
.‫ شارع قنوات حي باكوس باإلسكندرية‬18 ‫ في‬1918 ‫ يناير‬15 ‫ولد جمال عبد الناصر في‬
./‫إسكندرية‬/‫بال‬/ /‫باكوس‬/ /‫حي‬/ /‫قنوات‬/ /‫شارع‬/ /18/ /‫في‬/ /1918/ /‫يناير‬/15/ /‫في‬/ /‫جمال عبد الناصر‬/ / ‫ولد‬
delete
18
delete
1918
15
mod
Bakous(iof>place)
street(icl>road)
January
mod
Alexandria(iof>city)
plc
territory(icl>region)
plc
Gamal
Abdelnaser(iof>person)
obj
Kanawat(iof>street)
mod
plc
mod
tim
tim
born(obj>thing)
.@past
UNL Network:
born(obj>thing)
@past
obj
tim
plc
plc
tim
Gamal
Abdelnaser(iof>person)
@topic
January
mod
plc
15
territory(icl>region)
1918
mod
street(icl>road)
mod
18
Alexandria(iof>city)
mod
Kanawat(iof>street)
Bakous(icl>place)
3- Arabic DeConversion Rules:
• It is responsible for generating Arabic sentences out
of UNL networks.
• Arabic DeConversion Rules are able to:
1- Select Arabic words that represent universal concepts.
2- Arrange the concepts of the UNL network in a
syntactically well-formed sentence.
• Simulation of how the Deconverter works
obj
Egypt
description(icl>
action)
aoj
collaboration(icl>action)
agt
150
bas
More
(aoj>thing)
aoj
scientist(icl>scho
lar) .@entrry
aoj
and
prominent(aoj>thing)
gol
Egypt
1798
tim
accompany(agt>thi
ng,obj>thing)
mod
outcome(icl>resul).
@entry
scholar(icl>person)
agt
Bonaparte(iof>person)
‫ إلى مصر‬1798 ‫ باحث و عالم مرموق الذين صاحبوا بونابرت في‬150‫وصف مصر محصلة تعاون أكثر من‬
‫ إلى مصر‬1789 ‫ باحث و عالم مرموق الذين صاحبوا بونابرت في‬150 ‫وصف مصر محصلة تعاون أكثر من‬
4- A Corpus for Modern Standard Arabic:
• A representative sample (100 Millions) that
reflects the empirical usage of Modern Standard
Arabic.
• It plays a principle role in enhancing and
updating both EnConversion and DeConversion
rules.
B) Developing tools:
1- Integrated Development Environment (IDE)
2- Corpus analysis software (GATE)
C) Arabic language-based universal materials.
Library of Alexandria: the
Fourth Pyramid.
The Encyclopaedia of
Famous Persons
Abou Simple: The Temple of
the Sun.
Nasser Digital Library
An example of an Arabic Sentence in
UNL (universal) format
‫ في قرية‬1888 ‫وكان جمال عبد الناصر االبن األكبر لعبد الناصر حسين الذي ولد في عام‬
‫ ولكنه حصل على قدر من التعليم سمح له بأن‬،‫بني مر في صعيد مصر في أسرة من الفالحين‬
‫ وكان مرتبه يكفي بصعوبة لسداد ضرورات‬،‫يلتحق بوظيفة في مصلحة البريد باإلسكندرية‬
{unl}
.‫الحياة‬
aoj(son(icl>person):0I.@def.@entry,
Gamal Abdel Nasser(iof>person):00)
mod(son(icl>person):0I.@def.@entry,
Abd El-Naser Hosen(iof>person):23.@topic)
aoj(old(aoj>thing):1J,
son(icl>person):0I.@def)
man(old(aoj>thing):1J,
most(icl>how):15)
obj(born(obj>thing):31.@past,
Abd El-Naser Hossain(iof>person):23.@topic)
and(get(agt>thing,obj>thing):6S.@past.@contrast,
born(obj>thing):31.@past)
scn(born(obj>thing):31.@past,
family(icl>group):5Q)
plc(born(obj>thing):31.@past,
village(icl>region):4D)
tim(born(obj>thing):31.@past,
year(icl>period):3M)
mod(year(icl>period):3M, 1888:41)
plc(village(icl>region):4D, upper Egypt(iof>place):58)
mod(village(icl>region):4D, Bani Morr(iof>village):4S)
mod(family(icl>group):5Q, farmer(icl>person):65.@pl.@def)
obj(get(agt>thing,obj>thing):6S.@past.@contrast,
degree(icl>abstract thing):7N)
agt(allow(agt>thing,gol>thing,obj>thing):8M.@past,
degree(icl>abstract thing):7N)
mod(degree(icl>abstract thing):7N,
education(icl>activity):82.@def)
gol(allow(agt>thing,gol>thing,obj>thing):8M.@past,
join(agt>person,obj>thing):9I.@present)
obj(allow(agt>thing,gol>thing,obj>thing):8M.@past,
his(pos>he):97)
and(suffice(aoj>thing,obj>thing):CM.@present,
join(agt>person,obj>thing):9I.@present)
obj(join(agt>person,obj>thing):9I.@present, job(icl>work):A7)
plc(job(icl>work):A7,
postal service{icl>service ):AN)
plc(postal service{icl>service ):AN,
Alexandria(iof>city):BB)
aoj(suffice(aoj>thing,obj>thing):CM.@present,
salary(icl>money):BV)
mod(salary(icl>money):BV, his(pos>he):CB)
obj(suffice(aoj>thing,obj>thing):CM.@present,
satisfy(agt>thing,obj>thing):DQ)
man(suffice(aoj>thing,obj>thing):CM.@present,
hardly:DA)
obj(satisfy(agt>thing,obj>thing):DQ,
demand(icl>wants):E6.@pl.@def)
mod(demand(icl>wants):E6.@pl.@def,
life(icl>activity):EV.@def)
{/unl}
Language Independent
Format
Is it going to work this way?!!
• Are there language servers ready to work?
• Are the universal materials deconvertable to
other languages?
What about Arabic??
• Is the Arabic language server able to enconvert
Arabic texts to universal format?
• Is it also able to deconvert the universal
materials back to Arabic?
A proof of the concept
UNL-based Library Information
System (UNL-LIS)
• It is a system to search in a digital library catalogs.
• It is built on the UNL KI, therefore:
- Query is in Natural Language (two languages)
-Answer is also in Natural Language (7 languages)
UNL LIS Core Architecture
User
Question
Language
Server
Enco rules
+
Dic
Question in NL
Enconversion
Process
LIS
MARC21
Records
Question in UNL
UNL KB
Answer in UNL
Query
Engine
Encyclopedia
Concepts
Definitions
Deconversion
Process
Answer in NL
Language
Server
Deco rules
+
Dic
MARC21
Importing
Process
Demo:
Screen Shots
{unl} agt(begin(agt>thing,obj>action):12.@past.@entry, Naguib
Mahfouz(iof>person):0N.@topic)
obj(begin(agt>thing,obj>action):12.@past.@entry,
writing(icl>action):18)
tim(begin(agt>thing,obj>action):12.@past.@entry, year
old:1S.@past) aoj(year old:1S.@past, Naguib
Mahfouz(iof>person):0N.@topic) qua(year old:1S.@past, 17)
plc(born(aoj>thing):00, Cairo(iof>city):08) aoj(born(aoj>thing):00,
Naguib Mahfouz(iof>person):0N.@topic) tim(born(aoj>thing):00,
1911:0H) {/unl} [/S] ;;Time 1.4 Sec ;;Done!
{unl} and(write(agt>thing,obj>thing):1K.@past.@entry,
publish(agt>thing,obj>thing):0K.@past)
obj(write(agt>thing,obj>thing):1K.@past.@entry,
novel(icl>tale):1B.@pl.@topic)
tim(write(agt>thing,obj>thing):1K.@past.@entry,
before(icl>how(obj>thing)):1S) aoj(more(icl>additional):1A,
novel(icl>tale):1B.@pl.@topic)
qua(novel(icl>tale):1B.@pl.@topic, 10:16) [/S]
1. Enter
query
2. Press to
search
Encyclopedia
3. Specify result
language
(Arabic)
4. View results
here (Naguib
Mahfouz).
Click for more
information.
5. A link to
the UNL
document
Conclusion
Conclusion
• Independency of language is a very important dimension
that should be considered in storing and retrieving texts
for a UDL
• The UNL system is a promising formalism for
representing knowledge in a universal format.
• The ISAUC less than 2 years old, however, it is one of
the very active language centres in designing and
implementing UNL materials and tools.
• The UNL LIS has proved feasibility of the concept of
language independency.
Thank You
Any question is welcomed.
Download