WinnipegPresentation - University of Manitoba

advertisement
Introducing CRL
Computing
Research
Laboratory
The Computing Research Laboratory at NMSU
Jim Cowie – Director
Steve Helmreich – Deputy Director / 505-646-2141
shelmrei@nmsu.edu http://crl.nmsu.edu
• Established in 1983 by New Mexico
Legislature as a Center of Technical
Excellence
• CRL is a Research Department in the College
of Arts and Sciences at New Mexico State
University
• From 1983 to 1989 received more than $6.5
million in state funding.
• Since 1990, entirely self-supporting on
research grants and contracts.
COMPUTING
RESEARCH
LABORATORY
CRL
CRL Capabilities and Expertise
•
•
•
•
Multi-lingual text processing
Speech processing and generation
Human Computer Interaction
Team of Computer Scientists, Psychologists,
Linguists, Computational Linguists,
Geographers, Biochemists and
Mathematicians; (~40)
• capable of delivering complex, working,
prototype systems.
COMPUTING
RESEARCH
LABORATORY
CRL
Language Engineering at CRL
 Information retrieval
 Language learning and language
teaching
 Automatic translation
 Summarization
 Question answering
 Dictionary development
 Knowledge discovery
Overview of Talk
• Projects related to Machine Translation
– Pragmatics-based Machine Translation
– Jargon analysis project
– IL Annotation project
• Projects using Machine Translation
– Expedition / Boas
– MOQA (Question / Answering)
Machine
Translation
triangle
Interlingual (IL)
Analysis
Generation
Transfer
Source
Language
Direct Translation
Target
Language
Machine
Translation
triangle
Interlingual (IL)
Analysis &
Generation
Generation &
Analysis
Transfer
Source &
Target
Language
Direct Translation
Target &
Source
Language
Machine
Translation
triangle
Interlingual (IL)
Analysis &
Generation
Generation &
Analysis
Transfer
Source &
Target
Language
Direct Translation
Target &
Source
Language
CRL Machine Translation
Projects
• XTRA – Chinese-English IL, 1986-88
• ULTRA – five languages IL, 1988-90
• Pangloss – multi-site Spanish-English IL,
1992-95
• Mikrokosmos – Spanish-English IL, 199598
• Corelli – multi-lingual transfer, 1998-2001
Characteristics of IL MT
• Analysis to and generation from “meaning”
of the input
• Disambiguation to an unambiguous
language-independent representation (IL)
• Use of world knowledge to disambiguate
• World knowledge stored and manipulated
through an Ontology
Jesus of Montreal
• Woman to priest guiltily coming out of her
bedroom (in French): “Come on out, we’re
not playing a scene from Feydeau.”
• English subtitle: “Come on out. This isn’t a
bedroom farce.”
Which floor is this?
• In a Spanish newspaper article about
expensive real estate rental in Moscow:
“Nothing’s available on the “segundo
piso” but there’s still some space left on
the “tercero piso.”
• T1: second floor / third floor
• T2: third floor / fourth floor
Earthquakes – who is to blame?
• Acumulación de víveres por anuncios
sísmicos en Chile
• Hoarding Caused by Earthquake Predictions
in Chile
• STOCKPILING OF PROVISIONS
BECAUSE OF PREDICTED
EARTHQUAKES IN CHILE
Pragmatics-based MT hypothesis
• Translations are made on the basis of
interpretations
• Interpretations are a set of coherent inferences
about the content and the context of the message
• These inferences are based on
– Beliefs of the translator about the beliefs of the author
– Beliefs of the translator about the beliefs of the target
audience
– Beliefs of the translator about the world
Interpretations
Machine
Translation
triangle
Interlingual (IL)
Analysis &
Generation
Generation &
Analysis
Transfer
Source &
Target
Language
Direct Translation
Target &
Source
Language
Terrorist/Freedom Fighter
• sindicalistas: Union Members / Labor
Leaders
• asesino: killer / assassin
• asesinados: murdered / assassinated
• campesinos: small farmers / peasants
• sin tierras: without land / landless
• terrateniente: landowner / landholder
Hypothesis
• It is possible to identify an author's
viewpoint from the vocabulary (jargon)
used, particularly in the use of alternate
lexical items referring to the same concept
or object
Hypothesis
• Social groups are organized not just around topics
but also around points of view and
• develop jargons to express those points of view
• Members of those social groups generally hold to
those points of view and
• Use the jargons to express themselves
• THUS identifying an author’s jargon also
identifies the groups he/she belongs to and the
beliefs he/she is likely to hold
Training Corpus
• Issue: Abortion
• Text Size: approximately 8000 tokens each
• Text Size (types): 2273 pro-choice / 2168
pro-life
• Significant unique vocabulary: 79 prochoice / 68 pro-life
• Significant common vocabulary 113 / 37
Approach
• Unique vocabulary: 1581 pro-choice/1476 pro-life
• Common vocabulary: 692
• Significant unique vocabulary:
– 79 pro-choice
– 68 pro-life
• Significant common vocabulary: 113 (37)
Unique Vocabulary – Pro-life
• abnormalities, aborted, abortifacient,
abortifacients, abortion-inducing,
abortionist, abortionists, adultery, amniotic,
bible, blessed, cancer-causing,
chastisement, chastisements, chastises,
complication, complications, contrite,
creator, depression
Unique Vocabulary – Pro-choice
• activism, activists, alley, anti-abortion, antichoice, anti-democratic, antiabortionists,
arson, arsonist, arsons, attorney, attorney’s,
blockade, blockaders, blocked, blocking,
bomb, bombing, bombings
Significant Common Vocabulary
•
•
•
•
•
•
•
•
Pro-life
clinic(s)
fetus
parenthood
planned
unborn
week(s)
woman(‘s)
3
22
2
2
15
37
9
•
•
•
•
•
•
•
•
Pro-choice
unborn
clinic(s)
fetus
parenthood
planned
week(s)
woman(‘s)
1
46
7
14
15
8
27
One-year project
• Using sounder statistical measurements
– Base line corpus
– Statistically significant differences
– Other methods of measuring differences
• Using collocations as well as single words
• Looking for “synonymous” terms
– WordNet
– Ontology
– Rogets
Experiments
• Differentiate opinions in a binary opposition
within texts on the subject of opposition
• Differentiate opinions among a plurality of views
within texts on the subject
• Differentiate opinions in a binary opposition
within texts on a different subject
• Differentiate opinions among a plurality of views
within texts on a different subject
• Differentiate multiple viewpoints in any article
Problems with IL approach
• Idiosyncratic – no common understanding
of what IL should be or look like
• Limited automatic acquisition – most of the
knowledge-based and lexicon is hand-coded
Interlingual Annotation of
Multilingual Text Corpora
Computing Research Laboratory – NMSU
Mitre Corporation
UMIACS – U Maryland
Columbia University
Language Technologies Institute – CMU
Information Sciences Institute – USC
Approach
• Collection of texts in six languages
• Three translations of each into English
• Tools to analyze grammatical aspects
– Morphological analysis
– Name recognition
– Chunking
Develop IL Representation
• Through study of texts
• Through examination of current Ils
• Develop formal definition
– Rich representation
– Compatible with under-specification
• Develop coding manuals and guarantee
inter-coder reliability
Annotate the Corpus
•
•
•
•
All sites / all texts
One site in charge of one aspect of IL
Frequent interaction
Regular joint meetings
Evaluate the results
•
•
•
•
Inter-coder reliability
Growth rate
Grain size
Quality of generation
Trends in HLT Research Funding
• Focus on sub-tasks
– Named entity recognition
– Coreference resolution
– Word sense disambiguation
• Bring multi-lingual capabilities to parallel
technologies
– Multi-lingual IR/IE/summarization
• Bring multiple technologies into one project
Three such projects at CRL
• Expedition / Boas
• MOQA – Meaning-Oriented
Question/Answering
• Personal Profiler
Expedition:
A tool for building Machine Translation systems
The Problem
Given two people, a linguist who knows a language,
and a programmer, provide a support system
which allows them to build a machine translation
system for that language in six months.
Project is completed and we are now using it to build
translation systems for Somali and Urdu.
You can try out the system at http://aiaia.nmsu.edu
Boas: “A Linguist in the Box”
Boas is a semi-automatic knowledge elicitation system that
guides a language speaker through the process of
developing the static knowledge sources for
a moderate-quality, broad-coverage MT system from any
“low-density” language into English in about six months.
Some of the tasks include providing a list of characters and
morphological features, paradigms for inflected classes,
equivalents of closed-class items, translation of place
names and open class items from English into the source
language.
Language knowledge acquisition has been a bottleneck for MT
development and deployment for over 40 years. At the same time, the
dearth of data resources has strongly limited the deployment of any of the
recent corpus-based techniques in practical MT environments.
Expedition is a “quick ramp-up” MT environment between “low density”
languages and English which is a step to alleviating these problems.
Boas, the main knowledge acquisition module inside Expedition,
includes resident knowledge about
•a set of potential source languages
•generalized parametric typological knowledge about languages in
general and
•methods and configurations for human-computer interaction.
It is designed for use by a team which does not include trained
computational linguists.
Boas contains knowledge about human language and means
of realization of its phenomena in a number of specific
languages and is, thus, a kind of a “linguist in the box” that
helps non-professional acquirers with the task whose
complexity is well-known.
The ethnologist and linguist Franz Boas was the founder of the American
school of descriptive linguistics.
In this photo, circa 1900?, he is shown posing for a model which was being
made of a Kwakuitl Winter Ceremonial dancer in which the dancer emerges
from within a circular hole cut in the dancing screen.
Meaning-Oriented Question-Answering with
Ontological Semantics
An AQUAINT Project from
ILIT
Development Strategy
• Meaning oriented question answering
• Rapid Prototyping using pre-existing components
• Evaluation of end-to-end system performance for
specific tasks (collaboration with AWARE project,
Bill Ogden, CRL)
• Project commenced August 2002
• Current system runs on Linux or Windows 2000
Meaning-Oriented Question-Answering with
Ontological Semantics
• Initial Domain: travel and meetings
– question understanding and interpretation
– determining the answer and
– presenting the answer
• two kinds of data source
– Structured Fact Repository containing instances of
ontological entities
– open text (in English, Arabic and Farsi)
System Overview (V0)
Document
Sources
Query
Interface
&
Answer
Formulation
Document
Retrieval
Human
Acquisition
Fact
Repository
human
machine
Text
Analyzer
System Overview (V1 now)
Document
Sources
Document
Retrieval
Query
Interface
&
Answer
Formulation
human
real-time
batch
Text
Analyzer
Fact
Repository
questions
System Overview (V2)
Document
Sources
Document
Retrieval
Query
Interface
&
Answer
Formulation
human
real-time
batch
Text
Analyzer
Fact
Repository
questions
& texts
Batch Processing Overview
Web Spider
Documents
Keizai Indexing
Document
Collection
Keizai Retrieval
Document
Subset
Text Analysis
Text Meaning
Representation
TMR to FR
Converter
Fact Repository
Batch Mode - Fact Repository Population
• Spidered contemporary text
• Retrieval done using Keizai retrieval
system (Unicode based)
• Uses a list of interesting people and travel
keywords
• Selected documents saved and
automatically processed using UMBC’s
analyzer (which produces text meaning
representations)
• Instances of concepts from TMR
extracted and stored in Fact Repository
Interactive Processing Overview
Query
Interface
Answer
formulation
XML Answer
NL
Query
Information
Server
Analyzer
TMR
Instance Finder
Instances
Fact Repository
Interactive Mode – Question Answering
Question submitted – text or structured query
Routed to Fact Repository (Structured Queries) or
To Text Analyzer (NL queries)
Question converted to TMR
TMR to:
• Structured query (if good match and sent to user for
validation), or
• Converted to a direct Fact Repository query
Answer retrieved from FR and displayed
• Fall back queries if basic query cannot be answered
Follow up queries can be further questions or use the
multi-modal facilities of the interface.
A trace of the dialog is maintained.
Information Server
• Mediates between User Interface and all System
Components
• Fact Repository
• Question Analysis
• TMR Production
• Uses XML to communicate with Answer
Formulation Component
• Java structures communicate with fact repository
interface
• Java-lisp interface communicates with text
analyzer
Structured Fact Repository
• Uniform format for all kinds of data
• Uniform support for multiple applications
and tools
• Semantically anchored in general ontology
• Implemented using PostgreSQL
(REQUEST-INFO-842
(THEME (VALUE (MEMBER-OF-842.DOMAIN)))
(INSTANCE-OF (VALUE (REQUEST-INFO)))
)
(MEMBER-OF-842
(TIME (VALUE ((FIND-ANCHOR-TIME))))
(RANGE (VALUE (POLITICAL-ENTITY-842)))
(INSTANCE-OF (VALUE (MEMBER-OF)))
)
(POLITICAL-ENTITY-842
(OBJECT-NAME (VALUE ("Al Qaeda")))
(INSTANCE-OF (VALUE (POLITICAL-ENTITY)))
)
TMR for “Who is in al Qaeda?”
try-v3
syn-struc
root
cat
subj
xcomp
sem-struc
set-1
try
v
root
cat
root
cat
form
$var1
n
$var2
v
OR infinitive gerund
element-type refsem-1
cardinality >=1
refsem-1
sem
event
agent ^$var1
effect refsem-2
modality
modality-type
epiteuctic
modality-scope
refsem-2
modality-value
< 1
refsem-2
value ^$var2
sem
event
REQUEST-INFO-130
THEME
TEXT-POINTER
INSTANCE-OF
DEVELOP-2601.PURPOSE DEVELOP-2601.REASON
why
REQUEST-INFO
DEVELOP-2601
THEME
AGENT
PHASE
TIME
INSTANCE-OF
TEXT-POINTER
SET-2555
NATION-97
CONTINUOUS
FIND-ANCHOR-TIME
DEVELOP
developing
NATION-97
HAS-NAME
INSTANCE-OF
TEXT-POINTER
Iraq
NATION
Iraq
SET-2555
ELEMENT-TYPE
CARDINALITY
INSTRUMENT-OF
THEME-OF
INSTANCE-OF
TEXT-POINTER
WEAPON
> 1
KILL-1864
DEVELOP-2601
WEAPON
weapons
KILL-1864
THEME
INSTRUMENT
INSTANCE-OF
TEXT-POINTER
SET-2556
SET-2555
KILL
destruction
SET-2556
THEME-OF
ELEMENT-TYPE
CARDINALITY
INSTANCE-OF
TEXT-POINTER
KILL-1225
HUMAN
> 100
HUMAN
mass
“Why is Iraq developing weapons
of mass destruction?”
Resume Generator
Generating a resume for an individual:
1. Collect and prepare the data
Gather documents from the web in English, Russian and Spanish.
Filter the documents to reduce the data to a collection of related documents.
2.
Individual Document Summarization
(This is done for each document in the collection)
Determine a date for the document
Select concise relevant pieces of information from the filtered collection of
documents.
Determine a date for each of the selected extracts.
Translate the pieces of text into English (our target language).
3.
Profile Generation
Merge the translated text extracts in chronological order to produce the cross
document summary.
Generate the output form for the end user.
Language Engineers in Short Supply 
• Emerging field – combining Linguistics,
Computational Linguistics, Computer Science,
Systems Analysis, and Human Factors
• Masters Degrees being offered at –
–
–
–
–
University of Southern California
Arizona University
University of Colorado
Carnegie Mellon University
• Potential for both for supporting research and
developing applications.
Former CRL Staff and Students are working
on language applications at •
•
•
•
•
•
•
•
Microsoft Natural Language Group
Systran
AT&T
Telelogue (talking yellow pages)
Westlaw (Spanish language processing group)
General Electric
Motorola Chinese Telephony Group
The Institute for Genetic Research (TIGR) (bioinformatics)
• University of Maryland Baltimore County
• University of Sheffield
Appendix
Ecology Development
Challenges and Needs
• Research into appropriate processing methods for
language ecology is needed. Only a tiny handful
of languages have had any kind of
research/evaluation effort
for these topics
(English, French, Japanese, Spanish).
• Research corpora need to be produced and made
publically available. Main source of materials at
the the moment is the Linguistic Data Consortium
• (http://www.ldc.upenn.edu/ )
• Language processing resources such as proper
name lists (onomastica), lexicons, morphological
analyzers, and patterns of features for names need
to be produced
Requirements for Basic Analysis
•
•
•
•
•
•
Corpora
Markup
Character sets
Punctuation
Part of Speech Tagging
Noun Phrase Recognition
Requirements (Continued)
•
•
•
•
•
•
Numbers and Dates
Onomastica
Un-attributed Proper Names
Syntax
General Guidelines
Challenges and Problems
Why do We Need Corpora?
• Ground our development in reality
• Provide basis for statistical processing
– testing
– learning
Types of Corpus
• Raw - only markup from the source - e.g. newswire
• Cleaned - standardized markup - e.g. TREC corpora
• NLP specific markup - e.g. Penn treebank, Wall Street
Journal
• Parallel Corpora with alignment markup
Sources - Standard
•
•
•
•
•
•
Linguistic Data Consortium - LDC
European Language Resources Agency - ELRA
National Institute for Standards and Technology - NIST
Gutenberg Archive
Oxford Text Archive
International Computer Archive for Modern English ICAME
• + Many national initiatives
Sources - Do It Yourself
• Participation in evaluations
– TREC, MUC, Amaryllis, Semeval
benefits are tagged corpora focused on a specific task
• Web spidering
– Site grabbing – web spiders
– Language grabbing - CRL language recognizing web spider
• Newswire capture
• Parallel Corpora
– Embassies, Company web sites
– United Nations, Pan American Health Organization
• For 8 bit
Character Sets
– Various ISO standards – Latin 1 – Latin 5
– Microsoft variants
– Others – e.g. KOI8 for Russian
• Various 16 bit Japanese and Chinese
standards – EUC, SJIS, Big5….
• Unicode
– UTF8 – mixed 8 and 16 bit
– UCS2 – 16 bit (although many characters can
be composed of multiple characters)
Character Sets
• Eight bit character sets may be simpler if
processing only one language, or one
language + English
• Unicode offers the possibility of universal
tokenization (recognizing words), based on
character classifications
• Key is to make sure resources and data
being processed use the same character set
Sentence Segmentation
• Essential step in analysis
• Complicated by ambiguous use of punctuation and
by document headings and sub-headings (which
should be processed separately)
• For language with “.” used as an abbreviation
marker needs list of abbreviations + automatic
recognition of abbreviations using lexicon
• Still requires heuristics to handle abbreviations at
the end of a sentence.
Part of Speech Tagging
• Either statistical based on tagged corpora or rule
based. Tags here are based Penn treebank
(‘november’,’NP’) ( ‘24’,’CD’) ( ‘,’,’,’)
( ‘1989’,’CD’) (‘,’,’,’) (‘friday’,’NP’)
(‘bridgestone’,’NP’) (‘sports’,’NPS’) (‘co’,’NP’)
(‘said’,’VBD’) (‘friday’,’NP’) (‘it’,’PP’)
(‘has’,’VBZ’) (‘set’,’VBN’) (‘up’,’RP’) (‘a’,’DT’)
(‘joint’,’JJ’) (‘venture’,’NN’) (‘in’,’IN’)
Phrase Recognition
• Goal is to reduce the complexity of text processed
by Semantic Analysis to processing heads of
phrases
• To recognize, for example, noun phrases
describing companies – “the third Japanese electric appliance concern”
– “the new company”
• and to recognize noun phrases in general
– “golf clubs”, “metal woods”
Morphological Analysis
• Inflection analysis + part of speech tagging
• Needed to detect various features
– Number, tense, gender, role …..
• And to produce a citation form for lexical
lookup
• MORE?
Numbers and Dates
• Numbers in numeric and alphabetic form can be
recognized and grouped with punctuation and
qualifiers using simple regular expressions
– Percentages, money, temperatures, weights etc.
• Dates can also be recognized by regular
expressions by adding months and a few separator
characters to the set of tokens used by the regular
expressions
– Thus NUM SLASH NUM SLASH NUM would be an
acceptable date expression, tests on ranges could be
added
• Many languages support multiple calendars and
these all need to be supported (Japanese, Arabic)
Onomastica
• Lists of proper names are an essential resource for
text processing.
• Do not need to be huge as many can be recognized
automatically using context patterns – e.g. “we
enjoyed our visit to Plaster, Texas”
• A large list of place names + well known people
and company names that can be regularly found in
abbreviated form (Ford, Bush etc.)
• Transliteration software may be useful to help
understanding in translated texts
Un-attributed Proper Names
For each language the following resources are
required
• databases of proper name components
– Human names, company terminators, company start and end
words, all the contents of the Onomasticon
• patterns to combine proper name components
– Mostly regular expressions
• name abbreviation algorithms
Toyota Motor Corporation -> Toyota Motor -> Toyota
International Business Machines -> IBM
• context based patterns
– A spokesman for eBay said ….
Syntax
• Simple syntax probably sufficient before
semantic (user oriented) steps
– Noun phrases
– Compound verbs
– Subordinate clauses
General Guidelines
• The main guideline is to preserve a “reasonable”
amount of ambiguity for resolution by the
semantic analysis process
– Toyota – might be a product or a company
– Washington – might be a place or a person
– Taj Mahal – might be a mausoleum, or a casino
• But definite decisions should be made where
possible to reduce the load on the analyzer
Download