Describing and Discovering Language Resources School of Informatics University of Edinburgh

advertisement
Describing and Discovering
Language Resources
David Illsley, Ewan Klein, Steve Renals
School of Informatics
University of Edinburgh
Overview
•
•
•
•
•
Goals: availability and interoperability
Service oriented architecture and workflow
NLP Components
Service description and discovery
NLP and the Grid
What are Language Resources?
• Language Resources (LRs) of two kinds:
• Static resources:
– corpora (text, speech, multimodal)
– lexicons, terminologies, ontologies
– grammars, declarative rule-sets
• Processing resources:
– segmenters, tokenizers, zoners, taggers, entity
classifiers, chunkers, parsers, …
Goals
• Maximize availability of static LRs
for automatic processing
• Maximize interoperability of
processing LRs
LRs on the WWW, 1
• Can use the WWW to locate corpora
• Example: OLAC (Open Language
Archive Community)
– Provides query interface to search for
corpora across multiple repositories
– Requires standard metadata record for
harvesting.
– Does not provide access to corpora.
LRs on the WWW,2
• Can use the WWW to directly search
corpora
• Many examples
• BNC Online Search
– words (with regular expressions)
– tag strings
• Typically search is limited
(expressiveness, number of results)
LRs on the WWW, 3
• Can use the WWW to download tools
• Some tools offer a demo web interface
• No interoperability:
– you cannot take the output of one webinterfaced tool and feed it as input to
another tool
LRs on the WWW, 4
• Challenges for accessing static LRs for
automatic processing:
–
–
–
–
licensing restrictions
file (or database) structure
data format
data transfer
• What about processing LRs?
– can download,
– but not execute in an interoperable manner
Web Services (WS)
• WS is a self-contained software resource
• Can be located and invoked across the web:
– identified by a URL
– public interfaces and bindings are defined and
described using XML
• Other applications interact with it in a
prescribed manner
– XML-based messages conveyed by internet
protocols (e.g. HTTP)
• Web services can be composed into complex,
distributed applications
Service Oriented Architecture (SOA)
description
Discovery Agencies
WWW
locate
publish
client
Service
Requester
description
Service
Provider
interact
service
Source: Berners-Lee
Web Service: Key Ideas
• Interaction with Web Services is
– described by
– and conducted
• using XML documents exchanged over the
internet
• SOAP protocol
– describes the form of messages and how to
process them
– a way of representing Remote Procedure Calls
over HTTP
The Appeal of Web Services
• A means of building distributed systems
• virtualization — not dependent on any one
programming language, OS, development
environment
• based on well-understood underlying
protocols
• components can be developed independently
• decentralized (apart from DNS)
NLP Services
• Fairly easy to wrap legacy code as web
services
• Allows us to deploy tools across the
web as part of a larger application
• Corpora can also be deployed as
services
• Helps with availability interoperability
• But still many challenges
Building NLP Applications
• Many NLP applications involve relatively
few ‘conceptual’ components:
– tokenizers, taggers, named entity
recognizers, parsers, etc
– often different versions of the same
components
– much repeated (and messy) labour in wiring
the components together to interoperate
Issues in Component Approach
• Granularity
– What is appropriate ‘grain size’ of
functionality?
• Too fine: heavy overheads in communication,
lose ease of use
• Too gross: loss of flexibility
• Hierarchical decomposition is possible
• Compatibility
– informational, functional, formal
Linguistic Annotation
• Makes information in raw text explicit:
– Classification of words and phrases
– Detection of structural relationships
– Annotation with general and domain-specific
semantic labels
• Usually proceeds from more concrete to
more abstract
• Earlier stages of annotation feed into the
later stages
• Assumed that annotation is represented as
XML
Idealized View
Compatible NLP Services:
Substitution
tokenize
POS tag
POS tag
parse
Compatible NLP Services:
Sequencing
tokenize

parse
POS tag
parse
POS tag
tokenize
WSDL File
• XML document, usually on same machine as
server
• Describes everything involved in calling a
web service:
–
–
–
–
–
–
The service URL and namespace
The type of web service
List of available functions
Arguments for each function
Data type of each argument
Return value of each function and data type of
each return value
Processor Input and Output Types
• Composition of NL processors
constrained by input and output types
• Candidates for types?
• WSDL provides simple data types:
– strings, integers, booleans
– not expressive enough
• Can we build on notion of metadata for
LRs?
IMDI Catalogue Specification
Catalogue.Title
Catalogue.Subject-Language
Catalogue.Content-Type
Catalogue.Format.Text
Catalogue.Smallest Annotation Unit
Catalogue.Publisher
Catalogue.Size
Arabic Treebank
ara
written
UTF-8
word
LDC
266 Mb
LR Metadata Standards
• Advantages
– consistency
– software knows what to expect
– can be designed according to agreed principles
• Challenges
– no generally agreed ontology for LRs
– hard to get agreement (and who gets to decide?)
– categorizations of LRs influenced by favourite
linguistic theory
• Other people are addressing this issue
What’s missing: tool
metadata
• What kind of metadata would enable
us to ensure tool interoperability?
• Neither OLAC nor IMDI provide an
answer.
Discovering Resources
• Who cares about discovering LRs?
– researchers who are searching for LRs that
meet specific research criteria
– information providers
– teachers, journalists, casual browsers
– …
• Current focus: automatic discovery by
software agents
Service Description & Discovery
• What LRs can be discovered depends on how
the LRs are described.
• How LRs are described depends on the
requirements for discovery.
• Composability:
– If an agent (human or software) has already
selected component P, what other components Q
can provide well-formed input to P ?
– Query for all Q such that Q’s output type is
compatible with P’s input type
Some Versions of BNC
name:
British National Corpus, Version 1.0
type:
text
size:
2866 MB
name:
British National Corpus, Version 1.0, marked up in XML
type:
text
size:
815 MB
name:
British National Corpus, Version 1.0, parsed with Charniak parser
type:
text
size:
419 MB
name:
British National Corpus, Version 1.0, parsed with IMS parser
type:
text
size:
2088 MB
name:
British National Corpus, Version 1.0, parsed with Minipar
type:
text
size:
448 MB
Corpus Request Scenario
• Agent A requests corpus C with property
[key = val].
• If C with [key = val] exists, serve it to A.
• Otherwise,
– find processor P such that output of P(C) satisfies
[key = val]
– apply P to C
– serve result to A
– store result for future requests
Service Description
• Standard approach
– WSDL: describes service inputs/outputs in
terms of simple data types
– Doesn’t support semantically-based
service discovery
• Alternatives from Semantic Web
– inputs and outputs specified in an
ontology language
– OWL and RDF both possible
NLP as Document Annotation
• NL Processor
– takes a partially annotated document as input
– yields a more richly annotated document as
output
Tagging as document
annotation
• Part of Speech Tagger
– takes in a document with markup of words
– yields a document as with additional markup of part of
speech
Document Class
NB This is just corpus metadata!
Subsumption over the
Document class
Subsumption over Processors
Grid & NLP
• Parallelism
– distribute processes over many machines
– use parallel algorithms within process
– redundancy and fault tolerance
• Distributed data
– multiple corpora
– distributed annotation of single corpus
• Distributed processing pipeline
– different components hosted at different sites
Implementation
• Based on Globus Toolkit 3.2 middleware
• Corpus Services and Transformation Services provide
interfaces for corpora and tools
• Services Data Elements describe properties of
services
– properties are aggregated by Index Service, can be queried
by clients
• Index Service extended by Model Service
– provides richer description of services using RDF triples
• Backward chaining used to construct pipelines that
will produce a requested resource
Summary
• Corpus query
– for user, no obvious distinction between raw and
processed data
• Corpus service
– either provide existing resource, or generate it
• Need to have metadata for tools which
allows automatic composition
• Metadata needs to allow subsumption
matching
– using shared controlled vocabulary
Download