Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh Overview • • • • • Goals: availability and interoperability Service oriented architecture and workflow NLP Components Service description and discovery NLP and the Grid What are Language Resources? • Language Resources (LRs) of two kinds: • Static resources: – corpora (text, speech, multimodal) – lexicons, terminologies, ontologies – grammars, declarative rule-sets • Processing resources: – segmenters, tokenizers, zoners, taggers, entity classifiers, chunkers, parsers, … Goals • Maximize availability of static LRs for automatic processing • Maximize interoperability of processing LRs LRs on the WWW, 1 • Can use the WWW to locate corpora • Example: OLAC (Open Language Archive Community) – Provides query interface to search for corpora across multiple repositories – Requires standard metadata record for harvesting. – Does not provide access to corpora. LRs on the WWW,2 • Can use the WWW to directly search corpora • Many examples • BNC Online Search – words (with regular expressions) – tag strings • Typically search is limited (expressiveness, number of results) LRs on the WWW, 3 • Can use the WWW to download tools • Some tools offer a demo web interface • No interoperability: – you cannot take the output of one webinterfaced tool and feed it as input to another tool LRs on the WWW, 4 • Challenges for accessing static LRs for automatic processing: – – – – licensing restrictions file (or database) structure data format data transfer • What about processing LRs? – can download, – but not execute in an interoperable manner Web Services (WS) • WS is a self-contained software resource • Can be located and invoked across the web: – identified by a URL – public interfaces and bindings are defined and described using XML • Other applications interact with it in a prescribed manner – XML-based messages conveyed by internet protocols (e.g. HTTP) • Web services can be composed into complex, distributed applications Service Oriented Architecture (SOA) description Discovery Agencies WWW locate publish client Service Requester description Service Provider interact service Source: Berners-Lee Web Service: Key Ideas • Interaction with Web Services is – described by – and conducted • using XML documents exchanged over the internet • SOAP protocol – describes the form of messages and how to process them – a way of representing Remote Procedure Calls over HTTP The Appeal of Web Services • A means of building distributed systems • virtualization — not dependent on any one programming language, OS, development environment • based on well-understood underlying protocols • components can be developed independently • decentralized (apart from DNS) NLP Services • Fairly easy to wrap legacy code as web services • Allows us to deploy tools across the web as part of a larger application • Corpora can also be deployed as services • Helps with availability interoperability • But still many challenges Building NLP Applications • Many NLP applications involve relatively few ‘conceptual’ components: – tokenizers, taggers, named entity recognizers, parsers, etc – often different versions of the same components – much repeated (and messy) labour in wiring the components together to interoperate Issues in Component Approach • Granularity – What is appropriate ‘grain size’ of functionality? • Too fine: heavy overheads in communication, lose ease of use • Too gross: loss of flexibility • Hierarchical decomposition is possible • Compatibility – informational, functional, formal Linguistic Annotation • Makes information in raw text explicit: – Classification of words and phrases – Detection of structural relationships – Annotation with general and domain-specific semantic labels • Usually proceeds from more concrete to more abstract • Earlier stages of annotation feed into the later stages • Assumed that annotation is represented as XML Idealized View Compatible NLP Services: Substitution tokenize POS tag POS tag parse Compatible NLP Services: Sequencing tokenize parse POS tag parse POS tag tokenize WSDL File • XML document, usually on same machine as server • Describes everything involved in calling a web service: – – – – – – The service URL and namespace The type of web service List of available functions Arguments for each function Data type of each argument Return value of each function and data type of each return value Processor Input and Output Types • Composition of NL processors constrained by input and output types • Candidates for types? • WSDL provides simple data types: – strings, integers, booleans – not expressive enough • Can we build on notion of metadata for LRs? IMDI Catalogue Specification Catalogue.Title Catalogue.Subject-Language Catalogue.Content-Type Catalogue.Format.Text Catalogue.Smallest Annotation Unit Catalogue.Publisher Catalogue.Size Arabic Treebank ara written UTF-8 word LDC 266 Mb LR Metadata Standards • Advantages – consistency – software knows what to expect – can be designed according to agreed principles • Challenges – no generally agreed ontology for LRs – hard to get agreement (and who gets to decide?) – categorizations of LRs influenced by favourite linguistic theory • Other people are addressing this issue What’s missing: tool metadata • What kind of metadata would enable us to ensure tool interoperability? • Neither OLAC nor IMDI provide an answer. Discovering Resources • Who cares about discovering LRs? – researchers who are searching for LRs that meet specific research criteria – information providers – teachers, journalists, casual browsers – … • Current focus: automatic discovery by software agents Service Description & Discovery • What LRs can be discovered depends on how the LRs are described. • How LRs are described depends on the requirements for discovery. • Composability: – If an agent (human or software) has already selected component P, what other components Q can provide well-formed input to P ? – Query for all Q such that Q’s output type is compatible with P’s input type Some Versions of BNC name: British National Corpus, Version 1.0 type: text size: 2866 MB name: British National Corpus, Version 1.0, marked up in XML type: text size: 815 MB name: British National Corpus, Version 1.0, parsed with Charniak parser type: text size: 419 MB name: British National Corpus, Version 1.0, parsed with IMS parser type: text size: 2088 MB name: British National Corpus, Version 1.0, parsed with Minipar type: text size: 448 MB Corpus Request Scenario • Agent A requests corpus C with property [key = val]. • If C with [key = val] exists, serve it to A. • Otherwise, – find processor P such that output of P(C) satisfies [key = val] – apply P to C – serve result to A – store result for future requests Service Description • Standard approach – WSDL: describes service inputs/outputs in terms of simple data types – Doesn’t support semantically-based service discovery • Alternatives from Semantic Web – inputs and outputs specified in an ontology language – OWL and RDF both possible NLP as Document Annotation • NL Processor – takes a partially annotated document as input – yields a more richly annotated document as output Tagging as document annotation • Part of Speech Tagger – takes in a document with markup of words – yields a document as with additional markup of part of speech Document Class NB This is just corpus metadata! Subsumption over the Document class Subsumption over Processors Grid & NLP • Parallelism – distribute processes over many machines – use parallel algorithms within process – redundancy and fault tolerance • Distributed data – multiple corpora – distributed annotation of single corpus • Distributed processing pipeline – different components hosted at different sites Implementation • Based on Globus Toolkit 3.2 middleware • Corpus Services and Transformation Services provide interfaces for corpora and tools • Services Data Elements describe properties of services – properties are aggregated by Index Service, can be queried by clients • Index Service extended by Model Service – provides richer description of services using RDF triples • Backward chaining used to construct pipelines that will produce a requested resource Summary • Corpus query – for user, no obvious distinction between raw and processed data • Corpus service – either provide existing resource, or generate it • Need to have metadata for tools which allows automatic composition • Metadata needs to allow subsumption matching – using shared controlled vocabulary