GATE Overview Presentation

advertisement
CSC 9010:
Text Mining Applications
Fall, 2012
Introduction to GATE
Dr. Paula Matuszek
Paula.Matuszek@gmail.com
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuct
ure/Presentation/GATE.ppt
©2012 Paula Matuszek
What is GATE?



Stands for General Architecture for Text
Engineering.
Developed at the University of Sheffield
Component-based architecture with data
separated from applications, many
discrete capabilities included as plugins.
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Who Uses GATE?




Scientists performing experiments that
involve processing human language
Developers developing applications with
language processing components
Teachers and students of courses about
language and language computation
Us :-)
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
How GATE can Help?



Specify an architecture, or organizational
structure, for language processing software
Provide a framework that implements the
architecture and can be used to embed
language processing capabilities in
applications
Provide a development environment built on
top of the framework made up of convenient
tools for developing components (plugins)
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Really?




Yeah, really.
It’s been under development for 15 years and is still
under very active development
Open-source, with dozens of developers, some of whom
have been involved since the beginning
Active community that provides good support
– Mailing list: lists.sourceforge.net/lists/listinfo/gate-users
– twitter: twitter.com/#!/GateAcUk
– LinkedIn: http://www.linkedin.com/groups/GATE-2230077


Many other text mining capabilities have been integrated
with it.
An almost overwhelming amount of documentation
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
GATE Architecture Overview
©2012 Paula Matuszek
http://gate.ac.uk/overview.html
GATE Product Family






GATE Developer: IDE for language processing, with
information extraction and other plugins.
GATE Embedded: object library which can be included
in applications
GATE Teamware: collaborative annotation environment
GATE Mimir: a “multiparadigm index” which supports
semantic indexing and search
GATE Wiki: “controllable wiki” based on Grails and
Subversion
GATE Cloud: GATE embedded running on
supercomputer hardware
©2012 Paula Matuszek
GATE Components


We will deal primarily with GATE Developer:
It has four components:
– Applications: groups of processes to be run on a
document or corpus.
– LanguageResources (LRs): entities such as
lexicons, documents, corpora, annotation
schemas, ontologies.
– ProcessingResources (PRs): tools that operate on
unstructured text, such as parsers and tokenizers.
These are mostly plugins.
– DataStores: saved processed documents and
resources.
©2012 Paula Matuszek
Overview of Gate Developer


GATE Developer
Resources Pane
– applications: groups of processes to run on a
document or corpus
– language resources: corpus, ontologies, schemas
– processing resources: tools that operate on
unstructured text
– datastores: saved documents and resources

Display Pane: whatever you’re currently
working with.
©2012 Paula Matuszek
Setup Options

Configuration
– Appearance: font, skin
– Advanced:
– add space on markup (to make html and xml
more readable)
– Save options and session on exit
– Insert append or prepend (for annotations)
– default browser (for user guide)

Input (?)
– default language
©2012 Paula Matuszek
Language Resources

Language Resources can be of four
kinds:
– Documents are modeled as content plus
annotations plus features.
– A Corpus is a Java Set whose members
are Documents.
– Annotations are organized in graphs, which
are modeled as Java sets of Annotation.
– Schemas are XML schemas describing
allowable annotations and features
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Documents Processing in
GATE

Document:
– Formats including XML, RTF, email, HTML,
SGML, and plain text.
– Identified and converted into GATE
annotation format.
– Processed by Processing Resources.
– Results stored in a serial data store (based
on Java serialization) or indexed in a
Lucene database.
– Can also be exported as XML.
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
New Document




Documents are converted to GATE format; can be
saved for future use or exported.
Language Resources --> New --> Document
Name: can leave blank and it will be created
automatically (no spaces) from filename+UniqueID
Checkmarks: required.
– just leave defaults
– sourceURL
– can be a file (click the folder icon for browse)
– or actual URL (GATE will fetch it)
– or set to stringContent to put content in directly.


Encoding will probably be utf-8.
markupAware: process XML and HTML tags
©2012 Paula Matuszek
Document Display

Double-click document
– Text (minus annotations if you chose
markupAware)
– Annotation Sets
– from XML, HTML, previous annotation work
– different colors for different categories
– Annotations list
– annotations chosen in Sets pane
©2012 Paula Matuszek
Creating a Corpus



To import new documents we name the
corpus and create it without any
documents.
Language Resources --> New -->
Corpus
Right-click and populate
– choose directory, extensions, encoding

This will create the corpus and show the
corpus and the individual documents in
the Resources Pane.
©2012 Paula Matuszek
GATE Corpus

Corpus Display Pane:
– Add documents to a corpus with + button which
appears when a corpus is displayed.
– Remove with -. (Note: this removes them from
corpus, not from Developer)


Documents can be included in multiple
corpora.
A corpus can be created from a single
concatenated file, by specifying the
documentRootElement. This makes sense
for, for instance, XML documents.
©2012 Paula Matuszek
CREOLE




A Collection of REusable Objects for
Language Engineering
The set of resources integrated with
GATE
All the resources are packaged as Java
Archive (or ‘JAR’) files, plus some XML
configuration data.
Managed in the Creole Plugin Manager
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Processing Resources: ANNIE



A family of Processing Resources for
language analysis included with GATE
Stands for A Nearly-New Information
Extraction system.
Using finite state techniques to
implement various tasks: tokenization,
semantic tagging, verb phrase chunking,
and so on.
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
ANNIE IE Modules
http://gate.ac.uk/sale/tao/splitch6.html#chap:annie
©2012 Paula Matuszek
Some ANNIE Components




Tokenizer
Gazetteer: lists of entities
Sentence Splitter
Part of Speech Tagger
– produces a part-of-speech tag as an
annotation on each word or symbol.

Semantic Tagger
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
ANNIE Component:
Tokenizer

Token Types
– word, number, symbol, punctuation, and
spaceToken.

A tokenizer rule has a left hand side and
a right hand side.
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Tokenizer Rule

Operations used on the LHS:
–
–
–
–

| (or)
* (0 or more occurrences)
? (0 or 1 occurrences)
+ (1 or more occurrences)
The RHS uses ’;’ as a separator, and has the
following format:
{LHS} > {Annotation type};{attribute1}={value
1};...;{attribute n}={value n}
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Example Tokenizer Rule
– "UPPERCASE_LETTER" "LOWERCASE_LET
TER"*
–>
– Token;orth=upperInitial;kind=word;
– The sequence must begin with an uppercase
letter, followed by zero or more lowercase
letters. This sequence will then be annotated
as type “Token”. The attribute “orth”
(orthography) has the value “upperInitial”; the
attribute “kind” has the value “word”.
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
ANNIE Component: Gazetteer


The gazetteer lists used are plain text
files, with one entry per line.
Each list represents a set of names,
such as names of cities, organizations,
days of the week, etc.
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Example Gazetteer List




A small section of the list for units of currency:
……
Ecu
European Currency Units
FFr
Fr
German mark
German marks
New Taiwan dollar
New Taiwan dollars
NT dollar
NT dollars
……
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
ANNIE Component:
Semantic Tagger


Based on JAPE language, which
contains rules that act on annotations
assigned in earlier phases.
Produce outputs of annotated entities.
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
ANNIE Component: Sentence
Splitter



Segments the text into sentences.
This module is required for the tagger.
The splitter uses a gazetteer list of
abbreviations to help distinguish
sentence-marking full stops from other
kinds.
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Example Using ANNIE

http://services.gate.ac.uk/annie/

More next week.
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Viewing and Editing
Annotations


We have looked at annotations, both added by
ANNIE and extracted from tags in the document.
It is sometimes useful to examine closely and edit
these annotations
– you are using a small corpus and want them correct
before you proceed with other tools
– you have a sample set that will be used for training or
for quality assurance and they need to be accurate
– you are still developing the resources being used to
tag documents.
©2012 Paula Matuszek
Unrestricted Annotation
Editing


We can change to an arbitrary different
annotation type.
The process is:
– choose text to be annotated
– hover over it or right click. The annotation
editor pops up.
– if you’re changing it, delete existing annotation
– add new annotation, by choosing or typing it in
©2012 Paula Matuszek
Restricted Annotation Editing




Typically we want better consistency
and control for our editing.
Use a schema to specify allowable
annotation types and features.
GATE includes many predefined
schemas
Located at
<root>/plugins/ANNIE/resources/schem
a
©2012 Paula Matuszek
Schema Annotation Editor





CREOLE resource to let us use the
schema for annotation editing
Enable in Manage CREOLE Plugins
window (under File menu)
Select an annotation, hover or right-click
Different editor window, specifying
allowable types and features
Choose new type or feature.
©2012 Paula Matuszek
More on Schemas and Editing


You can also initiate editing by rightclicking on an annotation in the
annotations list.
You can use multiple schemata in
processing one document.
©2012 Paula Matuszek
Create an Application with
Processing Resources (PRs)




Applications model a control strategy for the
execution of PRs.
Simple pipelines: group a set of PRs
together in order and execute them in turn.
Corpus pipelines: open each document in
the corpus in turn, set that document as a
runtime parameter on each PR, run all the
PRs on the corpus, then close the document
We will do this during lab.
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Saving GATE Language
Resources and Applications

Data Stores:
– save processed documents for additional
use
– specialized folder on a hard drive
– Lucene database
– improve processing times for large
collections of documents
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Types of Data Store

Serial Data Store:
– based on java’s serialization system.
– store in a directory

Lucene Data Store (Lucene is an opensource indexing and search tool.)
– searchable repository
– Lucene-based indexing
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Saving in a datastore




Create a folder.
Right-click to get Create Datastore
menu
This only creates the store. Save
corpora or documents in the Language
Resources pane.
Once saved, they can be
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Saving as XML

Individual documents can also be saved
directly.
– Special GATE XML format
– annotations are appended to the document,
locations for tags are embedded in body
– Preserve original format
– use for XML or html.
– will save all original tags and everything
selected in the annotations
– For a plain text file, embeds inline tags.
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
Saving Applications

Save a set of processing resources and
their parameters.
– Right-click, save application state.
– Append .xgapp for name

To export as a standalone, export as
teamware
– bundles all needed files
– intended for teamware but can be used for
sharing directly.
©2012 Paula Matuszek
Taken partially from a presentation by Lin Lin.
http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur
e/Presentation/GATE.ppt
And LOTS more

GATE is an extraordinarily rich system. Some of the other
CREOLE resources included in the standard distribution:
– Annotation Merging, Quality assurance summarizer for comparing
annotations
– Web crawler , Information Retrieval, Key Phrase Extraction
– Machine learning
– Domain-specific taggers (e.g., chemistry)
– Resources for many languages

CREOLE plugins for integrating with many other systems. E.g.
–
–
–
–
–
–

UIMA
Wordnet
Penn BioTagger
OpenCalais
OpenNLP
LingPipe
More details at http://gate.ac.uk/gate/doc/plugins.html
©2012 Paula Matuszek
Some Links






Home page is http://gate.ac.uk/
Some good short tutorial videos for getting started:
http://gate.ac.uk/demos/developer-videos/ . These are
only a few minutes each, so they’re fast. Version 6, but
they don’t seem to be very different.
User Guide: http://gate.ac.uk/sale/tao/index.html . This is
apparently for version 7.1, which is a development build,
but again it seems to be fine.
Lots of documentation (“acres” of it):
http://gate.ac.uk/documentation.html
The wiki: http://gate.ac.uk/wiki/
Some very nice course materials, with a lot more detail
than we will cover, including a unit on sentiment analysis:
http://gate.ac.uk/wiki/training-materials-2011.html
©2012 Paula Matuszek
What Next?




In lab we will create a simple application
and use it.
Next week we will go into a lot more detail
on using Annie for information extraction
Homework. (You knew that was coming...)
I’m not going to get into programming in
GATE or the more advanced applications.
This might be the best tool for some of
your projects, though.
©2012 Paula Matuszek
Download