Here - Vivek Srikumar

advertisement
An Introduction to Edison
Vivek Srikumar
17th April 2012
Curator gives us easy access to several layers of
annotation over text
What can we do with these?
Outline
• What is Edison?
• Installing Edison
• Using Edison
– Creating Edison objects
– Accessing the Curator
– Adding and using views
What is Edison?
1. A uniform representation of diverse NLP
annotations
2. A library of NLP data structures
1. A Java client to the Curator
NLP Annotations
John Smith bought the car.
Part-of-speech
NNP John
NNP Smith
VBD bought
DT the
NN car
..
Named Entities
PER John Smith
Shallow parse
NP John Smith
VP bought
NP the car
Parse tree
Semantic roles
Predicate buy
A0 John Smith
A1 the car
NNP NNP
S
NP
VP
NP
VBD
DT
NN
John Smith bought the
car
And many others….
A uniform representation
• Main ideas
– All the annotations over text are graphs
– Nodes: Labeled spans of text
• Spans indexed by tokens in the text
– Edges: Relations between the nodes
• Edison terminology
–
–
–
–
TextAnnotation: A container of tokens and views
View: A graph that denotes a specific annotation
Constituent: A labeled span of text (nodes)
Relation: A labeled directed edge between Constituents
A uniform representation
TextAnnotation
Raw text: John Smith bought the car.
Tokens: {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.}
Views
Name: SENTENCE
Constituents: {…}
Relations: {…}
Name: POS
Constituents: {…}
Relations: {…}
Name: PARSE_CHARNIAK Constituents: {…}
Relations: {…}
and other views….
Getting started with Edison
• Download the jar from
http://cogcomp.cs.illinois.edu/page/software_view/Edison
– Click the download link and follow instructions
– Add the edison jar and its dependencies to your class path
• Dependencies
–
–
–
–
–
–
Cogcomp core utilities
Apache commons libraries
Thrift (to communicate with the Curator)
Porter stemmer
LBJ Library
Java WordNet interface
• Javadoc available under “User Guide”
Edison using Maven
• Add the following repository definition to your pom.xml file
<repositories>
<repository>
<id>CogcompSoftware</id>
<name>CogcompSoftware</name>
<url>http://cogcomp.cs.illinois.edu/m2repo/</url>
</repository>
</repositories>
• Add Edison as a dependency
<dependency>
<groupId>edu.illinois.cs.cogcomp</groupId>
<artifactId>edison</artifactId>
<version>0.2.9</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
So far…
1.
2.
3.
4.
5.
6.
7.
What is Edison?
Installing Edison
Creating a TextAnnotation
Adding views from the Curator
Using views
…??
Profit!
A uniform representation
TextAnnotation
Raw text: John Smith bought the car.
Tokens: {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.}
Views
Name: SENTENCE
Constituents: {…}
Relations: {…}
Name: POS
Constituents: {…}
Relations: {…}
Name: PARSE_CHARNIAK Constituents: {…}
Relations: {…}
and other views….
Three ways to create TextAnnotations
1. When you don’t know the tokenization
– Use this for raw text, if you don’t want to use the
Curator
2. When you know the tokenization
– Use this for pre-tokenized text
3. Using the Curator
– Use this for raw text
– If your text is pre-tokenized, you can still use the
Curator for adding views
Creating TextAnnotations (1)
• When to use this approach
– If you don’t know the tokenization (i.e. words)
– Want to use the LBJ tokenizer and sentence
splitter
• Note: Every TextAnnotation has a textId and corpusId, these
could be used in the future for book-keeping
Creating TextAnnotations (1)
String corpus = "2001_ODYSSEY";
String textId = "001";
String text1 = "Good afternoon, gentlemen. I am a HAL-9000 computer.";
TextAnnotation ta1 = new TextAnnotation(corpus, textId, text1);
System.out.println(ta1.getText());
System.out.println(ta1.getTokenizedText());
// Print the sentences. The `Sentence` class has the same
// methods as a `TextAnnotation`.
List<Sentence> sentences = ta1.sentences();
System.out.println(sentences.size() + " sentences found.");
for (int i = 0; i < sentences.size(); i++) {
Sentence sentence = sentences.get(i);
System.out.println(sentence);
}
Creating TextAnnotations (2)
• When to use this approach
– When you know the tokenization
• That is, when some external source specifies the tokens
of the text
• After creating it, it can be used as before
Creating TextAnnotations (2)
String corpus = "2001_ODYSSEY";
String textId = "002";
List<String> tokenizedSentences =
Arrays.asList("Good afternoon , gentlemen .",
"I am a HAL-9000 computer .");
TextAnnotation ta2 = new TextAnnotation(corpus, textId,
tokenizedSentences);
System.out.println(ta2.getText());
System.out.println(ta2.getTokenizedText());
// Print the sentences. The `Sentence` class of the same
// methods as a `TextAnnotation`.
List<Sentence> sentences = ta2.sentences();
System.out.println(sentences.size() + " sentences found.");
for (int i = 0; i < sentences.size(); i++) {
Sentence sentence = sentences.get(i);
System.out.println(sentence);
}
Connecting to the Curator (1)
If you don’t know anything about your text, the curator can tokenize your text
for you.
String text = "Good afternoon, gentlemen. I am a HAL-9000 "
+ "computer. I was born in Urbana, Il. in 1992";
String corpus = "2001_ODYSSEY";
String textId = "001";
// We need to specify a host and a port where the curator server is
// running.
String curatorHost = "my-curator-server.cs.uiuc.edu";
int curatorPort = 9090;
Create a curator client
CuratorClient client = new CuratorClient(curatorHost, curatorPort);
// Should the curator's cache be forcibly updated?
boolean forceUpdate = false;
// Get the text annotation object from the curator, which splits the
// sentences and tokenizes it.
TextAnnotation ta = client.getTextAnnotation(corpus, textId, text,
forceUpdate);
Create a TextAnnotation
Connecting to the Curator (2)
If you know the tokenization and want all the Curator’s annotators to respect
this tokenization
String corpus = "2001_ODYSSEY";
String textId = "002";
Create your
TextAnnotation as before
List<String> tokenizedSentences =
Arrays.asList("Good afternoon , gentlemen .",
"I am a HAL-9000 computer .");
TextAnnotation ta2 = new TextAnnotation(corpus, textId,
tokenizedSentences);
Curator shoud Respect
tokenization
// We need to specify a host and a port where the curator server is
// running.
String curatorHost = "my-curator-server.cs.uiuc.edu";
int curatorPort = 9090;
CuratorClient client = new CuratorClient(curatorHost, curatorPort, true);
Note: A Curator Client in this mode cannot create TextAnnotations. Doing so will
trigger an exception!
So far…
1.
2.
3.
4.
5.
6.
7.
What is Edison?
Installing Edison
Creating a TextAnnotation
Adding views from the Curator
Using views
…??
Profit!
Views
• Views are graphs, Constituents are nodes and Relations
are edges
• Every TextAnnotation can be seen as a container for
views, indexed by their name
• View is a Java class that represents any graph over
constituents
– Specializations of the View class to deal with specific types
• TokenLabelView, SpanLabelView, TreeView,
PredicateArgumentView, CoreferenceView
– You can create your own views or specializations too!
Example: Part-of-speech
John Smith bought the car.
Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.}
Part-of-speech
NNP John
NNP Smith
VBD bought
DT the
NN car
..
Constituents
0-1
NNP
1-2
NNP
2-3
VBD
3-4
DT
4-5
NN
5-6
.
Each constituent
is associated with
a span. The
convention is to
denote a span
using the first
token and the
(last +1)th one.
No Relations!
This specialization of the View class is called a TokenLabelView, where
each constituent assigns a label to a token and there are no relations.
Use for part-of-speech, stem/lemma, etc.
Adding part-of-speech from the
Curator
// Suppose we have a CuratorClient called 'client' and a TextAnnotation
// called 'ta'.
// Should the Curator forcibly update the part-of-speech annotation?
boolean forceUpdate = false;
// Add the part of speech view from the Curator
client.addPOSView(ta, forceUpdate);
Curator call
// Get the part-of-speech view from the TextAnnotation. This view will
// be filed under the name 'ViewNames.POS'. Also, we know that
// this view will be a TokenLabelView.
TokenLabelView posView = (TokenLabelView) ta.getView(ViewNames.POS);
// Iterate through the text and get the POS label for each token
for (int tokenId = 0; tokenId < ta.size(); tokenId++) {
String token = ta.getToken(tokenId);
String posLabel = posView.getLabel(tokenId);
System.out.println(token + "\t" + posLabel);
}
This method is
available for
TokenLabelVIews
Example: Shallow parse
John Smith bought the car.
Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.}
Constituents
Shallow parse
NP John Smith
VP bought
NP the car
0-2
NP
2-3
VP
No Relations!
3-4
NP
Each constituent
is associated with
a span. The
convention is to
denote a span
using the first
token and the
(last +1)th one.
This specialization of the View class is called a SpanLabelView, where
each constituent assigns a label to a span of text and there are no
relations. Use for named entities, shallow parse, Wikifier, etc.
Adding shallow parse from the Curator
// Suppose we have a CuratorClient called 'client' and a TextAnnotation
// called 'ta'.
// Should the Curator forcibly update the shallow parse annotation?
boolean forceUpdate = false;
// Add the shallow parse/chunk view from the Curator
client.addChunkView(ta, forceUpdate);
Curator call
// Get the shallow parse view from the TextAnnotation. This view will
// be filed under the name 'ViewNames.SHALLOW_PARSE'. Also, we know that
// this view will be a SpanLabelView.
SpanLabelView chunkView = (SpanLabelView) ta.getView(ViewNames.SHALLOW_PARSE);
// Get all constituents whose span is contained in the span (0, 2).
List<Constituent> constituents = chunkView.getSpanLabels(0, 2);
// Iterate over them and print their labels
for(Constituent c: constituents) {
String label = c.getLabel();
System.out.println(label);
}
Available for
SpanLabelView
Other SpanLabel views in the Curator
• Shallow parse
– ViewNames.SHALLOW_PARSE
– Use ‘client.addChunkView(ta, forceUpdate)’
•
Named entities
–
–
•
ViewNames.NER
Use ‘client.addNamedEntityView(ta, forceUpdate)’
Wikifier
–
–
ViewNames.WIKIFIER
Use ‘client.addWikifierView(ta, forceUpdate)
Note: For these function calls to work, the corresponding annotator
should exist in your instance of the Curator. Otherwise, an
exception will be triggered
Example: Parse view
John Smith bought the car.
Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.}
Parse tree
Constituents
ParentOf
S
NP
0-5
S
VP
NNP NNP
0-2
NP
3-5
VP
NP
VBD
DT
NN
John Smith bought the
car
0-1
NNP
Rest of the
tree not
shown.
ParentOf
ParentOf
Relations
This specialization of the View class is called a TreeView, where the
graph represents a tree. Use for full parse and dependency trees.
Adding Charniak parse from the
Curator
// Suppose we have a CuratorClient called 'client' and a TextAnnotation
// called 'ta'.
// Should the Curator forcibly update the parse annotation?
boolean forceUpdate = false;
// Add the charniak parse view from the Curator
client.addCharniakParse(ta, forceUpdate);
Curator call
// Get the Charniak parse view from the TextAnnotation. This view will
// be filed under the name 'ViewNames.PARSE_CHARNIAK'. Also, we know
// that this view will be a TreeView.
TreeView parseView = (TreeView) ta.getView(ViewNames.PARSE_CHARNIAK);
// get all parse nodes
List<Constituent> treeNodes = parseView.getConstituents();
// get the tree structure for the first sentence (i.e. sentence #0)
Tree<String> parseTree = parseView.getTree(0);
Do interesting things
// Get path between parse tree nodes (common feature)
String parsePath = PathFeatureHelper.getFullParsePathString(
treeNodes.get(0), treeNodes.get(1), 400);
Tree views from the curator
• Charniak parser
– ViewNames.PARSE_CHARNIAK
– client.addCharniakParse(ta, forceUpdate)
• Easy-first dependency parser
– ViewNames.DEPENDENCY
– client.addEasyFirstDependencyView(ta, forceUpdate)
• Stanford parser
– ViewNames.PARSE_STANFORD
– client.addStanfordParse(ta, forceUpdate)
• Stanford dependency parser
– ViewNames.DEPENDENCY_STANFORD
– client.addStanfordDependencyView(ta, forceUpdate)
Other Curator calls
• Verb semantic roles
– View name: ViewNames.SRL
– client.addSRLView(ta, forceUpdate)
• Adds a view of type PredicateArgumentView, which is a subclass of
the View class
• Nominal semantic roles
– View name: ViewNames.NOM
– client.addNOMView(ta, forceUpdate)
• Adds a view of type PredicateArgumentView
• Coreference
– View name: ViewNames.COREF
– client.addCorefView(ta, forceUpdate)
• Adds a view of type CoreferenceView, which is a subclass of the View
class
So far…
1.
2.
3.
4.
5.
6.
7.
What is Edison?
Installing Edison
Creating a TextAnnotation
Adding views from the Curator
Using views
…??
Profit!
Using views
• All views provide access to
– Constituents:
• getConstituents, getConstituentsCoveringToken,
getConstituentsCoveringSpan
– Relations: getRelations
• Allows us to manipulate several different views
– Eg: Get the parse tree nodes that contain the named entity constituent
that whose label is “PER”:
for (Constituent c : namedEntityView.getConstituents()) {
if (c.getLabel().equals("PER")) {
List<Constituent> parseConstituents = parseView
.getConstituentsCovering(c);
// do something with these
}
}
Using constituents and relations
• Each constituent belongs to a view
• Constituents provide the following methods:
– getLabel(): gets the label of the constituent
– getSpan(): gets the span of the constituent
– getIncomingRelations(): gets list of Relations that are
incident to this constituent in this view
– getOutgoingRelations(): gets list of Relations whose
source is this constituent in this view
• Relations provide the following accessors:
– getRelationName(), getSource(), getTarget()
Other useful functionality
• Supports
– Top-K views
– Custom views, for your application
• Provides helper functions for common tasks
– Look at the functions in classes in the package
edu.illinois.cs.cogcomp.edison.features.helpers
• Provides interface to WordNet
– WordNetManager
• Collin’s head-finding rules
• Several feature extraction utilities
– Look the classes at edu.illinois.cs.cogcomp.edison.features
So far…
1.
2.
3.
4.
5.
6.
7.
What is Edison?
Installing Edison
Creating a TextAnnotation
Adding views from the Curator
Using views
…??
Profit!
Links
• Edison download
http://cogcomp.cs.illinois.edu/page/software_view/Edison
• Example code
http://cogcomp.cs.illinois.edu/software/edison/
• API documentation
http://cogcomp.cs.illinois.edu/software/edison/apidocs
Download