Extracting Rich Event Structure from Text Models and Evaluations Code Workshop!

advertisement
Extracting Rich Event Structure from Text
Models and Evaluations
Code Workshop!
Nate Chambers
US Naval Academy
Available Code
Publicly available on github:
https://github.com/nchambers/probschemas
2
Requirements
Download WordNet
http://wordnetcode.princeton.edu/wn3.1.dict.tar.gz
Unzip to a permanent directory: your/path/dict/
tar xzvf wn3.1.dict.tar.gz
Download config file
http://blog.roland-kluge.de/?p=430
properties.xml
Edit config element to point to WordNet
<param name=“dictionary_path” value=“your/path/dict/”/>
3
Setup the code
Create a directory
mkdir probschemas
Clone the repository
git clone https://github.com/nchambers/probschemas.git
Compile the code
mvn compile
4
Eclipse Help
Setup Eclipse to Recognize Maven Projects
•
•
•
•
•
•
•
Help -> Install New Software
Click "Available Software Sites"
Click "Add..."
Name: "maven"
Location: http://download.eclipse.org/technology/m2e/releases
Hit OK, back to the Install page. Type in "maven" into the "Work with:" box.
Select "m2e" from the checkbox options that then appear.
Load Project into Eclipse
•
•
•
•
Right-click in the Package Explorer
Select "Import..."
Select "Existing Maven Projects"
Browse to your root directory for the project’s code.
5
Demo
Sailing Domain
• The code comes with 75 news documents about sailing.
– src/main/resources/sailing-docs.txt
• We will first parse and run coreference on these documents.
Pre-process the text
./runallparser.sh
–output sailout
–input src/main/resources/sailing-docs.txt
–type giga
Just a wrapper for mvn
stuff. Take a look and feel
free to change settings.
XML in Gigaword format
6
Demo
Pre-processed output
• You should now see a directory sailout/ with these files:
sailing-docs.txt.deps sailing-docs.txt.events
sailing-docs.txt.ner
sailing-docs.txt.parse
• These files contain NER, event words, typed dependencies, and
syntactic parse trees.
• If you gave the pre-processor a directory instead of a single file, it
will generate these four files for each file in the directory.
7
Demo
Pre-processed output
• Only if you pre-processed multiple files instead of one, we
need to append them all together into single files:
mkdir all
cat *.parse > all/parse.out
cat *.ner > all/ner.out
cat *.deps > all/deps.out
cat *.events > all/events.out
• You should now have a directory sailout/ or all/
parse.out ner.out deps.out events.out
8
Demo
Run the Gibbs Sampler to Learn a Model
./runlearner.sh
-topics 12
-plates 3
-train <dir-from-preprocessor>
*** This learns 3 templates with 4 slots(topics) each.
Options
• Topics are the number of slots to learn across all templates.
• If plates == 0, all slots (topics) are flat, no structure
• If plates > 0, the number of slots (topics) per template will be
(plates/topics)
9
Demo
Output
A serialized model will be saved at the end of the sampling
iterations:
sampler-sailout-ir0-plates3-topics12-jp0-jt0.model
Labeling New Data
./runlearner.sh -isamp -model sampler-sailout-…-jt0.model -test <dir>
10
Output from the Labeler
---- DOCUMENT LABELS (APW_ENG_20070703.1346) ---[8]
{ zealand ( OTHER ): zealand/nsubj--think zealand/nn-government zealand/nsubj--reach it/nsubj--sail it/dobj--win
kiwi/nsubj--cross }
[5]
{ alinghi ( PERSON LOCATION ORG ): alinghi/nsubj--win }
[4]
{ seventh ( OTHER ): seventh/amod--race }
Slot/Topic ID
NER Types of Entity
Typed Dependency
relation of one of the
entity’s mentions.
Head word of entity
11
Interpret Output Topics
• If you learned with 3 templates and 12 topics:
Template A: topics 0-3
Template B: topics 4-7
Template C: topics 8-11
12
Code Tips
1. NLP pre-processing of the text
2. Pulling out events and relevant info
3. Gibbs Sampling for learning
13
NLP pre-processing of the text
• Key file: src/main/java/nate/AllParser.java
• Flags:
-output <dir> Directory to create and dump the parse trees and other
files.
-input <dir or file> Directory with text files, or a single file.
-type <type> Type of files that make up the –input flag. “text” is just
plain text, one document per file. “giga” is XML in the Gigaword format.
See the resources/sailing-docs.txt for an example of XML. If you need a
new type, you’ll have to edit the code.
14
Pulling out events and relevant info
Key file:
src/main/java/nate/probschemas/DataSimplifier.java
• Learner.java calls a function loadFile()
– This checks for a “cache” folder to see if the documents
have already been processed for its events.
– If no “cache” is found, it uses DataSimplifier to pull out the
events from the pre-processed NLP files.
15
Pulling out events and relevant info
• DataSimplifier.java
getEntityList(ProcessedData data, List<String> docsnames)
getEntityListCurrentDoc(ProcessedData data)
This is the main function that reads parse trees, looks for entities, and fills in its
subject/object arguments based on the parse trees. It also uses the coreference
information to collect mentions.
Look here if the entities in the Learner look bad, or are missing attributes that you think
should be there. It contains several methods that trim out information, such as reporting
verbs and entities that are just pronouns (etc.).
Are these thresholds too high/low for your domain?
private double _minDepCounts = 10; // number of times a dep must be seen
private int _minDocCounts = 10; // number of docs a verb must occur in
(these are set in Learner.java!)
16
Pulling out events and relevant info
• DataSimplifier.java
• TextEntity.java
TextEntity is the class that represents a single entity in the text. It
holds all mentions of the entity and syntactic information for
each mention. The Gibbs Sampler receives a list of these entities.
Important: a document is now a list of TextEntity instances. All
other document information is ignored.
17
Gibbs Sampling for learning
Key file:
src/main/java/nate/probschemas/GibbsSamplerEntities.java
• Instantiated and called by Learner.java
• Can be run to learn a model and save to disk.
./runlearner.sh –topics 15 –plates 5 –train <dir>
18
Learner.java
Lots of flags:
-train : The directory containing text processed files.
-plates : The number of templates to learn.
-jplates : The number of junk templates to learn.
-topics : The number of slots across all templates. (e.g., if 5 templates, 20 topics is 4 per template)
-jtopics : The number of junk slots to include in sampling.
-dtheta : Use thetas per document, not as a single global distribution.
-n
: The number of Gibbs sampling iterations.
-d
: The number of training documents to use (if not given, all training docs used).
-ir : The number of documents to retrieve (if not present, no IR is used).
-avg : If present, sample the training set multiple times, infer, and average the results.
-isamp : If present, use a saved sampler model, and infer using its sampled labels.
-noent : If present, don't use NER features in the sampler.
-sw : Dirichlet smoothing parameter for words in the sampler.
-sd : Dirichlet smoothing parameter for deps in the sampler.
-sv : Dirichlet smoothing parameter for verbs in the sampler. Also, if this is used, it turns on the verbs
variable in the graphical model.
-sf : Dirichlet smoothing parameter for entity features in the sampler.
-c
: Cutoff for dep counts to keep mention.
-cdoc : Cutoff for verb counts to keep its mentions.
19
GibbsSamplerEntities.java
Learner.java
• createSampler(docnames, docEntities)
GibbsSamplerEntities.java
• initializeModelFromData(names, entities)
– Allocates memory for arrays
– Randomly assigns z values to entities!
• runSampler(numSteps)
– Most important: getTopicDistribution(doc, entity)
• printWordDistributionsPerTopic()
20
GibbsSamplerEntities.java
runSampler()
– Loops over all docs and entities.
– For each entity: unlabel the z, then resample the z
getTopicDistribution(int doc, int entity);
– Loops over the topics z
– Computes all P() factors for each z
– Returns the z distribution
That’s it!
Almost all other functions serve getTopicDistribution()
The rest are for debugging…
21
z’s as slots/templates
Z=
0
1
2
Template 0
3
4
5
6
7
8
Template 1
9
10
11
Template 2
Don’t keep track of template counts.
Just track z counts per document.
πΆπ‘œπ‘’π‘›π‘‘(𝑧 = 2)
𝑃 π‘ π‘™π‘œπ‘‘ = 2 π‘‘π‘’π‘šπ‘π‘™π‘Žπ‘‘π‘’ = 0) = 3
𝑖=0 πΆπ‘œπ‘’π‘›π‘‘(𝑧 = 𝑖)
(see probOfTemplateTopicGivenDoc() for this code)
22
GibbsSamplerEntities.java
getTopicDistribution()
θ
Documents
Entities
t
s
P(t | θ ) * probOfTopicGivenDoc()
P(s | t ) *
P( h | s ) *
P( d1 | s ) * P( d2 | s ) …
P( p1 | s ) * P( p2 | s ) …
getTopicDistribution()
Mentions
h
d
p
23
GibbsSamplerEntities.java
θ
GibbsSamplerWorkshop.java
Documents
Documents
θ
Entities
Entities
t
t
s
s
Mentions
Mentions
h
f
d
p
h
d
24
Your Task
θ
Documents
GibbsSamplerWorkshop.java
Entities
t
Add the green observed
variable: the predicate over
each entity mention.
s
Mentions
h
d
p
P( pred | template )
25
Your Task
θ
Documents
Entities
t
Keep track of verb counts per slot
(not per template).
Remember, the code does not have
a t variable. It only has a flat list of
z variables (topics).
s
Mentions
h
How?
d
p
Count verbs with z’s.
Duplicate what the code does for
head words with z’s.
26
Workshop Code
• Use GibbsSamplerWorkshop:
./runlearner -topics 10 -plates 2 -train sailout/ -workshop
• Use GibbsSamplerEntities:
./runlearner -topics 10 -plates 2 -train sailout/
27
Full Model
• Stuck?
• Not sure if you did it correctly?
diff GibbsSamplerEntities.java GibbsSamplerWorkshop.java
28
Download