Extracting Rich Event Structure from Text Models and Evaluations Code Workshop! Nate Chambers US Naval Academy Available Code Publicly available on github: https://github.com/nchambers/probschemas 2 Requirements Download WordNet http://wordnetcode.princeton.edu/wn3.1.dict.tar.gz Unzip to a permanent directory: your/path/dict/ tar xzvf wn3.1.dict.tar.gz Download config file http://blog.roland-kluge.de/?p=430 properties.xml Edit config element to point to WordNet <param name=“dictionary_path” value=“your/path/dict/”/> 3 Setup the code Create a directory mkdir probschemas Clone the repository git clone https://github.com/nchambers/probschemas.git Compile the code mvn compile 4 Eclipse Help Setup Eclipse to Recognize Maven Projects • • • • • • • Help -> Install New Software Click "Available Software Sites" Click "Add..." Name: "maven" Location: http://download.eclipse.org/technology/m2e/releases Hit OK, back to the Install page. Type in "maven" into the "Work with:" box. Select "m2e" from the checkbox options that then appear. Load Project into Eclipse • • • • Right-click in the Package Explorer Select "Import..." Select "Existing Maven Projects" Browse to your root directory for the project’s code. 5 Demo Sailing Domain • The code comes with 75 news documents about sailing. – src/main/resources/sailing-docs.txt • We will first parse and run coreference on these documents. Pre-process the text ./runallparser.sh –output sailout –input src/main/resources/sailing-docs.txt –type giga Just a wrapper for mvn stuff. Take a look and feel free to change settings. XML in Gigaword format 6 Demo Pre-processed output • You should now see a directory sailout/ with these files: sailing-docs.txt.deps sailing-docs.txt.events sailing-docs.txt.ner sailing-docs.txt.parse • These files contain NER, event words, typed dependencies, and syntactic parse trees. • If you gave the pre-processor a directory instead of a single file, it will generate these four files for each file in the directory. 7 Demo Pre-processed output • Only if you pre-processed multiple files instead of one, we need to append them all together into single files: mkdir all cat *.parse > all/parse.out cat *.ner > all/ner.out cat *.deps > all/deps.out cat *.events > all/events.out • You should now have a directory sailout/ or all/ parse.out ner.out deps.out events.out 8 Demo Run the Gibbs Sampler to Learn a Model ./runlearner.sh -topics 12 -plates 3 -train <dir-from-preprocessor> *** This learns 3 templates with 4 slots(topics) each. Options • Topics are the number of slots to learn across all templates. • If plates == 0, all slots (topics) are flat, no structure • If plates > 0, the number of slots (topics) per template will be (plates/topics) 9 Demo Output A serialized model will be saved at the end of the sampling iterations: sampler-sailout-ir0-plates3-topics12-jp0-jt0.model Labeling New Data ./runlearner.sh -isamp -model sampler-sailout-…-jt0.model -test <dir> 10 Output from the Labeler ---- DOCUMENT LABELS (APW_ENG_20070703.1346) ---[8] { zealand ( OTHER ): zealand/nsubj--think zealand/nn-government zealand/nsubj--reach it/nsubj--sail it/dobj--win kiwi/nsubj--cross } [5] { alinghi ( PERSON LOCATION ORG ): alinghi/nsubj--win } [4] { seventh ( OTHER ): seventh/amod--race } Slot/Topic ID NER Types of Entity Typed Dependency relation of one of the entity’s mentions. Head word of entity 11 Interpret Output Topics • If you learned with 3 templates and 12 topics: Template A: topics 0-3 Template B: topics 4-7 Template C: topics 8-11 12 Code Tips 1. NLP pre-processing of the text 2. Pulling out events and relevant info 3. Gibbs Sampling for learning 13 NLP pre-processing of the text • Key file: src/main/java/nate/AllParser.java • Flags: -output <dir> Directory to create and dump the parse trees and other files. -input <dir or file> Directory with text files, or a single file. -type <type> Type of files that make up the –input flag. “text” is just plain text, one document per file. “giga” is XML in the Gigaword format. See the resources/sailing-docs.txt for an example of XML. If you need a new type, you’ll have to edit the code. 14 Pulling out events and relevant info Key file: src/main/java/nate/probschemas/DataSimplifier.java • Learner.java calls a function loadFile() – This checks for a “cache” folder to see if the documents have already been processed for its events. – If no “cache” is found, it uses DataSimplifier to pull out the events from the pre-processed NLP files. 15 Pulling out events and relevant info • DataSimplifier.java getEntityList(ProcessedData data, List<String> docsnames) getEntityListCurrentDoc(ProcessedData data) This is the main function that reads parse trees, looks for entities, and fills in its subject/object arguments based on the parse trees. It also uses the coreference information to collect mentions. Look here if the entities in the Learner look bad, or are missing attributes that you think should be there. It contains several methods that trim out information, such as reporting verbs and entities that are just pronouns (etc.). Are these thresholds too high/low for your domain? private double _minDepCounts = 10; // number of times a dep must be seen private int _minDocCounts = 10; // number of docs a verb must occur in (these are set in Learner.java!) 16 Pulling out events and relevant info • DataSimplifier.java • TextEntity.java TextEntity is the class that represents a single entity in the text. It holds all mentions of the entity and syntactic information for each mention. The Gibbs Sampler receives a list of these entities. Important: a document is now a list of TextEntity instances. All other document information is ignored. 17 Gibbs Sampling for learning Key file: src/main/java/nate/probschemas/GibbsSamplerEntities.java • Instantiated and called by Learner.java • Can be run to learn a model and save to disk. ./runlearner.sh –topics 15 –plates 5 –train <dir> 18 Learner.java Lots of flags: -train : The directory containing text processed files. -plates : The number of templates to learn. -jplates : The number of junk templates to learn. -topics : The number of slots across all templates. (e.g., if 5 templates, 20 topics is 4 per template) -jtopics : The number of junk slots to include in sampling. -dtheta : Use thetas per document, not as a single global distribution. -n : The number of Gibbs sampling iterations. -d : The number of training documents to use (if not given, all training docs used). -ir : The number of documents to retrieve (if not present, no IR is used). -avg : If present, sample the training set multiple times, infer, and average the results. -isamp : If present, use a saved sampler model, and infer using its sampled labels. -noent : If present, don't use NER features in the sampler. -sw : Dirichlet smoothing parameter for words in the sampler. -sd : Dirichlet smoothing parameter for deps in the sampler. -sv : Dirichlet smoothing parameter for verbs in the sampler. Also, if this is used, it turns on the verbs variable in the graphical model. -sf : Dirichlet smoothing parameter for entity features in the sampler. -c : Cutoff for dep counts to keep mention. -cdoc : Cutoff for verb counts to keep its mentions. 19 GibbsSamplerEntities.java Learner.java • createSampler(docnames, docEntities) GibbsSamplerEntities.java • initializeModelFromData(names, entities) – Allocates memory for arrays – Randomly assigns z values to entities! • runSampler(numSteps) – Most important: getTopicDistribution(doc, entity) • printWordDistributionsPerTopic() 20 GibbsSamplerEntities.java runSampler() – Loops over all docs and entities. – For each entity: unlabel the z, then resample the z getTopicDistribution(int doc, int entity); – Loops over the topics z – Computes all P() factors for each z – Returns the z distribution That’s it! Almost all other functions serve getTopicDistribution() The rest are for debugging… 21 z’s as slots/templates Z= 0 1 2 Template 0 3 4 5 6 7 8 Template 1 9 10 11 Template 2 Don’t keep track of template counts. Just track z counts per document. πΆππ’ππ‘(π§ = 2) π π πππ‘ = 2 π‘ππππππ‘π = 0) = 3 π=0 πΆππ’ππ‘(π§ = π) (see probOfTemplateTopicGivenDoc() for this code) 22 GibbsSamplerEntities.java getTopicDistribution() θ Documents Entities t s P(t | θ ) * probOfTopicGivenDoc() P(s | t ) * P( h | s ) * P( d1 | s ) * P( d2 | s ) … P( p1 | s ) * P( p2 | s ) … getTopicDistribution() Mentions h d p 23 GibbsSamplerEntities.java θ GibbsSamplerWorkshop.java Documents Documents θ Entities Entities t t s s Mentions Mentions h f d p h d 24 Your Task θ Documents GibbsSamplerWorkshop.java Entities t Add the green observed variable: the predicate over each entity mention. s Mentions h d p P( pred | template ) 25 Your Task θ Documents Entities t Keep track of verb counts per slot (not per template). Remember, the code does not have a t variable. It only has a flat list of z variables (topics). s Mentions h How? d p Count verbs with z’s. Duplicate what the code does for head words with z’s. 26 Workshop Code • Use GibbsSamplerWorkshop: ./runlearner -topics 10 -plates 2 -train sailout/ -workshop • Use GibbsSamplerEntities: ./runlearner -topics 10 -plates 2 -train sailout/ 27 Full Model • Stuck? • Not sure if you did it correctly? diff GibbsSamplerEntities.java GibbsSamplerWorkshop.java 28