mallet-eval

advertisement
MALLET
MAchine Learning for LanguagE Toolkit
Outline
• About MALLET
• Representing Data
• Command Line Processing
• Simple Evaluation
• Conclusion
Outline
• About MALLET
• Representing Data
• Command Line Processing
• Simple Evaluation
• Conclusion
About MALLET
• "MALLET: A Machine Learning for Language Toolkit.“
• written by Andrew McCallum
• http://mallet.cs.umass.edu. 2002.
• Implemented in Java, currently version 2.0.6
• Motivation:
• Text classification and information extraction
• Commercial machine learning
• Analysis and indexing of academic publications
About MALLET
• Main idea
• Text focus: data is discrete rather than continuous, even when
values could be continuous
• How to
• Command line scripts:
• bin/mallet [command] --[option] [value] …
• Text User Interface (“tui”) classes
• Direct Java API
• http://mallet.cs.umass.edu/api
Outline
• About MALLET
• Representing Data
• Command Line Processing
• Simple Evaluation
• Conclusion
Representations
• Transform text documents to
vectors x1 , x2 …
• Elements of vector are called
feature values
• Example: “Feature at row 345 is
number of times “dog” appears in
document”
• Retain meaning of vector
indices
Documents to Vectors
Documents to Vectors
Documents to Vectors
Documents to Vectors
Documents to Vectors
Instances
Instances
Instances
Outline
• About MALLET
• Representing Data
• Command Line Processing
• Developing with MALLET
• Conclusion
Command Line
• Importing Data
• Classification
• Sequence Tagging
• Topic Modeling
Importing Data
• One Instance per file
• files in the folder:
sample-data/web/en or sample-data/web/de
• command line:
bin/mallet import-dir --input sample-data/web/* --output web.mallet
• One file, one instance per line
• file format:
[URL] [language] [text of the page...]
• command line:
bin/mallet import-file --input /data/web/data.txt --output web.mallet
Classification
• Training a classifier
bin/mallet train-classifier --input training.mallet --output-classifier
my.classifier
• Choosing an algorithm
• MaxEnt, NaiveBayes, C45, DecisionTree and many others.
bin/mallet train-classifier --input training.mallet --output-classifier
my.classifier --trainer MaxEnt
• Evaluation
• Random split the data into 90% training instances, which will be used to train the
classifier, and 10% testing instances.
bin/mallet train-classifier --input labeled.mallet --training-portion
0.9
Sequence Tagging
• Sequence algorithms
• hidden Markov models (HMMs)
• linear chain conditional random fields (CRFs).
• SimpleTagger
• a command line interface to the MALLET Conditional Random
Field (CRF) class
SimpleTagger
• Input file: [feature1 feature2 ... featuren label]
Bill CAPITALIZED noun
slept non-noun
here LOWERCASE STOPWORD non-noun
• Train a CRF
• An input file “sample”
• A trained CRF in the file "nouncrf"
java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar"
cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample
SimpleTagger
• A file “stest” needed to be labeled
CAPITAL Al
slept
here
• Label the input
java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar"
cc.mallet.fst.SimpleTagger --model-file nouncrf stest
• Output
Number of predicates: 5
noun CAPITAL Al
non-noun slept
non-noun here
Topic Modeling
• Building Topic Models
bin/mallet train-topics --input topic-input.mallet --num-topics 100 -output-state topic-state.gz
--input [FILE]
--num-topics [NUMBER] The number of topics to use. The best number depends on
what you are looking for in the model.
--num-iterations [NUMBER] The number of sampling iterations should be a trade off
between the time taken to complete sampling and the quality of the topic model.
--output-state [FILENAME] This option outputs a compressed text file containing the
words in the corpus with their topic assignments.
Demo
Outline
• About MALLET
• Representing Data
• Command Line Processing
• Simple Evaluation
• Conclusion
Methodology
• Focus on sequence tagging module in MALLET
• CRF-based implementation
• Some scripts written for importing data and evaluating results
• Small corpora collected from web
• Divided into two parts, 80% for training, 20% for test
• Evaluate both POS Tagging and Named Entity Recognition
• The performance of training
• Accuracy (POS Tagging) and Precision, Recall and FB1 (NER)
• All scripts, corpora and results can be found here
• http://mallet-eval.googlecode.com
A Survey of Named Entity Corpora
• Well known named entity corpora
• Language-Independent Named Entity Recognition at CoNLL-2003
• A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1)
• free and public, but need RCV1 raw texts as the input
• Message Understanding Conference (MUC) 6 / 7
• not for free
• Affective Computational Entities (ACE) Training Corpus
• not for free
• Other special purpose corpora
• Enron Email Dataset
• email messages in this corpus are tagged with person names, dates and times.
• A variety of biomedical corpora
• some corpora in this collection are tagged with entities in the biomedical domain,
such as gene name
Small Corpora
• Two small corpora collected from web
• Penn Treebank Sample
• English POS tagging corpora, ~5% fragment of Penn Treebank, (C)
LDC 1995.
• raw, tagged, parsed and combined data from Wall Street Journal
• 148120 tokens, 36 Standard treebank POS tagger
• http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/
• HIT CIR LTP Corpora Sample
• Chinese NER corpora integrated
• 10% of the whole corpora (open to public)
• 23751 tokens, 7 kinds of named entities
• http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm
Environment
• Hardware
• CPU: Q8300 Quad Core 2.50 GHz
• Memory: 3GB
• Software
• Fedora 13 x86_64
• Java 1.6.0_18
• MALLET 2.0.6
Data Format and Labels
• Data Format
• Each token one row, each feature one column
Bill noun
slept non-noun
Here non-noun
• Labels
• Standard treebank POS Tagger
• CC Coordinating conjunction | CD Cardinal number | DT Determiner | EX Existential there |
FW Foreign word | IN Preposition or subordinating conjunction | JJ Adjective | JJR Adjective,
comparative | JJS Adjective, superlative | LS List item marker | MD Modal | NN Noun,
singular or mass | NNS Noun, plural … … (36 taggers in all)
• HIT Named Entity
• O 不是 NE | S- 单独构成 NE | B- 一个 NE 的开始 | I- 一个 NE 的中间 | E- 一个 NE 的结尾
• Nm 数词 | Ni 机构名 | Ns 地名 | Nh 人名 | Nt 时间 | Nr 日期 | Nz 专有名词
• Example: 美国 B-Ni 洛杉矶 I-Ni 警察局 E-Ni
Evaluation
Stages
Training
Tasks
pos
chunking
ner
Instance #
3982
8936
1286
Tokens #
95767
211727
20913
Time
308m 23s
190m 50s
17m 13s
Tokens #
46452
47377
2829
Accuracy
85.67%
93.97%
98.55%
Precision
-
90.54%
86.89%
Recall
-
89.89%
86.89%
FB1
-
90.21
86.89
Time
15.80s
4.43s
0.8s
Test
DEMO
Q&A
Download