MALLET MAchine Learning for LanguagE Toolkit Outline • About MALLET • Representing Data • Command Line Processing • Simple Evaluation • Conclusion Outline • About MALLET • Representing Data • Command Line Processing • Simple Evaluation • Conclusion About MALLET • "MALLET: A Machine Learning for Language Toolkit.“ • written by Andrew McCallum • http://mallet.cs.umass.edu. 2002. • Implemented in Java, currently version 2.0.6 • Motivation: • Text classification and information extraction • Commercial machine learning • Analysis and indexing of academic publications About MALLET • Main idea • Text focus: data is discrete rather than continuous, even when values could be continuous • How to • Command line scripts: • bin/mallet [command] --[option] [value] … • Text User Interface (“tui”) classes • Direct Java API • http://mallet.cs.umass.edu/api Outline • About MALLET • Representing Data • Command Line Processing • Simple Evaluation • Conclusion Representations • Transform text documents to vectors x1 , x2 … • Elements of vector are called feature values • Example: “Feature at row 345 is number of times “dog” appears in document” • Retain meaning of vector indices Documents to Vectors Documents to Vectors Documents to Vectors Documents to Vectors Documents to Vectors Instances Instances Instances Outline • About MALLET • Representing Data • Command Line Processing • Developing with MALLET • Conclusion Command Line • Importing Data • Classification • Sequence Tagging • Topic Modeling Importing Data • One Instance per file • files in the folder: sample-data/web/en or sample-data/web/de • command line: bin/mallet import-dir --input sample-data/web/* --output web.mallet • One file, one instance per line • file format: [URL] [language] [text of the page...] • command line: bin/mallet import-file --input /data/web/data.txt --output web.mallet Classification • Training a classifier bin/mallet train-classifier --input training.mallet --output-classifier my.classifier • Choosing an algorithm • MaxEnt, NaiveBayes, C45, DecisionTree and many others. bin/mallet train-classifier --input training.mallet --output-classifier my.classifier --trainer MaxEnt • Evaluation • Random split the data into 90% training instances, which will be used to train the classifier, and 10% testing instances. bin/mallet train-classifier --input labeled.mallet --training-portion 0.9 Sequence Tagging • Sequence algorithms • hidden Markov models (HMMs) • linear chain conditional random fields (CRFs). • SimpleTagger • a command line interface to the MALLET Conditional Random Field (CRF) class SimpleTagger • Input file: [feature1 feature2 ... featuren label] Bill CAPITALIZED noun slept non-noun here LOWERCASE STOPWORD non-noun • Train a CRF • An input file “sample” • A trained CRF in the file "nouncrf" java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample SimpleTagger • A file “stest” needed to be labeled CAPITAL Al slept here • Label the input java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrf stest • Output Number of predicates: 5 noun CAPITAL Al non-noun slept non-noun here Topic Modeling • Building Topic Models bin/mallet train-topics --input topic-input.mallet --num-topics 100 -output-state topic-state.gz --input [FILE] --num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model. --num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model. --output-state [FILENAME] This option outputs a compressed text file containing the words in the corpus with their topic assignments. Demo Outline • About MALLET • Representing Data • Command Line Processing • Simple Evaluation • Conclusion Methodology • Focus on sequence tagging module in MALLET • CRF-based implementation • Some scripts written for importing data and evaluating results • Small corpora collected from web • Divided into two parts, 80% for training, 20% for test • Evaluate both POS Tagging and Named Entity Recognition • The performance of training • Accuracy (POS Tagging) and Precision, Recall and FB1 (NER) • All scripts, corpora and results can be found here • http://mallet-eval.googlecode.com A Survey of Named Entity Corpora • Well known named entity corpora • Language-Independent Named Entity Recognition at CoNLL-2003 • A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1) • free and public, but need RCV1 raw texts as the input • Message Understanding Conference (MUC) 6 / 7 • not for free • Affective Computational Entities (ACE) Training Corpus • not for free • Other special purpose corpora • Enron Email Dataset • email messages in this corpus are tagged with person names, dates and times. • A variety of biomedical corpora • some corpora in this collection are tagged with entities in the biomedical domain, such as gene name Small Corpora • Two small corpora collected from web • Penn Treebank Sample • English POS tagging corpora, ~5% fragment of Penn Treebank, (C) LDC 1995. • raw, tagged, parsed and combined data from Wall Street Journal • 148120 tokens, 36 Standard treebank POS tagger • http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/ • HIT CIR LTP Corpora Sample • Chinese NER corpora integrated • 10% of the whole corpora (open to public) • 23751 tokens, 7 kinds of named entities • http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm Environment • Hardware • CPU: Q8300 Quad Core 2.50 GHz • Memory: 3GB • Software • Fedora 13 x86_64 • Java 1.6.0_18 • MALLET 2.0.6 Data Format and Labels • Data Format • Each token one row, each feature one column Bill noun slept non-noun Here non-noun • Labels • Standard treebank POS Tagger • CC Coordinating conjunction | CD Cardinal number | DT Determiner | EX Existential there | FW Foreign word | IN Preposition or subordinating conjunction | JJ Adjective | JJR Adjective, comparative | JJS Adjective, superlative | LS List item marker | MD Modal | NN Noun, singular or mass | NNS Noun, plural … … (36 taggers in all) • HIT Named Entity • O 不是 NE | S- 单独构成 NE | B- 一个 NE 的开始 | I- 一个 NE 的中间 | E- 一个 NE 的结尾 • Nm 数词 | Ni 机构名 | Ns 地名 | Nh 人名 | Nt 时间 | Nr 日期 | Nz 专有名词 • Example: 美国 B-Ni 洛杉矶 I-Ni 警察局 E-Ni Evaluation Stages Training Tasks pos chunking ner Instance # 3982 8936 1286 Tokens # 95767 211727 20913 Time 308m 23s 190m 50s 17m 13s Tokens # 46452 47377 2829 Accuracy 85.67% 93.97% 98.55% Precision - 90.54% 86.89% Recall - 89.89% 86.89% FB1 - 90.21 86.89 Time 15.80s 4.43s 0.8s Test DEMO Q&A