Document Classification using the Natural Language Toolkit Ben Healey http://benhealey.info @BenHealey Source: IStockPhoto http://upload.wikimedia.org/wikipedia/commons/b/b6/FileStack_retouched.jpg The Need for Automation http://upload.wikimedia.org/wikipedia/commons/d/d6/Cat_loves_sweets.jpg Take ur pick! Class: Features: - # Words - % ALLCAPS - Unigrams - Sender - And so on. The Development Set Classification Algo. New Document (Class Unknown) Document Features Trained Classifier (Model) Classified Document. Relevant NLTK Modules • Feature Extraction – – – – – from from from from from nltk.corpus import words, stopwords nltk.stem import PorterStemmer nltk.tokenize import WordPunctTokenizer nltk.collocations import BigramCollocationFinder nltk.metrics import BigramAssocMeasures – See http://text-processing.com/demo/ for examples • Machine Learning Algos and Tools – – – – from from from from nltk.classify nltk.classify nltk.classify nltk.classify import import import import NaiveBayesClassifier DecisionTreeClassifier MaxentClassifier WekaClassifier – from nltk.classify.util import accuracy NaiveBayesClassifier P(label) ∗ P(features|label) P(label|features) = P(features) P(label) ∗ P(f1|label)∗...∗ P(fn|label) P(label|features) = P(features) http://61.153.44.88/nltk/0.9.5/api/nltk.classify.naivebayes-module.html http://www.educationnews.org/commentaries/opinions_on_education/91117.html 517,431 Emails Source: IStockPhoto Prep: Extract and Load • Sample* of 20,581 plaintext files • import MySQLdb, os, random, string • MySQL via Python ODBC interface • File, string manipulation • Key fields separated out – To, From, CC, Subject, Body * Folders for 7 users with a large number of email. So not representative! Prep: Extract and Load • Allocation of random number • Some feature extraction – #To, #CCd, #Words, %digits, %CAPS • Note: more cleaning could be done • Code at benhealey.info From: james.steffes@enron.com To: louise.kitchen@enron.com Subject: Re: Agenda for FERC Meeting RE: EOL Louise -We had decided that not having Mark in the room gave us the ability to wiggle if questions on CFTC vs. FERC regulation arose. As you can imagine, FERC is starting to grapple with the issue that financial trades in energy commodities is regulated under the CEA, not the Federal Power Act or the Natural Gas Act. Thanks, Jim From: pete.davis@enron.com To: pete.davis@enron.com Subject: Start Date: 1/11/02; HourAhead hour: 5; Start Date: 1/11/02; HourAhead hour: 5; No ancillary schedules awarded. No variances detected. LOG MESSAGES: PARSING FILE -->> O:\Portland\WestDesk\California Scheduling\ISO Final Schedules\2002011105.txt Class[es] assigned for 1,000 randomly selected messages: External Relations 45 Social/Personal 68 Human Resources 134 Other/Unclear 141 Admin/Planning 158 Info Tech 167 Regulatory/Accounting 172 Deals, Trading, Modelling 247 0 50 100 150 200 250 Prep: Show us ur Features • NLTK toolset – – – – – from from from from from nltk.corpus import words, stopwords nltk.stem import PorterStemmer nltk.tokenize import WordPunctTokenizer nltk.collocations import BigramCollocationFinder nltk.metrics import BigramAssocMeasures • Custom code – def extract_features(record,stemmer,stopset,tokenizer): … • Code at benhealey.info Prep: Show us ur Features • Features in boolean or nominal form if record['num_words_in_body']<=20: features['message_length']='Very Short' elif record['num_words_in_body']<=80: features['message_length']='Short' elif record['num_words_in_body']<=300: features['message_length']='Medium' else: features['message_length']='Long' Prep: Show us ur Features • Features in boolean or nominal form text=record['msg_subject']+" "+record['msg_body'] tokens = tokenizer.tokenize(text) words = [stemmer.stem(x.lower()) for x in tokens if x not in stopset and len(x) > 1] for word in words: features[word]=True Sit. Say. Heel. random.shuffle(dev_set) cutoff = len(dev_set)*2/3 train_set=dev_set[:cutoff] test_set=dev_set[cutoff:] classifier = NaiveBayesClassifier.train(train_set) print 'accuracy for > ',subject,':', accuracy(classifier, test_set) classifier.show_most_informative_features(10) Most Important Features Most Important Features Most Important Features Performance: ‘IT’ Model Decile Mean Prob. 9 1.0000 8 0.7364 7 0.0000 6 0.0000 5 0.0000 4 0.0000 3 0.0000 2 0.0000 1 0.0000 0 0.0000 % PR 3 4 1 5 4 6 7 15 % Social 1 11 11 8 7 6 7 8 4 5 IMPORTANT: These are ‘cheat’ scores! % HR 7 10 14 11 16 16 18 13 29 % Other % Admin 1 17 13 30 21 13 16 13 18 16 17 19 28 10 16 12 14 11 14 % IT 95 49 2 2 4 4 6 2 2 2 % Legal 2 5 13 11 13 17 16 37 42 16 % Deal 3 13 25 38 41 32 25 21 21 28 Performance: ‘Deal’ Model Decile Mean Prob. 9 1.0000 8 1.0000 7 0.9971 6 0.1680 5 0.0000 4 0.0000 3 0.0000 2 0.0000 1 0.0000 0 0.0000 % PR 2 1 2 3 4 5 5 7 16 % Social 2 3 15 11 8 9 6 4 10 IMPORTANT: These are ‘cheat’ scores! % HR 2 1 3 9 19 21 21 22 3 33 % Other % Admin 11 6 3 4 18 4 35 17 22 19 17 32 13 25 9 26 2 5 11 20 % IT 5 4 2 11 9 14 13 18 77 14 % Legal 14 3 17 17 23 18 28 24 13 15 % Deal 79 93 58 9 4 2 3 Performance: ‘Social’ Model Decile Mean Prob. 9 1.0000 8 1.0000 7 1.0000 6 1.0000 5 1.0000 4 1.0000 3 1.0000 2 1.0000 1 0.7382 0 0.0001 % PR 1 7 2 22 10 1 2 - % Social 9 15 5 2 11 7 15 3 1 - IMPORTANT: These are ‘cheat’ scores! % HR 6 17 6 21 32 20 10 15 7 - % Other % Admin 9 11 18 16 24 20 15 25 9 22 13 22 5 9 22 24 25 9 1 - % IT 13 2 3 11 10 13 7 14 4 89 % Legal 40 21 30 24 5 10 14 14 13 1 % Deal 24 21 18 22 10 24 49 24 45 10 Don’t get burned. • • • • Biased samples Accuracy and rare events Features and prior knowledge Good modelling is iterative! • Resampling and robustness • Learning cycles http://www.ugo.com/movies/mustafa-in-austin-powers Resources • NLTK: – www.nltk.org/ – http://www.nltk.org/book • Enron email datasets: – http://www.cs.umass.edu/~ronb/enron_dataset.html • Free online Machine Learning course from Stanford – http://ml-class.com/ (starts in October) • StreamHacker blog by Jacob Perkins – http://streamhacker.com