Text Analysis Fundamentals Objective To extract and examine metadata from collections of documents that contain text. If you want to develop intelligent agents that classify or make predictions based on text, this is an important prerequisite! Most machine learning algorithms require quantitative input, so text analysis helps you reduce and transform your textual content into numbers that describe the content. This example uses the tm and wordcloud packages in R, plus data and a little bit of the code from Machine Learning for Hackers by Drew Conway (@drewconway) and John Myles White (@johnmyleswhite). You should follow both of these guys on Twitter. They are very cool and they love data analysis. And their code is downloadable from https://github.com/johnmyleswhite/ML_for_Hackers. Background Text analysis is a powerful mechanism for knowledge discovery and extraction, and provides functions for identifying associations and establishing categorization schemes. Typically, data mining means working with structured data where you know what types of information are contained within various locations in your data repository (such as your database tables). Text mining, in contrast, involves working with unstructured data and can be much more complicated. It can be costly and time consuming to convert a dataset consisting of unstructured text into a more structured form suitable for mining with other techniques. There are three data structures typically associated with text analysis and text mining: the basic container for the unstructured text, and matrices that contain information about the frequency of appearance of words and terms. They are: Corpus - A list containing a collection of documents. Although the tm package in R allows you to build a corpus from several different file types, we will only be working with text files here. Although I've only used collections containing up to 10k documents, it is reported that R performs well with collections up to 50k documents, and sometimes even up to 100k documents depending upon how much memory you have available. Document-Term Matrix (dtm) - Organized with the documents as rows and the terms as columns, this matrix contains counts of the appearance of each term within each document of a corpus. Term-Document Matrix (tdm) - Organized with the terms as rows and the documents as columns, this matrix contains counts of the appearance of each term within each document of a corpus. In this chapter, we step through the process of building each of these objects in R, and using them to analyze the content of the emails and compare the composition of legitimate spam emails to the composition of real non-spam emails ("ham"). Data Format For this exercise we will use the "spam and ham" dataset of emails from Chapter 3 of Machine Learning for Hackers. (I've zipped up this data set and posted it to Blackboard under the Lab Exercises menu option. It's called spam-and-ham.zip. I unzipped this data into the directory C:/ISAT 344/MLH/03-Classification/data on my computer. Be sure you note what directory you unzip the data into, because you'll need to use that path.) spam.path <- "C:/ISAT 344/MLH/03-Classification/data/spam/" ham.path <- "C:/ISAT 344/MLH/03-Classification/data/easy_ham/" The /spam folder contains 501 files. 500 of these files contain one email (header + message) each. The filename of the first email starts with 00001, the filename of the second email starts with 00002, and so forth. An additional file called cmds contains the text of Unix commands. The /easy_ham folder contains 2501 files. 2500 of these files contain one email (header + message) each. The filename numbering scheme for emails in this directory is the same as for the /spam directory. An additional file called cmds is also present in this directory. You should try to open up one of the email files just to see what it looks like. (Note that there are several emails written in HTML, containing all of the tags that will make these messages pretty when you open them in your email program.) Our first challenge is to build a Corpus for each of the two collections (spam and ham), and then a document-term matrix (dtm) and term-document matrix (tdm) for each so we can execute the analysis algorithms. First, we need to cut and paste this "helper function" into R written by Conway and White. It will open up one of the files and pull out just the content of the message, ignoring all the email header information. get.msg <- function(path) { con <- file(path,open="rt",encoding="latin1") text <- readLines(con) msg <- text[seq(which(text=="")[1]+1,length(text))] close(con) return(paste(msg,collapse="\n")) } Next, we need to create a vector data structure in R to contain the content of each of the emails. The first line pulls out all of the filenames in the directory that has the location of spam.path and creates a list. The second line adjusts that list by pulling out all the file names which do not (!=) match the string "cmds". The sapply command performs the same function over all elements of a list, so the third line takes each filename contained within spam.docs and gets the message (get.msg) contained within that file. As a result, the content of each email message is contained in the list that we call all.spam. spam.docs <- dir(spam.path) spam.docs <- spam.docs[which(spam.docs!="cmds")] all.spam <- sapply(spam.docs, function(p)get.msg(paste(spam.path,p,sep=""))) Then, we do the same thing for all the files containing "ham" email: ham.docs <- dir(ham.path) ham.docs <- ham.docs[which(ham.docs!="cmds")] all.ham <- sapply(ham.docs, function(p)get.msg(paste(ham.path,p,sep=""))) The name of each element in the new list conveniently corresponds to the beginning of the filename containing the email content, so for example, all.spam[1] can be typed in to the R prompt to retrieve the content of the email in the file with the filename 00001.7848dde101aa985090474a91ec93fcf0. Now we can create our Corpus, document-term matrix (dtm) and term-document matrix (tdm) objects. We need to load the tm package for text mining first, so be sure you have already installed the package into your instance of R by going the Packages -> Install Package(s) on the menu bar of the R Gui. Then do this: library(tm) control <- list(stopwords=TRUE,removePunctuation=TRUE, removeNumbers=TRUE,minDocFreq=2) We will use the control variable to create our tdm and dtm objects. What we've done is to establish that we want to remove all of the most common words in the English language (stopwords), numbers, and punctuation. Furthermore, words that only appear once will be ignored (the minimum frequency has to be 2 for a word to be reported thanks to minDocFreq=2). More options are summarized in Table X.1. spam.corpus <- Corpus(VectorSource(all.spam)) spam.tdm <- TermDocumentMatrix(spam.corpus,control) spam.dtm <- DocumentTermMatrix(spam.corpus,control) Now, we do the same thing for our collection of ham emails: ham.corpus <- Corpus(VectorSource(all.ham)) ham.tdm <- TermDocumentMatrix(ham.corpus,control) ham.dtm <- DocumentTermMatrix(ham.corpus,control) What they do Removes all common words from the document, such as a, and, or, but, etc. removePunctuation=TRUE Eliminate punctuation marks from the corpus removeNumbers=TRUE Eliminate numbers from the corpus minDocFreq=5 Ignore all words that appear in less than 5 documents in your corpus minWordLength=2 Ignore all "words" less than 2 characters Table X.1: Optional arguments to TermDocumentMatrix and DocumentTermMatrix to control what ends up in the matrices Parameters available to control stopwords=TRUE Finally, we can check to make sure that our objects have been successfully created, and get a sense for what's contained within each one. There are several options available to us, including summary, head, and inspect. The summary command gives us a look at the structure of a corpus > summary(spam.corpus) A corpus with 500 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID > summary(ham.corpus) A corpus with 2500 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID The second way to get information about our objects is to use head, which stands for "header". We can ask to see the head of a tdm or dtm, but not the head of a corpus. Notice that the maximum term length in the dtm is 996 characters. This might be bad! (We might want to clean the data more before we examine it, or, we might want to look and see what kind of term is 996 long. It could, for example, be an encryption key or an encoded image! We don't know at this point.) > head(spam.tdm) A term-document matrix (6 terms, 500 documents) Non-/sparse entries: Sparsity : Maximal term length: Weighting : 18/2982 99% 25 term frequency (tf) > head(spam.dtm) A document-term matrix (6 documents, 27053 terms) Non-/sparse entries: Sparsity : Maximal term length: Weighting : 433/161885 100% 996 term frequency (tf) Notice also that head tells us how sparse the matrix is. Sparse means that there are a lot of zeroes, or non-occurrences of terms. It can take a long time for algorithms to search through matrices with a whole lot of zeroes, so you may want to consider removing sparse terms in your tdm (but not your dtm!) before you proceed. Fortunately, this is easy to do. All you need to know is what percentage of the row needs to be empty to eliminate it from your tdm, and a higher value is usually better. For example: > new.spam.tdm.1 <- removeSparseTerms(spam.tdm,0.2) > head(new.spam.tdm.1) A term-document matrix (6 terms, 500 documents) Non-/sparse entries: Sparsity : Maximal term length: Weighting : 0/3000 100% 2 term frequency (tf) > new.spam.tdm.2 <- removeSparseTerms(spam.tdm,0.8) > head(new.spam.tdm.2) A term-document matrix (6 terms, 500 documents) Non-/sparse entries: Sparsity : Maximal term length: Weighting : 909/2091 70% 11 term frequency (tf) In the first case, we're removing all the rows where at least 20% of the terms are not used. You can see this is a bad thing to do, because it reduces the number of actual data values ("non-sparse entities") to zero. We have essentially eliminated all of our data (oops). In the second case, we set up a new tdm called new.spam.tdm.2 and only wipe out the rows where at least 80% of the terms used are not in common with other terms in other documents within the corpus. The results are much better. Now, only 70% of our matrix containing word counts is empty, and the longest term is 11 characters, so we have also eliminated that odd 996-character term (and any that were like it). This also tells us we should use new.spam.tdm.2 whenever we can when we start to analyze our data. Similarly, we should check the structure of the tdm we created from our ham data, and removeSparseTerms if need be. (Try this yourself!) The third means to peek into our data is to use inspect. We can inspect elements of a corpus if we call them by the number they are indexed at (which will show us the content of each embedded document in the collection): > inspect(spam.corpus[40]) A corpus with 1 text document The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID $`00040.949a3d300eadb91d8745f1c1dab51133` Dear Sir or Madam: Please reply to Receiver: China Enterprise Management Co., Ltd. (CMC) E-mail: unido@chinatop.net As one technical organization supported by China Investment and Technical Promotion Office of United Nation Industry Development Organization (UNIDO), we cooperate closely with the relevant Chinese Quality Supervision and Standardization Information Organization. We provide the most valuable consulting services to help you to open Chinese market within the shortest time: 1. Consulting Service on Mandatory National Standards of The People's Republic of China. 2. Consulting Service on Inspection and Quarantine Standards of The People's Republic of China. 3. Consulting Service for Permission to Enter Chinese Market We are very sorry to disturb you! More information, please check our World Wide Web: http://www.chinatop.net Sincerely yours -Irish Linux Users' Group: ilug@linux.ie http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information. List maintainer: listmaster@linux.ie Alternatively, we can inspect rows and columns of a tdm or dtm, which will give us portions of the (sometimes large) tables that contain the counts of our terms. These calls to inspect give us rows 1 through 3 (the first argument in brackets), and then columns 1 through 3 (the second argument in brackets). > inspect(spam.tdm[1:3,1:3]) A term-document matrix (3 terms, 3 documents) Non-/sparse entries: Sparsity : Maximal term length: Weighting : 0/9 100% 22 term frequency (tf) Docs Terms 00001.7848dde101aa985090474a91ec93fcf0 \033b\033b 0 \033bal\033bvipmail 0 \033bckkonsh\033bpt\033bwlshy\033b 0 Docs Terms 00002.d94f1b97e48ed3b553b3508d116e6a09 \033b\033b 0 \033bal\033bvipmail 0 \033bckkonsh\033bpt\033bwlshy\033b 0 Docs Terms 00003.2ee33bc6eacdb11f38d052c44819ba6c \033b\033b 0 \033bal\033bvipmail 0 \033bckkonsh\033bpt\033bwlshy\033b 0 > inspect(spam.dtm[1:3,1:3]) A document-term matrix (3 documents, 3 terms) Non-/sparse entries: Sparsity : Maximal term length: Weighting : 0/9 100% 22 term frequency (tf) Terms Docs \033b\033b \033bal\033bvipmail 00001.7848dde101aa985090474a91ec93fcf0 0 0 00002.d94f1b97e48ed3b553b3508d116e6a09 0 0 00003.2ee33bc6eacdb11f38d052c44819ba6c 0 0 Terms Docs \033bckkonsh\033bpt\033bwlshy\033b 00001.7848dde101aa985090474a91ec93fcf0 0 00002.d94f1b97e48ed3b553b3508d116e6a09 0 00003.2ee33bc6eacdb11f38d052c44819ba6c 0 Code and Results You may have even forgotten than we haven't started the analysis part yet, because there's been so much work that's needed to be done with the data itself. First, let's get a sense of the overall frequency of terms in our collections. We get identical results whether we use our tdm or our dtm here. First, let's check to see how many terms have been mentioned at least 100 times: > length(findFreqTerms(spam.dtm,100)) [1] 120 That's a lot of terms, so let's check and see how many have been mentioned even more frequently - at least 300 times: > length(findFreqTerms(spam.dtm,300)) [1] 29 Since 29 is a manageable number, we can look at the terms themselves: > findFreqTerms(spam.dtm,300) [1] [6] [11] [16] [21] [26] "arial" "center" "font" "html" "people" "sized" "body" "div" "free" "input" "please" "table" "border" "email" "height" "list" "receive" "width" "borderd" "facedarial" "heightd" "money" "sansserif" "widthd" "business" "faceverdanafont" "helvetica" "option" "size" We can also check to see what terms appear with greatest frequency BOTH in the ham and the spam, meaning that the presence of these terms cannot be used to distinguish one type of email from the other: > intersect(findFreqTerms(spam.dtm,100),findFreqTerms(ham.dtm,100)) [1] [7] [13] [19] [25] [31] [37] "access" "contenttype" "here" "link" "money" "sent" "you" "address" "day" "home" "list" "name" "service" "business" "email" "html" "mail" "people" "size" "call" "free" "information" "mailing" "please" "software" "company" "government" "internet" "message" "report" "time" "computer" "help" "life" "million" "send" "web" For association analysis, we're going to work with the matrices that we stripped out many of the sparse terms from, let's take a look first at all the terms that appear in each matrix, and the number of times each of those common terms appears in the first document of our collection: > inspect(new.ham.tdm.2[,1]) A term-document matrix (6 terms, 1 documents) Non-/sparse entries: Sparsity : Maximal term length: Weighting : Docs 3/3 50% 7 term frequency (tf) Terms 00001.7c53336b37003a9286aba55d2945844c date 1 email 0 list 4 mailing 1 url 0 wrote 0 > inspect(new.spam.tdm.2[,1]) A term-document matrix (22 terms, 1 documents) Non-/sparse entries: Sparsity : Maximal term length: Weighting : 12/10 45% 11 term frequency (tf) Docs Terms 00001.7848dde101aa985090474a91ec93fcf0 address 0 body 1 click 1 company 0 contenttype 0 email 2 font 3 form 0 free 3 head 0 html 1 information 0 list 1 message 0 meta 2 please 1 receive 0 removed 1 table 4 time 0 wish 1 you 0 Since the word "list" appears 4 times in the first document, let's see what that term is most commonly associated with throughout the corpus: > findAssocs(new.ham.tdm.2,"list",0) list mailing wrote email 1.00 0.68 0.26 0.24 Similarly, let's see what's most commonly associated with the word "free" in the spam: > findAssocs(new.spam.tdm.2,"free",0) free information 1.00 0.32 please click 0.14 0.12 receive 0.31 font 0.09 time 0.27 head 0.09 address 0.17 html 0.04 list 0.17 wish 0.03 you 0.15 email 0.14 Looks like the spammers are most frequently offering free information. We can also examine word clouds, which can be very fun. But be advised: many data analysts and statisticians hate word clouds because the same information (and usually better information) can be found in frequency tables and matrices. That said, word clouds are still fun to make and can be very revealing nonetheless. So let's make a few. If you have not done it already, install and then load the wordcloud package: library(wordcloud) In this example, our goal is to create word clouds that compare and contrast the content of the ham corpus with respect to the spam corpus. As a result, we need to squish all the documents in all.spam and all.ham into a data frame with two columns, one for each email type: allspam <- paste(all.spam,sep="",collapse=" ") allham <- paste(all.ham,sep="",collapse=" ") tmpText = data.frame(c(allham,allspam),row.names=c("HAM","SPAM")) ds <- DataframeSource(tmpText) This special data structure can be used to create a new corpus. We'll remove punctuation, convert all the words to lowercase, remove the numbers, and eliminate the stopwords in a slightly different way that we did it last time: corp corp corp corp corp = = = = = Corpus(ds) tm_map(corp,removePunctuation) tm_map(corp,tolower) tm_map(corp,removeNumbers) tm_map(corp,function(x){removeWords(x,stopwords())}) Once we have the tdm, we can create a word cloud: tdm <- TermDocumentMatrix(corp) term.matrix <- as.matrix(tdm) v <- sort(rowSums(term.matrix),decreasing=TRUE) d <- data.frame(word=names(v),freq=v) wordcloud(d$word,d$freq,max.words=150) Figure X.1: Word cloud created with wordcloud(d$word,d$freq,max.words=150) For the grand finale, let's create word clouds that show the words most commonly used in BOTH ham and spam as well as the words most uncommonly used in each category. The former is called a commonality cloud, and the latter is called a comparison cloud. We can generate our clouds like this (the first line sets up a graphics panel with 1 row and 2 columns, so we can get the clouds to appear side by side). The clouds are in Figure X.2 below. par(mfrow=c(1,2)) comparison.cloud(term.matrix,max.words=200,random.order=FALSE, colors=c("#999999","#000000"),main="Differences Between HAM and SPAM") commonality.cloud(term.matrix,max.words=100,random.order=FALSE, color="#000000",asp=2,main="Similarities Between HAM and SPAM") From these clouds, there are some patterns that stand out. First, you might want to consider eliminating the words in the commonality cloud in further analyses. It looks like all kinds of email that show up at the inbox contain words like email, list, and people. These could make useful collections of additional stopwords. The comparison cloud at the left shows that there are tons of HTML tags in spam. Perhaps HTML emails should be examined more closely in the final spam filter than non-HTML emails? Figure X.2: Comparison cloud (left) and commonality cloud (right) for ham and spam. Conclusions With the examples you have just completed, you should be able to define, create, and examine a corpus, a Term-Document Matrix or Document-Term Matrix, whichever is needed. You can reduce the data in your matrix and make it less sparse, examine frequencies and associations, and create various kinds of wordclouds. Other Resources: http://en.wikipedia.org/wiki/Text_mining http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf - Feinerer's intro to the tm package http://www.jstatsoft.org/v25/i05/paper - Text Mining Infrastructure in R by Feinerer et al. (probably the BEST and most comprehensive guide to using the tm package) http://www.jstatsoft.org/v25/i05/ -- Feinerer's example code from the paper above can be downloaded here http://www.jstatsoft.org/v25/i05/ -- Zinkov's text mining with R http://www.r-bloggers.com/simple-text-mining-with-r/ - Simple text mining with R at R Bloggers http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/ text mining with Twitter by Jeffrey Breen