Text - Quality and Innovation

advertisement
Text Analysis Fundamentals
Objective
To extract and examine metadata from collections of documents that contain text. If
you want to develop intelligent agents that classify or make predictions based on text,
this is an important prerequisite! Most machine learning algorithms require quantitative
input, so text analysis helps you reduce and transform your textual content into
numbers that describe the content. This example uses the tm and wordcloud
packages in R, plus data and a little bit of the code from Machine Learning for Hackers by
Drew Conway (@drewconway) and John Myles White (@johnmyleswhite). You should
follow both of these guys on Twitter. They are very cool and they love data analysis. And
their code is downloadable from https://github.com/johnmyleswhite/ML_for_Hackers.
Background
Text analysis is a powerful mechanism for knowledge discovery and extraction, and
provides functions for identifying associations and establishing categorization schemes.
Typically, data mining means working with structured data where you know what types
of information are contained within various locations in your data repository (such as
your database tables). Text mining, in contrast, involves working with unstructured data
and can be much more complicated. It can be costly and time consuming to convert a
dataset consisting of unstructured text into a more structured form suitable for mining
with other techniques. There are three data structures typically associated with text
analysis and text mining: the basic container for the unstructured text, and matrices that
contain information about the frequency of appearance of words and terms. They are:

Corpus - A list containing a collection of documents. Although the tm package in
R allows you to build a corpus from several different file types, we will only be
working with text files here. Although I've only used collections containing up to
10k documents, it is reported that R performs well with collections up to 50k
documents, and sometimes even up to 100k documents depending upon how
much memory you have available.

Document-Term Matrix (dtm) - Organized with the documents as rows and the
terms as columns, this matrix contains counts of the appearance of each term
within each document of a corpus.

Term-Document Matrix (tdm) - Organized with the terms as rows and the
documents as columns, this matrix contains counts of the appearance of each
term within each document of a corpus.
In this chapter, we step through the process of building each of these objects in R, and
using them to analyze the content of the emails and compare the composition of
legitimate spam emails to the composition of real non-spam emails ("ham").
Data Format
For this exercise we will use the "spam and ham" dataset of emails from Chapter 3 of
Machine Learning for Hackers. (I've zipped up this data set and posted it to Blackboard
under the Lab Exercises menu option. It's called spam-and-ham.zip. I unzipped this data
into the directory C:/ISAT 344/MLH/03-Classification/data on my computer. Be sure you
note what directory you unzip the data into, because you'll need to use that path.)
spam.path <- "C:/ISAT 344/MLH/03-Classification/data/spam/"
ham.path <- "C:/ISAT 344/MLH/03-Classification/data/easy_ham/"
The /spam folder contains 501 files. 500 of these files contain one email (header +
message) each. The filename of the first email starts with 00001, the filename of the
second email starts with 00002, and so forth. An additional file called cmds contains the
text of Unix commands. The /easy_ham folder contains 2501 files. 2500 of these files
contain one email (header + message) each. The filename numbering scheme for emails
in this directory is the same as for the /spam directory. An additional file called cmds is
also present in this directory. You should try to open up one of the email files just to see
what it looks like. (Note that there are several emails written in HTML, containing all of
the tags that will make these messages pretty when you open them in your email
program.)
Our first challenge is to build a Corpus for each of the two collections (spam and ham),
and then a document-term matrix (dtm) and term-document matrix (tdm) for each so
we can execute the analysis algorithms.
First, we need to cut and paste this "helper function" into R written by Conway and
White. It will open up one of the files and pull out just the content of the message,
ignoring all the email header information.
get.msg <- function(path) {
con <- file(path,open="rt",encoding="latin1")
text <- readLines(con)
msg <- text[seq(which(text=="")[1]+1,length(text))]
close(con)
return(paste(msg,collapse="\n"))
}
Next, we need to create a vector data structure in R to contain the content of each of
the emails. The first line pulls out all of the filenames in the directory that has the
location of spam.path and creates a list. The second line adjusts that list by pulling out
all the file names which do not (!=) match the string "cmds". The sapply command
performs the same function over all elements of a list, so the third line takes each
filename contained within spam.docs and gets the message (get.msg) contained
within that file. As a result, the content of each email message is contained in the list
that we call all.spam.
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs!="cmds")]
all.spam <- sapply(spam.docs,
function(p)get.msg(paste(spam.path,p,sep="")))
Then, we do the same thing for all the files containing "ham" email:
ham.docs <- dir(ham.path)
ham.docs <- ham.docs[which(ham.docs!="cmds")]
all.ham <- sapply(ham.docs,
function(p)get.msg(paste(ham.path,p,sep="")))
The name of each element in the new list conveniently corresponds to the beginning of
the filename containing the email content, so for example, all.spam[1] can be typed
in to the R prompt to retrieve the content of the email in the file with the filename
00001.7848dde101aa985090474a91ec93fcf0.
Now we can create our Corpus, document-term matrix (dtm) and term-document
matrix (tdm) objects. We need to load the tm package for text mining first, so be sure
you have already installed the package into your instance of R by going the Packages ->
Install Package(s) on the menu bar of the R Gui. Then do this:
library(tm)
control <- list(stopwords=TRUE,removePunctuation=TRUE,
removeNumbers=TRUE,minDocFreq=2)
We will use the control variable to create our tdm and dtm objects. What we've done is
to establish that we want to remove all of the most common words in the English
language (stopwords), numbers, and punctuation. Furthermore, words that only
appear once will be ignored (the minimum frequency has to be 2 for a word to be
reported thanks to minDocFreq=2). More options are summarized in Table X.1.
spam.corpus <- Corpus(VectorSource(all.spam))
spam.tdm <- TermDocumentMatrix(spam.corpus,control)
spam.dtm <- DocumentTermMatrix(spam.corpus,control)
Now, we do the same thing for our collection of ham emails:
ham.corpus <- Corpus(VectorSource(all.ham))
ham.tdm <- TermDocumentMatrix(ham.corpus,control)
ham.dtm <- DocumentTermMatrix(ham.corpus,control)
What they do
Removes all common words from the
document, such as a, and, or, but, etc.
removePunctuation=TRUE
Eliminate punctuation marks from the corpus
removeNumbers=TRUE
Eliminate numbers from the corpus
minDocFreq=5
Ignore all words that appear in less than 5
documents in your corpus
minWordLength=2
Ignore all "words" less than 2 characters
Table X.1: Optional arguments to TermDocumentMatrix and
DocumentTermMatrix to control what ends up in the matrices
Parameters available to control
stopwords=TRUE
Finally, we can check to make sure that our objects have been successfully created, and
get a sense for what's contained within each one. There are several options available to
us, including summary, head, and inspect. The summary command gives us a look at the
structure of a corpus
> summary(spam.corpus)
A corpus with 500 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
> summary(ham.corpus)
A corpus with 2500 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
The second way to get information about our objects is to use head, which stands for
"header". We can ask to see the head of a tdm or dtm, but not the head of a corpus.
Notice that the maximum term length in the dtm is 996 characters. This might be bad!
(We might want to clean the data more before we examine it, or, we might want to look
and see what kind of term is 996 long. It could, for example, be an encryption key or an
encoded image! We don't know at this point.)
> head(spam.tdm)
A term-document matrix (6 terms, 500 documents)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:
18/2982
99%
25
term frequency (tf)
> head(spam.dtm)
A document-term matrix (6 documents, 27053 terms)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:
433/161885
100%
996
term frequency (tf)
Notice also that head tells us how sparse the matrix is. Sparse means that there are a lot
of zeroes, or non-occurrences of terms. It can take a long time for algorithms to search
through matrices with a whole lot of zeroes, so you may want to consider removing
sparse terms in your tdm (but not your dtm!) before you proceed. Fortunately, this is
easy to do. All you need to know is what percentage of the row needs to be empty to
eliminate it from your tdm, and a higher value is usually better. For example:
> new.spam.tdm.1 <- removeSparseTerms(spam.tdm,0.2)
> head(new.spam.tdm.1)
A term-document matrix (6 terms, 500 documents)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:
0/3000
100%
2
term frequency (tf)
> new.spam.tdm.2 <- removeSparseTerms(spam.tdm,0.8)
> head(new.spam.tdm.2)
A term-document matrix (6 terms, 500 documents)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:
909/2091
70%
11
term frequency (tf)
In the first case, we're removing all the rows where at least 20% of the terms are not
used. You can see this is a bad thing to do, because it reduces the number of actual data
values ("non-sparse entities") to zero. We have essentially eliminated all of our data
(oops). In the second case, we set up a new tdm called new.spam.tdm.2 and only wipe
out the rows where at least 80% of the terms used are not in common with other terms
in other documents within the corpus. The results are much better. Now, only 70% of
our matrix containing word counts is empty, and the longest term is 11 characters, so
we have also eliminated that odd 996-character term (and any that were like it).
This also tells us we should use new.spam.tdm.2 whenever we can when we start to
analyze our data. Similarly, we should check the structure of the tdm we created from
our ham data, and removeSparseTerms if need be. (Try this yourself!)
The third means to peek into our data is to use inspect. We can inspect elements of a
corpus if we call them by the number they are indexed at (which will show us the
content of each embedded document in the collection):
> inspect(spam.corpus[40])
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
$`00040.949a3d300eadb91d8745f1c1dab51133`
Dear Sir or Madam:
Please reply to
Receiver: China Enterprise Management Co., Ltd. (CMC)
E-mail: unido@chinatop.net
As one technical organization supported by China Investment and Technical
Promotion Office of United Nation Industry Development Organization (UNIDO), we
cooperate
closely
with
the
relevant
Chinese
Quality
Supervision
and
Standardization Information Organization. We provide the most valuable consulting
services to help you to open Chinese market within the shortest time:
1. Consulting Service on Mandatory National Standards of The People's Republic of
China.
2. Consulting Service on Inspection and Quarantine Standards of The People's
Republic of China.
3. Consulting Service for Permission to Enter Chinese Market
We are very sorry to disturb you!
More information, please check our World Wide Web: http://www.chinatop.net
Sincerely yours
-Irish Linux Users' Group: ilug@linux.ie
http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.
List maintainer: listmaster@linux.ie
Alternatively, we can inspect rows and columns of a tdm or dtm, which will give us
portions of the (sometimes large) tables that contain the counts of our terms. These
calls to inspect give us rows 1 through 3 (the first argument in brackets), and then
columns 1 through 3 (the second argument in brackets).
> inspect(spam.tdm[1:3,1:3])
A term-document matrix (3 terms, 3 documents)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:
0/9
100%
22
term frequency (tf)
Docs
Terms
00001.7848dde101aa985090474a91ec93fcf0
\033b\033b
0
\033bal\033bvipmail
0
\033bckkonsh\033bpt\033bwlshy\033b
0
Docs
Terms
00002.d94f1b97e48ed3b553b3508d116e6a09
\033b\033b
0
\033bal\033bvipmail
0
\033bckkonsh\033bpt\033bwlshy\033b
0
Docs
Terms
00003.2ee33bc6eacdb11f38d052c44819ba6c
\033b\033b
0
\033bal\033bvipmail
0
\033bckkonsh\033bpt\033bwlshy\033b
0
> inspect(spam.dtm[1:3,1:3])
A document-term matrix (3 documents, 3 terms)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:
0/9
100%
22
term frequency (tf)
Terms
Docs
\033b\033b \033bal\033bvipmail
00001.7848dde101aa985090474a91ec93fcf0
0
0
00002.d94f1b97e48ed3b553b3508d116e6a09
0
0
00003.2ee33bc6eacdb11f38d052c44819ba6c
0
0
Terms
Docs
\033bckkonsh\033bpt\033bwlshy\033b
00001.7848dde101aa985090474a91ec93fcf0
0
00002.d94f1b97e48ed3b553b3508d116e6a09
0
00003.2ee33bc6eacdb11f38d052c44819ba6c
0
Code and Results
You may have even forgotten than we haven't started the analysis part yet, because
there's been so much work that's needed to be done with the data itself. First, let's get a
sense of the overall frequency of terms in our collections. We get identical results
whether we use our tdm or our dtm here. First, let's check to see how many terms have
been mentioned at least 100 times:
> length(findFreqTerms(spam.dtm,100))
[1] 120
That's a lot of terms, so let's check and see how many have been mentioned even more
frequently - at least 300 times:
> length(findFreqTerms(spam.dtm,300))
[1] 29
Since 29 is a manageable number, we can look at the terms themselves:
> findFreqTerms(spam.dtm,300)
[1]
[6]
[11]
[16]
[21]
[26]
"arial"
"center"
"font"
"html"
"people"
"sized"
"body"
"div"
"free"
"input"
"please"
"table"
"border"
"email"
"height"
"list"
"receive"
"width"
"borderd"
"facedarial"
"heightd"
"money"
"sansserif"
"widthd"
"business"
"faceverdanafont"
"helvetica"
"option"
"size"
We can also check to see what terms appear with greatest frequency BOTH in the ham
and the spam, meaning that the presence of these terms cannot be used to distinguish
one type of email from the other:
> intersect(findFreqTerms(spam.dtm,100),findFreqTerms(ham.dtm,100))
[1]
[7]
[13]
[19]
[25]
[31]
[37]
"access"
"contenttype"
"here"
"link"
"money"
"sent"
"you"
"address"
"day"
"home"
"list"
"name"
"service"
"business"
"email"
"html"
"mail"
"people"
"size"
"call"
"free"
"information"
"mailing"
"please"
"software"
"company"
"government"
"internet"
"message"
"report"
"time"
"computer"
"help"
"life"
"million"
"send"
"web"
For association analysis, we're going to work with the matrices that we stripped out
many of the sparse terms from, let's take a look first at all the terms that appear in each
matrix, and the number of times each of those common terms appears in the first
document of our collection:
> inspect(new.ham.tdm.2[,1])
A term-document matrix (6 terms, 1 documents)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:
Docs
3/3
50%
7
term frequency (tf)
Terms
00001.7c53336b37003a9286aba55d2945844c
date
1
email
0
list
4
mailing
1
url
0
wrote
0
> inspect(new.spam.tdm.2[,1])
A term-document matrix (22 terms, 1 documents)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:
12/10
45%
11
term frequency (tf)
Docs
Terms
00001.7848dde101aa985090474a91ec93fcf0
address
0
body
1
click
1
company
0
contenttype
0
email
2
font
3
form
0
free
3
head
0
html
1
information
0
list
1
message
0
meta
2
please
1
receive
0
removed
1
table
4
time
0
wish
1
you
0
Since the word "list" appears 4 times in the first document, let's see what that term is
most commonly associated with throughout the corpus:
> findAssocs(new.ham.tdm.2,"list",0)
list mailing
wrote
email
1.00
0.68
0.26
0.24
Similarly, let's see what's most commonly associated with the word "free" in the spam:
> findAssocs(new.spam.tdm.2,"free",0)
free information
1.00
0.32
please
click
0.14
0.12
receive
0.31
font
0.09
time
0.27
head
0.09
address
0.17
html
0.04
list
0.17
wish
0.03
you
0.15
email
0.14
Looks like the spammers are most frequently offering free information.
We can also examine word clouds, which can be very fun. But be advised: many data
analysts and statisticians hate word clouds because the same information (and usually
better information) can be found in frequency tables and matrices. That said, word
clouds are still fun to make and can be very revealing nonetheless. So let's make a few.
If you have not done it already, install and then load the wordcloud package:
library(wordcloud)
In this example, our goal is to create word clouds that compare and contrast the content
of the ham corpus with respect to the spam corpus. As a result, we need to squish all the
documents in all.spam and all.ham into a data frame with two columns, one for
each email type:
allspam <- paste(all.spam,sep="",collapse=" ")
allham <- paste(all.ham,sep="",collapse=" ")
tmpText = data.frame(c(allham,allspam),row.names=c("HAM","SPAM"))
ds <- DataframeSource(tmpText)
This special data structure can be used to create a new corpus. We'll remove
punctuation, convert all the words to lowercase, remove the numbers, and eliminate
the stopwords in a slightly different way that we did it last time:
corp
corp
corp
corp
corp
=
=
=
=
=
Corpus(ds)
tm_map(corp,removePunctuation)
tm_map(corp,tolower)
tm_map(corp,removeNumbers)
tm_map(corp,function(x){removeWords(x,stopwords())})
Once we have the tdm, we can create a word cloud:
tdm <- TermDocumentMatrix(corp)
term.matrix <- as.matrix(tdm)
v <- sort(rowSums(term.matrix),decreasing=TRUE)
d <- data.frame(word=names(v),freq=v)
wordcloud(d$word,d$freq,max.words=150)
Figure X.1: Word cloud created with wordcloud(d$word,d$freq,max.words=150)
For the grand finale, let's create word clouds that show the words most commonly used
in BOTH ham and spam as well as the words most uncommonly used in each category.
The former is called a commonality cloud, and the latter is called a comparison cloud.
We can generate our clouds like this (the first line sets up a graphics panel with 1 row
and 2 columns, so we can get the clouds to appear side by side). The clouds are in Figure
X.2 below.
par(mfrow=c(1,2))
comparison.cloud(term.matrix,max.words=200,random.order=FALSE,
colors=c("#999999","#000000"),main="Differences Between
HAM and SPAM")
commonality.cloud(term.matrix,max.words=100,random.order=FALSE,
color="#000000",asp=2,main="Similarities Between HAM
and SPAM")
From these clouds, there are some patterns that stand out. First, you might want to
consider eliminating the words in the commonality cloud in further analyses. It looks like
all kinds of email that show up at the inbox contain words like email, list, and people.
These could make useful collections of additional stopwords. The comparison cloud at
the left shows that there are tons of HTML tags in spam. Perhaps HTML emails should
be examined more closely in the final spam filter than non-HTML emails?
Figure X.2: Comparison cloud (left) and commonality cloud (right) for ham and spam.
Conclusions
With the examples you have just completed, you should be able to define, create, and
examine a corpus, a Term-Document Matrix or Document-Term Matrix, whichever is
needed. You can reduce the data in your matrix and make it less sparse, examine
frequencies and associations, and create various kinds of wordclouds.
Other Resources:







http://en.wikipedia.org/wiki/Text_mining
http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf - Feinerer's intro
to the tm package
http://www.jstatsoft.org/v25/i05/paper - Text Mining Infrastructure in R by Feinerer
et al. (probably the BEST and most comprehensive guide to using the tm package)
http://www.jstatsoft.org/v25/i05/ -- Feinerer's example code from the paper above
can be downloaded here
http://www.jstatsoft.org/v25/i05/ -- Zinkov's text mining with R
http://www.r-bloggers.com/simple-text-mining-with-r/ - Simple text mining
with R at R Bloggers
http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/ text mining with Twitter by Jeffrey Breen
Download