Lab 8 – Clustering Text Documents (due M 4/23) In this lab we continue to use the R Text Mining package tm created by I. Feinerer in 2008 (the most recent version is 0.5-7.1)i. We discussed some of the preprocessing tasks in the Intro to Text Mining Tutorial with the Wizard of Oz data. In this lab we will be looking at emails. Steps for converting, preprocessing, and clustering the emails will be discussed. For this we will need an additional plug for tm.ii Information on this plug-in can be found in e-reserve. > install.packages("tm.plugin.mail") > library(tm.plugin.mail) Creating the corpus 1. Obtain a text file of emails (in mbox format). For example, the r-help emails are archived here: https://mailman.stat.ethz.ch/pipermail/r-help/ Pick a recent month, and download it to your R working directory. Notice that this a gzipped file. 2. Convert the mbox file into separate text emails. This will create a separate text file for each email and place them in a folder of your choice in the R working directory. > convert_mbox_eml(gzfile("2012-March.txt.gz"),"2012-March/") Go to the folder. How many emails are there? 3. Create the corpus. > rhelp<-Corpus(DirSource("2012-March/"),list(reader=readMail)) > rhelp The corpus should contain the same number of documents as emails in the directory you created. You can examine the first email with: > rhelp[[1]] > meta(rhelp[[1]]) 4. Detect threads. This is a one-pass algorithm for extracting thread information. It returns a thread ID and a thread depth (how many emails in the thread). Note that since it is a one-pass algorithm, it does not detect replies that appear in the list before the base emails (and will mark these with ID “NA” and depth 2). > rhelp.thread<-threads(rhelp) > rhelp.thread Preprocessing First we perform several transformations on the files. 5. Remove citations, i.e. lines beginning with >. > rhelp<-tm_map(rhelp,removeCitation) > rhelp[[1]] Remove signatures, i.e. anything below “-- “ (dash dash blank, the official signature start mark) and any other characters of your choice. Note: this will have some trouble with more than 2 dashes. > rhelp<-tm_map(rhelp,removeSignature) (or > rhelp[[1]]<-removeSignature(rhelp[[1]],marks="^[+]-*[+]$")) Initial Explorationiii 6. Determine the most active writers. First extract all of the authors and then normalize multiple entries (collapse those with multiple lines into one line). > authors<-lapply(rhelp,Author) > authors<-sapply(authors,paste,collapse=' ') Sort by number of emails authored and look at the top 10. > sort(table(authors), decreasing=T)[1:10] Who are the top 10 most active writers? How many emails did each generate? 7. Do the same thing with “Heading” for another way to look at thread/topic information. > headings<-lapply(rhelp,Heading) > headings<-sapply(headings,paste,collapse=' ') > sort(table(headings),decreasing=T)[1:10] > big.topics<-names(sort(table(headings),decreasing=T)[1:10]) > big.topics > unique(sapply(rhelp[headings==big.topics[1]],Author)) > unique(sapply(rhelp[headings==big.topics[2]],Author)) 8. Determine how many of the emails discuss a certain term. Perform a search of the emails for a specific term, say “problem,” to determine (estimate) how many of the emails are dealing with problems in R. > p.filter<-tm_filter(rhelp,FUN=searchFullText,"problem",doclevel=T) > p.filter How many of the emails contain the term “problem”? Determine the 10 most active authors for the term “problem”. > p.authors<-lapply(p.filter,Author) > p.authors<-(sapply(p.authors,paste,collapse=' ')) > sort(table(p.authors),decreasing=T)[1:10] Who are they? How many emails did they author containing “problem”? Repeat with a different term. What are your results? Count-based Evaluationiv The emails need to be in plain text format in order to create the document term matrix. It will also be useful to remove common words and punctuation before creating the document term matrix. 9. Convert the emails to plain text. Show what this does to the metadata. > rhelp<-tm_map(rhelp,as.PlainTextDocument) > meta(rhelp[[1]]) Convert to lowercase. > rhelp<-tm_map(rhelp,tolower) > rhelp[[1]] 10. Remove common words. > for(i in 1:3384){rhelp[[i]]<-removeWords(rhelp[[i]],stopwords("en"))} Remove numbers. > for(i in 1:3384){rhelp[[i]]<-removeNumbers(rhelp[[i]])} Remove punctuation. > for(i in 1:3384){rhelp[[i]]<-removePunctuation(rhelp[[i]])} > rhelp[[1]] 11. Create the document term matrix. > rhelp.tdm<-DocumentTermMatrix(rhelp) > rhelp.tdm How many non-zero entries are there? How many total terms? 12. Find the most frequent terms. Find the terms that occur at least 200 times. > findFreqTerms(rhelp.tdm,200) What are the words? 13. Find associations. Find the words that are associated with “archive” (correlation coefficient over 0.85). Note, this may take several minutes. > findAssocs(rhelp.tdm,"archive",.85) Data Mining with Text Dataiv Note that many of the below algorithms may be time-consuming with this particular data set. You may wish to choose a smaller one, or a sample of the current one for the following. Another approach would be to remove sparse columns. 14. Determine the number of clusters. One way to determine the number of clusters is to look at the within cluster sum of squares. Compute the within cluster sum of squares for k=1 to 20 sample k-means clusterings and see where the sum of squares seems to level off. > wss <- (nrow(rhelp.tdm)-1)*sum(apply(rhelp.tdm,2,var)) > for (i in 2:20) wss[i] <- sum(kmeans(rhelp.tdm, centers=i)$withinss) > plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares") Though there isn’t an obvious choice for k, make a selection to use for all of the methods below. 15. Hierarchical Clustering. > rhelp.hclust<-hclust(dist(rhelp.tdm), method="ward") > rhelp.hclust > rhelp.hclust.15 = cutree(rhelp.hclust,k=15) > rhelp.hclust.15 = cutree(rhelp.hclust,k=15) To examine the size of the clusters: > table(rhelp.hclust.15) > table(rhelp.hclust.10) 16. k-Means. We’ll use pam instead of kmeans as it is more robust. > library(cluster) > rhelp.pam<-pam(rhelp.tdm, 15) Examine the clusters: > rhelp.pam$clusinfo How might you change any of the steps above? Discuss ways to evaluate the clusters. For this lab, turn in a description of the preprocessing you did and the results. Answer any questions posed above. For the exploration, the count-based evaluation, and the data mining pieces, make sure to include results/output directly from R. Additionally, discuss the results and what they mean, and answer any questions related to them. It is easiest if you cut and paste your results into a word processor as you obtain them. Extra Functions: Remove non-text parts from multipart e-mail messages. > rhelp<-tm_map(rhelp,removeMultipart) Strip extra whitespace. > rhelp<-tm_map(rhelp,stripWhitespace) The above (8) creates a sub-corpus with the “problem” emails. Alternatively, you could have just created a T/F vector denoting whether or not each of the emails contains the term “problem”. From this you could find the percentage of emails. > p<-tm_index(rhelp,FUN=searchFullText,"problem",doclevel=T) > sum(p)/length(rhelp) Cosine similarity The default distance in hclust is Euclidean distance. Cosine similarity is often used for document data. The function dissimilarity creates a dissimilarity matrix using any of the metrics/similarities available in the proxy package. To use cosine similarity as the proximity measure in clustering, download and install the proxy R package. > install.packages("proxy") > library(proxy) Repeat hclust with the cosine similarity proximity measure. > d<-dissimilarity(rhelp.tdm,method="cosine") > rhelp.cosine<-hclust(d, method=”ward”) > rhelp.cosine.15 = cutree(rhelp.cosine,k=15) I. Feinerer. tm: Text Mining Package, 2012. URL http://CRAN.R-project.org/package=tm. R package version 0.5-7.1. ii tm.plugin.mail URL http://cran.r-project.org/web/packages/tm.plugin.mail/index.html iii I. Feinerer. An Introduction to text mining in R. R News, 8(2):19-22, Oct. 2008. URL http://CRAN.Rproject.org/doc/Rnews/. iv I. Feinerer, K. Hornik, and D. Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):154, March 2008. ISSN 1548-7660. URL http://www.jstatsoft.org/v25/i05. i