DHTK 2012 Lab #4 The Digital Historian’s Toolkit Lab #4: Distant Reading Earlier this week we discussed Mark Twain’s Roughing It (1872) through a process of close reading: carefully looking at specific passages and chapters. Today we will instead apply a process of “distant reading” through a technique called topic modeling. First we will topic model the entirety of Twain’s Roughing It. Then we will use topic modeling to examine four other “classic” nineteenth-century novels published during Twain’s era: Nathaniel Hawthorne’s The Scarlet Letter (1850), Harriet Beecher Stowe’s Uncle Tom’s Cabin (1852), Louisa May Alcott’s Little Women (1868) and Twain’s own The Adventures of Huckleberry Finn (1884). I. Topic Modeling One Novel We will use a program called MALLET to topic model. MALLET itself runs from the command line, but for ease of use we will instead use a Graphical User Interface (GUI) that repackages MALLET in a much more user-friendly interface. Think of the GUI as the body of a car and MALLET as its engine. The engine is what allows the car to function, but the body of the car makes it (more) usable. Download the GUI (TopicModelingTool.jar) at http://code.google.com/p/topic-modeling-tool/ and save it into your personal lab folder. Copy the lab’s data from the class folder into your personal lab folder on the server. Look through the data files. There are two main directories, Classics_FivePercent and RoughingIt_TwoPercent. Open up RoughingIt_TwoPercent. The folder contains fifty text files, each representing 2% of the entire text of the book. This is because topic modeling tends to work better with smaller chunks of text that deal with discrete kinds of content. Chapters might work best, but since we don’t have Roughing It separated, I’ve used an approximation of two-percent “chunks” of the novel. Open the TopicModelingTool.jar file that you downloaded. The first dialog is the Input folder which tells the program where to look for the data it’s going to analyze. In this case, select RoughingIt_TwoPercent. The second dialog is the Output folder – where you want the program to dump all of the results that it generates. This folder has not been created, so first create a new folder in your lab folder and name it RoughingIt_Output then select this folder within the topic modeling tool for your output folder. There are several other options you can change. First, you can specify the number of topics you want the program to generate. Usually this is a balance between too few topics (making topics too broad) and too many topics (making them unwieldy to analyze). For this lab, use 30 topics. Under the Advanced button, increase the iterations to 1,000 – this tells the program to loop through its text files 1,000 times to generate its topics. You might tweak this if you were analyzing a huge number of files so that it wouldn’t take as long to run. With only one novel, we don’t need to worry about this as much. Finally, change the Topic Proportion Threshold to 0 – this allows us to see the content of every topic in every single text chunk (even ones with a miniscule or negligible score for that topic). Click Learn Topics and watch your program run. After it has finished running, navigate to your output directory that you created. Notice that the 1 DHTK 2012 Lab #4 program has created two folders, output_html and output_csv. Under the output_html folder, open all_topics.html. Explore this file and try to figure out what kind of information it is telling you about the topics and the documents. Start to analyze the topics themselves. Which topics are expected or unsurprising? Do any of them surprise you? Select two topics that you find interesting to share with the class and come up with a label for them (ex. MINING, or VIOLENCE). Remember: because the program uses a sampling process, everyone’s list of topics will be slightly different (and if you run it again, your own list will change). II. Topic Modeling Multiple Novels Now let’s apply this same process to four “classic” American novels from the mid-nineteenth century. Instead of fifty text files consisting of 2% chunks of one novel, you will use eighty text files consisting of 5% chunks of each novel stored in the Classics_FivePercent folder. This time, however, you are going to groom the text files a bit more. Specifically, we are going to tell the computer to ignore certain words, known as “stop words.” If you remember from the first text analysis lab, stop words consist of words you don’t want to analyze, most often common words such as: and, the, of, etc. However, sometimes we want to include additional words for the computer to ignore. If you click under Advanced in the Topic Modeling Tool, you will notice that Remove stopwords is checked. This removes common English pronouns. But we are also going to specify a list of additional words consisting of proper names and some other miscellaneous words that create problematic “noise” in the data. This list is found in a text file named expandedStopWords.txt. You will re-run the topic modeler on the Classics_FivePercent folder using the same settings as above. Before you do so, however, create a new output folder. After running the topic model, take a look at your newly generated output data for the four novels. Explore the different topics and their prevalence in different documents. What patterns do you recognize? Which topics tend to be dominated by a single novel? Which topics are spread more evenly across different novels? We will then discuss your findings as a group and compare what everyone found. III. Processing Your Results Return to the results from Roughing It. You will now track how two topics behave over the course of the novel by charting their relative “score” within each of the fifty chunks of text. If each chunk of text is thought of as a pie, the “score” of a topic corresponds to the percentage of the pie that the topic is assigned to. The topic modeler assumes that the entirety of the text can be grouped into however many topics you select – if you specified 20 topics and each topic was distributed exactly evenly across the text, the “score” of each topic would be 5% for each document. While you were working on Part II, meanwhile, I have saved a new file under your personal 2 DHTK 2012 Lab #4 folder: TopicsInDocs.csv_reformatted.txt. Although it is a text file, its data is organized in a tabular format so that you can open it in Excel. Open with Excel and re-save it as an Excel Worksheet rather than a text file, named Twain_RoughingIt_Output.xlsx. Before you begin, copy and paste your data into a new sheet so that you have a “sandbox” you can play around in. Your assignment is to select two interesting topics that you found in Part I and create both a table and chart of their trends over the entire novel. To do this you’re going to have to sort the table so that each chunk of text is sequential in the order of appearance in the novel. We are going to do so using the metadata contained within each filename (ex. Twain_RoughingIt_1872_12.txt): author, title, publication date, and the sequence of the text segment. With this information, we can process the data in various ways. If, for instance, there were multiple novels by Twain we could aggregate them together (because each filename contains “Twain”), or if we had a list of 100 novels we could order them by publication date or combine Roughing It with all the other novels published in 1872. You will create a new column of that contains only the last number in the filename, representing the text segment’s sequence (1, 2, 3…50) so that you can then sort your table. You could do this manually for every single row, but what if there weren’t 50 chunks of text, but instead 5,000? Rather than typing in “1”, “2,” etc., try and figure out how to create this new column using the Text to Columns option (under Data) and the LEFT, RIGHT, and/or LEN functions. If you are unfamiliar with Excel I would recommend searching for help on these functions online. Let me know if you’re having trouble. Ultimately, you should end up with a new table pasted into a new worksheet that looks like the following example. Hint: research the Paste Special option if you’re having trouble copying and pasting cells with formulas in them. Text Segment 1 2 ,,, MINING (Topic 11) 0.21 0.04 … VIOLENCE (Topic 29) 0.02 0.07 … Once you have done so, use the table to construct a chart, with the X-axis as chunks of text and the Y-axis as topic scores, and insert it into a new sheet. Format the chart so that it is clean, clearly labeled, and directly conveys the relevant information. Export the chart as a JPG, save your Excel file, and email it to yourself so that you can include it in your email attachment to me when you submit your lab report. 3 DHTK 2012 Lab #4 LAB REPORT Part I: Roughing It In one paragraph, describe the pattern of two topics across Roughing It. Include why they are interesting or significant and possible explanations for why they behave the way they do. After the paragraph, insert the JPG of your chart along with an explanatory caption. Include the Excel file as a separate attachment. 50% of your lab grade will be based on your table and chart in this file, judged on the following criteria: 1. Did you manage to manipulate the table correctly using automated functions? 2. How effective is your chart at conveying the information? Does it have too much noise, is it missing labels, etc.? Part II: Topic Modeling Write two paragraphs on the process of topic modeling. Answer the following questions: 1. What observations or analysis could you draw out of topic modeling four “classic” American novels? Were there any surprising results? 2. What are possible historical questions you could begin to answer using topic modeling? What kind of sources might you use? What would be the advantages and disadvantages of using topic modeling rather than, say, traditional close reading or other tools such as Wordle and Voyant? 4