Assignment - Cameron Blevins

advertisement
DHTK 2012 Lab #4
The Digital Historian’s Toolkit
Lab #4: Distant Reading
Earlier this week we discussed Mark Twain’s Roughing It (1872) through a process of close
reading: carefully looking at specific passages and chapters. Today we will instead apply a
process of “distant reading” through a technique called topic modeling. First we will topic model
the entirety of Twain’s Roughing It. Then we will use topic modeling to examine four other
“classic” nineteenth-century novels published during Twain’s era: Nathaniel Hawthorne’s The
Scarlet Letter (1850), Harriet Beecher Stowe’s Uncle Tom’s Cabin (1852), Louisa May Alcott’s
Little Women (1868) and Twain’s own The Adventures of Huckleberry Finn (1884).
I. Topic Modeling One Novel
We will use a program called MALLET to topic model. MALLET itself runs from the command
line, but for ease of use we will instead use a Graphical User Interface (GUI) that repackages
MALLET in a much more user-friendly interface. Think of the GUI as the body of a car and
MALLET as its engine. The engine is what allows the car to function, but the body of the car
makes it (more) usable. Download the GUI (TopicModelingTool.jar) at
http://code.google.com/p/topic-modeling-tool/ and save it into your personal lab folder.
Copy the lab’s data from the class folder into your personal lab folder on the server. Look
through the data files. There are two main directories, Classics_FivePercent and
RoughingIt_TwoPercent. Open up RoughingIt_TwoPercent. The folder contains fifty text files,
each representing 2% of the entire text of the book. This is because topic modeling tends to work
better with smaller chunks of text that deal with discrete kinds of content. Chapters might work
best, but since we don’t have Roughing It separated, I’ve used an approximation of two-percent
“chunks” of the novel.
Open the TopicModelingTool.jar file that you downloaded. The first dialog is the Input folder
which tells the program where to look for the data it’s going to analyze. In this case, select
RoughingIt_TwoPercent. The second dialog is the Output folder – where you want the program
to dump all of the results that it generates. This folder has not been created, so first create a new
folder in your lab folder and name it RoughingIt_Output then select this folder within the topic
modeling tool for your output folder.
There are several other options you can change. First, you can specify the number of topics you
want the program to generate. Usually this is a balance between too few topics (making topics
too broad) and too many topics (making them unwieldy to analyze). For this lab, use 30 topics.
Under the Advanced button, increase the iterations to 1,000 – this tells the program to loop
through its text files 1,000 times to generate its topics. You might tweak this if you were
analyzing a huge number of files so that it wouldn’t take as long to run. With only one novel, we
don’t need to worry about this as much. Finally, change the Topic Proportion Threshold to 0 –
this allows us to see the content of every topic in every single text chunk (even ones with a
miniscule or negligible score for that topic). Click Learn Topics and watch your program run.
After it has finished running, navigate to your output directory that you created. Notice that the
1
DHTK 2012 Lab #4
program has created two folders, output_html and output_csv. Under the output_html folder,
open all_topics.html. Explore this file and try to figure out what kind of information it is telling
you about the topics and the documents.
Start to analyze the topics themselves. Which topics are expected or unsurprising? Do any of
them surprise you? Select two topics that you find interesting to share with the class and come
up with a label for them (ex. MINING, or VIOLENCE). Remember: because the program uses a
sampling process, everyone’s list of topics will be slightly different (and if you run it again, your
own list will change).
II. Topic Modeling Multiple Novels
Now let’s apply this same process to four “classic” American novels from the mid-nineteenth
century. Instead of fifty text files consisting of 2% chunks of one novel, you will use eighty text
files consisting of 5% chunks of each novel stored in the Classics_FivePercent folder.
This time, however, you are going to groom the text files a bit more. Specifically, we are going
to tell the computer to ignore certain words, known as “stop words.” If you remember from the
first text analysis lab, stop words consist of words you don’t want to analyze, most often
common words such as: and, the, of, etc. However, sometimes we want to include additional
words for the computer to ignore. If you click under Advanced in the Topic Modeling Tool, you
will notice that Remove stopwords is checked. This removes common English pronouns. But we
are also going to specify a list of additional words consisting of proper names and some other
miscellaneous words that create problematic “noise” in the data. This list is found in a text file
named expandedStopWords.txt.
You will re-run the topic modeler on the Classics_FivePercent folder using the same settings as
above. Before you do so, however, create a new output folder.
After running the topic model, take a look at your newly generated output data for the four
novels. Explore the different topics and their prevalence in different documents. What patterns
do you recognize? Which topics tend to be dominated by a single novel? Which topics are spread
more evenly across different novels? We will then discuss your findings as a group and compare
what everyone found.
III. Processing Your Results
Return to the results from Roughing It. You will now track how two topics behave over the
course of the novel by charting their relative “score” within each of the fifty chunks of text. If
each chunk of text is thought of as a pie, the “score” of a topic corresponds to the percentage of
the pie that the topic is assigned to. The topic modeler assumes that the entirety of the text can be
grouped into however many topics you select – if you specified 20 topics and each topic was
distributed exactly evenly across the text, the “score” of each topic would be 5% for each
document.
While you were working on Part II, meanwhile, I have saved a new file under your personal
2
DHTK 2012 Lab #4
folder: TopicsInDocs.csv_reformatted.txt. Although it is a text file, its data is organized in a
tabular format so that you can open it in Excel. Open with Excel and re-save it as an Excel
Worksheet rather than a text file, named Twain_RoughingIt_Output.xlsx. Before you begin,
copy and paste your data into a new sheet so that you have a “sandbox” you can play around in.
Your assignment is to select two interesting topics that you found in Part I and create both a
table and chart of their trends over the entire novel. To do this you’re going to have to sort the
table so that each chunk of text is sequential in the order of appearance in the novel. We are
going to do so using the metadata contained within each filename (ex.
Twain_RoughingIt_1872_12.txt): author, title, publication date, and the sequence of the text
segment. With this information, we can process the data in various ways. If, for instance, there
were multiple novels by Twain we could aggregate them together (because each filename
contains “Twain”), or if we had a list of 100 novels we could order them by publication date or
combine Roughing It with all the other novels published in 1872.
You will create a new column of that contains only the last number in the filename, representing
the text segment’s sequence (1, 2, 3…50) so that you can then sort your table. You could do this
manually for every single row, but what if there weren’t 50 chunks of text, but instead 5,000?
Rather than typing in “1”, “2,” etc., try and figure out how to create this new column using the
Text to Columns option (under Data) and the LEFT, RIGHT, and/or LEN functions. If you
are unfamiliar with Excel I would recommend searching for help on these functions online. Let
me know if you’re having trouble.
Ultimately, you should end up with a new table pasted into a new worksheet that looks like the
following example. Hint: research the Paste Special option if you’re having trouble copying and
pasting cells with formulas in them.
Text Segment
1
2
,,,
MINING (Topic 11)
0.21
0.04
…
VIOLENCE (Topic 29)
0.02
0.07
…
Once you have done so, use the table to construct a chart, with the X-axis as chunks of text and
the Y-axis as topic scores, and insert it into a new sheet. Format the chart so that it is clean,
clearly labeled, and directly conveys the relevant information. Export the chart as a JPG, save
your Excel file, and email it to yourself so that you can include it in your email attachment to me
when you submit your lab report.
3
DHTK 2012 Lab #4
LAB REPORT
Part I: Roughing It
In one paragraph, describe the pattern of two topics across Roughing It. Include why they are
interesting or significant and possible explanations for why they behave the way they do. After
the paragraph, insert the JPG of your chart along with an explanatory caption.
Include the Excel file as a separate attachment. 50% of your lab grade will be based on your
table and chart in this file, judged on the following criteria:
1. Did you manage to manipulate the table correctly using automated functions?
2. How effective is your chart at conveying the information? Does it have too much noise, is
it missing labels, etc.?
Part II: Topic Modeling
Write two paragraphs on the process of topic modeling. Answer the following questions:
1. What observations or analysis could you draw out of topic modeling four “classic”
American novels? Were there any surprising results?
2. What are possible historical questions you could begin to answer using topic modeling?
What kind of sources might you use? What would be the advantages and disadvantages
of using topic modeling rather than, say, traditional close reading or other tools such as
Wordle and Voyant?
4
Download