CAP2015Workshop - University of Washington

advertisement
slides
CAP 2015 Text as Data Workshop
John Wilkerson, Andreu Casas
Department of Political Science
University of Washington
jwilker@uw.edu
acasas2@uw.edu
Transparency initiatives have dramatically increased available digitized government data. Much of this data
takes the form of text. What information about the activities of government can we get from text and how do
we get it? What are the different analytic options? How do we assess whether an analytic method is doing a
good job of capturing a document’s meaning?
When it comes to interpreting text, humans are more perceptive but probably less reliable than
computer algorithms (so far). The main advantage of computational methods is scale. They can assist the
process of discovering patterns, formulating hypotheses, constructing measures for hypothesis testing where
human based approaches would be too time-consuming. They are never a substitute for careful theorizing and
users should be attentive to questions of validity and reliability as they would be for any method.
Python is the most popular language for large scale text processing. R offers more analytic and
presentation options. People often build datasets (term document matrices) in Python and then analyze them
in R. Below are some example scripts to illustrate both programming languages. If you are in it for the long
term, learning Python and R is still the way to go compared to commercial products because you’ll eventually
discover there’s something you want that is not offered.
Glossary of Terms at the bottom.
Troubleshooting
Problem solving is an inherent part of programming. Expect the code that you have adapted to your task to fail
the first time you try it! The first thing to check for is typos – code is not forgiving. Also:





Test code in small chunks to more easily identify where it is breaking
Test code on a simple dataset first (faster)
Your problem is not unique. Copy and paste the error message to Google. Consult common error
messages or (searchable) Python Tutorial
Walk away for a while. Often the problem will occur to you
Even if your program runs it may not be producing the desired output. Be sure to check that the output
is what you expected
Resources

Grimmer, Justin and Brandon Stewart. 2013. “Text as Data: The Promises and Pitfalls of Automated Content
Analysis.” Political Analysis, 1-31.

Bird, Klein, Loper, Natural Language Processing (NLTK) with Python
o There is a free downloadable version of this book, but it can also be purchased
Lutz, Learning Python (a very readable, big, introduction to the Python programming language)


Aaron Erlich drafted some notes about similarities and differences between R and Python
o “Python for R Users” “Python for R – Strings” “Python for R – Dicts and Tuples”
1
o
Another Python-R perspective
R
Some basic examples of R code for text analysis. We’ll do the first one in the workshop
 R: tm example
Data (you want to put this in a folder on your machine)



R: TextTools
R: Capturing Twitter Feeds
R LDA Topic Model
Getting Started has a nice example application that uses provided data
Example output: Hottest/Warmest Year (5mb zip file)
Data (you want to put this in a folder on your machine)
Python Resources
 The Khan Academy Introduction to Python videos are a very helpful introduction
 Pythex https://pythex.org/ (good way to test regular expressions)
Python
These scripts work through the process of getting and processing text, as well as more common analytic
methods. Most of the scripts are Python. Apologies in advance to people who can write cleaner code!

Install Anaconda. Test by opening this Notebook in I-Python. This usually goes smoothly but not always on Macs.
I am not much help unfortunately….
1. Getting Text (IP* refers to I-Python module)
IP1: Python Basics
IP2: Importing and Preparing Text Data: FOMC2 datafile
IP3: Scraping URLs and APIs
IP4: Scraping PDFs
2. Organizing and Analyzing Text
IP5: Preprocessing and Summarizing
IP6: Tokenizing
IP7: Building and Exporting Corpora
IP8: Topic Models (LDA) May need to install the gensim package (the introduction is worth reading)
Supervised Machine Learning: RTextTools Getting Started document includes script to run
IP9: Natural Language Processing (introduction)
IP10: Text Reuse (Smith Waterman local alignment algorithm)
WCopyFind (user friendly plagiarism detection software)
Glossary of Terms
Precision – proportion of predicted cases of X’s that are true Xs. (errors are false positives)
Recall – proportion of true Xs that are predicted Xs (errors are false negatives)
F-Score – the harmonic mean of precision and recall
Validity – a measure is valid if on average it accurately captures the concept to be measured
Reliability –a measure is reliable to the extent that it produces the same result each time
Bias – Reliability is not validity; a measure can reliably invalid
Confusion matrix – crosstabulation of actual versus predicted results. Used to examine prediction success
(precision, recall) overall and within specific categories.
2
Annotate = Classify = Code = Label (verb) Annotation = Class = Code = Label (noun)
Token – any element of a document (e.g. a word; space; semicolon).
Tokenization (aka text segmentation) - the process of breaking up a stream of text characters into meaningful
elements (e.g. presence of a space is used to designate ’ the bird’ as two tokens,’ the’ and ‘bird’
Feature – a token that the researcher judges to be relevant to the text task.
Parsing – Generally, the process of systematically disassembling a text into meaningful components (such as
sentences or words). In NLP, a formal methodology for labeling specific words in a sentence according
to linguistic rules (see Stanford Parser).
Normalization – eliminating differences in punctuation, such as removing capitalization
Stemming - process for reducing words to their stem, base or root form (e.g. fishing–fish)
Stopword – common words that are not considered to be valuable features of a text and are therefore
excludes (e.g. the)
Concordance, Collocation or Cooccurence – incorporating the context in which a word is used into its
meaning, for example by examining word sequences (n-grams) instead of just words in isolation.
Regular expression – a concise but often flexible pattern intended to recognize strings of text (such as any
date or any url)
Disambiguation – process of linking references to a single entity or topic. For example, in blogs, references to
President Obama might take different forms ‘Barack,’ ‘ Obama,’ ‘The One,’ ‘the President’ etc.
Alternately, reconciling different spellings of the same word.
Named entities - elements of text that are to be classified into predefined categories (e.g. person names,
organizations, locations, subjects, percentages, etc).
Semantic – Broadly, text that has meaning (e.g. a word rather than a hyperlink). Usually used in reference to
natural language processing approaches that are concerned with linguistic structure
Sentiment – refers to polarity in classification (e.g. from hate to love, liberal to conservative, etc). Not
necessarily single-dimensional.
Algorithm – a mathematical set of instructions about how to convert a set of inputs to an output. In
automated content analysis, researchers select from a wide variety of off the shelf algorithms suited
to different tasks (one of the most popular is SVM).
Machine learning – Generally, the ability of a computer program to get better at a task with more
information. The clearest examples are supervised machine learning algorithms. A set of hand labeled
examples (e.g.) is used to train an algorithm (how does the text of the examples in one category differ
from the text of the examples in other categories?). The algorithm then predicts the categories of
unseen events based on their text.
Bag of Words (BoW) –text analysis approaches that consider words as features in isolation (as opposed to NLP
or concordance approaches that value relationships among words)
Natural Language Processing (NLP) –text analysis approaches that value grammatical (linguistic) information
such as word order or sentence structure (subject-verb-object).
3
Download