Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012 University of Illinois Hottest Analytics University of Illinois Overview University of Illinois Sites that List Tools • DiRT, Digital Research Tools • http://digitalresearchtools.pbworks.com • Kdnuggets • http://www.kdnuggets.com/software/text.html • text-processing.com • Discussion Groups • Text Analytics on linkedin • http://www.linkedin.com/groups/Text-Analytics-115439 • Visual Analytics on linkedin • http://www.linkedin.com/groups/Visual-Analytics-80552 University of Illinois Bad News • There is no open and freely available tool that is going to solve all your problems!!! University of Illinois Good News • There is a variety of tools that can be beneficial and must be used in combination to accomplish the goal!!! University of Illinois Natural Language Processing (NLP) • • • • • Tokenization Part of speech tagging Stemming Stop word removal Other transformations University of Illinois NLP Tools • NLTK • http://www.nltk.org/ • http://text-processing.com • OpenNLP • http://incubator.apache.org/opennlp/ • Stanford CoreNLP • http://nlp.stanford.edu/software/corenlp.shtml • Mallet • http://mallet.cs.umass.edu/ • GATE • http://gate.ac.uk/ • LingPipe • http://alias-i.com/lingpipe/ University of Illinois Entity Extraction • Finding entities, like People, Locations, Time, etc • Some have ability to add your own entities (with seed terms) • Tools • OpenNLP • Stanford CoreNLP • OpenCalais • GATE University of Illinois Journalism Application • • • • • Structuring unstructured data Social networks of entities Clustering Plotting data on time line Plotting locations on a map University of Illinois Information Extraction • Automatically identifies and extracts binary relationships from English sentences • TextRunner • NACTEM, MEDIE • http://www.nactem.ac.uk/medie/ • ReVerb • http://reverb.cs.washington.edu/ University of Illinois Information Extraction: ReVerb University of Illinois Question and Answer • Parse the question to determine what type of information needs to be returned • Leveraging approaches like the information extraction for retrieving the results University of Illinois Journalism Application • Engagement! Help users find facts relevant to their situations. University of Illinois Document Classification • Starts with a training set • Predicts what class a document belongs • Leveraging pure data mining approaches like Naïve Bayes, Decision Trees, Neural Networks • Tools • • • • • • NLTK Mallet Weka Rapid Miner GATE Meandre University of Illinois Journalism Application • IBM's ManyBills project • Identifies the topic of each section in a Congressional bill for the purposes of identifying outliers. • For example, if a Congressman proposes a bill about the environment, but it has a section deep down about banking regulation, ManyBills would identify that as an outlier and highlight it. University of Illinois Document Similarity/Clustering • TF-IDF (Term frequency * inverse document frequency) • Overview project (AP) • Tools • GATE • Rapid Miner University of Illinois Journalism Application • Identifying copycat legislation from year to year • Clustering documents to find trends University of Illinois Topic Modeling • Exploratory approach to find patterns by finding words that frequently occur together • Document can have multiple topics • Words can exist in multiple topics • Tools • Mallet uses LDA (latent Dirichlet allocation) • Other implementations as well… University of Illinois Topic Exploration • Topical Guide: • http://tg.byu.edu/ • Tmve (Topic Model Visualization Engine) • http://code.google.com/p/tmve/ University of Illinois Topical Guide University of Illinois Tmve University of Illinois Journalism Application • Reporting tool for making sense of corpus • Isolating topics allows the user to focus only on the documents in a corpus that are relevant. • There exists a clear potential for more data visualization. University of Illinois Automatic Summarization • Identifies sentences from among the documents • Identifies common information conveyed across all the documents and then reformulates new sentences expressing that information • Aims to combine the main themes with completeness, readability, and conciseness • Lots of algorithms, but not really software tools to download to run on your collection • Meandre implements a HITS algorithm that identifies sentences but does not reformat them University of Illinois Journalism Application • Newsblaster • Summarizing all the news on the web • Every night, the system crawls a series of Web sites, downloads articles, groups them together into "clusters" about the same topic, and summarizes each cluster. • http://newsblaster.cs.columbia.edu/ • Ultimate Research Assistant • http://ultimate-research-assistant.com University of Illinois Sentiment Analysis • NLTK • http://www.text-processing.com/demo/sentiment • APIs • • • • AlchemyAPI Open Dover Lexalytics Saplo • Meandre (concept tracking) • Sentiment Analysis Symposium • http://sentimentsymposium.com/agenda.html • May 8, 2012 in New York University of Illinois Journalism Application • Tracking Twitter sentiment about political candidates • Comparing the tone of political statements over time or between candidates University of Illinois Analysis Frameworks • Meandre • http://seasr.org/meandre • DocumentCloud • www.documentcloud.org/ • Rapid Miner • http://rapid-i.com/ • Weka • http://www.cs.waikato.ac.nz/ml/weka/index.html University of Illinois Meandre Workbench • Web-based UI • Components and flows are retrieved from serverComponents • Additional locations of components and flows can be added to server Flows • Create flow using a graphical drag and drop interface • Change property values • Execute the flow Locations University of Illinois Meandre Services from Firefox Plugin Readability Analysis Date Entity to Simile Timeline Network Analysis Tag Cloud Analysis Location Entity to Google Map Automatic Summarization Example: Zotero, SEASR, Protovis, Google Maps, Simile University of Illinois Topic Modeling Uses Mallet Topic Modeling to cluster nouns from almost 4000 documents from 19th century with 10 segments per document Example below is clustering the Bible and shows 8 topics with at most 200 keywords for that topic University of Illinois Concept Mapping Sentiment Analysis six core emotions (Love, Joy, Surprise, Anger, Sadness, Fear) University of Illinois Correlation Analysis • Corrected OCR errs with spellchecking University of Illinois