NICAR2012

advertisement
Tools for Unstructured Text
NICAR 2012
Loretta Auvil, UIUC
Chase Davis, CIR
February 2012
University of Illinois
Hottest Analytics
University of Illinois
Overview
University of Illinois
Sites that List Tools
• DiRT, Digital Research Tools
• http://digitalresearchtools.pbworks.com
• Kdnuggets
• http://www.kdnuggets.com/software/text.html
• text-processing.com
• Discussion Groups
• Text Analytics on linkedin
• http://www.linkedin.com/groups/Text-Analytics-115439
• Visual Analytics on linkedin
• http://www.linkedin.com/groups/Visual-Analytics-80552
University of Illinois
Bad News
• There is no open and freely available tool that
is going to solve all your problems!!!
University of Illinois
Good News
• There is a variety of tools that can be beneficial
and must be used in combination to
accomplish the goal!!!
University of Illinois
Natural Language
Processing (NLP)
•
•
•
•
•
Tokenization
Part of speech tagging
Stemming
Stop word removal
Other transformations
University of Illinois
NLP Tools
• NLTK
• http://www.nltk.org/
• http://text-processing.com
• OpenNLP
• http://incubator.apache.org/opennlp/
• Stanford CoreNLP
• http://nlp.stanford.edu/software/corenlp.shtml
• Mallet
• http://mallet.cs.umass.edu/
• GATE
• http://gate.ac.uk/
• LingPipe
• http://alias-i.com/lingpipe/
University of Illinois
Entity Extraction
• Finding entities, like People, Locations, Time, etc
• Some have ability to add your own entities (with seed terms)
• Tools
• OpenNLP
• Stanford CoreNLP
• OpenCalais
• GATE
University of Illinois
Journalism Application
•
•
•
•
•
Structuring unstructured data
Social networks of entities
Clustering
Plotting data on time line
Plotting locations on a map
University of Illinois
Information Extraction
• Automatically identifies and extracts binary
relationships from English sentences
• TextRunner
• NACTEM, MEDIE
• http://www.nactem.ac.uk/medie/
• ReVerb
• http://reverb.cs.washington.edu/
University of Illinois
Information Extraction:
ReVerb
University of Illinois
Question and Answer
• Parse the question to determine what type of
information needs to be returned
• Leveraging approaches like the information
extraction for retrieving the results
University of Illinois
Journalism Application
• Engagement! Help users find facts relevant to
their situations.
University of Illinois
Document
Classification
• Starts with a training set
• Predicts what class a document belongs
• Leveraging pure data mining approaches like Naïve
Bayes, Decision Trees, Neural Networks
• Tools
•
•
•
•
•
•
NLTK
Mallet
Weka
Rapid Miner
GATE
Meandre
University of Illinois
Journalism Application
• IBM's ManyBills project
• Identifies the topic of each section in a Congressional bill for
the purposes of identifying outliers.
• For example, if a Congressman proposes a bill about the
environment, but it has a section deep down about banking
regulation, ManyBills would identify that as an outlier and
highlight it.
University of Illinois
Document
Similarity/Clustering
• TF-IDF (Term frequency * inverse document
frequency)
• Overview project (AP)
• Tools
• GATE
• Rapid Miner
University of Illinois
Journalism Application
• Identifying copycat legislation from year to year
• Clustering documents to find trends
University of Illinois
Topic Modeling
• Exploratory approach to find patterns by
finding words that frequently occur together
• Document can have multiple topics
• Words can exist in multiple topics
• Tools
• Mallet uses LDA (latent Dirichlet allocation)
• Other implementations as well…
University of Illinois
Topic Exploration
• Topical Guide:
• http://tg.byu.edu/
• Tmve (Topic Model Visualization Engine)
• http://code.google.com/p/tmve/
University of Illinois
Topical Guide
University of Illinois
Tmve
University of Illinois
Journalism Application
• Reporting tool for making sense of corpus
• Isolating topics allows the user to focus only on the documents
in a corpus that are relevant.
• There exists a clear potential for more data visualization.
University of Illinois
Automatic
Summarization
• Identifies sentences from among the documents
• Identifies common information conveyed across all the
documents and then reformulates new sentences expressing
that information
• Aims to combine the main themes with completeness,
readability, and conciseness
• Lots of algorithms, but not really software tools to download
to run on your collection
• Meandre implements a HITS algorithm that identifies
sentences but does not reformat them
University of Illinois
Journalism Application
• Newsblaster
• Summarizing all the news on the web
• Every night, the system crawls a series of Web sites, downloads
articles, groups them together into "clusters" about the same
topic, and summarizes each cluster.
• http://newsblaster.cs.columbia.edu/
• Ultimate Research Assistant
• http://ultimate-research-assistant.com
University of Illinois
Sentiment Analysis
• NLTK
• http://www.text-processing.com/demo/sentiment
• APIs
•
•
•
•
AlchemyAPI
Open Dover
Lexalytics
Saplo
• Meandre (concept tracking)
• Sentiment Analysis Symposium
• http://sentimentsymposium.com/agenda.html
• May 8, 2012 in New York
University of Illinois
Journalism Application
• Tracking Twitter sentiment about political
candidates
• Comparing the tone of political statements over
time or between candidates
University of Illinois
Analysis Frameworks
• Meandre
• http://seasr.org/meandre
• DocumentCloud
• www.documentcloud.org/
• Rapid Miner
• http://rapid-i.com/
• Weka
• http://www.cs.waikato.ac.nz/ml/weka/index.html
University of Illinois
Meandre Workbench
• Web-based UI
• Components and
flows are retrieved
from serverComponents
• Additional locations
of components and
flows can be added
to server
Flows
• Create flow using a
graphical drag and
drop interface
• Change property
values
• Execute the flow
Locations
University of Illinois
Meandre Services from
Firefox Plugin
Readability Analysis
Date Entity to Simile Timeline
Network Analysis
Tag Cloud Analysis
Location Entity to Google Map
Automatic Summarization
Example: Zotero, SEASR,
Protovis, Google Maps, Simile
University of Illinois
Topic Modeling
Uses Mallet Topic Modeling to cluster nouns from
almost 4000 documents from 19th century with 10
segments per document
Example below is clustering the Bible and shows 8
topics with at most 200 keywords for that topic
University of Illinois
Concept Mapping
Sentiment Analysis
six core emotions (Love, Joy, Surprise, Anger, Sadness, Fear)
University of Illinois
Correlation Analysis
• Corrected OCR errs with spellchecking
University of Illinois
Download