Intro to Text Mining and Analytics

advertisement
CSC 594 Topics in AI –
Text Mining and Analytics
Fall 2015/16
1. Introduction
1
Unstructured Data
• “80 % of business-relevant information originates in unstructured
form, primarily text.” (a quote in 2008)
• “Based on the industry’s current estimations, unstructured data will
occupy 90% of the data by volume in the entire digital space over
the next decade.” (a quote in 2010)
• “The possibilities for data mining from large text collections are
virtually untapped. Text expresses a vast, rich range of information,
but encodes this information in a form that is difficult to decipher
automatically. For example, it is much more difficult to graphically
display textual content than quantitative data.” (Marti Hear, UC Berkeley,
2007)
2
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
3
Text Mining and Analytics
• You use the terms text analytics, text data mining,
and text mining almost synonymously in this course.
• Text analytics uses algorithms for turning free-form text
(unstructured data) into data that can be analyzed (structured data)
by applying statistical and machine learning methods, as well as
Natural Language Processing (NLP) techniques.
• Once structured data is obtained, the same mining and analytic
techniques can apply.
• So the most significant part of Text Mining/Analytics is how to
convert texts into structured data.
Converting Text into Structured Data
• A huge amount of preprocessing is required to convert text.
– Cleaning up ‘dirty’ texts
• Remove mark-up tags from web documents, encrypted symbols such as
emoticons/emoji’s, extraneous strings such as “AHHHHHHHHHHHHHHHHHHHHH”
• Correct misspelled words..
– Tokenization
• Remove punctuations, normalizing upper/lower cases, etc.
– Sentence splitting
– Identifying multi-word expressions (e.g. “as well as”, “radio wave”) and Named
Entities (e.g. “Allied Waste”, “Super Mario Bros.”)
• Adding other linguistic information
– Parts-of-speech (e.g. noun, verb, adjective, adverb, preposition)
• Filtering non-significant/irrelevant words – to reduce dimensions
– Filtering non-content words using a stop-list (e.g. “the”, “a”, “an”, “and”)
– Combining tokens by stemming/lemmatizing or using synonyms
• Other NLP features/techniques, e.g. n-grams, syntax trees
Text Mining Process Pipeline
• Process is essentially a linear pipeline.
• Feedback from the results of Text Mining might affect earlier
preprocessing (to Parsing, or even data collection)..
6
Text Mining Paradigm
7
Data Mining – Two Broad Areas
– Pattern Discovery/Exploratory Analysis (Unsupervised
Learning)
• There is no target variable, and some form of analysis is
performed to do the following:
– identify or define homogeneous groups, clusters, or segments
– find links or associations between entities,
as in market basket analysis
– Prediction (Supervised Learning)
• A target variable is used, and some form of predictive or
classification model is developed.
• Input variables are associated with values
of a target variable, and the model produces a predicted target
value for a given set of inputs.
Text Mining Applications – Unsupervised
– Information retrieval (IR)
• finding documents with relevant content of interest
• used for researching medical, scientific, legal, and news
documents such as books and journal articles
– Document categorization for organizing
• clustering documents into naturally occurring groups
• extracting themes or concepts
– Anomaly detection
• identifying unusual documents that might be associated with
cases requiring special handling such as unhappy customers,
fraud activity, and so on
Text Mining Applications – Unsupervised
• Text clustering
• Trend analysis
Trend for the Term “text mining” from Google Trends
Cluster
No.
Comment
Key Words
1
1, 3, 4
doctor, staff,
friendly, helpful
2
5, 6, 8
treatment, results,
time, schedule
3
2, 7
service, clinic, fast
10
Text Mining Applications – Supervised
– Many typical predictive modeling or
classification applications can be
enhanced by incorporating textual data in
addition to traditional input variables.
• churning propensity models that include
customer center notes, website forms, emails, and Twitter messages
• hospital admission prediction models
incorporating medical records notes as a
new source of information
• insurance fraud modeling using adjustor
notes
• sentiment categorization (next page)
• stylometry or forensic applications that
identify the author of a particular writing
sample
Sentiment Analysis
• The field of sentiment analysis deals with categorization (or
classification) of opinions expressed in textual documents
Green color represents positive tone, red color represents negative tone, and
product features and model names are highlighted in blue and brown, respectively.
12
Structured + Text Data in Predictive
Models
• Use of both types of data in building predictive
models.
ROC Chart of Models With and Without Textual Comments
Discussion
•What sort of pattern discovery or
predictive modeling application do
you have in mind that can incorporate
text data?
Typical Text Pre-processing Step
•
Given a raw text (in a corpus), we typically pre-process
the text by applying the following tasks in order:
1. Part-Of-Speech (POS) tagging – assign a POS to every word
in a sentence in the text
2. Named Entity Recognition (NER) – identify named entities
(proper nouns and some common nouns which are relevant in
the domain of the text)
3. Shallow Parsing – identify the phrases (mostly verb phrases)
which involve named entities
4. Information Extraction (IE) – identify relations between
phrases, and extract the relevant/significant “information”
described in the text
Source: Andrew McCallum, UMass Amherst
15
1. Part-Of-Speech (POS) Tagging
• POS tagging is a process of assigning a POS or lexical
class marker to each word in a sentence (and all
sentences in a corpus).
Input:
Output:
the lead paint is unsafe
the/Det lead/N paint/N is/V unsafe/Adj
16
2. Named Entity Recognition (NER)
• NER is to process a text and identify named entities in a
sentence
– e.g. “U.N. official Ekeus heads for Baghdad.”
17
3. Shallow Parsing
• Shallow (or Partial) parsing identifies the (base) syntactic phases in
a sentence.
[NP He] [v saw] [NP the big dog]
• After NEs are identified, dependency parsing is often applied to
extract the syntactic/dependency relations between the NEs.
[PER Bill Gates] founded [ORG Microsoft].
found
nsubj
Bill Gates
dobj
Dependency Relations
nsubj(Bill Gates, found)
dobj(found, Microsoft)
Microsoft
18
4. Information Extraction (IE)
•
•
•
Identify specific pieces of information (data) in an
unstructured or semi-structured text
Transform unstructured information in a corpus of texts
or web pages into a structured database (or templates)
Applied to various types of text, e.g.
– Newspaper
articles
– Scientific
articles
– Web pages
– etc.
Source: J. Choi, CSE842, MSU
19
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a
local concern and a Japanese trading house to produce golf clubs to be supplied
to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new
Taiwan dollars, will start production in January 1990 with production of 20,000
iron and “metal wood” clubs a month.
template filling
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
20
Download