Unstructured Data & Text Mining

advertisement
Unstructured Data
and Text Mining
D. Silver
Unstructured Data
• Definition: Information that
either does not have a predefined data model or is not
organized in a predefined
manner
• Imprecise for several reasons:
– Structure of data may be easily
implied, but not explicit
– Data may have explicitly structure
but not for the task at hand
– Data may have some underlying
structure that is not understood
80% of Data is Unstructured
• Much of it is text based:
– Business data:
• Call center transcripts
• Other CRM
– Email
– Open-ended survey responses
– Web pages
– NewsGroups
– Organizational documents
– Regulatory information
Copyright 2003-4, SPSS Inc.
3
Growth of Unstructured Data
Unstructured Information
Management Architecture (UIMA)
• Architecture for the development, discovery,
composition, and deployment of analytics on
unstructured data
• Provides a common framework for processing
unstructured data to extract meaning and
create structured data and information
• IBM’s Watson uses UIMA for real-time content
analytics
References:
•
•
•
•
•
•
•
Text to attributes p.328-329
Text mining Section 9.5
Web mining and beyond , Section 9.6-9.8
String conversion p.439
http://en.wikipedia.org/wiki/Unstructured_data
http://en.wikipedia.org/wiki/UIMA
http://bigdataintegration.blogspot.ca/2012/02/u
nstructured-data-is-myth.html
Text Mining
• Text is:
– Unstructured, amorphous and challenging to parse
– Most common form of information exchange
– Motivation to extract information is compelling
• Text Mining differs from Data Mining
– Most authors strive to clearly inform the reader
– But humans do not have time to read/interpret
everything
– TM focuses on extracting information ready for rapid
machine or human consumption
Text Mining
• Two broad approaches:
– Natural Language Processing (Comp. Linguistics)
• Extracts concepts based on semantics
• Relies heavily on language morphology, syntax, and
semantics
– Information Retrieval
• Exploits bag of word approach
• Term weighting and text similarity measures
Text Mining is a Variant of DM
Text
Mining
Copyright 2003-4, SPSS Inc.
9
NLP Approach
Concept Maps
Attitudes
Attract
Text
Clustering
Grow
Categorization
Surveys
Trending
Web
Channel
Attributes
Retain
Concepts
Outcomes
Information
Extraction
Operational
Systems
Prediction
Customer
Data
Data
Collection
Expert UI
Copyright 2003-4, SPSS Inc.
Business UI
Text
Actions
NLP
Fraud
Business
User
10
NLP Relies on the
Building Blocks of Language
•
•
•
•
Morphology
Syntax
Semantics
Objective is to go from syntactic phrase
– Using a tool like Text Mining is a great idea for any
organization that is interested in maintaining
information on competitive intelligence.
• To semantic concept:
– Competitive Intelligence
Copyright 2003-4, SPSS Inc.
11
Morphology
• Understanding words
Noun
– Stems
– Affixes
• Prefix
• Suffix
– Inflectional elements

Reduces complexity of
analysis

Reduces complexity of
representation

Supports text mining
Copyright 2003-4, SPSS Inc.
Prefix
Noun
Stem
Suffix
in -
dispute
- able
12
Syntax
• The Bank of Canada will curb inflation with
higher interest rates
Sentence
Noun phrase
Adjective
The
Verb phrase
Aux
Verb
Noun
will
curb
inflation
Prepositional phrase
Noun
Bank of
Canada
with
Adjective
Copyright 2003-4, SPSS Inc.
higher
Noun phrase
Noun
Interest rates
13
Semantics
• The meaning of it all
• Approaches to meaning
– Semantic networks
– Deductive logic
– Rule-based systems
• Useful for classification of documents
Copyright 2003-4, SPSS Inc.
14
Problems with NLP
• Limitations of Natural Language Processing
– Correctly identifying the role of noun phrases
– Representing abstract concepts
– Classifying synonyms
– Representing the number of concepts
• Limitations of technology
– Language specific designs are required
– Classification speed
– Classifying hybrid words and sentences
Copyright 2003-4, SPSS Inc.
15
IR Approach
• Statistics applied to syntax yields pretty good
results for:
– Information Filtering
– Text Categorization
– Document/Term Clustering
– Text Summarization
Generality of Basic Techniques
t1 t2 … t n
d1 w11 w12… w1n
d2 w21 w22… w2n
……
…
dm wm1 wm2… wmn
Term
similarity
CLUSTERING
Doc
similarity
Stemming & Stop words
Raw text
tt
t
t tt
d
d dd
d
d
dd
d d
d d
dd
Term Weighting
Tokenized text
tt
t t tt
Sentence
selection
SUMMARIZATION
META-DATA/
ANNOTATION
Vector
centroid
d
CATEGORIZATION
17
Stemming
• General:
– http://en.wikipedia.org/wiki/Stemming
– http://www.comp.lancs.ac.uk/computing/research/stemming/general/
– http://snowball.tartarus.org/texts/introduction.html *READ*
• Julie B. Lovins (1968)
– http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.
htm
– http://snowball.tartarus.org/algorithms/lovins/stemmer.html
• Martin Porter (1979)
– http://www.comp.lancs.ac.uk/computing/research/stemming/general/porter.
htm
• Snowball (~2000)
– Framework for writing stemming algorithms
– Language and compiler for stemming algorithms
– http://snowball.tartarus.org
Information Filtering
• Stable & long term interest, dynamic info source
• System must make a delivery decision
immediately as a document “arrives”
• Two Methods: Content-based vs. Collaborative
my interest:
…
Filtering
System
26
Examples of Information Filtering
•
•
•
•
•
News filtering
Email filtering
Recommending Systems
Literature alert
And many others
27
Sample Applications
• Information Filtering
• Text Categorization
Document/Term Clustering
• Text Summarization
28
The Clustering Problem
• Discover “natural structure”
• Group similar objects together
• Object can be document, term, passages
29
Similarity-induced Structure
30
Examples of Doc/Term Clustering
•
•
•
•
•
Clustering of retrieval results
Clustering of documents in the whole collection
Term clustering to define “concept” or “theme”
Automatic construction of hyperlinks
In general, very useful for text mining
31
Sample Applications
• Information Filtering
• Text Categorization
• Document/Term Clustering
Text Summarization
32
“Retrieval-based” Summarization
• Observation: term vector  summary?
• Basic approach
– Rank “sentences”, and select top N as a summary
• Methods for ranking sentences
– Based on term weights
– Based on position of sentences
– Based on the similarity of sentence and document
vector
– NOTE: Similarity can be measured by inner product of vectors of
term frequencies
33
Examples of Summarization
• News summary
• Summarize retrieval results
– Single doc summary
– Multi-doc summary
• Summarize a cluster of documents (automatic
label creation for clusters)
34
Sample Applications
• Information Filtering
Text Categorization
• Document/Term Clustering
• Text Summarization
35
Text Categorization
• Pre-given categories and labeled document
examples (Categories may form hierarchy)
• Classify new documents
• A standard supervised learning problem
Sports
Categorization
System
Business
Education
…
Sports
Business
…
Science
Education
36
Examples of Text Categorization
•
•
•
•
News article classification
Meta-data annotation
Automatic Email sorting
Web page classification
38
References
• http://paginas.fe.up.pt/~ec/files_0405/slides/07
%20TextMining.pdf
• http://disi.unitn.it/~bernardi/Courses/CL/Slides/i
r.pdf
• Multinomimal Distribution
– http://onlinestatbook.com/2/probability/multinomial.
html
– http://onlinestatbook.com/2/probability/binomial.ht
ml
WEKA Tutorials
• https://moodle.umons.ac.be/pluginfile.php/4
3703/mod_resource/content/2/WekaTutorial.
pdf
• http://www.unal.edu.co/diracad/einternacional/Wek
a.pdf
Download