View slides

advertisement
Large-scale information extraction and integration infrastructure
for supporting financial decision making (FP7-ICT-257928)
http://project-first.eu
Text Mining and Text Stream
Mining Tutorial
Miha Grčar
miha.grcar@ijs.si
Department of Knowledge Technologies
Jožef Stefan Institute, Ljubljana
http://kt.ijs.si
Text and text stream mining
tutorial
Simple
• Part I: Text mining
Pragmatic
• Part II: Text stream mining
Focused
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
2
PART I • PART II
Part I:
Text mining
PART I • PART II
INTRO • BOW • ML • EVAL • APP
What is text mining?
• Text mining provides a set of methodologies and tools for
discovering, presenting, and evaluating knowledge from
large collections of textual documents
• Text mining adopts and adapts methodologies and tools
from …
–
–
–
–
–
–
–
–
Data mining (DM)
Machine learning (ML)
Information retrieval (IR)
Natural language processing (NLP)
Visualization
Social network analysis and graph mining
Knowledge management
…
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
4
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Typical text mining process
Feedback loop
Data
acquisition
- Acquisition
- Cleaning
Lucca, Oct 2012
Text preprocessing
- Transformation
Evaluation /
validation
- Performance and
- utility assessment
- Feedback loop
Application
- Presentation
- Interaction
Modeling
- Discover
- Extract
- Organize knowledge
Miha Grčar: Text and text stream mining
5
PART I • PART II
INTRO • BOW • ML • EVAL • APP
What do we cover in Part 1?
Feedback loop
Data
acquisition
Text preprocessing
- Vector spc model
- (bags-of-words)
Lucca, Oct 2012
Evaluation /
validation
- Cross validation
- Precision
- Recall …
Application
- Search & browse
- Categorization
- Recommendation
- Advertising
- Spam detection
- Summarization
- Visualization …
Modeling
- Machine learning
- Classification
- Clustering
Miha Grčar: Text and text stream mining
6
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Bags of words
The quick
brown dog
jumps over
the lazy dog.
Lucca, Oct 2012
the
quick
brown
dog
jumps -> jump
over
the
lazy
dog
Miha Grčar: Text and text stream mining
quick
brown
dog
jump
lazy
• Tokenize • Remove stop words • Lemmatize • Compute weights
1 1 2 1 1
7
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Bags of words
Tokenization & stop word removal
Original text:
After ripping 14% higher from
June until the first week of
October, stocks ran headfirst into
a wall of worry seemingly too
large to climb. Europe, China, the
fiscal cliff, etc aren't new
concerns but that doesn't mean
they aren't real. Investors
suddenly care and are behaving
accordingly, selling some of their
more aggressive names and
rotating into defensives.
Lucca, Oct 2012
Simple tokenizer (alphanumeric
strings only):
After | ripping | 14 | higher | from
| June | until | the | first | week |
of | October | stocks | ran |
headfirst | into | a | wall | of |
worry | seemingly | too | large |
to | climb | Europe | China | the |
fiscal | cliff | etc | aren | t | new |
concerns | but | that | doesn | t |
mean | they | aren | t | real |
Investors | suddenly | care | and |
are | behaving | accordingly |
selling | some | of | their | more |
aggressive | names | and |
rotating | into | defensives
Miha Grčar: Text and text stream mining
8
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Bags of words
Tokenization & stop word removal
Original text:
Regex tokenizer ([\p{L}']+):
After ripping 14% higher from
June until the first week of
October, stocks ran headfirst into
a wall of worry seemingly too
large to climb. Europe, China, the
fiscal cliff, etc aren't new
concerns but that doesn't mean
they aren't real. Investors
suddenly care and are behaving
accordingly, selling some of their
more aggressive names and
rotating into defensives.
After | ripping | higher | from |
June | until | the | first | week |
of | October | stocks | ran |
headfirst | into | a | wall | of |
worry | seemingly | too | large |
to | climb | Europe | China | the
| fiscal | cliff | etc | aren't | new
| concerns | but | that | doesn't
| mean | they | aren't | real |
Investors | suddenly | care | and
| are | behaving | accordingly |
selling | some | of | their | more
| aggressive | names | and |
rotating | into | defensives
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
9
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Bags of words
Lemmatization
Original text:
Lemmatized:
After ripping 14% higher from
June until the first week of
October, stocks ran headfirst into
a wall of worry seemingly too
large to climb. Europe, China, the
fiscal cliff, etc aren't new
concerns but that doesn't mean
they aren't real. Investors
suddenly care and are behaving
accordingly, selling some of their
more aggressive names and
rotating into defensives.
After | rip | high | from | June |
until | the | first | week | of |
October | stock | run | headfirst
| into | a | wall | of | worry |
seemingly | too | large | to |
climb | Europe | China | the |
fiscal | cliff | etc | aren't | new |
concern | but | that | doesn't |
mean | they | aren't | real |
Investor | suddenly | care | and |
are | behave | accordingly | sell |
some | of | their | more |
aggressive | name | and | rotate
| into | defensive
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
10
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Bags of words
Lemmatization
Original text:
Lemmatized:
È uno dei punti più contestati
della legge di Stabilità approvata
da poco dal governo: il taglio alle
detrazioni fiscali, ossia gli "sconti"
che ogni contribuente può
vantare sulla propria
dichiarazione dei redditi. Secondo
una bozza aggiornata del disegno
di legge, il taglio si applicherebbe
a decorrere dal periodo di
imposta al 31 dicembre 2012. Un
dettaglio che aveva creato, nei
giorni scorsi, non poche
polemiche.
E | uno | dei | puntare | più |
contestato | della | legge | di |
Stabilità | approvare | da | poco |
dal | governo | il | tagliare | alle |
detrazione | fiscale | ossia | gli |
scontare | che | ogni | contribuire |
può | vantare | sulla | proprio |
dichiarazione | dei | reddito |
Secondo | una | bozzare |
aggiornare | del | disegnare | di |
legge | il | tagliare | si | applicare | a
| decorrere | dal | periodare | di |
impostare | al | dicembre | Un |
dettagliare | che | aveva | creare |
nei | giorno | scorrere | non | poca |
polemico
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
11
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Computing weights
• TF
– Term Frequency
– The number of times a lemma (stem) occurs in a document
• DF
quick
brown
dog
jump
lazy
– Document Frequency
The quick
– The number of
documents in which a lemma (stem) occurs at least
brown dog
once
1 1 2 1 1
jumps over
• TFIDF
the lazy dog.
|D|
𝑇𝐹𝐼𝐷𝐹 = 𝑇𝐹 × 𝐼𝐷𝐹 = 𝑇𝐹 × loge
𝐷𝐹
Lucca, Oct 2012
• Higher TF means higher TFIDF
• Higher DF means lower TFIDF
Miha Grčar: Text and text stream mining
12
PART I • PART II
INTRO • BOW • ML • EVAL • APP
TF
DF
Computing weights
The quick
brown dog
jumps over
the lazy dog.
quick
brown
dog
jump
lazy
1
1
2
1
1
1
1
1
1
1
IDF TFIDF
0
0
0
0
0
0
0
0
0
0
𝑇𝐹𝐼𝐷𝐹 = 𝑇𝐹 × 𝐼𝐷𝐹
D
𝐼𝐷𝐹 = loge
𝐷𝐹
D =1
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
13
PART I • PART II
INTRO • BOW • ML • EVAL • APP
jump
The quick
brown dog
jumps over
the lazy dog.
quick
brown
dog
jump
lazy
TF
DF
Computing weights
IDF TFIDF
1
1
2
1
1
0.69
0.69
0.69
0.69
0.69
1.39
0
0
0.69
0.69
1
1
1
2
1
𝑇𝐹𝐼𝐷𝐹 = 𝑇𝐹 × 𝐼𝐷𝐹
D
𝐼𝐷𝐹 = loge
𝐷𝐹
D =2
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
14
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Cosine similarity
sim 𝐝𝟏, 𝐝𝟐 = cos 𝜑
d1
sim 𝐝𝟏, 𝐝𝟐 =
𝐝𝟏 ∙ 𝐝𝟐
𝐝𝟏 𝐝𝟐
d2

0
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
15
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Cosine similarity
𝐝𝑖
𝐝𝑖 =
𝐝𝑖
′
d1
1
sim 𝐝𝟏, 𝐝𝟐 = cos 𝜑
d1'
d2
sim 𝐝𝟏, 𝐝𝟐 =
𝐝𝟏′ ∙ 𝐝𝟐′
𝐝𝟏′ 𝐝𝟐′
sim 𝐝𝟏, 𝐝𝟐 = 𝐝𝟏′ · 𝐝𝟐′
d2'

0
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
Cosine similarity is equal to
dot product if BOWs are
normalized to unit length
(faster to compute)
16
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Centroids
1
𝐂=
𝑁
𝐂′ =
𝑁
𝐝𝑖
𝑖=1
𝐂
𝐂
• Determine characteristic
words in a cluster
• Nearest centroid classifier
• k-means clustering
• …
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
17
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Where are we?
Feedback loop
Data
acquisition
Text preprocessing
- Vector spc model
- (bags-of-words)
Lucca, Oct 2012
Evaluation /
validation
- Cross validation
- Precision
- Recall …
Application
- Search & browse
- Categorization
- Recommendation
- Advertising
- Spam detection
- Summarization
- Visualization …
Modeling
- Machine learning
- Classification
- Clustering
Miha Grčar: Text and text stream mining
18
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Machine learning
• Machine learning is concerned with the development of
algorithms that allow computer programs to learn from past
experience [Mitchell]
• Machine learning refers to a collection of algorithms that take
as input empirical data (e.g., from databases or sensors) and
try to discover some characteristics (rules, constraints,
patterns, features) of the process that generated the data
[Wikipedia]
• Learning from past experience = learning from past examples
• Examples (instances) = document vectors (normalized sparse
vectors)
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
19
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Machine learning
• We will look at two commonly used
machine learning techniques
– Classification
• Assigning instances (documents) to two or
more predefined (discrete) classes
• Supervised learning method
– Clustering
• Arranging instances (documents) into
groups (clusters) so that instances in the
same group are more similar to each other
than to those in other groups
• Unsupervised learning method
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
20
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Classification
• Labeled documents
Mergers & Acquisitions • Ingram Wraps Up Brightpoint Buyout
Mergers & Acquisitions • State Street completes acquisition of Goldman Sachs Administration Services
Economy & Government • Gasoline fuels inflation, but Fed policy seen steady
Economy & Government • Euro Leads Majors Higher as Spanish Bailout Looks Increasingly Likely
...
Investing Picks • Smith & Wesson Holding Corp. Enters Oversold Territory
Investing Picks • The Fresh Market: A Strong Buy
• Learn to classify
Labeled
dataset
Training
Algorithm
Classification
Model
• Classify unlabeled documents
Unlabeled
dataset
Classification
Algorithm
Fresh Del Monte Produce Inc.
Enters Oversold Territory
Predictions
(Labels)
Investing Picks
Classification
Model
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
21
PART I • PART II
Classification
INTRO • BOW • ML • EVAL • APP
with k-Nearest Neighbors
Investing Picks
Mergers & Acquisitions
Economy & Government
Investing Picks: 4
Mergers & Acquisitions: 1
Economy & Government: 0
Lucca, Oct 2012
22
PART I • PART II
Classification
INTRO • BOW • ML • EVAL • APP
with Nearest Centroid Classifier
Investing Picks
Mergers & Acquisitions
s1
s2
Economy & Government
s3
Similarity s2 > s1 > s3
s2: Mergers & Acquisitions
s1: Investing Picks
s3: Economy & Government
Lucca, Oct 2012
23
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Classification
with Support Vector Machine (SVM)
w
Investing Picks
• Maximize w
• Minimize 
tradeoff

Mergers & Acquisitions
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
24
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Classification algorithms
k-NN
Nearest
centroid
SVM
(linear kernel)
Multiclass?
yes
yes
no
Explains decisions?
no
yes
yes
Explains model?
no
yes
yes
Number of parameters
1
0
1
big
small
small
0
fast
slow
Classification speed
slow
fast
fast
Accuracy (on texts)
low
medium
high
Model size
Training speed
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
25
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Clustering
Lucca, Oct 2012
26
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Clustering
• k-means clustering
• Agglomerative hierarchical clustering
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
27
PART I • PART II
INTRO • BOW • ML • EVAL • APP
k-means clustering
Input: k
Output: k clusters (and their centroids)
1. Randomly select k instances for initial centroids
2. Assign step
Assign each instance to the nearest centroid
3. If the assignments did not change, end the
algorithm
4. Update step
Recompute (update) centroids
5. Repeat at Step 2
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
28
PART I • PART II
INTRO • BOW • ML • EVAL • APP
k-means clustering
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
29
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Agglomerative hierarchical clustering
1. Find the two most similar instances
2. Connect them
Repeat …
3. Replace them with their centroid
“Dendrogram”
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
30
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Where are we?
Feedback loop
Data
acquisition
Text preprocessing
- Vector spc model
- (bags-of-words)
Lucca, Oct 2012
Evaluation /
validation
- Cross validation
- Precision
- Recall …
Application
- Search & browse
- Categorization
- Recommendation
- Advertising
- Spam detection
- Summarization
- Visualization …
Modeling
- Machine learning
- Classification
- Clustering
Miha Grčar: Text and text stream mining
31
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Evaluation
• Cross validation (http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29)
– 10-fold cross validation
– Stratified
• Accuracy
• Precision, recall, F1 (http://en.wikipedia.org/wiki/Precision_and_recall |
http://en.wikipedia.org/wiki/F1_Score)
• Micro and macro-averaging (http://nlp.stanford.edu/IRbook/html/htmledition/evaluation-of-text-classification-1.html |
http://datamin.ubbcluj.ro/wiki/index.php/Evaluation_methods_in_text_categorization)
• Statistical tests (http://en.wikipedia.org/wiki/Statistical_hypothesis_testing)
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
32
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Where are we?
Feedback loop
Data
acquisition
Text preprocessing
- Vector spc model
- (bags-of-words)
Lucca, Oct 2012
Evaluation /
validation
- Cross validation
- Precision
- Recall …
Application
- Search & browse
- Categorization
- Recommendation
- Advertising
- Spam detection
- Summarization
- Visualization …
Modeling
- Machine learning
- Classification
- Clustering
Miha Grčar: Text and text stream mining
33
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Applications
• Enhanced Web search
(SearchPoint)
• Social browsing (LiveNetLife)
• Content categorization
• Content-based recommender
systems
• Advertising
• Blogging assistance (Zemanta)
• Spam detection
• Visualization / summarization
of large corpora
Lucca, Oct 2012
• Text summarization
Leskovec et al. (2005): Extracting Summary
Sentences Based on the Document Semantic
Graph. Microsoft Research Technical Report
MSR-TR-2005-07.
• Sentiment analysis
(demo later)
• News aggregation
http://emm.newsexplorer.eu
• Knowledge engineering
http://ontogen.ijs.si
• …
Miha Grčar: Text and text stream mining
34
Lucca, Oct 2012
Enhanced Web search (http://www.searchpoint.com)
Miha Grčar: Text and text stream mining
35
Hi!
Hello
Lucca, Oct 2012
Social browsing (http://www.livenetlife.com) @ http://videolectures.net
Miha Grčar: Text and text stream mining
36
Lucca, Oct 2012
Content categorization @ http://videolectures.net
Miha Grčar: Text and text stream mining
37
Lucca, Oct 2012
Recommender system @ http://videolectures.net
Miha Grčar: Text and text stream mining
38
Lucca, Oct 2012
Contextualized advertising
Miha Grčar: Text and text stream mining
39
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Blogging assistant (http://www.zemanta.com)
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
40
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Pump & dump
Siering, Muntermann, Grčar (2012)
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
41
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Visualizations
• Document space
visualization
• Canyon flows
• Tag clouds
http://www.jasondavies.com/wordcloud/
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
42
PART I • PART II
Recap
• Basics
–
–
–
–
• Applications
What is text mining?
TF-IDF bag-of-words vectors
Cosine similarity
Centroids
• Machine learning
–
–
–
–
–
k-NN
Nearest centroid classifier
SVM
k-means
Agglomerative clustering
Lucca, Oct 2012
– Enhanced Web search
(SearchPoint)
– Social browsing (LiveNetLife)
– Content categorization
– Content-based recommender
systems
– Advertising
– Writing assistance (Zemanta)
– Spam detection
– Visualization / summarization
of large corpora …
Miha Grčar: Text and text stream mining
43
PART I • PART II
Part II:
Text stream mining
PART I • PART II
INTRO • DACQ • BOW • ML • APP
What is text stream mining?
Same as text mining but on streams
Text stream mining provides a set of
methodologies and tools for discovering,
presenting, and evaluating knowledge from
streams of textual documents
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
45
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Remember
Typical text mining process
Feedback loop
Data
acquisition
- Acquisition
- Cleaning
Lucca, Oct 2012
Text preprocessing
- Transformation
Evaluation /
validation
- Performance and
- utility assessment
- Feedback loop
Application
- Presentation
- Interaction
Modeling
- Discover
- Extract
- Organize knowledge
Miha Grčar: Text and text stream mining
46
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Typical text stream mining process
Feedback loop
Stream
data
acquisition
- Acquisition
- Cleaning
Lucca, Oct 2012
Text preprocessing
- Transformation
Evaluation /
validation
- Performance and
- utility assessment
- Obtaining new
- labels
- Feedback loop
Application
- Presentation
- Interaction
Modeling
- Discover
- Extract
- Organize knowledge
Miha Grčar: Text and text stream mining
47
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Text stream mining pipelines
• Pipelining and parallelization
Parallelization
– Enables concurrent processing
– Increases throughput
– Enables distributed execution (cluster)
• Stream
Near-realtime online systems
– Stream cannot be paused or slowed down
(e.g., newsfeeds)
– [Near-realtime] Time Pipelining
between reception and
utilization of data should be as short as possible
– [Online] Stream is infinite and (sooner or later)
outdated data needs to be deleted
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
48
PART I • PART II
INTRO • DACQ • BOW • ML • APP
What do we cover in Part II?
Feedback loop
Evaluation /
validation
Stream
data
acquisition
Text preprocessing
- RSS feeds
- Online BOW
- Boilerplate remover
- Language detection
Lucca, Oct 2012
Modeling
- Online ML
- Incr. NCC
- Incr. k-means
- Incr. SVM
Miha Grčar: Text and text stream mining
Application
- Online document
- space visualization
- Online tweeter
- sentiment classif.
49
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Text stream acquisition and
preprocessing
Language
detector
RSS
reader
Boilerplate
remover
Language
detector
.
.
.
RSS
reader
Lucca, Oct 2012
Sync
Boilerplate
remover
Load balancing
RSS
reader
.
.
.
Boilerplate
remover
Online
BOW
...
Preprocessing
pipelines
Language
detector
Miha Grčar: Text and text stream mining
50
PART I • PART II
INTRO • DACQ • BOW • ML • APP
RSS (Really Simple Syndication)
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
51
PART I • PART II
INTRO • DACQ • BOW • ML • APP
RSS (Really Simple Syndication)
<rss version="2.0">
<channel>
<generator>NFE/1.0</generator>
<title>Top Stories - Google News</title>
<link>http://news.google.com/news?pz=1&ned=us&hl=en</link>
<language>en</language>
<webMaster>news-feedback@google.com</webMaster>
<copyright>©2011 Google</copyright>
<item>
<title>Egypt Analysts Comment on Next Steps After Mubarak’s Ouster Bloomberg</title>
<link>http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNEF9B
7Q8C7_TBDKPEMFjb83fcuNfQ&url=http://www.bloomberg.com/news/201102-11/egypt-analysts-comment-on-next-steps-after-mubarak-s-ouster.html</link>
<category>Top Stories</category>
<pubDate>Fri, 11 Feb 2011 20:15:40 GMT+00:00</pubDate>
<description>The ouster of Hosni Mubarak from Egypt’s presidency today, after
protests that started Jan. 25, prompted the following comments from analysts:
“The army needs to move quickly to remove obstacles to ...</description>
</item>
...
</channel>
</rss>
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
52
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Text stream acquisition and
preprocessing
Language
detector
RSS
reader
Boilerplate
remover
Language
detector
.
.
.
RSS
reader
Lucca, Oct 2012
Sync
Boilerplate
remover
Load balancing
RSS
reader
.
.
.
Boilerplate
remover
Online
BOW
...
Preprocessing
pipelines
Language
detector
Miha Grčar: Text and text stream mining
53
PART I • PART II
INTRO • DACQ • BOW • ML • APP
http://www.bbc.co.uk/news/world-us-canada-15051554
Boilerplate removal
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
54
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Boilerplate removal
URL tree
protocol :// domain / path / file ? query
http://
kt.ijs.si
/a/b/ c.html ?pg=0
Tree branch:
#  si  ijs  kt  a  b
root
domain
path
http://www.bbc.co.uk/news/world-us-canada-15051554
#  uk  co  bbc  www  news
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
55
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Boilerplate removal
URL tree
How many times
did I see “About
Us” in this part of
the tree?
Path
Domain
Root
Stream
#
This method is …
• Unsupervised
• Online
• Incremental
(consumes one document at a time)
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
56
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Text stream acquisition and
preprocessing
Language
detector
RSS
reader
Boilerplate
remover
Language
detector
.
.
.
RSS
reader
Lucca, Oct 2012
Sync
Boilerplate
remover
Load balancing
RSS
reader
.
.
.
Boilerplate
remover
Online
BOW
...
Preprocessing
pipelines
Language
detector
Miha Grčar: Text and text stream mining
57
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Language detection
• Motivation: language-specific text analysis
components and applications
• Solutions based on word lists and word or
character sequences (n-grams)
• Character n-gram model
– Build character n-gram histograms for many
languages (language models)
– Compare text document histogram to language
models
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
58
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Language detection
English
THE
Lucca, Oct 2012
German
E
1
E
1
T
2
N
2
O
3
R
3
A
4
I
4
N
5
T
5
I
6
S
6
H
7
A
7
S
8
D
8
R
9
U
9
D
10
EN
10
E_
11
G
11
L
12
ER
12
_T
13
H
13
TH
14
L
14
HE
15
N_
15
U
16
O
16
W
17
M
17
C
18
_D
18
M
19
C
19
...
...
...
...
Miha Grčar: Text and text stream mining
DER, DEN
59
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Language detection
Article “Egypt rejoices at Mubarak departure”
450
350
400
300
English article (n-gram rank)
English article (n-gram rank)
350
300
250
200
150
250
200
150
100
100
50
50
0
0
0
100
200
300
400
0
50
English language model (n-gram rank)
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
100
150
200
250
300
350
German language model (n-gram rank)
60
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Text stream acquisition and
preprocessing
Language
detector
RSS
reader
Boilerplate
remover
Language
detector
.
.
.
RSS
reader
Lucca, Oct 2012
Sync
Boilerplate
remover
Load balancing
RSS
reader
.
.
.
Boilerplate
remover
Online
BOW
...
Preprocessing
pipelines
Language
detector
Miha Grčar: Text and text stream mining
61
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Online BOW
Stream
Outdated
Queue
of TF vectors
Add
Remove
DF
values
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
62
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Online BOW
Stream
Outdated
Queue
of TF vectors
TF
DF
DF
values
TF-IDF
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
63
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Where are we?
Feedback loop
Evaluation /
validation
Stream
data
acquisition
Text preprocessing
- RSS feeds
- Online BOW
- Boilerplate remover
- Language detection
Lucca, Oct 2012
Modeling
- Online ML
- Incr. NCC
- Incr. k-means
- Incr. SVM
Miha Grčar: Text and text stream mining
Application
- Online document
- space visualization
- Online tweeter
- sentiment classif.
64
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Batch, incremental, offline, online
• Batch learning
Consuming all training examples at once
• Incremental learning
Consuming one example at a time
• Mini-batch learning
Consuming several examples at a time
• Offline learning (for datasets/finite streams)
All data is stored and can be accessed repeatedly
• Online learning (for infinite streams)
Each example is discarded after being processed
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
65
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Incremental nearest centroid classifier
Outdated
instance
Classify
Obtain
Update
actual
/ predict
centroids
label
(green)
(red)
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
66
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Incremental k-means clustering
Converges in only a few iterations (warm start)
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
67
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Other incremental methods
• Incremental SVM
A. Bordes, S. Ertekin, J. Weston, and L. Bottou
(2005): Fast Kernel Classifiers with Online and
Active Learning, Journal of Machine Learning
Research, vol. 6, pp. 1579–1619
• Incremental perceptron
www.cs.columbia.edu/~jebara/4771/tutorials/pe
rceptron.pdf
• Incremental winnow
http://en.wikipedia.org/wiki/Winnow_%28algorit
hm%29
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
68
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Where are we?
Feedback loop
Evaluation /
validation
Stream
data
acquisition
Text preprocessing
- RSS feeds
- Online BOW
- Boilerplate remover
- Language detection
Lucca, Oct 2012
Modeling
- Online ML
- Incr. NCC
- Incr. k-means
- Incr. SVM
Miha Grčar: Text and text stream mining
Application
- Online document
- space visualization
- Online tweeter
- sentiment classif.
69
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Document space visualization
2D
Several 1000
dimensions
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
70
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Document space visualization
Neighborhoods
computation
Corpus
preprocessing
k-means
clustering
Least-squares
interpolation
Stress
majorization
Document
corpus
−2
𝑖<𝑗 𝑑𝑖,𝑗
𝐩𝑖 =
Lucca, Oct 2012
arg min𝐗{ 𝐀𝐗 − 𝐁 2}
𝐩𝑗 +
𝑑𝑖,𝑗 (𝐩𝑖 − 𝐩𝑗)
𝐩𝑖 − 𝐩𝑗
Layout
−2
𝑖<𝑗 𝑑𝑖,𝑗
Miha Grčar: Text and text stream mining
71
PART I • PART II
INTRO • BOW • ML • EVAL • APP
Document space visualization
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
72
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Document space visualization
Maintaining
sorted lists
Warm start
Parallelization
Warm start
Neighborhoods
computation
Corpus
preprocessing
Document
corpus
k-means
clustering
arg min𝐗{ 𝐀𝐗 − 𝐁 2}
Least-squares
interpolation
Stress
majorization
Online
BOW
−2
𝑖<𝑗 𝑑𝑖,𝑗
𝐩𝑖 =
𝐩𝑗 +
𝑑𝑖,𝑗 (𝐩𝑖 − 𝐩𝑗)
𝐩𝑖 − 𝐩𝑗
Layout
−2
𝑖<𝑗 𝑑𝑖,𝑗
Warm start
Pipelining
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
73
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Document space visualization
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
74
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Twitter
• Platform for sending
short messages
(similar to SMS)
• Est. 225 million users
• 100 million accounts
added in 2010
• 65 million tweets per day
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
75
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Financial tweets
• Informal $ sign convention
• Some examples (March 19):
–
User#1: $AAPL is making an announcement at 9am
on what it plans to do with its 97 billion in cash.We
expect a dividend announcement
–
User#2: $AAPL over 600.00 a share in the pre-market
on news of a dividend.
–
User#3: Will there be any other news besides $AAPL
dividend?
• We acquire ~13,000 tweets per
weekday, for ~1,800 NASDAQ/NYSE
stocks ($GOOG, $MSFT…)
• We analyze tweets to determine
whether they contain positive or
negative vocabulary
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
76
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Sentiment classification
• Labeled documents
POS Financial markets are now officially open :)
POS market intelligence GMI Interactive and Mintel Win ARF Great Minds Award for Quality in Research
POS $AAPL : trust me -- AAPL will soar tomorrow
NEG Oh how I miss the days with GBP was at least 2 times the AUD. Sterling forecast to hit all-time lows soon
NEG omg! did you know BORDERS closed?! they went bankrupt last month and closed!! awww, too bad! i love borders!!
NEG @aekins that's just too bad
...
• Learn to classify
Labeled
dataset
Training
Algorithm
Classification
Model
• Classify unlabeled documents
Unlabeled
dataset
Classification
Algorithm
So Nickelodeon filed for bankruptcy
and announced that the next Kids Choice
Awards will be it's last.
Predictions
(Labels)
NEG
Classification
Model
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
77
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Sentiment classification
• SVM
–
classifier
Goodnight everyoneeee :) Love yall
I have a good feeling about today ;)
& emoticons
ooo the ice cream van is here... yaaaaaay :D
–
–
0
+
0
–
–
in the garden in the sun! Just about to fill the pool!
–
happy days! :D
– coming :)
Finally got JSON in #processing to work. More playing around
0
+
+
@oanhLove
I hate when that happens... :-/ –
• Neutral
zone
–
No jobs, no money. how in the hell is min wage here 4 f'n clams an hour? :(
+
I hate when I have to call and wake people up :(
I don't have any chalk! :-/ MY CHALKBOARD IS USELESS
–
0
+
– good all 0
UGHHHHHHHHHHHHHHH.. life is NOT
the time!!!!!! ;(
0
–
+
0
+
+
+
+
+
+
+
+
Lucca, Oct 2012
78
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Sentiment classification
• SVM
classifier
& emoticons
Replace
Replace
usernames
URLs with a
with a
token
token
Remove
letter
repetition
Replace
Replace
Replace
negations exclamation
question
with a
marks with a marks with
token
token
a token
Accuracy
Precision/recall
“Sovereign debt and unemployment are big
X
81.06% 81.32%/81.32%
issues
in
EU.”
X
X
X
X
X
80.22% 82.08%/78.02%
77.43%
X
77.78%/84.62%
77.10%
76.70%/86.81%
77.53%
80.79%/78.57%
76.85%
X
X
X
X
79.94%
unemployed,
issues, debt, eu79.94%
X
X
X
sovereign,
big
X
X
X
79.67%
• Neutral zone
X
• Explanations
Average accuracy
10-fold cross
validation
76.98%
X
78.83%
77.60%/81.87%
77.29%
X
78.55%
75.86%/84.62%
76.91%
78.55%
77.78%/80.77%
76.93%
78.27%
80.23%/75.82%
76.93%
78.27%
76.53%/82.42%
77.04%
77.44%
75.12%/82.97%
76.86%
X
X
X
X
X
X
X
X
X
X
X
X
• Accuracy
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
79
Grey:
Netflix stock closing price
Green dots:
Relevant events concerning
Netflix
Blue:
The number of positive
tweets
Yellow:
The difference between the
positive and negative tweets
Red:
The number of negative
tweets
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
80
First-quarter earnings
release
Plans to launch in 43
countries in Latin America
and the Caribbean
Volume peaks likely
represent important events
Lucca, Oct 2012
Netflix loses TV shows and
films, Netflix loses the Starz
deal
Miha Grčar: Text and text stream mining
81
Sentiment cross-over
happens before price plunge
Sentiment cross-over
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
82
PART I • PART II
INTRO • DACQ • BOW • ML • APP
Presidential elections
http://predsedniskevolitve.si
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
83
PART I • PART II
Recap
• Basics
• Applications
– What is text stream
mining?
– Pipelining, parallelization
– Web data acquisition
– Online BOWs
• Machine learning
– Online document space
visualization
– Online tweeter sentiment
classifier
• Stock sentiment
monitoring
• Presidential elections
– Batch, incremental, offline,
online
– Incremental nearest
centroid classifier
– Incremental k-means
– Warm start
Lucca, Oct 2012
Miha Grčar: Text and text stream mining
84
Download