Natural Language Processing

advertisement
Text Mining Overview
Piotr Gawrysiak
gawrysia@ii.pw.edu.pl
Warsaw University of Technology
Data Mining Group
22 November 2001
WUT
DMG
NOV 2001
Topics
1. Natural Language Processing
2. Text Mining vs. Data Mining
3. The toolbox
•
•
•
Language processing methods
Single document processing
Document corpora processing
4. Document categorization – a closer look
5. Applications
•
•
•
Classic
Profiled document delivery
Related areas
•
Web Content Mining & Web Farming
WUT
DMG
Natural Language Processing
• Natural language – test for Artificial Intelligence
• Alan Turing
• NLP and NLU
• Linguistics – Natural
exploring
mysteries
of a language
language
processing
•
•
•
(NLP)
William Jones
Comparative linguistics - Jakob Grimm, Rasmus Rask
anything that deals with text content
Noam Chomsky
•
•
I-Language and E-Language
language understanding
povertyNatural
of stimulus
(NLU)
• Statistical approaches – Markov and Shannon
semantics and logic
NOV 2001
WUT
DMG
NOV 2001
Information explosion
100000
10000
Number of books
published weekly
1000
100
10
Number of articles
published monthly
1
1970
1980
1990
2000
• Increasing popularity of the Internet as a publishing medium
• Electronic media’s minimal duplication costs
Primitive information retrieval and data management tools
WUT
DMG
Data Mining
Data Mining is understood as a process of automatically
extracting meaningful, useful, previously unknown and
ultimately comprehensible information from large
databases. – Piatetsky-Shapiro
•
•
•
•
•
•
Association rule discovery
Sequential pattern discovery
Categorization
Clustering
Statistics (mostly regression)
Visualization
NOV 2001
WUT
DMG
NOV 2001
Knowledge pyramid
Data Mining area
Semantic level
Wisdom
Knowledge
Information
Data
Signals
Resources occupied
WUT
DMG
Text Mining – a definition
Text Mining is understood as a process of automatically
extracting meaningful, useful, previously unknown and
ultimately comprehensible information from textual
document repositories.
Text Mining
=
Data Mining (applied to text data)
+
basic linguistics
NOV 2001
WUT
DMG
NOV 2001
Text Mining tools
• Linguistic analysis
Language tools
• Thesauri, dictionaries, grammar analysers etc.
• Machine translation
Single document tools
• Automatic feature extraction
• Automatic summarization
• Document categorization
• Document clustering
• Information retrieval
• Visualization methods
Multiple document tools
WUT
DMG
Language analysis
•
•
•
•
Syntactic analysers construction
Grammatical sentence decomposition
Part-of-speech tagging
Word sense disambiguation
This is not that simple – consider for example
This is a delicious butter - noun
You should butter your toast - verb
Rule based systems or self-learning classification
systems (using VMM and HMM)
NOV 2001
WUT
DMG
NOV 2001
Thesaurus construction
Thesaurus (semantic network) stores information about relationships
between terms
• Ascriptor - Descriptor relations
• „Broader term” – „Narrower term” relations
• „Related term” relations
Cell phone
Fax machine
Construction can be manual
(but this is a laborious process)
or automatic.
Electronic mail
AD
BT
RT
Telephone
The U.S.S Nashville arrived in Colon harbour with 42 marines
Telecommunications
With the warship in Colon harbour, the Colombian
troops withdrew
Data transmission network
Post and telecom
WUT
DMG
NOV 2001
Machine translation
• Different vocabularies
• Different grammars and flexion rules
• Even different character sets
Problems
Source:
Polish
Target:
English
Word level
W łóżku jest szybka
In bed is window-pane
Syntactic level
W łóżku jest szybka
She is a window-pane in bed
Semantic level
W łóżku jest szybka
She is quick in bed
Knowledge
representation
W łóżku jest szybka
She is quick in bed
Formal knowledge
representation language
WUT
DMG
Fully automatic approach
Based on learning word usage patterns from
large corpora of translated documents (bitext)
Problems
• Still quite few bitexts exist
• Sentences must be aligned prior to learning
• Keyword matching
• Sentence length based alignment
• Parameterisation is necessary
Książka okazała się adjective, The book turned out to be adjective
NOV 2001
WUT
DMG
NOV 2001
Feature extraction
Not all words are equally important
Data bases
•
•
•
•
•
Databases
Technical multiword terminology
Abbreviations
Knowledge discovery in databases
Micro$oft
Relations
Names
MineIT
Numbers
Microsoft
Discovering important terms
• Finding lexical affinities
Knowledge discovery in
databases
• Gap variance measurement
Knowledge discovery in large databases
• Dictionary-based methods
• Grammar based heuristics
Knowledge discovery in big databases
WUT
DMG
NOV 2001
Document summarization
Abstracts
Indicative summaries
Summaries
Extracts
Informative summaries
Summary creation methods:
• statistical analysis of sentence and word frequency + dictionary
analysis (i.e. „abstract”, „conclusion” words etc.)
• text representation methods – grammatical analysis of sentences
• document structure analysis (question-answer patterns, formatting,
vocabulary shifts etc.)
WUT
DMG
Document categorization & clustering
NOV 2001
Clustering – dividing set of documents into groups
Categorization – grouping based on predefined category scheme
Repository
Typical categorization scenario
Step 1 : Create training hierarchy
Step 2 : Perform training
Class 1
Class 2
Step 3 : Actual classification
Unknown document
Class
fingerprints
categorization
WUT
DMG
Categorization/clustering system
Documents
Representation conversion
Classic DM algorithm
Clustering – k-means, agglomerative,...
Categorization – kNN, DT, Bayes,...
Representation processing
Deriving metrics
NOV 2001
WUT
DMG
Information retrieval
Two types of search methods
• exact match – in most cases uses some simple Boolean
query specification language
• fuzzy – uses statistical methods to estimate relevance of
the document
Modern IR tools seem to be very effective...
1999 data - Scooter (AltaVista) : 1.5GB RAM, 30GB disk, 4x533 MHz
Alpha, 1GB/s I/O (crawler) - about 1 month needed to recrawl
2000 data - 40-50% of the Web indexed at all
NOV 2001
IR – exact match
Most popular method – inverted files
a
b
c
d
...
z
• Very fast
• Boolean queries very easy to process
• Very simple
WUT
DMG
NOV 2001
WUT
DMG
IR – fuzzy search
NOV 2001
Documents are represented as vectors over word (feature) space
Query can be a set of keywords, a document, or even a set of
documents – also represented as a vector
k
sim ( Di , Q)  cos( Di , Q) 
d
l 1
k
q
l 1
2
l
il
ql
k
d
l 1
2
il
Initial query
Repository
IR
Output
Selection
It’s possible to perform it iteratively – relevance feedback
Output
WUT
DMG
Document visualization
Island represents several
documents sharing similar
subject, and separated from
others - hence creating a group
of interest
Water represents assorted
documents, creating semantic
noise
Peak represents many strongly
related documents
NOV 2001
WUT
DMG
Document visualization
NOV 2001
Document categorization
A closer look
WUT
DMG
NOV 2001
Measuring quality
Binary categorization scenario is analogous to document retrieval
DB
DB – document database
dr – relevant documents
ds
ds – documents labelled as
relevant
dr
PR 
R
ds  dr
dr
A
ds  dr
FO 
ds
ds  dr  DB  ds  dr
DB
ds  dr
DB  dr
WUT
DMG
NOV 2001
Metrics
R( f , g ) 
a
; a  c  0  R( f , g )  1
ac
FO( f , g ) 
a
; a  b  0  PR( f , g )  1
ab
ad
A( f , g ) 
abcd
PR( f , g ) 
b
; b  d  0  FO( f , g )  1
bd
F 
1

1
1
 (1   )
PR
R
WUT
DMG
NOV 2001
Multiple class scenario
M={M1, M2,...,Ml}
Mk
Micro-averaging
Macro-averaging
PR={PR1, PR2, ..., PRl}
l
ma
PR( f , g ) 
 PR
i
i 1
l
WUT
DMG
Categorization example
NOV 2001
WUT
DMG
Document representations
• unigram representations (bag-of-words)
• binary
• multivariate
• n-gram representations
• -gram representation
• positional representation
NOV 2001
WUT
DMG
Bigram example
Twas brillig, and the slithy toves
Did gyre and gimble in the wabe
NOV 2001
WUT
DMG
Probabilistic interpretation
Operations:
• R(D) – creating representation R from document D
• G(R) – generating document D based on representation R
R(G ( R( D)))  R( D)
unigram
said has further that of a upon an the a see joined heavy cut alice on once you is and open the edition t of a to
brought he it she she she kinds I came this away look declare four re and not vain the muttered in at was cried and
her keep with I to gave I voice of at arm if smokes her tell she cry they finished some next kitten each can imitate only
sit like nights you additional she software or courses for rule she is only to think damaged s blaze nice the shut
prisoner no
bigram
Consider your white queen shook his head and rang through my punishments. She ought to me and alice
said that distance said nothing. Just then he would you seem very hopeful so dark. There it begins with one
on how many candlesticks in a white quilt and all alive before an upright on somehow kitty. Dear friend and
without the room in a thing that a king and butter.
NOV 2001
WUT
DMG
NOV 2001
Positional representation
35000
30000
Position
25000
20000
15000
10000
5000
0
Any
Dumpty
0
10
20
30
Occurence
40
50
60
WUT
DMG
NOV 2001
Creating positional representation
1 gdy w j  vi , vi  V
 j k r 0 wpw.

k r
f vi (k ) 
i
n
f
1
2r
f(k)=2 (before norm.)
k
Word occurences
vi
1
WUT
DMG
any
NOV 2001
0.00025
r=500
r=5000
0.0002
f any
0.00015
Examples
0.0001
5e-005
0
dumpty
0.0004
0.00035
f dumpty
0.0003
0.00025
0.0002
0.00015
0.0001
5e-005
0
r=500
r=5000
WUT
DMG
NOV 2001
Processing representations
Zipf’s law
10000
Frequency
The
1664
And
940
To
789
Frequency
Word
1000
100
10
A
788
It
683
You
666
I
658
She
543
Of
538
said
473
1
0
500
1000
1500 2000
Word ID
2500
3000
Stopwords?
There is no information about
information penguins document
penguins in this document
3500
WUT
DMG
Expanding and trimming
• Expanding
• Trimming
• Scaling functions
• Attribute selection
• Remapping attribute space
NOV 2001
WUT
DMG
NOV 2001
Representation processing
Expanding
Laplace
Lidstone
Plap (vi | wk 1 ,..., wk  n 1 ) 
Plid (vi | wk 1 ,..., wk n1 ) 
M x, y  1

s
n
M

s
x
,
j
j 1
M x, y  

n
M

s

x
,
j
j 1
s
Scaling
TF/IDF
term frequency tfi, document frequency dfi
N – all documents in system
 lln ( wi , d j )  1  log( tf ij )   log(
N
)
df i
 lln ( wi , d j )  1  log( tf ij )   log( N )  log( N )  log( tf ij )
Attribute present in one document
 lln ( wi , d j )  1  log( tf ij )   log(
Attribute present in all documents
N
)  1  log( tf ij )   0  0
N
WUT
DMG
NOV 2001
Attribute selection
Example – Information Gain
IG ( wi )   j 1 P(k j )  log P(k j )  P( wi )   j 1 P(k j | wi )  log P(k j | wi ) 
l
l
 P( wi )   j 1 P(k j | wi )  log P(k j | wi )
l
P(wi) – probability of encountering attribute wi in a randomly selected
document
P(kj) – probability, that randomly selected document belongs to class kj
P(kj|wi) – probability, that document selected from these containing wi
belongs to class kj
Statistical tests can be also applied to check if a
feature – class correlation exists
WUT
DMG
Attribute space remapping
Attribute space
remapping
Attribute
clustering
Semantic
clustering
Representation matrix
processing
(example - SVD)
Attribute –
class
clustering
Clustering
according to
density function
similarity
NOV 2001
WUT
DMG
Applications
• Classic
• Mail analysis and mail routing
• Event tracking
• Internet related
• Web Content Mining and Web Farming
• Focused crawling and assisted browsing
NOV 2001
Thank you
Download