Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001 WUT DMG NOV 2001 Topics 1. Natural Language Processing 2. Text Mining vs. Data Mining 3. The toolbox • • • Language processing methods Single document processing Document corpora processing 4. Document categorization – a closer look 5. Applications • • • Classic Profiled document delivery Related areas • Web Content Mining & Web Farming WUT DMG Natural Language Processing • Natural language – test for Artificial Intelligence • Alan Turing • NLP and NLU • Linguistics – Natural exploring mysteries of a language language processing • • • (NLP) William Jones Comparative linguistics - Jakob Grimm, Rasmus Rask anything that deals with text content Noam Chomsky • • I-Language and E-Language language understanding povertyNatural of stimulus (NLU) • Statistical approaches – Markov and Shannon semantics and logic NOV 2001 WUT DMG NOV 2001 Information explosion 100000 10000 Number of books published weekly 1000 100 10 Number of articles published monthly 1 1970 1980 1990 2000 • Increasing popularity of the Internet as a publishing medium • Electronic media’s minimal duplication costs Primitive information retrieval and data management tools WUT DMG Data Mining Data Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from large databases. – Piatetsky-Shapiro • • • • • • Association rule discovery Sequential pattern discovery Categorization Clustering Statistics (mostly regression) Visualization NOV 2001 WUT DMG NOV 2001 Knowledge pyramid Data Mining area Semantic level Wisdom Knowledge Information Data Signals Resources occupied WUT DMG Text Mining – a definition Text Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from textual document repositories. Text Mining = Data Mining (applied to text data) + basic linguistics NOV 2001 WUT DMG NOV 2001 Text Mining tools • Linguistic analysis Language tools • Thesauri, dictionaries, grammar analysers etc. • Machine translation Single document tools • Automatic feature extraction • Automatic summarization • Document categorization • Document clustering • Information retrieval • Visualization methods Multiple document tools WUT DMG Language analysis • • • • Syntactic analysers construction Grammatical sentence decomposition Part-of-speech tagging Word sense disambiguation This is not that simple – consider for example This is a delicious butter - noun You should butter your toast - verb Rule based systems or self-learning classification systems (using VMM and HMM) NOV 2001 WUT DMG NOV 2001 Thesaurus construction Thesaurus (semantic network) stores information about relationships between terms • Ascriptor - Descriptor relations • „Broader term” – „Narrower term” relations • „Related term” relations Cell phone Fax machine Construction can be manual (but this is a laborious process) or automatic. Electronic mail AD BT RT Telephone The U.S.S Nashville arrived in Colon harbour with 42 marines Telecommunications With the warship in Colon harbour, the Colombian troops withdrew Data transmission network Post and telecom WUT DMG NOV 2001 Machine translation • Different vocabularies • Different grammars and flexion rules • Even different character sets Problems Source: Polish Target: English Word level W łóżku jest szybka In bed is window-pane Syntactic level W łóżku jest szybka She is a window-pane in bed Semantic level W łóżku jest szybka She is quick in bed Knowledge representation W łóżku jest szybka She is quick in bed Formal knowledge representation language WUT DMG Fully automatic approach Based on learning word usage patterns from large corpora of translated documents (bitext) Problems • Still quite few bitexts exist • Sentences must be aligned prior to learning • Keyword matching • Sentence length based alignment • Parameterisation is necessary Książka okazała się adjective, The book turned out to be adjective NOV 2001 WUT DMG NOV 2001 Feature extraction Not all words are equally important Data bases • • • • • Databases Technical multiword terminology Abbreviations Knowledge discovery in databases Micro$oft Relations Names MineIT Numbers Microsoft Discovering important terms • Finding lexical affinities Knowledge discovery in databases • Gap variance measurement Knowledge discovery in large databases • Dictionary-based methods • Grammar based heuristics Knowledge discovery in big databases WUT DMG NOV 2001 Document summarization Abstracts Indicative summaries Summaries Extracts Informative summaries Summary creation methods: • statistical analysis of sentence and word frequency + dictionary analysis (i.e. „abstract”, „conclusion” words etc.) • text representation methods – grammatical analysis of sentences • document structure analysis (question-answer patterns, formatting, vocabulary shifts etc.) WUT DMG Document categorization & clustering NOV 2001 Clustering – dividing set of documents into groups Categorization – grouping based on predefined category scheme Repository Typical categorization scenario Step 1 : Create training hierarchy Step 2 : Perform training Class 1 Class 2 Step 3 : Actual classification Unknown document Class fingerprints categorization WUT DMG Categorization/clustering system Documents Representation conversion Classic DM algorithm Clustering – k-means, agglomerative,... Categorization – kNN, DT, Bayes,... Representation processing Deriving metrics NOV 2001 WUT DMG Information retrieval Two types of search methods • exact match – in most cases uses some simple Boolean query specification language • fuzzy – uses statistical methods to estimate relevance of the document Modern IR tools seem to be very effective... 1999 data - Scooter (AltaVista) : 1.5GB RAM, 30GB disk, 4x533 MHz Alpha, 1GB/s I/O (crawler) - about 1 month needed to recrawl 2000 data - 40-50% of the Web indexed at all NOV 2001 IR – exact match Most popular method – inverted files a b c d ... z • Very fast • Boolean queries very easy to process • Very simple WUT DMG NOV 2001 WUT DMG IR – fuzzy search NOV 2001 Documents are represented as vectors over word (feature) space Query can be a set of keywords, a document, or even a set of documents – also represented as a vector k sim ( Di , Q) cos( Di , Q) d l 1 k q l 1 2 l il ql k d l 1 2 il Initial query Repository IR Output Selection It’s possible to perform it iteratively – relevance feedback Output WUT DMG Document visualization Island represents several documents sharing similar subject, and separated from others - hence creating a group of interest Water represents assorted documents, creating semantic noise Peak represents many strongly related documents NOV 2001 WUT DMG Document visualization NOV 2001 Document categorization A closer look WUT DMG NOV 2001 Measuring quality Binary categorization scenario is analogous to document retrieval DB DB – document database dr – relevant documents ds ds – documents labelled as relevant dr PR R ds dr dr A ds dr FO ds ds dr DB ds dr DB ds dr DB dr WUT DMG NOV 2001 Metrics R( f , g ) a ; a c 0 R( f , g ) 1 ac FO( f , g ) a ; a b 0 PR( f , g ) 1 ab ad A( f , g ) abcd PR( f , g ) b ; b d 0 FO( f , g ) 1 bd F 1 1 1 (1 ) PR R WUT DMG NOV 2001 Multiple class scenario M={M1, M2,...,Ml} Mk Micro-averaging Macro-averaging PR={PR1, PR2, ..., PRl} l ma PR( f , g ) PR i i 1 l WUT DMG Categorization example NOV 2001 WUT DMG Document representations • unigram representations (bag-of-words) • binary • multivariate • n-gram representations • -gram representation • positional representation NOV 2001 WUT DMG Bigram example Twas brillig, and the slithy toves Did gyre and gimble in the wabe NOV 2001 WUT DMG Probabilistic interpretation Operations: • R(D) – creating representation R from document D • G(R) – generating document D based on representation R R(G ( R( D))) R( D) unigram said has further that of a upon an the a see joined heavy cut alice on once you is and open the edition t of a to brought he it she she she kinds I came this away look declare four re and not vain the muttered in at was cried and her keep with I to gave I voice of at arm if smokes her tell she cry they finished some next kitten each can imitate only sit like nights you additional she software or courses for rule she is only to think damaged s blaze nice the shut prisoner no bigram Consider your white queen shook his head and rang through my punishments. She ought to me and alice said that distance said nothing. Just then he would you seem very hopeful so dark. There it begins with one on how many candlesticks in a white quilt and all alive before an upright on somehow kitty. Dear friend and without the room in a thing that a king and butter. NOV 2001 WUT DMG NOV 2001 Positional representation 35000 30000 Position 25000 20000 15000 10000 5000 0 Any Dumpty 0 10 20 30 Occurence 40 50 60 WUT DMG NOV 2001 Creating positional representation 1 gdy w j vi , vi V j k r 0 wpw. k r f vi (k ) i n f 1 2r f(k)=2 (before norm.) k Word occurences vi 1 WUT DMG any NOV 2001 0.00025 r=500 r=5000 0.0002 f any 0.00015 Examples 0.0001 5e-005 0 dumpty 0.0004 0.00035 f dumpty 0.0003 0.00025 0.0002 0.00015 0.0001 5e-005 0 r=500 r=5000 WUT DMG NOV 2001 Processing representations Zipf’s law 10000 Frequency The 1664 And 940 To 789 Frequency Word 1000 100 10 A 788 It 683 You 666 I 658 She 543 Of 538 said 473 1 0 500 1000 1500 2000 Word ID 2500 3000 Stopwords? There is no information about information penguins document penguins in this document 3500 WUT DMG Expanding and trimming • Expanding • Trimming • Scaling functions • Attribute selection • Remapping attribute space NOV 2001 WUT DMG NOV 2001 Representation processing Expanding Laplace Lidstone Plap (vi | wk 1 ,..., wk n 1 ) Plid (vi | wk 1 ,..., wk n1 ) M x, y 1 s n M s x , j j 1 M x, y n M s x , j j 1 s Scaling TF/IDF term frequency tfi, document frequency dfi N – all documents in system lln ( wi , d j ) 1 log( tf ij ) log( N ) df i lln ( wi , d j ) 1 log( tf ij ) log( N ) log( N ) log( tf ij ) Attribute present in one document lln ( wi , d j ) 1 log( tf ij ) log( Attribute present in all documents N ) 1 log( tf ij ) 0 0 N WUT DMG NOV 2001 Attribute selection Example – Information Gain IG ( wi ) j 1 P(k j ) log P(k j ) P( wi ) j 1 P(k j | wi ) log P(k j | wi ) l l P( wi ) j 1 P(k j | wi ) log P(k j | wi ) l P(wi) – probability of encountering attribute wi in a randomly selected document P(kj) – probability, that randomly selected document belongs to class kj P(kj|wi) – probability, that document selected from these containing wi belongs to class kj Statistical tests can be also applied to check if a feature – class correlation exists WUT DMG Attribute space remapping Attribute space remapping Attribute clustering Semantic clustering Representation matrix processing (example - SVD) Attribute – class clustering Clustering according to density function similarity NOV 2001 WUT DMG Applications • Classic • Mail analysis and mail routing • Event tracking • Internet related • Web Content Mining and Web Farming • Focused crawling and assisted browsing NOV 2001 Thank you