Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928) http://project-first.eu Text Mining and Text Stream Mining Tutorial Miha Grčar miha.grcar@ijs.si Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana http://kt.ijs.si Text and text stream mining tutorial Simple • Part I: Text mining Pragmatic • Part II: Text stream mining Focused Lucca, Oct 2012 Miha Grčar: Text and text stream mining 2 PART I • PART II Part I: Text mining PART I • PART II INTRO • BOW • ML • EVAL • APP What is text mining? • Text mining provides a set of methodologies and tools for discovering, presenting, and evaluating knowledge from large collections of textual documents • Text mining adopts and adapts methodologies and tools from … – – – – – – – – Data mining (DM) Machine learning (ML) Information retrieval (IR) Natural language processing (NLP) Visualization Social network analysis and graph mining Knowledge management … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 4 PART I • PART II INTRO • BOW • ML • EVAL • APP Typical text mining process Feedback loop Data acquisition - Acquisition - Cleaning Lucca, Oct 2012 Text preprocessing - Transformation Evaluation / validation - Performance and - utility assessment - Feedback loop Application - Presentation - Interaction Modeling - Discover - Extract - Organize knowledge Miha Grčar: Text and text stream mining 5 PART I • PART II INTRO • BOW • ML • EVAL • APP What do we cover in Part 1? Feedback loop Data acquisition Text preprocessing - Vector spc model - (bags-of-words) Lucca, Oct 2012 Evaluation / validation - Cross validation - Precision - Recall … Application - Search & browse - Categorization - Recommendation - Advertising - Spam detection - Summarization - Visualization … Modeling - Machine learning - Classification - Clustering Miha Grčar: Text and text stream mining 6 PART I • PART II INTRO • BOW • ML • EVAL • APP Bags of words The quick brown dog jumps over the lazy dog. Lucca, Oct 2012 the quick brown dog jumps -> jump over the lazy dog Miha Grčar: Text and text stream mining quick brown dog jump lazy • Tokenize • Remove stop words • Lemmatize • Compute weights 1 1 2 1 1 7 PART I • PART II INTRO • BOW • ML • EVAL • APP Bags of words Tokenization & stop word removal Original text: After ripping 14% higher from June until the first week of October, stocks ran headfirst into a wall of worry seemingly too large to climb. Europe, China, the fiscal cliff, etc aren't new concerns but that doesn't mean they aren't real. Investors suddenly care and are behaving accordingly, selling some of their more aggressive names and rotating into defensives. Lucca, Oct 2012 Simple tokenizer (alphanumeric strings only): After | ripping | 14 | higher | from | June | until | the | first | week | of | October | stocks | ran | headfirst | into | a | wall | of | worry | seemingly | too | large | to | climb | Europe | China | the | fiscal | cliff | etc | aren | t | new | concerns | but | that | doesn | t | mean | they | aren | t | real | Investors | suddenly | care | and | are | behaving | accordingly | selling | some | of | their | more | aggressive | names | and | rotating | into | defensives Miha Grčar: Text and text stream mining 8 PART I • PART II INTRO • BOW • ML • EVAL • APP Bags of words Tokenization & stop word removal Original text: Regex tokenizer ([\p{L}']+): After ripping 14% higher from June until the first week of October, stocks ran headfirst into a wall of worry seemingly too large to climb. Europe, China, the fiscal cliff, etc aren't new concerns but that doesn't mean they aren't real. Investors suddenly care and are behaving accordingly, selling some of their more aggressive names and rotating into defensives. After | ripping | higher | from | June | until | the | first | week | of | October | stocks | ran | headfirst | into | a | wall | of | worry | seemingly | too | large | to | climb | Europe | China | the | fiscal | cliff | etc | aren't | new | concerns | but | that | doesn't | mean | they | aren't | real | Investors | suddenly | care | and | are | behaving | accordingly | selling | some | of | their | more | aggressive | names | and | rotating | into | defensives Lucca, Oct 2012 Miha Grčar: Text and text stream mining 9 PART I • PART II INTRO • BOW • ML • EVAL • APP Bags of words Lemmatization Original text: Lemmatized: After ripping 14% higher from June until the first week of October, stocks ran headfirst into a wall of worry seemingly too large to climb. Europe, China, the fiscal cliff, etc aren't new concerns but that doesn't mean they aren't real. Investors suddenly care and are behaving accordingly, selling some of their more aggressive names and rotating into defensives. After | rip | high | from | June | until | the | first | week | of | October | stock | run | headfirst | into | a | wall | of | worry | seemingly | too | large | to | climb | Europe | China | the | fiscal | cliff | etc | aren't | new | concern | but | that | doesn't | mean | they | aren't | real | Investor | suddenly | care | and | are | behave | accordingly | sell | some | of | their | more | aggressive | name | and | rotate | into | defensive Lucca, Oct 2012 Miha Grčar: Text and text stream mining 10 PART I • PART II INTRO • BOW • ML • EVAL • APP Bags of words Lemmatization Original text: Lemmatized: È uno dei punti più contestati della legge di Stabilità approvata da poco dal governo: il taglio alle detrazioni fiscali, ossia gli "sconti" che ogni contribuente può vantare sulla propria dichiarazione dei redditi. Secondo una bozza aggiornata del disegno di legge, il taglio si applicherebbe a decorrere dal periodo di imposta al 31 dicembre 2012. Un dettaglio che aveva creato, nei giorni scorsi, non poche polemiche. E | uno | dei | puntare | più | contestato | della | legge | di | Stabilità | approvare | da | poco | dal | governo | il | tagliare | alle | detrazione | fiscale | ossia | gli | scontare | che | ogni | contribuire | può | vantare | sulla | proprio | dichiarazione | dei | reddito | Secondo | una | bozzare | aggiornare | del | disegnare | di | legge | il | tagliare | si | applicare | a | decorrere | dal | periodare | di | impostare | al | dicembre | Un | dettagliare | che | aveva | creare | nei | giorno | scorrere | non | poca | polemico Lucca, Oct 2012 Miha Grčar: Text and text stream mining 11 PART I • PART II INTRO • BOW • ML • EVAL • APP Computing weights • TF – Term Frequency – The number of times a lemma (stem) occurs in a document • DF quick brown dog jump lazy – Document Frequency The quick – The number of documents in which a lemma (stem) occurs at least brown dog once 1 1 2 1 1 jumps over • TFIDF the lazy dog. |D| 𝑇𝐹𝐼𝐷𝐹 = 𝑇𝐹 × 𝐼𝐷𝐹 = 𝑇𝐹 × loge 𝐷𝐹 Lucca, Oct 2012 • Higher TF means higher TFIDF • Higher DF means lower TFIDF Miha Grčar: Text and text stream mining 12 PART I • PART II INTRO • BOW • ML • EVAL • APP TF DF Computing weights The quick brown dog jumps over the lazy dog. quick brown dog jump lazy 1 1 2 1 1 1 1 1 1 1 IDF TFIDF 0 0 0 0 0 0 0 0 0 0 𝑇𝐹𝐼𝐷𝐹 = 𝑇𝐹 × 𝐼𝐷𝐹 D 𝐼𝐷𝐹 = loge 𝐷𝐹 D =1 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 13 PART I • PART II INTRO • BOW • ML • EVAL • APP jump The quick brown dog jumps over the lazy dog. quick brown dog jump lazy TF DF Computing weights IDF TFIDF 1 1 2 1 1 0.69 0.69 0.69 0.69 0.69 1.39 0 0 0.69 0.69 1 1 1 2 1 𝑇𝐹𝐼𝐷𝐹 = 𝑇𝐹 × 𝐼𝐷𝐹 D 𝐼𝐷𝐹 = loge 𝐷𝐹 D =2 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 14 PART I • PART II INTRO • BOW • ML • EVAL • APP Cosine similarity sim 𝐝𝟏, 𝐝𝟐 = cos 𝜑 d1 sim 𝐝𝟏, 𝐝𝟐 = 𝐝𝟏 ∙ 𝐝𝟐 𝐝𝟏 𝐝𝟐 d2 0 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 15 PART I • PART II INTRO • BOW • ML • EVAL • APP Cosine similarity 𝐝𝑖 𝐝𝑖 = 𝐝𝑖 ′ d1 1 sim 𝐝𝟏, 𝐝𝟐 = cos 𝜑 d1' d2 sim 𝐝𝟏, 𝐝𝟐 = 𝐝𝟏′ ∙ 𝐝𝟐′ 𝐝𝟏′ 𝐝𝟐′ sim 𝐝𝟏, 𝐝𝟐 = 𝐝𝟏′ · 𝐝𝟐′ d2' 0 Lucca, Oct 2012 Miha Grčar: Text and text stream mining Cosine similarity is equal to dot product if BOWs are normalized to unit length (faster to compute) 16 PART I • PART II INTRO • BOW • ML • EVAL • APP Centroids 1 𝐂= 𝑁 𝐂′ = 𝑁 𝐝𝑖 𝑖=1 𝐂 𝐂 • Determine characteristic words in a cluster • Nearest centroid classifier • k-means clustering • … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 17 PART I • PART II INTRO • BOW • ML • EVAL • APP Where are we? Feedback loop Data acquisition Text preprocessing - Vector spc model - (bags-of-words) Lucca, Oct 2012 Evaluation / validation - Cross validation - Precision - Recall … Application - Search & browse - Categorization - Recommendation - Advertising - Spam detection - Summarization - Visualization … Modeling - Machine learning - Classification - Clustering Miha Grčar: Text and text stream mining 18 PART I • PART II INTRO • BOW • ML • EVAL • APP Machine learning • Machine learning is concerned with the development of algorithms that allow computer programs to learn from past experience [Mitchell] • Machine learning refers to a collection of algorithms that take as input empirical data (e.g., from databases or sensors) and try to discover some characteristics (rules, constraints, patterns, features) of the process that generated the data [Wikipedia] • Learning from past experience = learning from past examples • Examples (instances) = document vectors (normalized sparse vectors) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 19 PART I • PART II INTRO • BOW • ML • EVAL • APP Machine learning • We will look at two commonly used machine learning techniques – Classification • Assigning instances (documents) to two or more predefined (discrete) classes • Supervised learning method – Clustering • Arranging instances (documents) into groups (clusters) so that instances in the same group are more similar to each other than to those in other groups • Unsupervised learning method Lucca, Oct 2012 Miha Grčar: Text and text stream mining 20 PART I • PART II INTRO • BOW • ML • EVAL • APP Classification • Labeled documents Mergers & Acquisitions • Ingram Wraps Up Brightpoint Buyout Mergers & Acquisitions • State Street completes acquisition of Goldman Sachs Administration Services Economy & Government • Gasoline fuels inflation, but Fed policy seen steady Economy & Government • Euro Leads Majors Higher as Spanish Bailout Looks Increasingly Likely ... Investing Picks • Smith & Wesson Holding Corp. Enters Oversold Territory Investing Picks • The Fresh Market: A Strong Buy • Learn to classify Labeled dataset Training Algorithm Classification Model • Classify unlabeled documents Unlabeled dataset Classification Algorithm Fresh Del Monte Produce Inc. Enters Oversold Territory Predictions (Labels) Investing Picks Classification Model Lucca, Oct 2012 Miha Grčar: Text and text stream mining 21 PART I • PART II Classification INTRO • BOW • ML • EVAL • APP with k-Nearest Neighbors Investing Picks Mergers & Acquisitions Economy & Government Investing Picks: 4 Mergers & Acquisitions: 1 Economy & Government: 0 Lucca, Oct 2012 22 PART I • PART II Classification INTRO • BOW • ML • EVAL • APP with Nearest Centroid Classifier Investing Picks Mergers & Acquisitions s1 s2 Economy & Government s3 Similarity s2 > s1 > s3 s2: Mergers & Acquisitions s1: Investing Picks s3: Economy & Government Lucca, Oct 2012 23 PART I • PART II INTRO • BOW • ML • EVAL • APP Classification with Support Vector Machine (SVM) w Investing Picks • Maximize w • Minimize tradeoff Mergers & Acquisitions Lucca, Oct 2012 Miha Grčar: Text and text stream mining 24 PART I • PART II INTRO • BOW • ML • EVAL • APP Classification algorithms k-NN Nearest centroid SVM (linear kernel) Multiclass? yes yes no Explains decisions? no yes yes Explains model? no yes yes Number of parameters 1 0 1 big small small 0 fast slow Classification speed slow fast fast Accuracy (on texts) low medium high Model size Training speed Lucca, Oct 2012 Miha Grčar: Text and text stream mining 25 PART I • PART II INTRO • BOW • ML • EVAL • APP Clustering Lucca, Oct 2012 26 PART I • PART II INTRO • BOW • ML • EVAL • APP Clustering • k-means clustering • Agglomerative hierarchical clustering Lucca, Oct 2012 Miha Grčar: Text and text stream mining 27 PART I • PART II INTRO • BOW • ML • EVAL • APP k-means clustering Input: k Output: k clusters (and their centroids) 1. Randomly select k instances for initial centroids 2. Assign step Assign each instance to the nearest centroid 3. If the assignments did not change, end the algorithm 4. Update step Recompute (update) centroids 5. Repeat at Step 2 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 28 PART I • PART II INTRO • BOW • ML • EVAL • APP k-means clustering Lucca, Oct 2012 Miha Grčar: Text and text stream mining 29 PART I • PART II INTRO • BOW • ML • EVAL • APP Agglomerative hierarchical clustering 1. Find the two most similar instances 2. Connect them Repeat … 3. Replace them with their centroid “Dendrogram” Lucca, Oct 2012 Miha Grčar: Text and text stream mining 30 PART I • PART II INTRO • BOW • ML • EVAL • APP Where are we? Feedback loop Data acquisition Text preprocessing - Vector spc model - (bags-of-words) Lucca, Oct 2012 Evaluation / validation - Cross validation - Precision - Recall … Application - Search & browse - Categorization - Recommendation - Advertising - Spam detection - Summarization - Visualization … Modeling - Machine learning - Classification - Clustering Miha Grčar: Text and text stream mining 31 PART I • PART II INTRO • BOW • ML • EVAL • APP Evaluation • Cross validation (http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29) – 10-fold cross validation – Stratified • Accuracy • Precision, recall, F1 (http://en.wikipedia.org/wiki/Precision_and_recall | http://en.wikipedia.org/wiki/F1_Score) • Micro and macro-averaging (http://nlp.stanford.edu/IRbook/html/htmledition/evaluation-of-text-classification-1.html | http://datamin.ubbcluj.ro/wiki/index.php/Evaluation_methods_in_text_categorization) • Statistical tests (http://en.wikipedia.org/wiki/Statistical_hypothesis_testing) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 32 PART I • PART II INTRO • BOW • ML • EVAL • APP Where are we? Feedback loop Data acquisition Text preprocessing - Vector spc model - (bags-of-words) Lucca, Oct 2012 Evaluation / validation - Cross validation - Precision - Recall … Application - Search & browse - Categorization - Recommendation - Advertising - Spam detection - Summarization - Visualization … Modeling - Machine learning - Classification - Clustering Miha Grčar: Text and text stream mining 33 PART I • PART II INTRO • BOW • ML • EVAL • APP Applications • Enhanced Web search (SearchPoint) • Social browsing (LiveNetLife) • Content categorization • Content-based recommender systems • Advertising • Blogging assistance (Zemanta) • Spam detection • Visualization / summarization of large corpora Lucca, Oct 2012 • Text summarization Leskovec et al. (2005): Extracting Summary Sentences Based on the Document Semantic Graph. Microsoft Research Technical Report MSR-TR-2005-07. • Sentiment analysis (demo later) • News aggregation http://emm.newsexplorer.eu • Knowledge engineering http://ontogen.ijs.si • … Miha Grčar: Text and text stream mining 34 Lucca, Oct 2012 Enhanced Web search (http://www.searchpoint.com) Miha Grčar: Text and text stream mining 35 Hi! Hello Lucca, Oct 2012 Social browsing (http://www.livenetlife.com) @ http://videolectures.net Miha Grčar: Text and text stream mining 36 Lucca, Oct 2012 Content categorization @ http://videolectures.net Miha Grčar: Text and text stream mining 37 Lucca, Oct 2012 Recommender system @ http://videolectures.net Miha Grčar: Text and text stream mining 38 Lucca, Oct 2012 Contextualized advertising Miha Grčar: Text and text stream mining 39 PART I • PART II INTRO • BOW • ML • EVAL • APP Blogging assistant (http://www.zemanta.com) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 40 PART I • PART II INTRO • BOW • ML • EVAL • APP Pump & dump Siering, Muntermann, Grčar (2012) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 41 PART I • PART II INTRO • BOW • ML • EVAL • APP Visualizations • Document space visualization • Canyon flows • Tag clouds http://www.jasondavies.com/wordcloud/ Lucca, Oct 2012 Miha Grčar: Text and text stream mining 42 PART I • PART II Recap • Basics – – – – • Applications What is text mining? TF-IDF bag-of-words vectors Cosine similarity Centroids • Machine learning – – – – – k-NN Nearest centroid classifier SVM k-means Agglomerative clustering Lucca, Oct 2012 – Enhanced Web search (SearchPoint) – Social browsing (LiveNetLife) – Content categorization – Content-based recommender systems – Advertising – Writing assistance (Zemanta) – Spam detection – Visualization / summarization of large corpora … Miha Grčar: Text and text stream mining 43 PART I • PART II Part II: Text stream mining PART I • PART II INTRO • DACQ • BOW • ML • APP What is text stream mining? Same as text mining but on streams Text stream mining provides a set of methodologies and tools for discovering, presenting, and evaluating knowledge from streams of textual documents Lucca, Oct 2012 Miha Grčar: Text and text stream mining 45 PART I • PART II INTRO • DACQ • BOW • ML • APP Remember Typical text mining process Feedback loop Data acquisition - Acquisition - Cleaning Lucca, Oct 2012 Text preprocessing - Transformation Evaluation / validation - Performance and - utility assessment - Feedback loop Application - Presentation - Interaction Modeling - Discover - Extract - Organize knowledge Miha Grčar: Text and text stream mining 46 PART I • PART II INTRO • DACQ • BOW • ML • APP Typical text stream mining process Feedback loop Stream data acquisition - Acquisition - Cleaning Lucca, Oct 2012 Text preprocessing - Transformation Evaluation / validation - Performance and - utility assessment - Obtaining new - labels - Feedback loop Application - Presentation - Interaction Modeling - Discover - Extract - Organize knowledge Miha Grčar: Text and text stream mining 47 PART I • PART II INTRO • DACQ • BOW • ML • APP Text stream mining pipelines • Pipelining and parallelization Parallelization – Enables concurrent processing – Increases throughput – Enables distributed execution (cluster) • Stream Near-realtime online systems – Stream cannot be paused or slowed down (e.g., newsfeeds) – [Near-realtime] Time Pipelining between reception and utilization of data should be as short as possible – [Online] Stream is infinite and (sooner or later) outdated data needs to be deleted Lucca, Oct 2012 Miha Grčar: Text and text stream mining 48 PART I • PART II INTRO • DACQ • BOW • ML • APP What do we cover in Part II? Feedback loop Evaluation / validation Stream data acquisition Text preprocessing - RSS feeds - Online BOW - Boilerplate remover - Language detection Lucca, Oct 2012 Modeling - Online ML - Incr. NCC - Incr. k-means - Incr. SVM Miha Grčar: Text and text stream mining Application - Online document - space visualization - Online tweeter - sentiment classif. 49 PART I • PART II INTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing Language detector RSS reader Boilerplate remover Language detector . . . RSS reader Lucca, Oct 2012 Sync Boilerplate remover Load balancing RSS reader . . . Boilerplate remover Online BOW ... Preprocessing pipelines Language detector Miha Grčar: Text and text stream mining 50 PART I • PART II INTRO • DACQ • BOW • ML • APP RSS (Really Simple Syndication) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 51 PART I • PART II INTRO • DACQ • BOW • ML • APP RSS (Really Simple Syndication) <rss version="2.0"> <channel> <generator>NFE/1.0</generator> <title>Top Stories - Google News</title> <link>http://news.google.com/news?pz=1&amp;ned=us&amp;hl=en</link> <language>en</language> <webMaster>news-feedback@google.com</webMaster> <copyright>&amp;copy;2011 Google</copyright> <item> <title>Egypt Analysts Comment on Next Steps After Mubarak’s Ouster Bloomberg</title> <link>http://news.google.com/news/url?sa=t&amp;fd=R&amp;usg=AFQjCNEF9B 7Q8C7_TBDKPEMFjb83fcuNfQ&amp;url=http://www.bloomberg.com/news/201102-11/egypt-analysts-comment-on-next-steps-after-mubarak-s-ouster.html</link> <category>Top Stories</category> <pubDate>Fri, 11 Feb 2011 20:15:40 GMT+00:00</pubDate> <description>The ouster of Hosni Mubarak from Egypt’s presidency today, after protests that started Jan. 25, prompted the following comments from analysts: “The army needs to move quickly to remove obstacles to ...</description> </item> ... </channel> </rss> Lucca, Oct 2012 Miha Grčar: Text and text stream mining 52 PART I • PART II INTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing Language detector RSS reader Boilerplate remover Language detector . . . RSS reader Lucca, Oct 2012 Sync Boilerplate remover Load balancing RSS reader . . . Boilerplate remover Online BOW ... Preprocessing pipelines Language detector Miha Grčar: Text and text stream mining 53 PART I • PART II INTRO • DACQ • BOW • ML • APP http://www.bbc.co.uk/news/world-us-canada-15051554 Boilerplate removal Lucca, Oct 2012 Miha Grčar: Text and text stream mining 54 PART I • PART II INTRO • DACQ • BOW • ML • APP Boilerplate removal URL tree protocol :// domain / path / file ? query http:// kt.ijs.si /a/b/ c.html ?pg=0 Tree branch: # si ijs kt a b root domain path http://www.bbc.co.uk/news/world-us-canada-15051554 # uk co bbc www news Lucca, Oct 2012 Miha Grčar: Text and text stream mining 55 PART I • PART II INTRO • DACQ • BOW • ML • APP Boilerplate removal URL tree How many times did I see “About Us” in this part of the tree? Path Domain Root Stream # This method is … • Unsupervised • Online • Incremental (consumes one document at a time) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 56 PART I • PART II INTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing Language detector RSS reader Boilerplate remover Language detector . . . RSS reader Lucca, Oct 2012 Sync Boilerplate remover Load balancing RSS reader . . . Boilerplate remover Online BOW ... Preprocessing pipelines Language detector Miha Grčar: Text and text stream mining 57 PART I • PART II INTRO • DACQ • BOW • ML • APP Language detection • Motivation: language-specific text analysis components and applications • Solutions based on word lists and word or character sequences (n-grams) • Character n-gram model – Build character n-gram histograms for many languages (language models) – Compare text document histogram to language models Lucca, Oct 2012 Miha Grčar: Text and text stream mining 58 PART I • PART II INTRO • DACQ • BOW • ML • APP Language detection English THE Lucca, Oct 2012 German E 1 E 1 T 2 N 2 O 3 R 3 A 4 I 4 N 5 T 5 I 6 S 6 H 7 A 7 S 8 D 8 R 9 U 9 D 10 EN 10 E_ 11 G 11 L 12 ER 12 _T 13 H 13 TH 14 L 14 HE 15 N_ 15 U 16 O 16 W 17 M 17 C 18 _D 18 M 19 C 19 ... ... ... ... Miha Grčar: Text and text stream mining DER, DEN 59 PART I • PART II INTRO • DACQ • BOW • ML • APP Language detection Article “Egypt rejoices at Mubarak departure” 450 350 400 300 English article (n-gram rank) English article (n-gram rank) 350 300 250 200 150 250 200 150 100 100 50 50 0 0 0 100 200 300 400 0 50 English language model (n-gram rank) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 100 150 200 250 300 350 German language model (n-gram rank) 60 PART I • PART II INTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing Language detector RSS reader Boilerplate remover Language detector . . . RSS reader Lucca, Oct 2012 Sync Boilerplate remover Load balancing RSS reader . . . Boilerplate remover Online BOW ... Preprocessing pipelines Language detector Miha Grčar: Text and text stream mining 61 PART I • PART II INTRO • DACQ • BOW • ML • APP Online BOW Stream Outdated Queue of TF vectors Add Remove DF values Lucca, Oct 2012 Miha Grčar: Text and text stream mining 62 PART I • PART II INTRO • DACQ • BOW • ML • APP Online BOW Stream Outdated Queue of TF vectors TF DF DF values TF-IDF Lucca, Oct 2012 Miha Grčar: Text and text stream mining 63 PART I • PART II INTRO • DACQ • BOW • ML • APP Where are we? Feedback loop Evaluation / validation Stream data acquisition Text preprocessing - RSS feeds - Online BOW - Boilerplate remover - Language detection Lucca, Oct 2012 Modeling - Online ML - Incr. NCC - Incr. k-means - Incr. SVM Miha Grčar: Text and text stream mining Application - Online document - space visualization - Online tweeter - sentiment classif. 64 PART I • PART II INTRO • DACQ • BOW • ML • APP Batch, incremental, offline, online • Batch learning Consuming all training examples at once • Incremental learning Consuming one example at a time • Mini-batch learning Consuming several examples at a time • Offline learning (for datasets/finite streams) All data is stored and can be accessed repeatedly • Online learning (for infinite streams) Each example is discarded after being processed Lucca, Oct 2012 Miha Grčar: Text and text stream mining 65 PART I • PART II INTRO • DACQ • BOW • ML • APP Incremental nearest centroid classifier Outdated instance Classify Obtain Update actual / predict centroids label (green) (red) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 66 PART I • PART II INTRO • DACQ • BOW • ML • APP Incremental k-means clustering Converges in only a few iterations (warm start) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 67 PART I • PART II INTRO • DACQ • BOW • ML • APP Other incremental methods • Incremental SVM A. Bordes, S. Ertekin, J. Weston, and L. Bottou (2005): Fast Kernel Classifiers with Online and Active Learning, Journal of Machine Learning Research, vol. 6, pp. 1579–1619 • Incremental perceptron www.cs.columbia.edu/~jebara/4771/tutorials/pe rceptron.pdf • Incremental winnow http://en.wikipedia.org/wiki/Winnow_%28algorit hm%29 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 68 PART I • PART II INTRO • DACQ • BOW • ML • APP Where are we? Feedback loop Evaluation / validation Stream data acquisition Text preprocessing - RSS feeds - Online BOW - Boilerplate remover - Language detection Lucca, Oct 2012 Modeling - Online ML - Incr. NCC - Incr. k-means - Incr. SVM Miha Grčar: Text and text stream mining Application - Online document - space visualization - Online tweeter - sentiment classif. 69 PART I • PART II INTRO • BOW • ML • EVAL • APP Document space visualization 2D Several 1000 dimensions Lucca, Oct 2012 Miha Grčar: Text and text stream mining 70 PART I • PART II INTRO • BOW • ML • EVAL • APP Document space visualization Neighborhoods computation Corpus preprocessing k-means clustering Least-squares interpolation Stress majorization Document corpus −2 𝑖<𝑗 𝑑𝑖,𝑗 𝐩𝑖 = Lucca, Oct 2012 arg min𝐗{ 𝐀𝐗 − 𝐁 2} 𝐩𝑗 + 𝑑𝑖,𝑗 (𝐩𝑖 − 𝐩𝑗) 𝐩𝑖 − 𝐩𝑗 Layout −2 𝑖<𝑗 𝑑𝑖,𝑗 Miha Grčar: Text and text stream mining 71 PART I • PART II INTRO • BOW • ML • EVAL • APP Document space visualization Lucca, Oct 2012 Miha Grčar: Text and text stream mining 72 PART I • PART II INTRO • DACQ • BOW • ML • APP Document space visualization Maintaining sorted lists Warm start Parallelization Warm start Neighborhoods computation Corpus preprocessing Document corpus k-means clustering arg min𝐗{ 𝐀𝐗 − 𝐁 2} Least-squares interpolation Stress majorization Online BOW −2 𝑖<𝑗 𝑑𝑖,𝑗 𝐩𝑖 = 𝐩𝑗 + 𝑑𝑖,𝑗 (𝐩𝑖 − 𝐩𝑗) 𝐩𝑖 − 𝐩𝑗 Layout −2 𝑖<𝑗 𝑑𝑖,𝑗 Warm start Pipelining Lucca, Oct 2012 Miha Grčar: Text and text stream mining 73 PART I • PART II INTRO • DACQ • BOW • ML • APP Document space visualization Lucca, Oct 2012 Miha Grčar: Text and text stream mining 74 PART I • PART II INTRO • DACQ • BOW • ML • APP Twitter • Platform for sending short messages (similar to SMS) • Est. 225 million users • 100 million accounts added in 2010 • 65 million tweets per day Lucca, Oct 2012 Miha Grčar: Text and text stream mining 75 PART I • PART II INTRO • DACQ • BOW • ML • APP Financial tweets • Informal $ sign convention • Some examples (March 19): – User#1: $AAPL is making an announcement at 9am on what it plans to do with its 97 billion in cash.We expect a dividend announcement – User#2: $AAPL over 600.00 a share in the pre-market on news of a dividend. – User#3: Will there be any other news besides $AAPL dividend? • We acquire ~13,000 tweets per weekday, for ~1,800 NASDAQ/NYSE stocks ($GOOG, $MSFT…) • We analyze tweets to determine whether they contain positive or negative vocabulary Lucca, Oct 2012 Miha Grčar: Text and text stream mining 76 PART I • PART II INTRO • DACQ • BOW • ML • APP Sentiment classification • Labeled documents POS Financial markets are now officially open :) POS market intelligence GMI Interactive and Mintel Win ARF Great Minds Award for Quality in Research POS $AAPL : trust me -- AAPL will soar tomorrow NEG Oh how I miss the days with GBP was at least 2 times the AUD. Sterling forecast to hit all-time lows soon NEG omg! did you know BORDERS closed?! they went bankrupt last month and closed!! awww, too bad! i love borders!! NEG @aekins that's just too bad ... • Learn to classify Labeled dataset Training Algorithm Classification Model • Classify unlabeled documents Unlabeled dataset Classification Algorithm So Nickelodeon filed for bankruptcy and announced that the next Kids Choice Awards will be it's last. Predictions (Labels) NEG Classification Model Lucca, Oct 2012 Miha Grčar: Text and text stream mining 77 PART I • PART II INTRO • DACQ • BOW • ML • APP Sentiment classification • SVM – classifier Goodnight everyoneeee :) Love yall I have a good feeling about today ;) & emoticons ooo the ice cream van is here... yaaaaaay :D – – 0 + 0 – – in the garden in the sun! Just about to fill the pool! – happy days! :D – coming :) Finally got JSON in #processing to work. More playing around 0 + + @oanhLove I hate when that happens... :-/ – • Neutral zone – No jobs, no money. how in the hell is min wage here 4 f'n clams an hour? :( + I hate when I have to call and wake people up :( I don't have any chalk! :-/ MY CHALKBOARD IS USELESS – 0 + – good all 0 UGHHHHHHHHHHHHHHH.. life is NOT the time!!!!!! ;( 0 – + 0 + + + + + + + + Lucca, Oct 2012 78 PART I • PART II INTRO • DACQ • BOW • ML • APP Sentiment classification • SVM classifier & emoticons Replace Replace usernames URLs with a with a token token Remove letter repetition Replace Replace Replace negations exclamation question with a marks with a marks with token token a token Accuracy Precision/recall “Sovereign debt and unemployment are big X 81.06% 81.32%/81.32% issues in EU.” X X X X X 80.22% 82.08%/78.02% 77.43% X 77.78%/84.62% 77.10% 76.70%/86.81% 77.53% 80.79%/78.57% 76.85% X X X X 79.94% unemployed, issues, debt, eu79.94% X X X sovereign, big X X X 79.67% • Neutral zone X • Explanations Average accuracy 10-fold cross validation 76.98% X 78.83% 77.60%/81.87% 77.29% X 78.55% 75.86%/84.62% 76.91% 78.55% 77.78%/80.77% 76.93% 78.27% 80.23%/75.82% 76.93% 78.27% 76.53%/82.42% 77.04% 77.44% 75.12%/82.97% 76.86% X X X X X X X X X X X X • Accuracy Lucca, Oct 2012 Miha Grčar: Text and text stream mining 79 Grey: Netflix stock closing price Green dots: Relevant events concerning Netflix Blue: The number of positive tweets Yellow: The difference between the positive and negative tweets Red: The number of negative tweets Lucca, Oct 2012 Miha Grčar: Text and text stream mining 80 First-quarter earnings release Plans to launch in 43 countries in Latin America and the Caribbean Volume peaks likely represent important events Lucca, Oct 2012 Netflix loses TV shows and films, Netflix loses the Starz deal Miha Grčar: Text and text stream mining 81 Sentiment cross-over happens before price plunge Sentiment cross-over Lucca, Oct 2012 Miha Grčar: Text and text stream mining 82 PART I • PART II INTRO • DACQ • BOW • ML • APP Presidential elections http://predsedniskevolitve.si Lucca, Oct 2012 Miha Grčar: Text and text stream mining 83 PART I • PART II Recap • Basics • Applications – What is text stream mining? – Pipelining, parallelization – Web data acquisition – Online BOWs • Machine learning – Online document space visualization – Online tweeter sentiment classifier • Stock sentiment monitoring • Presidential elections – Batch, incremental, offline, online – Incremental nearest centroid classifier – Incremental k-means – Warm start Lucca, Oct 2012 Miha Grčar: Text and text stream mining 84