Mining Unstructured Text at Gigabyte per Second Speeds Alan Ratner 1 Aim • Identify language and topic of streaming web documents – blogs, newsgroup posts • In the presence of noise – bytes of HTML & JavaScript >> bytes of content → everything looks like English • At network speeds – this year at 2.5, 10, & 40 Gb/s – next year at 100 Gb/s 2 The Nature of Text Ideas Sequence of Words Topic Language Graphical Characters Script Sequences of Bits Encoding native & transliterated 3 The Nature of Text Ideas Sequence of Words Topic Language Graphical Characters Script Sequences of Bits Encoding native & transliterated Variations: >>106 4,000 29+ ASCII, WCP, ISO, UTF, languagespecific 4 6 Fundamental Questions in Mining Text Language 1. Detection: Is there language? 2. Identification: What language is it? 3. Clustering: Is this language like any other I’ve seen? Topic 4. Labeling: How can I describe the topic? 5. Identification: Is it on a specific topic of interest? 6. Clustering: Is this topic like any other I’ve seen? 5 Agenda • • • • Computing Platforms ◄ Language Detection Language Identification Topic Identification 6 High Speed Processing • Processing time – O(nanosecond per byte) – O(microsecond per document) • Platforms – FPGA* – CAM* – NPU – GPU – N-core CPU (N >> 1) 7 Content Addressable Memory • If string in memory then return address • CAM is typically ternary (base 3) – specified bits/bytes can be ignored – ignoring the 6th bit in many encodings makes the comparison case-insensitive • 1013 18-byte string compares per second 8 Architecture Streaming Data TCAM (Tokens) FPGA RAM (Sums of Weights) (Weights) CPU (Choose winning class) 9 Agenda • • • • Computing Platforms Language Detection ◄ Language Identification Topic Identification 10 Does a Given File (or Streaming Data) Contain Language? • Language ≡ vocabulary AND grammar • Shannon showed we can correctly guess the next character in English text using a 27-letter alphabet 69% of the time • Text is typically more predictable than well-compressed non-text (zip, audio, video, images) 11 Spaceyness vs. Highness (on noiseless text) 100% 80% Highness 60% Non-Text Latin Text Non-Latin Text 40% 20% 0% 0 10 20 30 40 50 60 Spaceyness 12 Agenda • • • • Computing Platforms Language Detection Language Identification ◄ Topic Identification 13 Occurrence AND T HE 10.0% 1.0% YO FOR U NO T WT H IV H WA A E ARES T HI S B THUET F RO Y M HIS ALL ON W E Y HA T THOEUR AB ORE U W WIOL L T WH ULD H DICH WA HAHSO CA N MO W RE THEERE WH IR EN SOOUMT PE O E TH P LE HEREM B J UE EN LIKS T A YE ON SATIHE R H D T IM OH NEN TIMLY THAE N DGOET N' T HO KN W SHEO W T IN K S HO BEHC UL CO AUSD U L D E AOLU R VESRO M Y N AY EVOEW NE N INTW S THEO SE E SE WE LL IT'S D TDHIO NEWS E MA ITSKE M CH WU T AY O OW VER MO G ST SUOCO D MA H FIRNY S DO AFTEST WH ER HERY G R E SA YOUP R IG W HT BEHI E RE POSNG YE TE JWULARSD RY I'M OTE US E HO WAME NT T HA T English Word Occurrence (>=3 letters) 0.1% 0 10 20 30 40 50 60 70 80 90 100 Rank 14 English Cumulative Word Occurrence 25% Cumulative Occurrence 20% WAS HAVE ARE THIS BUT THEY HIS FROM ALL ONE HAD WHICH WILL WOULD ABOUT THERE YOUR WHAT WITH 15% NOT FOR YOU THAT 10% AND THE 5% 0% 0 5 10 15 20 25 Rank 15 Comparative Cumulative Word Occurrence 60% Cumulative Probability 50% 40% Arabic ArabicChat Russian 30% RussianChat English Dutch 20% 10% 0% 0 100 200 300 400 500 Top Ranked Words 16 Selecting Patterns for Language Id • Spaced languages – Common words ≥ 3 characters with spaces before & after – Avoid • HTML keywords • JavaScript keywords • International words (proper nouns) • Unspaced languages – Common character pairs 17 Agenda • • • • Computing Platforms Language Detection Language Identification Topic Identification ◄ 18 Topic Id & Search • How do you find the 10-100 relevant docs to satisfy a user’s need for information (out of 1012 candidates)? • Search engines force keyword search onto users • What users generally want is topic id – “Show me docs on topic X” – “Show me docs that look like this and not like this” • False negatives OK – User doesn’t know what you have missed (unlike anti-spam) • Precision needs to be high – Don’t annoy the user! – Extremely low false positive rate (0.0000001%) 19 Algorithms for Topic ID • • • • • • • • • • • • • Bayes Markov Markov Orthogonal Sparse Bi-word Hyperspace Correlative Entropy Term Frequency * Inverse Doc Frequency Minimum Description Length Morphological Centroid-based Logistic Regression (similar to SVM & single-layer NN) … Most of these use either additive or multiplicative weights of detected tokens 20 Topic ID Implementation 1. Select on-topic & off-topic docs for training 2. Select tokens (typically 1K on-topic/10K off-topic) 3. Compute additive weights that minimize false positives with acceptable recall 21 Topic Id Evaluation • Consistent labeling of informal documents by topic is difficult • Need millions of labeled docs to cover major languages and wide variety of topics • Used 30K newsgroup posts on 32 topics – Topic assumed to be name of newsgroup – Considerable overlap in topics, especially those on sports and religion 22 Hockey Space Hockey Blog Space Wiki True Positive Rate 30% 20% 10% 0% 0.00% 0.02% 0.04% 0.06% 0.08% 0.10% False Positive Rate 23 Hockey Space Hockey Markov Space Markov True Positive Rate 30% 20% 10% 0% 0.00% 0.02% 0.04% 0.06% 0.08% 0.10% False Positive Rate 24 25 Conclusions 1. Language detection solved on clean documents but unsolved on noisy web documents 2. Language identification works well at gigabyte per second speeds 3. Topic identification should work well at gigabyte per second speeds; further testing needed on larger data sets 4. Topic clustering needs further work 26