Big, Bigger Biggest Large scale issues: Phrase queries and common words OCR Tom Burton West Hathi Trust Project Lucid Imagination, Inc. – http://www.lucidimagination.com 1 Hathi Trust Large Scale Search Challenges Goal: Design a system for full-text search that will scale to 5 million to 20 million volumes (at a reasonable cost.) Challenges: Must scale to 20 million full-text volumes Very long documents compared to most large-scale search applications Multilingual collection OCR quality varies Lucid Imagination, Inc. – http://www.lucidimagination.com 2 Index Size, Caching, and Memory Our documents average about 300 pages which is about 700KB of OCR. Our 5 million document index is between 2 and 3 terabytes. About 300 GB per million documents Large index means disk I/O is bottleneck Tradeoff JVM vs OS memory Solr uses OS memory (disk I/O caching) for caching of postings Memory available for disk I/O caching has most impact on response time (assuming adequate cache warming) Fitting entire index in memory not feasible with terabyte size index Lucid Imagination, Inc. – http://www.lucidimagination.com 3 Response time varies with query Lucid Imagination, Inc. – http://www.lucidimagination.com Average: 673 Median: 91 90th: 328 99th: 7,504 4 Slowest 5 % of queries The slowest 5% of queries took about 1 second or longer. Response Time (seconds) Response Time 95th percentile (seconds) The slowest 1% of queries took between 10 seconds and 2 minutes. 1,000 Slowest 0.5% of queries took between 30 seconds and 2 minutes 100 10 These queries affect response time of other queries 1 0 940 950 960 970 980 Query number 990 1,000 Cache pollution Contention for resources Slowest queries are phrase queries containing common words Lucid Imagination, Inc. – http://www.lucidimagination.com 5 Query processing Phrase queries use position index (Boolean queries do not). Position index accounts for 85% of index size Position list for common words such as “the” can be many GB in size This causes lots of disk I/O . Solr depends on the operating systems disk cache to reduce disk I/O requirements for words that occur in more than one query I/O from Phrase queries containing common words pollutes the cache Lucid Imagination, Inc. – http://www.lucidimagination.com 6 Slow Queries Slowest test query: “the lives and literature of the beat generation” took 2 minutes. 4MB data read for Boolean query. 9,000+ MB read for Phrase query. WORD NUMBER OF DOCUMENTS POSTINGS LIST (SIZE MB) TOTAL TERM OCCURRENCES (MILLIONS) POSITION LIST (SIZE MB) the 800,000 0.8 4,351 4,351 of 892,000 0.89 2,795 2,795 and 769,000 0.77 1,870 1,870 literature 435,000 0.44 9 9 generation 414,000 0.41 5 5 lives 432,000 0.43 5 5 beat 278,000 0.28 1 1 TOTAL 4.02 Lucid Imagination, Inc. – http://www.lucidimagination.com 9,036 7 Why not use Stop Words? The word “the” occurs more than 4 billion times in our 1 million document index. Removing “stop” words (“the”, “of” etc.) not desirable for our use cases. Couldn’t search for many phrases “to be or not to be” “the who” “man in the moon” vs. “man on the moon” Stop words in one language are content words in another language German stop words “war” and “die” are content words in English English stop words “is” and “by” are content words (“ice” and “village”) in Swedish Lucid Imagination, Inc. – http://www.lucidimagination.com 8 “CommonGrams” Ported Nutch “CommonGrams” algorithm to Solr Create Bi-Grams selectively for any two word sequence containing common terms Slowest query: “The lives and literature of the beat generation” “the-lives” “lives-and” “and-literature” “of-the” “literature-of” “the-beat” “generation” Lucid Imagination, Inc. – http://www.lucidimagination.com 9 Standard index vs. CommonGrams Standard Index WORD Common Grams TOTAL OCCURRENCES IN CORPUS (MILLIONS) NUMBER OF DOCS (THOUSANDS) TERM TOTAL OCCURRENCES IN CORPUS (MILLIONS) NUMBER OF DOCS (THOUSANDS) the 2,013 386 of-the 446 396 of 1,299 440 generation 2.42 262 855 376 the-lives 0.36 128 literature 4 210 literature-of 0.35 103 lives 2 194 lives-and 0.25 115 generation 2 199 and-literature 0.24 77 0.6 130 the-beat 0.06 26 TOTAL 450 and beat TOTAL 4,176 Lucid Imagination, Inc. – http://www.lucidimagination.com 10 Comparison of Response time (ms) Standard Index Common Grams 99th SLOWEST QUERY AVERAGE MEDIAN 90th 459 32 146 6,784 120,595 68 3 71 2,226 7,800 Lucid Imagination, Inc. – http://www.lucidimagination.com 11 Other issues Analyze your slowest queries We analyzed the slowest queries from our query logs and discovered additional “common words” to be added to our list. We used Solr Admin panel to run our slowest queries from our logs with the “debug” flag checked. We discovered that words such as “l’art” were being split into two token phrase queries. We used the Solr Admin Analysis tool and determined that the analyzer we were using was the culprit. Lucid Imagination, Inc. – http://www.lucidimagination.com 12 Other issues We broke Solr … temporarily Dirty OCR in combination with over 200 languages creates indexes with over 2.4 billion unique terms Solr/Lucene index size was limited to 2.1 Billion unique terms Patched: Now it’s 274 Billion Dirty OCR is difficult to remove without removing “good” words. Because Solr/Lucene tii/tis index uses pointers into the frequency and position files we suspect that the performance impact is minimal compared to disk I/O demands, but we will be testing soon. Lucid Imagination, Inc. – http://www.lucidimagination.com 13