Phrase Queries and Common Words OCR

advertisement
Big, Bigger Biggest
Large scale issues:
Phrase queries and common words
OCR
Tom Burton West
Hathi Trust Project
Lucid Imagination, Inc. – http://www.lucidimagination.com
1
Hathi Trust Large Scale Search Challenges
Goal: Design a system for full-text search that
will scale to 5 million to 20 million volumes (at a reasonable cost.)
Challenges:
Must scale to 20 million full-text volumes
Very long documents compared to
most large-scale search applications
Multilingual collection
OCR quality varies
Lucid Imagination, Inc. – http://www.lucidimagination.com
2
Index Size, Caching, and Memory
Our documents average about 300 pages
which is about 700KB of OCR.
Our 5 million document index is between 2 and 3 terabytes.
About 300 GB per million documents
Large index means disk I/O is bottleneck
Tradeoff JVM vs OS memory
Solr uses OS memory (disk I/O caching) for caching of postings
Memory available for disk I/O caching has most impact on response
time (assuming adequate cache warming)
Fitting entire index in memory not feasible with terabyte size index
Lucid Imagination, Inc. – http://www.lucidimagination.com
3
Response time varies with query
Lucid Imagination, Inc. – http://www.lucidimagination.com
Average:
673
Median:
91
90th:
328
99th:
7,504
4
Slowest 5 % of queries
The slowest 5% of queries took about
1 second or longer.
Response Time
(seconds)
Response Time 95th percentile (seconds)
The slowest 1% of queries took
between 10 seconds and 2 minutes.
1,000
Slowest 0.5% of queries took
between 30 seconds and 2 minutes
100
10
These queries affect response time of
other queries
1
0
940
950
960
970
980
Query number
990
1,000
Cache pollution
Contention for resources
Slowest queries are phrase queries
containing common words
Lucid Imagination, Inc. – http://www.lucidimagination.com
5
Query processing
Phrase queries use position index (Boolean queries do not).
Position index accounts for 85% of index size
Position list for common words such as
“the” can be many GB in size
This causes lots of disk I/O .
Solr depends on the operating systems disk cache to reduce disk
I/O requirements for words that occur in more than one query
I/O from Phrase queries containing
common words pollutes the cache
Lucid Imagination, Inc. – http://www.lucidimagination.com
6
Slow Queries
Slowest test query: “the lives and literature of the beat
generation” took 2 minutes.
4MB data read for Boolean query.
9,000+ MB read for Phrase query.
WORD
NUMBER OF
DOCUMENTS
POSTINGS LIST
(SIZE MB)
TOTAL TERM OCCURRENCES
(MILLIONS)
POSITION LIST
(SIZE MB)
the
800,000
0.8
4,351
4,351
of
892,000
0.89
2,795
2,795
and
769,000
0.77
1,870
1,870
literature
435,000
0.44
9
9
generation
414,000
0.41
5
5
lives
432,000
0.43
5
5
beat
278,000
0.28
1
1
TOTAL
4.02
Lucid Imagination, Inc. – http://www.lucidimagination.com
9,036
7
Why not use Stop Words?
The word “the” occurs more than 4 billion times in our 1 million
document index.
Removing “stop” words (“the”, “of” etc.) not desirable for our use cases.
Couldn’t search for many phrases
“to be or not to be”
“the who”
“man in the moon” vs. “man on the moon”
Stop words in one language are content words in another language
German stop words “war” and “die” are content words in English
English stop words “is” and “by” are content words (“ice” and “village”)
in Swedish
Lucid Imagination, Inc. – http://www.lucidimagination.com
8
“CommonGrams”
Ported Nutch “CommonGrams” algorithm to Solr
Create Bi-Grams selectively for any two word sequence containing
common terms
Slowest query: “The lives and literature of the beat generation”
“the-lives”
“lives-and”
“and-literature”
“of-the”
“literature-of”
“the-beat”
“generation”
Lucid Imagination, Inc. – http://www.lucidimagination.com
9
Standard index vs. CommonGrams
Standard Index
WORD
Common Grams
TOTAL
OCCURRENCES
IN CORPUS
(MILLIONS)
NUMBER OF
DOCS
(THOUSANDS)
TERM
TOTAL
OCCURRENCES
IN CORPUS
(MILLIONS)
NUMBER OF
DOCS
(THOUSANDS)
the
2,013
386
of-the
446
396
of
1,299
440
generation
2.42
262
855
376
the-lives
0.36
128
literature
4
210
literature-of
0.35
103
lives
2
194
lives-and
0.25
115
generation
2
199
and-literature
0.24
77
0.6
130
the-beat
0.06
26
TOTAL
450
and
beat
TOTAL
4,176
Lucid Imagination, Inc. – http://www.lucidimagination.com
10
Comparison of Response time (ms)
Standard Index
Common
Grams
99th
SLOWEST
QUERY
AVERAGE
MEDIAN
90th
459
32
146
6,784
120,595
68
3
71
2,226
7,800
Lucid Imagination, Inc. – http://www.lucidimagination.com
11
Other issues
Analyze your slowest queries
We analyzed the slowest queries from our query logs and
discovered additional “common words” to be added to our list.
We used Solr Admin panel to run our slowest queries from our
logs with the “debug” flag checked.
We discovered that words such as “l’art” were being split into
two token phrase queries.
We used the Solr Admin Analysis tool and determined that the
analyzer we were using was the culprit.
Lucid Imagination, Inc. – http://www.lucidimagination.com
12
Other issues
We broke Solr … temporarily
Dirty OCR in combination with over 200 languages creates
indexes with over 2.4 billion unique terms
Solr/Lucene index size was limited to 2.1 Billion unique terms
Patched: Now it’s 274 Billion
Dirty OCR is difficult to remove without removing “good” words.
Because Solr/Lucene tii/tis index uses pointers into the frequency
and position files we suspect that the performance impact is
minimal compared to disk I/O demands, but we will be testing
soon.
Lucid Imagination, Inc. – http://www.lucidimagination.com
13
Download