The TEXTURE Benchmark: Measuring Performance of Text Queries on a Relational DBMS

advertisement
The TEXTURE Benchmark:
Measuring Performance of Text
Queries on a Relational DBMS
Vuk Ercegovac
David J. DeWitt
Raghu Ramakrishnan
1
Applications Combining Text and
Relational Data
Query:
SELECT SCORE, P.id,
FROM Products P
WHERE P.type = ‘PDA’ and
CONTAINS(P.complaint, ‘short battery life’, SCORE)
ORDER BY SCORE DESC
ProductComplaints
Score
P.id
0.9
123
0.87
987
0.82
246
…
…
How should such an application be
expected to perform?
2
Possibilities for Benchmarking
Measure
Workload
Quality
Response
Time/
Throughput
Relational
N/A
TPC[3], AS3AP[10],
Set Query[8]
Text
TREC[2], VLC2[1]
FTDR[4], VLC2[1]
Relational +
Text
??
TEXTURE
1. http://es.csiro.au/TRECWeb/vlc2info.html
2. http://trec.nist.gov
3. http://www.tpc.org
4. S. DeFazio, Full-text Document Retrieval Benchmark, chapter 8. Morgan Kaufman, 2 edition, 1993
8. P. O’Neil. The Set Query Benchmark. The Benchmark Handbook, 1991
10. C. Turbyfill, C. Orji, and D. Bitton. AS3AP- a Comparative Relational Database Benchmark. IEEE Compcon, 1989.
3
Contributions of TEXTURE



Design micro-benchmark to compare
response time using a mixed relational +
text query workload
Develop TextGen to synthetically grow a
text collection given a real text collection
Evaluate TEXTURE on 3 commercial
systems
4
Why a Micro-benchmark Design?

A fine level of control for experiments is
needed to differentiate effects due to:
How text data is stored
 How documents are assigned a score
 Optimizer decisions

5
Why use Synthetic Text?

Allows for systematic scale-up


User’s current data set may be too small
Users may be more willing to share
synthetic data
Measurements on synthetic data shown
empirically by us to be close to same
measurements on real data
6
A Note on Quality

Measuring quality is important!


Easy to quickly return poor results
We assume that the three commercial
systems strive for high quality results


Some participated at TREC
Large overlap between result sets
7
Outline



TEXTURE Components
Evaluation
Synthetic Text Generation
8
TEXTURE Components
Query Templates
System A
System B
Query 1
Query 2
…
Query n
Response Time A
Response Time B
Relational Text Attributes
QueryGen
num_id
pkey
DBGen
TextGen
num_u
num_05
num_5
un-clustered indexes
num_50
txt_short
txt_long
display
body
9
Overview of Data

Schema based on Wisconsin Benchmark [5]



Used to control relational predicate selectivity
Relational attributes populated by DBGen [6]
Text attributes populated by TextGen (new)

Input:


D: document collection, m: scale-up factor
Output:


D’: document collection with |D| x m documents
Goal: Same response times for workloads on D’ and
corresponding real collection
5. D. DeWitt. The Wisconsin Benchmark: Past, Present, and Future. The Benchmark Handbook, 1991.
6. J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger.
Quickly Generating Billion-record Synthetic Databases. ACM SIGMOD, 1994
10
Overview of Queries


Query workloads derived from query
templates with following parameters
Text expressions:



Relational expression:


Vary predicate selectivity, join condition selectivity
Sort order:


Vary number of keywords, keyword selectivity, and
type of expression (i.e., phrase, Boolean, etc.)
Keywords chosen from text collection
Choose between relational attribute or score
Retrieve ALL or TOP-K results
11
Example Queries
 Example of a single relation, mixed relational and text
query that sorts according to a relevance score.
SELECT SCORE, num_id, txt_short
FROM R
WHERE NUM_5 = 3 and
CONTAINS(R.txt_long, ‘foo bar’, SCORE)
ORDER BY SCORE DESC
 Example of a join query, sorting according to a
relevance score on S.txt_long.
SELECT S.SCORE, S.num_id, S.txt_short
FROM R, S
WHERE R.num_id = S.num_id and S.NUM_05 = 2
and CONTAINS(S.txt_long, ‘foo bar’, S.SCORE)
ORDER BY S.SCORE DESC
12
Outline



TEXTURE Components
Evaluation
Synthetic Text Generation
13
Overview of Experiments



How is response time affected as the
database grows in size?
How is response time affected by sort
order and top-k optimizations?
How do the results change when input
collection to TextGen differs?
14
Data and Query Workloads

TextGen input is TREC AP Vol.1[1] and VLC2 [2]



Text-only queries:




Low (<0.01%) vs. high selectivity (5%)
Pair with all text-only queries
Mixed, multi relation queries:


Low (< 0.03%) vs. high selectivity (< 3%)
Phrases, OR, AND
Mixed, single relation queries:


Output: relations w/ {1, 2.5, 5, 7.5, 10} x 84,678 tuples
Corresponds to ~250 MB to 2.5 GB of text data
2, 3 relations, vary text attribute used, vary selectivity
Each query workload consists of 100 queries
1. http://es.csiro.au/TRECWeb/vlc2info.html
2. http://trec.nist.gov
15
Methodology for Evaluation


Setup database and query workloads
Run workload per system multiple
times to obtain warm numbers



Discard first run, report average of
remaining
Repeat for all systems (A, B, C)
Platform: Microsoft Windows 2003
Server, dual processor 1.8 GHz AMD, 2
GB of memory, 8 120 GB IDE drives
16
Scaling: Text-Only Workloads
How does response time vary per system as the data
set scales up?


Query workload: low text selectivity (0.03%)
Text data: synthetic based on TREC AP Vol. 1
60
System A
System B
System C
50
Seconds

40
30
20
10
0
1
2.5
5
Scale Factor
7.5
10
17
Mixed Text/Relational
Workloads

Drill down on scale factor 5 (~450K tuples)



Query workload Low: text selectivity (0.03%)
Query workload High: text selectivity (3%)
Do the systems take advantage of relational predicate
for mixed workload queries?

Query workload Mix: High text, low relational selectivity (0.01%)
Workload
System
A
B
C
Low
High
Mix
2.8
30
2.6
71
140
28
69 (97%)
97 (69%)
21 (75%)
18
Seconds per system and workload (synthetic TREC)
Top-k vs. All Results


Compare retrieving all vs. top-k results
Query workload is Mix from before


High selectivity text expression (3%)
Low selectivity relational predicate (0.01%)
Workload
System
All
Top-k
A
B
C
69
97
28
2.6
96
2.2
Seconds per system and workload (450K tuples, synthetic TREC)
19
Varying Sort Order

Compare sorting by score vs. sorting by
relational attribute


When retrieving all, results similar to previous
Results for retrieving top-k shown below
Workload
System
Score
Relational
A
B
C
2.6
96
2.2
2.7
715
2.2
Seconds per system and workload (450K tuples, synthetic TREC)
20
Varying the Input Collection

What is the effect of different input text
collections on response time?

Query workload: low text selectivity (0.03%)


All results retrieved
Text Data: synthetic TREC and VLC2
Collection
System
A
B
C
Synthetic
TREC
2.9
30
2.5
Synthetic
VLC2
1.2
3.6
1.6
Seconds per system and collection (450K tuples)
21
Outline



Benchmark Components
Evaluation
Synthetic Text Generation
22
Synthetic Text Generation

TextGen:



Problem: Given documents D, how do we
add documents to obtain D’ ?


Input: document collection D, scale-up factor m
Output: document collection D’ with |D| x m
documents
Goal: Same response times for workloads on D’
and corresponding real collection C, |C|=|D’|
Approach: Extract “features” from D and
draw |D’| samples according to features
23
Document Collection Features

Features considered
W(w,c) : word distribution
 G(n, v) : vocabulary growth
 U,L : number of unique, total words per
document
 C(w1, w2, …, wn, c) : co-occurrence of
word groups


Each feature is estimated by a model
Ex. Zipf[11] or empirical distribution for W
 Ex. Heaps Law for G[7]

24
7. H. S. Heaps, Information Retrieval, Computational and Theoretical Aspects. Academic Press, 1978.
11. G. Zipf. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Hafner Publications, 1949.
Process to Generate D’

Pre-process: estimate features


Generate |D’| documents



Depends on model used for feature
Generate each document by sampling W
according to U and L
Grow vocabulary according to G
Post-process: Swap words between
documents in order to satisfy co-occurrence
of word groups C
25
Feature-Model Combinations

Considered 3 instances of TextGen, each a
combination of features/models
Feature
TextGen
W
G
L
U
C
(Word distr.)
(Vocab)
(Length)
(Unique)
(co-occur.)
Synthetic1
Zipf
Heaps
Synthetic2
Empirical
Synthetic3
N/A
Average
N/A
Average
Empirical
26
Which TextGen is a Good
Generator?



Goal: response time measured on
synthetic (S) and real (D) should be
similar across systems
Does the use of randomized words in
D’ affect response time accuracy?
How does the choice of features and
models effect response time accuracy
as the data set scales?
27
Use of Random Words


Words are strings composed of a random
permutation of letters
Random words are useful for:



Vocabulary growth
Sharing text collections
Do randomized words affect measured
response times?

What is the affect on stemming, compression, and
other text processing components?
28
Effect of Randomized Words

Experiment: create two TEXTURE
databases and compare across systems





Database AP based on TREC AP Vol. 1
Database R-AP: randomize each word in AP
Query workload: low & high selectivity keywords
Result: response times differ on average by
< 1%, not exceeding 4.4%
Conclusion: using random words is
reasonable for measuring response time
29
Effect of Features and Models

Experiment: compare response times over
same sized synthetic (S) and real (D)
collections


Sample s documents of D
Use TextGen to produce S at several scale factors




|S| = 10, 25, 50, 75, and 100% of |D|
Compare response time across systems
Must repeat for each type of text-only query
workload
Used as framework for picking features/models
30
TextGen Evaluation Results
How does response time measured on real data compare to
the synthetic TextGen collections?


Query workload: low selectivity text only query (0.03%)
Graph is for System A

Similar results obtained for other systems
1.6
Elapsed Time (seconds)

1.4
1.2
1
0.8
Real Collection
0.6
Synthetic-1
0.4
Synthetic-2
0.2
Synthetic-3
0
10
25
50
Scale Factor (%)
75
100
31
Future Work



How should quality measurements be
incorporated?
Extend the workload to include
updates
Allow correlations between attributes
when generating database
32
Conclusion



We propose TEXTURE to fill the gap
seen by applications that use mixed
relational and text queries
We can scale-up a text collection
through synthetic text generation in
such a way that response time is
accurately reflected
Results of evaluation illustrate
significant differences between current
commercial relational systems
33
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
http://es.csiro.au/TRECWeb/vlc2info.html
http://trec.nist.gov
http://www.tpc.org
S. DeFazio, Full-text Document Retrieval Benchmark, chapter 8. Morgan
Kaufman, 2 edition, 1993
D. DeWitt. The Wisconsin Benchmark: Past, Present, and Future. The
Benchmark Handbook, 1991.
J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger.
Quickly Generating Billion-record Synthetic Databases. ACM SIGMOD, 1994
H. S. Heaps, Information Retrieval, Computational and Theoretical Aspects.
Academic Press, 1978.
P. O’Neil. The Set Query Benchmark. The Benchmark Handbook, 1991
K. A. Shoens, A. Tomasic, H. Garcia-Molina. Synthetic Workload Performance
Analysis of Incremental Updates. In Research and Development in Information
Retrieval, 1994.
C. Turbyfill, C. Orji, and D. Bitton. AS3AP- a Comparative Relational Database
Benchmark. IEEE Compcon, 1989.
G. Zipf. Human Behavior and the Principle of Least Effort: An Introduction to
Human Ecology. Hafner Publications, 1949.
34
Questions?
35
Download