Exploiting Multithreaded Architectures to Improve the

advertisement
Exploiting Multithreaded
Architectures to Improve the Hash
Join Operation
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
The Advanced Computer Architecture Group @ U of C
(ACAG)
Department of Electrical and Computer Engineering
*Department of Computer Science
University of Calgary
Outline







The SMT and the CMP Architectures
The Hash Join Database Operation
Motivation
Architecture-Aware Hash Join
Experimental Methodology
Timing and Memory Analysis
Conclusions
MEDEA'08
University of Calgary
2/13
The SMT and the CMP Architectures


Simultaneous Multithreading (SMT): multiple threads
run simultaneously on a single processor.
Chip Multiprocessor (CMP): more than one processor
are integrated on a single chip.
MEDEA'08
University of Calgary
3/13
The Hash Join Database Operation
 The hash join process
 The partition-based hash join algorithm
MEDEA'08
University of Calgary
4/13
Motivation




MEDEA'08
Utilize the multiple threads.
Decrease the L2 miss rate.
L1 Load Miss Rate

Multithreaded architectures create
new opportunities for improving
essential DBMS’s operations.
Hash join is one of the most
important operations in current
commercial DBMSs.
The L2 cache load miss rate is a
critical factor in main-memory
hash join performance.
Therefore, we have two goals:
5.4%
5.3%
5.2%
5.1%
5.0%
4.9%
4.8%
4.7%
4.6%
4.5%
4.4%
20
60
100
Tuple Size (Byte)
140
20
60
100
Tuple Size (Byte)
140
70%
L2 Load Miss Rate

Characterizing the Grace hash join on a
multithreaded machine
60%
50%
40%
30%
20%
10%
0%
University of Calgary
5/13
Architecture-Aware Hash Join (AA_HJ)
1.
The R-relation index partition phase

Tuples divided equally between threads, each
thread has its own set of L2-cache size clusters.
2.
The build and S-relation index partition phase

One thread builds a hash table from each keyrange:

MEDEA'08
Other threads index partition the probe relation.
University of Calgary
6/13
Architecture-Aware Hash Join (cont’d)
3.
The probe phase

The random accesses to any hash table whenever there is a
search for a potential match are a challenge.

Threads probe hash tables with similar key range
simultaneously to increase temporal and spatial locality.
MEDEA'08
University of Calgary
7/13
Experimental Methodology
We ran our algorithms on two machines with the
following specifications:
MEDEA'08
University of Calgary
8/13
Experimental Methodology (cont’d)
All algorithms are implemented in C.
 We employed the built-in OpenMP C/C++ library to
manage parallelism.
 For Machine 1 we had a 50MByte build relation and
a 100MByte probe relation.
 While for Machine 2 we had 250MByte build relation
and 500MByte.
 We used the Intel VTune Performance Analyzer for
Linux 9.0 to collect the hardware events.

MEDEA'08
University of Calgary
9/13
AA_HJ Timing Results
We achieved speedups ranging from 2 to 4.6 compared to Grace
hash join on Quad Intel Xeon Dual Core server (Machine 2).
Speedups for the Pentium 4 with HT ranges between 2.1 to 2.9
compared to Grace hash join.


Time (Second)
PT
NPT
Index PT
2
4
8
12
16
50
45
40
35
30
25
20
15
10
5
0
• PT: Copy-partitioning
hash join
• NPT: Non-partitioning
hash join
• Index PT: Indexpartitioning hash join
20
60
100
140
• 2, 4, 8, 12 or 16 is
number of threads
Tuple Size (Byte)
MEDEA'08
University of Calgary
10/13
Memory-Analysis for Multithreaded AA_HJ
 A decrease
in L2 load miss rate is due to the cache-sized index
partitioning, constructive cache sharing and Group Prefetching.
 A minor increase in L1 data cache load miss rate from 1.5% to
4% on Machine 2.
NPT
70%
10%
60%
9%
L1 Load Miss Rate
L2 Load Miss Rate
NPT 2 4 8 12 16
50%
40%
30%
20%
10%
4
8
12
16
8%
7%
6%
5%
4%
0%
3%
20
60
100
Tuple Size (Byte)
MEDEA'08
2
140
20
60
100
140
Tuple Size (Byte)
University of Calgary
11/13
Conclusions




Revisiting the join implementation to take advantage of stateof-the-art hardware improvements is an important direction to
boost the performance of DBMSs.
We emphasized pervious findings that the hash join is bound
by the L2 miss rates, which range from 29% to 62%.
We proposed an Architecture-Aware Hash Join (AA_HJ) that
relies on sharing critical structures between working threads at
the cache level.
We find that AA_HJ decreases the L2 cache miss rate from
62% to 11%, and from 29% to 15% for tuple size = 20Bytes
and 140Bytes, respectively.
MEDEA'08
University of Calgary
12/13
The End
Time Breakdown Comparison (Machine 2)
Partition
Build Index Partition
Probe Index Partition
Build
Probe
PT
Index PT
2
4
8
12
16
PT
Index PT
2
4
8
12
16
PT
Index PT
2
4
8
12
16
PT
Index PT
2
4
8
12
16
Time (Second)
15
14.5
14
13.5
35.91
13
second12.5
12
11.5
11
10.5
10
9.5
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
3.5
3
27.70 2.5
2
second 1.5
1
0.5
0
20
60
100
140
Tuple Size
MEDEA'08
University of Calgary
Backup
Download