Presented By: Seungbeom Ma (ID 125)
Professor: Dr. T. Y. Lin
Computer Science Department
San Jose State University
Previously , most of algorithms are required two passes.
There is a case that we need more than two passes.
Case : Data is too big to store in main memory.
We have to hash or sort the relation with multipass algorithms.
1. Multipass Sort-Based Algorithm
2.
Multipass Hash-Based Algorithm
Multipass sort-based algorithm.
M: Number of Memory Buffers
R: Relation
B(R) : Number of blocks for holding relation.
BASIS:
1. If R fits in M block (B (R) <= M).
2. Reading R into main memory.
3. Sorting R in the main memory with any sorting algorithm.
4. Write the sorted relation to disk.
Multipass sort-based algorithm.
INDUCTION: (B(R)> M)
1. If R does not fit into main memory then partitioning the blocks hold R into M groups, which call R
1
, R
2
, …, R
M
2.Recursively sorting R i from i =1 to M
3.Once sorting is done, the algorithm merges the M sorted sublists.
Performance: Multipass Sort-Based Algorithms
1) Each pass of a sorting algorithm:
1.Reading data from the disk.
2. Sorting data with any sorting algorithms
3. Writing data back to the disk.
2-1) (k)-pass sorting algorithm needs
2k B(R) disk I/O’s
2-2)To calculate (Multipass)-pass sorting algorithm needs
= > A+ B
A: 2(K-1 ) (B(R) + B(S) ) [ disk I/O operation to sort the sublists]
B: B(R) + B(S)[ disk I/O operation to read the sorted the sublists in the final pass]
Total: (2k-1)(B(R)+B(S)) disk I/O’s
Multipass Hash-Based Algorithms
1. Hashing the relations into M-1 buckets, where M is number of memory buffers.
2. Unary case:
It applies the operation to each bucket individually.
1.Duplicate elimination (
δ
) and grouping (
γ
).
1) Grouping: Min, Max, Count , Sum , AVG , which can group the data in the table
2) Duplicate elimination: Distinct
Basis:
If the relation fits in M memory block,
-> Reading relation into memory and perform the operations.
3. Binary case: It applies the operation to each corresponding pair of buckets.
Query operations: union, intersection, difference , and join
If either relations fits in M-1 memory blocks,
-> Reading that relation into main memory M-1 blocks
-> Reading next relation to 1 block at a time into the M th block
Then performing the operations.
INDUCTION
If Unary and Binary relation does not fit into the main memory buffers.
1.
Hashing each relation into M-1 buckets.
2.
Recursively performing the operation on each bucket or corresponding pair of buffers.
3.
Accumulating the output from each buckets or pair.
Hash-Based Algorithms : Unary Operatiors
Perfermance: Hash-Based Algorithms
R: Realtion.
Operations are like δ and γ
M: Buffers
U(M, k): Number of blocks in largest relation with k-pass hashing algorithm.
Performance: Induction
Induction:
1. Assuming that the first step divides relation R into M-1 equal buckets.
2. The buckets for the next pass must be small enough to handle in k-1 passes
3.Since R is divided into M-1 buckets , we need to have (M-1)u(M, k-1).
Sort-Based VS Hash-Based
1. Sort-based can produce output in sorted order. It might be helpful to reduce rotational latency or seek time
2. Hash-based depends on buckets being of equal size. For binary operations, hash-based only limits size of smaller relation. Therefore, hash-based can be faster than sort-based for small size of relation.
THANKS