CS257_125_Ch.15.8 - Department of Computer Science

advertisement

15.8 Algorithms using more than two passes

Presented By: Seungbeom Ma (ID 125)

Professor: Dr. T. Y. Lin

Computer Science Department

San Jose State University

Multipass Algorithms

 Previously , most of algorithms are required two passes.

 There is a case that we need more than two passes.

 Case : Data is too big to store in main memory.

 We have to hash or sort the relation with multipass algorithms.

Agenda

 1. Multipass Sort-Based Algorithm

 2.

Multipass Hash-Based Algorithm

Multipass sort-based algorithm.

 M: Number of Memory Buffers

 R: Relation

 B(R) : Number of blocks for holding relation.

 BASIS:

 1. If R fits in M block (B (R) <= M).

 2. Reading R into main memory.

 3. Sorting R in the main memory with any sorting algorithm.

 4. Write the sorted relation to disk.

Multipass sort-based algorithm.

 INDUCTION: (B(R)> M)

 1. If R does not fit into main memory then partitioning the blocks hold R into M groups, which call R

1

, R

2

, …, R

M

 2.Recursively sorting R i from i =1 to M

 3.Once sorting is done, the algorithm merges the M sorted sublists.

Performance: Multipass Sort-Based Algorithms

1) Each pass of a sorting algorithm:

1.Reading data from the disk.

2. Sorting data with any sorting algorithms

3. Writing data back to the disk.

2-1) (k)-pass sorting algorithm needs

2k B(R) disk I/O’s

2-2)To calculate (Multipass)-pass sorting algorithm needs

= > A+ B

A: 2(K-1 ) (B(R) + B(S) ) [ disk I/O operation to sort the sublists]

B: B(R) + B(S)[ disk I/O operation to read the sorted the sublists in the final pass]

Total: (2k-1)(B(R)+B(S)) disk I/O’s

Multipass Hash-Based Algorithms

 1. Hashing the relations into M-1 buckets, where M is number of memory buffers.

 2. Unary case:

 It applies the operation to each bucket individually.

 1.Duplicate elimination (

δ

) and grouping (

γ

).

 1) Grouping: Min, Max, Count , Sum , AVG , which can group the data in the table

 2) Duplicate elimination: Distinct

Basis:

If the relation fits in M memory block,

-> Reading relation into memory and perform the operations.

 3. Binary case: It applies the operation to each corresponding pair of buckets.

 Query operations: union, intersection, difference , and join

 If either relations fits in M-1 memory blocks,

 -> Reading that relation into main memory M-1 blocks

 -> Reading next relation to 1 block at a time into the M th block

 Then performing the operations.

INDUCTION

 If Unary and Binary relation does not fit into the main memory buffers.

1.

Hashing each relation into M-1 buckets.

2.

Recursively performing the operation on each bucket or corresponding pair of buffers.

3.

Accumulating the output from each buckets or pair.

Hash-Based Algorithms : Unary Operatiors

Perfermance: Hash-Based Algorithms

 R: Realtion.

 Operations are like δ and γ

 M: Buffers

 U(M, k): Number of blocks in largest relation with k-pass hashing algorithm.

Performance: Induction

Induction:

1. Assuming that the first step divides relation R into M-1 equal buckets.

2. The buckets for the next pass must be small enough to handle in k-1 passes

3.Since R is divided into M-1 buckets , we need to have (M-1)u(M, k-1).

Sort-Based VS Hash-Based

1. Sort-based can produce output in sorted order. It might be helpful to reduce rotational latency or seek time

2. Hash-based depends on buckets being of equal size. For binary operations, hash-based only limits size of smaller relation. Therefore, hash-based can be faster than sort-based for small size of relation.

THANKS

Download