Targil 8 Notes

advertisement
Tirgul 6
B-Trees – Another kind of balanced trees
Problem set 1 - some solutions
Motivation
• Primary memory (RAM) : very fast, but costly
Secondary storage (disk) : very cheap, but slow
• Problem: a large D.B. must reside partially on disk. But disk
operations are very slow.
• Solution: take advantage of important disk property -Basic
read/write unit is a page (2-4 Kb) - can’t read/write less.
• Thus when analyzing D.B. performance, we consider two
different measures: CPU time and number of times we need
to access the disk.
• Besides, B-trees are an interesting type of balanced trees...
B-Trees
B-Tree: a balanced search tree whose nodes can have many
children:
• A node x contains n[x] keys, and has n[x]+1 children (c1[x], c2[x], … ,
cn[x]+1[x]).
• The keys in each node are ordered, and relate to their left and right sub-trees
like regular search trees: if ki is any key stored in the sub-tree rooted at ci[x],
then:
k1  key1 x   k 2  key2 x     keyn x  x   k n x 1
• All leaves have the same depth h (the tree’s height)
• There is a fixed integer t (the minimum degree) :
– Every node (besides the root) has at least t-1 keys (i.e. t children)
– Every node can contain at most 2t-1 keys (2t children).
Example
50
65 83 89
10 25
3
7
12 17 20 22
34 39
t=3
54 61
70 82
85 86
90 93
B-Trees and disk access (last time...)
• Each node contains as many keys as possible without being
larger than a single page on disk.
• Whenever we need to access a node – load it from the disk (one
read operation), after changing a node – rewrite it to the disk.
• For example, say each node contains 1000 keys – then the root
has 1001 children, each of which also has 1001 children. Thus
with just 2 disk accesses we are able to access ~10003 records.
• Operations are designed to work in one pass from the root to
the leaves – we do not need to backtrack our steps. This further
reduces the number of disk accesses we make.
The height of a B-Tree
Theorem: If n  1, then for any B-tree of height h with n keys and
minimum degree t  2:
h  log t ( (n+1) / 2 )
Proof: Each child of the root has at least t children, each of them also
has at least t children, and so on. Thus in every sub-tree of the root
there are at least ih1 t i 1 nodes. Each of them contains at least t-1
keys. The root contains at least one key and has at least two children,
so we have:
h
n  1  2(t  1)  t i 1
i 1
 th 1
 1  2(t  1)
  2t h  1
 t 1 
B-Tree Search
• Search is done in the regular way: In each node, we
find the sub-tree in which our value might be, and
recursively find it there.
• Performance:
O(th) = O(tlogtn) - total run-time, out of which:
O(h) = O(logtn) - disk access operations
B-Tree Split
• Used for insertion. This operation verifies that a node will
have less than 2t-1 keys.
• What we do is split the node into two nodes, each with
t-1 keys. The extra key goes into the node’s parent (We
assume the parent is not full)
• To split a node x (look at the next slide for illustration), take
keyt[x] (notice it is the median key). All smaller keys (exactly
t-1 of them) form one new (legal) node, the same with all
larger keys. keyt[x] goes into x’s parent.
• If the node we split is the root, then a new root is created. This
new root contains only one key.
B-tree split
x y
t-1 keys...
...
m
(parent)
t-1 keys...
...
(full node)
x m y
t-1 keys...
...
t-1 keys...
...
Notice that the parent has many other sub-trees that don’t change.
Example
50
10 25
50 89
65 83 89 95 96
A full node (t=3)
10 25
65 83
95 96
B-Tree Insert
• We insert a key only to a leaf. We start from the root and
go down to the appropriate leaf.
• On the way, before going down to the next node, we check
if it is full. If so, we split it (its father is non-full because
we checked this before going down to the father).
• When we reach the right leaf, we know that the leaf is not
full, so we can simply insert the new value to the leaf.
• Notice that we may need to split the root, if it is full. In
this case, the tree’s height increases (but the tree remains
completely balanced!). That’s why we say that a B-tree
grows from the root, in contrast to most of the trees, who
grow from the leaves...
Example
We start with an empty tree (t=3)
(I) Inserting 3,7,34,10,39
3
7
(II) Inserting 25
splits the root
10 34 39
3
(III) Inserting 40 and 20
7
7
25 34 39
(IV) Inserting 17 splits the right leaf
10
3
10
10 34
20 25 34 39 40
3
7
17 20 25
39 40
B-Tree Insert (cont.)
• Performance:
– Split:
• three disk accesses (to write the 2 new nodes, and the
parent)
• O(t) - total run time
– Insert:
• O(h) - disk accesses
• O(tlogtn) - total run time
• Requires O(1) pages in main memory.
Problem Set 1 Solutions
• 2.b T  n   3T  n / 2  n log  n  , T 1  2
• Solution
i
– This recurrence is defined for power of 2, n  2 , i  1
– We use here the general Master theorem which deals
with
T  n   aT  n / b   f  n 
– In this case f  n   n log  n 
1.5
n
log
n

O
n
– We will claim that
   
(next next slide)
Problem Set 1 Solutions (cont)
– The first case of Master theorem applies:
f  n   O n1.5  O nlog2 3 for some   0 ,
because log 2  3  1.5
 



– Using the Master theorem we have T  n    n
log 2  3

Problem Set 1 Solutions (cont)
– Claim n log  n   O  n1.5  :
• Recall that log  m  O  m
• This implies that for any constant c  0,
log  n   O  n c 
substitute m  nc, and get log  n c   c log  n   O  n c 
• Now set c  1/ 2 and we get that log  n   O  n1/ 2 
• If we multiply by n we get
n log  n   n  O  n1/ 2   O  n1.5 
Problem Set 1 Solutions (cont)
• 3. Finding the missing integer
– All the integers from 0 to n except one, are stored in an array
A[1…n]. We want to find the missing integer.
– In this problem we cannot access an entire integer in A with a
single operation.
– The elements in A are represented as binary strings.
– The only operation to access the integers is to ask for the ith
bit of A[j].
– Each such query takes constant time.
– Give an algorithm that finds the missing integer using   n 
queries.
Problem Set 1 Solutions (cont)
• Assumptions:
– All integers are presented with the same
number of bits. 0’s are put to the left of
smaller numbers.
k
N

n

1

2
–
for some integer k.
• Let x be the missing integer.
Problem Set 1 Solutions (cont)
• The principle:
–
–
–
–
–
–
A binary search
Run on the first bit of all the integers in A (takes n queries).
Count the number of 0’s you see on the way.
If x<N/2 then one zero is missing (N/2-1 0’s).
If x>=N/2 then no zero is missing (N/2 0’s).
After one pass of A we know:
• The first bit of x
• Whether x<N/2 or not
• We have only half the problem to search. (only elements
with the same bit as x)
– One pass of A is one step in the binary search.
– Proceed to query the next bit.
Problem Set 1 Solutions (cont)
• The problem:
– Is to keep track of all the elements of A that are still relevant
for the search.
• Solution:
– Use a linked list L
– Each element y in L contains two fields
(except for the usual pointers in a linked list ).
• y.pA – a pointer to an element in A
• y.bit – one bit of that element in A.
– Run through A by running over the pointers in the linked list.
– Updates the bits in the elements of L as you run
(in every pass).
– When the number of 0’s counted at the end of the pass,
delete from the list the elements pointing to non relevant
elements of A.
Problem Set 1 Solutions (cont)
• The run time of the algorithm:
– Each run goes
• a constant number of times through the liked list
• Performs a constant number of operations for each
element.
• O(n) steps.
– Each step the list is shortened by half.
– T  n   dn  T  n / 2 , T 1  1 for some constant d.
– Solution with Master theorem rule number 3:
(a=1, b=2)
T  n    n
Problem Set 1 Solutions (cont)
• 2.d T  n   T  n / 2  T  n / 4  n, T 1  4, T  2  8
• Solution
i
– This recurrence is defined for power of 2, n  2 , i  1
– Les us try to see how it behaves.
Opening it up, we have: T 1  4, T  2  8,
T  4  4  8  4  16, T 8  16  8  8  32
– We see that the answers are power of 2.
Problem Set 1 Solutions (cont)
– We now have to observe that if we present n as a
power of 2 as well, we get a simpler rule.
0
2
1
3
2
4
T  2   2 , T  2   2 , T  2   2 ...
– We thus can guess:
T 2
k
2
k 2
– The recurrence presented in term of power of 2, is:
T  2k   T  2k 1   T  2k  2   2k
T  20   22 , T  21   23
– Let us prove by induction on k that T  2k   2k  2
is indeed the solution.
Problem Set 1 Solutions (cont)
– Base case:
For k=0,1 it is true because of the initialization of
the recurrence.
– Reduction assumption: T  2k 1   2k 1
T  2k  2   2k
– Reduction proof:
T  2k   T  2k 1   T  2k  2   2k 
2
k 1
2 2 2
k
k
k 1
 22  2
k
k 1
2
k 1
2
k 2
Download