External Memory Data Structures Srinivasa Rao Satti Workshop on Recent Advances in Data Structures December 20, 2011 Fundamental Algorithmic Problems • Searching: Given a list (sequence) L of elements x1, x2, .., xn and query element x, check whether x is present in L. – When L is not sorted, we use linear search – scan the list to check if x is present in it. – When L is sorted, we use binary search – divide the remaining list to be searched in half with every comparison. Also insert and delete elements to/from L. • Sorting: Given a sequence of elements, sort them in increasing (or decreasing) order. – Insertion sort, bubble sort, quick sort, merge sort 2 Random Access Machine (RAM) Model • Standard theoretical model of computation: – Infinite memory – Uniform access cost R A M • Unit-cost RAM model: All the basic operations (reading/writing a location from/to the memory, standard arithmetic and Boolean operations) take one unit of time. • Simple model crucial for success of computer industry. 3 Hierarchical Memory L 1 L 2 R A M • Modern machines have complicated memory hierarchy – Levels get larger and slower further away from CPU – Data moved between levels using large blocks 4 Hard disk drive 5 Slow I/O • Disk access is 106 times slower than main memory access track read/write head read/write arm “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on magnetic surface one’s desk or by taking an airplane to the other side of – Disk systems try to amortize large access time largea the transferring world and using contiguous blocks of data (8-16Kbytes) sharpener on someone else’s • Important to store/access data to take advantage ofdesk.” blocks(D. (locality) Comer) 6 External Memory Model D Block I/O M P [Aggarwal-Vitter 1988] N = # of items in the problem instance B = # of items per disk block M = # of items that fit in main memory T = # of items in output I/O: Move block between memory and disk Performance measures: Space: # of disk blocks used by the structure Time: # of I/Os performed by the algorithm (CPU time is “free”) 8 Scalability Problems: Block Access Matters • Example: Traversing linked list – Array size N = 10 elements – Disk block size B = 2 elements – Main memory size M = 4 elements (2 blocks) 1 5 2 6 3 8 9 4 7 10 Algorithm 1: N=10 I/Os 1 2 10 9 5 6 3 4 8 7 Algorithm 2: N/B=5 I/Os • Large difference between N and N/B since block size is large – Example: N = 256 x 106, B = 8000 , 1ms disk access time N I/Os take 256 x 103 sec = 4266 min = 71 hr N/B I/Os take 256/8 sec = 32 sec 9 Queues and Stacks • Queue: – Maintain push and pop blocks in main memory Push Pop O(1/B) Push/Pop operations • Stack: – Maintain push/pop blocks in main memory O(1/B) Push/Pop operations 10 Fundamental Bounds • Scanning: • Sorting: • Searching: Internal N N log N log2 N External N B N B logM B N B logB N • Note: – Linear I/O: O(N/B) – B factor VERY important: – Cannot sort optimally with search tree 11 Search trees: API • Given a set S of keys, support the operations: – search(x) : return TRUE if x is in S, and FALSE otherwise – insert(x) : insert x into S (error if x is already in S) – delete(x) : delete x from S (error if x is not in S) – rangesearch(x,y) : return all the keys z such that x ≤ z ≤ y 14 Binary Search Trees • Binary search tree: – Standard method for search among N elements – We assume elements in leaves (log2 N ) – Search traces a root-to-leaf path – If nodes are stored arbitrarily on disk Search in O (log2 N ) I/Os Rangesearch in O(log2 N T ) I/Os 15 External Search Trees (log2 B) (B ) • BFS blocking: – Block height O(log2 N ) / O(log2 B) O(logB N ) – Output elements blocked Rangesearch in (logB N T B) I/Os • Optimal: O(N/B) space and (logB N T B) query 16 External Search Trees • Maintaining BFS blocking during updates? – Balance is normally maintained in search trees using rotations x y y x • Seems very difficult to maintain BFS blocking during rotation – Also need to make sure output (leaves) is blocked! 17 B-trees • BFS-blocking naturally corresponds to tree with fan-out (B ) • B-trees balanced by allowing node degree to vary – Rebalancing performed by splitting and merging nodes 18 (a,b)-tree • T is an (a,b)-tree (a≥2 and b≥2a-1) – All leaves on the same level and contain between a and b elements – Except for the root, all nodes have degree between a and b – Root has degree between 2 and b (2,4)-tree • (a,b)-tree uses linear space and has height O (loga N ) Choosing a,b =(B) each node/leaf stored in one disk block (N/B) space and (logB N T B) query 19 (a,b)-Tree Insert • Insert: Search and insert element in leaf v DO { if v has b+1 elements/children Split v: make nodes v’ and v’’ with b21 b and b21 a elements insert element (ref) in parent(v) (make new root if necessary) v=parent(v) } v b 1 v’ v’’ b21 b21 • Insert touches (loga N ) nodes 20 (2,4)-Tree Insert 21 (a,b)-Tree Delete • Delete: Search and delete element from leaf v DO { if v has a-1 elements/children Fuse v with sibling v’: move children of v’ to v delete element (ref) from parent(v) (delete root if necessary) If v has >b (and ≤ a+b-1<2b) children split v v=parent(v) } v’ v a -1 v 2a -1 • Delete touches O (loga N ) nodes 22 (2,4)-Tree Delete 23 Summary/Conclusion: B-tree • B-trees: (a,b)-trees with a,b = (B ) – O(N/B) space – O(logB N+T/B) I/Os for search and rangesearch – O(logB N) I/Os for insert and delete • B-trees with elements in the leaves sometimes called B+-tree • Construction in O( NB logM B NB ) I/Os – Sort elements and construct leaves – Build tree level-by-level bottom-up 24 B-tree Construction • In internal memory we can sort N elements in O(N log N) time using a balanced search tree: – Insert all elements one-by-one (construct tree) – Output in sorted order using in-order traversal • Same algorithm using B-tree use O( N logB N ) I/Os log MB – A factor of O ( B log B ) non-optimal • As discussed we could build B-tree bottom-up in O( NB logM B NB ) I/Os – In general we would like to have dynamic data structure to use in O( NB logM B NB )algorithms O( B1 logM B NB ) I/O operations 25 Flash memory 30 Flash memory 31 Flash memory • Non-volatile memory which can be erased and programmed • Characteristics: – Lighter – Provides better shock resistance – Provides more throughput – Consumes less power – More denser (uses less space) compared to magnetic disks • Commonly used in digital cameras, handheld computers, mobile phones, portable music players etc. • Also used in embedded systems, sensor networks; and even replacing magnetic disks in PCs. 32 HDD vs SSD The disassembled components of a hard disk drive (left) and of the PCB and components of a solid-state drive (right) 33 Limitations of flash memory • Memory cells in a flash memory device can be written only a limited number of times – between 10,000 and 1,000,000, after which they wear out and become unreliable. • The only way to set bits (change their value from 0 to 1) is to erase an entire region memory. These regions have fixed size in a given device, typically ranging from several kilobytes to hundreds of kilobytes, and are called erase units. • Two different types of Flash memories: NOR and NAND – they have slightly different characteristics 34 Flash memory • The memory space of the chip is partitioned into blocks called erase blocks. The only way to change a bit from 0 to 1 is to erase the entire unit containing the bit. • Each block is further partitioned into pages, which usually store 2048 bytes of data and 64 bytes of meta-data. Erase blocks typically contain 32 or 64 pages. • Bits are changed from 1 to 0 by programming (writing) data onto a page. An erased page can be programmed only a small number of times (1 to 3) before it must be erased again. 35 Flash memory • Reading data takes tens of microseconds for the first access to a page, plus tens of nanoseconds per byte. • Writing a page takes hundreds of microseconds, plus tens of nanoseconds per byte. • Erasing a block takes several milliseconds. • Each block can sustain only a limited number of erasures. Algorithms/data structures designed for I/O model do not always work well when implemented on flash memory. 36 Flash memory models (I) • General flash model: BR M c Flash BW • The complexity of an algorithm is x + c · y, where x and y are the number of read and write I/Os respectively, and c is a penalty factor for writing. • Typically, we assume that BR < BW << M, and c ≥ 1. 37 Flash memory models (II) • Unit-cost flash model: BR Flash M BW • General flash model augmented with the assumption of an equal access time per element for reading and writing. • The cost of an algorithm performing x read I/Os and y write I/Os is given by x.BR + y.BW. • This simplifies the model considerably, as it becomes easier to adapt external-memory results. 38 B-trees on flash memory • An insertion in a B-tree updates a single leaf (unless the leaf splits) • Since we cannot perform an in-place update in flash memory, we need to create a new copy of the leaf, with the new element inserted. • Since the parent of this leaf has to update its pointer to the leaf, we need to create a new copy of the parent. And so on..up to the root. • Thus the write performance is quite bad for the naïve implementation. 39 Flash Translation Layer (FTL) • Software layer on the flash disk which performs logical to physical block mapping. • Distributes writes uniformly across blocks. • B-tree with FTL: – All nodes contain just the logical address of other nodes – Allows any update to write just the target node • Achieves one erase per update (amortized) 40 μ-tree [Kang, Jung, Kang, Kim, 2007] • Minimally Updated tree • Achieves similar performance as ‘B-tree with FTL’ on raw flash • Sizes of the nodes decreases exponentially from leaf to the root • Each block corresponds to a leaf-to-root path, and stores the nodes on a prefix of this path • Works only when log2 B ≥ logB N 41 FD-tree [Li, He, Yang, Luo, Yi, 2010] • Flash Disk aware tree index • Transforms random writes into sequential writes • Limits random writes to within a small region 42 FD-tree •Flash Disk aware tree index •Transforms random writes into sequential writes •Contains a head tree and a few levels of sorted runs of increasing sizes •O(logk N) levels, where k is the size ratio between levels 43 Other B-tree indexes for flash memory • BFTL [Wu, Luo, Chang, 2007] • Lazy Adaptive tree [Agrawal, Ganesan, Sitaraman, Diao, Singh, 2009] • Lazy Update tree [On, Hu, Li, Xu, 2009] • In-page Logging approach [Lee, Moon, 2007] • … All these are designed to get better practical performance, and take different aspects of flash characteristics into consideration. -- not easy to compare with each other 44 Comparison of tree indexes on flash N – number of elements BR – read block size BW – write block size BU – size of buffer h – height of the tree k - parameter 45 Directions for further research • The area is still in its infancy. • Not much is work has been done apart from the development of some file systems and tree indexing structures • Open problems: – Efficient tree indexes for flash memory – Tons of other (practically significant) algorithmic problems – Better memory model. 46 Thank You 47