LSM TREE What makes NoSQL databases so fast and efficient Submitted By 205223019 BEFORE WE BEGIN… A typical Database Management System (DBMS in short) consists of multiple components, each responsible for handling different aspects of data storage, retrieval and management. One such component is the storage engine which is responsible for providing a reliable interface for reading and writing data efficiently from/to the underlying storage device. It's the component that implements the two among the four big tasks of databases i.e., the ACID properties: Atomicity and Durability. In addition to that, the performance of a storage engine matters a lot in the choice of a database as it's the component that's closest to the storage device in use. Two popular data structures for implementing storage engines are B+ Trees and LSM Trees. THE TWO COMPONENT LOG STRUCTURED MERGE TREE ALGORITHM Memory Disk SS(Sorted String) Tables Balanced binary tree def put(self, key, value): # Insert a key-value pair into the LSM Tree WRITE self.memtable.add(key, value) if self.memtable.size() >= self.compaction_threshold: self.compact() Lsm tree based database like Cassandra Aman 200 Ram 700 Ankit 320 Server Memory MemorySize SizeThreshold Threshold= 3 WRITE Aman-200 In-Memory Red Black Tree SSD/HDD Lsm tree based database like Cassandra Aman 200 Ram 700 Ankit 320 Memory Size Threshold = 3 WRITE Aman-200 Ram-700 Server In-Memory Red Black Tree SSD/HDD Lsm tree based database like Cassandra Aman 200 Ram 700 Ankit 320 Threshold Reached WRITE Aman-200 Ankit -320 Server Ram-700 In-Memory Red Black Tree SSD/HDD When the MemTable is flushed, it is persisted to disk as an immutable SSTable that contains the sorted key-value pairs Lsm tree based database like Cassandra SSTable Creation Aman 200 Ram 700 Ankit 320 Rohan 133 Server WRITE Rohan-133 FLUSH In-Memory Red Black Tree Aman 220 Rohan 133 sumit 222 SSD/HDD The amortized time complexity for Write operation is O(1) as it writes in in-memory UPDATE DELETE def get(self, key): # Retrieve the value associated with the given key # Search in the memtable first, then in the SSTables value = self.memtable.get(key) READ if value is None: for table in reversed(self.sstable): value = table.get(key) if value is not None: break return value a=3 a=2 All reads in a LSM Tree is served first from the Memtable. If the key is not found in the Memtable, it is then looked up in the most recent level L0 then L1, L2 and so on till we either find the key or return a null value. Since the SSTables are already sorted, we benefit from the ability to perform a binary search on the files to quickly narrow down the range within which the key may be found. The time complexity for Read operation is O(logN) where N is the number of records in the disk. READ OPTIMISATION Even though search is fast on sorted data , going through all the on disk SSTables consumes lot of I/O. So Summary tables are kept in the memory that contains the min/max range of each disk block of every level. It allows the system to skip searches on those disk blocks where the key doesn’t fall within the range. This saves a lot of I/O. WHAT HAPPENS WHEN THERE ARE SO MANY SSTABLES As the number of SSTables grows, it would take an increasingly long time to look up a key. As the SSTables accumulate there are more and more outdated entries as keys are updated and tombstone are added. These take up precious disk space. COMPACTION Compaction is the process of compacting and eliminating redundant or obsolete data in the LSM Tree. The core algorithm utilized by compaction is the k-way merge sort algorithm adapted to SSTables . It helps manage disk space efficiently. Compaction involves the following steps: Overlapping Key Ranges: During the merge process, SSTables with overlapping key ranges are identified. These overlapping ranges are resolved to eliminate redundant data. Tombstones: In some LSM Tree, tombstones are used to mark keys that have been deleted. During compaction, SSTables containing tombstones can be safely discarded, freeing up disk space. WHAT IF THE KEY DOESN’T EXIST WHAT IF THE KEY DOESN’T EXIST WHAT IF THE KEY DOESN’T EXIST WHAT IF THE KEY DOESN’T EXIST Keep a Bloom Filter at each level. A Bloom Filter is a space-efficient data structure that returns a firm no if the key does not exist , and a 'probably yes’ if a key might exist. This allows the system to skip a level entirely if the key does not exist there which reduces the number of random I/O required. B-TREE VS LSM TREE AVERAGE CASE TIME COMPARISON OPERATION B+TREE LSM TREE SEARCH O(logn) O(logn) INSERTION O(logn) O(1) DELETION O(logn) O(1) READ O(logn) O(logn) DATABASES THAT USE LSM TREE THANK YOU