SSTables and LSM-Trees
SSTables and LSM-Trees
SStable stands for Sorted String Table, in which key sequence of key-value pairs is sorted by key.
- A compaction process is running in the background to merge SSTables and remove old values of keys. mergesort is used.
- In memory index is sparse, keys between two consecutive indexed keys need to be scanned.
- Since read requests need to scan over several key-value pairs in the requested range anyway, it is possible to group those records into a block and compress it before writing it to disk.
Constructing and maintaining SSTables
- When a write comes in, add it to an in-memory balanced tree data structure. This in-memory tree is sometimes called a memtable.
- When the memtable gets bigger than some threshold, write it out to disk as an SSTable file.
- In order to server a read request, first try to find the key in the memtable, then in on-disk segment
- run a compaction process to combine segment files and discard overwritten or deleted values.
- use a append log to recover from crash.
Making an LSM-tree out of SSTables
- use Bloom filter to efficiently detect keys that does not exist
- size-tiered and leveled compaction
- size-tiered: newer and smaller SSTables are successively merged into older and larger SSTables.
- leveled compaction: the key range is split up into smaller SSTables and older data is moved into separate 'levels'
Making B-trees reliable
- write-ahead log (WAL)
- protecting the tree's data structures with latches (lightweight locks)
- copy-on-write. A modified page is witten to a different location, and a new version of the parent pages is created, pointing at the new location.
- save space by abbreviating keys
- additional pointers
- borrow log-structured ideas
Comparing B-Trees and LSM-Trees
LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads. Reads are typically slower on LSM-trees because they have to check several different data structures and SSTables at different stages of compaction.
Need to test systems with particular workload in order to make a valid comparison.
Advantages of LSM-trees
write amplification: one write to the database resulting in multiple writes to the disk over the course of the database's lifetime.
In write-heavy applications, the performance bottleneck might be the rate at which the database can write to disk. The more that a storage engine writes to disk, the fewer writes per second it can handle within the available disk bandwidth.
LSM-trees are typically able to sustain higher write throughput than B-trees, partly because they sometimes have lower write amplification (depends on the storage engine configuration and workload), and partly because they sequentially write SSTable files
LSM-trees can be compressed better
SSD firmware may use a log-structured algorithm to turn random writes into sequential wirtes
Downsides of LSM-trees
- The compaction process can sometimes interfere with the performance of ongoing reads and writes. the response time of queries to log-structured storage engines can sometimes be quite high
- At high write throughput, the compaction may cannot keep up with the rate of incoming writes.
- An advantage of B-trees is that each key exists in exactly one place, so transaction isolation is easier.