B-tree

In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, insertions, deletions, and sequential access in logarithmic amortized time. The B-tree is a generalization of a binary search tree in that more than two paths diverge from a single node. Unlike self-balancing binary search trees, the B-tree is optimized for systems that read and write large blocks of data. It is most commonly used in databases and filesystems.

In B-trees, internal (non-leaf) nodes can have a variable number of child nodes within some pre-defined range. When data is inserted or removed from a node, its number of child nodes changes. In order to maintain the pre-defined range, internal nodes may be joined or split. Because a range of child nodes is permitted, B-trees do not need re-balancing as frequently as other self-balancing search trees, but may waste some space, since nodes are not entirely full. The lower and upper bounds on the number of child nodes are typically fixed for a particular implementation. For example, in a 2-3 B-tree (often simply referred to as a 2-3 tree), each internal node may have only 2 or 3 child nodes.

Each internal node of a B-tree will contain a number of keys. Usually, the number of keys is chosen to vary between $$d$$ and $$2d$$. In practice, the keys take up the most space in a node. The factor of 2 will guarantee that nodes can be split or combined. If an internal node has $$2d$$ keys, then adding a key to that node can be accomplished by splitting the $$2d$$ key node into two $$d$$ key nodes and adding key to the parent node. Each split node has the required minimum number of keys. Similarly, if an internal node and its neighbor each have $$d$$ keys, then a key may be deleted from the internal node by combining with its neighbor. Deleting the key would make the internal node have $$d-1$$ keys; joining the neighbor would add $$d$$ keys plus one more key brought down from the neighbor's parent. The result is an entirely full node of $$2d$$ keys.

The branches (or child nodes) from a node will be one more than the number of keys stored in the node. In a 2-3 B-tree, the internal nodes will store either one key (with two child nodes) or two keys (with three child nodes). A B-tree is sometimes described with the parameters $$(d+1)$$ — $$(2d+1)$$ or simply with the highest branching order, $$(2d+1)$$.

A B-tree is kept balanced by requiring that all leaf nodes are at the same depth. This depth will increase slowly as elements are added to the tree, but an increase in the overall depth is infrequent, and results in all leaf nodes being one more node further away from the root.

B-trees have substantial advantages over alternative implementations when node access times far exceed access times within nodes. This usually occurs when the nodes are in secondary storage such as disk drives. By maximizing the number of child nodes within each internal node, the height of the tree decreases and the number of expensive node accesses is reduced. In addition, rebalancing the tree occurs less often. The maximum number of child nodes depends on the information that must be stored for each child node and the size a full disk block or an analogous size in secondary storage. While 2-3 B-trees are easier to explain, practical B-trees using secondary storage want a large number of child nodes to improve performance.

The term B-tree may refer to a specific design or it may refer to a general class of designs. In the narrow sense, a B-tree stores keys in its internal nodes but need not store those keys in the records at the leaves. The general class includes variations such as the B+-tree and the B*-tree. In the B+-tree, copies of the keys are stored in the internal nodes; the keys and records are stored in leaves; in addition, a leaf may include a pointer to the next leaf to speed sequential access. The B*-tree balances more neighboring internal nodes to keep the internal nodes more densely packed. For example, a non-root node of a B-tree must be only half full, but a non-root node of a B*-tree must be two-thirds full.

Rudolf Bayer and Ed McCreight invented the B-tree while working at Boeing in 1971, but did not explain what, if anything, the B stands for. Douglas Comer suggests a number of possibilities:
 * "Balanced," "Broad," or "Bushy" might apply [since all leaves are at the same level]. Others suggest that the "B" stands for Boeing [since the authors worked at Boeing Scientific Research Labs in 1972]. Because of his contributions, however, it seems appropriate to think of B-trees as "Bayer"-trees.

Time to search a sorted file
Usually, sorting and searching algorithms have been characterized by the number of comparison operations that must be performed using order notation. A binary search of a sorted table with $$N$$ records, for example, can be done in $$O(\log_2 N)$$ comparisons. If the table had 1,000,000 records, then a specific record could be located with about 20 comparisons: $$O(\log_2 1,000,000) = O(19.93156...) $$.

Large databases have historically been kept on disk drives. The time to read a record on a disk drive can dominate the time needed to compare keys once the record is available. The time to read a record from a disk drive involves a seek time and a rotational delay. The seek time may be 0 to 20 or more milliseconds, and the rotational delay averages about 1/2 the rotation period. For a 7200 RPM drive, the rotation period is 8.333 milliseconds. For a drive such as the Seagate ST3500320NS, the track-to-track seek time is 0.8 milliseconds and the average reading seek time is 8.5 milliseconds. For simplicity, assume reading from disk takes about 10 milliseconds.

Naively, then, the time to locate one record out of a million would take 20 disk reads times 10 milliseconds per disk read, which is .2 seconds.

The time won't be that bad because individual records are grouped together in a disk block. A disk block might be 16 kilobytes. If each record is 160 bytes, then 100 records could be stored in each block. The disk read time above was actually for an entire block. Once the disk head is in position, one or more disk blocks can be read with little delay. With 100 records per block, the last 6 or so comparisons don't need to do any disk reads—the comparisons are all within the last disk block read.

To speed the search further, the first 13 - 14 comparisons (which each required a disk access) must be sped up.

An Index speeds the search
A significant improvement can be made with an index. In the example above, initial disk reads narrowed the search range by a factor of two. That can be improved substantially by creating an auxiliary index that contains the first record in each disk block. This auxiliary index would be 1% of size of the original database, but it can be searched more quickly. Finding an entry in the auxiliary table would tell us which block to search in the main database; after searching the auxiliary index, we would have to search only that one block of the main database—at a cost of one more disk read. The index would hold 10,000 entries, so it would take at most 14 comparisons. Like the main database, the last 6 or so comparisons in the aux index would be on the same disk block. The index could be searched in about 8 disk reads, and the desired record could be accessed in 9 disk reads.

The trick of creating an auxiliary index can be repeated to make an auxiliary index to the auxiliary index. That would make an aux-aux index that would need only 100 entries and would fit in one disk block.

Instead of reading 14 disk blocks to find the desired record, we only need to read 3 blocks. Reading and searching the first (and only) block of the aux-aux index identifies the relevant block in aux-index. Reading and searching that aux-index block identifies the relevant block in the main database. Instead of taking 150 milliseconds to get the record, we have it 30 milliseconds.

The auxiliary indices have turned the search problem from a binary search requiring roughly $$\log_2 N$$ disk reads to one requiring only $$\log_b N$$ disk reads where $$b$$ is the blocking factor.

In practice, if the main database is being frequently searched, the aux-aux index and much of the aux index may reside in a disk cache, so they would not incur a disk read.

Insertions and Deletions cause trouble
If the database does not change, then compiling the index is simple to do, and the index need never be changed. If there are changes, then managing the database and its index becomes more complicated.

Deleting records from a database don't cause much trouble. The index can stay the same, and the record can just be marked as deleted. The database stays in sorted order. If there are a lot of deletions, then the searching and storage becomes less efficient.

Insertions are a disaster in a sorted sequential file because room for the inserted record must be made. Inserting a record before the first record in the file requires shifting all of the records down one. Such an operation is just too expensive to be practical.

A trick is to leave some space lying around to be used for insertions. Instead of densely storing all the records in a block, the block can have some free space to allow for subsequent insertions. Those records would be marked as if they were "deleted" records.

Now, both insertions and deletions are fast as long as space is available on a block. If an insertion won't fit on the block, then some free space on some nearby block must be found and the auxiliary indices adjusted. The hope is enough space is nearby that a lot of blocks do not need to be reorganized. Alternatively, some out-out-sequence disk blocks may be used.

The B-tree uses all those ideas
The B-tree uses all the above ideas. It keeps the records in sorted order so they may be sequentially traversed. It uses a hierarchical index to minimize the number of disk reads. The index is elegantly adjusted with a recursive algorithm. The B-tree uses partially full blocks to speed insertions and deletions. In addition, a B-tree minimizes waste by making sure the interior nodes are at least 1/2 full. A B-tree can can handle an arbitrary number of insertions and deletions.

Technical Description
A B-tree of order m (the maximum number of children for each node) is a tree which satisfies the following properties:
 * 1) Every node has at most m children.
 * 2) Every node (except root and leaves) has at least $m/2$ children.
 * 3) The root has at least two children if it is not a leaf node.
 * 4) All leaves appear in the same level, and carry information.
 * 5) A non-leaf node with k children contains k–1 keys.

Each internal node's elements act as separation values which divide its subtrees. For example, if an internal node has three child nodes (or subtrees) then it must have two separation values or elements a1 and a2. All values in the leftmost subtree will be less than a1, all values in the middle subtree will be between a1 and a2, and all values in the rightmost subtree will be greater than a2.

Internal nodes in a B-tree — nodes which are not leaf nodes — are usually represented as an ordered set of elements and child pointers. Every internal node contains a maximum of U children and — other than the root — a minimum of L children. For all internal nodes other than the root, the number of elements is one less than the number of child pointers; the number of elements is between L-1 and U-1. The number U must be either 2L or 2L-1; thus each internal node is at least half full. This relationship between U and L implies that two half-full nodes can be joined to make a legal node, and one full node can be split into two legal nodes (if there is room to push one element up into the parent). These properties make it possible to delete and insert new values into a B-tree and adjust the tree to preserve the B-tree properties.

Leaf nodes have the same restriction on the number of elements, but have no children, and no child pointers.

The root node still has the upper limit on the number of children, but has no lower limit. For example, when there are fewer than L-1 elements in the entire tree, the root will be the only node in the tree, and it will have no children at all.

A B-tree of depth n+1 can hold about U times as many items as a B-tree of depth n, but the cost of search, insert, and delete operations grows with the depth of the tree. As with any balanced tree, the cost grows much more slowly than the number of elements.

Some balanced trees store values only at the leaf nodes, and so have different kinds of nodes for leaf nodes and internal nodes. B-trees keep values in every node in the tree, and may use the same structure for all nodes. However, since leaf nodes never have children, a specialized structure for leaf nodes in B-trees will improve performance.

Best case and worst case heights
The best case height of a B-Tree is:
 * $$\log_{M} n.\ $$

The worst case height of a B-Tree is:
 * $$\log_{M/2}n\ $$

where $$M$$ is the maximum number of children a node can have.

Algorithms
Warning: the discussion below uses "element", "value", "key", "separator", and "separation value" to mean essentially the same thing. The terms are not clearly defined. There are some subtle issues at the root and leaves.

Search
Searching is similar to searching a binary search tree. Starting at the root, the tree is recursively traversed from top to bottom. At each level, the search chooses the child pointer (subtree) whose separation values are on either side of the search value.

Binary search is typically (but not necessarily) used within nodes to find the separation values and child tree of interest.

Insertion


All insertions start at a leaf node. To insert a new element

Search the tree to find the leaf node where the new element should be added. Insert the new element into that node with the following steps:


 * 1) If the node contains fewer than the maximum legal number of elements, then there is room for the new element. Insert the new element in the node, keeping the node's elements ordered.
 * 2) Otherwise the node it is full, so evenly split it into two nodes.
 * 3) A single median is chosen from among the leaf's elements and the new element.
 * 4) Values less than the median are put in the new left node and values greater than the median are put in the new right node, with the median acting as a separation value.
 * 5) Insert the separation value in the node's parent, which may cause it to be split, and so on. If the node has no parent (i.e., the node was the root), create a new root above this node (increasing the height of the tree).

If the splitting goes all the way up to the root, it creates a new root with a single separator value and two children, which is why the lower bound on the size of internal nodes does not apply to the root. The maximum number of elements per node is U-1. When a node is split, one element moves to the parent, but one element is added. So, it must be possible to divide the maximum number U-1 of elements into two legal nodes. If this number is odd, then U=2L and one of the new nodes contains (U-2)/2 = L-1 elements, and hence is a legal node, and the other contains one more element, and hence it is legal too. If U-1 is even, then U=2L-1, so there are 2L-2 elements in the node. Half of this number is L-1, which is the minimum number of elements allowed per node.

An improved algorithm supports a single pass down the tree from the root to the node where the insertion will take place, splitting any full nodes encountered on the way. This prevents the need to recall the parent nodes into memory, which may be expensive if the nodes are on secondary storage. However, to use this improved algorithm, we must be able to send one element to the parent and split the remaining U-2 elements into two legal nodes, without adding a new element. This requires U = 2L rather than U = 2L-1, which accounts for why some textbooks impose this requirement in defining B-trees.

Deletion
There are two popular strategies for deletion from a B-Tree.

or
 * locate and delete the item, then restructure the tree to regain its invariants
 * do a single pass down the tree, but before entering (visiting) a node, restructure the tree so that once the key to be deleted is encountered, it can be deleted without triggering the need for any further restructuring

The algorithm below uses the former strategy.

There are two special cases to consider when deleting an element:
 * 1) the element in an internal node may be a separator for its child nodes
 * 2) deleting an element may put its node under the minimum number of elements and children.

Each of these cases will be dealt with in order.

Deletion from a leaf node

 * Search for the value to delete.
 * If the value is in a leaf node, it can simply be deleted from the node, perhaps leaving the node with too few elements; so some additional changes to the tree will be required.

Deletion from an internal node
Each element in an internal node acts as a separation value for two subtrees, and when such an element is deleted, two cases arise. In the first case, both of the two child nodes to the left and right of the deleted element have the minimum number of elements, namely L-1. They can then be joined into a single node with 2L-2 elements, a number which does not exceed U-1 and so is a legal node. Unless it is known that this particular B-tree does not contain duplicate data, we must then also (recursively) delete the element in question from the new node.

In the second case, one of the two child nodes contains more than the minimum number of elements. Then a new separator for those subtrees must be found. Note that the largest element in the left subtree is still less than the separator. Likewise, the smallest element in the right subtree is the smallest element which is still greater than the separator. Both of those elements are in leaf nodes, and either can be the new separator for the two subtrees.


 * If the value is in an internal node, choose a new separator (either the largest element in the left subtree or the smallest element in the right subtree), remove it from the leaf node it is in, and replace the element to be deleted with the new separator.
 * This has deleted an element from a leaf node, and so is now equivalent to the previous case.

Rebalancing after deletion
If deleting an element from a leaf node has brought it under the minimum size, some elements must be redistributed to bring all nodes up to the minimum. In some cases the rearrangement will move the deficiency to the parent, and the redistribution must be applied iteratively up the tree, perhaps even to the root. Since the minimum element count doesn't apply to the root, making the root be the only deficient node is not a problem. The algorithm to rebalance the tree is as follows:


 * If the right sibling has more than the minimum number of elements
 * Add the separator to the end of the deficient node.
 * Replace the separator in the parent with the first element of the right sibling.
 * Append the first child of the right sibling as the last child of the deficient node
 * Otherwise, if the left sibling has more than the minimum number of elements.
 * Add the separator to the start of the deficient node.
 * Replace the separator in the parent with the last element of the left sibling.
 * Insert the last child of the left sibling as the first child of the deficient node
 * If both immediate siblings have only the minimum number of elements
 * Create a new node with all the elements from the deficient node, all the elements from one of its siblings, and the separator in the parent between the two combined sibling nodes.
 * Remove the separator from the parent, and replace the two children it separated with the combined node.
 * If that brings the number of elements in the parent under the minimum, repeat these steps with that deficient node, unless it is the root, since the root may be deficient.

The only other case to account for is when the root has no elements and one child. In this case it is sufficient to replace it with its only child.

Initial construction
In applications, it's frequently useful to build a B-tree to represent a large existing collection of data and then update it incrementally using standard B-tree operations. In this case, the most efficient way to construct the initial B-tree is not to insert every element in the initial collection successively, but instead to construct the initial set of leaf nodes directly from the input, then build the internal nodes from these. This approach to B-tree construction is called bulkloading. Initially, every leaf but the last one has one extra element, which will be used to build the internal nodes.

For example, if the leaf nodes have maximum size 4 and the initial collection is the integers 1 through 24, we would initially construct 5 leaf nodes containing 5 values each (except the last, which contains 4):

We build the next level up from the leaves by taking the last element from each leaf node except the last one. Again, each node except the last will contain one extra value. In the example, suppose the internal nodes contain at most 2 values (3 child pointers). Then the next level up of internal nodes would be:

This process is continued until we reach a level with only one node and it is not overfilled. In the example only the root level remains:

B-trees in Filesystems
The B-tree is also used in filesystems to allow quick random access to an arbitrary block in a particular file. The basic problem is turning the logical block $$i$$ address into a physical disk block (or perhaps to cylinder, head, sector).

Some operating systems require the user to allocate the maximum size of the file when the file is created. The file can then be allocated as contiguous disk blocks. Converting to a physical block: the operating system just adds the logical block address to the starting physical block of the file. The scheme is simple, but the file cannot exceed its created size.

Other operating systems allow a file to grow. The resulting disk blocks may not be contiguous, so mapping logical blocks to physical blocks is more involved.

MS/DOS, for example, used a simple File Allocation Table (FAT). The FAT has an entry for each physical disk block, and that entry identifies the next physical disk block of a file. The result is the disk blocks of a file are in a linked list. In order to find the physical address of block $$i$$, the operating system must sequentially search the FAT. For MS/DOS, that was not a huge penalty because the disks were small and the FAT had few entries. In the FAT12 filesystem, there were only 4,096 entries, and the FAT would usually be resident. As disks got bigger, the FAT architecture confronts penalties. It may be necessary to perform disk reads to learn the physical address of a block the user wants to read.

TOPS-20 (and possibly TENEX) used a 0 to 2 level tree that has similarities to a B-Tree. A disk block was 512 36-bit words. If the file fit in a 512 ($$2^9$$) word block, then the file directory would point to that physical disk block. If the file fit in $$2^{18}$$ words, then the directory would point to an aux index; the 512 words of that index would either be NULL (the block isn't allocated) or point to the physical address of the block. If the file fit in $$2^{27}$$ words, then the directory would point to a block holding an aux-aux index; each entry would either be NULL or point to a an aux index. Consequently, the physical disk block for a $$2^{27}$$ word file could be located in two disk reads and read on the third.

Apple Computer's filesystem and Microsoft's NTFS use B-trees.

Multi-way combining and splitting
It is possible to modify the above algorithm to, when trying to find extra elements for a deficient node, examine other siblings, and if one has more than the minimum number of values rearrange values across a larger number of siblings to make up the deficit in one.

Similarly, when a node is split, extra elements can be moved to nearby, less populated siblings; or the split can involve a number of siblings, redistributing elements among them rather than splitting a node.

In practice, the most common use of B-trees involves keeping the nodes on secondary storage, where it is slow to access a node which is not already being used. Using only two-ways splits and combines helps decrease the number of nodes needed for many common situations, but may be useful in others.

Relationship between U and L
It is almost universal to split nodes by choosing a single median and creating two new nodes. This constrains the relationship between L and U. Trying to insert an element into a node with U elements — involves redistributing U elements. One of these, the median, will move to the parent, and the remaining elements will be split as equally as possible among the two new nodes.

For example, in a 2-3 B-tree, adding an element to a node with three child nodes, and thus two separator values, involves three values — the two separators and the new value. The median becomes the new separator in the parent, and each of the other two becomes the sole elements in nodes with one value and two children. Generally, if U is odd, each of the two new nodes has (U+1)/2 children. If U is even, one has U/2 children and the other U/2+1.

If full nodes are split into exactly two nodes, L must be small enough to allow for the sizes after a node is split. But it is possible to split full nodes into more than two new nodes. Choosing to split a node into more than two nodes would require a lower value of L for the same value of U.

As L gets smaller, it allows for more unused space in the nodes. This might decrease the frequency of node splitting, but it is also likely to increase the amount of memory needed to store the same number of values, and the number of nodes that have to be examined for any particular operation.

Theoretical results
Robert Tarjan proved that the amortized number of splits/merges is 2.

Access concurrency
Lehman and Yao showed that linking the tree blocks at each level together with a next pointer results in a tree structure where read locks on the tree blocks can be avoided as the tree is descended from the root to the leaf for both search and insertion. Write locks are only required as a tree block is modified. Minimizing locking to a single node held only during its modification helps to maximize access concurrency by multiple users, an important consideration for databases and/or other B-Tree based ISAM storage methods.