232-3 Clarifications on B-tree

Hi I need some help on the B-tree course(https://app.dataquest.io/m/232/performance-boosts-of-using-a-b-tree/3/inserting-into-a-non-full-node). I find it hard to follow. I am going to rewrite the materials (copied as figures from the course) in the way that makes sense to me. I hope someone can let me know if I got it right or not. Here we go,

To make things easier, we will start with a root node already established that has 3 keys reference to 4 children, thus the degree of the tree is 4, which means that nodes other than the root must contain 3 to 7 keys per node. Let’s take a look at our example b-tree.

Notice that keys in the root and child nodes are in sorted order and that the order to reference children matter. The index of the key on the children list references the left child. If you reference to index+1 it will point to the next child on the right.

Suppose we want to insert the key 4 into our tree. First, we check if the Node is a leaf or not. If it is not, then we set the index to 0 and enumerate through the root’s keys to see if 4 is greater. If it is greater, then we increment the index by 1. If not, we break the loop and call insert on the child nodes.

If the Node is a leaf, then all we have to do is insert it at the correct location in the keys using the insert method from the list class, which takes in an index and a value and inserts the value into the list. For example,
image

Hi @xuehong.liu.pdx,

You are right that there are some inconsistencies in the content here. We are currently working hard on updating the DE path to match our new standards. Improving this mission is part of our roadmap but as you can imagine, it takes some time to go over the content and improve it.

Your understanding of it seems to be correct, to be one the safe side, I will explain it in other words.

A node in a b-tree has two components:

  1. A list of keys
  2. A list of children nodes

For a node with keys 4, 15, 37 we want to have one child for:

  1. keys lower than 4
  2. keys between 4 and 15
  3. keys between 15 and 37
  4. keys higher than 37

As you can see, we need four children in this case. In general, if a node has k keys, then we will need k + 1 children to represents all ranges of keys.

Here is a figure illustrating this:

If you think about it, this is a generalization of what happened with binary trees. In a binary tree, we had a single key per node. Therefore we only needed to represent two ranges of values in the children: the values less than that key and the values larger than that key. When allowing more keys in a node, we need more children to represent the ranges of keys in between.

In terms of implementation, the child at position i in the children array of a node will represent the keys that lie between the keys at indexes i - 1 and i. Let’s see a concrete example:

image

In the above figure, you see that the child at index 3 will represent keys that lie between keys[3 - 1] = keys[2] = 37 and keys[3] = 55.

When we add a new key, we start from the root and search of a leaf where that key fits. This search is similar to the one used for binary trees. The only difference here is that, since a node represents several ranges, we need to iterate over the keys to find the right one.

The following figure shows where keys 5, 12 and 20 would be inserted in an example b-tree.

Starting at the root node, we compare the new key with the keys until we find the range were the new key fits. Then we go to the child that represents that key and continue the search in the same way until we reach a leaf node (a node with no children).

The resulting three of inserting these values would be:

Now, in order to have efficient operations, we need to limit the number of keys a node can hold. Otherwise, we would simply insert everything on the root and get a list structure rather than a tree. This is the t factor mentioned in the mission.

When we create a b-tree, we set a factor t and set a rule that no node can contain 2t keys or more. When a node reaches 2t keys, we split it into two giving the middle value to the parent node. Imagine that our b-tree from the previous example has t = 2. Then when a node reaches 2t = 4 keys we will split it.

Let’s add the value 14 to the b-tree to see this.

This result in a node with four keys: 11, 12, 13, 14 which violates our b-tree condition:

To fix it, we will take the middle key and bring it to the parent node. Here there are two middle keys: 12 and 13. We can choose any of the in them. Let’s say we move 13 up.

This operation will result in the following b-tree:

The mission mentions a minimum number of keys that a non-root node can contain. This value is t - 1. But this actually is more of a consequence of the implementation than a restriction. Since we split a node when it reaches 2t keys, the resulting nodes will have at least t - 1 keys. This is a simpler way of thinking about it.

Note than when pushing a value to its parent, it might be the case that this node also becomes over populated. In this case a new split needs to happen. This process goes on going up the tree and fixing the nodes along the way.

I hope this helps you better understanding b-tree. I will be happy to answer any further questions that you might have.

3 Likes

@Francois,

Thank you so much for taking the time explain the concept. Your version is much better than what is in the course at the moment. I am pretty clear conceptually about the B-tree. However, I am not sure I am able to translating that understanding on the level of code. At the present form, I am not able to complete the exercises on my own, which is not good, and having trouble follow the logic of the code is even worse. Could you help me go over the greater_than() function below? I put some comments in to clarify as much as I can, where the confusion come from.
Thanks in advance! Xuehong

 def greater_than(self, node, term, upper_bound=None, inclusive=False):
        if not self.root:  # if not root (added during insert), list empty
            return []
        index = 0
        values = []
        for key in node.keys:
            # first check for upper bound 
            if upper_bound is not None:
                if inclusive and key == upper_bound:
                    values.append(key)
                if key >= upper_bound:
                    break
            # if key < term, moving on to the next key
            if term > key:  
                index += 1
                continue

            if inclusive and key == term: # I thought key == term should be sufficient here
                values.append(key)
            # if key satisfies the condition, add it to the final list
            if term < key:
                values.append(key)

            # continue to find values in child    
            if not node.is_leaf():
                values += self.greater_than(
                    node.children[index],
                    term,
                    upper_bound,
                    inclusive
                )
            index += 1
        
        # I don't follow the logic of doing another checking whether the node is a leaf or not.
        if not node.is_leaf():
            values += self.greater_than(
                node.children[index],
                term,
                upper_bound,
                inclusive
            )
        return values

Essentially, as far as I can see, we have three blocks of code here,

  1. check to see whether the root is None or not, if it is, return an empty list,
  2. looping through keys in the root, then its child to find values matching the condition (key>=term)
  3. if the input node is not a leaf node, looking for matching values in child.

The last conditional statement following the for loop is wired. I know this is a tricky thing due to the recursive nature of the function. But I am having a hard time thinking through this process. It seemed redundant because the child node has been checked within the for loop for each key. Also, I don’t see any statements checking the degree of the tree. I would appreciate if you could find a way to make it understandable.

@xuehong.liu.pdx,

This code is indeed a bit more complicated than it needs to be :sweat_smile:

Let’s break down the contents of the for loop over the keys in the node:


# first check for upper bound 
if upper_bound is not None:
    if inclusive and key == upper_bound:
        values.append(key)
    if key >= upper_bound:
         break

This first condition is here to deal with the upper bound and stop the search if we reach the end of the range defined by it. The first if condition deal with the inclusive condition. It will make sure we add the upper bound if inclusive is set to True. The second if condition is the one that stops the for search by breaking out of it.


# if key < term, moving on to the next key
if term > key:
    index += 1
    continue

This if condition, as you mentioned, is here to ignore the current key if it does not lie in the range. The continue statement will stop the current execution of the for loop and make it loop go to the next key.


if inclusive and key == term: # I thought key == term should be sufficient here
    values.append(key)
# if key satisfies the condition, add it to the final list
if term < key:
    values.append(key)

This part is the one that adds the current key to the list of results. By default, this method will return all keys that are strictly larger than term. However if inclusive is set to True, then it will also return a key that is equal to term. This is what the first if condition checks. If you remove the inclusive part of the condition then this option becomes always active.

The second if, as you mentioned, checks the general condition that the key is larger than term to add the key to the list.


So far everything in the for loop was there to deal with the keys inside the current node. But we also need to search the other nodes in the b-tree. This is done by searching the appropriate children nodes.

Before the for loop there was a variable named index that was created and set to 0. This variable represents the current child.

# continue to find values in child    
if not node.is_leaf():
  values += self.greater_than(
      node.children[index],
      term,
      upper_bound,
      inclusive
  )

First we check whether the child exists by checking that the current node is not a leaf. If it exists, we search it recursively using the same function.


Finally, at the end of the for loop we update the child index:

index += 1

After the for loop is over we are nearly done but not quite yet. Remember that there is more child than keys. One more to be precise. After the loop, index will point to this yet to be visited child. We need to search to be sure that we don’t miss any keys.

# I don't follow the logic of doing another checking whether the node is a leaf or not.
if not node.is_leaf():
    values += self.greater_than(
        node.children[index],
        term,
        upper_bound,
        inclusive
    )

We recheck the leaf condition because maybe the previous for loop was stopped before this condition was even checked.


Let me know if this helps. I believe that it would be good exercise for you to implement your own function without the inclusive and upper_bound conditions. So simply a method that returns all keys that are greater than a given value.

To code should be more clear and without so many distracting conditions. If you do it I would be happy to check it if yo want.

2 Likes