Robert's B-Tree Blog

I am keeping this blog while I create my own solution to homework 10/11 for CSCI 2270, Fall 2009. The goal is to give you an idea of my thought process while solving this problem, as well as the concrete steps I break the problem into.

You may use any of the code posted here (with appropriate acknowledgment).


11/18/2009

Preliminaries

I started off with the following steps which are probably familiar to you. I often see many of you start your homework in a similar way:

  1. Create vset.cxx: I just created an empty implementations of all the functions from the header file. I added "return false;" at the end of all the functions that return a bool because I always resolve all compiler warnings.
  2. Create vsettest.cxx: This is going to be my interactive test program. Right now it doesn't have anything other than necessary includes and an empty main function.
  3. Create makefile: My makefile contains the targets all, vsetexam, vsettest, vsetexam.o, vsettest.o, vset.o, and clean.
  4. Compile: I ran 'make' to check that I had done all of the above correctly.

At this point I also changed MINIMUM to 2 in vset.h so that small trees would still have multiple levels.

Finally, before starting any coding, I wrote out my class invariant. I based this on the six B-Tree rules on pages 542-543 of the textbook. The difference between what I wrote and what is in the book is that I used the member variables of the set class to define my invariant. To give you an idea of what I mean and to get you started, here is what I wrote for the first rule:

    // CLASS INVARIANT:
    // 1) data.size() >= MINIMUM except possibly at the root.  
Phew! That was a lot of work and I haven't even written any code yet! I am now ready to start implementing the member functions tomorrow.

11/19/2009

Test Program, print, insert and clear

After several hours, I have my insert and clear functions working. Even though insert is the hardest function to write, I thought it was the best place to start because you can't really test anything until insert is working. If it took me a few hours, it will probably take you almost as long! I hope my description here helps you with your implementation.

Test Program

Actually, the first thing that I did was write the print function and a short interactive test program so that I could see what the tree was doing while my insert function flailed about. My test program gives the following options:

Select from the following:
  I: Insert
  M: Multiple Insert
  P: Print
  Q: Quit 
I thought that multiple insert was especially useful for testing, because you need to insert a lot of numbers into the tree to get it to create new levels even with min=2, max=4. Specifically, you need to insert the numbers 1 through 17 to get 3 levels. My multiple insert just runs through a loop and inserts a range of numbers, here is the code:
            ...
            case 'M':
                cout<<"Enter (min, max): "<<endl;
                cin>>x>>y;
                for(i = x; i <= y; i++)
                  s.insert(i);
                break;
            ...  

Print

Next I wrote a simple print function to aid with testing. Michael suggested the format that I used, which is just a pre-order traversal of the tree, but also using indentation to help show levels. Here is the code for my print function (in my header file I gave indent a default argument of 0):
    // SUGGESTED FUNCTION FOR DEBUGGING
    void set::print(size_t indent) const
    {
        const int width = 3;
        size_t i = 0;

        // print indent spaces
        for(i = 0; i < indent; i++)
            cout<<" ";

        // print the data from this node
        for(i = 0; i < data.size(); i++)
            cout<<setw(width)<<data[i];

        // print endl at end of the root data
        cout<<endl;

        // recursively print children
        for(i = 0; i < child.size(); i++)
          child[i]->print(indent + 4);
    } 
And here is what the output looks like after doing my multiple insert from 1 to 17:
  9
      3  6
          1  2
          4  5
          7  8
     12 15
         10 11
         13 14
         16 17 
You can interpret this as: The root is 9 and it has two children, one containing the numbers 3 and 6, the other containing the numbers 12 and 15. The child containing 3 and 6 has three children: the first contains 1 and 2, the second contains 4 and 5, the third contains 7 and 8. Etc.

Insert

At last I arrived at the dreaded insert function. In my mind, this is the meat of homework 10. I followed the textbook very closely in my implementation of insert. The textbook's discussion of insert starts on page 551 and continues all the way through page 556. (Note that in the text, the array of pointers to children is called 'subset' instead of 'child'. In the text, subset[i] refers to the child at index i.) I wrote my insert function in the following steps:

  1. I wrote a find_index function:
        std::vector<value_type>::size_type find_index(const value_type& entry) const; 
    It returns the index of the first element greater than or equal to entry. You will need this in several member functions, so even if you have a simple implementation of find_index (e.g., using lower_bound), it is still nice to have this as a separate helper function.
  2. Next I wrote the loose_insert function as described in text (in fact, I followed the pseudocode provided in the text very closely). Here is the function prototype I used:
        bool loose_insert(const value_type& entry); 
    By the way, in step 2c where the text says "if so, then fix that problem", it is referring to the next function that we are about to write:
  3. The fix_excess function was where I had the most errors when I was debugging my code:
        void fix_excess(std::vector<value_type>::size_type i); 
    This function is responsible for splitting a child with too many entries and bringing the median entry up to the parent. I wrote this function in the following steps:
    1. Copy the upper half of the data over to the new node we are creating. If the node we are splitting is not a leaf, copy the corresponding child pointers over too.
    2. Insert the median entry from the node we are splitting into the parent, and add a child pointer to the parent pointing to the new child.
    3. Remove any unused entries from the split node's data and child vectors.
    If you use some of the vector member functions that you learned about during recitation, you can actually make this function pretty short (and you don't even need any loops). Just be more careful with your indices than I was!
  4. Finally, I wrote my insert function. This is very short and mainly uses the helper functions. In fact, after creating the new root and copying data, I was even able to use a call to fix_excess(0) as suggested in the text.

Debugging

There were several problems with my first attempt. I got a segfault the first time that my fix_excess function was called. I used gdb to find the line that my code was crashing on. I eventually figured out that the bug was that I needed to handle leaves slightly differently in my fix_excess function, as described above. Then, I thought that my insert function was working but when I inserted the 17th node (which creates the 3rd level) the new tree did not look correct. In this case gdb wasn't very much help because the problem wasn't a crash, it was just that there was a mistake in my algorithm somewhere. You might not make the exact same mistake that I did, but you may find my debugging technique useful. I added calls to the print() function at the beginning and end of fix_excess (where I suspected there may be a bug). This showed me what the tree looked like immediately before and after every call to fix_excess. This gave me enough information to realize what the problem was and my insert function is finished now (as far as I know).

Clear

I also wrote my clear function for good measure, because it's always important to deallocate dynamic memory. At first my program was crashing every time I exited. Any time this happens it's almost certain that you have a bug in your destructor. In my case the error was very simple although it took me awhile to realize what was happening (and I finally figured it out by watching the call stack in gdb). Here is what I saw by typing 'bt' at the gdb prompt shortly before my program crashed:

#0  csci2270::set::clear () at vset.cxx:90
#1  0x080491fc in csci2270::set::~set (this=0x804e068) at vset.h:57
#2  0x08049ef1 in csci2270::set::clear () at vset.cxx:95
#3  0x08049ec6 in csci2270::set::clear () at vset.cxx:94
#4  0x080491fc in csci2270::set::~set (this=0xbffff948) at vset.h:57
#5  0x08048e88 in main () at vsettest.cxx:55 
The way my destructor was written it called clear recursively and then called delete on each child to free the dynamic memory. Of course, I overlooked a major problem with this. Looking at the call stack above, I thought, "Wait a second. The destructor is calling clear, which is calling clear recursively, and then when I use delete to free dynamic memory it is calling clear AGAIN!" Oops! This actually results in clear getting called multiple times on the same node. In fact, there is no need to call clear() from clear(), instead you can just use delete which not only frees the dynamic memory, but also calls the destructor of the child node, which ends up making more calls to clear() as well. I deleted the recursive call to clear(), keeping only the delete, and all was well.

11/20/2009

contains, operator =, copy constructor

Once you have insert done, this part should seem relatively easy.

contains

For contains I used the find_index function that I had already written yesterday. Based on the return you know which recursive call to make. I must have been thinking about break because I messed this up multiple times while writing it. Don't forget to check that the node actually has children before attempting to make a recursive call to them.

copy contructor

Technically this isn't the most efficient thing to do, but I just implemented this by calling the assignment operator.

operator =

I broke this down into 3 steps:
  1. Check for self-assignment.
  2. Call clear() in case this set already contains a tree, in which case we need to free all dynamic memory we are using before copying source.
  3. Copy all the data and children from source. However, make sure not to only copy the child pointers (which is what's stored in the child vector). If you did this, the root of the new set and the root of source would be different but would point to the exact same children in memory. Instead we need to use the 'new' operator, and also make recursive calls to either the copy constructor or the assignment operator to initialize these children (and their children, and their grandchildren, etc). If you get confused, look at how the tree_copy function works in chapter 10. In fact, you could create a helper btree_copy function to call both here and in the copy constructor. That's not what I did, but if you are confused about using the assignment operator to create the copy constructor, or about how you would call either of these recursively, a helper function is a nice (and possibly more human-readable) way to do this.

Testing

I passed all of my initial tests and was feeling pretty clever, so I tried running vsetexam expecting that I would pass all the tests except for the erase function (which is homework 11). Not quite:

frohardt@csel-04:~/CSCI-2270/vset$ ./vsetexam 
Running set tests:
TEST 1:
A. Testing empty and contains for an empty set.
Segmentation fault

Oops! Looks like I got caught not checking a boundary case. Even worse, I got caught not using assert to check the preconditions of my helper functions. I added the asserts and found out that:

Running set tests:
TEST 1:
A. Testing empty and contains for an empty set.
vsetexam: vset.cxx:174: size_t csci2270::set::find_index(const int&) const: Assertion `data.size() > 0' failed.
Aborted 

It turns out that my contains function was violating the precondition of my find_index function, which is that there must be at least one element in data. Otherwise, the return from find_index would be nonsense. After fixing up some things I passed all the tests in vsetexam except for the tests related to the erase function. Even better, I was informed that:

If you submit the set now, you will have
90 points out of the possible 75 points.  
I guess we better fix this (maybe you would prefer that we didn't). Stay tuned for more info on the scoring.

12/7/2009

erase

This is harder than insert because there are more cases, but on the other hand the description in the book is even more explicit than for insert. Also, you may find this part to be easier than insert in some ways because now you are familiar with using the vector member functions to do things like move values from one vector to another and erase values. I actually haven't finished debugging my implementation of erase and still have one bug I have to track down, so I will provide more information about this once I finish. Here are some of the issues I ran into while debugging and some tips for testing:


12/8/2009

more erase

correction

There is a mistake in the pseudocode for erase() on p. 557 of the book. The line that reads:
if(loose_erase(target))
  Return zero since loose_erase did not remove an entry from the set.
This is incorrect, because we want to return false if loose_erase fails. This should be:
if(!loose_erase(target))
  Return zero since loose_erase did not remove an entry from the set.

success

My last bug was hard to track down, and I eventually did it by stepping through erase line by line and drawing a picture of what my tree looked like as I went. It was very useful here that I had an interactive test program and knew of a specific case that my erase() function always failed on. This is much easier to work with than just knowing that vsetexam fails, and I would encourage you to have an interactive test program and to try to find specific cases where your functions fail.

Since my bug was pretty hard to track down I'll let you know what I did in my forgetful programming (a common problem for me). In cases 3 and 4 of fix_shortage (which are both done using my combine() helper function), there is a step where we have to move all of the children from one child to the other. The mistake I made is that right after you copy the child pointers over, there are now two sets of pointers to the same children. I then wrote 'delete child[i+1]' but that causes the whole subtree to be deleted recursively. Make sure that you empty out child[i+1]'s child vector before calling delete in this case.

Grading

I have confirmed with Prof. Main that we will use MINIMUM = 2 to run vsetexam and determine your homework 11 grade. With MINIMUM=200 the tree never grows more than 2 levels in vsetexam, so not all cases are considered. Make sure that you test with MINIMUM = 2 since this is how you will be graded.

counter on blogger