Your job in this problem will be to implement your own working version of the decision tree learning algorithm sketched in Chapter 18 of Russell and Norvig. We'll approach this problem in several stages.4.1.1 (10 points)
First, devise your own sample domain for decision tree learning. The basic idea is that there should be an interesting, nontrivial choice involved--nontrivial in the sense that many factors may enter into your decision. (Examples might include: whether or not to trust the weather enough to go to an outdoor concert; whether or not to apply to a particular graduate school; whether or not to go home for Thanksgiving.) You need to devise a set of perhaps 8-12 binary (yes/no) attributes that characterize your examples, and create about 15 training examples and 5 test examples.4.1.2 (20 points)
Write and test several procedures that will be used for the Choose-Attribute function of the decision tree learning algorithm. The procedures are:Information-value (yes-number, no-number)
This procedure should return the quantitative value of the information associated with a probability distribution ( y/(y+n), n/(y+n) ) as described in the text.Information-gain (yes-examples, no-examples, attribute)
This procedure should return the information gain resulting from using a particular attribute to partition a root with the set of starting yes- and no-examples into distinct subtrees.Choose-best-attribute-by-information-value (examples, attributes)
This procedure should find the best attribute in the set of attributes (using the information value) to partition the set of examples into distinct subtrees.
Show how your procedures work on some sample input.
4.1.3 (5 points)
Using your information-value procedure, answer the following question:
Suppose we start with a set of 100 yes-examples and 100 no-examples; and suppose that one attribute would partition the set into two subtrees, one with 70 yes-examples and 40 no-examples, and the other with 30 yes-examples and 60 no-examples. Now suppose that we have a second attribute that partitions the set into two subtrees, one with N yes-examples and 0 no-examples, and one with 100-N yes-examples and 100 no-examples. What is the smallest value of N such that the second attribute is superior to the first for partitioning the original set?
4.1.4 (30 points)
Implement the full decision-tree learning algorithm (using the information-based procedures that you created in 4.1.2, or procedures like them); and then run the algorithm on your training sample to produce a decision tree. Once that tree is created, verify that it correctly classifies your training examples. How close is it to the tree that you might have expected (based on introspection)?
4.1.5 (15 points)
Implement a procedure which, when given an unclassified example and a decision tree, uses the decsion tree to classify the example. (In other words, we now want to use the tree created in 4.1.4 to classify new examples.) Show how your decision tree from 4.1.4 classifies the test examples that you devised in 4.1.1.
4.1.6 (20 points)
Now experiment with the algorithm that you created in 4.1.4 so that it uses two alternative methods for choosing the next attribute to use in partitioning a tree into subtrees. First, use a method that simply chooses an attribute at random; now run your algorithm on the training set and show the tree that this new version of the algorithm would produce. Second, use a method that chooses the attribute on the basis of the following rule:
a. Define the yes-no-difference for a given set (yes-examples, no-examples) as the absolute value of the difference between the number of yes-examples and the number of no-examples.
b. Define the yes-no-subtree-difference for a given attribute as the sum of the yes-no-differences of the two subtrees produced by a given binary attribute.
c. Choose the "best" attribute by a method which maximizes the yes-no-subtree-difference for a given attribute choice.
Once more, run this new version of the algorithm on the training set and show the tree that this new version would produce. How do these two alternative tree-producing algorithms compare with the original information-based algorithm in terms of (i) the size of the tree produced, and (ii) the quality or understandability of the tree produced?