Information Gain in Decision Trees - Drawbacks

Drawbacks

Although information gain is usually a good measure for deciding the relevance of an attribute, it is not perfect. A notable problem occurs when information gain is applied to attributes that can take on a large number of distinct values. For example, suppose that we are building a decision tree for some data describing the customers of a business. Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer's credit card number. This attribute has a high information gain, because it uniquely identifies each customer, but we do not want to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven't seen before (overfitting).

Information gain ratio is sometimes used instead. This biases the decision tree against considering attributes with a large number of distinct values. However, attributes with very low information values then appeared to receive an unfair advantage. In addition, methods such as permutation tests have been proposed to correct the bias.

Read more about this topic:  Information Gain In Decision Trees

Famous quotes containing the word drawbacks:

    France has neither winter nor summer nor morals—apart from these drawbacks it is a fine country.
    Mark Twain [Samuel Langhorne Clemens] (1835–1910)