Decision Trees

8 minute read

Updated:

12. Decision Trees

Learning outcomes

  • Define the entropy of a set
  • Compute the entropy of a given set
  • Define the information gain for a given feature
  • Define the Gini Impurity of a set
  • Implement the ID3 and CART algorithms

Making Decisions

  • Decision tree = classifier, and can be used for regression.
  • Built out of non-metric data. We’ll have questions with yes no answer and each node splits we receive classification decision.
  • Decision is the class given point belongs to – and we need to decide what the decision is based on the features!!
  • We look at each feature to
  • Q – based on the features selected, how can we construct a tree? In what order to be more efficient?
    • A: best feature to pick as the one to classify on now is the one that gives you the most information

History

  • Interestingly it’s a one man show – most algo built on decision is based on Ross Quinlan.
  • The book – is the first version of it and he further developed it further for may versions

Entropy and information

  • This is based on CS concept info theory – main thing is entropy
  • Information carried by a msg is a measure of ‘surprise’ - how much msg should surprise me? If ive expected it, I gain no information, and if I didn’t, I learn a new thing.
    • image
  • Entropy of a distribution
    • Expected value of information is called the entropy E[I]
    • The probability of me expecting the information
    • image
    • If we only have 2 classes graph looks like this - 0 entropy when we are certain that one of the two events will win
    • Max entropy of 1 when 2 events have the same probability.
    • With N possible outcomes, the maximum value of the entropy is log2(N).
  • Information and distribution are tightly coupled
    • E.g. if I play ping pong who do you bet to? Me or my sister? you don’t know anything so ur chances are 50, 50 aka same probability. But if he tells us she plays 2 times a week, so the prob goes from 0.5 0.5 to 0.8 0.2 this changed my distribution.
    • More distribution it changed information was more important!! It was more valuable.
    • All probability came out of betting -

Fruits

  • V imp classification problem – we have basket of fruits and we wanna pick one and classify
  • If I choose random, how likely would it be apple, orange or pair – will have max entropy of 1 coz we dunno
    • If it was all oranges – will be 0 entropy, bc we know we gonna get orange.
  • image → calculating the entropy
  • image

Apples and Oranges

  • We can consider two features:
    • Size - If we decide to split base on size, we need to go down on the tree deeper.
    • Colour – we don’t need other branch; we can classify green and orange in one step.
  • We want the feature that’s most informative – by looking at the distributions of two classes before and after the split. If the distribution has decreased in entropy we reduced surprise, we are more certain
  • Reduction in entropy = increase in information.

Entropy of the set

  • Info is measured based on how much the distribution that you keep in mind changes by.
  • Entropy = acquisition of information
    • If I know what’s gonna win – entropy is zero.
    • If it has equal prob – max entropy
  • Features are independent – we don’t worry about the conditional probability
  • We want max entropy – tells us what’s the most info. And we want this entropy to go from high to low
    • We dunno what we’ll get we are sure of what we gonna get.

Entropy of the set II

  • image
  • Hcolour = green(0) + orange(0) = 0 entropy! Coz we KNOW we gonna get orange or apple if we classify by colour
  • Hsize = large(entropy of large) + medium(entropy of medium) = 0.98 coz we absolutely dunno what we getting if apple and oranges are of the same size.
  • splitting based on colour is highly informative, and size isn’t very informative coz the splitted values are also uncertain.

Entropy of the set III

  • We measure the average entropy of the 2 sets, that we could obtain by splitting, and look at the average entropy of the two sets, and you want the entropy to have gone down as low as possible.
  • So the original entropy of the basket was H = 0.985.
    • To get the H, we add the (fraction in green * entropy of green ) + (fraction in orange * entropy of orange ) to gain below Hcolor
    • Hcolour = 0 exact certainty
    • Hsize = 0.98
  • If we do Horiginal subtracted by 2, we reduced:
    • Information gain of colour = 0.985 – 0 !! 0.985, max information gain (we have exact certainty)
    • Information gain of size = 0.985 – 0.98 0.005. almost no info gain.
  • What is the second most informative, third – until you reach the leaf

Information gain

  • Information gain – how much the entropy decreases if we choose each particular feature for the next classification step.
  • Look at the difference between original entropy and the average entropy of all the subsets!!
  • So bigger the subtraction is, the more info you gained. And we want to maximise this!!!
  • image

The ID3 algorithm

  • Computes the information gain for each feature and chooses the one that produces the highest value.
  • It searches the space of possible trees in a greedy way by choosing features with highest information gain.
  • Each stage the best feature is selected and removed from the dataset, and it’s called recursively on the rest.
  • Then it assigns to the leaf of the tree, the class of majority of points that have the features values corresponding to the leaf
If all examples have the same label:
-	Return a leaf with that label

Else if there are no features left to test
-	Return a leaf with the most common label

Else:
-	Choose the feature F that maximises the information gain of S to be the next node
-	Add a branch from the node for each possible value f in F
-	For each branch
    o	Calculate Sf by removing F from the set of features
    o	Recursively call the algorithm with Sf to compute the gain relative to the current set of examples

Visualizing splits

  • Screenshot 2020-02-25 at 7 55 36 pm
  • This is the split the above algorithm does. This image is for continuous variables, but it works the same for discrete variables.
  • Since the features are independent, all the decision boundaries would have to be parallel to the axis. So we either capped with one feature or another.
  • So it first finds a line that could split the variables the most with one line! And so on.

Characteristics

  • ID3 has been developed further, and now it’s replaced by C4.5
  • What are the main characteristics of decision trees and ID3?
    • It’s greedy with respect to information gain finds most info gain
    • It can deal with noisy data – if there’s one element that doesn’t belong, if there isn’t too many of them, it won’t change the label for that set. its robust in respect to noise
    • On the other hand, it always uses all the features, which may lead to overfitting. It may improve the classification for training set, but you dunno for validation set.
  • In 4.5 improvement
    • It does pruning – after it builds the tree it removes some of the nodes, to see if by remerging the sets we have split, and we get generalization
      • We may introduce a little bit of bias/error in training set (Error on the particular set might increase slightly), to reduce variance to allows for generalization
    • Main idea is the use of information gain

A different criterion: Gini Impurity

  • We consider a criterion for splitting different from entropy.
  • If I assigned the class according to its frequency in the data(= based on the distribution of elements in the set), and we choose an element randomly, how likely would I be wrong?
  • Eg
    • 4/10 apple. 3/10 orange, 3/10 pear.
    • If I picked at random, we look at what’s not apple, and what’s not pear, not orange etc
    • So image
    • Probability of being wrong (not apple) when I picked at random using the distribution
  • Impurity in the name suggests that the aim of the decision tree is to have each leaf node represent a set of datapoints that are in the same class, so that there are no mismatches.→ purity
  • If a leaf is pure then all of the training data within it have just one class.
  • To choose a feature to split on, algo loops over diff features and checks how many points belong to each class.

  • image

Classification and Regression Trees (CART)

  • We can use Gini impurity and calculate again the value before and after the split (instead of entropy) CART. And we tend to maximize this.
  • Can be used for both classification and regression.

Random forests

  • In practice, trees are used in sets called forests – set of decisions trees by introducing random differences
    • Either using the a random subset of data to build a tree, and do it for as many trees (bagging)
    • Or choosing a random subset of the features – only use certain feature for one tree, and other for other
  • Each tree votes for a classification, and the one with most votes is returned by the random forest
  • It’s the same but we use gini impurity or entropy and it doesn’t make a big difference, and it’s the matter of figuring what works best for ur dataset
  • In ML you can’t just use one, you gotta experiment with diff methods to see works – it depends on the dataset

Leave a comment