Decision Tree is a type of supervised learning algorithm that can be used in both regression and classification problems.
Goal in Decision Tree
Goal in Decision tree is to create a training model which can be used to predict the value or the class of the target variable by learning simple decision rules and terminologies inferred from a training data.
Idea behind Decision Tree
At each point or a training sample, We consider a set of questions that can partition our dataset.
We choose the question that provides the best split and again find the best question for the partition.
We stop once all the points you’re considering are of same class.
This way decision tree helps to segregate data or a population based on the values of input variables and identify the variable which creates the best homogeneous sets of data/population (which are heterogeneous to each other).
Suppose, Our dataset consists of n no of attributes in the training sample out of which we have to find the most suitable variable to split our dataset into more subsets. We can’t choose these variables randomly as it may give us bad results with low accuracy.
In decision trees, There are some methods to determine the splitting criteria as We need something to measure and compare the impurity (initially all the training samples are impure). Some of the methods are Gini Impurity, Entropy, Information Gain, Gain Ratio, Reduction in variance, Chi square .
Splitting criteria and Algorithm selection is based on the type of target variable.
Decision Tree makes decision on the basis of homogeneity(purity).
step 1 : Calculate Gini impurity for all the given variables i.e. Gini impurity for the left node and Gini impurity for the right node.
step2 : Now, Calculating total Gini impurity for a variable to segregate data on the basis of target variable.
Calculate the weighted average of the leaf node impurities.
Variable with the lowest Gini Impurity is selected for the root node.
Entropy measures the level of impurity in a group. A decision tree is build top down from a root node and involves partitioning the data into subsets that contain similar instances (instances with similar values).
So, Higher the entropy, It becomes difficult to infer conclusions from that information. Attribute with lowest Entropy is always preferred.
A is a more impure node, C is a pure node and B is a less impure node
So, we can say that impure node i.e. less impure node requires less info to describe it and more impure node requires more info to describe.
Information Gain is based on the decrease in entropy after a dataset is split on the basis of a variable.
step1 : Calculate entropy of the target variable
step2 : Dataset is then split on the different variables. The entropy for each branch is calculated. Then it is added proportionally, to get the total entropy for the split. The resulting entropy is subtracted from the entropy before split
step3 : Choose the attribute with the largest info gain as the decision node, again divide the dataset by its branch and repeat the same process on every branch.
step4 : Branch with entropy of 0 is a leaf node and a branch with more than 0 entropy needs further splitting.
We always prefer to use attribute with largest information gain.
Data is sorted at every node of the tree in order to determine the best splitting attribute . At each node of the tree, C4.5 algorithm makes the use of Gain ratio to choose one attribute of the data that most effectively splits its set of samples into subsets.
Split Info is calculated with respect to all the classes in the target variable and attribute with the maximum gain ratio is selected as the splitting attribute.
Attribute with maximum gain ratio is selected for the splitting criteria.
Reduction in Variance
Reduction in variance is mainly used for continuous target variable(regression problems). This algorithm uses the standard formula of variance to choose the best split.
The split with lowest variance is selected as a criteria to split the training data.
Chi Square is an algorithm used to find out the statistical significance between the sub nodes and parent node by calculating the difference between sub nodes and parent node.
We measure their relationship by sum of square of standardized difference between observed and expected frequencies of a target variable. It build CHAID tree and performs multi level splits when computing the classification trees.
step1 : Calculate chi square for the individual nodes by calculating the deviation of the given classes in target variable.
step2 : Calculate chi square of split using sum of all chi square of the classes of each node of the split.
Algorithms used in Decision Tree
It is a core algorithm mainly used in decision tree. ID3 algorithm uses entropy and information gain of the attribute to split the original training sample. It then select the attribute with smallest entropy or largest information gain.
But When information gain encounters with a variables that have a large no of values or classes. It becomes biased towards choosing attributes and prefer the attribute with a large no of distinct values.
C4.5 algorithm uses Gain Ratio (a successor of ID3) for building decision trees. It is an improvement of ID3 or modification of ID3 that reduces the bias and overcomes the problems with information gain. It corrects Information Gain by taking the useful information of a split or taking in account the no of branches that would result before making the split
CART algorithm uses Gini impurity criteria to split the node and identify the best variable for the splitting criteria. It is also used in both classification and regression problems.
Advantages And Disadvantages Of Decision Tree
There are several advantages and disadvantages associated with decision tree models.
Normalization or Scaling of data is not required
Handles data with missing value
Easy to explain and visualize
Automatic feature selection i.e. irrelevant features won’t affect decision trees.
Prone to Overfitting
Sensitive to data i.e. If data changes slightly, outcome also change to a very large extend.
Requires more time in training
Not fit for continuous variables, while working with continuous numeric variables, decision tree often looses some info when it categorizes variables in different categories.
Please check out overfitting and underfitting in CART and methods to overcome them, if you’re interested in exploring the topic further.