Welcome readers. In case you have directly landed here, I strongly suggest you to go back and read through this link first.
Introduction to the problem :- In this blog, I would like to help you guys to build a Machine Learning model based on the Decision Tree Algorithm. Here, we shall be working on a smaller dataset (taken from archive). We shall first be training our model using the given data and then shall be performing the Multi-class classification using the built model.
Let’s begin by exploring the data-set first. Please note that, there are 4 independent variables and 1 dependent variable. Merely observing the data doesn’t tells anything about the domain as such.
Next, lets see the unique values of our dependent-variable. This feature, is what we shall be predicting going forward, based upon the other 4 input variables. From below demonstration, there are 3 possible values for the dependent-variable i.e. <B, L, R> and in the given data-set there are <49, 288, 288> counts of records having these values respectively.
Next, we break our dataset into training-set & test-set. Here, we are splitting on 70–30 rule. Thus, out of 625 given records, we are left with 437 records to which we shall be using for training purpose.
Introduction to DecisionTreeClassifier: This is the classifier function for DecisionTree. It is the main function for implementing the algorithms. Some important parameters of this function are:
- criterion: It defines the function to measure the quality of a split. Sklearn supports “gini” criteria for Gini Index & “entropy” for Information Gain. By default, it takes “gini” value.
- splitter: It defines the strategy to choose the split at each node. Supports “best” value to choose the best split & “random” to choose the best random split. By default, it takes “best” value.
- max_features: It defines the no. of features to consider when looking for the best split. We can input integer, float, string & None value. If an integer is inputted then it considers that value as max features at each split. If float value is taken then it shows the percentage of features at each split. If “auto” or “sqrt” is taken then max_features=sqrt(n_features). If “log2” is taken then max_features= log2(n_features). If None, then max_features=n_features. By default, it takes “None” value.
- max_depth: The max_depth parameter denotes maximum depth of the tree. It can take any integer value or None. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. By default, it takes “None” value.
- min_samples_split: This tells above the minimum no. of samples reqd. to split an internal node. If an integer value is taken then consider min_samples_split as the minimum no. If float, then it shows percentage. By default, it takes “2” value.
- min_samples_leaf: The minimum number of samples required to be at a leaf node. If an integer value is taken then consider min_samples_leaf as the minimum no. If float, then it shows percentage. By default, it takes “1” value.
- max_leaf_nodes: It defines the maximum number of possible leaf nodes. If None then it takes an unlimited number of leaf nodes. By default, it takes “None” value.
- min_impurity_split: It defines the threshold for early stopping tree growth. A node will split if its impurity is above the threshold otherwise it is a leaf.
Now, Let’s proceed to build our model. There are 2 types of criteria's. First we shall be using default criteria i.e. ‘gini’:-
Next, Let’s proceed to build our model using criteria as ‘entropy’:-
Next, let’s perform the prediction(i.e. Multi class classification) using the built models. For both of the above models, we shall be performing the classification-task. Now, the predicted results have been stored into ‘y_pred_gini’ & ‘y_pred_entrophy’ data-frames.
First, we shall be importing the metrics libraries from the sklearn package.
Next, let’s evaluate the performance of the both the models. For this, we shall be comparing the predicted results of the test data-set with the actual-values of the test data-set.
Note that, from above report, the accuracy of the model using gini is around 73% and the accuracy of the model using entropy is around 71%. We can also see the predicted values by both of the models :-
Following is how the decision tree looks for our model based upon gini :-
Following is how the decision tree looks for our model based upon entropy :-