Multi-class Classification using Decision-Tree Model

Welcome readers. In case you have directly landed here, I strongly suggest you to go back and read through this link first.

Introduction to the problem :- In this blog, I would like to help you guys to build a Machine Learning model based on the Decision Tree Algorithm. Here, we shall be working on a smaller dataset (taken from archive). We shall first be training our model using the given data and then shall be performing the Multi-class classification using the built model.

Let’s begin by exploring the data-set first. Please note that, there are 4 independent variables and 1 dependent variable. Merely observing the data doesn’t tells anything about the domain as such.

Next, lets see the unique values of our dependent-variable. This feature, is what we shall be predicting going forward, based upon the other 4 input variables. From below demonstration, there are 3 possible values for the dependent-variable i.e. <B, L, R> and in the given data-set there are <49, 288, 288> counts of records having these values respectively.

Next, we break our dataset into training-set & test-set. Here, we are splitting on 70–30 rule. Thus, out of 625 given records, we are left with 437 records to which we shall be using for training purpose.

Introduction to DecisionTreeClassifier: This is the classifier function for DecisionTree. It is the main function for implementing the algorithms. Some important parameters of this function are:

Now, Let’s proceed to build our model. There are 2 types of criteria's. First we shall be using default criteria i.e. ‘gini’:-

Next, Let’s proceed to build our model using criteria as ‘entropy’:-

Next, let’s perform the prediction(i.e. Multi class classification) using the built models. For both of the above models, we shall be performing the classification-task. Now, the predicted results have been stored into ‘y_pred_gini’ & ‘y_pred_entrophy’ data-frames.

First, we shall be importing the metrics libraries from the sklearn package.

Next, let’s evaluate the performance of the both the models. For this, we shall be comparing the predicted results of the test data-set with the actual-values of the test data-set.

Note that, from above report, the accuracy of the model using gini is around 73% and the accuracy of the model using entropy is around 71%. We can also see the predicted values by both of the models :-

Following is how the decision tree looks for our model based upon gini :-

Following is how the decision tree looks for our model based upon entropy :-

References :-

--

--

Software Engineer for Big Data distributed systems

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store