Building ANN model post exploratory data analysis

Introduction to the problem :-

In this blog, we would work with one of the popular data-set i.e. of LendingClub. Its a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world’s largest peer-to-peer lending platform.

Objective of the Blog :-

In this blog, Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), we shall be building a model that can predict whether or nor a borrower will pay back their loan? This way in the future when we get a new potential customer we can assess whether or not they are likely to pay back the loan.

Dependent and Independent variable :-

The “loan_status” attribute is our dependent variable. It is also called as Target variable. This would be of binary type and therefore this problem is of Binary-Classification. “loan_status” is what, we shall be predicting through the help of our ANN model. Rest all are dependent attributes, as these attributes plays role in deciding whether loan_status should be T or F. Let’s first see the information regarding these dependent variables, present in our data-set.

Part 1 : Exploratory Data Analysis :-

First, let’s begin with Jupyter notebook and start importing the LendingClub’s data :-

Next, let’s see the distribution of the different categories under the “loan_status” attribute. Let’s plot the graph(using matplotlib library) for different values for target/dependent-variable i.e. ‘loan_status’ :-

The aforesaid task can also be performed using Seaborn library as well :-

Now, let’s analyse the distribution of the loan-amount vs it’s frequency with the help of histogram. Here, we see, there are way higher number of users who have taken loan of 10,000 USD, as compared to number of users, who have taken loan of 30,000 USD & above.

Continuous Variable :- If we start counting a given random-variable now and never, ever, ever finish (i.e. the numbers go on and on until infinity), we have what’s called a continuous variable. For Example: if our RV is “Temperature in Sahara Desert,” how long would it take us to write every possible temperature? It would take us literally forever: 50°, 50.1°, 50.11°, 50.111°, 50.1111°, …etc. Now, In our use-case, we have numerous continuous variables. Let’s plot the correlation graph between these continuous variables present into our data-frame :-

Now, We can visualise the correlation between these attributes more easily using the heat-map as well. Here, highest co-relation is of ‘YELLOW’ colour and so on. From below picture, attributes ‘loan_amount’ and ‘instalment’ have highest co-relation and therefore we can drop one of the variable.

Categorical variable is one that can only take a fixed number of values. Ex — -> limited number of sports, the gender is either male or female, and medals are gold, silver or bronze.

There are two major classes of categorical-data i.e. nominal and ordinal.

  • In any nominal — categorical data attribute, there is no concept of ordering amongst the values of that attribute. Weather is a nominal-categorical attribute. Similarly movie, music and video game genres, country names, food and cuisine types are other examples of nominal categorical attributes.
  • In any ordinal — categorical attributes, there do exists some sense or notion of order amongst its values. For instance : shirt sizes. It is quite evident that order or in this case ‘size’ matters when thinking about shirts (S is smaller than M which is smaller than L and so on).

Please note that, we generally use Chi-Square test for establishing the co-relation analysis between categorical variables.

Next, back to our main problem, we saw above that attributes ‘loan_amount’ and ‘instalment’ were positively co-related to each other. Now, there must be a linear-relationship between these 2 variables themselves. Let’s plot the same using scatterplot graph. Clearly below graph shows that, there exists Multi-Co-Linearity amongst these input variables.

Next, lets see the relationship (using BoxPlot summary) between attributes ‘loan_status’ and ‘loan_amount’. Dark line on the top, represents outliers above ~37K USD of loan_amount.

Next, We have a categorical attribute ‘grade’. Let’s see the relationship between the attributes ‘grade’ and ‘loan_status’. We shall be using scatterplot here along with Hue. The hue allows us to specify a categorical variable in a different color. The order parameter allows usto determine the sequence of the categorical variables. And finally, the palette parameter allows us to specify colors for the different levels of the hue variable. From the below graph, Grade C and D both have highest loan-defaulting rate. Even Grade B also has good enough number of defaulting-rate.

Next, let’s see how many number of loans were taken each sub-grade wise. There are two ways of looking at this information graphically :-

  • Through the use of Matplotlib.
  • Through the use of SNS countplot.

We can see that, people in sub-grade ‘B3’ have taken highest number of loans and similarly, people in sub-grade ‘G5’ have taken lowest number of loans. Please note here that, we had performed the ordering on the basis of the count of the loan-accounts.

Now, let’s see how many number of loans were fully-paid & charged-off under each sub-grade wise. Basically we shall be using ‘hue’ property here. We can note here that, for sub-grade E4, E5, F4 the count of ‘Charged-Off’ cases are equal to the count of ‘Fully-Paid’ cases, whereas for sub-grades B3, B4, C1, C2, etc. the percentage of ‘Charged-Off’ cases are quite less in number.

We can also do the ordering here on the basis of the sub-grade too. Please note here that, the ordering is based on the custom user defined ‘subgrade_order’.

For the last six grades (i.e. Grade F & G), we are unable to see the data because of the comparative loan-accounts to be too less. Let’s see the count of loan-accounts for grade F & G separately at more granular level. From below graph, we see that, count of people who have defaulted on the loan (i.e. “Charged-Off” loan-accounts) are more than the count of people who have paid the loan (i.e. “Fully-Paid” loan-accounts) for G3 sub-grade.

Next, we would form a new boolean-type attribute(‘loan_repaid’) from the existing column ‘loan_status’. We shall be mapping the value “Fully-Paid” to 1 and the value “Charged-Off” to 0. Post this, we shall also be dropping the column ‘loan_status’. Here, we have converted the categorical variable into the numeric variable.

Part 2: Data Pre-processing :-

Let’s see the top view of this data-set. Also, now find out how many total rows we have in this data-set. We have around 3,96,030 records in this data-set.

Next, let’s see, each column wise, how many rows have the value as NULL. From below data, around 23K records have NULL value for column ‘emp_title’. Similarly, around 18K records have NULL value for column ‘emp_length’ and around 37K records have NULL value for column ‘mort_acc’.

Now, let’s see how many are the unique values for the ‘emp_title’ attribute and following are some of the unique values of the title :-

Please note here that, there are too many unique ‘emp_title’s and it would lead us to too many columns in our data-set if we try to convert the to ‘emp_titles’ attribute to some dummy variable feature. Therefore, let’s remove this ‘emp_title’ column.

df = df.drop(‘emp_title’,axis=1)

Next, let’s see the unique values for ‘emp_length’ attribute :-

Next, let’s see employees with how much tenure have procured maximum count of loans. Below graph indicates that, employees with length of 10+ years have sought maximum count of loans.

Let’s see, if there is any relationship between ‘emp_length’ and count of loans-repaid. We would be using the ‘hue’ parameter to group the count amongst 2 different sub-groups. From below visualisation, we can observe that employees with employees with ‘emp_length’ more than 10 years are taking more loans as well as they are defaulting more. In another sense, below graph doesn’t tells anything meaningful. Thus, we are yet not much clear on whether, there is a strong relationship between employment length and loans going defaulters.

Thus, let’s find-out the percentage of charged-offs (Loans getting defaulted) per category (i.e. employee-length). Essentially we want to find out the informing us what percent of people per employment category didn’t pay back their loan.

First, let’s find out only those records which have are of type ‘Charged Off’ category only :-

Next, For the ‘Charged Off’ category of customers, let’s find out the count of records per ‘emp_length’ wise, by performing ‘group-by’ operation on it. We even plotted this data above.

Next, For all types of customers, let’s find out the count of records per ‘emp_length’ wise, by performing ‘group-by’ operation on it.

Next, let’s find out the percentage of defaulters under each ‘emp_length’ category wise. From the below graph, it is clear that under each type of ‘emp_length’, rate of defaulting is almost similar around ~19 or 20%. Although this information is meaningful, but we can conclude that, there is no specific category (i.e. no such ‘emp_length’ tenure) for which percentage of defaulting is too high. Under each category, around 20% people are defaulting on the loans. Therefore, this attribute actually would not help us in predicting / classifying whether a given loan would go defaulter or not ? Thus, this input-feature-attribute can be easily dropped.

Next, let’s again see, each column wise, how many rows have the value as NULL. In below data, please note that the column ‘emp_length’ is no more, because just above, we dropped this feature, as it was a redundant attribute for us. From below data, around 23K records have NULL value for column ‘emp_title’. Similarly, around around 37K records have NULL value for column ‘mort_acc’.

Next, let’s see what are the top values of the purpose of the loans :-

Next, let’s see how much is the count-of-loans taken for each type of purpose. Please note that, out of net-total count of records (i.e. 3,96,030) around 2,34,507 customers have taken the loan for purpose of ‘debt_consolidation’.

Next, we can observe that the attribute ‘title’ is also holding almost the same information as that of purpose and also it is also holding some NULL values as well.

Therefore, we can also drop this ‘title’ attribute as well. This seems to be more duplicate and basically it appears that its a sub-category of the ‘purpose’ attribute itself. Again, let’s see, each column wise, how many rows have the value as NULL. In below data, please note that the column ‘title’ is no more.

Next, let’s see what are the unique values of the attribute ‘mort_acc’. By the way, the ‘mort_acc’ stands for : “Number of mortgage accounts” a given customer is holding. From below data, there are around 18,194 such customers who are holding the 5 mortgage accounts.

Next, let’s see if we this input-attribute is co-related to any other attribute in the given data ? From the below data, it appears that the input-feature (“mort_acc”) is mostly co-related with input-feature (“total_acc”) with value of 38%.

We have around 37K records having value of ‘mort_acc’ as NULL. Therefore, we would now be using the attribute ‘total_acc’ in order to fill the null value in column ‘mort_acc’. Let’s begin by grouping our data-frame by ‘total_acc’ attribute and then calculate the mean value for the ‘mort_acc’ per ‘total_acc’ entry.

For example, from the below data, we can note that, customers having the ‘total_acc’ value as 135, have 3 as the value of the attribute ‘mort_acc’.

Next, let’s fill in the missing ‘mort_acc’ values based on their ‘total_acc’ value. If the mort_acc is missing, then we will fill in that missing value with the mean value corresponding to its total_acc value from the Series
we created above. This involves using an .apply() method with two columns.

Now, again let’s see, each column wise, how many rows have the value as NULL. In below data, please note that the column ‘mort_acc’ doesn’t have any NULL values anymore.

Next, observe that two columns ‘revol_until’ and ‘pub_rec_bankruptcies’ also have sum-total of 811 such records which have NULL values, but since the count of such entries is quite less as compared to the entire data-set, therefore we can happily let-go-off such records fully. Let’s go ahead and delete all of these 811 records :-

Now, we can observe that, we are left with 3,95,219 odd entries and no record have any NULL value.

Now, let’s list down all those columns which have categorical value. Out of all categorical values, the ‘term’ attribute have following values :-

Handling of ‘term’ categorical attribute :- Next, let’s now convert the ‘term’ attribute into the numeric attribute.

Handling of ‘grade’ categorical attribute :- Next, we can also drop the ‘grade’ attribute because we already have ‘sub_grade’ attribute, which is rather more granular.

df = df.drop(‘grade’,axis=1)

Handling of ‘sub_grade’ categorical attribute :- Next, let’s convert the ‘sub_grade’ categorical attribute into dummy attributes. Please note here that, we are using One Hot Encoding technique to convert the categorical attribute into an Dummy variable.

Now, we shall be concatenating these new columns to the original dataframe and dropping the original ‘sub_grade’ column.

Finally, let’s investigate the values of all of these columns. Each row would have value as 1 for only one of the newly introduced attributes.

Handling of following categorical attribute :- For attributes in the input-data i.e. (verification_status’, ‘application_type’, ‘initial_list_status’, ‘purpose’ ), we shall convert these columns as well into dummy variables and concatenate them with the original data-frame AND we shall be dropping the original categorical columns.

Handling of ‘home_ownership’ categorical attribute :- Next, lets investigate ‘home_ownership’ categorical attribute :-

Next, let’s convert these-values to dummy variables, but replace NONE and ANY with OTHER, so that we end up with just 4 new dummy-attributes, MORTGAGE, RENT, OWN, OTHER. Also, we shall be concatenating these new attributes with the original data_frame and dropping the original columns.

New feature extraction from ‘address’ categorical attribute :- Next, lets feature-engineer a zip code column from the address in the data set. Create a column called ‘zip_code’ that extracts the zip code from the address column.

Next, let’s see how many total number of zip-codes we have :-

Next, convert the newly introduced ‘zip_code’ attribute into a dummy-variable using One-Hot-Encoding :-

Now, we would be dropping the ‘address’ & ‘zipcode’ attributes, since we have already added the dummy variables for ‘zipcode’.

Handling of ‘issue_d’ categorical attribute :- Next, lets drop this attribute, because we wouldn’t know beforehand whether or not a loan would be issued when using our model, so in theory we wouldn’t have an issue_date, drop this feature.

Handling of ‘earliest_cr_line’ categorical attribute :- Next, this attribute appears to be a historical time stamp feature.

We extract the year from this feature using a .apply function, then convert it to a numeric feature. Set this new data to a feature column called ‘earliest_cr_year’. Then drop the earliest_cr_line feature.

Handling of ‘loan_status’ categorical attribute :- Now, let’s convert the ‘loan_status’ categorical attribute into the numeric attribute. We would be dropping the ‘loan_status’ column. Finally, we are not left with any categorical attribute anymore.

Part 3: Building a ANN based Model :-

First, let’s understand what is meaning of train & test data. Training set — a subset which is sued to train a model. Test set — a subset used to test the trained model. Now, let’s break our original data-set into X & y features. Here ‘X’ represents the input variables and ‘y’ represents the output variable.

Next, let’s see the data-frame at once. We have around 4 lakhs records and 80 attributes in this data-set.

Now, we break our data-set into train/test data by the rule of 80% split. In below demonstration, we showcase that count of records, combined together in Train & Test dataset comes out to be equal to the count of records in our original dataframe.

Next, we use a MinMaxScaler to normalize the attributes i.e. X_train and X_test. We can also use StandardScaler in order to normalize the attributes. Recall we don’t want data leakage from the test set so we only fit on the X_train data.

Next, Please note that we shall be using Keras & TensorFlows in order to build a ANN based model, therefore let’s proceed and install ‘tensorflow’ & keras package and then import it into our notebook. Please note that, most of the things in the ANN are experimental and data-driven.

Introduction to Perceptron :- ANN is nothing but consists of numerous Neurons(aka Perceptron) tied to each other. To each perceptron, inputs are taken along with the bias and it produces an output based upon the activation function. Further the output of this Neuron acts as feeder/input for the next layer.

Introduction to Activation-Functions :- Next, there are many Activation-Functions like Relu, TanH, Sigmoid, etc. Activation function actually fires/activates a neuron. In our context below (i.e. Logistic Regression → We perform the Binary-Classification), we have used ‘Relu’ as the activation-function for hidden layers AND we have used ‘Sigmoid’ as the activation-function for output layer.

Building a Model :- We have built a Sequential Model in our below context. ‘Dropout’ would help us to protect from overfitting. It won’t connect to some of the nodes of the hidden layer. Since, we have 80 attributes in total, therefore we have kept value for density as 80. Usually, the number of independent-features shall be equal to the number of neurons(aka perceptrons) at the input layer. In our use-case, since we have 80 final independent features, the number of perceptrons will be 80.

Next, we can have as many as hidden layers possible, but increasing the hidden layer, would also increase the complexity & time to train. More no. of hidden layers would also require more computational power. Also, it would increase the accuracy of our model upto a certain extent. We have kept 2 hidden layers each with density equal to half of the original density for input layer. But, please note that, there is no thumb rule for deciding upon the density for hidden layers. Hidden Layers are actually blackbox. As the size of the data increases, having the hidden layers would help us to reduce the complexity. Example → Usually in big dataset of Speech-Recognition, there can be as many as >1000 hidden layers as well. The time taken for such training can be as high as 10 days on a 64GB RAM of gPU. Whenever the data-variation is high, we should be using the more number of hidden layers. Please note from below snapshot that, in our first hidden layer, we have chosen 40 number of neurons and for our second hidden layer, we have chosen 20 number of neurons.

Next, we have output layer with only 1 as the value for density, because we only have 1 attribute to do predict for. Also, since our problem is a binary-classification problem, we are using the ‘Sigmoid’ as Activation-Function. Please note that, the advantage of sigmoid is such that, its output would be either 0 or 1.

Finally, we compile the model. Please note that, there are various loss-factors as well. Since our problem is binary-classification problem, we have chosen our loss as ‘binary_crossentropy’. Also, most of the times, ‘adam’ optimiser is the best used optimiser. There is also ‘learning_grade’ but we shall be considering it as default.

Next, we now keep the value of ‘epochs’ as 25. 1 epoch means → 1 full (forward + backward) propagation. Usually, the value of batch_size is kept as 2^n. In our context we have kept it as : 256.

Next, our model-training is over and therefore let’s save our model.

Now, let’s evaluate the performance of our model, with the help of training_loss and validation_loss.

Now, let’s perform our actual task of prediction (i.e. in this case classification) upon the test data. Please recall, we had 20% of the data-records as test-data-set. So, we now perform the classification on the test-data-set, basis of the input-attributes i.e. ‘X_test’.

After the above step, we have our classification-results ready with us. We would now compare our classification-results with the actual class from the ‘y_test’ dataset. We now generate the classification report based upon our Classification AND Actuals. We can see that, accuracy is of around 89%.

In the last, now let’s see to predict the output (whether loan shall be paid or not) basis upon some random input.

Let’s create a new_customer with the help of our own data-frame and using randomness :-

Next, let’s predict the class for this newly randomly generated customer using the ANN model :-

That’s all in this blog. Please share your experiences, if you liked this.

References :-