Multi-class Classification of Mathematical-Numbers | Vanilla DNN | Part1

12 min readNov 1, 2021

In this blog, we shall be using KERAs framework with TensorFlow backend, in order to create a Deep Neural Network. We shall be using MNSIT dataset, which consists of 28*28 grayscale images, representing the hand-written images of numeric digits 0 to 9. The dataset is partitioned in 2 parts :-

Training dataset of 60,000 images.
Testing dataset of 10,000 images.

This is a classic example of Classification, where we would be classifying each of the input image into 1 amongst 10 classes of decimal-digits (0 to 9). The Deep-Neural-Network that we have designed have 2 hidden layers, in addition to the input & output layer, like the way we have demonstrated below.

First, we import important libraries that we shall be using throughout our demonstration. We shall be using the tensorFlow V1 throughout the setup, therefore we explicitly also disable the TF V2 behaviour.

Next, we initialise the random-number-generator & set the seed so that, we can keep on using the same set of instances or random-numbers in every run of the program.

We would be using the MNIST hand-written data-set, which is already provided as a built-in data-set in the KERAs framework. We, therefore first create the instance of MNIST dataset.

Next, we would be loading the data (from MNSIT data-set) into the current environment.

Let’s first understand the meaning of the 4 variables created above :- The training set is a subset of the data set used to train a model.

Xtrain is the training data set.
Ytrain is the set of labels to all the data in Xtrain.

The test set is a subset of our data-set, that we shall be using to test our model, after the model has gone through initial vetting by the validation set.

Xtest is the test data set.
Ytest is the set of labels to all the data in Xtest.

Coming back on the MNSIT dataset, let’s go ahead and print the shape of all of the above 4 datasets. We have following configuration :-

Training dataset of 60,000 images, stored in Xtrain.
Testing dataset of 10,000 images, stored in Xtest.

Also, please note that, this is set of hand-written numbers, stored as gray-scale-images. Each image is being stored as the matrix of size, 28 * 28 with pixel values taking a value from in range (0 to 255).

Let’s go ahead and print any sample image :

As we have learnt so far that, there are only 60,000 images into the training data-set, we are printing the last image from the training data-set i.e. 59,999th image :-

Understand that, each image is of size 28 * 28. Each pixel can be varying within range of (0, 255). Each value represents some density of color. Note that, each image is a 2d array of size (28 * 28). Let’s go ahead and see, what’s the pixels value 59,999th image are :-

Let’s see, what’s the pixels value are for : 59,999th image’s 20th row :-

Similarly, as we have also learnt that, there are only 10,000 images into the test-data-set, we are printing the last image from the test-data-set i.e. 9,999th image :-

Now, let’s go ahead and see the label for this particular image, as given in the MNSIT test dataset :-

Next, we need to perform the pre-processing on the images. The input images are converted into a tensor of size 28 * 28, using the “reshape()” function. Then, each image is normalised, in order to convert the pixel values from range (0 to 255) to range (0 to 1). Also, we shall be converting the values into float type values.

Data-Pre-Processing :- Next, we need to perform the pre-processing on the images.

Remember from our previous screenshots above that, each image was originally a 2d matrix of size 28*28. Let’s now go ahead and convert the input image from 2d into 1d of size: 784, using the “reshape()” function. Note that, after this reshape operation, each image is a 1d array of size 784. Also note that, post this operation, we can neither visually see the image anymore nor plot it using pyplot.

Next, let’s perform the Normalisation operation. Note that, initially each pixel in the image has gotten the value from range (0 to 255). Post the normalisation operation, value of the pixel would be ranging from (0 to 1).

Therefore, each pixel value would now become of float type :-

Next, let’s also perform the similar operations on the Xtest dataset :-

Reshape → We shall be reshaping each image in the test dataset from 2d (each image of size 28 * 28) into 1d (each image of size 784).
Normalisation → We then perform normalisation where each pixel value is converted from range of (0 to 255) to the range of (0 to 1).

Note that, the output values into the data-set are categorical in nature with values in range of (0 to 9). Therefore, this categorical data is first converted into vector using One-Hot-Encoding approach :-

Keras library provides us out-of-the-box method to “to_categorical”.

Above step completes the pre-processing step for both input images and output categories. Recall that :-

Xtrain contains the 60,000 images, which we shall be using for training.
Ytrain contains the corresponding labels for 60,000 images in Xtrain dataset.
Xtest contains the 10,000 images, which we shall be using for testing/validation.
Ytest contains the corresponding labels for 10,000 images in Xtest dataset.

Let’s now build the simplest & sequential DNN using Keras. We are planning to use the 4 layers in this Neural Network.

First Layer is an input layer, which expects input-tensor of size 784 (i.e. each input image to our model is of size 784*1). Recall that, earlier above, we had converted(reshaped) all of our input images from 2d to 1d.

Note that, only the input layer specifies the input_size. The input layer is defined as dense-layer with 50 neurons and ‘relu’ activation function. Dense-Layer implies that, all neurons of one layer are connected to all neurons of next layer. Here is computation of the

Second Layer In this second layer, we have 60 neurons and ‘relu’ activation function.

Third Layer In this third layer, we have 30 neurons and ‘relu’ activation function.

Fourth Layer In this fourth and output layer, we have 10 neurons and it uses ‘softmax’ activation function, in order to perform classification accurately.

Summary of the Model Here is how our entire model looks like :-

Computation of no. of variables involved in First hidden layer :- Input to the first hidden layer is an image. is of size (784*1) and in first hidden layer, we have in-total of 50 neurons, so, the total number of variables are :- 784*50. Also, there would be 50 biases each-one for 50 neurons there in first hidden layer. Therefore, here in this model total net number of parameters would be : (784*50 + 50 == 39250).

Note that, we have only taken the 50 neurons here in first hidden layer for the following reason :-

Computation of no. of variables involved in Second hidden layer :- Input to the second hidden layer is of size (50*1) and in second hidden layer, we have in-total of 60 neurons, so, the total number of variables are :- 50*60. Also, there would be 60 biases each-one for 60 neurons there in second hidden layer. Therefore, here in this model total net number of parameters would be : ( 50*60 + 60 == 3060).

Computation of no. of variables involved in Third hidden layer :- Input to the third hidden layer is of size (60*1) and in third hidden layer, we have in-total of 30 neurons, so, the total number of variables are :- 60*30. Also, there would be 30 biases each-one for 30 neurons there in third hidden layer. Therefore, here in this model total net number of parameters would be : ( 60*30 + 30 == 1830).

Computation of no. of variables involved in Fourth layer :- Input to the fourth layer is of size (30*1) and in fourth layer, we have in-total of 10 neurons, so, the total number of variables are :- 30*10. Also, there would be 10 biases each-one for 10 neurons there in fourth layer. Therefore, here in this model total net number of parameters would be : ( 30*10 + 10 == 310).

Overall summary of this deep neural network model can be viewed as follows :-

Note that, we have in-total of 44,450 parameters.

Configuring Model :- Let’s now configure our recently created DNN model for training.

The Cost function is defines as Loss-Function and in this case, we are using “Categorical Entropy” as Loss function.

The metrics used for evaluating the model is “accuracy”.
We are planning to use ADAM optimiser on the top of SGD (Standard Gradient Descent), in order to minimise the cost-function.

Usually, below values are adopted for hyper-parameters Delta and Gamma :-

Here is a ready comparative analysis for various types of Optimisers :-

Once the configuration of our model is complete, we can proceed for Training.

Training of Model :- Before, we proceed for training of our model, here is the concept :-

We are using the Gradient-Descent approach here, which can be described as following :-

Forward pass looks something like :-

Backward pass looks something like :-

Below here, we have used here “Mini Batch Gradient Descent Approach”, in order to minimise the LOSS :-

On a general note, following is the understanding :-

We can proceed to train the model using “MiniBatch-Gradient-Descent approach” like shown below :-

Let’s understand few things from above step :-

Epoch :- One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE. Note here that single pass(Forward+Backward) or one epoch is not enough, as it may lead to under-fitting of the curve. As the number of epochs increases, more number of times the weight are changed in the neural network and the curve usually goes from under-fitting to optimal to over-fitting curve.
Batch size: Since we have limited memory, probably we shall not be able to process the entire training instances(i.e. 60,000 instances) all at one forward pass. So, what is commonly done is splitting up training instances into subsets (i.e., batches), performing one pass over the selected subset (i.e., batch), and then optimising the network through back-propagation. The number of training instances within a subset (i.e., batch) is called batch_size. The higher the batch size, the more memory space we would be needing. The batch_size is usually specified in power of 2.

Example #1: Let’s say we have 2000 training examples that we are going to use, then we can divide the dataset of 2000 example-instances into batches of 500, then it will take 4 iterations to complete 1 complete epoch.

Example #2: Coming back to our above example, we have 60,000 training examples, and our batch size is 64, therefore, it will take us 938 iterations to complete 1 epoch. Note that, in each epoch → all 60,000 instances shall be processed for sure. Also observe that, after every epoch, the loss reduces and accuracy improves.

As training of the DNN model progresses, both the (accuracy and loss) are stored as a list in the object. Final (Training loss and Loss) can be obtained from the history object.

The last element of the history object (model_fit.history[‘loss’])gives us the final loss after training process.
The last element of the history object (model_fit.history[‘acc’])gives us the final accuracy after training process.

Next, We can evaluate our thus build model with the help of testing dataset of size 10,000. The “evaluate” function gives us the testing accuracy and testing loss as the output.

From above snapshot, we can observe that, our DNN model is giving the testing accuracy of 97.35%. We can try to improvise this accuracy by playing on following parameters :-

Changing the number of neurons in each of the hidden/output layers.
Changing the number of hidden layers as well.
By modifying the various optimisers like RMSProp, etc.

Thanks for reading through this and we shall meet you in another article.

References :-

Multi-class Classification of Mathematical-Numbers | Vanilla DNN | Part1

Written by aditya goel

No responses yet