Multi-class Classification of Mathematical-Numbers | CNN | Part4

Below is the simple CNN based architecture, which we shall be building as part of this blog :-

A colourful representation of the CNN would look something like below :-

In case you are directly landing here, it’s highly recommended to have a view at this story.

In this blog, we shall be using KERAs framework with TensorFlow backend, in order to create a Convolutional Neural Network. We shall be using MNSIT dataset, which consists of 28*28 grayscale images, representing the hand-written images of numeric digits 0 to 9. The dataset is partitioned in 2 parts :-

  • Training dataset of 60,000 images.

Question: What is the purpose of Convolutional-Neural-Networks ?

Question: What is the input given to Convolutional-Neural-Networks ?

Question: How does Convolutional-Neural-Networks works ?

Question: What are the other applications, where Convolutional-Neural-Networks can be used ?

Question: Why is Image Classification that hard area of work ?

Question: How do we actually use “Convolution-Operation” in CNN to extract features ?

Question: What’s the purpose of performing Convolution-Operation (i.e. Applying Kernel(aka filter) on the input-image) ?

Question: Does Applying Kernel/Filters on the input-image NOT spoils the image pixels ? What advantage does it provides otherwise ?

Question: How do we perform the blurring at the boundaries ?

Question: How does a simple 3*3 kernel(a.k.a. filter) looks like ?

Question: Can a kernel be more complex as well ? Well, how does a Gaussian Filter looks like ?

Question:- Are there some examples of filters (aka kernel) in order to sharpen the Image ?

Question:- Are there some examples of filters (aka kernel) in order to blur the Image ?

Question:- Are there some examples of filters (aka kernel) in order to detect the edges of the Image ?

Let’s get back to our problem statement of multi-class-classification, where we would be classifying, the input image into 1 amongst 10 classes of decimal-digits (0 to 9). First, we import important libraries that we shall be using throughout our demonstration. We shall be using the tensorFlow V1 throughout the setup, therefore we explicitly also disable the TF V2 behaviour.

Next, we initialise the random-number-generator & set the seed so that, we can keep on using the same set of instances or random-numbers in every run of the program.

We would be using the MNIST hand-written data-set, which is already provided as a built-in data-set in the KERAs framework. We, therefore first create the instance of MNIST dataset and also divide the overall-dataset into Training & Testing parts.

Let’s first understand the meaning of the 4 variables created above :- The training set is a subset of the data set used to train a model.

  • Xtrain is the training data set.

The test set is a subset of our data-set, that we shall be using to test our model, after the model has gone through initial vetting by the validation set.

  • Xtest is the test data set.

Coming back on the MNSIT dataset, let’s go ahead and print the shape of all of the above 4 datasets. We have following configuration :-

  • Training dataset of 60,000 images, stored in Xtrain.

Also, please note that, this is set of hand-written numbers, stored as gray-scale-images. Each image is being stored as the matrix of size, 28 * 28 with pixel values taking a value from in range (0 to 255).

Let’s go ahead and visually see some of the images in our training data-set :-

Let’s see the shape of any random image(say we wanna investigate 7321th image) from the training dataset and also view the pixel-values of this random image :-

Data Pre-Processing Operation :- Here, we shall be performing 2 operations namely, on the entire dataset (i.e. Training + Test data-set) :-

  • Reshaping.

Step #1.) Note that, each gray-scale image usually has depth value of 1, therefore let’s perform the reshape() operation here on the entire training dataset :-

Step #2.) Now, let’s view the pixel-values of this random image again now. Note that, the image has now changed to 3d image.

Step #3.) Now, let’s perform the Normalisation operation at this random image. Note that, shape would not change, post normalisation operation :-

Step #4.) Again, let’s re-observe the pixel-values of this random image now. Note that, pixel-values have duly changed to decimal values now :-

Step #5.) Observe that, even after reshaping+normalisation operations here, we can still visually see any image. Let’s go ahead and see our own random image i.e. 7321th imae :-

Step #6.) Let’s also perform the same aforesaid operations on the test dataset too :-

Step #7.) Next, note that, the output values into the data-set are categorical in nature with values in range of (0 to 9). Therefore, this categorical data is first converted into vector using One-Hot-Encoding approach. Keras library provides us out-of-the-box method to “to_categorical”.

Step #8.) Now, let’s view the value being stored in output(Ytrain) for this random image again now. Note that, since the 7321th image was

Step #9.) At-last, here is our total dataset looks like :-

Let’s now build the simplest & sequential DNN using Keras. We are planning to use the 4 layers in this Neural Network.

Step#1.) First Layer is a Conv2D layer, which is defined as a follows :- This is a Conv2D layer of 32 Kernels(aka filters) each of size 3 * 3 and RELU activation-function.

  • Conv2D is a layer which performs 2D convolutions, with filter of size 3*3. There are 32 such filters (aka kernels), that we have specified in the initial go here. Here, In this convolutional-operation, stride-size of 1 is being considered. All of these 32 filters are initialised randomly by it’s own.

Question:- How does a simple convolutional i.e. Cross-CoRelation (Operation of applying Filter/Kernel over the Input Image) operation looks like ?

Question:- How is the above filter(kernel) actually applied on the input-image ?

Question:- What’s the objective of a CNN model ?

Question:- How does the architecture looks like for the first layer of this CNN ?

Note: It basically means that, on each of the Input-Image from the training data-set, we are applying the 32 different filters(aka kernels) and therefore, for each input-image, there shall be 32 feature-Maps shall be generated.

Question:- What does each convolutional-plane in above architecture represents ?

Question:- Since our original-input-image was of dimension (28*28*1), why is that size of the each of the output-feature-map shrinks to (26*26*1) ?

Note: The size of the output-feature-map is governed by the below formula basically. It depends upon following factors :-

  • Input-size of Image.

In our case, we have : W=28, K=3, P=0, S=1, therefore, computation for Output-Feature-Map-Size shall be :- O = [(28–3 + 2*0)/1 + 1] = 26.

Question:- What would be immediate impact, if we have Stride-value as more than 1 ?

Question:- Can we visualise, how does our output-feature-maps looks like, post this first-layer’s convolutional operations being applied ?

Note: As we explained above :-

  • Each of 32 output-feature-maps would be of size (26*26) with depth as 1.

Question:- Can we know, the computation behind 320 trainable parameters in Layer 1 ?

  • See here we have a single-input-image (with 1 feature maps aka 1 dimensions aka 1 depth) as Input to the 1st Convolutional layer.

Step#2.) Layer-1’s MaxPooling operation downsamples the feature-map, by taking the max out of the window. Here in above example, stride of 2*2 is being added, which indicates that, pooling-window moves with this speed.

Question:- What’s the importance/objective of applying Sub-Sampling-Operation (aka Pooling) ?

Question:- What are the various types of the Sub-Sampling (a.k.a. Pooling) Operations available ?

Question:- Consider an example of Max-Pooling-Operations : Given an below output-feature-maps of size (6*6), what shall be output of :-

  • Max-Pooling-Operation.
  • The output of Max-Pooling-Operation would look something like below :-
  • The output of Mean-Pooling-Operation would look something like below :-

Question:- Can we visualise, how does our output-feature-maps now looks like, post this first-layer’s Max-Pooling-Operation being applied ?

Note: The size of the output-feature-map is now reduced to (13*13), because of 2*2 pooling being applied over the same

Step 3#.) Coming back to the Second Layer which is again an another Conv2D layer, which is defined as a follows :- This is a Conv2D layer of 64 Kernels(aka filters) each of size 3 * 3 and RELU activation-function.

  • Conv2D is a layer which performs 2D convolutions, with filter of size 3*3. There are 64 such filters (aka kernels), that we have specified in this layer here. So, for each Incoming-Input-Feature-Map, there shall be 64 outputs thus generated, as a result of application of each of 64 kernels.

Question:- Can we visualise, how does our output-feature-maps looks like, post this 2nd-layer’s convolutional operations being applied ?

Question:- Can we know, the computation behind 18,496 trainable parameters in Layer 2?

  • See here we have a single-input-image (with 32 feature maps aka 32 dimensions) as Input to the 2nd Convolutional layer.

Step#4.) Layer-2’s MaxPooling operation downsamples the feature-map, by taking the max out of the window. Here in above example, stride of 2*2 is being added, which indicates that, max-pooling-window moves with this speed.

Question:- Can we visualise, how does our output-feature-maps’s size now looks like, post this second-layer’s Max-Pooling-Operation being applied ?

Step 5#.) Coming back to the Third Layer which is again an another Conv2D layer, which is defined as a follows :- This is a Conv2D layer of 64 Kernels(aka filters) each of size 3 * 3 and RELU activation-function.

  • Conv2D is a layer which performs 2D convolutions, with filter of size 3*3. There are again 64 such new filters (aka kernels), that we have specified in this layer too. So, for each Incoming-Input-Feature-Map, there shall be 64 outputs thus generated, as a result of application of each of 64 kernels.

Question:- Can we visualise, how does our output-feature-maps looks like, post this 3rd-layer’s convolutional operations being applied ?

Question:- Can we know, the computation behind 36,928 trainable parameters post Layer-3’s convolutional operation ?

  • See here we have a single-input-image (with 64 feature maps aka 64 dimensions) as Input to the 3rd Convolutional layer.

Step#6.) Layer-2’s Flattening Operation : Keras Flatten() class is very important when we have to deal with multi-dimensional inputs such as image datasets. Keras.layers.Flatten() function flattens the multi-dimensional input tensors into a single dimension, so we can model our input layer and build our neural network model, then pass those data into every single neuron of the model effectively.

Step 7#.) Coming back to the Fourth Layer which is a Dense-Layer with 64 neurons and RELU activation-function.

  • This layer also acts as an input layer, with input-tensor of size: (576*1) i.e. each input image to our fully-connected-model is of size 576*1.

Computation of no. of variables involved in First hidden layer (aka 4th layer of the CNN model) :-

  • Input to the first hidden layer(aka 4th layer of CNN) is an flattened feature-map of size (576*1).

Step 8#.) Coming back to the Fifth Layer (aka 2nd hidden layer) which is again a Dense-Layer having 32 neurons and RELU activation-function.

  • This layer also acts as a hidden layer, with input-tensor of size: (64*1). Recall that, output of the first hidden layer was an image of size 64*1.

Computation of no. of variables involved in Second hidden layer (aka 5th layer of the CNN model) :-

  • Input to the second hidden layer(aka 5th layer of CNN) is a flattened feature-map of size (64*1).

Step 9#.) Coming back to the Sixth Layer which is also an output Layer with 10 neurons and SOFTMAX activation-function.

  • This layer also acts as an output layer, with input-tensor of size: (32*1). Recall that, output of the second hidden layer was an image of size 32*1.

Computation of no. of variables involved in Output layer (aka 6th layer of the CNN model) :-

  • Input to the 6th layer of CNN is a flattened feature-map of size (32*1).

Note that, we have in-total of 95,082 parameters involved in this 6-layered-CNN model. In order to generalise the model better, it’s always suggestible to have large size of the training-data-set.

Finally, our Six-Layered-CNN-model, thus formed so far, can also be visualised as below :-

Step 10#.) Configuring Model :- Let’s now configure our recently created DNN model for training.

  • The Cost function is defined as Loss-Function and in this case, we are using “Categorical-Entropy” as Loss function.
  • The metrics used for evaluating the model is “accuracy”.

Usually, below values are adopted for hyper-parameters Delta and Gamma :-

Here is a ready comparative analysis for various types of Optimisers :-

Question: What should be our choice of selecting a Cost/Loss/Divergence Functions ?

Once the configuration of our model is complete, we can proceed for Training.

Step 11#.) Training of Model :- Given the enormous(60K) data-set that we have got, we shall not be using “Full-Batch-Gradient-Descent” because of expensive computation being involved there, therefore we are planning to use “Mini-Batch-Gradient-Descent” approach, in order to minimise the LOSS.

Let’s understand few things from above step :-

  • Epoch :- One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE. We usually need many epochs, in order to arrive at an optimal learning curve. Here, in this example, In one Full-Epoch (i.e. Forward+Backward pass), these 938 batches are passed. Weights are iteratively tuned, Loss is gradually reduced and Model-accuracy is gradually improved.

Question:- Why do we need more number of EPOCHS ?

One epoch is not enough, as it may lead to under-fitting of the curve. As the number of epochs increases, more number of times the weight are changed in the entire neural network and the curve usually goes from under-fitting to optimal to over-fitting curve.

  • Batch size: Since we have limited memory, probably we shall not be able to process the entire training instances(i.e. 60,000 instances) all at one forward pass. So, what is commonly done is splitting up training instances into subsets (i.e., batches), performing one pass over the selected subset (i.e., batch), and then optimising the network through back-propagation. So we have divided our Training-Data-Set into the small chunks of 64 each. Therefore, there are around 60,000 / 64 ~==~ 938 net-total batches that we have formed in this process.

Question: What’s suggestible regarding Batch-Size ?

The higher the batch size, the more memory space we would be needing. The batch_size is usually specified in power of 2.

Example #1: Let’s say we have 2000 training examples that we are going to use, then we can divide the dataset of 2000 example-instances into batches of 500, then it will take 4 iterations to complete 1 complete epoch.

Example #2: Coming back to our above example, we have 60,000 training examples, and our batch size is 64, therefore, it will take us 938 iterations to complete 1 epoch. Note that, in each epoch → all 60,000 instances shall be processed for sure. Also observe that, after every epoch, the loss reduces and accuracy improves.

  • Validation Split of 0.1: It means that out of total dataset, 1% of data is set aside as the cross-validation-set. It’s value is a Float between 0 and 1. It stands for the fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.

Question:- So, we know that Layer-1st had got 32 filters/kernels overall and during the course of training our model, basically our goal was to learn these kernels itself. Post the training process, can we know, what weights had been learnt by our CNN model for 31st kernel of Layer-1st ?

Question: Can we visualise the Validation-Loss, as the various epoch progresses ? Yes, we can also plot the graph for Validation-Loss. Below graph signifies that :

As the no. of epochs progresses, validation-loss-value also decreases. Note that :-

  • Post 1st EPOCH is completed, net validation loss of our model was 6.34%.

Step 13#.) Question: Can we visualise the Validation-Accuracy, as the various epoch progresses ? Yes, We can also plot the graph for Validation-Accuracy. Below graph signifies that : As the no. of epochs progresses, validation-accuracy also increases. Note that :-

  • Post 1st EPOCH is completed, net total accuracy of our model was 98.02%.

Step 14#.) As training of the CNN model progresses, both the (accuracy and loss) are stored as a list in the object. Final (Training Loss and Accuracy) can be obtained from the history object.

  • The last element of the history object (model_fit.history[‘loss’])gives us the final loss after training process.

From above snapshot, we can observe that, our vanilla CNN model, is giving the training accuracy of 99.87% and overall testing loss is : .0039.

Step 15#.) Recall that, in the beginning, we had divided the entire MNSIT data-set into Training and Testing datasets. We had 10K records into out Testing-Dataset. Can we see the same please ?

Above output indicates that, Xtest is the testing-dataset have got the 10,000 images and each image is of dimension 28*28 with depth as 1.

Step 16#.) Can we see any sample random image from the testing dataset(Xtest) and it’s corresponding label in the test-dataset(Ytest) ?

  • Note that, earlier above, we had converted the Ytest as well into the vector representation, using the one-hot-encoding approach. Therefore, the corresponding label for the sample random image appears into the vector-format-representation.

Step 17#.) As our CNN model is all ready now, let’s proceed to perform our long-desired task of prediction of the numbers ?

Question:- We had 10K images into our testing-data-set. Are you sure that, we have got the prediction being done for all 10K images ?

Question:- Can you show us for example, what result is being predicted by our respected-CNN model, for the same random image i.e. 7349th image, for which we above saw the actual-results ?

Note that, this is again a vector-formatted-result and the value at 2nd index is too high whereas the values at all other 9 indices is too small.

  • High value @ 3rd Index, indicates that, our model is too much confident about this particular number being as 2.

Step 18#.) This one-hot-encoded vector is getting difficult for me to visualise. Can you kindly convert this to plain-old mathematical numbers, back AND also show the result being predicted by our respected-CNN model, for the same random image i.e. 7349th image ?

Step 19#.) Output of Step-18 looks so cool, Alright, let’s convert our entire Ytest dataset to plain-old mathematical numbers, back AND also show the actual-class for the same random image i.e. 7349th image ?

Step 20#.) Alright we were originally doing the Multi-Class-Classification of mathematical numbers. As we now have both the actual & predicted classes for YTest, can we visualise the classification-report please ?

Question:- For Number 7, what does precision of 0.98 indicates ?

Precision → Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all numbers that labeled as 7, how many actually got predicted as 7 ? High precision relates to the low false positive rate. We have got 0.98 precision which is pretty good. Precision = TP/TP+FP. In other words :-

When our model predicts that any number is 7, it is correct around 98% of the time.

Question:- For Number 8, what does recall of 0.98 indicates ?

Recall → The recall is the ratio TP / (TP + FN) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. In other words :-

For all the numbers 8 in our Test-Set, recall tells us, how many we correctly identified as 8.

F1-score → F1-score gives us the harmonic mean of precision and recall. The scores corresponding to every class will tell us the accuracy of the classifier-model in classifying the data points in that particular class compared to all other classes.

Support → Support is the number of actual occurrences of the class in the specified dataset. Imbalanced support in the training data may indicate structural weaknesses in the reported scores of the classifier and could indicate the need for stratified sampling or rebalancing. Support doesn’t change between models but instead diagnoses the evaluation process.

Step 21#.) Can we kindly visualise the Confusion-Matrix for this Multi-Class-Classification please ?

A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.

Step 22#.) Next, We can evaluate our thus build model with the help of testing dataset of size 10,000. The “evaluate” function gives us the testing accuracy and testing loss as the output.

From above snapshot, we can observe that, our vanilla CNN model, is giving the testing accuracy of 99.06% and overall testing loss is : .052.

e can try to improvise this accuracy by playing on following parameters :-

  • By changing the number of Kernels being applied in the initial layers.

Question:- Having understood all of the concepts above, what’s the ultimate requirement for being succesfull in getting good outputs from CNN ?

Thanks for reading through this and we shall meet you in another article.

References:-

--

--

Software Engineer for Big Data distributed systems

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store