HandsOn with NLP using TensorFlow

In this blog, we shall be studying generating the Word-Encodings using Tensorflow library.

Question:- Can you explain, what’s NLP ?

Question:- What’s the idea about the Deep learning usage for NLP ?

Question: Why do we have to encode the textual data ?

Answer → So, computers understand numbers and not textual characters. So, we have to find ways to encode characters. We could take character encodings for each character in a set, for example, the ASCII values.

Question: What are the disadvantages of using ASCII ?

• For example, consider the word “Arms” as shown here and the ASCII values of each character in the word. So, you might think you could have a word like “Arms” encoded using these values.
• But the problem with this, of course, is that the semantics of the word are not encoded in the letters. This could be demonstrated using the word “Mars”, which has a very different meaning but with exactly the same letters.

The disadvantage is that, two words might be anagram to each other and both of them would have same ascii codings :-

So, it seems that training a Neural Network with just the letters could be a daunting task.

Question: What’s the other alternative of performing the encoding ?

Answer → So, let’s try to consider words. We can use the Unique Number Word Encoding, but this too comes with own disadvantages :-

• We’ll try to give words of value and have those values used for training a network. For example, consider the sentence, “It is a sunny day”. We’ll assign a value to each word, where, what that value is doesn’t matter.
• This value is the same for the same word every time. So, a simple encoding for the sentence would be, for example, to give word “It” the value one, “is” the value two, “a” a value of three, “sunny” a value of four and so on. So, this is how we can start training a Neural Network based on words.

Question: Are there some libraries available, in order to generate the word encodings ?

Answer → Yes, we do have TensorFlow as Library available in Python, with which we can perform NLP task.

Step #1.) Let’s first import the necessary libraries :-

Here is the version of the tensorflow, that we shall be using :-

Step #2.) Now, let’s imagine that, following are our training sentences, to which we have to encode :-

Step #3.) In this step, we would be training the tokenizer on the above supplied training sentences :-

• We first instantiate the Tokeniser by specifying 100 as total number of words. We also specify the Out of Vocabulary Token as <oov>.
• We next fit the tokeniser on the aforesaid sentences.
• We finally store the words index.

Step #4.) Let’s see the words index. This is also our dictionary :-

Step #5.) Here is how, we can generate the Encoded format of Words-Sequence :-

Let’s see one sentence and it’s corresponding Encoding sequence :-

Question: Imagine that, some new sentences comes in, so how does they fit into this dictionary ?

Answer → Let’s tokenize above new sentence as well, using our trained tokenizer (the one that we did in above step) :-

Note that, we didn’t had the word ‘pleasant’ in our library (as we prepared above). So, automatically the word pleasant is being encoded as 1, which is equivalent to <oov>.

Question: Now, above sentences are un-equal in length. Is there some way to make sure that, length of all the sentences is equal ?

Answer → Yes, the way to achieve this is using Padding :-

• We wish to perform the post type of padding i.e. after the string, if there is shortfall, the bits shall be appended.
• Max-length of the sequence would be 6, post which it shall be truncated.

Thus, we have padded the short sequences with ZERO as padded characters.