In this blog, we shall be studying generating the Word-Encodings using Tensorflow library.
Question:- Can you explain, what’s NLP ?
Question:- What’s the idea about the Deep learning usage for NLP ?
Question: Why do we have to encode the textual data ?
Answer → So, computers understand numbers and not textual characters. So, we have to find ways to encode characters. We could take character encodings for each character in a set, for example, the ASCII values.
Question: What are the disadvantages of using ASCII ?
- For example, consider the word “Arms” as shown here and the ASCII values of each character in the word. So, you might think you could have a word like “Arms” encoded using these values.
- But the problem with this, of course, is that the semantics of the word are not encoded in the letters. This could be demonstrated using the word “Mars”, which has a very different meaning but with exactly the same letters.
The disadvantage is that, two words might be anagram to each other and both of them would have same ascii codings :-
So, it seems that training a Neural Network with just the letters could be a daunting task.
Question: What’s the other alternative of performing the encoding ?
Answer → So, let’s try to consider words. We can use the Unique Number Word Encoding, but this too comes with own disadvantages :-
- We’ll try to give words of value and have those values used for training a network. For example, consider the sentence, “It is a sunny day”. We’ll assign a value to each word, where, what that value is doesn’t matter.
- This value is the same for the same word every time. So, a simple encoding for the sentence would be, for example, to give word “It” the value one, “is” the value two, “a” a value of three, “sunny” a value of four and so on. So, this is how we can start training a Neural Network based on words.
Question: Are there some libraries available, in order to generate the word encodings ?
Answer → Yes, we do have TensorFlow as Library available in Python, with which we can perform NLP task.
Step #1.) Let’s first import the necessary libraries :-
Here is the version of the tensorflow, that we shall be using :-
Step #2.) Now, let’s imagine that, following are our training sentences, to which we have to encode :-
Step #3.) In this step, we would be training the tokenizer on the above supplied training sentences :-
- We first instantiate the Tokeniser by specifying 100 as total number of words. We also specify the Out of Vocabulary Token as <oov>.
- We next fit the tokeniser on the aforesaid sentences.
- We finally store the words index.
Step #4.) Let’s see the words index. This is also our dictionary :-
Step #5.) Here is how, we can generate the Encoded format of Words-Sequence :-
Let’s see one sentence and it’s corresponding Encoding sequence :-
Question: Imagine that, some new sentences comes in, so how does they fit into this dictionary ?
Answer → Let’s tokenize above new sentence as well, using our trained tokenizer (the one that we did in above step) :-
Note that, we didn’t had the word ‘pleasant’ in our library (as we prepared above). So, automatically the word pleasant is being encoded as 1, which is equivalent to <oov>.
Question: Now, above sentences are un-equal in length. Is there some way to make sure that, length of all the sentences is equal ?
Answer → Yes, the way to achieve this is using Padding :-
- We wish to perform the post type of padding i.e. after the string, if there is shortfall, the bits shall be appended.
- Max-length of the sequence would be 6, post which it shall be truncated.
Thus, we have padded the short sequences with ZERO as padded characters.
Thanks for reading till here. If you liked it, please do clap on this page. We shall see you in next blog.