HandsOn with NLP using TensorFlow

In this blog, we shall be studying generating the Word-Encodings using Tensorflow library.

Question:- Can you explain, what’s NLP ?

Question:- What’s the idea about the Deep learning usage for NLP ?

Question: Why do we have to encode the textual data ?

Answer → So, computers understand numbers and not textual characters. So, we have to find ways to encode characters. We could take character encodings for each character in a set, for example, the ASCII values.

Question: What are the disadvantages of using ASCII ?

Answer →

  • For example, consider the word “Arms” as shown here and the ASCII values of each character in the word. So, you might think you could have a word like “Arms” encoded using these values.

The disadvantage is that, two words might be anagram to each other and both of them would have same ascii codings :-

So, it seems that training a Neural Network with just the letters could be a daunting task.

Question: What’s the other alternative of performing the encoding ?

Answer → So, let’s try to consider words. We can use the Unique Number Word Encoding, but this too comes with own disadvantages :-

  • We’ll try to give words of value and have those values used for training a network. For example, consider the sentence, “It is a sunny day”. We’ll assign a value to each word, where, what that value is doesn’t matter.

Question: Are there some libraries available, in order to generate the word encodings ?

Answer → Yes, we do have TensorFlow as Library available in Python, with which we can perform NLP task.

Step #1.) Let’s first import the necessary libraries :-

Here is the version of the tensorflow, that we shall be using :-

Step #2.) Now, let’s imagine that, following are our training sentences, to which we have to encode :-

Step #3.) In this step, we would be training the tokenizer on the above supplied training sentences :-

  • We first instantiate the Tokeniser by specifying 100 as total number of words. We also specify the Out of Vocabulary Token as <oov>.

Step #4.) Let’s see the words index. This is also our dictionary :-

Step #5.) Here is how, we can generate the Encoded format of Words-Sequence :-

Let’s see one sentence and it’s corresponding Encoding sequence :-

Question: Imagine that, some new sentences comes in, so how does they fit into this dictionary ?

Answer → Let’s tokenize above new sentence as well, using our trained tokenizer (the one that we did in above step) :-

Note that, we didn’t had the word ‘pleasant’ in our library (as we prepared above). So, automatically the word pleasant is being encoded as 1, which is equivalent to <oov>.

Question: Now, above sentences are un-equal in length. Is there some way to make sure that, length of all the sentences is equal ?

Answer → Yes, the way to achieve this is using Padding :-

  • We wish to perform the post type of padding i.e. after the string, if there is shortfall, the bits shall be appended.

Thus, we have padded the short sequences with ZERO as padded characters.

Thanks for reading till here. If you liked it, please do clap on this page. We shall see you in next blog.

--

--

Software Engineer for Big Data distributed systems

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store