NLP for Machine Learning | Part-1

aditya goel
7 min readMar 5, 2023

--

In this blog, we shall be learning about following aspects :-

  • Installation of NLTK Library on our Machine.
  • Verification of NLTK Library.
  • Downloading the NLTK Library.
  • Reading a Semi-Structured File in Python.
  • Reading a Semi-Structured File through Pandas.
  • Explanatory Data Analysis.
  • Flow of a simple Machine-Learning Pipeline.

Question: How to install NLTK Library on our own machine ?

Question: How to verify that, whether NLTK has been installed ?

Question: How to download the NLTK library ?

Question: How to read a semi-unstructured file ?

Answer →

Step #1.) We are reading first 500 characters of this file-name : “SMSSpamCollection.tsv” :-

We can see that, it’s just basically a block of text i.e. simply a String. We can see the “\n” characters at the end of each line.

Step #2.) Let’s first replace \n’s with \t’s, and that’ll allow us to split this into a list on the basis of ‘\n’.

  • The labels are thus obtained at positions 0th, 2nd, 4th, 6th, so on.
  • The text are thus obtained at positions 1st, 3rd, 5th, 7th, so on.

This gives us some sort of structure.

Step #3.) Now, we perform separation of the labels and texts, into separate lists :-

  • We pick-up the Zeroth element and then every 2nd element from the original list and put it into the labelList.
  • Similarly, we now pick-up the first element and then every 2nd element from the original list and put it into the textList.

Step #4.) Next, let’s study our data :-

  • Note that, there are around 5571 records in the labelList.
  • And, there are around 5570 records in the textList.

So, we can see that, the last element is an empty character, so we would just leave it behind.

Step #5.) Now, let’s convert this list-form-data into a pandas-data-frame :-

Note that, we are explicitly leaving behind the last character from the labelList. It has got 5571 characters originally and we are capturing everything, except the last character. Now, let’s see this data-frame :-

With data in the format of Data-Frames, we now have some structure in a clean format. Earlier, we were having the data in text format and now we have data in a more readable format.

Question: Can we read a semi-unstructured file, directly using Pandas ?

Answer → Yes, we can very well read the file, directly through Pandas as well :-

Step #1.) Let’s read the unstructured-file directly into the Pandas-DataFrame as well. Here is how the same can be achieved :-

Step #2.) Now, we need to assign the column-names in this data-frame :

Question: How many entries have the label as HAM and how many entries have the label as SPAM ?

Answer → We can very well find the number of entries for which label is HAM and the number of entries for which label is SPAM :-

Question: How can we check, if some entry is Null or Empty ?

Answer → Luckily, we don’t have any data with Null values :-

  • We can very well find if we have some entries where label is NULL :-
  • Similarly, we can also find that, if we have some entries where text is NULL :-

Question: Explain how does general Machine-Learning-Pipeline looks like ?

Answer → These are the following stages of the Machine-Learning-Pipeline :-

Step #1.) We first obtain the raw-text, to which Model doesn’t understands anything.

  • It’s important to note that at this stage, the computer has no idea what it’s looking at. All it sees is a collection of characters.
  • It doesn’t know the word ham from the word spam. The characters mean nothing. It doesn’t even know a space from a number or a letter. They’re all the same.

Step #2.) The next important thing, that we need to do is Tokenize our text. We can do so by splitting on white space or special characters.

  • Example → The sentence “I am learning NLP” would be split into a list with four tokens, I and then am and then learning, and lastly NLP.
  • So now, instead of just seeing one long string of characters like it was seeing before, now Python will see a list with four distinct tokens, so it knows what to look at.

Step #3.1) Removal of Stop-Words →

Question → What are Stop-Words and value being added by them ?

Answer → Note that, some of the words might be a little bit more important than other words. For instance, the words the, and, of, or, appear very frequently but don’t really offer much information about the sentence itself. These are what’s called stop words.

Question → What’s the treatment that we give to Stop-Words ?

Answer → Typically, we shall remove the Stop-Words, to allow Python to really focus in on the most pivotal words in our sentence.

  • For example → In sentence “I, am, learning, NLP,”, once we remove stop words, we shall just be left with learning and NLP.
  • This still gets across the most important point of the sentence, but now you’re only looking at half the amount of tokens.

Step #3.2) Applying Stemming

Question → What is the meaning of Stemming ?

Answer → The process of stemming helps Python realize that words like learn, learned and learning all have basically the same semantic meaning.

Question → Is Stemming really very significant ?

Answer → In a small sample, the stemming mayn’t be quite helpful, but when you all of a sudden have a million text messages, and a corpus of 150,000 words, any words that you can remove to allow Python to focus on the most pivotal words can really make a big difference.

Question → What’s the advantage of Stemming ?

Answer → With Stemming being applied, Python sees a list of tokens you care about, and the keywords that we think are useful for building some kind of machine learning model.

Step #4.) Vectorisation

Question → What’s the need of Vectorisation ?

Answer → Even though Python now knows what you care about, it still only sees characters. It doesn’t know what learning or NLP even means.

Question → What’s the meaning of Vectorisation ?

Answer → We have to convert the words/characters into a numeric-format that a machine learning algorithm can actually ingest and use to build a model. This is a process called vectorizing.

Question → How the process of Vectorisation is being done ?

Answer → Vectorizing is a process of basically converting the text to a numeric representation of that text, where you are essentially counting the occurrences of each word in each text message using a matrix with one row per text message and one column per word.

Step #5.) ML Model Training

  • Once we have this numeric matrix, we can now fit our actual machine learning model by feeding in our vectorized data along with our spam or ham labels.
  • The model will then learn the relationships between the words and the labels in order to train a model to make predictions on text messages that it has never seen before and determine whether they are spam or not.

Question → How many ML-Models usually we test out with ?

Answer → There are various types of machine learning models. It will be up to us to select a few of them to try out. We shall tailor our choices based on :-

  • The type of input data we’re giving to the Model.
  • What we’re trying to predict.
  • How much compute power we have, things like that.

We’ll typically test out a number of what’s called candidate models before selecting which model performs best.

  • Once we select the best model, we’ll evaluate that on a holdout test set, and this is typically a set of data that we’ll set off to the side in the very beginning for the purpose of testing our final model on it to see how our model will perform on data that it’s never seen or touched before.
  • If it passes this final test, then we’ll prepare to implement it within whatever framework we’re working with.

That’s all in this blog. If you liked reading, do clap on this page and we shall see you in next blog.

--

--

aditya goel
aditya goel

Written by aditya goel

Software Engineer for Big Data distributed systems

No responses yet