Sentiment Classification in NLP | Part-1 | Features Extraction
If you are landing here directly, it may be good to read through this blog first.
Question → What is Supervised Learning Algorithms and it’s goals ?
Answer → In supervised machine learning you have input features X and a set of labels Y.
Goal → Now to make sure you’re getting the most accurate predictions for Labels, based on your data, your goal is to minimize your error rates or cost as much as possible.
Question → How do we perform Supervised Learning ?
Answer → Once our prediction function is ready, we takes in parameters data to map your features to output labels Y hat.
- Now the best mapping from features to labels is achieved when the difference between the expected values Y and the predicted values Y hat is minimized.
- Then you can update your parameters and repeat the whole process until your cost is minimized.
Question → How do we solve the problem of Tweet Classification ?
Answer → For this task, we will be using “Logistic Regression Classifier”, which assigns its observations to two distinct classes wither positive OR Negative :-
- You will first process the raw tweets in your training sets and extract useful features.
- Basically, by processing the raw-tweets, we mean that → We have to represent the text (i.e. “I am happy because I am learning NLP”) as features.
- Then you will train your logistic regression classifier while minimising the cost.
- And finally you’ll be able to make your predictions.
Question → How to represent a text as a vector ?
Answer → In order to represent a text as a vector, we shall have to perform following steps :-
Step #1.) Vocabulary Formation → First, you have to build a vocabulary and that will allow you to encode any text or any tweet as an array of numbers.
- Imagine that, we have a list of tweets.
- All the unique words from all the tweets would form your vocabulary, V.
- In the above example, you’d have the word “I”, then the word, “am” and “happy”, “because”, and so forth.
- But note that the word “I” and the word “am” would not be repeated in the vocabulary.
Step #2.) Feature Extraction → Now, from all of these tweets, we extract features using the afore-formed vocabulary. To do so, you’d have to check if every word from your vocabulary appears in the tweet or not ?
- If a particular word does appear in the tweet, you would assign a value of 1 to that feature.
- If it doesn’t appear, you’d assign a value of 0, like that.
- In this example, the representation of your tweet would have length equal to the number of words (features) in the dictionary.
- In other words, we can say that, this representation would have a number of features equal to the size of your entire vocabulary.
- In this representation of your tweet, we would have six ones and many zeros.
Now, this type of representation with a small relative number of non-zero values is called a sparse representation.
Question → What are the problems associated with the Spare-Representation of the of the Text ?
Answer → The main problem comes during the Model Learning phase. With the sparse representation, a logistic regression model would have to learn n plus 1 parameters, where n would be equal to the size of your vocabulary.
- For large vocabulary sizes, this would be problematic. It would take an excessive amount of time to train your model
- Also, this format would take much more time than necessary to make predictions/inferences.
Question → How to generates positive and negative counts, which you can then use as features into your logistic regression classifier ?
Answer → Here, our objective is to specifically keep track of the number of times, that a given a word of the tweet shows up in the positive class OR in the negative class.
Step #1.) Vocabulary Formation → Here, you could have a corpus consisting of four tweets as shown below. With this corpus, let’s form our dictionary. In this example, your vocabulary would have eight unique words :-
Step #2.) Identifying Classes → For this particular example of sentiment analysis, you have two classes.
- One class associated with positive sentiment and the other with negative sentiment.
- So taking your corpus, you’d have a set of two tweets that belong to the positive class, and the sets of two tweets that belong to the negative class.
Step #3.) Positive-Frequency → Let’s take the sets of positive tweets. Now, take a look at your vocabulary. For each word in our dictionary, we shall find it’s frequency from the positive-tweets. For example → the word happy appears one time in the first positive tweet, and another time in the second positive tweet. So it’s positive frequency is two.
Step #4.) Negative-Frequency → Similarly, let’s take the sets of negative tweets. Now, take a look at your vocabulary. For each word in our dictionary, we shall find it’s frequency from the negative-tweets. For example → the word am appears two times in the first tweet and another time in the second one. So it’s negative frequency is three.
Step #5.) Overall-Frequency → Now, take a look at the entire table for positive + negative frequencies and feel free to check its values. So this is the entire table with the positive and negative frequencies for your corpus.
Note → In practice when coding, this table is a dictionary mapping from a word class there to its frequency. So, it maps the word and its corresponding class to the frequency or the number of times that’s where it showed up in the class.
How this is helpful → Using both those counts, you can then extract features and use those features into your logistic regression classifier.
Question → In the above example of Sentiment classification, earlier we represented a tweet in dimension of “V” where V is the size of entire vocabulary. Can you show now, how to encode a tweet as a vector of dimension 3 ?
Answer → Here is how we can present a Tweet as a vector of dimension 3 :-
1.) Goal → We will have a much faster speed for your logistic regression classifier, because instead of learning V features, you only have to learn three features.
2.) How do we do this → Now that you’ve built your frequencies dictionary in above step, you can use it to extract useful features for sentiment analysis.
3.) What does a feature look like → Let’s look at the arbitrary tweet m.
- The first feature would be a bias unit equal to 1.
- The second is the sum of the positive frequencies for every unique word on tweet m.
- The third is the sum of negative frequencies for every unique word on the tweet.
Example #1 → Computation of Second Feature (Sum of Positive Frequencies) →
- For instance, take the example of following tweet : “I am sad, I am not learning NLP”.
- Now let’s look at the frequencies for the positive class from the above step.
- To get this value, you need to sum the frequencies of the words from the positive-frequency-table and vocabulary that appear on the tweet. At the end, you get a value equal to eight.
Example #2 → Computation of Third Feature (Sum of Negative Frequencies) →
- Once again, take the example of following tweet : “I am sad, I am not learning NLP”.
- Now let’s look at the frequencies for the negative class from the above step.
- To get this value, you need to sum the frequencies of the words from the negative-frequency-table and vocabulary that appear on the tweet. At the end, you get a value equal to Eleven.
Conclusion → So far, this tweet : “I am sad, I am not learning NLP”, the representation would be equal to the vector → [1, 8, 11]. Hence you end up getting the following feature vector [1,8,11].
- 1 corresponds to the bias.
- 8 corresponds to the positive feature.
- 11 corresponds to the negative feature.
Question → How do we preprocess the data ?
Answer → Now, we shall look into the major concepts of preprocessing.
- The first thing is called stop words.
- The second thing is called stemming.
We have to use perform stemming and remove stop-words to preprocess your texts.
Question → Can you show with an example of pre-processing the tweet ?
Answer → Let’s process this tweet : “@YMourri and @ AndrewNG are tuning a GREAT AI model at https://deeplearning.ai”.
Step #1 → First, I remove all the words that don’t add significant meaning to the tweets, aka stop words and punctuation marks.
- In practice, you would have to compare your tweet against two lists. One with stop words in English and another with punctuation. These lists are usually much larger, but for the purpose of this example, they will do just fine.
- Every word from the tweet, that also appears on the list of stop words should be eliminated. So you’d have to eliminate the words : “and”, “a” and “at”:-
Step #2 → Next, I eliminate every punctuation mark. In this example, there are only exclamation points.
Step #3 → Next, I eliminate the handles and URLs too. Tweets and other types of texts often have handles and URLs, but these don’t add any value for the task of sentiment analysis. Let’s eliminate these two handles and this URL.
At the end of this process, the resulting tweets contains all the important information related to its sentiment. “Tuning GREAT AI model” is clearly a positive tweet and a sufficiently good model should be able to classify it.
Step #4 → Next, we shall perform stemming for every word. Stemming in NLP is simply transforming any word to its base stem, which you could define as the set of characters that are used to construct the word and its derivatives. Let’s take the first word from the example. Its stem is tun.
- Adding the letter e, it forms the word tune.
- Adding the suffix ed, forms the word tuned.
- Adding the suffix ing, it forms the word tuning.
How does Stemming is helpful →
- After you perform stemming on your corpus, the word tune, tuned, and tuning will be reduced to the stem tun. So, your vocabulary would be significantly reduced when you perform this process for every word in the corpus.
- To reduce your vocabulary even further without losing valuable information, you’d have to lowercase every one of your words. So the word GREAT, Great and great would be treated as the same exact word.
So, back to our example above, This is the final preprocesss tweet as a list of words :-
Question → How do you know generate the Matrix of features from the data?
Answer → Here is how the process looks like :-
Step #1.) First, we preprocess a tweet, to get a list of words that contain all the relevant information for the sentiment analysis tasks in NLP.
Step #2.) With that list of words, you would be able to get a nice representation using a frequency dictionary mapping And finally, get a vector with a bias unit and two additional features that store the sum of the number of times that every word on your process tweets appear in positive tweets and the sum of the number of times they appear in negative ones.
Step #3.) Note that, In practice you would have to perform this process on a set of m tweets one by one to get these sets of lists of words one for each of your tweets. And finally, you’d be able to extract features using a frequencies dictionary mapping.
Step #4.) At the end, you would have a matrix, X with m rows and three columns where every row would contain the features for each one of your tweets.
The code for the same would look like as follows :-
That’s all in this blog. Pl stay tuned for more parts of this series.