Sentiment Classification in NLP | Part-2 | Logistic Regression
If you are landing here directly, it may be good to read through this blog first.
Question → What are we going to do here next ?
Answer → So far, you have learned to extract features in my previous blog of this series, and now you will use those extracted features to predict whether a tweet has a positive sentiment or a negative sentiment.
Question → What is Logistic Regression ?
Answer → Logistic regression makes use of a sigmoid function, which outputs a probability between zero and one.
- In supervised machine learning, you have input features and a sets of labels.
- To make predictions based on your data, you use a function with some parameters to map your features to output labels.
- To get an optimum mapping from your features to labels, you minimize the cost function which works by comparing how closely your output “Y-hat” is to the true labels “Y” from your data.
- After this, the parameters are updated and you repeat the process until your cost is minimized.
Question → What is Sigmoid Function ?
Answer → For logistic regression, this function F is equal to the sigmoid function and it depends on below 2 parameters :-
- Theta.
- The features vector X superscripts i, where i is used to denote the ith observation or data points. In the context of tweets, that’s the ith tweets.
Note the 2 important properties of sigmoid function :-
- The Sigmoid approaches zero, as the dot product of Theta transpose and X, over here, approaches minus infinity.
- The Sigmoid approaches one, as the dot product of Theta transpose and X, over here, approaches plus infinity.
Question → How do we do Classification with the help of Sigmoid Function?
Answer → For classification, a threshold is needed. Usually, it is set to be 0.5 and this value corresponds to a dot product between Theta transpose and X equal to zero.
- So whenever the dot product is greater or equal than zero, the prediction is positive.
- And whenever the dot product is less than zero, the prediction is negative.
Question → Showcase the example of Classification in context of our Tweets Sentiment Analysis, with the help of Sigmoid Function?
Answer → Look at the following tweet :: “@YMourri and @ AndrewNG are tuning a GREAT AI model at https://deeplearning.ai”.
Step #1.) After a preprocessing, you should end up with a list like this : “tun, ai, great, model”
- Note that handles are deleted.
- Everything is in lowercase.
- The word tuning is reduced to its stem, tun.
Step #2.) Then you would be able to extract features given a frequencies dictionary and arrive at a vector similar to the following. Pl refer to Part-1 of this blog series to know more about this.
- With a bias units over here.
- And two features that are the sum of positive and negative frequencies of all the words in your processed tweets.
Step #3.) Now assuming that you already have an optimum sets of parameters Theta, you would be able to get the value of the sigmoid function, in this case, equal to 4.92 and finally, predict this tweet as a positive sentiment.
Question → How to get the theta variable from scratch ?
Answer → To train your logistic regression classifier, we would have to iterate until you find the set of parameters theta, that minimizes your cost function.
- Let us suppose that your loss only depends on the parameters theta1 and theta2, you would have a cost function that looks like this contour plots in the left hand part of the picture below.
- On the right, you can see the evolution of the cost function as you iterate. As our number of iterations increases, the cost comes down. After a certain point i.e. after certain number of iterations, the cost can not be reduced any further.
Question → What’s the algorithm to find out the values of Theta variables ?
Answer → The algorithm to derive the values of these Theta Variables is known as Gradient-Descent :-
- First, you would have to initialise your parameters vector theta. Then you will update your theta in the direction of the gradient of your cost function.
- Then you’d use the logistic function to get values for each of your observations. After that, you’d be able to calculate the gradients of your cost function and update your parameters.
- Finally, you’d be able to compute your cost J(Theta) and determine if more iterations are needed according to a stop-parameter or maximum number of iterations.
Question → How do you take a judgement, whether your classifier is a good classifier or a bad classifier :-
Answer → Now that you have the optimised values of theta, you will use this theta to predict our new data points.
Step #1.) For example, given a new tweet, you will use this theta to say whether this tweet is positive or negative. In order to do the same, you will need X_val and Y_val of the data that was set-aside during trainings, also known as the validation sets as well as Theta, the sets of optimum parameters that you got from training on your data.
Step #2.) Next, you will compute the sigmoid function for X_val with parameters Theta, then you will evaluate if each value of h(Theta) is greater than or equal to a threshold value, often set to 0.5.
Step #3.) We shall do the aforementioned computation for all the records present into our validation set and then, we shall assert, if for each of the record, the output is greater than or equal to 0.5. In simpler words, we are going to predict whether tweet is postive OR negative :-
At the end, you will have a vector populated with zeros and ones indicating predicted negative and positive examples, respectively.
Step #4.) After building the predictions vector, you can compute the accuracy of your model over the validation sets.
- To do so, you will compare the predictions you have made with the true value for each observation from your validation data.
- If the values are equal and your prediction is correct, you’ll get a value of 1 and 0 otherwise.
Step #5.) After building the predictions vector, you can compute the accuracy of your model as follows :-
- First, compute the sum total of all ones from the vector of the comparisons.
- Finally, you’ll divide that number over the total number m of observations from your validation sets.
This metric gives an estimate of the times that’s your logistic regression will correctly work on unseen data.
Question → What’s the intuition behind the Cost Function used in Logistic Regression ?
Answer → The cost function in case of Logistic Regression is also called as Log-Loss-Function OR Binary-Cross-Entrophy :-
Part #1.) First, we have a sum over the variable m, which is just the number of training examples in your training set. This indicates that you’re going to sum over the cost of each training example. Out front, there is a -1/m, indicating that when combined with the sum, this will be some kind of average. The minus sign ensures that your overall costs will always be a positive number.
Part #2.) Inside the square brackets, the equation has two terms that are added together. The term on the left is :-
- y^i → The label for each training example.
- log(h(x^i, Theta)) → Log of the prediction, which is the logistic regression function applied to each training example.
Case-2.1) Consider the case when your label is 0. In this case, the function h can return any value, and the entire term will be 0 because 0 times anything is just 0.
Case-2.2) Consider the case when your label is 1. Now, If your prediction is close to 1, then the log of your prediction will be close to 0 :-
Case-2.3) Consider the case when your label is 1. Now, If your prediction is close to 0, then the log of your prediction will be equal to Negative Infinity :-
Intuitively, now, you can see that this term (i.e. LHS side term being shown above) is the relevant term in your cost function when your label is 1.
- When your prediction is close to the label value, the loss is small.
- And In case, When your label and prediction disagree, the overall cost goes up.
Note: In this case J, of theta simplifies to just negative log h(x(theta).
In below plot, you have your prediction value (predicted by Logistic-Function) on the horizontal axis and the cost (J(Theta)) associated with a single training example on the vertical axis. The plot between the Loss-function and Sigmoid-function, for the case when the label is 1, looks something like this :-
- When your prediction is close to 1, the loss is close to 0, because your prediction agrees well with the label.
- And when the prediction is close to 0, the loss approaches infinity, because your prediction and the label disagree strongly.
Part #3.) Now, consider the term on the right hand side of the cost function equation :-
Case-3.1) Consider the case when your label is 1. In this case, the (1-y) term goes to 0. Although, the function h can return any value, but the entire term will be 0 because 0 times anything is just 0.
Case-3.2) Consider the case when your label is 0 and also assume that, the logistic-function h returns the value close to zero, then in that case, log (1–0) would approach towards zero and hence again the entire term will be 0 because 0 times anything is just 0.
Case-3.3) Consider the case when your label is 0 and now say that, the logistic-function h returns the value close to one, then in that case, log (1–1) would approach towards negative-Infinity :-
Intuitively, now, you can see that this term (i.e. RHS side term being shown above) is the relevant term in your cost function when your label is 0.
- When your prediction is close to the label value, the loss is small.
- And In case, When your label and prediction disagree, the overall cost goes up.
Note: In this case J(theta) reduces to just minus log(1- h(x, theta).
In below plot, you have your prediction value (predicted by Logistic-Function) on the horizontal axis and the cost (J(Theta)) associated with a single training example on the vertical axis. The plot between the Loss-function and Sigmoid-function, for the case when the label is 0, looks something like this :-
- Now when your prediction is close to 0, the loss is also close to 0.
- And when your prediction is close to 1, the loss approaches infinity.
Conclusion about Cost-Function →
- One term in the cost function is relevant when your label is 0 and another that is relevant when the label is 1.
- In each of these terms, you’re taking the log of a value between 0 and 1, which will always return a negative number, and so the minus sign out front ensures that the overall cost will always be a positive number.
That’s all in this blog. We shall see you in next one.