Sentiment Classification in NLP | Part-3 | Vector Space Models
If you are landing at this page directly, it’s advisable that you go through this page first.
Question → What’s the need of Vector Spaces ?
Answer → Let’s understnad the reason for the need of Vector-Spaces :-
Aspect #1.) Say, suppose you have two questions :- the first one is, where are you heading? And the second one is where are you from? These sentences have identical words except for the last ones. However, they both have a different meaning.
Aspect #2.) On the other hand, say you have two more questions whose words are completely different but both sentences mean the same thing.
Conclusion #.) Vector space models will help you identify whether the first pair of questions or the second pair are similar in meaning even if they do not share the same words.
Question → Can Vector Spaces also help to capture the dependencies between words ?
Answer → Vector space models will also allow you to capture dependencies between words as well :-
Aspect #1.) Consider this sentence, “you eat cereal from a bowl”, here you can see that the words cereal and the word bowl are related.
Aspect #2.) Now let’s look at this other sentence, “you buy something and someone else sells it”. The second half of the sentence is dependent on the first half.
With vectors based models, you will be able to capture this and many other types of relationships among different sets of words.
Question → What are the various applications of Vector Space Models ?
Answer → Vector space models have extensive set of applications :-
1.) These are used in information extraction to answer questions, in the style of who, what, where, how and etcetera.
2.) These are also used in machine translation and in chess sports programming. They’re also used in many, many other applications.
3.) They can be used to identify similarity for a question answering, paraphrasing and summarization.
Question → What’s the fundamental on which, Vector Space Models works ?
Answer → Vector space models fits best on this quote from John Firth, a famous English linguists :- you shall know a word by the company it keeps.
This is one of the most fundamental concepts in NLP.
- When using vector space models, the way that representations are made is by identifying the context around each word in the text and this captures the relative meaning.
- When learning these vectors, you usually make use of the neighboring words to extract meaning and information about the center word.
- If you were to cluster these vectors together, you will see that adjectives, nouns, verbs, etc. tend to be near one another.
- Another cool fact, is that synonyms and antonyms are also very close to one another. This is because you can easily interchange them in a sentence and they tend to have similar neighboring words
- V. V. Imp Point → Vector space models allow you to represent words and documents as vectors. This captures the relative meaning.
Question → How would you represent a word in form of vector using Word-2-Word design ?
Answer → The way the word-2–word design works is :-
- It makes a co-occurrence matrix and extract vector representations for the words in your corpus.
- The co-occurrence of two different words is the number of times that they appear in your corpus together within a certain word distance k. For instance →
Step #1.) Say that, your corpus has the following two sentences as shown on left hand side. For word “simple” and “data”, you’d get a value equal to 2, because data and simple co-occur in the first sentence within a distance of one word and in the second sentence within a distance of two words,
Step #2.) Now, The Vector Representation corresponding to the word “data” with the words : simple, raw, like and I, would be equal to 2, 1, 1, 0.
Important Notes :-
- With a word by word design, you can get a representation with n entries, with n equal to the size of your entire vocabulary.
- The word “data” can be represented as a vector, v = [2,1,1,0].
Question → How would you represent a word in form of vector using Word-2-Document design ?
Answer → The way the word-2–document design works is :-
Step #1.) It shall count the times that words from your vocabulary appear in documents that belong to specific categories.
- For instance, you could have a corpus consisting of documents between different topics like entertainment, economy, and machine learning. Here, you’d have to count the number of times that your words appear on the documents that belong to each of the three categories.
- In the example shown below, suppose that the word “data”, appears 500 times in documents from your corpus related to entertainment, 6620 times in economy documents and 9320 in documents related to machine learning.
- Similarly, the word “film” appears in each document’s category as :- 7000, 4000, and 1000 times respectively.
Important Notes about above representation →
- The rows could correspond to words and the columns to documents.
- The numbers in the matrix correspond to the number of times each word showed up in the document.
- You can represent the entertainment category, as a vector, v = [500,7000].
Step #2.) Once you’ve constructed the representations for multiple sets of documents or words, you’ll guess your vector space.
- Here you could take a representation for the words : “data” and “film “ from the rows of the table.
- The vector space will have two-dimensions. The number of times that the words, data and film appear on the type of documents.
Important Notes →
- We can compare categories as show above, by doing a simple plot.
- In this space it is easy to see that the economy and machine-learning documents are much more similar than they are to the entertainment category.
- In the above example, we have shown 2-dimensions : “film” and “data”, whereas there are 3 vectors being shown : “Entertainment”, “Economy” and “ML”.
Question → Now that, we have constructed the categories vectors, how would you compare these vectors ?
Answer → There is a metric known as Euclidean-Distance that allows you to identify how far two points or two vectors are apart from each other. Once we find the Euclidean-Distance between the two vectors, we can then generalize that notion to vector spaces in higher dimensions.
- In the example shown above & below, there were two dimensions i.e. “film” & “data”.
- In the example shown above & below, there are 2 vectors being shown : “Entertainment” and “ML”.
- Let’s call the number of times (frequency), that the word “data” and the word “film” appeared in entertainment-corpus be known with the name of Corpus-A.
- Let’s call the number of times (frequency), that the word “data” and the word “film” appeared in machine-learning-corpus be known with the name of Corpus-B.
Now let’s represent those vectors (“Entertainment” and “ML”) as points in the vector space.
The Euclidean distance is the length of the straight line segments connecting them. To get that value, you should use the Pythagorean formula :-
- The first term is their horizontal distance squared.
- The second term is their vertical distance squared.
As you see, this formula is an example of the Pythagorean theorem. If you solve for each of the terms in the equation, you should arrive at the expression shown above and at last get a Euclidean distance approximately equal to 10,667.
Question → How do we generalise this concept of Euclidean distance to higher dimensions ?
Answer → Let’s walk through an example of computing the Euclidean distance between the vectors using the following co-occurrence matrix :-
- Suppose that you want to know the Euclidean distance between the vector representation v of the word “ice cream” and the vector representation w of the word the “boba”. Let’s also imagine that, we have 3 dimensions (categories-corpus) : “AI”, “drinks”, and “food”.
- The vector-v (word “ice-cream”) appears 1 times in category-corpus AI, 6 times in category-corpus drinks, 8 times in category-corpus food.
- Similarly, The vector-w (word “boba”) appears 0 times in category-corpus AI, 4 times in category-corpus drinks, 6 times in category-corpus food.
- Now, In order to find the distance between the vectors v and w, we would use the Pythagorean-Theorem as shown below. The vector v (ice-cream) can be represented in 3–dimensional space as Point (1,6,8) and the vector w (boba) can be represented in 3–dimensional space as Point (0,4,6).
Notes :- This process is the generalization of the one from the 2-d-space example.
Question → How do we find the Euclidean-Distance in Python ?
Answer → If you have two vector representations like the ones from the previous example shown above, you can use the “linalg” module from NumPy to get the norm of the difference between them.
- The norm function works for n-dimensional spaces.
- The primary takeaways here are that the Euclidean distance is basically the length of the straight line that connects two vectors and that together Euclidean distance, you have to calculate the norm of the difference between the vectors that you are comparing.
Conclusion → By using this metric, you can get a sense of how similar two documents or words are.
Question → What are the scenarios where Euclidean Distance metric should not be used ?
Answer → If you have two documents/corpuses of very different sizes, then taking the Euclidean distance is not ideal and the cosine similarity metric is useful in those scenarios. In other words, When comparing large documents to smaller ones with euclidean distance one could get an inaccurate result. Example :-
Assumption #1.) Imagine that we have 2 dimensions : “eggs” and “disease”.
Assumption #2.) Assume that, we are into the vector space where the Corpora are represented by the occurrence of the words disease and eggs.
Assumption #3.) Also, suppose that, we have got 3 different corpuses and if we represent these corpuses in the form of vectors, then their vector representation looks like below :-
- Food-Corpus can be represented in vector format as (5, 15) → This means that, in the food-corpus, the word “eggs” appears for 5 times and the word “disease” appears for 15 times.
- Agricultural-Corpus can be represented in vector format as (20, 40) → This means that, in the Agricultural-corpus, the word “eggs” appears for 20 times and the word “disease” appears for 40 times.
- History-Corpus can be represented in vector format as (30, 20) → This means that, in the History-corpus, the word “eggs” appears for 30 times and the word “disease” appears for 20 times.
Conclusions#.) As per Euclidean distance, the d2 < d1 and hence we can say that, the agriculture and history corpora are more similar than the agriculture and food corpora. This
- In the aforementioned situation, it’s evident that, the size of these vectors is quite different and hence using Euclidean-distance is not advisable.
- In other words, the word totals in the corpora differ from one another. In fact, the agriculture and the history corpus have a similar number of words, while the food corpus has a relatively small number.
Question → What‘s the solution to the aforementioned problem ?
Answer → The cosine similarity metric could help you overcome this problem.
- It basically makes use of the cosine of the angle between two vectors. Based on that, it tells you whether two vectors are close or not.
- If the angle is small, the cosine would be close to one. As the angle approaches 90 degrees, the cosine approaches zero.
Back to our above example :-
- The angle Alpha between food and agriculture corpus is smaller than the angle Beta between agriculture and history.
- This means that, the vectors food and agriculture are more closer/similar to each other as compared to the vectors agriculture and history.
- The cosine similarity used the angle between the documents and is thus not dependent on the size of the corpuses.
Question → Showcase the computation of Cosine-Similarity-Score between vectors ?
Answer → Let’s consider the corpora from the afore-mentioned example. Recall that’s in this example, you have a vector space where the representations of the corpora is given by the number of occurrences of the words disease and eggs.
Computation #1.) Cosine-Similarity-Score between 2 vectors “Agriculture” and “History”. The angle between those vector representations is denoted by Beta. The agriculture corpus is represented by the vector v, and the history corpus is going to be vector w.
Computation #2.) Cosine-Similarity-Score between 2 vectors “Agriculture” and “Food” :- (5 X 20) + (15 X 40) / (√(5² + 15²) + √(20² + 40²)) == 0.99
Question → What does this Cosine-Similarity-Score metric tells you about ?
Answer → This score tells us about the similarity of two different vectors.
Case #1.) Consider when two vectors are orthogonal in the vector spaces, the maximum angle between pair of vectors is 90 degrees. In that case, the cosine would be equal to 0 and it would mean that the two vectors have orthogonal directions or that they are maximally dissimilar.
Case #2.) Consider the case where the vectors have the same direction. In this case, the angle between them is 0 degrees and the cosine is equal to 1 Because cosine of 0 is just 1.
Notes :- As you can see :-
1.) The cosine similarity takes values between 0 and 1. As the cosine of the angle between two vectors approaches 1, the closer their directions are.
2.) Higher the cosine-similarity-score between the vectors, it means they are more similar to each other.
- In other words → This metric (cosine-similarity-score) is proportional to the similarity between the directions of the vectors.
- Put it simply → Similar vectors have higher cosine-similarity-scores.
3.) If you were to take any cosine similarity score of a vector with itself, you will get 1. If the vectors are perpendicular, it’ll give you 0.
That’s all in this blog. We shall see you in next blog.