Concept of TF*IDF Vectors

aditya goel
4 min readOct 22, 2023

Question → Explain the concept of TF-IDF ?

Answer → TF-IDF is a Sparse-Vector method, where there are lots of Zeros with occasional value in there.

1.) TF stands for Term-Frequency. It looks at a sentence OR a paragraph and given a certain query, it will tell you → Compared to the length of that document, how frequent your query is ?

  • q → Stands for Query.
  • D → Stands for Document.
  • f(q, D) → What is the frequency of query “q” in document “D”.
  • f(t, D) → What is the frequency of all terms “t” in our document “D” → Total number of Terms in the document.

2.) IDF stands for Inverse-Document-Frequency. It is calculated for each document. Document here means either a Sentence OR a paragraph. In below example, we have 3 different documents.

  • N → Total number of documents.
  • N (q = ‘is’) → Total number of documents that contains the query “is” into them.

3.) Let’s see some couple of calculations here for TF*IDF :-

Notes :-

==> There are chances that, “Most common words” are not that highly relevant to our query.

==> There are chances that, “Less common words” are more relevant to our query.

4.) Now, we see the entire calculations here :-

Question → Showcase the TF-IDF in Python ?

Step #1.) Here is how our set of 3 documents looks like :-

Step #2.) Next, this is how, we can compute the term : [TF * IDF] :-

Step #3.1) Now let’s check the value of [TF * IDF] for this term :-

Step #3.2) Let’s also check the value of [TF * IDF] for another term :-

Question → TF-IDF are sparse vectors, but obviously whatever we have done above, doesn’t looks much like a vector. So, how do we turn them into the vectors ?

Step #1.) Let’s build the vocabulary :-

Step #2.) Now, we shall take this Vocab and mirror it into a vector, so that for every word, we are going to calculate the TF*IDF for each document.

And here is the vector that we get for “a” :-

And here is the vector that we get for “b” :-

Question → What’s the disadvantage of (TF * IDF) ?

Answer → If the frequency of queries found in a document increases, the score increases linearly. For example →

  • Imagine we have an 1000–word-document (document-1) and it has the word “dog” 10 times There is a good chance that, document is talking about the dogs, right !! Also, there shall be some score == TF*IDF(“dog”, document-1)
  • Now, if number of times the word “dog” appears in another document (document-2) say is 20, then TF*IDF(“dog”, document-2) score for that document shall also double, because the TF*IDF score is directly proportional to the Term-Frequency of word “dog”.

Now, it doesn’t mean that document-2 is more relevant as compared to the document-1, in context of query “dog”. Rather, it may so happen that, the document-2 is just slightly relevant as compared to the document-1. Therefore, in order to address this problem, comes BM25, which we shall discuss in next blog.

--

--

aditya goel

Software Engineer for Big Data distributed systems