Bloom Filter Algorithm for Search

aditya goel
8 min readJan 29, 2023

Question:- What is a Bloom Filter ?

Answer → A Bloom-Filter is a Space-Efficient probabilistic Data-Structure. It’s been around for 50 years. It’s used to answer the question like : Is this element in the Set ?

Question:- What are the practical applications of Bloom Filter ?

Answer → A Bloom Filter is a Data-Structure with many practical applications. It can be found in Browsers, Network-Routers and Databases just to name a few.

Question:- What can be the practical application, where Bloom Filters can be used ?

Answer → A Bloom Filter is used to answer this question : If this Element is present in the Set ? The Bloom-Filter would answer with either a “Firm No” or “Probably Yes”. This “Probably Yes” part makes the Bloom-Filters probabilistic.

  • The False-Positives are possible i.e. The element is NOT present in the in Set, but Bloom Filter says, it is present indeed.
  • The False-Negatives are Not possible i.e. The element is present in the Set, but Bloom Filter says, it is NOT present.

Question:- What’s the TradeOff here with Bloom Filters ?

Answer → In exchange of providing sometimes an Incorrect False-Positive Answer, a Bloom-Filter consumes lot less memory than a Data-Structure like a Hash-Table, that would provide correct answer all the times.

Question:- What’s that thing, that we must take care while using the Bloom Filters ?

Answer → If our use-case can tolerate Some False-Positives and can’t tolerate the False-Negatives, then we can go for the Bloom-Filters.

Question:- How does Bloom Filters works ?

Answer → The critical ingredient to a Bloom-Filter is some good hash-functions.

  • These Hash-functions should be fast and they should produce outputs that are evenly and randomly distributed.
  • Collisions are OK, as soon as they are rare.

Understanding → A Bloom-Filter is a large set of buckets, with each bucket containing a single bit and they all start with a Zero. Let’s imagine that, we want to keep track of the food I like. For this example :-

Step #1.) We would start with a Bloom Filter of size 10, labelled from 0 to 9 :-

Step #2.) Now, for every incoming element :-

  • We shall be passing it through THREE distinct Hash-Functions.
  • Each Hash-Function would end-up returning a value from range 0 to 9.

For example, we want to put the element “Ribeye” a type of meat, into the Bloom-Filter. So, upon passing this element through the THREE Hash-Functions :-

  • Imagine that upon passing the element “Ribeye” through Hash-Function-1, we got the value as 1. This means that, the value at Index 1 would be marked as 1.
  • Imagine that upon passing the element “Ribeye” through Hash-Function-2, we got the value as 3. This means that, the value at Index 3 would be marked as 1.
  • Imagine that upon passing the element “Ribeye” through Hash-Function-3, we got the value as 4. This means that, the value at Index 4 would be marked as 1.

Step #3.) Let’s take another example, we want to put the element “Potato” into the Bloom-Filter. So, upon passing this element through the THREE Hash-Functions :-

  • Imagine that upon passing the element “Potato” through Hash-Function-1, we got the value as 0. This means that, the value at Index 0 would be marked as 1.
  • Imagine that upon passing the element “Potato” through Hash-Function-2, we got the value as 4. This means that, the value at Index 4 would be marked as 1.
  • Imagine that upon passing the element “Potato” through Hash-Function-3, we got the value as 8. This means that, the value at Index 8 would be marked as 1.

Test #1.) Now, let’s test our Bloom-Filter, by inquiring from it, whether I liked the “Potato” ?

  • Since the same input always hashes to the same output, through the same Hash-Functions, element “Potato” still hashes to values : [0, 4, 8].
  • Next, we check all those buckets i.e. we check whether the value at the Indices [0, 4, 8] are 1 or not ?

Upon checking all these indices with the Bloom-Filter, since all of these indices are ONE, therefore Bloom-Filter declares that : Yes, I like the element “Potato”.

Test #2.) Now, let’s test our Bloom-Filter, by inquiring from it, whether I liked the “Pork Chop” ?

  • As a first thing, we pass the element “Pork Chop” through the THREE Hash-Functions, say that three hash-functions returns the values : [1, 5, 8]. Note that, these values can be anything else as well.
  • Next, we check all those buckets i.e. we check whether the value at the Indices [1, 5, 8] are 1 or not ?

Upon checking all these indices with the Bloom-Filter, since all of these indices are NOT ONE, therefore Bloom-Filter declares confidently that : NO, I don’t like the element “Pork Chop”.

Test #3 → Case of False-Positive) Now, let’s test our Bloom-Filter, by inquiring from it, whether I liked the “Lemon” ? As a first thing, we pass the element “Lemon” through the THREE Hash-Functions, say that three hash-functions returns the values : [1, 4, 8]. Next, we check all those buckets i.e. we check whether the value at the Indices [1, 4, 8] are 1 or not :-

  • Recall that, The value at Index 1, was set as ONE, when “ribeye” element was ingested into our Bloom-Filter.
  • The value at Index 4, was set as ONE, when elements “Potato” and “ribeye” were ingested into our Bloom-Filter.
  • The value at Index 8, was set as ONE, when “Potato” element was ingested into our Bloom-Filter.

Upon checking all these indices with the Bloom-Filter, since all of these indices are already ONE (being set 1 by some other elements), therefore Bloom-Filter still ends-up declaring that : Yes, I like the element “Lemon”. This is the case of False-Positive. Here, The element(“Lemon”) is NOT present in the Set, but Bloom Filter still says, it is present indeed.

Question:- What’s the size of the Bloom Filters in real-life ?

Answer → The size of the Bloom Filters in practical life is very big, as compared to the size being shown here in the example above.

Question:- Explain how Akamai’s CDN uses Bloom Filters ?

Answer → Akamai’s CDN uses Bloom-Filters to prevent Caching “One Hit Wonders”.

Question:- What is One Hit Wonders ?

Answer → These are web-pages, which are only requested once. According to the Akamai, 75% of the pages are One Hit Wonders.

Question:- How does Akamai prevent caching of One Hit Wonders ?

Answer → Bloom-Filter basically tracks all the URLs seen.

  • For every request (say the web-page is being viewed for the first time), it ends-up ingesting an entry into the Bloom-Filter.
  • Basically, the process is : The Webpage is ran through multiple Hash-Functions and the index-values being returned by these multiple hash-functions are set to ONE.
  • For every subsequent request, it checks whether this entry is present into the Bloom-Filter. Only, if it’s present, then only the web-page is Cached otherwise Not.

Question:- What’s the advantage of preventing the caching of One Hit Wonders ?

Answer → Preventing the caching of One Hit Wonders, significantly reduces the Caching Workload at Akamai’s end and increases the Cache-Hit-Rate as well.

Question:- Explain how Modern Browsers uses Bloom Filters ?

Answer → Web Browsers like Google-Chrome used to use Bloom-Filters :-

  • To identify, whether a particular URL is malicious OR NOT ?
  • Only when Bloom-Filter declares that, an URL is probably malicious, then only it performs the fully blown check of verifying whether the URL is really malicious or not ?

Question:- Does Browsers still uses Bloom Filters, even today ?

Answer → Since there are millions of malicious URLs possible, we need a more efficient solution than this.

Question:- Explain how Password Validators uses Bloom Filters ?

Answer → Some Password-Validators uses Bloom-Filters to prevent users from using weak passwords. This is how it would work :-

  • All the possible weak passwords would be first indexed into the Bloom-Filter.
  • Whenever any User intends to create the password, that same password is first checked into the Bloom-Filter.

==> If that password is well present into the Bloom-Filter, simply we can ask the user to come up with a new password.

==> If that password is NOT present into the Bloom-Filter, it means that it’s indeed not a weak password and user may proceed ahead with it.

Note that, at times, even some of the strong passwords might become the victim of False-Positive scenario and a Strong password might be called out as a Weak password.

Question:- Explain how does NoSql databases uses Bloom Filters ?

Answer → Many NoSql-databases uses Bloom-Filters to reduce the disk reads for the keys, that doesn’t exist. With LSM-Tree based database, searching for a key that doesn’t exists requires looking through many files and is very costly operation.

That’s all in this blog. We shall see you in the next document..

--

--

aditya goel

Software Engineer for Big Data distributed systems