Bloom Filter Algorithm for Search
Question:- What is a Bloom Filter ?
Answer → A Bloom-Filter is a Space-Efficient probabilistic Data-Structure. It’s been around for 50 years. It’s used to answer the question like : Is this element in the Set ?
Question:- What are the practical applications of Bloom Filter ?
Answer → A Bloom Filter is a Data-Structure with many practical applications. It can be found in Browsers, Network-Routers and Databases just to name a few.
Question:- What can be the practical application, where Bloom Filters can be used ?
Answer → A Bloom Filter is used to answer this question : If this Element is present in the Set ? The Bloom-Filter would answer with either a “Firm No” or “Probably Yes”. This “Probably Yes” part makes the Bloom-Filters probabilistic.
- The False-Positives are possible i.e. The element is NOT present in the in Set, but Bloom Filter says, it is present indeed.
- The False-Negatives are Not possible i.e. The element is present in the Set, but Bloom Filter says, it is NOT present.
Question:- What’s the TradeOff here with Bloom Filters ?
Answer → In exchange of providing sometimes an Incorrect False-Positive Answer, a Bloom-Filter consumes lot less memory than a Data-Structure like a Hash-Table, that would provide correct answer all the times.
Question:- What’s that thing, that we must take care while using the Bloom Filters ?
Answer → If our use-case can tolerate Some False-Positives and can’t tolerate the False-Negatives, then we can go for the Bloom-Filters.
Question:- How does Bloom Filters works ?
Answer → The critical ingredient to a Bloom-Filter is some good hash-functions.
- These Hash-functions should be fast and they should produce outputs that are evenly and randomly distributed.
- Collisions are OK, as soon as they are rare.
Understanding → A Bloom-Filter is a large set of buckets, with each bucket containing a single bit and they all start with a Zero. Let’s imagine that, we want to keep track of the food I like. For this example :-
Step #1.) We would start with a Bloom Filter of size 10, labelled from 0 to 9 :-
Step #2.) Now, for every incoming element :-
- We shall be passing it through THREE distinct Hash-Functions.
- Each Hash-Function would end-up returning a value from range 0 to 9.
For example, we want to put the element “Ribeye” a type of meat, into the Bloom-Filter. So, upon passing this element through the THREE Hash-Functions :-
- Imagine that upon passing the element “Ribeye” through Hash-Function-1, we got the value as 1. This means that, the value at Index 1 would be marked as 1.
- Imagine that upon passing the element “Ribeye” through Hash-Function-2, we got the value as 3. This means that, the value at Index 3 would be marked as 1.
- Imagine that upon passing the element “Ribeye” through Hash-Function-3, we got the value as 4. This means that, the value at Index 4 would be marked as 1.
Step #3.) Let’s take another example, we want to put the element “Potato” into the Bloom-Filter. So, upon passing this element through the THREE Hash-Functions :-
- Imagine that upon passing the element “Potato” through Hash-Function-1, we got the value as 0. This means that, the value at Index 0 would be marked as 1.
- Imagine that upon passing the element “Potato” through Hash-Function-2, we got the value as 4. This means that, the value at Index 4 would be marked as 1.
- Imagine that upon passing the element “Potato” through Hash-Function-3, we got the value as 8. This means that, the value at Index 8 would be marked as 1.
Test #1.) Now, let’s test our Bloom-Filter, by inquiring from it, whether I liked the “Potato” ?
- Since the same input always hashes to the same output, through the same Hash-Functions, element “Potato” still hashes to values : [0, 4, 8].
- Next, we check all those buckets i.e. we check whether the value at the Indices [0, 4, 8] are 1 or not ?
Upon checking all these indices with the Bloom-Filter, since all of these indices are ONE, therefore Bloom-Filter declares that : Yes, I like the element “Potato”.
Test #2.) Now, let’s test our Bloom-Filter, by inquiring from it, whether I liked the “Pork Chop” ?
- As a first thing, we pass the element “Pork Chop” through the THREE Hash-Functions, say that three hash-functions returns the values : [1, 5, 8]. Note that, these values can be anything else as well.
- Next, we check all those buckets i.e. we check whether the value at the Indices [1, 5, 8] are 1 or not ?
Upon checking all these indices with the Bloom-Filter, since all of these indices are NOT ONE, therefore Bloom-Filter declares confidently that : NO, I don’t like the element “Pork Chop”.
Test #3 → Case of False-Positive) Now, let’s test our Bloom-Filter, by inquiring from it, whether I liked the “Lemon” ? As a first thing, we pass the element “Lemon” through the THREE Hash-Functions, say that three hash-functions returns the values : [1, 4, 8]. Next, we check all those buckets i.e. we check whether the value at the Indices [1, 4, 8] are 1 or not :-
- Recall that, The value at Index 1, was set as ONE, when “ribeye” element was ingested into our Bloom-Filter.
- The value at Index 4, was set as ONE, when elements “Potato” and “ribeye” were ingested into our Bloom-Filter.
- The value at Index 8, was set as ONE, when “Potato” element was ingested into our Bloom-Filter.
Upon checking all these indices with the Bloom-Filter, since all of these indices are already ONE (being set 1 by some other elements), therefore Bloom-Filter still ends-up declaring that : Yes, I like the element “Lemon”. This is the case of False-Positive. Here, The element(“Lemon”) is NOT present in the Set, but Bloom Filter still says, it is present indeed.
Question:- What’s the size of the Bloom Filters in real-life ?
Answer → The size of the Bloom Filters in practical life is very big, as compared to the size being shown here in the example above.
Question:- Explain how Akamai’s CDN uses Bloom Filters ?
Answer → Akamai’s CDN uses Bloom-Filters to prevent Caching “One Hit Wonders”.
Question:- What is One Hit Wonders ?
Answer → These are web-pages, which are only requested once. According to the Akamai, 75% of the pages are One Hit Wonders.
Question:- How does Akamai prevent caching of One Hit Wonders ?
Answer → Bloom-Filter basically tracks all the URLs seen.
- For every request (say the web-page is being viewed for the first time), it ends-up ingesting an entry into the Bloom-Filter.
- Basically, the process is : The Webpage is ran through multiple Hash-Functions and the index-values being returned by these multiple hash-functions are set to ONE.
- For every subsequent request, it checks whether this entry is present into the Bloom-Filter. Only, if it’s present, then only the web-page is Cached otherwise Not.
Question:- What’s the advantage of preventing the caching of One Hit Wonders ?
Answer → Preventing the caching of One Hit Wonders, significantly reduces the Caching Workload at Akamai’s end and increases the Cache-Hit-Rate as well.
Question:- Explain how Modern Browsers uses Bloom Filters ?
Answer → Web Browsers like Google-Chrome used to use Bloom-Filters :-
- To identify, whether a particular URL is malicious OR NOT ?
- Only when Bloom-Filter declares that, an URL is probably malicious, then only it performs the fully blown check of verifying whether the URL is really malicious or not ?
Question:- Does Browsers still uses Bloom Filters, even today ?
Answer → Since there are millions of malicious URLs possible, we need a more efficient solution than this.
Question:- Explain how Password Validators uses Bloom Filters ?
Answer → Some Password-Validators uses Bloom-Filters to prevent users from using weak passwords. This is how it would work :-
- All the possible weak passwords would be first indexed into the Bloom-Filter.
- Whenever any User intends to create the password, that same password is first checked into the Bloom-Filter.
==> If that password is well present into the Bloom-Filter, simply we can ask the user to come up with a new password.
==> If that password is NOT present into the Bloom-Filter, it means that it’s indeed not a weak password and user may proceed ahead with it.
Note that, at times, even some of the strong passwords might become the victim of False-Positive scenario and a Strong password might be called out as a Weak password.
Question:- Explain how does NoSql databases uses Bloom Filters ?
Answer → Many NoSql-databases uses Bloom-Filters to reduce the disk reads for the keys, that doesn’t exist. With LSM-Tree based database, searching for a key that doesn’t exists requires looking through many files and is very costly operation.
That’s all in this blog. We shall see you in the next document..