Intro to Hashing and need for Consistent-Hashing
Question:- What is Hashing ?
Answer → With the process of Hashing, we put some kind of data into a function/computing, which gives us a result known as Hash/HashCode.
- The data (being hashed) can be either String, Object, etc.
- Generally speaking, the hash-code value is an Integer, but it can be anything.
Question:- What is the real-life purpose of Hashing ?
Answer → Say, we have considered some Strings/Names and we pass them through Hash-Function in order to generate the HashCode.
Imagine that, we have these names being stored in database and we have to search through these names, then :-
- If the names are stored in Alphabetical Order (i.e. Sorted Order), then our search cost shall be that of Binary-Search i.e. O(n * Log(n)).
- If the names are NOT stored in Sorted Order, then our search cost shall be too huge i.e. O(n) because in Worst-Case, we might have to check through all the entries, in order to look for a particular string.
Now, the solution to this problem is that :- We generate the hash-codes for all of these names and try to access the names through these hash-codes. For example :-
- Whenever some-one comes and asks for details for the name : “Ankit”, we just look for a number 12.
- Whenever some-one comes and asks for details for the name : “Ankur”, we just look for a number 34.
- These hash-values (12, 42, 34, 67) actually are Array-Indices.
- The access-time for these Indices, would therefore become constant, because we are accessing just using the Indexes.
Therefore, if we have some hash-values generated for these Strings, we can access them in quite a fast manner.
Question:- What can be the range of these Hash-values ?
Answer → Because, above in the blog, we have said that the Hash-Value would be Integer, therefore the range for these hash-values would be the range of Integer i.e. 0 to 2³².
Question:- What are the problems with this large range of Integer ?
Answer → Following are some of the problems with such large range :-
- In order to store these names, we can’t have such a huge array. Even if, we have requirement to store 10,000 names, still this array (of size 2³²) would be too large.
- If we are going to map the hash-values(for these names) with the Indexes of the Array, in that case we would have to have this much large array (of size ²³²), which shall be waste of lot of space.
Question:- What’s the solution to this problem of very large array ?
Answer → After computing the Hash-Value of the name, we take the Modulus. For example →
- Let’s revisit our original simple example as quoted above : We have 4 different names. So, we go ahead and take the size of the array as 2X, so as to arrange the future growth. Thus, the overall size of our Array shall be 8.
- Now, we shall be performing the Modulus-8 upon every hash-value. We are having these 4 hash-values : 12, 42, 34, 67. Once we perform (modulus 8) operation, we get following numbers : 4, 2, 2, 3.
- Next, we would place the above names on the indices we have got (recall that the size of our array was 8) :-
Question:- What’s the problem with this approach of Modulus ?
Answer → We can see that : after performing the modulus operation on the names, it may so happen that same value is generated. For example →
- After we perform modulus-8 operation on hashcode of “Neha” (i.e. 42), then the value generated is 2.
- After we perform modulus-8 operation on hash-code of “Ankur” (i.e. 34), then again the value generated is 2.
This problem is called as Hash Collision. In this case, we would save a list at that particular Index. For example, we have stored both “Neha” & “Ankur” in form a list at Index 2 in the Array above.
Question:- What are the use-cases of Hashing ?
Answer → One of the popular use-case of Hashing is : Saving Key-Value pairs, which helps in Caching.
- If the size of the data (sum total size of all Key-Value pairs) is small enough, then all those (Key-Value pairs) can be accommodated on ONE machine.
- If the size of the data (sum total size of all Key-Value pairs) is larger than memory of one-server, then all those (Key-Value pairs) can be accommodated on multiple machines i.e. distributed across multiple machines.
Question:- What is Horizontal Scaling and where do we use it ?
Answer → In large scale distributed systems, data doesn’t fits into a single server. The data is distributed across many machines. This is called as Horizontal Scaling.
Question:- What is important aspect, about which we should care, while using the Horizontal Scaling ?
Answer → The data should be equally distributed evenly across data-servers.
Question:- Explain a use-case where we have got multiple machines to store the data ?
Answer → Imagine that, we have got 4 servers across which we are going to save the Key-Value pairs in a distributed fashion.
- Imagine that, we are using a different hash-function this time. Again taking the example of the 4 names, once we generate the hash-code of these 4 names, we get 4 values : [16, 25, 30, 23].
- Next step, we have to accommodate the (Key-Value pair) in either of these 4 servers, therefore we shall be performing the Modulus-4 operation on the afore-generated hash-values. Say, we get 4 values : [S0, S1, S2, S3].
These values (S0, S1, S2, S3) are nothing but the server-names, where we are going to accommodate these names.
- Ankit → This key shall be stored on Server S0.
- Neha → This key shall be stored on Server S1.
- Ankur → This key shall be stored on Server S2.
- Ruchi → This key shall be stored on Server S3.
Note that, there can be multiple names being stored on a single-server, at any given moment of time.
Question:- How to access Keys from these Servers ?
Answer → Imagine that, we want to read a particular key “Ankit”.
- We shall be running this key (“Ankit”) through the same hash-function and generate the hash-code. Say, we obtain 16 as the hash-code.
- Then, we perform the Modulo-4 operation on the afore-obtained hash-code. Say, we obtain S0 as the value.
Now, since the value thus obtained is S0, we would redirect to S0 server.
Question:- What approach do we use, in order to distribute the data equally ?
Answer → The common method to distribute the data as evenly as possible among the servers is : Simple Hashing.
Step #1.) First, for each object, we hash the object_key using the MD5 algorithm OR Murmur algorithm. This step, maps the object_key to the long range of numerical values.
Step #2.) A good hash-function distributes the hashes evenly across the entire range.
Step #3.) Next, we perform the Modulo-Operation on the Hash against the number-of-servers.
Step #4.) This step determines to which server, the object belongs to. As long as the number of servers stays the same, an “object_key” will always map to the same-server.
Imagine that, we have got 256 servers and say, we have got an incoming key “object_key_1” :-
- First, we generate the hash-code of this “object_key_1”.
- Next, we perform the modulus-256 operation on the hash-code thus obtained.
Now, every time we have to either search OR store the object with key as “object_key_1”, the same value shall be generated each time. This value is nothing, but the index in the array of 256 servers.
Question:- Demonstrate the data-distribution with an example ?
Answer → Say, we have got 8 objects to be stored across 4 different servers.
- In that case, we would first compute the hash-code of these 8 object_key and then compute the modulo-4 operation.
- The answer of modulo-4 operation is nothing but the server-indice. For example Key0 shall be stored on Server S1 and so on.
Question:- Whats the problem with this approach of Hashing ?
Answer → This approach works well, when the size of the cluster is fixed and data-distribution is even.
Question:- What happens in case of Load being increasing OR decreasing, when we use Hashing in this way ?
Answer → Imagine that, we have got 4 servers.
- Now Imagine that the load is increasing, so we have to add another servers (S4, S5) to our cluster.
- Now Imagine that the load is decreasing (i.e. we have lesser number of Keys), in that case we shall have to remove one of the existing server, say S3 from our cluster.
Question:- What happens, when the new servers gets added to meet new demand OR When existing servers get removed ?
Answer → Say one server goes down, then we are left with 3 Servers :-
- Even though, the hashes for the object keys stays the same, we are now applying the modulo operation to a different set servers (“n”).
- The impact of this situation is quite drastic. Most of the keys would get re-distributed. This affects almost all the Objects, not just the objects originally stored in the server that is now Offline.
- This triggers a storm of Misses and lot of objects to be moved.
For situations, where servers often comes and go, this design is untenable.
Question:- Explain with an example, where server-count is being increased OR decreased ?
Answer → Let’s first discuss the use-case of load being reduced i.e. Number of keys being reduced.
- In this case, we would be reducing the server-count by 1 and we are now left with three servers : S0, S1 and S2. Recall that key (“Ruchi”) was earlier stored on Server S3. Now, this key would have to move to some of other server : S0, S1 or S2.
- Therefore, we have to accommodate the (Key-Value pairs) in either of these 3 servers, therefore we shall be performing the Modulus-3 operation on the afore-generated hash-values i.e. (16, 25, 30, 23). Upon performing the modulus-3 operation, we get 3 values : [S1, S1, S0, S2].
It means that, following keys shall be stored across these servers :-
- Ankit → This key shall be stored on Server S1.
- Neha → This key shall be stored on Server S1.
- Ankur → This key shall be stored on Server S0.
- Ruchi → This key shall be stored on Server S2.
Question:- What’s disadvantage of moving keys in this case ?
Answer → Note that, In this case we were only required to move key (“Ruchi”) away from Server S3, because Server S3 would be removed. But since we changed from modulus-4 to modulus-3 operation, following reorganisation of keys would also happen mandatorily :-
- Earlier key (“Ankur”) was stored on server S2 and now this key would be stored on Server S0.
- Earlier key (“Ankit”) was stored on server S0 and now this key would be stored on Server S1.
Note that, the upscaling and down-scaling is a basic condition for the design of scalable-systems. The main disadvantage here is that : With normal-hashing process using modulo operation, during the upscaling/down-scaling of the servers in the server-cluster, we have to move around a lot of keys.
Question:- How can we get rid of this movement of keys during upscaling/down-scaling of the servers ?
Answer → Consistent-Hashing is an effective technique to mitigate this issue.
Question:- Who all uses Consistent Hashing ?
That’s all in this blog and we shall see you in next part of this document.