Scalability & HA with Redis Cluster
Welcome to this blog. If you are coming here directly, it’s highly recommended to read through this story first. We shall be looking at following topics in this blog :-
- General Understanding of Scalability.
- Types of Scalability.
- Need for Scalability.
- Why Horizontal-Scaling is preferred, in context of Redis ?
- Algorithmic Sharding with Redis and disadvantage.
- Problem of Resharding and trivial solution.
- Addressing Resharding problem with HashSlots.
- High-Availability with Redis Cluster.
- Automatic Failover with Redis Cluster.
- Split-Brain Problem with Redis Cluster and its solution.
Part #1 : Scalability Aspects with Redis Cluster
Question:- What’s the meaning of Scalability ?
Answer :- Scalability is the property of a system to handle a growing amount of work by adding resources to the system.
Question:- What are the scaling strategies available ?
Answer :- The two most common scaling strategies are :-
- Vertical scaling.
- Horizontal scaling.
Question:- What does Vertical-Scaling means ?
Answer:- Vertical scaling, or also called scaling up, means adding more resources like CPUs or memory to your server.
Question:- What does Horizontal-Scaling means ?
Answer:- Horizontal scaling, or scaling out, implies adding more servers to your pool of resources.
Note:- The difference between Vertical & Horizontal Scaling is just getting a bigger server & deploying a whole fleet of servers respectively.
Question:- Can you showcase a practical example for Vertical Vs Horizontal Scaling ?
Answer:- Suppose you have a server with 128 gigabytes of RAM, but you know that your database will need to store 300 gigabytes of data. In this case, you’ll have two choices :-
- You can either add more RAM to your server, so it can fit the 300 gigabyte data set.
- You can add two more servers and split the 300 gigabytes of data between the three of them.
Question:- What are the usual practical reasons, for which we need Scaling?
Answer:- There are two usual reasons for doing the Scaling :-
- Hitting your server’s RAM limit.
- Reaching the performance limits in terms of throughput or operations per second is another.
Question:- Why should we really go for Horizontal-Scaling (Sharding) and not for Vertical-Scaling ?
Answer:- Let’s see why we should prefer Horizontal-Scaling after a certain limit :-
- Redis is mostly single-threaded, a single Redis Server instance cannot make use of the multiple cores of your server’s CPU for command processing.
- Now, If we split the data between two Redis instances, our system can process requests in parallel, effectively doubling the throughput.
- As a matter of fact, performance will scale close to linearly by adding more Redis instances to the system.
This pattern of splitting data between multiple servers for the purpose of scaling is called Sharding.
Question:- Can you explain, what is a Shard ?
Answer:- Whenever we split the overall data between multiple Redis-Servers for the purpose of scaling, these resulting servers or processes that hold these chunks of the data are called shards.
Question:- With multiple Redis-Shards, How will it know where to look for a given key ?
Answer:- We need to have a way to consistently map a key to a specific shard. There are multiple ways to do this. The one Redis uses is called “Algorithmic Sharding”.
Question:- How does “Algorithmic Sharding” works ?
Answer:- We basically define the shard for a given key i.e. map the key to the shard. We hash the key and then mod the result by the total number of shards.
- Because we’re using a deterministic hash function, this function will always assign a given key to the same shard always.
- For example, Say that, we have got 2 Shards. Then, we first hash the key and then mod the hash-result by 2.
Question:- Now, say our business is growing and we have got more data to handle and therefore we want to increase our shards count (i.e. ReSharding). So, how does our data-querying is affected now ?
Answer:- Let’s say we add one new shard so that our total number of shards is three. Now, when a client tries to read the key foo, they will run the hash function and mod the number of shards as before. This time, the number of shards is THREE and therefore we’re modding with three instead of two. Understandably, the result may be different, pointing us to the different shard. For example, data for key “foo” was originally housed in Shard-0, but now post this fresh re-sharding, we are being re-directed to Shard-1. Thus, we may not get the data at all now.
Question:- How the Resharding issue can be solved manually with the Algorithmic — Sharding Strategy ?
Answer:- This can be solved by rehashing all the keys in the keys base and moving them to the shard appropriate to the new shard count. This is not a trivial task, though. And it can require a lot of time and resources, during which the database will not be able to reach its full performance or might even become unavailable.
Question:- Is there not some smarter approach for solving the problem of ReSharding with Algorithmic Sharding ?
Answer:- Redis uses a clever approach to solve this problem :-
==> A logical unit that sits between a key and a shard called a hashslot.
==> The total number of hashslots in a database is always 16,384, or 16K.
Note that, The hashslots are divided roughly even across the shards. For example :- Say we have got overall 2 shards, then Slots 0 through 8,000 might be assigned to shard 1, and slots 8,001 to 16,384 might be assigned to shard 2.
==> In a Redis cluster, we actually mod by the number of hashslots, not by the number of shards. Each key is assigned to a hashslot. When we do need to reshard, we simply move hashslots from one shard to another, distributing the data as required across the different Redis instances.
For example, In above diagram, the 3rd Shard (i.e. Shard #2) was added recently, but since we are always hashing on the hashslot-count, therefore the answer of our mod, shall always remain same.
Part #2 : High-Availability with Redis Cluster
Question:- How does Redis-Cluster provides High-Availability ?
Answer:- High availability refers to the cluster’s ability to remain operational even in the face of certain failures. For example, the cluster can detect when a primary shard fails and promote a replica to a primary without any manual intervention from the outside.
Question:- How does Redis-Cluster provides Automatic-Failover ?
Answer:- Redis-Cluster can come to know quickly, whenever the primary shard has failed and it can promote its replica to the new primary.
- Say, we have one replica for every primary shard. If all our data is divided between three Redis Servers, we would need a six-membered cluster, with three primary shards and three replicas.
- All six shards are connected to each other over TCP and constantly ping each other and exchange messages. These messages allow the cluster to determine which shards are alive.
- When enough shards report that a given primary shard is not responding to them, they can agree to trigger a fail-over and promote the shard’s replica to become the new primary. The number of shards that needs to agree that a fellow shard is offline before fail-over is triggered, is configurable at the time of cluster-creation.
Question:- Demonstrate how the Split-Brain situation can happen with Redis-Cluster ?
Answer:- Let me show you what I mean.
- Imagine that, we have got a Redis-Cluster with THREE primary shards and one replica for every primary shard. Overall, our Redis cluster is a six-membered cluster, with three primary shards and three replicas.
- Further imagine that, Network Partitioning has have happened i.e. the group on the left side will not be able to talk to the shards in the group on the right side.
- Now, both cluster-groups will think that they are offline and both shall trigger a fail-over of any primary shards, resulting in left side with all primary shards, as well as right side also would have all primary shards.
- Both sides, thinking they have all the primaries, will continue to receive client requests that modify data. And that is a problem, because maybe client A sets the key foo to bar on the left side, but a client B sets the same key’s value to baz on the right side.
- When the network partition is removed and the shards try to rejoin, we will have a conflict, because we have two shards holding different data, claiming to be the primary, and we wouldn’t know which data is valid. This is called a split brain situation, and it’s a very common issue in the world of distributed systems.
Question:- What’s the solution to fix the Split-Brain situation ?
Answer:- Here is the solution to this problem :-
- To prevent something called a split brain situation in a Redis cluster, always keep an odd number of shards in your cluster.
- Now, when we get a Network-Split, left and right group shall do a count and see if they are in a bigger (majority) or smaller group (minority) ?
- If a particular group is in Minority, it shall NOT try to trigger a fail-over and shall NOT accept any client write requests.
That’s all in this section. If you liked reading this blog, kindly do press on clap button multiple times, to indicate your appreciation. We would see you in next part of this series with Hands-On with Redis-Cluster.