Deep Dive with Cassandra || Part-1
Question #1.) What’s the design of Cassandra ?
Answer → Cassandra is a distributed database, which means :-
- DeCentralised → Our Application can actually connect to any of the DB-Nodes. In the below example, we showcased a Cassandra-Cluster with 6 nodes, organised in a Ring format.
- Ubiquitous → Our application doesn’t need to necessarily connect to any single specific node, rather it can connect to any of the node as all of these nodes are Masters. That’s why, there is no Single point of failure (SPOF) while working with Cassandra.
- Replication → There shall be multiple copies of your data and all of those multiple copies shall be present in the difference nodes/servers.
- Fault-Tolerant → Even if anyone of the server goes down, still our application can continue to operate as it is.
- Geographically located DataCentres → Usually the best way to setup the Cassandra cluster is to setup the multiple clusters in different geographical regions.
The need for that is to make sure that, our users can experience low latency.
Question #2.) How does Cassandra does the Replication of Data ?
Step #1.) Basically, with Cassandra, multiple copies of a single record shall be stored on different nodes. This makes the Cassandra Non-ACID-Compliant.
Step #2.) Let’s say, we have configured the replication factor as 2, then in that case 3 copies of the same data shall be stored across 3 different servers. But by default, cassandra returns success even when the data is successfully being written to only one node. This behaviour can be configured well.
Step #3.) This can lead to the problem of Dirty-Reads. For example :-
- Say the data is being written to the Node-1 and this update has not yet been committed to Node-2 and Node-3.
- Meanwhile the Read request lands for this record at Node-3.
- In this scenario, older data shall be returned.
Step #4.) In order to avoid this problem, Cassandra provides tunable consistency.
Question #3.) What are the benefits of Cassandra ?
Question #4.) In which scenarios, should we be using Cassandra ?
Question #5.) What does Columnar data means ?
Answer :- In order to understand the Columnar-Database, let’s Imagine that, we have a eCommerce-platform where we have Millions of products.
Step #1.) Now, each product may have different attributes :-
Step #2.) Now, If we store these products in a RDBMS, then we might have value of many columns empty for various Rows.
Step #3.) Some columns doesn’t makes sense for some of the products.
Below are the problems with this kind of storage :-
Step #4.) Below is how, we can store the data about products in columnar format :-
Step #5.) Here is how the columnar-database looks like :-
Question #6.) What’s the advantage of Columnar database ?
Question #7.) How does the columnar-database looks like for our usecase of product catalog ?
Question #8.) Where should we NOT use the Cassandra ?
Question #8.1.) Show by example, how data is written in Cassandra, assuming a cluster of 3 nodes ?
Answer → This is how Partition Key based Hashing Works in Cassandra (Murmur3Partitioner) → When a row is inserted in Cassandra, the partition key is hashed using the Murmur3 hashing function, and the resulting hash determines which node stores the data.
Let’s assume we have a train seat booking system with the following table:
CREATE TABLE seat_bookings (
train_id TEXT,
seat_no TEXT,
customer_id TEXT,
PRIMARY KEY (train_id, seat_no)
);
Here:
- Partition Key →
train_id
- Clustering Key →
seat_no
- Primary Key →
(train_id, seat_no)
1. Hashing the Partition Key
If we insert data like:
INSERT INTO seat_bookings (train_id, seat_no, customer_id)
VALUES ('TRAIN123', 'A1', 'CUST001');
- The partition key is
"TRAIN123"
. - Cassandra applies the Murmur3 hashing function to
"TRAIN123"
, generating a hashed token.
2. Mapping the Token to a Node
- Suppose the Murmur3 hash of
"TRAIN123"
produces token-543210987654321
. - Cassandra maintains a ring of token ranges, where each node owns a range of tokens.
- The hashed token
-543210987654321
falls within a specific node’s range, so that node stores the data.
Example with Multiple Nodes
Let’s say we have three nodes with these token ranges:
NodeToken Range StartToken Range End :-
Node 1-9,223,372,036,854,775,808-3,000,000,000,000,000,000
Node 2-2,999,999,999,999,999,9993,000,000,000,000,000,000
Node 33,000,000,000,000,000,0019,223,372,036,854,775,807
- Our hashed token
-543210987654321
falls in Node 2’s range. - So, Node 2 will store the data.
3. What Happens on Read?
If we run:
SELECT * FROM seat_bookings WHERE train_id = 'TRAIN123';
- Cassandra hashes
"TRAIN123"
again. - It checks which node owns the token range where the hash falls.
- It retrieves data from that node.
4. What Happens if We Add a New Node?
When adding a new node, Cassandra:
- Reassigns some token ranges to the new node.
- Moves some data from existing nodes to balance the cluster.
- Ensures that future writes distribute data more evenly.
Key Takeaways
✅ Partition Key is Hashed → Helps determine which node stores the data.
✅ Token Ranges Distribute Data → Ensures even load balancing.
✅ Efficient Reads → Cassandra knows exactly where to look based on the hash.
✅ Automatic Rebalancing → When nodes are added, Cassandra shifts some token ranges.
Question #9.) Summarise the Cassandra ?
Question #10.) Who all are using Cassandra ?
Question #11.) What’s the underlying technology for HBase ?
Question #12.) Where does the HDFS & HBase sits in entire Hadoop Ecosystem ?
Question #13.) What is Hadoop ?
Question #14.) Compare Cassandra with HBase ?
Answer → Cassandra is equivalent to the HBase.
Question #15.) Can we use Cassandra with Hadoop ?
Question #16.) Where should we use Hadoop ?
Question #17.) Under what use-cases in which Cassandra OR HBase would be preferred ?
That’s all in this section. If you liked reading this blog, kindly do press on clap button multiple times, to indicate your appreciation. We would see you in next blog.
References :-