Deep Dive with Cassandra || Part-1

aditya goel
5 min readJan 14, 2024

Question #1.) What’s the design of Cassandra ?

Answer → Cassandra is a distributed database, which means :-

  • DeCentralised → Our Application can actually connect to any of the DB-Nodes. In the below example, we showcased a Cassandra-Cluster with 6 nodes, organised in a Ring format.
  • Ubiquitous → Our application doesn’t need to necessarily connect to any single specific node, rather it can connect to any of the node as all of these nodes are Masters. That’s why, there is no Single point of failure (SPOF) while working with Cassandra.
  • Replication → There shall be multiple copies of your data and all of those multiple copies shall be present in the difference nodes/servers.
  • Fault-Tolerant → Even if anyone of the server goes down, still our application can continue to operate as it is.
  • Geographically located DataCentres → Usually the best way to setup the Cassandra cluster is to setup the multiple clusters in different geographical regions.

The need for that is to make sure that, our users can experience low latency.

Question #2.) How does Cassandra does the Replication of Data ?

Step #1.) Basically, with Cassandra, multiple copies of a single record shall be stored on different nodes. This makes the Cassandra Non-ACID-Compliant.

Step #2.) Let’s say, we have configured the replication factor as 2, then in that case 3 copies of the same data shall be stored across 3 different servers. But by default, cassandra returns success even when the data is successfully being written to only one node. This behaviour can be configured well.

Step #3.) This can lead to the problem of Dirty-Reads. For example :-

  • Say the data is being written to the Node-1 and this update has not yet been committed to Node-2 and Node-3.
  • Meanwhile the Read request lands for this record at Node-3.
  • In this scenario, older data shall be returned.

Step #4.) In order to avoid this problem, Cassandra provides tunable consistency.

Question #3.) What are the benefits of Cassandra ?

Question #4.) In which scenarios, should we be using Cassandra ?

Question #5.) What does Columnar data means ?

Answer :- In order to understand the Columnar-Database, let’s Imagine that, we have a eCommerce-platform where we have Millions of products.

Step #1.) Now, each product may have different attributes :-

Step #2.) Now, If we store these products in a RDBMS, then we might have value of many columns empty for various Rows.

Step #3.) Some columns doesn’t makes sense for some of the products.

Below are the problems with this kind of storage :-

Step #4.) Below is how, we can store the data about products in columnar format :-

Step #5.) Here is how the columnar-database looks like :-

Question #6.) What’s the advantage of Columnar database ?

Question #7.) How does the columnar-database looks like for our usecase of product catalog ?

Question #8.) Where should we NOT use the Cassandra ?

Question #9.) Summarise the Cassandra ?

Question #10.) Who all are using Cassandra ?

Question #11.) What’s the underlying technology for HBase ?

Question #12.) Where does the HDFS & HBase sits in entire Hadoop Ecosystem ?

Question #13.) What is Hadoop ?

Question #14.) Compare Cassandra with HBase ?

Answer → Cassandra is equivalent to the HBase.

Question #15.) Can we use Cassandra with Hadoop ?

Question #16.) Where should we use Hadoop ?

Question #17.) Under what use-cases in which Cassandra OR HBase would be preferred ?

That’s all in this section. If you liked reading this blog, kindly do press on clap button multiple times, to indicate your appreciation. We would see you in next blog.

References :-

--

--

aditya goel

Software Engineer for Big Data distributed systems