Fastness of Kafka

aditya goel
5 min readFeb 18, 2023

Question :- What’s the meaning of Statement : “Kafka is fast” ?

Answer → Here is the understanding of Kafka’s fastness :-

  • It means about the : Kafka’s ability to move a lot of data efficiently. Kafka is designed to move a large number of records, in a very short amount of time.
  • Think of it as, very large pipe moving Liquid. The bigger the diameter of the pipe, the larger the volume of liquid that can move through it.

Question :- What design decisions, help Kafka to move data so quickly ?

Answer → There are primarily 2 main reasons, that contributes to the high speed offered by Kafka :-

Design #1.) Kafka’s reliance on Sequential I/O → Kafka makes use of Append-Only-Log, as it’s primary data-structure. An Append-Only-Log adds new data to the end of file. This access-pattern is Sequential.

Design #2.) Kafka’s focus on Efficiency → Kafka moves a lot of data :-

  • From Network to Disk.
  • from Disk to Network.

Kafka is able to achieve this, using the principle of Read with Zero Copy.

Question :- Is Disk-Access literally slower than the Memory-Access ?

Answer → There is a massive mis-conception that, disk-access is slower than memory-access, but this largely depends upon the disk-access-patterns. There are two types of access-patterns :-

  • Random → For HDD, it takes time to physically move the Arm to a different locations on the Magnetic-Disks. This is what makes, random-access slower.
  • Sequential → Since the arm doesn’t needs to jump around, it is much faster to read & write blocks of data, one after the another.

Question :- How does the performance looks like for Sequential and Random access patterns with Modern Disks ?

Answer → On modern hardware with an array of disks :-

  • Sequential Writes reaches hundred of MegaBytes per second.
  • Random Writes are measured in hundred of KiloBytes per second.

Similarly, Sequential-Access is several order of magnitude faster.

Question :- How does the costing looks like, if compared between Hard-Disks and SSDs ?

Answer → Using Hard-Disks has its cost-advantage too :- Compared to SSD, hard-disks comes at 1/3rd of the price, but with about three times of the capacity.

Question :- What’s the benefit available with Kafka making use of HDD ?

Answer → Giving Kafka, a large pool of cheap disk, without any performance penalty means that, Kafka can retain the messages for a longer period of time. This is uncommon with other Messaging Systems, before Kafka.

Question :- How does Kafka moves tons of data to & fro between network & disk ?

Answer → It’s critically important to eliminate excess copy, when moving lots of pages between the Disk & Network. It makes use of Zero Copy Principle.

Question :- What is Zero Copy Principle ?

Answer → Modern Unix Operating-Systems are highly optimised to transfer data from Disk to Network, without copying the data excessively.

Question :- How does the flow of data looks like with “Read without Zero Copy” ?

Answer → Here are the sequence of steps being followed :-

Step #1.) First, the data is produced by the Producer to the Kafka-brokers.

Step #2.) Next, the data is written from the Kafka-application to the OS-Cache.

Step #3.) Next, the data is written from the OS-Cache to the Disc.

Step #4.) Next, the data is loaded from Disk to the OS-Cache.

Step #5.) Next, the data is copied from OS-Cache into the Kafka-application.

Step #6.) The data is now copied from Kafka to the Socket-Buffer.

Step #7.) The data is now copied from Socket-Buffer to the Network-Interface-Card-Buffer.

Step #8.) The data is finally sent over the network to the consumer.

This is clearly InEfficient because there are FOUR copies of the data being present and TWO system-calls.

Question :- How does the flow of data looks like with “Read with Zero Copy” ?

Answer → Here are the sequence of steps being followed :-

Step #1.) First, the data is produced by the Producer to the Kafka-brokers.

Step #2.) Next, the data is written from the Kafka-application to the OS-Cache.

Step #3.) Next, the data is written from the OS-Cache to the Disc.

Step #4.) Next, the data is loaded from Disk to the OS-Cache.

Step #5.) With Zero-copy, the Kafka application uses a System-Call called “sendfile()”, to tell the Operating-System to directly copy the data from the OS-Cache to the Network-Interface-Card-Buffer.

In this path, the only copy is from the OS-Cache into the NIC-Buffer.

Question :- How is “Read with Zero Copy” performed ?

Answer → With a modern Network card, this copying is done using DMA. DMA stands for Direct Memory Access. When DMA is used, CPU is not involved, making it even more efficient.

That’s all in this blog. If you liked reading it, do clap on this page. We shall see you in next blog.

--

--

aditya goel

Software Engineer for Big Data distributed systems