Deep dive into simplified Distributed Object Storage GCS

aditya goel

7 min readJan 21, 2024

Question → What is Object Storage Solution ?

Answer → AWS’s S3-bucket Or GCP’s GCS-Bucket is a type of Object storage solution, which means that :-

It is a managed solution in the cloud which is designed for scaling. Here, we can read objects, write objects and delete objects.
Here, we have to create a Bucket, where you end up storing the objects. Bucket is just a logical unit of storage which separates the scope of other objects being stored. Bucket Name should be unique across the whole S3. The Max size of the Object that can be stored here is 5 TB.
Object could be a File. It could be a text file, a binary file, audio file, video file, anything.

Question → What is the main difference between Object Storage Solution AND Block Storage Solution ?

Answer → The main difference between the Object-Storage and Block-Storage is that →

If at all you change even one character in a 1 GB huge file, then, in case of Block-Storage, only those blocks which contains that particular characters needs to be changed whereas when it comes to Object Storage, the whole file should be updated.
This difference did effects in Throughput, Efficiency, Cost as well. Because here in the Block-Storage, it is a very small change but there in Object Storage, it is the whole file change as such. OK.
So, Block Storage is obviously faster and it uses less bandwidth as compared to Object Storage because change is very minimal.
The cost of Block-Storage is more as compared to Object Storage, because of the higher efficiency offerred by Block-Storage.

Question → What are the Functional requirements & scale at which we need to build this system ?

Answer → Daily scale looks like this :-

Question → What are the Non Functional Requirements ?

Question → Explain the overall working of the S3 File Upload System ?

Answer → As the data comes-in, it is received by the API-Service. It performs 2 important things :-

It writes the data to the Data-Storage.
It writes the meta-data to the Metadata-Storage.

Question → Explain about the Metadata Service ?

Answer → Since we have lot of files, we would need to shard our metadata database :-

We would first create a unique hashID (UUID) on the basis of : (UserId, FileName, Bucket, Prefix). This UUID would uniquely identify a particular file from a particular user for a particular bucket with a particular prefix.
This hashID would be in-turn used with Consistent-Hashing to find the server, where this file shall be stored. Note that here in this design we are not using any out of the shelf NoSql-DB rather we are building our own nosql-db.

Metadata-Service keeps track of the files.

Question → Explain how the actual-file (i.e. data-bytes) is actually being uploaded to the System ?

Answer → As we receive the data from the Customer :-

It goes through the API-Service and then streamed to the Data-Manager-Service.
Data-Manager-Service first computes the UUID of the file on the basis of : (UserId, FileName, Bucket, Prefix) and sends the unique UUID to the Placement Service.
Placement Service decides upon the actual server, where a given file shall be stored on the basis of modulo-operation.
It computes the Primary-Node where the file would be stored as well as replica-nodes where the replica of the data would go to.
Once the primary & secondary data-node are being decided, the Data-Manager-Service streams the data-file to those nodes.
The Data-Node first saves the file on the primary-node & then makes sure the data-file is being stored successfully in the replica-nodes as well.

Question → How much is the replication factor considered ?

Answer → We can think about storing files with three copies i.e. replication-factor of 3 i.e. overall 3 copies of the file. This is 200% of additional space.

We can store 2 copies of the data in the same Region but different AZs. Having the 2 copies of the data in different AZ can provide enough durability to the overall system.
Third copy can goto the different Region itself asynchronously. This copy would come to handy in case there is failure of entire Region.

Once the 2 copies of the file are being saved successfully, we can return 200 OK from the Data-Manager-Service to the Customer.

Question → Explain about the Placement-Service aka Cluster Manager ?

Answer → Placement-Service → This is responsible for maintaining the state of the cluster of the data nodes.

Placement Service decides upon the actual server, where a given file shall be stored. It computes the Primary-Node where the file would be stored as well as replica-nodes where the replica of the data would go to.
The Placement Service is responsible to know physical & logical architecture & location of servers.
We can have data nodes in different data centres, different racks, different AZs & different Regions. It also knows about the Data Capacity, CPUs RAM Utilisation, Machine utilisation.
As new nodes are added to the cluster, Servers are registered with Placement Service, as they join the cluster. They sends all the information about it’s state, location, etc. to the Placement-Service.

This way, we can add capacity as much as we want and Placement-Service will start using that for Data-Storage.

In case we loose a Node :-

The Placement-Service would stop receiving the heartbeat & Placement-Service would then de-register it from the cluster.
The Placement-Service would also start to replicate the data to other nodes, so as to maintain the replication-factor.

That’s how the Placement-Service manages the Server-Pool.

Question → How do we read the file ?

Answer → Below is how the process happens :-

Step #1.) As we get the (UserId, FileName, Bucket, Prefix) from the user, we first fetch the unique hashId from the MetaData-Service. This operation has to be very fast and to make it fast, we can either hit the Cache (sitting on top of the Database) OR in worst-case-scenario, we would fetch it from the database.

Step #2.) Once we have the hashId, we go to the Data-Manager-Service which would in-turn call to the Placement-Service to get the list of servers where the data is stored. The server-list can be computed based upon the Server-Load-Average , Availability of Servers, Close proximity of servers.

Question → How do we optimise the Disk ?

Answer → We do so with the approach of Erasure-Coding :-

Question → Explain the approach for storing the large & small files ?

Answer → Let’s understand the approach for storing our files :-

Case #1.) Files which are lesser than 8 KBs → In the case of 64-bit filesystem , the block-size is usually of 8 KB. So, if you are storing the file-system of 1 KB, then we shall be wasting the 7 KB of the disk-space. In order to deal with this problem, we shall use the approach similar to WAL :-

We shall be storing these small files sequentially into a large Read-Write-File and as this large-file grows to a specific size, we are going to rotate this file and convert it into a Read-Only-File. We are going to create a new file for Write-And-Read operations.
We shall also have a state stored on the Data-Node, that will be responsible for storing the location of the specific file.
Important → Thus, if we want to store a small file, we shall be given a Write-Ahead-File, the location/offset of this small-file in that Write-Ahead-file and this size of this small-file.
This way, we shall be able to retieve our small files from the large-Write-Ahead-File.

Case #2.) Files which are larger → For the larger files, we think about the Multi-Part-Upload.

Imagine we have a file of say 4 GBs, then we shall break that file into 8 parts of 512 MB each and we shall start sending them one by one in parallel.
For every part of the file, we shall get a ETag-Id which is a quality checksum of that particular part.
Once all the pieces are uploaded, we send a Multi-Part-Upload-Complete request to the File-System mentioning following parts → (Upload-Id, Number-of-parts, Checksums-Of-Parts).
The system on the Backend pulls all those pieces/parts, re-assembles the whole file and stores it using the UploadId.

That’s all in this blog. We shall see you in next blog.

References :-

Deep dive into simplified Distributed Object Storage GCS

Written by aditya goel