Deep dive into AWS for developers | Part6 — DataLake

10 min readMar 9, 2024

In case, you are landing here directly, it would be recommended to visit this page.

In this particular blog, we shall see the end to end process of Firehose generating the data-stream into S3 and then Glue Crawler reading that data into DataLakeFormation Table & finally querying the same using Athena.

Question → Show the step by step process for creating DataLake from the Batch Dat

Answer → Here is the process looks like for launching the DataLake using the Batch Data :-

Part #1 : Creating the Firehose Stream

Step #1.) We first create the Firehose-Stream named “adi-firehose-stream”. We have chosen the source as “Direct PUT” and destination as S3.

Note that, there are 3 types of sources which we can specify here :-

Kinesis Data Streams.
MSK.
Direct-Put.

Step #2.) Let’s create the folder inside our already existing S3 Bucket, where this firehose-stream shall be saving the data i.e. “bronze/ingest/real-time-ingest” :-

Step #3.) Let’s configure the S3 Bucket location/folder inside the Firehose-stream configuration settings :-

Step #4.) Let’s now generate the test data into our Firehose-stream, which shall automatically send the data to S3 Location specified above :-

Step #5.) Let’s verify from our S3 Bucket folder whether we have got the test-data generated into it :-

The above folder has been automatically created by the test-data coming from our Firehose-stream. Let’s go inside it. Since we have ran the test-data generation 3 times, we can see that, there are 3 files generated into our S3 bucket folder :-

Part #2 : Creating the Glue-Crawler

Step #1.) We now create the Glue-Crawler named “Ingest-RealTimeData-Crawler”, whose job shall be to read the data from S3 Bucket folder (as shown above) and create the DataLakeFormation table out of that data :-

Step #2.) We here specify the folder-location from where, our Glue-crawler shall be reading the data from our S3 Bucket “adilakeformation” and location “bronze/real-time-ingest” :-

Finally, here is how the configuration of Glue-Crawler looks like :-

Step #3.) We now create a new IAM-ServiceRole which shall be used by this Crawler, in order to create the DataLakeFormation tables (we shall see the permissions thing later) :-

Here is how this configuration looks like :-

Step #4.) We now select the Output location i.e. once this crawler runs, it is going to create the Table inside this particular Target-Database “ingest” :-

Step #5.) Finally, we hit “Create Crawler” button at Review screen :-

Part #3 : Grant required Permissions to Crawler

Step #1.) We can now see that, our crawler is ready here :-

Step #2.) We now head back to DataLakeFormation dashboard to allow necessary permissions (i.e. to create tables) to this Crawler’s IAM-Role “AWSGlueServiceRole-RealTimeIngestS3” :-

Second part of permissions looks like this :-

Part #4 : Running the Crawler & verifying the data

Step #1.) We can now run our crawler, which shall read the data from S3 folder and create the table by parsing the data automatically. Note that, we have ran the Crawler 2 times the latest one being started at 7:13:03 and finished at 7:14:30 hours.

Step #2.) We can now verify that, once our crawler runs, it shall automatically create the table and even predict the schema automatically. Note that, the name of the table (the one that our crawler has created) is “real_time_ingest” and this is syntactically same as the folder structure in S3 “real-time-ingest” :-

Step #3.) Let’s also cross-verify from DataLakeFormation dashboard that, whether our table has been created by the Cralwer :-

Step #4.) Let’s now query to our table to check whether the data is present, using Athena. Note that, there are 3 records into our Table because recall in our Part-1 that, we ran the test-data-generator-script for 3 times and hence there are 3 records we have :-

Part #5 : Reverify the entire process again

Step #1.) We can now run our Test-Data-Generator again. Let’s run it for 6 times. We can now verify from S3 bucket that, overall 6 new records has been created under the hour wise folder in our S3 Bucket :-

Step #2.) We can now run our Crawler again & wait for it to run successfully.

Below, we can see that, the Crawler has ran successfully i.e. it started at 7:56:09 and finished at 7:57:35 :-

Step #3.) We can verify using Glue Data-Catalog-Tables dashboard that, the table “real_time_ingest” has been very well updated at 7:57:34 hours time :-

Step #4.) We can also cross-verify using AWSLakeFormation dashboard that, the table “real_time_ingest” has been very well updated at 7:57:34 hours time :-

Step #5.) Let’s see the data in this table now, using Athena. We can verify that, the new 6 records are very well present into the Table :-

Part #6 : Verifying the Data and moving it to Silver bucket in parque format

In this part, we shall be creating a GlueJob would read the data from LakeFormation DB-Table (real_time_ingest), transform the data-schema little and dump the data into S3-bucket in Parquet format.

Step #1.) Let’s go ahead and create the new database named “silver” in our AWS DataLakeFormation :-

The database has been created now :-

Step #2.) Let’s go ahead and create the GlueJob. Here, we have to select the Sources, series of transformations, if any and targets where data shall be stored.

We need to create a IAM-Role which shall be assumed by the aforementioned GlueJob. Note that, our GlueJob would read the data from LakeFormation DB-Table, transform the data and dump the data into S3 in Parquet format. Therefore, we have given the GlueConsole and LakeFormation access to this IAM-Role.

Let’s associate the newly created IAM-Role with the GlueJob that we have just created :-

Step #3.) Now, we create the folder “silver” inside our Bucket “adilakeformation”, into which we shall be storing the data.

Note that :-

In DataLake world, the verified data is also known by terminology of silver, whereas unprocessed data is also called as bronze.
Thus, Inside the silver folder, we have created a new folder as shown below. In this folder, our GlueJob shall be storing the output :-

Step #4.) Next, we shall be associating the Transform step, where we have just changed the dataType of some fields from double to string format :-

Step #5.) Next, we have the Target as S3 Bucket where we shall be storing the data in Parquet format. Here, we have selected the S3 Bucket-folder that we just created in above step :-

Step #6.) The overall Job would now look like this. Let’s execute this Job :-

Let’s run this GlueJob now to make sure that, the data reaches to S3-Bucket from our LakeFormation DB-Table. In the below screenshot, you can see that, our Job was failing earlier, but finally we made it working by addressing the various issues & problems :-

We can see that, the data has now reached to S3-Bucket. There were total of 9 records in the source LakeFormation DB-Table as shown below :-

And thus, we can see that, there are 9 records has been created into our S3 Bucket-folder in the parquet format :-

Part #7 : Crawling the data from S3 Bucket and saving it to LakeFormation Table :-

In this part, we shall be creating a GlueCrawler which shall read the data from S3 Bucket in parquet format and sink the data into LakeFormation DB “silver”.

Step #1.) Let’s go ahead and create the Crawler :-

Step #2.) Next, we specify the S3 Bucket folder, from where we shall be reading the data :-

This configuration now looks like this :-

Step #3.) Next, we specify the IAM-Role :-

Here are the permissions assigned to this IAM-Role. Note that, this IAM-Role would be reading the data from S3 and sinking the same into the LakeFormation DB :-