Sitemap

Deep dive into AWS for developers | Part6 — DataLake

10 min readMar 9, 2024

In case, you are landing here directly, it would be recommended to visit this page.

In this particular blog, we shall see the end to end process of Firehose generating the data-stream into S3 and then Glue Crawler reading that data into DataLakeFormation Table & finally querying the same using Athena.

Press enter or click to view image in full size

Question → Show the step by step process for creating DataLake from the Batch Dat

Answer → Here is the process looks like for launching the DataLake using the Batch Data :-

Part #1 : Creating the Firehose Stream

Step #1.) We first create the Firehose-Stream named “adi-firehose-stream”. We have chosen the source as “Direct PUT” and destination as S3.

Press enter or click to view image in full size

Note that, there are 3 types of sources which we can specify here :-

  • Kinesis Data Streams.
  • MSK.
  • Direct-Put.
Press enter or click to view image in full size

Step #2.) Let’s create the folder inside our already existing S3 Bucket, where this firehose-stream shall be saving the data i.e. “bronze/ingest/real-time-ingest” :-

Press enter or click to view image in full size

Step #3.) Let’s configure the S3 Bucket location/folder inside the Firehose-stream configuration settings :-

Press enter or click to view image in full size

Step #4.) Let’s now generate the test data into our Firehose-stream, which shall automatically send the data to S3 Location specified above :-

Press enter or click to view image in full size

Step #5.) Let’s verify from our S3 Bucket folder whether we have got the test-data generated into it :-

Press enter or click to view image in full size

The above folder has been automatically created by the test-data coming from our Firehose-stream. Let’s go inside it. Since we have ran the test-data generation 3 times, we can see that, there are 3 files generated into our S3 bucket folder :-

Press enter or click to view image in full size

Part #2 : Creating the Glue-Crawler

Step #1.) We now create the Glue-Crawler named “Ingest-RealTimeData-Crawler”, whose job shall be to read the data from S3 Bucket folder (as shown above) and create the DataLakeFormation table out of that data :-

Press enter or click to view image in full size

Step #2.) We here specify the folder-location from where, our Glue-crawler shall be reading the data from our S3 Bucket “adilakeformation” and location “bronze/real-time-ingest” :-

Press enter or click to view image in full size

Finally, here is how the configuration of Glue-Crawler looks like :-

Press enter or click to view image in full size

Step #3.) We now create a new IAM-ServiceRole which shall be used by this Crawler, in order to create the DataLakeFormation tables (we shall see the permissions thing later) :-

Press enter or click to view image in full size

Here is how this configuration looks like :-

Press enter or click to view image in full size

Step #4.) We now select the Output location i.e. once this crawler runs, it is going to create the Table inside this particular Target-Database “ingest” :-

Press enter or click to view image in full size

Step #5.) Finally, we hit “Create Crawler” button at Review screen :-

Press enter or click to view image in full size

Part #3 : Grant required Permissions to Crawler

Step #1.) We can now see that, our crawler is ready here :-

Press enter or click to view image in full size

Step #2.) We now head back to DataLakeFormation dashboard to allow necessary permissions (i.e. to create tables) to this Crawler’s IAM-Role “AWSGlueServiceRole-RealTimeIngestS3” :-

Press enter or click to view image in full size

Second part of permissions looks like this :-

Press enter or click to view image in full size

Part #4 : Running the Crawler & verifying the data

Step #1.) We can now run our crawler, which shall read the data from S3 folder and create the table by parsing the data automatically. Note that, we have ran the Crawler 2 times the latest one being started at 7:13:03 and finished at 7:14:30 hours.

Press enter or click to view image in full size

Step #2.) We can now verify that, once our crawler runs, it shall automatically create the table and even predict the schema automatically. Note that, the name of the table (the one that our crawler has created) is “real_time_ingest” and this is syntactically same as the folder structure in S3 “real-time-ingest” :-

Press enter or click to view image in full size

Step #3.) Let’s also cross-verify from DataLakeFormation dashboard that, whether our table has been created by the Cralwer :-

Press enter or click to view image in full size

Step #4.) Let’s now query to our table to check whether the data is present, using Athena. Note that, there are 3 records into our Table because recall in our Part-1 that, we ran the test-data-generator-script for 3 times and hence there are 3 records we have :-

Press enter or click to view image in full size

Part #5 : Reverify the entire process again

Step #1.) We can now run our Test-Data-Generator again. Let’s run it for 6 times. We can now verify from S3 bucket that, overall 6 new records has been created under the hour wise folder in our S3 Bucket :-

Press enter or click to view image in full size

Step #2.) We can now run our Crawler again & wait for it to run successfully.

Press enter or click to view image in full size

Below, we can see that, the Crawler has ran successfully i.e. it started at 7:56:09 and finished at 7:57:35 :-

Press enter or click to view image in full size

Step #3.) We can verify using Glue Data-Catalog-Tables dashboard that, the table “real_time_ingest” has been very well updated at 7:57:34 hours time :-

Press enter or click to view image in full size

Step #4.) We can also cross-verify using AWSLakeFormation dashboard that, the table “real_time_ingest” has been very well updated at 7:57:34 hours time :-

Press enter or click to view image in full size

Step #5.) Let’s see the data in this table now, using Athena. We can verify that, the new 6 records are very well present into the Table :-

Press enter or click to view image in full size

Part #6 : Verifying the Data and moving it to Silver bucket in parque format

In this part, we shall be creating a GlueJob would read the data from LakeFormation DB-Table (real_time_ingest), transform the data-schema little and dump the data into S3-bucket in Parquet format.

Step #1.) Let’s go ahead and create the new database named “silver” in our AWS DataLakeFormation :-

Press enter or click to view image in full size

The database has been created now :-

Press enter or click to view image in full size

Step #2.) Let’s go ahead and create the GlueJob. Here, we have to select the Sources, series of transformations, if any and targets where data shall be stored.

Press enter or click to view image in full size

We need to create a IAM-Role which shall be assumed by the aforementioned GlueJob. Note that, our GlueJob would read the data from LakeFormation DB-Table, transform the data and dump the data into S3 in Parquet format. Therefore, we have given the GlueConsole and LakeFormation access to this IAM-Role.

Press enter or click to view image in full size

Let’s associate the newly created IAM-Role with the GlueJob that we have just created :-

Press enter or click to view image in full size

Step #3.) Now, we create the folder “silver” inside our Bucket “adilakeformation”, into which we shall be storing the data.

Press enter or click to view image in full size

Note that :-

  • In DataLake world, the verified data is also known by terminology of silver, whereas unprocessed data is also called as bronze.
  • Thus, Inside the silver folder, we have created a new folder as shown below. In this folder, our GlueJob shall be storing the output :-
Press enter or click to view image in full size

Step #4.) Next, we shall be associating the Transform step, where we have just changed the dataType of some fields from double to string format :-

Press enter or click to view image in full size

Step #5.) Next, we have the Target as S3 Bucket where we shall be storing the data in Parquet format. Here, we have selected the S3 Bucket-folder that we just created in above step :-

Press enter or click to view image in full size

Step #6.) The overall Job would now look like this. Let’s execute this Job :-

Press enter or click to view image in full size

Let’s run this GlueJob now to make sure that, the data reaches to S3-Bucket from our LakeFormation DB-Table. In the below screenshot, you can see that, our Job was failing earlier, but finally we made it working by addressing the various issues & problems :-

Press enter or click to view image in full size

We can see that, the data has now reached to S3-Bucket. There were total of 9 records in the source LakeFormation DB-Table as shown below :-

Press enter or click to view image in full size

And thus, we can see that, there are 9 records has been created into our S3 Bucket-folder in the parquet format :-

Press enter or click to view image in full size

Part #7 : Crawling the data from S3 Bucket and saving it to LakeFormation Table :-

In this part, we shall be creating a GlueCrawler which shall read the data from S3 Bucket in parquet format and sink the data into LakeFormation DB “silver”.

Step #1.) Let’s go ahead and create the Crawler :-

Press enter or click to view image in full size

Step #2.) Next, we specify the S3 Bucket folder, from where we shall be reading the data :-

Press enter or click to view image in full size

This configuration now looks like this :-

Press enter or click to view image in full size

Step #3.) Next, we specify the IAM-Role :-

Press enter or click to view image in full size

Here are the permissions assigned to this IAM-Role. Note that, this IAM-Role would be reading the data from S3 and sinking the same into the LakeFormation DB :-

Press enter or click to view image in full size

We also grant the permissions to this IAM-Role from LakeFormation Dashboard, so that this IAM-User can create the tables inside this DB :-

Press enter or click to view image in full size

Step #4.) Next, we specify the output location, where this crawler shall be storing the data :-

Press enter or click to view image in full size

The final configuration of Crawler looks like this :-

Press enter or click to view image in full size

When we hit Create button, the crawler is created :-

Press enter or click to view image in full size

Step #5.) Now, we run the crawler.

Press enter or click to view image in full size

Once this Crawler runs succesfully, it would create the table inside the LakeFormation DB :-

Press enter or click to view image in full size

Step #6.) Next, we verify the data into this table from Athena Query :-

Press enter or click to view image in full size

That’s all in this blog. We shall see you in next blog. If you liked reading it, please do clap on this page.

References :- Data being used in this blog.

--

--

aditya goel
aditya goel

Written by aditya goel

Software Engineer for Big Data distributed systems

No responses yet