Deep dive into AWS for developers | Part5 — DataLake

aditya goel
6 min readMar 6, 2024

--

In case, you are landing here directly, it would be recommended to visit this page.

In this particular blog, we shall see the end to end process of reading the static-data from S3 Bucket and putting the same into the DataLakeFormation Table using Glue-Crawler and querying the same using Athena.

Question → Show the step by step process for creating DataLake from the Batch Data ?

Answer → Here is the process looks like for launching the DataLake using the Batch Data :-

Part #1 : Launching the IAM User

Step #1.) We first create an IAM user named “adiLakeFormation2” with which we shall perform all the stuff :-

Step #2.) We would associate following policies (“Admin”) to this IAM-User :-

We also associate AWSLakeFormationDataAdmin policy to this IAM-User :-

Part #2 : Launching the S3 Bucket

Step #1.) We now create the S3 Bucket named “adibucketlakeformation”. Inside the bucket, we have also created 3 folders :-

  • bronze → Indicating the place for raw data.
  • silver → Indicating the place for verified data.
  • gold → Indicating the place for processed data.

Step #2.) We would now update a file to S3 Bucket, which looks like this :-

Step #3.) Let’s upload this file to S3 Bucket under “bronze/ingest/batch-person” folder :-

Step #4.) Let’s hit the upload button :-

Part #3 : Launching the AWS LakeFormation

Step #1.) Let’s go to AWSLakeFormation and add the IAM-User “adiLakeFormation2” as Administrator. Note that, this step has to be performed while we are loggedin as root user.

Step #2.) Let’s now login to AWS Portal using this IAM-User : “adiLakeFormation2”. Now, grant the permissions to create databases :-

Step #3.) Finally the dashboard of AWSLakeFormation looks like this. We can see here that :-

  • The IAM user “adiLakeFormation2” has the Admin access to the DataLake Administrator.
  • This IAM user “adiLakeFormation2” also has the neccesary permissions to create the databases.

Step #4.) Next, we shall register our aforecreated S3 Bucket as the DataLake Location :-

Step #5.) Here is how our dashboard looks like after registering the S3 Bucket :-

Part #4 : Creating the LaekFormation Database

Step #1.) Let’s first create a AWSLakeFormation Database. This DB shall be the Ingest database at Bronze Layer. We name it as “Ingest” :-

Step #2.) Finally, we have it created :-

Part #5 : Configuring the AWS Glue Crawler

Step #1.) Let’s go to the crawler dashboard under AWS Glue :-

Step #2.) Now, start creating the Crawler with name “Ingest-batch-crawler” :-

Step #3.) Now, let’s add the datasource :-

Here is how the S3 bucket path has been specified uptill folder “batch-person”, which indicates that DB-Table shall be created with this name :-

Step #4.) Next, let’s create the IAM-Role. Note that, AWS prefixes the keyword : “AWSGlueServiceRole-” :-

Step #5.) The Job of Crawler is to read the data from the text files stored in our S3 Bucket, parse the data from these files and dump/sink the data into the AWSLakeFormation database. Note that, this crawler itself shall be creating the tables for us :-

  • Thus, let’s now specify the output place in the configuration of Glue Crawler.
  • Also, note that, the schedule of this crawler has been specified as “On Demand”.

Step #6.) Finally, we are at the Review Screen :-

Step #7.) And we now hit the “Create Crawler” button and our Glue-Crawler is created :-

Step #8.) Now, we need to provide the AWSLakeFormation permissions to the newly launched ServiceRole here “AWSGlueServiceRole-IngestBatchPerson” :-

Step #9.) Let’s go inside the Data Lake Permissions and grant the access to this IAM Role. In the below step, we are granting the permissions on the following database :-

In the below step, we are selecting the permissions :-

Step #10.) Finally we can see here that, we have permissions to this IAMRole :-

Part #6 : Running the Crawler and Table creation

Step #1.) Let’s go to the crawler and run it. This process of running may take few minutes :-

Step #2.) The crawler has ran succesfully and have created ONE Table successfully :-

Step #3.) We can also verify from AWSLakeFormation dashboard that, the Table with name “batch_person” has now been succesfully created :-

Step #4.) We can also verify from AWSGlue Dashboard “Data Catalog Tables” that, our crawler has created a Table. Note that, our crawler was able to parse the schema automatically :-

Part #7 : Querying the Table using AWSAthena

Step #1.) Let’s go inside the Athena Dashboard and we ned to setup the query-results first :-

Step #2.) Let’s create another folder named “AthenaResults” inside our S3 Bucket, which shall serve as Results Location :-

Step #3.) Finally, we can write the SQL Query and see the results :-

One more query with where clause for demonstration purpose :-

That’s all in this section. We shall see you in Part-6 of this AWS Series. If you liked reading it, clap on this blog.

--

--

aditya goel
aditya goel

Written by aditya goel

Software Engineer for Big Data distributed systems

No responses yet