Deep dive into AWS for developers | Part5 — DataLake

aditya goel
6 min readMar 6, 2024

In case, you are landing here directly, it would be recommended to visit this page.

In this particular blog, we shall see the end to end process of reading the static-data from S3 Bucket and putting the same into the DataLakeFormation Table using Glue-Crawler and querying the same using Athena.

Question → Show the step by step process for creating DataLake from the Batch Data ?

Answer → Here is the process looks like for launching the DataLake using the Batch Data :-

Part #1 : Launching the IAM User

Step #1.) We first create an IAM user named “adiLakeFormation2” with which we shall perform all the stuff :-

Step #2.) We would associate following policies (“Admin”) to this IAM-User :-

We also associate AWSLakeFormationDataAdmin policy to this IAM-User :-

Part #2 : Launching the S3 Bucket

Step #1.) We now create the S3 Bucket named “adibucketlakeformation”. Inside the bucket, we have also created 3 folders :-

  • bronze → Indicating the place for raw data.
  • silver → Indicating the place for verified data.
  • gold → Indicating the place for processed data.

Step #2.) We would now update a file to S3 Bucket, which looks like this :-

Step #3.) Let’s upload this file to S3 Bucket under “bronze/ingest/batch-person” folder :-

Step #4.) Let’s hit the upload button :-

Part #3 : Launching the AWS LakeFormation

Step #1.) Let’s go to AWSLakeFormation and add the IAM-User “adiLakeFormation2” as Administrator. Note that, this step has to be performed while we are loggedin as root user.

Step #2.) Let’s now login to AWS Portal using this IAM-User : “adiLakeFormation2”. Now, grant the permissions to create databases :-

Step #3.) Finally the dashboard of AWSLakeFormation looks like this. We can see here that :-

  • The IAM user “adiLakeFormation2” has the Admin access to the DataLake Administrator.
  • This IAM user “adiLakeFormation2” also has the neccesary permissions to create the databases.

Step #4.) Next, we shall register our aforecreated S3 Bucket as the DataLake Location :-

Step #5.) Here is how our dashboard looks like after registering the S3 Bucket :-

Part #4 : Creating the LaekFormation Database

Step #1.) Let’s first create a AWSLakeFormation Database. This DB shall be the Ingest database at Bronze Layer. We name it as “Ingest” :-

Step #2.) Finally, we have it created :-

Part #5 : Configuring the AWS Glue Crawler

Step #1.) Let’s go to the crawler dashboard under AWS Glue :-

Step #2.) Now, start creating the Crawler with name “Ingest-batch-crawler” :-

Step #3.) Now, let’s add the datasource :-

Here is how the S3 bucket path has been specified uptill folder “batch-person”, which indicates that DB-Table shall be created with this name :-

Step #4.) Next, let’s create the IAM-Role. Note that, AWS prefixes the keyword : “AWSGlueServiceRole-” :-

Step #5.) The Job of Crawler is to read the data from the text files stored in our S3 Bucket, parse the data from these files and dump/sink the data into the AWSLakeFormation database. Note that, this crawler itself shall be creating the tables for us :-

  • Thus, let’s now specify the output place in the configuration of Glue Crawler.
  • Also, note that, the schedule of this crawler has been specified as “On Demand”.

Step #6.) Finally, we are at the Review Screen :-

Step #7.) And we now hit the “Create Crawler” button and our Glue-Crawler is created :-

Step #8.) Now, we need to provide the AWSLakeFormation permissions to the newly launched ServiceRole here “AWSGlueServiceRole-IngestBatchPerson” :-

Step #9.) Let’s go inside the Data Lake Permissions and grant the access to this IAM Role. In the below step, we are granting the permissions on the following database :-

In the below step, we are selecting the permissions :-

Step #10.) Finally we can see here that, we have permissions to this IAMRole :-

Part #6 : Running the Crawler and Table creation

Step #1.) Let’s go to the crawler and run it. This process of running may take few minutes :-

Step #2.) The crawler has ran succesfully and have created ONE Table successfully :-

Step #3.) We can also verify from AWSLakeFormation dashboard that, the Table with name “batch_person” has now been succesfully created :-

Step #4.) We can also verify from AWSGlue Dashboard “Data Catalog Tables” that, our crawler has created a Table. Note that, our crawler was able to parse the schema automatically :-

Part #7 : Querying the Table using AWSAthena

Step #1.) Let’s go inside the Athena Dashboard and we ned to setup the query-results first :-

Step #2.) Let’s create another folder named “AthenaResults” inside our S3 Bucket, which shall serve as Results Location :-

Step #3.) Finally, we can write the SQL Query and see the results :-

One more query with where clause for demonstration purpose :-

That’s all in this section. We shall see you in Part-6 of this AWS Series. If you liked reading it, clap on this blog.

--

--

aditya goel

Software Engineer for Big Data distributed systems