Deep dive into AWS for developers | Part5 — DataLake

aditya goel
8 min readMar 6, 2024

--

In case, you are landing here directly, it would be recommended to visit this page.

In this particular blog, we shall see the end to end process of reading the static-data from S3 Bucket and putting the same into the DataLakeFormation Table using Glue-Crawler and querying the same using Athena.

Question → What is a DataLakeFormation Table and How is it Different from an RDS Table ?

Answer → A DataLakeFormation Table in AWS is a metadata definition of structured or semi-structured data stored in Amazon S3, managed through AWS Glue Data Catalog. It is part of AWS Lake Formation, which helps create and manage a secure data lake. This allows users to query large-scale datasets efficiently using Athena, Redshift Spectrum, and EMR without needing to move the data.

On the other hand, an RDS Table exists within a relational database (e.g., MySQL, PostgreSQL, SQL Server) hosted on Amazon RDS. It follows a strict schema, supports ACID transactions, and is designed for OLTP (Online Transaction Processing) workloads.

Key Differences:

When to Use What?

  • Use DataLakeFormation Table if you are dealing with large-scale analytical workloads, require cost-effective storage, and need schema flexibility.
  • Use RDS Table if you need real-time transactions, structured data with strong consistency, and indexing for fast lookups.

Question → Explain where we did used the DataLakeFormation in my Journey ?

Answer → We had to perform the Clickstream Analysis (Ad Impressions, Clicks, Conversions) in our Ads-ecosystem.

  • Every user action (ad viewed, clicked, ignored) generates terabytes of raw log data.
  • This needs to be analyzed to understand which ads perform best.
  • Data is semi-structured (JSON, Parquet, CSV) and huge (petabytes).

Here, our best choice for performing the Click-Analysis was AWS Glue DataLakeFormation + Athena (or Hive on S3) :-

  • Store raw clickstream data in Amazon S3.
  • Use Glue Crawler to define a schema on read.
  • Query using Athena (serverless SQL on S3).

1.) Example Clickstream Data (Stored in S3 — JSON Format)
📂 s3://adtech-clickstream-data/2025/02/11/

{
"UserID": 12345,
"AdID": 6789,
"PageURL": "https://www.tokopedia.com/product",
"EventType": "Click",
"Timestamp": "2025-02-11T12:30:00Z",
"Device": "Mobile"
}

2.) Create a Table in AWS Glue (Schema-on-Read)

CREATE EXTERNAL TABLE clickstream_events (
UserID STRING,
AdID STRING,
PageURL STRING,
EventType STRING,
Timestamp STRING,
Device STRING
)
STORED AS PARQUET
LOCATION 's3://adtech-clickstream-data/';

3.) Query in Athena (Find Top-Performing Ads)

SELECT AdID, COUNT(*) AS Clicks
FROM clickstream_events
WHERE EventType = 'Click'
GROUP BY AdID
ORDER BY Clicks DESC
LIMIT 10;

4.) Why Glue + Athena?

  • Scalability: Can process petabytes of clickstream data.
  • Low cost: No cluster needed; pay only for queries.
  • Schema-on-read: New data formats can be added dynamically.

5.) Why Not RDS?

  • RDS can’t handle billions of rows efficiently for analytics.
  • Storage & compute cost is too high for large-scale logs.

Question → Show the step by step process for creating DataLake from the Batch Data ?

Answer → Here is the process looks like for launching the DataLake using the Batch Data :-

Part #1 : Launching the IAM User

Step #1.) We first create an IAM user named “adiLakeFormation2” with which we shall perform all the stuff :-

Step #2.) We would associate following policies (“Admin”) to this IAM-User :-

We also associate AWSLakeFormationDataAdmin policy to this IAM-User :-

Part #2 : Launching the S3 Bucket

Step #1.) We now create the S3 Bucket named “adibucketlakeformation”. Inside the bucket, we have also created 3 folders :-

  • bronze → Indicating the place for raw data.
  • silver → Indicating the place for verified data.
  • gold → Indicating the place for processed data.

Step #2.) We would now update a file to S3 Bucket, which looks like this :-

Step #3.) Let’s upload this file to S3 Bucket under “bronze/ingest/batch-person” folder :-

Step #4.) Let’s hit the upload button :-

Part #3 : Launching the AWS LakeFormation

Step #1.) Let’s go to AWSLakeFormation and add the IAM-User “adiLakeFormation2” as Administrator. Note that, this step has to be performed while we are loggedin as root user.

Step #2.) Let’s now login to AWS Portal using this IAM-User : “adiLakeFormation2”. Now, grant the permissions to create databases :-

Step #3.) Finally the dashboard of AWSLakeFormation looks like this. We can see here that :-

  • The IAM user “adiLakeFormation2” has the Admin access to the DataLake Administrator.
  • This IAM user “adiLakeFormation2” also has the neccesary permissions to create the databases.

Step #4.) Next, we shall register our aforecreated S3 Bucket as the DataLake Location :-

Step #5.) Here is how our dashboard looks like after registering the S3 Bucket :-

Part #4 : Creating the LaekFormation Database

Step #1.) Let’s first create a AWSLakeFormation Database. This DB shall be the Ingest database at Bronze Layer. We name it as “Ingest” :-

Step #2.) Finally, we have it created :-

Part #5 : Configuring the AWS Glue Crawler

Step #1.) Let’s go to the crawler dashboard under AWS Glue :-

Step #2.) Now, start creating the Crawler with name “Ingest-batch-crawler” :-

Step #3.) Now, let’s add the datasource :-

Here is how the S3 bucket path has been specified uptill folder “batch-person”, which indicates that DB-Table shall be created with this name :-

Step #4.) Next, let’s create the IAM-Role. Note that, AWS prefixes the keyword : “AWSGlueServiceRole-” :-

Step #5.) The Job of Crawler is to read the data from the text files stored in our S3 Bucket, parse the data from these files and dump/sink the data into the AWSLakeFormation database. Note that, this crawler itself shall be creating the tables for us :-

  • Thus, let’s now specify the output place in the configuration of Glue Crawler.
  • Also, note that, the schedule of this crawler has been specified as “On Demand”.

Step #6.) Finally, we are at the Review Screen :-

Step #7.) And we now hit the “Create Crawler” button and our Glue-Crawler is created :-

Step #8.) Now, we need to provide the AWSLakeFormation permissions to the newly launched ServiceRole here “AWSGlueServiceRole-IngestBatchPerson” :-

Step #9.) Let’s go inside the Data Lake Permissions and grant the access to this IAM Role. In the below step, we are granting the permissions on the following database :-

In the below step, we are selecting the permissions :-

Step #10.) Finally we can see here that, we have permissions to this IAMRole :-

Part #6 : Running the Crawler and Table creation

Step #1.) Let’s go to the crawler and run it. This process of running may take few minutes :-

Step #2.) The crawler has ran succesfully and have created ONE Table successfully :-

Step #3.) We can also verify from AWSLakeFormation dashboard that, the Table with name “batch_person” has now been succesfully created :-

Step #4.) We can also verify from AWSGlue Dashboard “Data Catalog Tables” that, our crawler has created a Table. Note that, our crawler was able to parse the schema automatically :-

Part #7 : Querying the Table using AWSAthena

Step #1.) Let’s go inside the Athena Dashboard and we ned to setup the query-results first :-

Step #2.) Let’s create another folder named “AthenaResults” inside our S3 Bucket, which shall serve as Results Location :-

Step #3.) Finally, we can write the SQL Query and see the results :-

One more query with where clause for demonstration purpose :-

That’s all in this section. We shall see you in Part-6 of this AWS Series. If you liked reading it, clap on this blog.

--

--

aditya goel
aditya goel

Written by aditya goel

Software Engineer for Big Data distributed systems

No responses yet