Deep dive into AWS for developers | Part5 — DataLake
In case, you are landing here directly, it would be recommended to visit this page.
In this particular blog, we shall see the end to end process of reading the static-data from S3 Bucket and putting the same into the DataLakeFormation Table using Glue-Crawler and querying the same using Athena.
Question → What is a DataLakeFormation Table and How is it Different from an RDS Table ?
Answer → A DataLakeFormation Table in AWS is a metadata definition of structured or semi-structured data stored in Amazon S3, managed through AWS Glue Data Catalog. It is part of AWS Lake Formation, which helps create and manage a secure data lake. This allows users to query large-scale datasets efficiently using Athena, Redshift Spectrum, and EMR without needing to move the data.
On the other hand, an RDS Table exists within a relational database (e.g., MySQL, PostgreSQL, SQL Server) hosted on Amazon RDS. It follows a strict schema, supports ACID transactions, and is designed for OLTP (Online Transaction Processing) workloads.
Key Differences:
When to Use What?
- Use DataLakeFormation Table if you are dealing with large-scale analytical workloads, require cost-effective storage, and need schema flexibility.
- Use RDS Table if you need real-time transactions, structured data with strong consistency, and indexing for fast lookups.
Question → Explain where we did used the DataLakeFormation in my Journey ?
Answer → We had to perform the Clickstream Analysis (Ad Impressions, Clicks, Conversions) in our Ads-ecosystem.
- Every user action (ad viewed, clicked, ignored) generates terabytes of raw log data.
- This needs to be analyzed to understand which ads perform best.
- Data is semi-structured (JSON, Parquet, CSV) and huge (petabytes).
Here, our best choice for performing the Click-Analysis was AWS Glue DataLakeFormation + Athena (or Hive on S3) :-
- Store raw clickstream data in Amazon S3.
- Use Glue Crawler to define a schema on read.
- Query using Athena (serverless SQL on S3).
1.) Example Clickstream Data (Stored in S3 — JSON Format)
📂 s3://adtech-clickstream-data/2025/02/11/
{
"UserID": 12345,
"AdID": 6789,
"PageURL": "https://www.tokopedia.com/product",
"EventType": "Click",
"Timestamp": "2025-02-11T12:30:00Z",
"Device": "Mobile"
}
2.) Create a Table in AWS Glue (Schema-on-Read)
CREATE EXTERNAL TABLE clickstream_events (
UserID STRING,
AdID STRING,
PageURL STRING,
EventType STRING,
Timestamp STRING,
Device STRING
)
STORED AS PARQUET
LOCATION 's3://adtech-clickstream-data/';
3.) Query in Athena (Find Top-Performing Ads)
SELECT AdID, COUNT(*) AS Clicks
FROM clickstream_events
WHERE EventType = 'Click'
GROUP BY AdID
ORDER BY Clicks DESC
LIMIT 10;
4.) Why Glue + Athena?
- Scalability: Can process petabytes of clickstream data.
- Low cost: No cluster needed; pay only for queries.
- Schema-on-read: New data formats can be added dynamically.
5.) Why Not RDS?
- RDS can’t handle billions of rows efficiently for analytics.
- Storage & compute cost is too high for large-scale logs.
Question → Show the step by step process for creating DataLake from the Batch Data ?
Answer → Here is the process looks like for launching the DataLake using the Batch Data :-
Part #1 : Launching the IAM User
Step #1.) We first create an IAM user named “adiLakeFormation2” with which we shall perform all the stuff :-
Step #2.) We would associate following policies (“Admin”) to this IAM-User :-
We also associate AWSLakeFormationDataAdmin policy to this IAM-User :-
Part #2 : Launching the S3 Bucket
Step #1.) We now create the S3 Bucket named “adibucketlakeformation”. Inside the bucket, we have also created 3 folders :-
- bronze → Indicating the place for raw data.
- silver → Indicating the place for verified data.
- gold → Indicating the place for processed data.
Step #2.) We would now update a file to S3 Bucket, which looks like this :-
Step #3.) Let’s upload this file to S3 Bucket under “bronze/ingest/batch-person” folder :-
Step #4.) Let’s hit the upload button :-
Part #3 : Launching the AWS LakeFormation
Step #1.) Let’s go to AWSLakeFormation and add the IAM-User “adiLakeFormation2” as Administrator. Note that, this step has to be performed while we are loggedin as root user.
Step #2.) Let’s now login to AWS Portal using this IAM-User : “adiLakeFormation2”. Now, grant the permissions to create databases :-
Step #3.) Finally the dashboard of AWSLakeFormation looks like this. We can see here that :-
- The IAM user “adiLakeFormation2” has the Admin access to the DataLake Administrator.
- This IAM user “adiLakeFormation2” also has the neccesary permissions to create the databases.
Step #4.) Next, we shall register our aforecreated S3 Bucket as the DataLake Location :-
Step #5.) Here is how our dashboard looks like after registering the S3 Bucket :-
Part #4 : Creating the LaekFormation Database
Step #1.) Let’s first create a AWSLakeFormation Database. This DB shall be the Ingest database at Bronze Layer. We name it as “Ingest” :-
Step #2.) Finally, we have it created :-
Part #5 : Configuring the AWS Glue Crawler
Step #1.) Let’s go to the crawler dashboard under AWS Glue :-
Step #2.) Now, start creating the Crawler with name “Ingest-batch-crawler” :-
Step #3.) Now, let’s add the datasource :-
Here is how the S3 bucket path has been specified uptill folder “batch-person”, which indicates that DB-Table shall be created with this name :-
Step #4.) Next, let’s create the IAM-Role. Note that, AWS prefixes the keyword : “AWSGlueServiceRole-” :-
Step #5.) The Job of Crawler is to read the data from the text files stored in our S3 Bucket, parse the data from these files and dump/sink the data into the AWSLakeFormation database. Note that, this crawler itself shall be creating the tables for us :-
- Thus, let’s now specify the output place in the configuration of Glue Crawler.
- Also, note that, the schedule of this crawler has been specified as “On Demand”.
Step #6.) Finally, we are at the Review Screen :-
Step #7.) And we now hit the “Create Crawler” button and our Glue-Crawler is created :-
Step #8.) Now, we need to provide the AWSLakeFormation permissions to the newly launched ServiceRole here “AWSGlueServiceRole-IngestBatchPerson” :-
Step #9.) Let’s go inside the Data Lake Permissions and grant the access to this IAM Role. In the below step, we are granting the permissions on the following database :-
In the below step, we are selecting the permissions :-
Step #10.) Finally we can see here that, we have permissions to this IAMRole :-
Part #6 : Running the Crawler and Table creation
Step #1.) Let’s go to the crawler and run it. This process of running may take few minutes :-
Step #2.) The crawler has ran succesfully and have created ONE Table successfully :-
Step #3.) We can also verify from AWSLakeFormation dashboard that, the Table with name “batch_person” has now been succesfully created :-
Step #4.) We can also verify from AWSGlue Dashboard “Data Catalog Tables” that, our crawler has created a Table. Note that, our crawler was able to parse the schema automatically :-
Part #7 : Querying the Table using AWSAthena
Step #1.) Let’s go inside the Athena Dashboard and we ned to setup the query-results first :-
Step #2.) Let’s create another folder named “AthenaResults” inside our S3 Bucket, which shall serve as Results Location :-
Step #3.) Finally, we can write the SQL Query and see the results :-
One more query with where clause for demonstration purpose :-
That’s all in this section. We shall see you in Part-6 of this AWS Series. If you liked reading it, clap on this blog.