Data Lakehouse on AWS: Step-by-Step Tutorial (Part 1)

Truong Nguyen Duc

Welcome to a hands-on tutorial series, “End-to-End AWS Lakehouse Pipeline”. Over the coming weeks, we’ll guide you through each step needed to design a modern Lakehouse architecture. We will also cover how to build and run it on AWS. Whether you’re a data engineer, architect, or analytics specialist, this series is tailored to give you practical guidance. It provides clear examples you can apply directly in your organization.

Why a Lakehouse on AWS?

The modern Lakehouse architecture is the combination of a traditional data lake (for raw, low-cost, and unstructured data) and an old-school data warehouse (for structured, curated analytics). A lakehouse lets you:

Unified Storage: Keep raw, cleaned, and curated data in a single object store (Amazon S3), reducing data duplication.
Flexible Schema (Schema Evolution): Support schema-on-read for exploration and schema enforcement for production analytics.
Open Formats: Leverage Apache Iceberg or Delta Lake tables for ACID transactions, time travel, and faster queries.
Scalable Processing: Use serverless engines like AWS Glue, Amazon EMR, or Amazon Athena to process data at petabyte scale.

By the end of this tutorial, you will learn and gain a deep knowledge of Data Lakehouse architecture. You will also understand how Lakehouse leverages the flexibility of a data lake and the performance of a data warehouse.

Overall architecture of the series

This series will try to implement the following architecture:

Key Components and Terminology

Before diving in, let’s align on the main AWS services and patterns as shown above:

Amazon S3: Durable, scalable object store for raw and processed data.
AWS Glue: Serverless ETL and Data Catalog service (Jobs, Crawlers, Workflows).
EventBridge: Event bus for triggering workflows on data arrival or status changes.
Apache Iceberg: Table format for ACID, partition evolution, and time travel on S3 data.
Amazon Athena: Serverless, interactive query service reading Iceberg tables.
Amazon Redshift Spectrum: Federated querying of S3-backed tables from Redshift.
AWS Lake Formation: Centralized security, governance, and fine-grained access control.
Amazon CloudWatch: Monitoring, logging, and alerting for all pipeline components.

Additionally, there are security services as IAM, KMS, Secret Manager, and VPC, that should be provisioned as the foundation of the whole pipeline. We will walk through them via the coming articles.

Implementation Timeline

Our provisioning and implementation will follow these eight phases:

Phase 1: Foundation Infrastructure
- IAM roles, KMS keys, SSM parameters
- VPC, subnets, security groups
- S3 buckets for Bronze/Silver/Gold layers + access logs
Phase 2: Core Data Services
- Register S3 locations in Lake Formation
- Set up Glue Data Catalog and Crawlers
Phase 3: Processing Layer
- Configure Apache Iceberg tables
- Develop Glue ETL jobs for medallion transformations
Phase 4: Analytics Layer
- Provision Redshift cluster and Spectrum integration
- Create Athena workgroups and resource configurations
Phase 5: Governance & Monitoring
- Define DataZone domain for lineage/governance
- Build CloudWatch dashboards, alarms, and SNS topics
- Create EventBridge rules for pipeline events
Phase 6: Data Pipeline
- Extract: Copy legacy data into raw S3
- Transform: Bronze→Silver→Gold processing
- Load: Push curated data into Redshift
Phase 7: Testing & Quality
- Implement data quality checks with Great Expectations
- Develop unit/integration tests using pytest
Phase 8: CI/CD
- Automate deployments with GitHub Actions and CodePipeline

This timeline ensures dependencies—like security, S3, and the Glue catalog—are in place before later phases such as ETL, governance, and CI/CD.

Hands-On Prerequisites

To follow along, you’ll need:

AWS account with permissions for IAM, S3, Glue, EventBridge, Lake Formation, Redshift, Athena, CloudWatch, SNS, and DataZone.
Familiarity with AWS CLI or Console.
Python or Scala experience for Spark/Glue scripting.
Sample dataset (e.g., CSV or Parquet) for ingestion—details to come in Part 1.

Tip: Use a dedicated sandbox AWS account to isolate tutorial resources.

Foundation and Core Services Implementation

Before taking a deeper look at the internal architecture of the AWS Lakehouse. We need to initialize the foundation services:

Part A: Security Stack Setup

1. Create KMS Keys for Data Encryption

1.1 Create S3 Data Encryption Key

Navigate to AWS KMS service in the console
Click Create key
Configure key:
- Key type: Symmetric
- Key usage: Encrypt and decrypt
- Click Next
Add alias and description:
- Alias: ecom-lakehouse-s3-key
- Description: KMS key for S3 data encryption in ecom lakehouse
- Click Next
Define key administrative permissions:
- Keep default root account permissions
- Click Next
Define key usage permissions:
- Keep default root account permissions
- Click Next
Review and click Finish

1.2 Create Glue/ETL Encryption Key

Repeat steps 1-7 above with:
- Alias: ecom-lakehouse-glue-key
- Description: KMS key for Glue ETL jobs encryption

2. Create IAM Roles

2.1 Create Glue Service Role

Navigate to IAM service
Click Roles → Create role
Select trusted entity:
- AWS service
- Use case: Glue
- Click Next
Add permissions policies:
- AWSGlueServiceRole
- AmazonS3FullAccess
- AWSLakeFormationDataAdmin
- Click Next
Role details:
- Role name: EcomLakehouseGlueRole
- Description: Service role for Glue ETL jobs in ecom lakehouse
- Click Create role

2.2 Create Lake Formation Admin Role

Click Create role
Select trusted entity:
- AWS service
- Use case: Lake Formation
- Click Next
Add permissions policies:
- AWSLakeFormationDataAdmin
- AmazonS3FullAccess
- IAMReadOnlyAccess
- Click Next
Role details:
- Role name: EcomLakehouseLakeFormationRole
- Description: Admin role for Lake Formation data governance
- Click Create role

2.3 Create Athena Query Role

Click Create role

Select trusted entity:

Custom trust policy
Paste this trust policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "athena.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Click Next

Add permissions policies:
- AmazonAthenaFullAccess
- AmazonS3ReadOnlyAccess
- AWSGlueConsoleFullAccess
- Click Next
Role details:
- Role name: EcomLakehouseAthenaRole
- Description: Role for Athena query execution
- Click Create role

2.4 Create Redshift Role

Click Create role
Select trusted entity:
- AWS service
- Use case: Redshift – Customizable
- Click Next
Add permissions policies:
- AmazonRedshiftAllCommandsFullAccess
- AmazonS3ReadOnlyAccess
- AWSGlueConsoleFullAccess
- Click Next
Role details:
- Role name: EcomLakehouseRedshiftRole
- Description: Role for Redshift cluster operations
- Click Create role

3. Create SSM Parameters for Configuration

Navigate to AWS Systems Manager
Click Parameter Store → Create parameter

3.1 Create Database Configuration Parameters

Create the following parameters one by one:

Parameter 1:

Name: /ecom-lakehouse/config/s3-bucket-name
Type: String
Value: ecom-analyzer-lakehouse
Description: Main S3 bucket for lakehouse data

Parameter 2:

Name: /ecom-lakehouse/config/glue-role-arn
Type: String
Value: arn:aws:iam::YOUR-ACCOUNT-ID:role/EcomLakehouseGlueRole
Description: ARN of Glue service role

Parameter 3:

Name: /ecom-lakehouse/config/kms-s3-key-id
Type: String
Value: alias/ecom-lakehouse-s3-key
Description: KMS key for S3 encryption

Parameter 4:

Name: /ecom-lakehouse/config/database-name
Type: String
Value: ecom_lakehouse_db
Description: Glue catalog database name

Part B: Network Stack Setup

1. Create VPC for Redshift and Secure Resources

1.1 Create VPC

Navigate to VPC service
Click Create VPC
VPC configuration:
- Resources to create: VPC and more
- Name tag: ecom-lakehouse-vpc
- IPv4 CIDR block: 10.0.0.0/16
- IPv6 CIDR block: No IPv6 CIDR block
- Tenancy: Default
- Number of Availability Zones: 2
- Number of public subnets: 2
- Number of private subnets: 2
- NAT gateways: In 1 AZ
- VPC endpoints: None
- DNS options: Enable both DNS hostnames and DNS resolution
Click Create VPC

1.2 Create Security Groups

Security Group 1: Redshift Security Group

In VPC console, click Security Groups → Create security group
Basic details:
- Name: ecom-lakehouse-redshift-sg
- Description: Security group for Redshift cluster
- VPC: Select ecom-lakehouse-vpc
Inbound rules:
- Type: Redshift
- Protocol: TCP
- Port: 5439
- Source: Custom → Select the VPC CIDR 10.0.0.0/16
Outbound rules: Keep default (All traffic)
Click Create security group

Security Group 2: Glue Connection Security Group

Step 1: Create an empty SG for Glue

Create another security group:
- Name: ecom-lakehouse-glue-sg
- Description: Security group for Glue connections
- VPC: Select ecom-lakehouse-vpc
Inbound rules: Leave it empty for now
Outbound rules: Keep default
Click Create security group

Step 2: Edit the SG to Add Self-Reference

Find your newly created security group in the list
Click Actions → Edit inbound rules (or click the Inbound rules tab → Edit inbound rules)
Click Add rule
Configure the rule:
- Type: All traffic
- Protocol: All
- Port range: All
- Source: Custom
- In the source field, start typing ecom-lakehouse-glue-sg
- Select the security group from dropdown (it will show: sg-xxxxxxxxx – ecom-lakehouse-glue-sg)

Part C: Enhanced S3 Configuration

1. Configure S3 Bucket Policies and Encryption

Assume that the S3 bucket ecom-analyzer-lakehouse already created with folders: raw, bronze, silver, gold

1.1 Enable S3 Bucket Encryption

Navigate to S3 service
Click on bucket ecom-analyzer-lakehouse
Go to Properties tab
Scroll to Default encryption
Click Edit
Configure encryption:
- Encryption type: Server-side encryption with AWS KMS keys (SSE-KMS)
- AWS KMS key: Choose from your AWS KMS keys
- Select: ecom-lakehouse-s3-key
- Bucket Key: Enable
Click Save changes

1.2 Enable S3 Versioning

In the same bucket, go to Properties
Find Versioning section
Click Edit
Select Enable
Click Save changes

1.3 Configure S3 Lifecycle Rules (Optional)

Go to Management tab
Click Create lifecycle rule
Lifecycle rule configuration:
- Name: ecom-lakehouse-lifecycle
- Status: Enabled
- Rule scope: Apply to all objects in the bucket
Lifecycle rule actions:
- ✅ Transition current versions of objects between storage classes
- ✅ Transition noncurrent versions of objects between storage classes
- ✅ Delete noncurrent versions of objects
Transition current versions:
- Days after object creation: 30
- Storage class: Standard-IA
- Add another transition:
  - Days: 90
  - Storage class: Glacier Flexible Retrieval
Delete noncurrent versions:
- Days after objects become noncurrent: 30
Click Create rule

2. Create Additional S3 Buckets

2.1 Create Logs Bucket

Click Create bucket
Bucket configuration:
- Name: ecom-analyzer-lakehouse-logs
- Region: Same as main bucket
- Object Ownership: ACLs disabled
- Block Public Access: Block all public access
- Versioning: Enable
- Default encryption: SSE-KMS with ecom-lakehouse-s3-key
Click Create bucket

2.2 Create Athena Query Results Bucket

Click Create bucket
Bucket configuration:
- Name: ecom-analyzer-athena-results
- Region: Same as main bucket
- Object Ownership: ACLs disabled
- Block Public Access: Block all public access
- Versioning: Enable
- Default encryption: SSE-KMS with ecom-lakehouse-s3-key
Click Create bucket

Part D: Integrated Data Catalog & Lake Formation Setup

1. Configure Lake Formation Foundation

Navigate to AWS Lake Formation
Administrative roles and tasks → Data lake administrators → Add
Add EcomLakehouseLakeFormationRole
Data lake locations → Register location:
- Path: s3://ecom-analyzer-lakehouse/
- IAM role: EcomLakehouseLakeFormationRole

2. Create Glue Database with Lake Formation Integration

Navigate to AWS Glue → Databases → Create database
Configure:
- Name: ecom_lakehouse_db
- Location: s3://ecom-analyzer-lakehouse/

3. Grant Unified Permissions

Lake Formation → Permissions → Data permissions → Grant
For Glue Role:
- Principal: EcomLakehouseGlueRole
- Database: ecom_lakehouse_db
- Database Permissions: All
For Athena Role:
- Principal: EcomLakehouseAthenaRole
- Database: ecom_lakehouse_db
- Database Permissions: All

Part E: Medallion Architecture Crawlers

1. Create Multi-Layer Crawler (Consolidated Approach)

Glue → Crawlers → Create crawler
Properties:
- Name: ecom-medallion-crawler
- Description: Unified crawler for all medallion layers
Data sources (Add multiple):
- Source 1: s3://ecom-analyzer-lakehouse/bronze/
- Source 2: s3://ecom-analyzer-lakehouse/silver/
- Source 3: s3://ecom-analyzer-lakehouse/gold/
Security: IAM role EcomLakehouseGlueRole
Output:
- Database: ecom_lakehouse_db
- Table name prefix: (Empty)
- Schedule: On demand
Run crawler and verify tables created

Part F: Athena Workgroup Setup

1. Configure Athena Workgroup

Amazon Athena → Workgroups → Create workgroup
Configure:
- Name: ecom-lakehouse-workgroup
- Query results: s3://ecom-analyzer-athena-results/

2. Test Integration

Query editor → Select ecom-lakehouse-workgroup
Database: ecom_lakehouse_db
Test query: SHOW TABLES;

Next Steps

In Part 2, we will create Glue Jobs and configure the Iceberg table format for our S3 bucket. If you have questions or need clarification, drop a comment below – let’s build this AWS Lakehouse pipeline together!

Truong Nguyen Duc

With a solid foundation in software engineering and a growing expertise in big data technologies, I bring 5+ years of experience across roles in data engineering, backend development, and product ownership. I'm currently a Data Engineer at NASTECH, where I work on building scalable data solutions to empower business insights using technologies like AWS, Databricks and Microsoft Azure.

1 thought on “Data Lakehouse on AWS: Step-by-Step Tutorial (Part 1)”

Truong Nguyen Duc
15th August 2025 at 5:23 am

Good article!

Loading...

Reply

Solutions

Industry

Our thinking

Data Lakehouse on AWS: Step-by-Step Tutorial (Part 1)

Truong Nguyen Duc

Table of Contents

Why a Lakehouse on AWS?

Overall architecture of the series

Key Components and Terminology

Implementation Timeline

Hands-On Prerequisites

Foundation and Core Services Implementation

Part A: Security Stack Setup

1. Create KMS Keys for Data Encryption

1.1 Create S3 Data Encryption Key

1.2 Create Glue/ETL Encryption Key

2. Create IAM Roles

2.1 Create Glue Service Role

2.2 Create Lake Formation Admin Role

2.3 Create Athena Query Role

2.4 Create Redshift Role

3. Create SSM Parameters for Configuration

3.1 Create Database Configuration Parameters

Part B: Network Stack Setup

1. Create VPC for Redshift and Secure Resources

1.1 Create VPC

1.2 Create Security Groups

Part C: Enhanced S3 Configuration

1. Configure S3 Bucket Policies and Encryption

1.1 Enable S3 Bucket Encryption

1.2 Enable S3 Versioning

1.3 Configure S3 Lifecycle Rules (Optional)

2. Create Additional S3 Buckets

2.1 Create Logs Bucket

2.2 Create Athena Query Results Bucket

Part D: Integrated Data Catalog & Lake Formation Setup

1. Configure Lake Formation Foundation

2. Create Glue Database with Lake Formation Integration

3. Grant Unified Permissions

Part E: Medallion Architecture Crawlers

1. Create Multi-Layer Crawler (Consolidated Approach)

Part F: Athena Workgroup Setup

1. Configure Athena Workgroup

2. Test Integration

Next Steps

Share this:

Like this:

Related

Truong Nguyen Duc

1 thought on “Data Lakehouse on AWS: Step-by-Step Tutorial (Part 1)”

Leave a CommentCancel reply

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements

Discover more from NashTech Blog