NashTech Blog

Table of Contents
aws-lakehouse-hands-on-tutorial

Welcome to a hands-on tutorial series, “End-to-End AWS Lakehouse Pipeline”. Over the coming weeks, we’ll guide you through each step needed to design a modern Lakehouse architecture. We will also cover how to build and run it on AWS. Whether you’re a data engineer, architect, or analytics specialist, this series is tailored to give you practical guidance. It provides clear examples you can apply directly in your organization.

Why a Lakehouse on AWS?

The modern Lakehouse architecture is the combination of a traditional data lake (for raw, low-cost, and unstructured data) and an old-school data warehouse (for structured, curated analytics). A lakehouse lets you:

  • Unified Storage: Keep raw, cleaned, and curated data in a single object store (Amazon S3), reducing data duplication.
  • Flexible Schema (Schema Evolution): Support schema-on-read for exploration and schema enforcement for production analytics.
  • Open Formats: Leverage Apache Iceberg or Delta Lake tables for ACID transactions, time travel, and faster queries.
  • Scalable Processing: Use serverless engines like AWS Glue, Amazon EMR, or Amazon Athena to process data at petabyte scale.

By the end of this tutorial, you will learn and gain a deep knowledge of Data Lakehouse architecture. You will also understand how Lakehouse leverages the flexibility of a data lake and the performance of a data warehouse.

Overall architecture of the series

This series will try to implement the following architecture:

Key Components and Terminology

Before diving in, let’s align on the main AWS services and patterns as shown above:

  • Amazon S3: Durable, scalable object store for raw and processed data.
  • AWS Glue: Serverless ETL and Data Catalog service (Jobs, Crawlers, Workflows).
  • EventBridge: Event bus for triggering workflows on data arrival or status changes.
  • Apache Iceberg: Table format for ACID, partition evolution, and time travel on S3 data.
  • Amazon Athena: Serverless, interactive query service reading Iceberg tables.
  • Amazon Redshift Spectrum: Federated querying of S3-backed tables from Redshift.
  • AWS Lake Formation: Centralized security, governance, and fine-grained access control.
  • Amazon CloudWatch: Monitoring, logging, and alerting for all pipeline components.

Additionally, there are security services as IAM, KMS, Secret Manager, and VPC, that should be provisioned as the foundation of the whole pipeline. We will walk through them via the coming articles.

Implementation Timeline

Our provisioning and implementation will follow these eight phases:

  1. Phase 1: Foundation Infrastructure
    • IAM roles, KMS keys, SSM parameters
    • VPC, subnets, security groups
    • S3 buckets for Bronze/Silver/Gold layers + access logs
  2. Phase 2: Core Data Services
    • Register S3 locations in Lake Formation
    • Set up Glue Data Catalog and Crawlers
  3. Phase 3: Processing Layer
    • Configure Apache Iceberg tables
    • Develop Glue ETL jobs for medallion transformations
  4. Phase 4: Analytics Layer
    • Provision Redshift cluster and Spectrum integration
    • Create Athena workgroups and resource configurations
  5. Phase 5: Governance & Monitoring
    • Define DataZone domain for lineage/governance
    • Build CloudWatch dashboards, alarms, and SNS topics
    • Create EventBridge rules for pipeline events
  6. Phase 6: Data Pipeline
    • Extract: Copy legacy data into raw S3
    • Transform: Bronze→Silver→Gold processing
    • Load: Push curated data into Redshift
  7. Phase 7: Testing & Quality
    • Implement data quality checks with Great Expectations
    • Develop unit/integration tests using pytest
  8. Phase 8: CI/CD
    • Automate deployments with GitHub Actions and CodePipeline

This timeline ensures dependencies—like security, S3, and the Glue catalog—are in place before later phases such as ETL, governance, and CI/CD.

Hands-On Prerequisites

To follow along, you’ll need:

  • AWS account with permissions for IAM, S3, Glue, EventBridge, Lake Formation, Redshift, Athena, CloudWatch, SNS, and DataZone.
  • Familiarity with AWS CLI or Console.
  • Python or Scala experience for Spark/Glue scripting.
  • Sample dataset (e.g., CSV or Parquet) for ingestion—details to come in Part 1.

Tip: Use a dedicated sandbox AWS account to isolate tutorial resources.

Foundation and Core Services Implementation

Before taking a deeper look at the internal architecture of the AWS Lakehouse. We need to initialize the foundation services:

Part A: Security Stack Setup

1. Create KMS Keys for Data Encryption

1.1 Create S3 Data Encryption Key

  1. Navigate to AWS KMS service in the console
  2. Click Create key
  3. Configure key:
    • Key type: Symmetric
    • Key usage: Encrypt and decrypt
    • Click Next
  4. Add alias and description:
    • Alias: ecom-lakehouse-s3-key
    • Description: KMS key for S3 data encryption in ecom lakehouse
    • Click Next
  5. Define key administrative permissions:
    • Keep default root account permissions
    • Click Next
  6. Define key usage permissions:
    • Keep default root account permissions
    • Click Next
  7. Review and click Finish

1.2 Create Glue/ETL Encryption Key

  1. Repeat steps 1-7 above with:
    • Alias: ecom-lakehouse-glue-key
    • Description: KMS key for Glue ETL jobs encryption

2. Create IAM Roles

2.1 Create Glue Service Role

  1. Navigate to IAM service
  2. Click RolesCreate role
  3. Select trusted entity:
    • AWS service
    • Use case: Glue
    • Click Next
  4. Add permissions policies:
    • AWSGlueServiceRole
    • AmazonS3FullAccess
    • AWSLakeFormationDataAdmin
    • Click Next
  5. Role details:
    • Role name: EcomLakehouseGlueRole
    • Description: Service role for Glue ETL jobs in ecom lakehouse
    • Click Create role

2.2 Create Lake Formation Admin Role

  1. Click Create role
  2. Select trusted entity:
    • AWS service
    • Use case: Lake Formation
    • Click Next
  3. Add permissions policies:
    • AWSLakeFormationDataAdmin
    • AmazonS3FullAccess
    • IAMReadOnlyAccess
    • Click Next
  4. Role details:
    • Role name: EcomLakehouseLakeFormationRole
    • Description: Admin role for Lake Formation data governance
    • Click Create role

2.3 Create Athena Query Role

  1. Click Create role
  2. Select trusted entity:
    • Custom trust policy
    • Paste this trust policy:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "athena.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }
    
    • Click Next
  3. Add permissions policies:
    • AmazonAthenaFullAccess
    • AmazonS3ReadOnlyAccess
    • AWSGlueConsoleFullAccess
    • Click Next
  4. Role details:
    • Role name: EcomLakehouseAthenaRole
    • Description: Role for Athena query execution
    • Click Create role

2.4 Create Redshift Role

  1. Click Create role
  2. Select trusted entity:
    • AWS service
    • Use case: Redshift – Customizable
    • Click Next
  3. Add permissions policies:
    • AmazonRedshiftAllCommandsFullAccess
    • AmazonS3ReadOnlyAccess
    • AWSGlueConsoleFullAccess
    • Click Next
  4. Role details:
    • Role name: EcomLakehouseRedshiftRole
    • Description: Role for Redshift cluster operations
    • Click Create role

3. Create SSM Parameters for Configuration

  1. Navigate to AWS Systems Manager
  2. Click Parameter StoreCreate parameter

3.1 Create Database Configuration Parameters

Create the following parameters one by one:

Parameter 1:

  • Name: /ecom-lakehouse/config/s3-bucket-name
  • Type: String
  • Value: ecom-analyzer-lakehouse
  • Description: Main S3 bucket for lakehouse data

Parameter 2:

  • Name: /ecom-lakehouse/config/glue-role-arn
  • Type: String
  • Value: arn:aws:iam::YOUR-ACCOUNT-ID:role/EcomLakehouseGlueRole
  • Description: ARN of Glue service role

Parameter 3:

  • Name: /ecom-lakehouse/config/kms-s3-key-id
  • Type: String
  • Value: alias/ecom-lakehouse-s3-key
  • Description: KMS key for S3 encryption

Parameter 4:

  • Name: /ecom-lakehouse/config/database-name
  • Type: String
  • Value: ecom_lakehouse_db
  • Description: Glue catalog database name

Part B: Network Stack Setup

1. Create VPC for Redshift and Secure Resources

1.1 Create VPC

  1. Navigate to VPC service
  2. Click Create VPC
  3. VPC configuration:
    • Resources to create: VPC and more
    • Name tag: ecom-lakehouse-vpc
    • IPv4 CIDR block: 10.0.0.0/16
    • IPv6 CIDR block: No IPv6 CIDR block
    • Tenancy: Default
    • Number of Availability Zones: 2
    • Number of public subnets: 2
    • Number of private subnets: 2
    • NAT gateways: In 1 AZ
    • VPC endpoints: None
    • DNS options: Enable both DNS hostnames and DNS resolution
  4. Click Create VPC

1.2 Create Security Groups

Security Group 1: Redshift Security Group

  1. In VPC console, click Security GroupsCreate security group
  2. Basic details:
    • Name: ecom-lakehouse-redshift-sg
    • Description: Security group for Redshift cluster
    • VPC: Select ecom-lakehouse-vpc
  3. Inbound rules:
    • Type: Redshift
    • Protocol: TCP
    • Port: 5439
    • Source: Custom → Select the VPC CIDR 10.0.0.0/16
  4. Outbound rules: Keep default (All traffic)
  5. Click Create security group

Security Group 2: Glue Connection Security Group

Step 1: Create an empty SG for Glue

  1. Create another security group:
    • Name: ecom-lakehouse-glue-sg
    • Description: Security group for Glue connections
    • VPC: Select ecom-lakehouse-vpc
  2. Inbound rules: Leave it empty for now
  3. Outbound rules: Keep default
  4. Click Create security group

Step 2: Edit the SG to Add Self-Reference

  1. Find your newly created security group in the list
  2. Click ActionsEdit inbound rules (or click the Inbound rules tab → Edit inbound rules)
  3. Click Add rule
  4. Configure the rule:
    • Type: All traffic
    • Protocol: All
    • Port range: All
    • Source: Custom
    • In the source field, start typing ecom-lakehouse-glue-sg
    • Select the security group from dropdown (it will show: sg-xxxxxxxxx – ecom-lakehouse-glue-sg)

Part C: Enhanced S3 Configuration

1. Configure S3 Bucket Policies and Encryption

Assume that the S3 bucket ecom-analyzer-lakehouse already created with folders: raw, bronze, silver, gold

1.1 Enable S3 Bucket Encryption

  1. Navigate to S3 service
  2. Click on bucket ecom-analyzer-lakehouse
  3. Go to Properties tab
  4. Scroll to Default encryption
  5. Click Edit
  6. Configure encryption:
    • Encryption type: Server-side encryption with AWS KMS keys (SSE-KMS)
    • AWS KMS key: Choose from your AWS KMS keys
    • Select: ecom-lakehouse-s3-key
    • Bucket Key: Enable
  7. Click Save changes

1.2 Enable S3 Versioning

  1. In the same bucket, go to Properties
  2. Find Versioning section
  3. Click Edit
  4. Select Enable
  5. Click Save changes

1.3 Configure S3 Lifecycle Rules (Optional)

  1. Go to Management tab
  2. Click Create lifecycle rule
  3. Lifecycle rule configuration:
    • Name: ecom-lakehouse-lifecycle
    • Status: Enabled
    • Rule scope: Apply to all objects in the bucket
  4. Lifecycle rule actions:
    • Transition current versions of objects between storage classes
    • Transition noncurrent versions of objects between storage classes
    • Delete noncurrent versions of objects
  5. Transition current versions:
    • Days after object creation: 30
    • Storage class: Standard-IA
    • Add another transition:
      • Days: 90
      • Storage class: Glacier Flexible Retrieval
  6. Delete noncurrent versions:
    • Days after objects become noncurrent: 30
  7. Click Create rule

2. Create Additional S3 Buckets

2.1 Create Logs Bucket

  1. Click Create bucket
  2. Bucket configuration:
    • Name: ecom-analyzer-lakehouse-logs
    • Region: Same as main bucket
    • Object Ownership: ACLs disabled
    • Block Public Access: Block all public access
    • Versioning: Enable
    • Default encryption: SSE-KMS with ecom-lakehouse-s3-key
  3. Click Create bucket

2.2 Create Athena Query Results Bucket

  1. Click Create bucket
  2. Bucket configuration:
    • Name: ecom-analyzer-athena-results
    • Region: Same as main bucket
    • Object Ownership: ACLs disabled
    • Block Public Access: Block all public access
    • Versioning: Enable
    • Default encryption: SSE-KMS with ecom-lakehouse-s3-key
  3. Click Create bucket

Part D: Integrated Data Catalog & Lake Formation Setup

1. Configure Lake Formation Foundation

  1. Navigate to AWS Lake Formation
  2. Administrative roles and tasksData lake administratorsAdd
  3. Add EcomLakehouseLakeFormationRole
  4. Data lake locationsRegister location:
    • Path: s3://ecom-analyzer-lakehouse/
    • IAM role: EcomLakehouseLakeFormationRole

2. Create Glue Database with Lake Formation Integration

  1. Navigate to AWS GlueDatabasesCreate database
  2. Configure:
    • Name: ecom_lakehouse_db
    • Location: s3://ecom-analyzer-lakehouse/

3. Grant Unified Permissions

  1. Lake FormationPermissionsData permissionsGrant
  2. For Glue Role:
    • Principal: EcomLakehouseGlueRole
    • Database: ecom_lakehouse_db
    • Database Permissions: All
  3. For Athena Role:
    • Principal: EcomLakehouseAthenaRole
    • Database: ecom_lakehouse_db
    • Database Permissions: All

Part E: Medallion Architecture Crawlers

1. Create Multi-Layer Crawler (Consolidated Approach)

  1. GlueCrawlersCreate crawler

  2. Properties:

    • Name: ecom-medallion-crawler
    • Description: Unified crawler for all medallion layers
  3. Data sources (Add multiple):

    • Source 1: s3://ecom-analyzer-lakehouse/bronze/
    • Source 2: s3://ecom-analyzer-lakehouse/silver/
    • Source 3: s3://ecom-analyzer-lakehouse/gold/
  4. Security: IAM role EcomLakehouseGlueRole

  5. Output:

    • Database: ecom_lakehouse_db
    • Table name prefix: (Empty)
    • Schedule: On demand
  6. Run crawler and verify tables created

Part F: Athena Workgroup Setup

1. Configure Athena Workgroup

  1. Amazon AthenaWorkgroupsCreate workgroup
  2. Configure:
    • Name: ecom-lakehouse-workgroup
    • Query results: s3://ecom-analyzer-athena-results/

2. Test Integration

  1. Query editor → Select ecom-lakehouse-workgroup
  2. Database: ecom_lakehouse_db
  3. Test query: SHOW TABLES;

Next Steps

In Part 2, we will create Glue Jobs and configure the Iceberg table format for our S3 bucket. If you have questions or need clarification, drop a comment below – let’s build this AWS Lakehouse pipeline together!

Picture of Truong Nguyen Duc

Truong Nguyen Duc

With a solid foundation in software engineering and a growing expertise in big data technologies, I bring 5+ years of experience across roles in data engineering, backend development, and product ownership. I'm currently a Data Engineer at NASTECH, where I work on building scalable data solutions to empower business insights using technologies like AWS, Databricks and Microsoft Azure.

1 thought on “Data Lakehouse on AWS: Step-by-Step Tutorial (Part 1)”

Leave a Comment

Suggested Article

Discover more from NashTech Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading