Question 1

What AWS services form the core of an enterprise data lakehouse?

Accepted Answer

The core AWS lakehouse stack: Amazon S3 with Apache Iceberg for ACID transactions; AWS Glue for ETL, data cataloguing (Glue Data Catalog), and data quality (Glue DataBrew and Deequ); Amazon Athena for serverless SQL on S3 Iceberg tables; Amazon Redshift Serverless for high-performance warehouse workloads; AWS Lake Formation for column-level and row-level access control via LF-Tag ABAC; and Amazon EMR or Databricks on AWS for large-scale Spark processing. These six components form the Bronze-to-Gold medallion stack for most enterprise AWS lakehouse implementations.

Question 2

What is AWS Lake Formation and how does it differ from standard S3 bucket policies?

Accepted Answer

AWS Lake Formation is a centralised data lake governance service providing fine-grained access control down to the column and row level across all S3 data — capabilities that S3 bucket policies alone cannot provide. Lake Formation uses LF-Tags for ABAC: a tag like 'sensitivity=PII' on a column automatically restricts access to principals with the corresponding tag permission, without modifying individual bucket policies. Lake Formation integrates with Glue Data Catalog, Athena, Redshift Spectrum, and EMR, providing consistent access control regardless of which AWS service queries the data.

Question 3

What are the five Amazon Bedrock Agents deployed in the AWS lakehouse platform?

Accepted Answer

Five Bedrock Agents operate across the platform: (1) Pipeline Repair Agent — uses CloudWatch logs and Glue job metrics as a Knowledge Base to diagnose and fix pipeline failures, with Action Groups invoking Lambda for automated remediation; (2) Quality Triage Agent — routes Glue DataBrew and Deequ quality alerts to resolution workflows based on severity and downstream SLA impact; (3) Cost Optimisation Agent — analyses Cost Explorer data and identifies S3 storage tier optimisation opportunities within pre-approved thresholds; (4) Catalog Enrichment Agent — auto-generates Glue Data Catalog descriptions, tags, and quality assessments for newly crawled tables; (5) GDPR Compliance Agent — monitors Macie findings, tracks Data Subject Registry entries, and orchestrates erasure workflows within the 72-hour Article 17 deadline.

Question 4

How does S3 Intelligent-Tiering help with lakehouse cost management?

Accepted Answer

S3 Intelligent-Tiering automatically moves objects between access tiers (Frequent, Infrequent, Archive Instant, Archive, Deep Archive) based on observed access patterns, without retrieval fees for objects moving between the Frequent and Infrequent tiers. For a lakehouse, this is particularly valuable for Bronze layer historical data accessed frequently during initial processing but rarely thereafter. Combined with S3 storage class analysis and lifecycle policies, Intelligent-Tiering typically reduces S3 costs by 20–40% for mature lakehouse deployments.

Question 5

What is the difference between AWS Glue ETL and Amazon EMR for data processing?

Accepted Answer

AWS Glue is a fully serverless ETL service — you write PySpark or Python scripts, AWS manages all cluster provisioning, scaling, and maintenance. It is ideal for scheduled batch ETL where operational simplicity matters more than fine-grained cluster control. Amazon EMR gives full control over a Spark cluster — instance types, cluster size, Spark configuration — and can run long-lived clusters for interactive workloads. EMR is preferred for complex Spark workloads requiring specific configurations, ML training jobs, and high-throughput streaming processing. For most enterprise lakehouse Silver layer transformations, Glue is the recommended default.

Question 6

How do you implement real-time streaming in an AWS data lakehouse?

Accepted Answer

The AWS lakehouse streaming architecture uses Amazon Kinesis Data Streams for sub-second ingestion, Amazon MSK for Kafka-compatible high-throughput event streaming, and Amazon Managed Service for Apache Flink for stateful exactly-once processing — writing directly to S3 Iceberg Bronze tables via the Iceberg Flink connector. This unified architecture eliminates Lambda architecture complexity: streaming data is immediately queryable through Athena alongside historical batch data in the same Iceberg tables without separate pipelines.

AWS Enterprise Lakehouse Data Platform — Definitive Architecture Guide (Parts 1–3)

What This Guide Covers

Part 1 — Platform Vision, Network, Security and IAM

Part 2 — Data Engineering, Quality, Catalogue, BI and ML/AI

Part 3 — Streaming, APIs, FinOps, SRE and Implementation Roadmap

Topics Covered in This Guide

Read the Full Guide + Download Free Sample

Frequently Asked Questions

Brief Summary

Extended Summary

Related Guides in the SimuPro Knowledge Store

SimuPro Data Solutions — Cloud Data Engineering & AI Consultancy