Question 1

What is a data lakehouse and how does it differ from a data lake or data warehouse?

Accepted Answer

A data lakehouse combines the storage cost efficiency and schema flexibility of a data lake with the ACID transactions, schema enforcement, and query performance of a data warehouse — in a single architecture. Using table formats like Apache Iceberg, Delta Lake, or Apache Hudi on object storage, the lakehouse delivers cheap storage, ACID transactions, and high-performance SQL analytics simultaneously — replacing the need for separate lake and warehouse systems.

Question 2

What are the three layers of medallion architecture (Bronze, Silver, Gold)?

Accepted Answer

Bronze (raw ingestion) stores data exactly as received — immutable, append-only, preserved for reprocessing and audit. Silver (cleansed and conformed) applies standardisation, deduplication, null handling, and business rules to produce reliable, schema-consistent datasets. Gold (business-ready) applies business logic, aggregations, and domain-specific models to produce datasets optimised for specific consumers — BI dashboards, ML feature stores, API endpoints. Each layer builds on the previous, and Bronze can always regenerate downstream layers if transformations need correction.

Question 3

What is the difference between Apache Iceberg, Delta Lake and Apache Hudi?

Accepted Answer

All three add ACID transactions to object storage. Apache Iceberg is the most widely adopted across cloud providers — natively supported by AWS, Azure, and GCP — with the strongest partition evolution and time travel. Delta Lake is the native format for Databricks with excellent streaming support. Apache Hudi specialises in record-level upsert performance, optimal for CDC workloads where individual records must be updated efficiently at scale. For most greenfield deployments, Iceberg is the recommended default.

Question 4

How do you implement GDPR Right to Erasure in an immutable data lake?

Accepted Answer

The solution is crypto-shredding: each data subject's data is encrypted with a unique per-subject key stored in the cloud KMS (AWS KMS, Azure Key Vault, GCP Cloud KMS). To erase a subject, you delete their encryption key — all their encrypted data across the Bronze layer immediately becomes permanently unreadable, fulfilling the erasure obligation without physically modifying immutable files. A Data Subject Registry tracks every subject's key ID and all Bronze locations containing their data, enabling 72-hour GDPR Article 17 compliance.

Question 5

What are the six AI agents deployed in the enterprise lakehouse platform?

Accepted Answer

Six autonomous AI agents operate across the platform: (1) Pipeline Repair Agent — monitors health, diagnoses failures using log analysis, applies automated fixes for known failure patterns; (2) Quality Triage Agent — routes data quality alerts to resolution paths based on severity and downstream impact; (3) Cost Optimisation Agent — analyses cloud spend, identifies waste, implements reductions within pre-approved bounds; (4) Catalog Enrichment Agent — auto-generates metadata, tags, and quality assessments for new data assets; (5) Security Anomaly Agent — monitors access patterns for anomalous behaviour; (6) Compliance Agent — monitors GDPR and ISO 27001 obligations and generates audit evidence automatically.

Question 6

How many people are needed to build an enterprise data lakehouse?

Accepted Answer

The 24–28 week implementation roadmap requires 85–150 people spanning 17 role categories including Platform Architects, Data Engineers, ML Engineers, DataOps/SRE Engineers, Security Engineers, IAM Specialists, Data Governance leads, Data Quality Engineers, BI Developers, Business Analysts, Product Managers, and Programme Managers. A focused Phase 1 foundation can be built with 20–30 people over 8–10 weeks. The 85–150 figure reflects a full enterprise programme delivering all 17 domains simultaneously.

Generic Cloud — Enterprise Lakehouse Data Platform — Definitive Architecture Guide

What This Guide Covers

The 17 Architecture Domains

Medallion Architecture — Bronze, Silver and Gold Layers

Zero Trust Security Architecture — Four-Zone Network Design

AI Agents Across the Platform

Topics Covered in This Guide

Read the Full Guide + Download Free Sample

Frequently Asked Questions

Brief Summary

Extended Summary

Related Guides in the SimuPro Knowledge Store

SimuPro Data Solutions — Cloud Data Engineering & AI Consultancy