Question 1

What GCP services form the core of an enterprise data lakehouse?

Accepted Answer

The core stack: GCS with Apache Iceberg via BigLake for ACID transactions; Cloud Dataflow for batch and streaming ETL; BigQuery as the analytical query engine; Dataplex for governance, cataloguing, and quality; Cloud Composer for orchestration; and Pub/Sub with Dataflow for real-time streaming. BigQuery's serverless architecture scales compute automatically and charges per byte scanned, making it uniquely cost-effective for variable analytics workloads.

Question 2

What is Google Cloud Dataplex and how does it manage lakehouse governance?

Accepted Answer

Dataplex is GCP's intelligent data fabric providing unified governance, discovery, and quality management across GCS, BigQuery, and Bigtable from a single control plane. Tag-based ABAC propagates access control automatically to all assets within a Dataplex lake. Automated data discovery builds a live catalogue, built-in DQ rules provide quality score trending, and data lineage tracks across Dataflow, Spark, and BigQuery jobs — eliminating manual per-bucket and per-dataset policy management.

Question 3

What is BigLake and how does it enable Apache Iceberg on GCS?

Accepted Answer

BigLake extends BigQuery's security and query capabilities to data in GCS, including Apache Iceberg tables. Iceberg tables on GCS can be queried through BigQuery SQL with the same column-level security, row-level security, and data masking as native BigQuery tables. BigLake also enables fine-grained access control at the table level on GCS, making it possible to share specific Iceberg tables without exposing the entire bucket.

Question 4

What are the five Vertex AI Agents deployed in the GCP lakehouse?

Accepted Answer

(1) Pipeline Repair Agent — monitors Composer DAG failures, triggers automated Dataflow restarts; (2) Quality Triage Agent — routes Dataplex DQ findings to resolution workflows; (3) Cost Optimisation Agent — analyses GCP billing in BigQuery and recommends committed use discounts and storage class transitions; (4) Catalog Enrichment Agent — auto-generates Dataplex metadata descriptions using Gemini for newly discovered assets; (5) GDPR Compliance Agent — monitors Cloud DLP findings, tracks the Firestore Data Subject Registry, and orchestrates crypto-shredding workflows within the 72-hour erasure deadline.

Question 5

How does BigQuery ML integrate with the lakehouse ML platform?

Accepted Answer

BigQuery ML trains and deploys ML models directly on BigQuery data using SQL — no data extraction or pipeline management required for standard ML use cases. Supported models include linear/logistic regression, XGBoost, DNNs, k-means, ARIMA+ forecasting, and remote calls to Vertex AI hosted models. For the Gold layer, BigQuery ML suits demand forecasting, customer segmentation, and anomaly detection. More complex workflows use Vertex AI Pipelines, which accesses BigQuery Managed Datasets directly as training data sources.

Question 6

What is the difference between Dataflow and Dataproc for GCP data processing?

Accepted Answer

Cloud Dataflow is fully serverless — define an Apache Beam pipeline, GCP manages all worker provisioning, scaling, and teardown, with exactly-once semantics for both batch and streaming. It is the recommended default for most GCP ETL workloads. Cloud Dataproc is a managed Spark and Hadoop cluster service where you manage the cluster lifecycle. Dataproc is preferred for workloads requiring specific Spark versions, complex Spark configurations, or existing PySpark code that cannot easily be ported to Apache Beam.

GCP Enterprise Lakehouse Data Platform — Definitive Architecture Guide (Parts 1–3)

What This Guide Covers

Part 1 — Platform Vision, Network, Security and IAM

Part 2 — Data Engineering, BigQuery, Dataplex and Vertex AI

Part 3 — Streaming, Apigee, FinOps, SRE and Implementation Roadmap

Topics Covered in This Guide

Read the Full Guide + Download Free Sample

Frequently Asked Questions

Brief Summary

Extended Summary

Related Guides in the SimuPro Knowledge Store

SimuPro Data Solutions — Cloud Data Engineering & AI Consultancy