Lakehouse Architecture: The Ops Nobody Mentions

Oct 17, 2025 Metasphere Engineering 10 min read

You migrated your analytics from a managed warehouse to a self-managed Iceberg lakehouse on S3. The migration went smoothly. Initial Trino query times were faster. Your team celebrated. Six months later, the head of analytics files a ticket: “The customer segmentation query that used to take 8 seconds now takes 4 minutes and 12 seconds.”

Nobody ever scheduled compaction jobs. The orders table has piled up 340,000 small files from months of streaming micro-batches. Every query opens each file to read its metadata before scanning a single row of actual data.

A library where it takes longer to walk between shelves than to read the book. The fix takes an afternoon. The performance debt built up silently for months.

Key takeaways

Compaction is not optional. Hundreds of thousands of small files from streaming micro-batches turn an 8-second query into a 4-minute one. Hundreds of pamphlets instead of one bound volume. Schedule it or watch performance degrade quietly.
Open table formats (Delta Lake, Iceberg, Hudi) bring ACID to object storage through metadata layers that track file-level changes. Powerful, but operationally demanding.
The lakehouse vs managed warehouse decision depends on your team’s operational capacity, not on feature comparison charts.
Partition strategy determines query performance more than engine choice. Wrong partitioning on a lakehouse is slower than correct partitioning on a managed warehouse.
Time travel and snapshot isolation are genuine differentiators when audit trails or reproducible analytics matter. But the storage cost of keeping old files compounds without vacuuming.

Delta Lake and Iceberg bring ACID to object storage. Two questions matter more than feature comparisons: does your team have the operational discipline to maintain them? Does the complexity justify itself over a managed alternative?

How ACID Works on Object Storage

S3, GCS, and Azure Blob were not designed for concurrent transactional writes. Eventually consistent, object-level, no native atomic multi-file operations. Getting ACID on infrastructure built to store objects and log files required some clever engineering. Getting ACID on infrastructure not built for it required clever engineering.

Open table formats solve this with transaction logs. Every write stages new Parquet files, then commits by appending an entry to the log. The log records which files were added and which were logically deleted (tombstoned). Readers check the log to find the current consistent set of files before scanning data. Concurrent writers use optimistic concurrency: if two writers try to commit overlapping changes, one succeeds and the other gets a conflict error and retries.

Prerequisites

Object storage with atomic single-object writes (S3, GCS, Azure Blob all qualify)
Catalog service for table registration and metadata (Hive Metastore, AWS Glue, or Iceberg REST Catalog)
At least one compute engine compatible with your chosen table format
Monitoring for commit latency, file counts per table, and storage growth rate
Team capacity for ongoing compaction, vacuum, and partition management

In practice, readers always see a consistent snapshot, concurrent writes fail cleanly rather than producing corrupt data, and uncommitted writes are never visible. The overhead is 50-200ms per commit operation. Tiny for batch workloads. Compounds for real-time streaming micro-batches writing every 30 seconds. (Fine for batch. Compounds for streaming micro-batches.)

Choosing between Iceberg, Delta Lake, and Hudi depends on your query engine ecosystem, update frequency, and operational maturity. A solid data engineering practice evaluates these against your actual workload, not the last conference talk.

Z-Ordering and Data Layout

ACID on object storage is the foundation. But ACID alone doesn’t make queries fast. Query performance lives and dies by how data is physically organized within files. ACID alone doesn’t make queries fast.

Z-ordering co-locates related data so the query engine can skip entire files that contain no matching values, turning full scans into targeted reads. A query filtering on customer_id against a 500GB unordered table scans all 500GB. The same query against a Z-ordered table scans a small fraction because the engine reads file-level min/max statistics and skips files that can’t contain the target value. Targeted reads instead of full scans.

Layout	Query: WHERE customer_id = 55000	Data Scanned	Files Touched
Unordered table	Every file contains customer_id values 1-1M mixed randomly. No file can be skipped	All 500GB	All files
Z-ordered on customer_id	File 6 contains customer_id 50001-60000. Files 1-5, 7+ don’t contain the target range	15GB (3% of total)	1 file

Z-ordering sorts data within files by the specified columns. Min/max statistics per file let the query engine skip files that cannot contain matching rows. 97% file skipping on a single-column filter is typical.

The trade-off is compute at write time. Z-ordering is applied during compaction and requires rewriting files. For a large table, a full Z-order rewrite can take tens of minutes on a medium Spark cluster. It pays off on high-read, lower-write tables where the write compute cost is spread across thousands of query executions. Re-alphabetizing the library takes a weekend. Worth it if 1,000 people search the shelves every day.

Anti-pattern

Don’t: Z-order on every column in the table. This burns write budget for tiny query improvement on columns nobody filters on. Alphabetizing by author, title, ISBN, publisher, and color. Pick 2-3.

Do: Analyze your query logs to identify the most common filter predicates. Z-order on those columns only. For intermediate pipeline tables read by downstream jobs, not analysts, skip Z-ordering entirely.

This is one area where platform engineering investment in optimized compaction scheduling pays back directly through reduced query costs.

The Compaction Requirement

Every write creates new files. Every update marks old files as deleted and creates new ones. Every streaming micro-batch produces a file. After weeks of frequent writes, a table piles up tens of thousands of small files, each requiring the query engine to open it and read its metadata before scanning data. Hundreds of pamphlets where there should be one bound volume.

The Compaction Cliff The point at which accumulated small files degrade query performance past the user’s tolerance. For most tables, this hits once file counts reach the tens of thousands. The query engine spends more time opening files and reading metadata than scanning actual data. Performance drops non-linearly. An 8-second query becomes a 4-minute query, not a 16-second one. Performance drops non-linearly from there.

The surprising part: file count is a performance multiplier independent of total data volume. A table with 10GB in 50,000 files queries slower than a table with 1TB in 200 files. A library with 50,000 pamphlets is slower to search than one with 200 encyclopedias. Every team that skips compaction learns this the hard way.

Compaction rewrites many small files into fewer, larger, well-sorted files:

Schedule compaction on a cadence matched to write frequency (daily for streaming tables, weekly for batch)
Tune target file size: 256MB per file for high-write tables, 512MB for batch-loaded tables
Decide whether to apply Z-ordering during compaction (yes for analyst-facing tables, usually no for intermediate pipeline tables)
Run vacuum to expire old snapshots and recover storage (keep 7 days of snapshots for time travel, delete everything older)

-- Iceberg compaction: rewrite small files into optimal-sized ones
CALL system.rewrite_data_files(
  table => 'analytics.orders',
  options => map(
    'target-file-size-bytes', '268435456',  -- 256MB target
    'min-file-size-bytes', '67108864',       -- Don't touch files >64MB
    'max-concurrent-file-group-rewrites', '5'
  )
);

-- Expire old snapshots to reclaim storage
CALL system.expire_snapshots('analytics.orders', TIMESTAMP '2026-03-14 00:00:00', 100);

-- Remove orphan files left by failed writes
CALL system.remove_orphan_files(table => 'analytics.orders', older_than => TIMESTAMP '2026-03-07');

    
      
      
      
      
      
      
    
    

  

```d2 { alt="Lakehouse compaction and vacuum operational schedule showing daily compaction for high-write tables, weekly for medium-write, and vacuum runs to expire old snapshots and recover time-travel storage" }
direction: right

tables: Table Write Frequency {
  style.fill: "#E8EAF6"
  style.stroke: "#1A2980"

  high: High-Write Tables {
    style.fill: "#C5CAE9"
    label: "Streaming ingestion\n100k+ rows/hour\nCompact: daily\nTarget: 256MB files"
  }

  medium: Medium-Write Tables {
    style.fill: "#E0F7F6"
    label: "Batch ETL\n1-10M rows/day\nCompact: weekly\nTarget: 512MB files"
  }

  low: Low-Write Tables {
    style.fill: "#E0F7F6"
    label: "Reference data\nUpdated rarely\nCompact: monthly\nTarget: 512MB files"
  }
}

ops: Maintenance Operations {
  style.fill: "#E8EAF6"
  style.stroke: "#1A2980"

  compact: File Compaction {
    style.fill: "#E8EAF6"
    label: "Merge small files\nApply Z-ordering\nUpdate table stats"
  }

  vacuum: Snapshot Vacuum {
    style.fill: "#E8EAF6"
    label: "Expire snapshots\nolder than 7 days\nDelete orphaned files\nRecover storage"
  }

  stats: Statistics Refresh {
    style.fill: "#E8EAF6"
    label: "Column min/max stats\nRow count per file\nEnables data skipping"
  }
}

tables -> ops: "Schedule per\nwrite frequency"

What the Industry Gets Wrong About Lakehouse Architecture

“Open table formats eliminate the need for a warehouse.” Open table formats bring ACID and schema evolution to object storage. They don’t eliminate the need for query optimization, compaction scheduling, metadata management, or storage lifecycle policies. A library without a librarian. A lakehouse without operational discipline degrades to a swamp.

“Managed services are just more expensive versions of the same thing.” Managed lakehouse services handle compaction, vacuum, and metadata optimization automatically. The cost premium buys back the engineering time you’d spend maintaining these operations yourself. For teams without dedicated data platform engineers, managed is the pragmatic choice. The cost premium buys back the engineering time.

“Just use Iceberg for everything.” Iceberg has the broadest engine compatibility, but Delta Lake has deeper Spark integration and a larger install base. Hudi handles streaming upserts more efficiently. The right choice depends on your engine ecosystem and update patterns, not on which format has the most conference talks.

Our take If your team can’t dedicate at least half an engineer’s time to lakehouse operations (compaction scheduling, vacuum, partition management, file size tuning), use a managed warehouse or managed lakehouse service. The capabilities of open table formats are genuine. The operational burden is equally genuine. Choosing self-managed without the capacity to sustain it produces the worst of both worlds: the complexity of a lakehouse with the reliability of an unmaintained system.

That customer segmentation query. Eight seconds to four minutes because nobody scheduled compaction. With compaction matched to write patterns, automated partition management, and file size targets enforced in the pipeline, the debt never piles up. The query stays fast because the maintenance runs before anyone notices it needs to. The lakehouse delivers when operational discipline matches architectural ambition.

Frequently Asked Questions

How does ACID work on object storage when S3 is not transactional?

Open table formats build ACID through a transaction log. Writes stage new Parquet files, then commit by appending atomically to the log. Readers check the log to find the current consistent snapshot. Concurrent writers use optimistic concurrency with conflict detection, producing serializable behavior without a lock manager. This adds roughly 50-200ms of commit overhead per write operation.

What is time travel in a lakehouse and what does it cost?

Time travel queries a table’s state at a past point in time using the transaction log to reconstruct which files were valid. The cost: old files must be kept on update, using storage in proportion to change volume and retention window. Without regular vacuuming, time travel storage grows without limit. A table with 10GB of daily updates and 30-day retention piles up 300GB of time travel overhead.

What is the difference between Delta Lake, Apache Iceberg, and Apache Hudi?

Delta Lake has the deepest Spark integration and largest install base. Apache Iceberg has the broadest engine interoperability (Spark, Trino, Flink, Athena, DuckDB) making it the strongest multi-engine choice. Apache Hudi is optimized for streaming upserts, handling high-throughput record ingestion efficiently. Iceberg has gained the most momentum as a neutral standard, with AWS, Google, and Apple running it at scale.

What is compaction and how often should it run?

Every write creates new small files. After weeks of frequent writes, tables pile up tens of thousands of files, each needing a metadata read before data scanning. Compaction rewrites small files into fewer, larger files. High-write tables need daily compaction with 256MB target files. Lower-write tables can run weekly with 512MB targets.

When is a managed data warehouse still the right answer?

When your workload is structured SQL analytics without streaming, or your team can’t dedicate half an engineer’s time to lakehouse operations. The operational cost of self-managed compaction, vacuum tuning, and file optimization is real. Teams underestimating it see months of gradual performance decay before tracing it to operational debt that started piling up on day one.