Lakehouse Architecture: Compaction and Operational Reality

Oct 17, 2025 Metasphere Engineering 7 min read

You migrated your analytics from Redshift to a self-managed Iceberg lakehouse on S3. The migration went smoothly. Initial Trino query times were faster than Redshift. Your team celebrated. Champagne was involved. Six months later, the head of analytics files a ticket: “The customer segmentation query that used to take 8 seconds now takes 4 minutes and 12 seconds.” The post-mortem reveals that nobody ever scheduled compaction jobs. The orders table has accumulated 340,000 small files from months of streaming micro-batches. Every query opens each file to read its metadata before scanning a single row of actual data. The fix takes an afternoon of running CALL system.rewrite_data_files(). The performance debt accumulated silently for months and went completely unnoticed until it became impossible to ignore.

This pattern breaks teams regularly. The open table formats (Delta Lake, Apache Iceberg, Apache Hudi) are genuinely excellent engineering. They solved the hard problem of bringing ACID semantics to object storage. The architectural decision is not whether they work. They do. The decision is whether your team has the operational discipline to manage them well, and whether the capabilities justify the complexity relative to a managed warehouse like Snowflake or BigQuery for your specific workload.

How ACID Works on Object Storage

S3, GCS, and Azure Blob were not designed for concurrent transactional writes. They are eventually consistent, object-level systems without native atomic multi-file operations. So how do you get ACID on top of something that was built to store cat photos and log files?

Open table formats solve this with transaction logs. Every write stages new Parquet files, then commits by appending an entry to the transaction log. The log records which files were added and which were logically deleted (tombstoned). Readers consult the log to determine the current consistent set of files before scanning data. Concurrent writers use optimistic concurrency: if two writers attempt to commit overlapping changes, one succeeds and the other gets a conflict error and retries. It is elegant engineering.

In practice, this means readers always see a consistent snapshot, concurrent writes fail cleanly rather than producing corrupt data, and uncommitted writes are never visible. The overhead is 50-200ms per commit operation. Negligible for batch workloads. It adds up for real-time streaming micro-batches writing every 30 seconds. Choosing between Iceberg, Delta Lake, and Hudi depends on your query engine ecosystem, update frequency, and operational maturity. A solid data engineering practice evaluates these trade-offs against your specific workload profile rather than picking whatever the last conference talk recommended.

Z-Ordering and Query Performance

ACID on object storage is the foundation. But ACID alone does not make queries fast. Query performance in a lakehouse depends heavily on how data is physically organized within files. Z-ordering co-locates related data so the query engine can skip entire files that contain no matching values, turning full scans into targeted reads.

Open table formats support layout optimization techniques that traditional data lakes lack entirely. Z-ordering (Delta Lake) and sort orders (Iceberg) co-locate related data within the same files, enabling data skipping: when a query filters on the sort key columns, the engine skips entire files that contain no matching values without scanning them.

The impact is dramatic. A query filtering on customer_id against a 500GB unordered table scans all 500GB. The same query against a Z-ordered table scans 15GB because the engine reads file-level min/max statistics and skips the 97% of files that cannot contain the target value. That is a direct reduction in both latency (minutes to seconds) and cost (fewer bytes scanned means lower compute bills on Athena and BigQuery). When your analytics team sees query times drop by an order of magnitude, they stop complaining about the migration.

The trade-off is compute at write time. Z-ordering is applied during compaction and requires rewriting files. For a 1TB table, a full Z-order rewrite might take 30-45 minutes on a medium Spark cluster. It is most valuable for high-read, lower-write tables where the write compute cost is amortized across thousands of query executions. This is one area where platform engineering investment in optimized compaction scheduling pays back through substantially reduced query costs. Pick your Z-order columns carefully: choose the 2-3 dimensions your analysts filter on most, not every column in the table. Z-ordering on everything is the wrong approach. It burns write budget for negligible query improvement on columns nobody filters on.

The Compaction Requirement

This is the section that matters most if you are running a lakehouse in production. Read it carefully.

Every write creates new files. Every update marks old files as deleted and creates new ones. Every streaming micro-batch produces a file. After weeks of frequent writes, a table accumulates tens of thousands of small files. Each file requires the query engine to open it and read its metadata before scanning data. The file count is a direct performance multiplier independent of total data volume. A table with 10GB of data in 50,000 files queries slower than a table with 1TB of data in 200 files. That is not intuitive, but it is the reality that catches every team that skips compaction.

Compaction rewrites many small files into fewer, larger, well-sorted files. For data engineering teams managing lakehouse tables, this means:

Scheduling compaction on a cadence matched to write frequency (daily for streaming tables, weekly for batch)
Tuning target file size: 256MB per file for high-write tables, 512MB for batch-loaded tables
Deciding whether to apply Z-ordering during compaction (yes for analyst-facing tables, usually no for intermediate pipeline tables)
Running vacuum to expire old snapshots and recover storage (keep 7 days of snapshots for time travel, delete everything older)

The operational burden of compaction is the most cited reason teams prefer managed services. Databricks Unity Catalog, AWS Lake Formation with Iceberg, and Google BigLake handle compaction automatically. The trade-off is cost (20-40% premium over self-managed) and vendor coupling. For teams with strong data engineering depth who want multi-engine access patterns where no single managed service covers all query engines, self-managed is practical. For teams that want lakehouse benefits without the operational overhead, a managed service is the pragmatic choice.

Do not make this decision based on ideology. Make it based on an honest assessment of your team’s engineering capacity. If you cannot dedicate at least half an engineer’s time to lakehouse operations, go managed. You will sleep better. And your analysts will stop filing performance tickets at 3 PM every Friday.

The lakehouse delivers genuine value when the operational discipline matches the architectural ambition. Teams that schedule compaction, tune file sizes, and run vacuum on a cadence matched to their write patterns maintain the query performance that justified the migration. Teams that skip this maintenance discover the performance debt six months later when their flagship query goes from 8 seconds to 4 minutes. The technology works. The question is whether you will do the work to keep it working.

Frequently Asked Questions

How does ACID work on object storage when S3 is not transactional?

Open table formats implement ACID through a transaction log. Writes stage new Parquet files, then commit by appending atomically to the log. Readers consult the log to determine the current consistent snapshot. Concurrent writers use optimistic concurrency with conflict detection, producing serializable behavior without a lock manager. This adds roughly 50-200ms of commit overhead per write operation.

What is time travel in a lakehouse and what does it cost?

Time travel queries a table’s state at a past point in time using the transaction log to reconstruct which files were valid. The cost: old files must be retained on update, consuming storage proportional to change volume and retention window. Without regular vacuuming, time travel storage grows unbounded. A table with 10GB of daily updates and 30-day retention accumulates 300GB of time travel overhead.

What is the difference between Delta Lake, Apache Iceberg, and Apache Hudi?

Delta Lake has the deepest Spark integration and largest install base. Apache Iceberg has the broadest engine interoperability (Spark, Trino, Flink, Athena, DuckDB) making it the strongest multi-engine choice. Apache Hudi is optimized for streaming upserts, handling 100K+ records per second efficiently. Iceberg has gained the most momentum as a neutral standard, with AWS, Google, and Apple running it at scale.

What is compaction and how often should it run?

Every write creates new small files. After weeks of frequent writes, tables accumulate tens of thousands of files, each requiring a metadata read before data scanning. Compaction rewrites small files into fewer, larger files. Without it, 8-second queries become 4-minute queries within months. High-write tables need daily compaction with 256MB target files. Lower-write tables can run weekly with 512MB targets.

When is a managed data warehouse still the right answer?

Snowflake, BigQuery, or Redshift are right when your workload is structured SQL analytics without streaming, or your team can’t dedicate 0.5 FTE to lakehouse operations. The operational cost of self-managed compaction, vacuum tuning, and file optimization is real. Teams underestimating it see 6-12 months of gradual performance degradation before tracing it to operational debt that started accumulating on day one.