HomeInterview QuestionsA pipeline ingests data into an ISO table every ho…

A pipeline ingests data into an ISO table every hour. Over six months, query performance has dropped and S3 storage cost has spiked due to millions of tiny metadata files. What approach would reduce cost and improve pipeline performance?

🔴 Hard System Design Senior level
1Times asked
May 2026Last seen
May 2026First seen

💡 Model Answer

First, identify the root cause: the pipeline is producing many small files, which inflates S3 object count and metadata overhead. 1) Batch ingestion: Instead of hourly small writes, accumulate data for 15–30 minutes or a day and write larger files (200–500 MiB). 2) Use Iceberg’s delete files: For incremental updates, write delete files rather than rewriting entire partitions. 3) Compaction job: Schedule a nightly compaction that merges small data and delete files into larger ones, reducing file count. 4) Partitioning strategy: Partition by a high‑cardinality column (e.g., date/hour) to enable pruning and reduce the number of files scanned. 5) S3 lifecycle policies: Archive or delete obsolete snapshots or old partitions to lower storage costs. 6) Query engine tuning: Enable predicate pushdown, use partition pruning, and consider using a query engine that supports Iceberg’s snapshot isolation (e.g., Athena, Presto). 7) Monitoring: Track file size distribution, query latency, and S3 request counts to validate improvements. By batching writes, compaction, and proper partitioning, you reduce the number of tiny files, lower S3 request costs, and improve query performance while keeping the pipeline compliant with ISO requirements.

This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.

🎤 Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.

Get Assisting AI — Starts at ₹500