HomeInterview QuestionsWe currently use a copy‑on‑write strategy, but clu…

We currently use a copy‑on‑write strategy, but cluster calls are increasing because full data files are rewritten for a single row update. How would you address this?

🟡 Medium Conceptual Mid level
1Times asked
May 2026Last seen
May 2026First seen

💡 Model Answer

The root cause is write amplification: each row update triggers a rewrite of the entire file. To mitigate this, consider:

  1. Switch to a merge‑on‑read or delta‑table format (Delta Lake, Iceberg) that stores updates as separate delta files and merges lazily.
  2. Partition the table on a high‑cardinality column (e.g., date or region) so that updates affect only a small partition.
  3. Use incremental writes: write only the changed rows to a new file and maintain a manifest that points to both base and delta files.
  4. Schedule regular compaction jobs to merge delta files into base files, keeping file sizes optimal.
  5. If the workload is mostly point‑updates, consider using a transactional engine like Apache Hudi or Delta Lake that supports upserts.
  6. Tune the file size target (e.g., 200–400 MB) to reduce the number of files and improve read performance.

By adopting a MoR or delta approach and proper partitioning, you can drastically reduce the cost of single‑row updates while maintaining efficient query performance.

This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.

🎤 Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.

Get Assisting AI — Starts at ₹500