We have a Delta table `retail_prod.analytics.sales_silver` that is about 600 GB. It is partitioned by `event_date` (daily) and contains many small files (5–20 MB) with no Z‑Order. Dashboards filter on the last 30 days of `event_date` and `merchant_id`, and the p95 query time is above 12 seconds. We cannot change the upstream schema. How can we bring the p95 query time under 12 seconds for the 30‑day window?
💡 Model Answer
The key to improving query performance on a large Delta table with many small files is to enable efficient data skipping and reduce the number of files read. First, run a OPTIMIZE command with a Z‑Order on the most selective columns used in the query – in this case merchant_id and event_date. Z‑Ordering clusters rows that share the same values together, so a 30‑day filter will skip large portions of the data. Second, use OPTIMIZE ... ZORDER BY (merchant_id) with a WHERE event_date >= date_sub(current_date(),30) clause to target only the recent data. Third, consider adding a FILE_FORMAT that supports column pruning, such as Parquet, which Delta already uses. Fourth, enable Delta’s mergeSchema and autoCompact settings to periodically merge small files into larger ones. Finally, enable caching for the most frequently queried partitions by running CACHE TABLE retail_prod.analytics.sales_silver or using Delta Live Tables to materialize a cached view. These steps reduce the number of files scanned, improve data skipping, and bring the p95 latency below the 12‑second target.
This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.
🎤 Get questions like this answered in real-time
Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.
Get Assisting AI — Starts at ₹500