You have a massive sales fact table and a smaller product dimension table. The join is taking too long because specific product IDs are skewed, appearing in 90% of sales records. How would you address this issue?

Question

Assisting AI · Accepted Answer

Data skew occurs when a few keys dominate the join, causing a single reducer or executor to become a bottleneck. In Spark or Hive you can mitigate this by: 1) Using a broadcast join if the product table is small enough to fit in memory; this eliminates the shuffle entirely. 2) Repartitioning the sales table on a hash of the product_id plus a random suffix for the skewed keys (e.g., `sales.repartitionByRange(100, product_id, rand())`) so that the heavy key is spread across many partitions. 3) In Hive, enable the `hive.optimize.skewjoin` setting and provide a `skewjoin` hint to let the engine handle skewed keys separately. 4) For very large skewed keys, you can manually split the key into multiple pseudo-keys, join each part, and then combine the results. 5) Monitor the shuffle files and executor metrics to confirm the skew has been alleviated. These techniques reduce the time spent on the heavy key and balance the load across executors, improving overall join performance.

You have a massive sales fact table and a smaller product dimension table. The join is taking too long because specific product IDs are skewed, appearing in 90% of sales records. How would you address this issue?

💡 Model Answer

🎤 Get questions like this answered in real-time