Home โ€บ Interview Questions โ€บ You have a massive sales fact table and a smaller โ€ฆ

You have a massive sales fact table and a smaller product dimension table. The join is taking too long because specific product IDs are skewed, appearing in 90% of sales records. How would you address this issue?

๐ŸŸก Medium Conceptual Mid level
1Times asked
Jul 2026Last seen
Jul 2026First seen

๐Ÿ’ก Model Answer

Data skew occurs when a few keys dominate the join, causing a single reducer or executor to become a bottleneck. In Spark or Hive you can mitigate this by: 1) Using a broadcast join if the product table is small enough to fit in memory; this eliminates the shuffle entirely. 2) Repartitioning the sales table on a hash of the product_id plus a random suffix for the skewed keys (e.g., sales.repartitionByRange(100, product_id, rand())) so that the heavy key is spread across many partitions. 3) In Hive, enable the hive.optimize.skewjoin setting and provide a skewjoin hint to let the engine handle skewed keys separately. 4) For very large skewed keys, you can manually split the key into multiple pseudo-keys, join each part, and then combine the results. 5) Monitor the shuffle files and executor metrics to confirm the skew has been alleviated. These techniques reduce the time spent on the heavy key and balance the load across executors, improving overall join performance.

This answer was generated by AI for study purposes. Use it as a starting point โ€” personalize it with your own experience.

๐ŸŽค Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers โ€” invisible to screen sharing.

Get Assisting AI โ€” Starts at โ‚น500