How would you handle data skew in Spark when performing a broadcast join?

Question

Assisting AI · Accepted Answer

Data skew occurs when one key has many more records than others, causing a single partition to become a bottleneck. In Spark, I handle it by first identifying skewed keys using a sample or histogram. Then I apply one of the following techniques:
1. **Salting** – add a random suffix to the skewed key before the join, then group by the original key after the join.
2. **Repartitioning** – use a custom partitioner that distributes skewed keys across multiple partitions.
3. **Broadcast join with a threshold** – only broadcast tables that are truly small; for larger tables, use a shuffle join with skew hints.
4. **Skew hints** – in Spark 3.x, you can use `spark.sql.adaptive.skewJoin.enabled` to let the engine automatically handle skew.
5. **Map-side combine** – aggregate data locally before shuffling.
I also monitor the shuffle write metrics to ensure partitions are balanced. These steps reduce the execution time from minutes to seconds in many real-world scenarios.

How would you handle data skew in Spark when performing a broadcast join?

💡 Model Answer

🎤 Get questions like this answered in real-time