Home β€Ί Interview Questions β€Ί Let's say you have used re-partition and it has in…

Let's say you have used re-partition and it has increased shuffling, but it's not helping. How would you handle it?

🟑 Medium Debugging Junior level
1Times asked
Jul 2026Last seen
Jul 2026First seen

πŸ’‘ Model Answer

When re-partitioning in Spark increases shuffling without improving performance, the first step is to analyze the shuffle metrics in the Spark UI. If the shuffle is dominated by a few large partitions, consider using coalesce instead of repartition to reduce the number of partitions without a full shuffle. If the data is skewed, apply a custom partitioner or add a salt column to even out the distribution. For joins that trigger shuffles, broadcast the smaller dataset if it fits in memory to avoid a shuffle join. Caching intermediate RDDs or DataFrames can also reduce repeated shuffles. Additionally, ensure that the partitioning key aligns with the subsequent operations; for example, partition by the join key before a join. Finally, tune the shuffle partitions parameter (spark.sql.shuffle.partitions) to a value that balances parallelism and overhead. These steps collectively reduce shuffle volume and improve job runtime.

This answer was generated by AI for study purposes. Use it as a starting point β€” personalize it with your own experience.

🎀 Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers β€” invisible to screen sharing.

Get Assisting AI β€” Starts at β‚Ή500