Why does coalesce not perform a full shuffle?

Question

Assisting AI · Accepted Answer

In Apache Spark, the coalesce transformation is used to reduce the number of partitions in an RDD or DataFrame. Unlike repartition, coalesce does not trigger a full data shuffle across the cluster. Instead, it collapses existing partitions by moving data only within the same executor or a small set of executors. This is achieved by selecting a subset of the original partitions and merging their data locally, which avoids the expensive network I/O and data movement that a shuffle requires. Coalesce is therefore efficient for decreasing partition count when the data is already roughly evenly distributed and you want to avoid the overhead of a full shuffle. However, if the data is skewed or you need a more balanced partitioning, you should use repartition, which performs a full shuffle to redistribute data evenly across new partitions.

Why does coalesce not perform a full shuffle?

💡 Model Answer

🎤 Get questions like this answered in real-time