Why does coalesce not perform a full shuffle?
💡 Model Answer
In Apache Spark, the coalesce transformation is used to reduce the number of partitions in an RDD or DataFrame. Unlike repartition, coalesce does not trigger a full data shuffle across the cluster. Instead, it collapses existing partitions by moving data only within the same executor or a small set of executors. This is achieved by selecting a subset of the original partitions and merging their data locally, which avoids the expensive network I/O and data movement that a shuffle requires. Coalesce is therefore efficient for decreasing partition count when the data is already roughly evenly distributed and you want to avoid the overhead of a full shuffle. However, if the data is skewed or you need a more balanced partitioning, you should use repartition, which performs a full shuffle to redistribute data evenly across new partitions.
This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.
🎤 Get questions like this answered in real-time
Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.
Get Assisting AI — Starts at ₹500