HomeInterview QuestionsIn a scenario where you ingest 100 to 100 GB of da…

In a scenario where you ingest 100 to 100 GB of data from a source and call the collect function, what is wrong with this approach and what is the correct process?

🟡 Medium Conceptual Mid level
1Times asked
Jul 2026Last seen
Jul 2026First seen

💡 Model Answer

Calling collect() on a 100–100 GB dataset forces Spark to bring the entire dataset into the driver’s memory, which will almost certainly cause an OutOfMemoryError and crash the application. The correct approach is to avoid collect for large data. Instead, perform actions that keep the work distributed, such as write() to a distributed storage (Parquet, CSV, or a database), foreachPartition() to process data in parallel, or use aggregate functions that run on executors. If you need to inspect a sample, use take(n) or limit(n) to fetch a small subset. For debugging, you can use df.show() with a small limit. In production, always keep the data on the cluster and write it out or push it to downstream systems without pulling it into the driver.

This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.

🎤 Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.

Get Assisting AI — Starts at ₹500