How do you optimize Apache Spark jobs?
💡 Model Answer
Optimizing Spark involves several layers: data layout, execution plan, and cluster configuration. First, partition data by key to reduce shuffles; use bucketing or coalesce to match downstream joins. Second, enable caching for reused RDDs or DataFrames, and use broadcast joins for small tables. Third, tune shuffle settings: set spark.sql.shuffle.partitions to a value close to the number of executors times cores, and adjust spark.sql.autoBroadcastJoinThreshold. Fourth, use Tungsten and whole-stage code generation by enabling spark.sql.codegen.wholeStage. Fifth, monitor GC and memory usage; set spark.executor.memory and spark.memory.fraction appropriately. Finally, profile jobs with the UI or tools like Spark History Server, identify long stages, and rewrite expensive transformations (e.g., replace multiple filter+map with a single map). These steps together reduce runtime, memory pressure, and improve throughput.
This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.
🎤 Get questions like this answered in real-time
Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.
Get Assisting AI — Starts at ₹500