HomeInterview QuestionsExplain how to perform a left join in PySpark and …

Explain how to perform a left join in PySpark and how broadcast join optimization can be used when the right table is small.

🟡 Medium Conceptual Junior level
1Times asked
Jun 2026Last seen
Jun 2026First seen

💡 Model Answer

In PySpark, a left join is performed using the DataFrame API: df_left.join(df_right, on=join_cols, how='left'). When the right DataFrame is small enough to fit in memory, Spark can broadcast it to all executors, eliminating the shuffle that normally occurs for a join. To enable this, import broadcast from pyspark.sql.functions and wrap the right DataFrame: df_left.join(broadcast(df_right), on=join_cols, how='left'). Broadcast joins reduce network I/O and improve performance, but they should be used only when the right side is small (typically < 100 MB). If the right side is large, a shuffle join is required. The broadcast hint also allows Spark to skip the cost-based optimizer’s decision if you know the size. Always monitor the executor memory and use spark.sql.autoBroadcastJoinThreshold to control the threshold.

This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.

🎤 Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.

Get Assisting AI — Starts at ₹500