Write code to perform a broadcast join between sales data and stores data.
1Times asked
May 2026Last seen
May 2026First seen
π‘ Model Answer
Assuming you have two Spark DataFrames, sales_df (sales_id, store_id, amount) and stores_df (store_id, store_name), and the stores table is small enough to fit in memory, you can broadcast it to all executors and perform an efficient join:
python
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast
spark = SparkSession.builder.getOrCreate()
# Load data
sales_df = spark.read.parquet("/path/to/sales.parquet")
stores_df = spark.read.parquet("/path/to/stores.parquet")
# Broadcast the smaller DataFrame
broadcast_stores = broadcast(stores_df)
# Perform the join
joined_df = sales_df.join(broadcast_stores, on="store_id", how="inner")
# Show result
joined_df.select("sales_id", "store_name", "amount").show()Key points:
broadcast()hints to Spark that the DataFrame should be sent to all executors.- The join type can be
inner,left, etc., depending on requirements. - This avoids shuffling the large sales table across the network.
- Ensure the broadcast table is small (e.g., < 100MB) to avoid memory pressure.
Complexity: The join cost is O(N) where N is the number of rows in the large DataFrame, plus the cost of broadcasting the small table once. The shuffle cost is eliminated for the broadcast side.
This answer was generated by AI for study purposes. Use it as a starting point β personalize it with your own experience.
π€ Get questions like this answered in real-time
Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers β invisible to screen sharing.
Get Assisting AI β Starts at βΉ500