Write code to perform a broadcast join between sales data and stores data.

Question

Assisting AI · Accepted Answer

Assuming you have two Spark DataFrames, `sales_df` (sales_id, store_id, amount) and `stores_df` (store_id, store_name), and the stores table is small enough to fit in memory, you can broadcast it to all executors and perform an efficient join:

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.getOrCreate()

# Load data
sales_df = spark.read.parquet("/path/to/sales.parquet")
stores_df = spark.read.parquet("/path/to/stores.parquet")

# Broadcast the smaller DataFrame
broadcast_stores = broadcast(stores_df)

# Perform the join
joined_df = sales_df.join(broadcast_stores, on="store_id", how="inner")

# Show result
joined_df.select("sales_id", "store_name", "amount").show()
```

Key points:
1. `broadcast()` hints to Spark that the DataFrame should be sent to all executors.
2. The join type can be `inner`, `left`, etc., depending on requirements.
3. This avoids shuffling the large sales table across the network.
4. Ensure the broadcast table is small (e.g., < 100MB) to avoid memory pressure.

Complexity: The join cost is O(N) where N is the number of rows in the large DataFrame, plus the cost of broadcasting the small table once. The shuffle cost is eliminated for the broadcast side.

Write code to perform a broadcast join between sales data and stores data.

💡 Model Answer

🎤 Get questions like this answered in real-time