Can you explain a PySpark transformation you personally developed? What were the input DataFrames? What type of joins did you use? Did you save the input and output as views?

Question

Assisting AI · Accepted Answer

I built a transformation that enriched user clickstream data with user profile information. The input DataFrames were:
1. `click_df` – raw click events with columns (user_id, event_time, page_id, device).
2. `user_df` – user profiles with columns (user_id, age, country, signup_date).

I performed a left semi-join to filter clicks only for users present in the profile table, then a broadcast left join to bring in profile attributes. The code looked like this:
```python
from pyspark.sql.functions import broadcast

filtered_clicks = click_df.join(broadcast(user_df), "user_id", "left_semi")
joined = filtered_clicks.join(broadcast(user_df), "user_id", "left")
```
After the join, I added a window function to compute the number of clicks per user per day:
```python
from pyspark.sql.window import Window
from pyspark.sql.functions import count, to_date

w = Window.partitionBy("user_id", to_date("event_time"))
result = joined.withColumn("clicks_per_day", count("page_id").over(w))
```
I registered the final DataFrame as a temporary view for downstream SQL queries:
```python
result.createOrReplaceTempView("user_click_summary")
```
The complexity of the join is O(n + m) where n and m are the sizes of the two DataFrames, and the window aggregation is O(n log n) due to the sort required for the window partitioning. This transformation is reusable and can be scheduled as a daily job.

Can you explain a PySpark transformation you personally developed? What were the input DataFrames? What type of joins did you use? Did you save the input and output as views?

💡 Model Answer

🎤 Get questions like this answered in real-time