Can you explain a PySpark transformation you personally developed? What were the input DataFrames? What type of joins did you use? Did you save the input and output as views?
💡 Model Answer
I built a transformation that enriched user clickstream data with user profile information. The input DataFrames were:
click_df– raw click events with columns (user_id, event_time, page_id, device).user_df– user profiles with columns (user_id, age, country, signup_date).
I performed a left semi-join to filter clicks only for users present in the profile table, then a broadcast left join to bring in profile attributes. The code looked like this:
from pyspark.sql.functions import broadcast
filtered_clicks = click_df.join(broadcast(user_df), "user_id", "left_semi")
joined = filtered_clicks.join(broadcast(user_df), "user_id", "left")After the join, I added a window function to compute the number of clicks per user per day:
from pyspark.sql.window import Window
from pyspark.sql.functions import count, to_date
w = Window.partitionBy("user_id", to_date("event_time"))
result = joined.withColumn("clicks_per_day", count("page_id").over(w))I registered the final DataFrame as a temporary view for downstream SQL queries:
result.createOrReplaceTempView("user_click_summary")The complexity of the join is O(n + m) where n and m are the sizes of the two DataFrames, and the window aggregation is O(n log n) due to the sort required for the window partitioning. This transformation is reusable and can be scheduled as a daily job.
This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.
🎤 Get questions like this answered in real-time
Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.
Get Assisting AI — Starts at ₹500