HomeInterview QuestionsCan you explain a PySpark transformation you perso…

Can you explain a PySpark transformation you personally developed? What were the input DataFrames? What type of joins did you use? Did you save the input and output as views?

🟡 Medium Coding Mid level
1Times asked
Jun 2026Last seen
Jun 2026First seen

💡 Model Answer

I built a transformation that enriched user clickstream data with user profile information. The input DataFrames were:

  1. click_df – raw click events with columns (user_id, event_time, page_id, device).
  2. user_df – user profiles with columns (user_id, age, country, signup_date).

I performed a left semi-join to filter clicks only for users present in the profile table, then a broadcast left join to bring in profile attributes. The code looked like this:

python
from pyspark.sql.functions import broadcast

filtered_clicks = click_df.join(broadcast(user_df), "user_id", "left_semi")
joined = filtered_clicks.join(broadcast(user_df), "user_id", "left")

After the join, I added a window function to compute the number of clicks per user per day:

python
from pyspark.sql.window import Window
from pyspark.sql.functions import count, to_date

w = Window.partitionBy("user_id", to_date("event_time"))
result = joined.withColumn("clicks_per_day", count("page_id").over(w))

I registered the final DataFrame as a temporary view for downstream SQL queries:

python
result.createOrReplaceTempView("user_click_summary")

The complexity of the join is O(n + m) where n and m are the sizes of the two DataFrames, and the window aggregation is O(n log n) due to the sort required for the window partitioning. This transformation is reusable and can be scheduled as a daily job.

This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.

🎤 Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.

Get Assisting AI — Starts at ₹500