Give PySpark code to calculate the friend acceptance rate for each date when friend requests were sent. A request is sent if action = 'sent' and accepted if action = 'accepted'. If a request is not accepted, there is no record of it being accepted in the table. The output should only include dates where requests were sent and at least one of them was accepted (acceptance can occur on any date after the request is sent).

Question

Assisting AI · Accepted Answer

To compute the acceptance rate per date, first read the data into a DataFrame. Group by the date of the sent action, count total sent requests, and count accepted requests that correspond to those sent requests. Since an accepted record may appear on a later date, we need to join the sent and accepted actions on the sender and receiver pair.

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, when

spark = SparkSession.builder.getOrCreate()
# Assume df has columns: user_id_sender, user_id_receiver, date, action
# Separate sent and accepted actions
sent = df.filter(col("action") == "sent")
accepted = df.filter(col("action") == "accepted")

# Count sent per date
sent_counts = sent.groupBy("date").agg(count("user_id_sender").alias("sent_count"))

# Join sent with accepted on sender/receiver to find which sent requests were accepted
accepted_sent = sent.join(accepted, ["user_id_sender", "user_id_receiver"], "inner")
# Count accepted per sent date
accepted_counts = accepted_sent.groupBy("sent.date").agg(count("user_id_sender").alias("accepted_count"))

# Combine counts and compute rate
result = sent_counts.join(accepted_counts, sent_counts.date == accepted_counts["sent.date"], "inner")
result = result.withColumn("acceptance_rate", col("accepted_count") / col("sent_count"))

# Select desired columns
final = result.select(col("date"), "sent_count", "accepted_count", "acceptance_rate")
final.show()
```

Complexity: Each filter, join, and aggregation is linear in the number of rows (O(n)). The join is on two keys, so the cost is O(n) with a shuffle. The final result contains one row per date where at least one request was sent and accepted. This solution scales to large datasets typical in Spark environments.

💡 Model Answer

🎤 Get questions like this answered in real-time