How can you compute a cumulative sum (running balance) ordered by date using an unbounded preceding window specification in a Spark DataFrame?

Question

Assisting AI · Accepted Answer

In Spark SQL or DataFrame API you can define a window that orders by the date column and spans from the first row to the current row. Example:
```python
from pyspark.sql import Window
from pyspark.sql.functions import sum as _sum

w = Window.partitionBy().orderBy('date').rowsBetween(Window.unboundedPreceding, Window.currentRow)

df_with_balance = df.withColumn('running_balance', _sum('amount').over(w))
```
This creates a new column `running_balance` that contains the cumulative sum of `amount` up to each row, sorted by `date`. The `rowsBetween` clause with `Window.unboundedPreceding` ensures the window includes all preceding rows, giving a running total. The result can then be displayed or written to storage.

How can you compute a cumulative sum (running balance) ordered by date using an unbounded preceding window specification in a Spark DataFrame?

💡 Model Answer

🎤 Get questions like this answered in real-time