You need to pull 200 million rows from a source JDBC system into Databricks. The current PySpark job uses a single partition and exhausts the driver. How would you redesign the read for parallelism (choose a partitionColumn, set lowerBound/upperBound/numPartitions, handle nulls/outliers, tune fetch size) and ensure the ingest is idempotent before writing to Bronze?

Question

Assisting AI · Accepted Answer

To parallelize the JDBC read, pick a numeric column that is evenly distributed, such as an auto‑incrementing primary key or a timestamp. Use `spark.read.format('jdbc').option('partitionColumn', 'id').option('lowerBound', 1).option('upperBound', 200000000).option('numPartitions', 200).option('fetchsize', 5000)` to create 200 parallel tasks. If the column has nulls or outliers, add a filter to exclude nulls and use a histogram to adjust bounds so each partition has roughly the same row count. For idempotency, add a unique constraint or use a watermark on the ingestion timestamp. Write to Bronze using `df.write.mode('append').option('checkpointLocation', '/mnt/checkpoints').saveAsTable('bronze_table')` and maintain a separate metadata table that records the last processed primary key. On each run, query the metadata to set the `lowerBound` to the last processed key + 1. This guarantees that re‑runs only ingest new rows and prevents duplicates.

💡 Model Answer

🎤 Get questions like this answered in real-time