HomeInterview QuestionsYou need to pull 200 million rows from a source JD…

You need to pull 200 million rows from a source JDBC system into Databricks. The current PySpark job uses a single partition and exhausts the driver. How would you redesign the read for parallelism (choose a partitionColumn, set lowerBound/upperBound/numPartitions, handle nulls/outliers, tune fetch size) and ensure the ingest is idempotent before writing to Bronze?

🟡 Medium Coding Mid level
1Times asked
May 2026Last seen
May 2026First seen

💡 Model Answer

To parallelize the JDBC read, pick a numeric column that is evenly distributed, such as an auto‑incrementing primary key or a timestamp. Use spark.read.format('jdbc').option('partitionColumn', 'id').option('lowerBound', 1).option('upperBound', 200000000).option('numPartitions', 200).option('fetchsize', 5000) to create 200 parallel tasks. If the column has nulls or outliers, add a filter to exclude nulls and use a histogram to adjust bounds so each partition has roughly the same row count. For idempotency, add a unique constraint or use a watermark on the ingestion timestamp. Write to Bronze using df.write.mode('append').option('checkpointLocation', '/mnt/checkpoints').saveAsTable('bronze_table') and maintain a separate metadata table that records the last processed primary key. On each run, query the metadata to set the lowerBound to the last processed key + 1. This guarantees that re‑runs only ingest new rows and prevents duplicates.

This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.

🎤 Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.

Get Assisting AI — Starts at ₹500