You must implement near‑real‑time ingestion from S3 with schema drift (occasional new columns) while avoiding full listings of the bucket. Badly formed records should land in a quarantine column for triage. Which design is most appropriate on Databricks?

Question

Assisting AI · Accepted Answer

Use Structured Streaming with Delta Lake and schema evolution. Configure the stream with `mergeSchema=true` (or `schemaEvolution=true` for older versions) so that new columns are automatically added. Read the data with `spark.readStream.format("json").option("path", "s3://bucket/path").load()`. In the foreachBatch function, validate each row; if a row fails validation, write it to a separate Delta table or a quarantine column. This approach avoids full bucket listings, handles schema drift, and keeps bad records separate for later analysis.

You must implement near‑real‑time ingestion from S3 with schema drift (occasional new columns) while avoiding full listings of the bucket. Badly formed records should land in a quarantine column for triage. Which design is most appropriate on Databricks?

💡 Model Answer

🎤 Get questions like this answered in real-time