You must implement near‑real‑time ingestion from S3 with schema drift (occasional new columns) while avoiding full listings of the bucket. Badly formed records should land in a quarantine column for triage. Which design is most appropriate on Databricks?
💡 Model Answer
Use Structured Streaming with Delta Lake and schema evolution. Configure the stream with mergeSchema=true (or schemaEvolution=true for older versions) so that new columns are automatically added. Read the data with spark.readStream.format("json").option("path", "s3://bucket/path").load(). In the foreachBatch function, validate each row; if a row fails validation, write it to a separate Delta table or a quarantine column. This approach avoids full bucket listings, handles schema drift, and keeps bad records separate for later analysis.
This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.
🎤 Get questions like this answered in real-time
Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.
Get Assisting AI — Starts at ₹500