HomeInterview QuestionsYou must implement near‑real‑time ingestion from S…

You must implement near‑real‑time ingestion from S3 with schema drift (occasional new columns) while avoiding full listings of the bucket. Badly formed records should land in a quarantine column for triage. Which design is most appropriate on Databricks?

🟡 Medium Conceptual Junior level
1Times asked
May 2026Last seen
May 2026First seen

💡 Model Answer

Use Structured Streaming with Delta Lake and schema evolution. Configure the stream with mergeSchema=true (or schemaEvolution=true for older versions) so that new columns are automatically added. Read the data with spark.readStream.format("json").option("path", "s3://bucket/path").load(). In the foreachBatch function, validate each row; if a row fails validation, write it to a separate Delta table or a quarantine column. This approach avoids full bucket listings, handles schema drift, and keeps bad records separate for later analysis.

This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.

🎤 Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.

Get Assisting AI — Starts at ₹500