HomeInterview QuestionsSuppose multiple CSV files are coming daily into a…

Suppose multiple CSV files are coming daily into an S3 bucket, and sometimes the schema changes: new columns are added or data types change. How would you handle this situation in PySpark? Explain the steps and logic behind your approach.

🟡 Medium Conceptual Mid level
1Times asked
Jun 2026Last seen
Jun 2026First seen

💡 Model Answer

First, I would maintain a canonical schema in a schema registry (e.g., AWS Glue Schema Registry or a simple JSON file in S3). Each day, I read the incoming CSVs with spark.read.option('header', 'true').option('inferSchema', 'true'). I then compare the inferred schema with the canonical one. If new columns appear, I add them to the canonical schema with a default type (often String or a nullable type). If data types change, I cast the column to the canonical type using withColumn and cast. For columns that are missing in the new files, I add them with null values using withColumn. After normalizing the schema, I write the data to a Delta Lake table, enabling Delta’s schema evolution feature (mergeSchema=true). Delta automatically tracks schema changes and stores them in the transaction log, allowing future reads to see the latest schema. I also set up a daily job that validates the schema against the registry and alerts if an unexpected change occurs. This approach keeps downstream consumers stable while allowing the source to evolve.

This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.

🎤 Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.

Get Assisting AI — Starts at ₹500