Suppose multiple CSV files are coming daily into an S3 bucket, and sometimes the schema changes: new columns are added or data types change. How would you handle this situation in PySpark? Explain the steps and logic behind your approach.

Question

Assisting AI · Accepted Answer

First, I would maintain a canonical schema in a schema registry (e.g., AWS Glue Schema Registry or a simple JSON file in S3). Each day, I read the incoming CSVs with `spark.read.option('header', 'true').option('inferSchema', 'true')`. I then compare the inferred schema with the canonical one. If new columns appear, I add them to the canonical schema with a default type (often String or a nullable type). If data types change, I cast the column to the canonical type using `withColumn` and `cast`. For columns that are missing in the new files, I add them with null values using `withColumn`. After normalizing the schema, I write the data to a Delta Lake table, enabling Delta’s schema evolution feature (`mergeSchema=true`). Delta automatically tracks schema changes and stores them in the transaction log, allowing future reads to see the latest schema. I also set up a daily job that validates the schema against the registry and alerts if an unexpected change occurs. This approach keeps downstream consumers stable while allowing the source to evolve.

Suppose multiple CSV files are coming daily into an S3 bucket, and sometimes the schema changes: new columns are added or data types change. How would you handle this situation in PySpark? Explain the steps and logic behind your approach.

💡 Model Answer

🎤 Get questions like this answered in real-time