How would you design an ETL pipeline that uses AWS DMS CDC to capture change data from an Oracle database, stores the raw change data as JSON in a raw zone, and then processes it into a target system?

Question

Assisting AI · Accepted Answer

First, set up AWS Database Migration Service (DMS) with a CDC task that connects to the Oracle source. Configure the task to use Oracle’s native redo logs to capture inserts, updates, and deletes. In the DMS task settings, enable the JSON output format and map each change event to a JSON object that includes the operation type, primary key, and column values. The task writes these JSON records to an Amazon S3 bucket that serves as the raw zone. Next, create an AWS Glue or Lambda job that reads the S3 objects, parses the JSON, and writes the data into a staging table in Amazon Redshift or a data lake. Use a schema registry or Glue catalog to enforce schema consistency. Finally, build a downstream transformation layer (e.g., Redshift SQL or Spark) that aggregates, cleans, and loads the data into the target schema. Throughout the pipeline, use CloudWatch metrics and DMS task logs to monitor lag and error rates. This design leverages DMS for efficient CDC, S3 for durable raw storage, and Glue/Lambda for flexible processing, ensuring that change data is captured, stored, and transformed reliably.

How would you design an ETL pipeline that uses AWS DMS CDC to capture change data from an Oracle database, stores the raw change data as JSON in a raw zone, and then processes it into a target system?

💡 Model Answer

🎤 Get questions like this answered in real-time