Suppose 100 GB of files arrive in an S3 bucket every day and need to be processed and loaded into a data warehouse. Which AWS services would you use and why? Consider factors such as flexibility, cost, scalability, and error recovery.

Question

Assisting AI · Accepted Answer

I would use S3 as the landing zone and trigger an AWS Glue crawler to catalog the incoming data. For processing, Glue ETL jobs are cost‑effective for moderate volumes and provide serverless scaling, but if the data requires heavy transformations or machine learning, I’d spin up an EMR cluster with Spark for parallel processing. After transformation, I’d load the data into Amazon Redshift using the COPY command, which can ingest from S3 in parallel and is highly scalable. For flexibility, Glue’s serverless nature eliminates cluster management overhead. For cost, Glue charges per 1‑minute job execution, while EMR charges per instance hour; I’d choose Glue for lighter workloads and EMR for heavier ones. Scalability is handled by Glue’s auto‑scaling and EMR’s cluster scaling. Error recovery is managed by Glue job bookmarking and Step Functions orchestration, which can retry failed steps and send alerts via CloudWatch. This architecture balances flexibility, cost, scalability, and robust error handling.

Suppose 100 GB of files arrive in an S3 bucket every day and need to be processed and loaded into a data warehouse. Which AWS services would you use and why? Consider factors such as flexibility, cost, scalability, and error recovery.

💡 Model Answer

🎤 Get questions like this answered in real-time