Which data model would you use for large datasets stored in Amazon S3?

Question

Assisting AI · Accepted Answer

For large datasets in Amazon S3, a data lake architecture is typically used. The data model is often a columnar format such as Parquet or ORC, which compresses data and speeds up analytical queries. You would design a logical schema using a star or snowflake schema for analytics workloads, storing fact tables in S3 and metadata in a catalog like AWS Glue. Physical implementation can be achieved with services such as Amazon Athena or Redshift Spectrum, which query the data directly in S3 without moving it. If you need transactional capabilities, you can use Amazon RDS or Aurora with a relational model, but for big‑data analytics the columnar data lake model is preferred. This approach separates storage from compute, allows schema evolution, and supports scalable, cost‑effective querying.

Which data model would you use for large datasets stored in Amazon S3?

💡 Model Answer

🎤 Get questions like this answered in real-time