Suppose 500 million transactions occur per day. Design a pipeline to consume these transactions, perform processing, and every hour submit analytics extracted from the data. How would you architect this system?

Question

Assisting AI · Accepted Answer

For 500 M daily transactions (~5.8 k/s), a micro‑service architecture with a dedicated ingestion layer is essential. Use a high‑throughput, durable stream such as Kafka or AWS Kinesis to buffer incoming events. Partition the topic by a hash of the transaction key to achieve parallelism and avoid hot spots. A stream processor (Apache Flink, Spark Structured Streaming, or Kinesis Data Analytics) consumes the stream, applies transformations, and aggregates data in hourly windows. The processor writes intermediate results to a fast, scalable store like Amazon DynamoDB or a time‑series database (InfluxDB) for low‑latency queries. For batch analytics, materialized views or a data lake (S3) can be populated nightly, and a compute engine (EMR, Databricks) can run complex queries. Use a scheduler (Airflow) to trigger hourly analytics jobs that pull from the stream‑processed tables, compute metrics, and publish dashboards via Grafana or QuickSight. Key decisions: 1) Partitioning strategy to balance load; 2) Exactly‑once semantics to avoid double counting; 3) Retention policy to keep raw logs for audit; 4) Auto‑scaling of consumer groups to match peak throughput; 5) Monitoring with Prometheus and alerting on lag. Complexity per transaction is O(1) for ingestion and O(1) per record for processing, while the hourly aggregation cost is O(n) over the hour’s data. This design ensures low latency, high throughput, and fault tolerance for the required analytics pipeline.

Suppose 500 million transactions occur per day. Design a pipeline to consume these transactions, perform processing, and every hour submit analytics extracted from the data. How would you architect this system?

💡 Model Answer

🎤 Get questions like this answered in real-time