Suppose a data pipeline has retrieved 8 million records and then stopped. How would you recover to retrieve the remaining 2 million records?

Question

Assisting AI · Accepted Answer

First, I would investigate the failure point by reviewing logs, error messages, and system metrics to identify whether the stop was due to a timeout, resource exhaustion, or data corruption. Assuming the pipeline uses a checkpointing mechanism (e.g., Spark checkpoint or Kafka offsets), I would resume from the last successful checkpoint, ensuring idempotent writes to avoid duplicates. If no checkpoint exists, I would rely on the source's change data capture (CDC) or timestamp column to reprocess only the missing 2 million rows. I would also implement a retry strategy with exponential backoff and alerting to surface persistent issues. To prevent future stalls, I would add monitoring for throughput and latency, set up alerts for threshold breaches, and consider scaling resources or partitioning the workload further. Finally, I would run a data quality validation to confirm that the recovered records match the source and that the pipeline is now fully operational.

Suppose a data pipeline has retrieved 8 million records and then stopped. How would you recover to retrieve the remaining 2 million records?

💡 Model Answer

🎤 Get questions like this answered in real-time