Home › Interview Questions › We have a raw folder in an S3 bucket and want to p…

We have a raw folder in an S3 bucket and want to perform data cleansing. How would you identify duplicates in a particular file?

🟔 Medium Conceptual Junior level
1Times asked
Jul 2026Last seen
Jul 2026First seen

šŸ’” Model Answer

To identify duplicates in a raw file stored in S3, I would first load the file into a processing framework such as AWS Glue or Spark. The file is read as a DataFrame, and I would identify a set of columns that uniquely define a record (e.g., a composite key). Using a window function or a groupBy on those key columns, I can count occurrences. Rows with a count greater than one are duplicates. If the file is large, I would use partitioning on the key columns to parallelize the scan. After identifying duplicates, I can either drop them, keep the first occurrence, or write them to a separate S3 location for audit. Complexity is O(n) for the scan, with additional overhead for shuffling if using Spark. This approach scales to terabytes of data and can be scheduled via Glue jobs or Lambda triggers on S3 events.

This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.

šŸŽ¤ Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.

Get Assisting AI — Starts at ₹500