HomeInterview QuestionsData Processing, Json, Deduplication

Given a file containing JSON records, de‑duplicate the contents based on event_id. If multiple records share the same event_id, keep only one.

🟡 Medium Coding Junior level
1 Times asked
Mar 2026 Last seen
Mar 2026 First seen

💡 Model Answer

To de‑duplicate the file, read each line, parse the JSON, and use a dictionary keyed by event_id to keep the first occurrence. If a duplicate event_id is encountered, skip it or replace it based on the desired policy. This approach runs in O(n) time and O(k) space, where n is the number of lines and k is the number of unique event_ids. In Python:

python
import json
unique = {}
with open('events.txt', 'r') as f:
    for line in f:
        record = json.loads(line)
        eid = record['event_id']
        if eid not in unique:
            unique[eid] = record
# Write back or process unique.values()

If the file is large, you can stream the output directly to a new file to avoid holding all records in memory. For very large datasets, consider using a database or a streaming framework like Apache Beam, where you can apply a Distinct transform on the event_id field. The key idea is to maintain a hash set or map of seen event_ids to filter duplicates efficiently.

This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.

🎤 Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.

Get Assisting AI — Starts at ₹500