Given a file containing JSON records, de‑duplicate the contents based on event_id. If multiple records share the same event_id, keep only one.

Question

Assisting AI · Accepted Answer

To de‑duplicate the file, read each line, parse the JSON, and use a dictionary keyed by event_id to keep the first occurrence. If a duplicate event_id is encountered, skip it or replace it based on the desired policy. This approach runs in O(n) time and O(k) space, where n is the number of lines and k is the number of unique event_ids. In Python:

```python
import json
unique = {}
with open('events.txt', 'r') as f:
    for line in f:
        record = json.loads(line)
        eid = record['event_id']
        if eid not in unique:
            unique[eid] = record
# Write back or process unique.values()
```

If the file is large, you can stream the output directly to a new file to avoid holding all records in memory. For very large datasets, consider using a database or a streaming framework like Apache Beam, where you can apply a `Distinct` transform on the event_id field. The key idea is to maintain a hash set or map of seen event_ids to filter duplicates efficiently.

Given a file containing JSON records, de‑duplicate the contents based on event_id. If multiple records share the same event_id, keep only one.

💡 Model Answer

🎤 Get questions like this answered in real-time