HomeInterview QuestionsWrite a Python function using Apache Spark Structu…

Write a Python function using Apache Spark Structured Streaming that reads messages from a Kafka topic, counts occurrences of each 'event_type' in real time, and writes the result to a BigQuery table. Handle connection and streaming errors gracefully.

🔴 Hard Coding Mid level
1Times asked
Jun 2026Last seen
Jun 2026First seen

💡 Model Answer

To solve this, create a SparkSession with the BigQuery connector and Kafka packages. Read the Kafka stream with spark.readStream.format("kafka"), specifying the topic, bootstrap servers, and starting offsets. The Kafka value is a binary column; cast it to string and parse the JSON payload to extract the event_type field. Use groupBy("event_type").count() to aggregate per micro‑batch. Write the aggregated DataFrame to BigQuery using the writeStream API with the BigQuery connector, specifying a checkpoint location for fault tolerance. Wrap the streaming query in a try/except block and use foreachBatch to handle errors per batch: inside the batch function, attempt to write to BigQuery and catch exceptions, logging them and optionally retrying with exponential back‑off. Use spark.conf.set("spark.sql.streaming.checkpointLocation", "/tmp/checkpoints") to persist state. Complexity is O(n) per micro‑batch, where n is the number of events in that batch. The solution is robust because checkpointing and error handling ensure the stream can recover from transient network or BigQuery outages without data loss.

This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.

🎤 Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.

Get Assisting AI — Starts at ₹500