HomeInterview QuestionsGiven employee ID, name, department, salary, and l…

Given employee ID, name, department, salary, and location, process the data in Spark and handle invalid records.

🟡 Medium Coding Junior level
1Times asked
Jul 2026Last seen
Jul 2026First seen

💡 Model Answer

In Spark you can read the data with a predefined schema to catch type mismatches. For example:

python
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

schema = StructType([
    StructField("emp_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("department", StringType(), True),
    StructField("salary", DoubleType(), True),
    StructField("location", StringType(), True)
])

df = spark.read.csv("employees.csv", header=True, schema=schema)

Invalid records (e.g., missing emp_id, non‑numeric salary) will be loaded as nulls. You can filter them out:

python
clean_df = df.filter(df.emp_id.isNotNull() & df.salary.isNotNull())

If you need to log or store the bad rows, use df.filter(df.emp_id.isNull() | df.salary.isNull()).write....

After cleaning, you can perform transformations or write the cleaned data to a target system. The approach is O(n) in the number of rows and leverages Spark’s distributed processing.

```

This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.

🎤 Get questions like this answered in real-time

Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.

Get Assisting AI — Starts at ₹500