Given employee ID, name, department, salary, and location, process the data in Spark and handle invalid records.
💡 Model Answer
In Spark you can read the data with a predefined schema to catch type mismatches. For example:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
schema = StructType([
StructField("emp_id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("department", StringType(), True),
StructField("salary", DoubleType(), True),
StructField("location", StringType(), True)
])
df = spark.read.csv("employees.csv", header=True, schema=schema)Invalid records (e.g., missing emp_id, non‑numeric salary) will be loaded as nulls. You can filter them out:
clean_df = df.filter(df.emp_id.isNotNull() & df.salary.isNotNull())If you need to log or store the bad rows, use df.filter(df.emp_id.isNull() | df.salary.isNull()).write....
After cleaning, you can perform transformations or write the cleaned data to a target system. The approach is O(n) in the number of rows and leverages Spark’s distributed processing.
```
This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.
🎤 Get questions like this answered in real-time
Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.
Get Assisting AI — Starts at ₹500