Given employee ID, name, department, salary, and location, process the data in Spark and handle invalid records.

Question

Assisting AI · Accepted Answer

In Spark you can read the data with a predefined schema to catch type mismatches. For example:
```python
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

schema = StructType([
    StructField("emp_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("department", StringType(), True),
    StructField("salary", DoubleType(), True),
    StructField("location", StringType(), True)
])

df = spark.read.csv("employees.csv", header=True, schema=schema)
```
Invalid records (e.g., missing emp_id, non‑numeric salary) will be loaded as nulls. You can filter them out:
```python
clean_df = df.filter(df.emp_id.isNotNull() & df.salary.isNotNull())
```
If you need to log or store the bad rows, use `df.filter(df.emp_id.isNull() | df.salary.isNull()).write...`.
After cleaning, you can perform transformations or write the cleaned data to a target system. The approach is O(n) in the number of rows and leverages Spark’s distributed processing.
```

Given employee ID, name, department, salary, and location, process the data in Spark and handle invalid records.

💡 Model Answer

🎤 Get questions like this answered in real-time