What is Spark architecture and how does it work?
💡 Model Answer
Apache Spark is a distributed data processing engine that follows a master‑worker architecture. The driver program runs the user’s code, creates a logical execution plan, and submits it to a cluster manager (YARN, Mesos, or Kubernetes). The cluster manager allocates resources and launches executors on worker nodes. Each executor runs as a JVM process and hosts multiple tasks. Spark’s core abstraction is the Resilient Distributed Dataset (RDD), an immutable distributed collection of objects that can be transformed lazily. Transformations (map, filter, join) build a Directed Acyclic Graph (DAG) of stages; actions (count, collect, save) trigger execution. Spark optimizes the DAG using a Catalyst optimizer for DataFrames/Datasets, performing logical and physical plan optimizations, and a Tungsten execution engine for efficient memory and CPU usage. Data is shuffled between stages via network I/O, and Spark can cache RDDs in memory or disk for reuse. The architecture allows in‑memory processing, fault tolerance via lineage, and integration with Hadoop, Hive, and other storage systems.
This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.
🎤 Get questions like this answered in real-time
Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.
Get Assisting AI — Starts at ₹500