Which AWS platform is used to run and transform Spark jobs, and what parameters are required to trigger a Spark transformation?
💡 Model Answer
AWS Glue and Amazon EMR are the two primary services that run Apache Spark workloads in the AWS ecosystem. Glue is a fully managed ETL service that uses a serverless Spark runtime; you submit a Glue job that points to a Python or Scala script, and Glue handles cluster provisioning, scaling, and job scheduling. EMR, on the other hand, gives you a managed Hadoop ecosystem where you can launch a Spark cluster, install custom libraries, and run Spark jobs in a more traditional cluster‑centric way.
To trigger a Spark transformation in either service you need to provide several key parameters:
- Job name – a unique identifier for the job.
- Script location – the S3 path to the Spark script (Python/Scala).
- IAM role – a role that grants Glue/EMR permissions to read/write S3, CloudWatch, etc.
- Spark configuration – optional key/value pairs such as
--conf spark.executor.memory=4gor--conf spark.sql.shuffle.partitions=200. - Deployment mode –
--deploy-mode clusterfor EMR or the Glue job type (Spark, Python, etc.). - Resource specifications – for EMR you set the number of EC2 instances, instance type, and EBS size; for Glue you set the DPUs.
- Parameters – any job‑specific arguments you want to pass to the script.
Once these parameters are supplied, the job is submitted via the AWS console, CLI (aws glue start-job-run or aws emr add-steps), or SDK, and the Spark engine executes the transformation on the data in S3 or other data stores. The service then streams logs to CloudWatch and returns a job run ID for monitoring.
This answer was generated by AI for study purposes. Use it as a starting point — personalize it with your own experience.
🎤 Get questions like this answered in real-time
Assisting AI listens to your interview, captures questions live, and gives you instant AI-powered answers — invisible to screen sharing.
Get Assisting AI — Starts at ₹500