Remote Spark Submission using Apache Livy

Remote Spark Submission using Apache Livy

What is Apache Livy?

Apache Livy is an open-source REST service for managing long-running Apache Spark jobs. It allows you to submit, manage, and interact with Spark jobs over a RESTful interface, making it easier to integrate Spark with other applications and services.

Apache-livy

Setting Up Apache Livy

  1. Install Apache Livy:

    • Download the Livy release from the official Apache Livy repository.

    • Extract the archive and place it in a directory on your cluster.

    tar -zxvf livy-<version>.tar.gz
  1. Configure Livy:

    • Navigate to the Livy configuration directory.
    cd livy-<version>/conf
  • Copy the template configuration files.
    cp livy.conf.template livy.conf
    cp livy-env.sh.template livy-env.sh
  • Edit livy.conf to configure Livy. At a minimum, you need to set the URL for your Spark master:
    livy.server.port = 8998
    livy.server.host = <hostname>
    livy.spark.master = yarn
    livy.spark.deployMode = cluster
  • Edit livy-env.sh to set environment variables as needed. For example:
    export SPARK_HOME=/path/to/spark
    export HADOOP_CONF_DIR=/path/to/hadoop/etc/hadoop
  1. Start Livy Server:

    • Start the Livy server by running the following command:
    ./bin/livy-server start
  • Check the logs to ensure Livy started correctly:
    tail -f logs/livy--server.out

Submitting a PySpark Job via Livy

  1. Prepare Your PySpark Script:

    • Place your PySpark script in a location accessible by the cluster (e.g., HDFS or S3).
  2. Submit Job via REST API:

    • From your client machine, you can submit a PySpark job using HTTP requests.

Example PySpark Script (example.py):

    from pyspark.sql import SparkSession

    spark = SparkSession.builder.appName("ExampleApp").getOrCreate()

    # Sample PySpark code
    data = [("James", "Smith"), ("Anna", "Rose"), ("Robert", "Williams")]
    columns = ["First Name", "Last Name"]
    df = spark.createDataFrame(data, columns)
    df.show()

    spark.stop()

Submitting the Job:

    import requests
    import json

    # Define the Livy URL
    livy_url = 'http://<livy-server>:8998/batches'

    # Define the payload for the POST request
    payload = {
        "file": "hdfs:///path/to/example.py",
        "className": "org.apache.spark.examples.SparkPi",
        "args": [],
        "conf": {
            "spark.jars.packages": "com.databricks:spark-avro_2.11:3.0.1"
        }
    }

    # Define the headers
    headers = {'Content-Type': 'application/json'}

    # Send the POST request
    response = requests.post(livy_url, data=json.dumps(payload), headers=headers)

    # Print the response
    print(response.json())

Managing and Monitoring Jobs

  1. Check Job Status:

    • You can check the status of your submitted job by sending a GET request to the Livy server.
    batch_id = response.json()['id']
    status_url = f"http://<livy-server>:8998/batches/{batch_id}"

    response = requests.get(status_url)
    print(response.json())
  1. List All Jobs:

    • List all running and completed jobs by sending a GET request.
    list_url = 'http://<livy-server>:8998/batches'

    response = requests.get(list_url)
    print(response.json())
  1. Kill a Job:

    • You can kill a running job by sending a DELETE request.
    delete_url = f"http://<livy-server>:8998/batches/{batch_id}"

    response = requests.delete(delete_url)
    print(response.json())

Benefits of Using Apache Livy

  1. Ease of Integration: Livy provides a REST interface that can be easily integrated with other applications, services, and programming languages.

  2. Remote Job Submission: Allows you to submit Spark jobs from any machine without needing direct access to the Spark cluster.

  3. Job Management: Provides features to manage, monitor, and control Spark jobs.

  4. Multi-language Support: Supports multiple languages including Python, Scala, and Java.

Conclusion

Using Apache Livy allows you to submit and manage Spark jobs remotely through a RESTful interface, making it a robust solution for your requirement to call a PySpark app from a client machine without having the code on the client. This setup not only simplifies the interaction with the Spark cluster but also enhances the flexibility and scalability of job submissions.

Did you find this article valuable?

Support Sivaraman Arumugam by becoming a sponsor. Any amount is appreciated!