What is Apache Livy?
Apache Livy is an open-source REST service for managing long-running Apache Spark jobs. It allows you to submit, manage, and interact with Spark jobs over a RESTful interface, making it easier to integrate Spark with other applications and services.
Setting Up Apache Livy
Install Apache Livy:
Download the Livy release from the official Apache Livy repository.
Extract the archive and place it in a directory on your cluster.
tar -zxvf livy-<version>.tar.gz
Configure Livy:
- Navigate to the Livy configuration directory.
cd livy-<version>/conf
- Copy the template configuration files.
cp livy.conf.template livy.conf
cp livy-env.sh.template livy-env.sh
- Edit
livy.conf
to configure Livy. At a minimum, you need to set the URL for your Spark master:
livy.server.port = 8998
livy.server.host = <hostname>
livy.spark.master = yarn
livy.spark.deployMode = cluster
- Edit
livy-env.sh
to set environment variables as needed. For example:
export SPARK_HOME=/path/to/spark
export HADOOP_CONF_DIR=/path/to/hadoop/etc/hadoop
Start Livy Server:
- Start the Livy server by running the following command:
./bin/livy-server start
- Check the logs to ensure Livy started correctly:
tail -f logs/livy--server.out
Submitting a PySpark Job via Livy
Prepare Your PySpark Script:
- Place your PySpark script in a location accessible by the cluster (e.g., HDFS or S3).
Submit Job via REST API:
- From your client machine, you can submit a PySpark job using HTTP requests.
Example PySpark Script (example.py
):
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ExampleApp").getOrCreate()
# Sample PySpark code
data = [("James", "Smith"), ("Anna", "Rose"), ("Robert", "Williams")]
columns = ["First Name", "Last Name"]
df = spark.createDataFrame(data, columns)
df.show()
spark.stop()
Submitting the Job:
import requests
import json
# Define the Livy URL
livy_url = 'http://<livy-server>:8998/batches'
# Define the payload for the POST request
payload = {
"file": "hdfs:///path/to/example.py",
"className": "org.apache.spark.examples.SparkPi",
"args": [],
"conf": {
"spark.jars.packages": "com.databricks:spark-avro_2.11:3.0.1"
}
}
# Define the headers
headers = {'Content-Type': 'application/json'}
# Send the POST request
response = requests.post(livy_url, data=json.dumps(payload), headers=headers)
# Print the response
print(response.json())
Managing and Monitoring Jobs
Check Job Status:
- You can check the status of your submitted job by sending a GET request to the Livy server.
batch_id = response.json()['id']
status_url = f"http://<livy-server>:8998/batches/{batch_id}"
response = requests.get(status_url)
print(response.json())
List All Jobs:
- List all running and completed jobs by sending a GET request.
list_url = 'http://<livy-server>:8998/batches'
response = requests.get(list_url)
print(response.json())
Kill a Job:
- You can kill a running job by sending a DELETE request.
delete_url = f"http://<livy-server>:8998/batches/{batch_id}"
response = requests.delete(delete_url)
print(response.json())
Benefits of Using Apache Livy
Ease of Integration: Livy provides a REST interface that can be easily integrated with other applications, services, and programming languages.
Remote Job Submission: Allows you to submit Spark jobs from any machine without needing direct access to the Spark cluster.
Job Management: Provides features to manage, monitor, and control Spark jobs.
Multi-language Support: Supports multiple languages including Python, Scala, and Java.
Conclusion
Using Apache Livy allows you to submit and manage Spark jobs remotely through a RESTful interface, making it a robust solution for your requirement to call a PySpark app from a client machine without having the code on the client. This setup not only simplifies the interaction with the Spark cluster but also enhances the flexibility and scalability of job submissions.