Usage with Apache Spark on YARN

venv-pack can be used to distribute virtual environments to be used with Apache Spark jobs when deploying on Apache YARN. By bundling your environment for use with Spark, you can use custom packages, and ensure that they’re consistently provided on every node. This makes use of YARN’s resource localization by distributing environments as archives, which are then automatically unarchived on every node. In this case either the tar.gz or zip formats must be used.

Example

Create an environment:

# Using venv (Python 3 only)
$ python -m venv example

# Or using virtualenv
$ virtualenv example

Activate the environment:

$ source example/bin/activate

Install some packages into the environment

(example) $ pip install numpy pandas scikit-learn scipy

Package the environment into a tar.gz archive:

(example) $ venv-pack -o environment.tar.gz
Collecting packages...
Packing environment at '/home/jcrist/example' to 'environment.tar.gz'
[########################################] | 100% Completed |  16.6s

Write a PySpark script, for example:

# script.py
from pyspark import SparkConf
from pyspark import SparkContext

conf = SparkConf()
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)

def some_function(x):
    # Packages are imported and available from your bundled environment.
    import sklearn
    import pandas
    import numpy as np

    # Use the libraries to do work
    return np.sin(x)**2 + 2

rdd = (sc.parallelize(range(1000))
         .map(some_function)
         .take(10))

print(rdd)

Submit the job to Spark using spark-submit. In YARN cluster mode:

$ PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode cluster \
--archives environment.tar.gz#environment \
script.py

Or in YARN client mode:

$ PYSPARK_DRIVER_PYTHON=`which python` \
PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode client \
--archives environment.tar.gz#environment \
script.py

You can also start a PySpark interactive session using the following:

$ PYSPARK_DRIVER_PYTHON=`which python` \
PYSPARK_PYTHON=./environment/bin/python \
pyspark \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode client \
--archives environment.tar.gz#environment