Running Spark Job On Minikube Kubernetes Cluster

Kubernetes is another industry buzzword these days and I am trying a few different things with Kubernetes. On Feb 28th, 2018 Apache spark released v2.3.0, I am already working on Apache Spark and the new release has added a new Kubernetes scheduler backend that supports native submission of spark jobs to a cluster managed by Kubernetes. The feature is still currently experimental but I wanted to try it out.

I am writing this post because I have to search a lot of things to get it working and there is no single place to get the exact information on how to run a Spark job on a local Kubernetes cluster. I thought writing a post on my findings will enable others not to go through the same pain as I did.

In this post, we will take a look at how to set up a Kubernetes cluster using minikube on a local machine and how to run an Apache spark job on top of it. We are not going to write any code for this, a few examples are already available within the Apache Spark distribution and we are going to use the same example jar and run the SparkPi program from it onto our Kubernetes cluster.

Prerequisites

Before we start please make sure you have the following prerequisites installed on your local machine.

I am running the tools on Mac High Sierra.

  • Apache Spark (v2.3.0)

  • Kubernetes (v.1.9.3)

  • Docker

  • Minikube

  • VirtualBox

Start Minikube

We can start minikube locally by simply running the below command.

> minikube start
Starting local Kubernetes v1.9.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
Loading cached images from config file.

Minikube is a tool that makes it easy to run Kubernetes locally. Minikube runs a single-node Kubernetes cluster inside a VM on a local machine. It enables us to try out Kubernetes or develop it day-to-day.

When minikube starts it starts with a single-node configuration and takes 1Gb of Memory and 2 cores of CPU by default, however, for running spark this requirement will not suffice and the jobs will fail (and they do as I have tried multiple times) so we have to increase both Memory and CPU cores for our minikube cluster.

There are three different ways to do this. However, for me not all worked and that’s how I came to know about the different ways to change the minikube configuration. You can use any one of the below ways to change the configuration.

Pass Configuration To Minikube Start Command

You can directly pass the memory and CPU options to the minikube start command like:

> minikube start --memory 8192 --cpus 4
Starting local Kubernetes v1.9.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
Loading cached images from config file.

This will start a minikube cluster with 8Gb of memory and 4 cores of CPU. (somehow for me this didn't work)

Modify minikube config

We can change the minikube config with the config command in minikube cli. The config command allows setting different config options for minikube like memory, CPUs, vm-driver, disk-size etc. To see all the available options use the> minikube config command it will list all the available options that can be modified.

> minikube config set memory 8192

## These changes will take effect upon a minikube delete and then a minikube start
> minikube config set cpus 4

## These changes will take effect upon a minikube delete and then a minikube start

After setting the configuration using the config command we need to delete our previous running cluster and start a new one. To delete minikube cluster run > minikube delete and rerun the start minikube command.

Change Configuration Using VirtualBox

For me, the above two options didn't work and if you are like me you can use this last option which worked for me and hope it works for you. Open the VirtualBox app on your machine and select your VM like the one shown in the below image. RightClick -> Settings on your VM, this will open the configuration page for the Minikube VM.

note: To change configuration using VirtualBox first you need to shutdown the VM if it is already running.

Inside option System -> Motherboard, you can change the memory of the VM using the slider, in my case I have given it 8192MB of memory.

To change the CPU config, go to Processor tab and change it to 4. You can try other options also for me 4 works just fine. Don’t change it to less than 2 otherwise things won’t work. :p

After the configuration changes are done, start the VM using the VirtualBox app or using the above given minikube start command.

Your cluster is running and now you need to have a docker image for spark. Let's see how to build the image next.

Creating Docker Image For Spark

Make sure you have Docker installed on your machine and the spark distribution is extracted.

Go inside your extracted spark folder and run the below command to create a spark docker image

some of the output logs are excluded.

> ./bin/docker-image-tool.sh -m -t spark-docker build
./bin/docker-image-tool.sh: line 125: ================================================================================: command not found
Sending build context to Docker daemon  256.5MB
Step 1/16 : FROM openjdk:8-alpine
 ---> 224765a6bdbe
Step 2/16 : ARG spark_jars=jars
 ---> Using cache
 ---> dd42fdc28d7a
Step 3/16 : ARG img_path=kubernetes/dockerfiles
 ---> Using cache
 ---> 570fc343f883
Step 4/16 : RUN set -ex &&     apk upgrade --no-cache &&     apk add --no-cache bash tini libc6-compat &&     mkdir -p /opt/spark &&     mkdir -p /opt/spark/work-dir     touch /opt/spark/RELEASE &&     rm /bin/sh &&     ln -sv /bin/bash /bin/sh &&     chgrp root /etc/passwd && chmod ug+rw /etc/passwd
 ---> Using cache
 ---> 705ae89cb075
Step 5/16 : COPY ${spark_jars} /opt/spark/jars
 ---> Using cache
 ---> 506ab8c02abf
Step 9/16 : COPY ${img_path}/spark/entrypoint.sh /opt/
 ---> Using cache
 ---> 03959bfa5250
Step 10/16 : COPY examples /opt/spark/examples
 ---> Using cache
 ---> 5a2f91a7ce3e
Step 11/16 : COPY data /opt/spark/data
 ---> Using cache
 ---> 58090cef2be4
Step 13/16 : RUN echo $(ls /opt/spark/examples/jars/)
 ---> Using cache
 ---> 0a6c27628ac7
Step 14/16 : ENV SPARK_HOME /opt/spark
 ---> Using cache
 ---> 7f32a22e0196
Step 16/16 : ENTRYPOINT [ "/opt/entrypoint.sh" ]
 ---> Using cache
 ---> 01daf3302719
Successfully built 01daf3302719
Successfully tagged spark:spark-docker

> docker image ls
REPOSITORY                                                TAG                 IMAGE ID            CREATED             SIZE
spark-docker                                              v0.1                01daf3302719        2 days ago          347MB

Now if you run > docker image ls you will see the docker build available on your local machine. Make a note of this image name we need to provide the image name to spark-submit command.

There is a push option available to the above command which enables you to push the docker image to your own repository this, in turn, will enable your production kubernetes to pull the docker image from the configured Docker repository. Run the same command without any options to see its usage.

It might happen that the command will not work and will give an error like:

./bin/docker-image-tool.sh: line 125: ================================================================================: command not found
“docker build” requires exactly 1 argument.
See ‘docker build — help’.
Usage: docker build [OPTIONS] PATH | URL | — [flags]
Build an image from a Dockerfile

This is because of an issue with the docker-image-tool.sh file. I have raised a bug for this in Apache Spark JIRA you can see it here.

The issue is under fix but for you to continue with this post what you can do is open the docker-image-tool.sh file present inside the bin folder and after line no 59 add BUILD_ARGS=(), save the file and run the command once again and it will work.

For those who the above command worked without the workaround the issue might have been fixed at the time you are reading this post and you don't have to do anything hurrah!!!

Submit Spark Job

Now, let's submit our SparkPi job to the cluster. Our cluster is ready and we have the docker image. Run the below command to submit the spark job on a kubernetes cluster. The spark-submit script takes care of setting up the classpath with Spark and its dependencies and can support different cluster managers and deploy modes that Spark supports

> spark-submit \                                      
--master k8s://https://192.168.99.100:8443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.container.image=spark-docker \
local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar

options used to run on Kubernetes are:

  • --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)

  • --master: The master URL for the Kubernetes cluster (e.g. k8s://https://192.168.99.100:8443)

  • --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)

  • --conf spark.executor.instances=3: configuration property to specify how many executor instances to use while running the spark job.

  • --conf spark.kubernetes.container.image=spark-docker: Configuration property to specify which docker image to use, here provide the same docker name from docker image ls command.

  • local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a local:// path that is present on all nodes

The package jar should be available cluster-wide either through HDFS, or HTTP or should be available within every packaged docker image so that it will be available to all the executor nodes in our spark program. The local:// address means that the jar is available as a local file on every initialized pod by the spark driver and no IO has to be made to pull the jar from anywhere, this works well when the application jar is large and is pushed to every worker node or shared using any shared filesystem.

Luckily, our spark docker image file packages the example jar in the docker container so we can use it. However, how to package our own application code and push it either on HDFS or as a separate docker image I will write as a separate post.

To check if the pods are started and the spark job is running, open the Kubernetes dashboard available within minikube.

View Minikube Dashboard & Kubernetes Logs

To check the status of our submitted job we can use either the Kubernetes dashboard or view Kubernetes logs. Minikube comes with the dashboard available as an add-on and can be started using the following command.

> minikube dashboard
Opening kubernetes dashboard in default browser...

OR

> minikube service list
|-------------|------------------------------------------------------|--------------------------------|
|  NAMESPACE  |                         NAME                         |              URL               |
|-------------|------------------------------------------------------|--------------------------------|
| default     | kubernetes                                           | No node port                   |
| default     | neo4j                                                | No node port                   |
| default     | neo4j-public                                         | http://192.168.99.100:30388    |
|             |                                                      | http://192.168.99.100:30598    |
| default     | spark-pi-709e1c1b19813e7cbc1aeff45200c64e-driver-svc | No node port                   |
| kube-system | kube-dns                                             | No node port                   |
| kube-system | kubernetes-dashboard                                 | http://192.168.99.100:30000    |
|-------------|------------------------------------------------------|--------------------------------|

Kubernetes Dashboard

Navigate to the URL given by the above command to view the dashboard. The dashboard provides lots of information about cluster memory usage, CPU usage, pods, services, replica set etc. We can also view service logs directly through the dashboard. However, if you don't want to go to the dashboard you can view the Spark Driver log using the> kubectl logs <pod name> command:

> kubectl logs spark-pi-709e1c1b19813e7cbc1aeff45200c64e-driver 
.....
2018-03-07 13:10:35 INFO  TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool 
2018-03-07 13:10:35 INFO  DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:38, took 1.091617 s
Pi is roughly 3.1389956949784747
2018-03-07 13:10:35 INFO  AbstractConnector:318 - Stopped Spark@53e211ee{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-03-07 13:10:35 INFO  SparkUI:54 - Stopped Spark web UI at http://spark-pi-709e1c1b19813e7cbc1aeff45200c64e-driver-svc.default.svc:4040
2018-03-07 13:10:35 INFO  KubernetesClusterSchedulerBackend:54 - Shutting down all executors
2018-03-07 13:10:35 INFO  KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asking each executor to shut down
2018-03-07 13:10:35 INFO  KubernetesClusterSchedulerBackend:54 - Closing kubernetes client
2018-03-07 13:10:35 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-03-07 13:10:35 INFO  MemoryStore:54 - MemoryStore cleared
2018-03-07 13:10:35 INFO  BlockManager:54 - BlockManager stopped
2018-03-07 13:10:35 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2018-03-07 13:10:35 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-03-07 13:10:35 INFO  SparkContext:54 - Successfully stopped SparkContext
2018-03-07 13:10:35 INFO  ShutdownHookManager:54 - Shutdown hook called
......

If you see in the logs the program has calculated the Pi value and the container is stopped in the end.

Shutdown Cluster

Shutting down the cluster is very easy, use > minikube stop and it will stop the cluster.


Hope this helps you try running apache spark on the local Kubernetes cluster. Please do comment if you didn’t understand any steps or get any errors while following the steps and I will try to add more details to the post. Also, please do let me know if you liked it and help others by sharing.

Happy Coding...

References

  1. Apache Spark Documentation

  2. Minikube Github

Did you find this article valuable?

Support Neenad Ingole by becoming a sponsor. Any amount is appreciated!