Skip to main content
Version: Next

Run TensorFlow Jobs

This guide gives an overview of how to set up training-operator and how to run a Tensorflow job with YuniKorn scheduler. The training-operator is a unified training operator maintained by Kubeflow. It not only supports TensorFlow but also PyTorch, XGboots, etc.

Install training-operator​

You can use the following command to install training operator in kubeflow namespace by default. If you have problems with installation, please refer to this doc for details.

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.3.0"

Prepare the docker image​

Before you start running a TensorFlow job on Kubernetes, you'll need to build the docker image.

  1. Download files from deployment/examples/tfjob
  2. To build this docker image with the following command
docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .

Run a TensorFlow job​

Here is a TFJob yaml for MNIST example.

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: dist-mnist-for-e2e-test
namespace: kubeflow
spec:
tfReplicaSpecs:
PS:
replicas: 2
restartPolicy: Never
template:
metadata:
labels:
applicationId: "tf_job_20200521_001"
queue: root.sandbox
spec:
schedulerName: yunikorn
containers:
- name: tensorflow
image: kubeflow/tf-dist-mnist-test:1.0
Worker:
replicas: 4
restartPolicy: Never
template:
metadata:
labels:
applicationId: "tf_job_20200521_001"
queue: root.sandbox
spec:
schedulerName: yunikorn
containers:
- name: tensorflow
image: kubeflow/tf-dist-mnist-test:1.0

Create the TFJob

kubectl create -f deployments/examples/tfjob/tf-job-mnist.yaml

You can view the job info from YuniKorn UI. If you do not know how to access the YuniKorn UI, please read the document here.

tf-job-on-ui