Howto: Low Cost Ray on Kubernetes with KubeRay on Rackspace Spot

Rackspace Spot provides the industry's most cost-effective Kubernetes clusters with high-availability & auto-scaling. This combination of low cost and simplicity makes it a good platform for ML/Ops teams interested in Ray. This HOWTO walks you through the deployment of KubeRay on Spot instances for use in batch processing, ML prediction/inference in real time, ML model deployment etc.

KubeRay Introduction

KubeRay is an open-source Kubernetes operator designed to simplify the deployment and management of Ray applications (e.g. ML model training/deployment using ray libraries) on Kubernetes clusters. KubeRay provides us RayCluster, RayJob, RayService CRs to train & deploy ML workloads.

  • RayCluster: KubeRay fully manages the lifecycle of RayCluster, including cluster creation/deletion, autoscaling, and ensuring fault tolerance.
  • RayJob: With RayJob, KubeRay automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the job finishes.
  • RayService: RayService is made up of two parts: a RayCluster and a Ray Serve deployment graph. RayService offers zero-downtime upgrades for RayCluster and high availability.

Other usage of KubeRay will be used for distributed job processing. To learn more about Ray, follow this link.

Minimum System Requirements

KubeRay requires a of minimum 4 CPUs & 4GB Memory for head node & worker nodes to work. Rackspace Spot provides several machines that meet these needs.

Step 1: Create a Spot Cloudspace

To deploy KubeRay operator & RayClusters, you need a Kubernetes cluster. Rackspace Spot provides Cloudspaces, which are complete Kubernetes clusters with load balancing, persistent storage and auto-scaling for your Ray applications. In this HOWTO, we are going to be using General Virtual Server.Extra Large machine to deploy the KubeRay operator and sample RayCluster CR.

Deploy a Cloudspace

You can also use existing Cloudspace of your choice if it satisfies minimum requirement for KubeRay (or) you can create new Cloudspace by clicking "Add Cloudspace" from the top right corner.

Choose from the Data-center Location.

Choose your preferred Server Configuration. For this HOWTO, we are going with General Virtual Server.Extra Large machine to place a bid.

Configure your Cloudspace based on your preference and type in your Cloudspace name.

Proceed with Go to Summary which summarizes the configuration and pricing for your Cloudspace and deploy your Cloudspace. If this is the first Cloudspace in your Rackspace Spot account, you will see a screen with a option for Billing in the left panel. Provide the information regarding PAYMENT METHOD & BILLING INFORMATION .

To learn more about detailed creation of Cloudspace and bids, follow this article.

Step 2: Install the KubeRay Operator

Once the Cloudspace (K8s cluster) is provisioned in Spot, download the kubeconfig file from the spot dashboard in Cloudspace view.

Deploy the KubeRay operator using Helm chart repository.

Bash
Copy

This above instructions are based on the documentation available here.

Step 3: Deploy a RayCluster

Now you can deploy a RayCluster in the default namespace using the KubeRay operator.

Bash
Copy

A RayCluster with this Custom Resource definition will be deployed in the default namespace by Helm:

Bash
Copy

Step 4: Run an application on your RayCluster

Once the Head node(s) & Worker node(s) are both in RUNNING state, you can submit a RayJob by following the steps below. There may be different ways to interact with head node, here we review one such way to submit a RayJob. Other ways of interacting with the RayCluster are described in the documentation here.

We create a RayJob which will print the cluster resources of RayCluster.

Bash
Copy

Step 5: Inspect the KubeRay Dashboard

KubeRay also provides dashboard to monitor RayJobs, RayServices, RayCluster metrics etc,. To access the dashboard, you just need to run two commands:

Bash
Copy

Ray Dashboard shows the job executed from the above section.

Step 6: Autoscaling of Nodes for KubeRay

Ray applications will often require a variable amount of capacity. While KubeRay provides auto-scaling which will increase/decrease the worker pods, Autoscaling in Rackspace Spot will dynamically provision the required nodes for your Kubernetes Cluster.

References

https://spot.rackspace.com/docs/create-rackspace-spot-cloudspace

https://github.com/ray-project/kuberay

https://docs.ray.io/en/master/cluster/key-concepts.html#cluster-key-concepts

https://docs.ray.io/en/master/cluster/kubernetes/getting-started/raycluster-quick-start.html

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard