Howto: maximize uptime (99.99%+) at the lowest cloud cost with Rackspace Spot

This article has been written as a guide to help users achieve the highest possible application uptime (>99.99%) at the lowest possible cloud cost with Rackspace Spot. Please note that we do not guarantee this as a foolproof method, but in practice, this approach should work.

Problem statement

The problem is clear. Rebecca's organization is trying to reduce costs and has asked her to find the lowest cost cloud architecture to run certain applications. To make things difficult, she isn't allowed to have much if any downtime at all. The expectation is that her application will be available with a 99.99% SLA.

To visualize the problem, consider that Rebecca configured her Cloudspace with the following parameters:

  1. Region: US Central, Chicago
  2. Server Class: General Purpose Virtual Server.Extra Large (8 vCPUs, 30GB RAM)
  3. Server bid count: 2 nodes
  4. Total capacity: 16vCPUs, 60 GB of RAM
  5. Current market price: $0.001/hr
  6. Maximum bid price: $0.003/hr

Along came a price spike

As every user of Rackspace Spot knows, prices change due to market auctions. As new users join the market, or as usage changes, or as new capacity is added, prices go up or down. Now, these server pools have substantial capacity (and they are growing all the time), but users should expect prices to fluctuate. Consider that there was a sudden influx of several new users who decided that they all needed that exact server configuration: General Purpose Virtual Server Extra Large, in US Central Chicago.

The auction cutoff first increased to $0.002/hr, and then $0.003/hr. At this point, Rebecca's Cloudspace is impacted, and loses one of the two nodes.

As shown in Figure 2, Rebecca's Cloudspace lost 1 of the two nodes in its original bid. Thankfully, Rebecca had planned for this scenario and had registered to be notified when a node was pre-empted, giving her the ability to handle this situation.

Using multiple bids to optimize capacity and cost

Rebecca had previously identified alternative server configurations that would work just as well. She had noticed that the General Purpose Large server class was still going at the reserve price of $0.001/hr. So, she had automation in place to provision a second server pool which would still give her the capacity she required. Using Terraform, she provisioned a second pool with these parameters:

  1. Region: US Central, Chicago
  2. Server Class: General Purpose Virtual Server Large
  3. Server bid count: 2 nodes
  4. Total capacity: 8 vCPUs, 30GB
  5. Current price: $0.001/hr
  6. Maximum bid price: $0.002/hr

With this second server pool in place, Kubernetes in her Cloudspace automatically rescheduled the pods across the two pools of capacity:

By leveraging a different server class, the Cloudspace still gets the total capacity it desired (a total of 16 vCPUs and 60GB of RAM), while keeping the total cost of infrastructure as low as possible.

Multiple bids are most effective when using different server classes, because they insulate your Cloudspace from a market price surge in any one server class.

While a user could use multiple bids for the same server class at two different price thresholds (e.g. $0.003 and $0.005), that is no different than having just one bid at the higher price - $0.005. Since users always pay the market price, upto their maximum bid, a single bid at the higher price works just as well as two different bids for the same server class.

Using a third server pool from another region

Rebecca noticed that her preferred server configuration, the General Purpose Virtual Server Extra Large class, was available at lower prices in the IAD region. To further hedge against a market price surge in Chicago, she decided to add a third pool from the Ashburn region. Rackspace Spot Cloudspaces use a hosted Kubernetes control plane that can manage servers across multiple sites. In addition, Rackspace's robust interconnect infrastructure means that network latencies across these sites is not a significant penalty (and there is no cost for network traffic between Rackspace sites).

Rebecca's final Cloudspace configuration now looks like this:

Upcoming features to make this even easier

The Spot public roadmap describes a few features that will make this kind of optimization even easier:

More useful alerts: Alert on market capacity risk (before pre-emption):

Allow Spot users to use some non pre-emptible capacity:

Automatic bid failover: node replacement from other server classes or regions

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard