Introduction
This guide walks through how to properly deploy Airflow on Kubernetes, with a focus on using spot Instances for maximum cost savings.
Spot instances are preemptible and interruptions are expected. In this guide we design around that reality at every level: diversified node pools and multi-bid pricing keep capacity available when spot markets fluctuate, replicated control plane components survive node preemptions, and application patterns like retries, idempotent tasks, and remote logging ensure workloads recover automatically.
We’ll be using Rackspace spot instances for this setup, which uses an auction-based pricing model. Because you control your maximum bid, you have visibility into your cost ceiling and more control over when capacity gets reclaimed. To give you a sense of how this holds up in the real world, here's an experience from a user running Airflow entirely on spot instances:
“Yes, and this is working fine for me. The core Airflow components on k8s are fault tolerant, so preemption is not a concern. I'm running three node clusters per environment/cloudspace, and wouldn't want to lose all three at the same time, but after 45 days or so, I haven't had any failures related to infra.”
No preemption for that long while running entirely on Spot is a pretty sweet deal, especially for a tool like Airflow where most workloads are batch and fault tolerant.
Most times people focus on running Airflow control-plane components on on-demand nodes and the worker components on spot instances. The reason for this is that the control-plane components are expected to run on more stable nodes. You can still use the examples in this article to achieve that hybrid setup, just replace the control plane spot node pools with on-demand node pools, while keeping worker pools on spot instances.
In this article we go through how to run Airflow on Rackspace Spot using the Kubernetes Executor for workers. We also cover cost optimization strategies, compute sizing, and how to right-size each component for better efficiency.
This is the first article in a series on running Airflow on spot instances. This article focuses solely on infrastructure setup and Airflow installation. The next article will dive into Directed Acyclic Graph (DAG) design patterns optimized for spot instances and handling task-level interruptions. Future articles will cover intentional preemption testing, fault tolerance validation, and database management options like CloudNativePG (CNPG) and Longhorn for spot-heavy architectures.
What is Airflow?
Apache Airflow is an open-source tool for defining, scheduling, and running workflows using code. It lets you model processes as a series of tasks with clear dependencies and execution order, so pipelines that need to run in sequence can be orchestrated, retried, and monitored from a central system.
A common example, and the one we will use throughout this article series, is a machine-learning retraining pipeline. In this workflow, Airflow fetches loan data from PostgreSQL, trains a Random Forest model, evaluates its performance, and stores artifacts in S3.
This scenario came from an ML engineer who shared that their Airflow environment ran entirely on on-demand instances even though the retraining job only executed once every 10 days. The infrastructure stayed up continuously while heavy compute was only needed for a short window.
Now imagine that same pipeline running primarily on spot instances at 70-80% lower compute costs, with a retry-friendly, fault-tolerant architecture that handles interruptions automatically.
Airflow components
Before configuring anything, we first need to understand the Airflow components in this setup. Airflow's architecture is divided into two major categories: control plane components and workers.

Scheduler
The scheduler is Airflow's core orchestration engine. It watches all DAGs and tasks, creating new DAG runs when schedule conditions are met and marking tasks as ready when their dependencies are satisfied. For each new run, it pulls the latest DAG version from the database.
Executor
The executor sits between the scheduler and workers. When the scheduler marks a task as ready, it hands it to the executor. The executor decides how to run that task. Examples include KubernetesExecutor, CeleryExecutor, and LocalExecutor.
API Server
The API server handles requests from various Airflow components and serves as the central access point to the metadata database. Workers communicate with the API server during task execution rather than connecting to the database directly. The API server processes these requests and manages all database operations on their behalf.
Airflow metadata database
The metadata database is Airflow's central source of truth. It stores everything Airflow needs to function: connections, serialized DAG definitions, XCom outputs, and the complete history of DAG runs and task instances, including their current and past states.
DAG processor
The DAG processor continuously scans the configured DAG directory for DAG files. It parses each file to understand the task structure and dependencies, then stores a serialized representation in the metadata database. This parsing happens separately from actual task execution.
Triggerer
The triggerer handles tasks that need to wait for external events or conditions. Instead of blocking a worker while waiting for a database migration to finish or an API endpoint to become healthy, the task delegates the waiting to the triggerer. This frees the worker to execute other tasks while the triggerer monitors the event asynchronously.
Workers
Workers are responsible for executing the actual task code. How they operate depends entirely on which executor is configured in your Airflow deployment:
- CeleryExecutor: Tasks run on Celery workers distributed across separate processes or machines.
- KubernetesExecutor: Each task gets its own dedicated Kubernetes pod.
- LocalExecutor: Tasks run as separate Python processes on the same machine as the scheduler.
How Airflow 3 executes a DAG
- The DAG processor parses the Python file and writes a serialized version to the metadata database.
- The scheduler scans serialized DAGs and checks if any are ready to run based on schedules, dataset triggers, or external events.
- The executor determines where tasks should run and creates pod specs (for KubernetesExecutor) or publishes to a queue (for CeleryExecutor).
- Workers poll the queue and pick up tasks to execute.
- The worker communicates with the API server for all metadata operations. This includes fetching runtime configuration, reporting task state, and storing task outputs. Logs are written directly to the configured storage backend.
- The scheduler monitors task completion in the database and queues downstream tasks as soon as dependencies are met.
Prerequisite
Before diving in, make sure you have:
- A Rackspace Account
- Helm installed locally for deployment of Airflow
- AWS S3 Bucket for storing model artifacts (or any S3-compatible storage)Required for S3 remote logging and ML artifact storage
- Spotctl Installed: Rackspace's CLI tool for managing cloudspaces and retrieving kubeconfig files. Download and install spotctl with the instructions here
Generate a Terraform API token
You'll need an API key to authenticate with the Rackspace Terraform provider.
- In your cloudspace, go to API Access > Terraform
- Click Get new token
- Name the token and copy the generated value
Terraform infrastructure setup
Start by creating a dedicated directory for your Terraform configuration.
mkdir airflow-terraform
cd airflow-terraform
Step 1: Configure credentials for Terraform access to Rackspace Spot
Create a secrets.auto.tfvars file and store your API token there so Terraform can authenticate with the Rackspace Spot API.
rackspace_spot_token = "XXXXXXXXXXXXXXX"Important: Add secrets.auto.tfvars to your .gitignore to prevent committing credentials to version control.
For local development and testing, secrets.auto.tfvars with .gitignore is fine. For production or shared environments, use a proper secret management solution
Alternative approaches for production:
- Use environment variables:
export TF_VAR_rackspace_spot_token="your-token" - Store in a secret manager (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault)
- Use Terraform Cloud workspace variables with sensitive flag enabled
- Use encrypted backends like S3 with encryption at rest
Step 2: Create a providers.tf file
terraform {
required_version = ">= 1.0"
required_providers {
spot = {
source = "rackerlabs/spot"
version = ">= 0.1.0"
}
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "spot" {
token = var.rackspace_spot_token
}
provider "aws" {
region = var.aws_region
}The AWS provider here manages the S3 bucket for storing Airflow logs and artifacts.
Step 3: Create the variable.tf file
variable "rackspace_spot_token" {
description = "Rackspace Spot authentication token"
type = string
sensitive = true
}
variable "aws_region" {
description = "AWS region for S3 bucket"
type = string
default = "us-east-1"
}
Step 4: Create the rackspace.tf file
You can retrieve the list of available server classes and their specifications using the Terraform data source documentation: Rackspace Spot Server Classes and quickly view current spot market prices for different server classes in the Rackspace Spot pricing documentation.
Note: Server class descriptions are done in this manner gp.vs1.large-dfw:
gp= General Purpose familyvs1= Generation-1 Virtual Serverslarge= Size tier (small, medium, large, xlarge, etc.)dfw= Dallas region
resource "spot_cloudspace" "airflow_cluster" {
cloudspace_name = "airflow-cluster"
region = "us-central-dfw-1"
hacontrol_plane = false
wait_until_ready = true
kubernetes_version = "1.31.1"
cni = "calico"
}
# Spot control plane node pool
resource "spot_spotnodepool" "control_plane" {
cloudspace_name = spot_cloudspace.airflow_cluster.cloudspace_name
server_class = "gp.vs1.large-dfw" # General purpose Large: 4 vCPU, 15 GiB
bid_price = 0.15
desired_server_count = 2
labels = {
"workload" = "airflow-control-plane"
}
taints = [{
key = "workload"
value = "airflow-control-plane"
effect = "NoSchedule"
}]
}
# Spot worker node pool
resource "spot_spotnodepool" "workers" {
cloudspace_name = spot_cloudspace.airflow_cluster.cloudspace_name
server_class = "ch.vs1.medium-dfw" # Compute Heavy Medium: 2 vCPU, 3.75 GiB
bid_price = 0.05
autoscaling = { min_nodes = 2, max_nodes = 3 }
labels = {
"workload" = "airflow-workers"
}
}
Diversified Server Classes (Single Region, Multiple Capacity Markets)
To deal with capacity shortages, you can use multiple server classes within the same region to diversify across different spot markets:
resource "spot_cloudspace" "airflow_cluster" {
cloudspace_name = "airflow-cluster"
region = "us-central-dfw-1"
hacontrol_plane = false
wait_until_ready = true
kubernetes_version = "1.31.1"
cni = "calico"
}
# Control plane pool 1: General purpose instances
resource "spot_spotnodepool" "control_plane_gp" {
cloudspace_name = spot_cloudspace.airflow_cluster.cloudspace_name
server_class = "gp.vs1.large-dfw" # 4 vCPU, 15 GiB
bid_price = 0.15
desired_server_count = 1
labels = {
"workload" = "airflow-control-plane"
"pool" = "gp"
}
taints = [{
key = "workload"
value = "airflow-control-plane"
effect = "NoSchedule"
}]
}
# Control plane pool 2: Compute-heavy instances (different spot market)
resource "spot_spotnodepool" "control_plane_ch" {
cloudspace_name = spot_cloudspace.airflow_cluster.cloudspace_name
server_class = "ch.vs1.large-dfw" # 4 vCPU, 7.5 GiB
bid_price = 0.15
desired_server_count = 1
labels = {
"workload" = "airflow-control-plane"
"pool" = "ch"
}
taints = [{
key = "workload"
value = "airflow-control-plane"
effect = "NoSchedule"
}]
}
# Worker pool 1: Compute-heavy instances
resource "spot_spotnodepool" "workers_compute_heavy" {
cloudspace_name = spot_cloudspace.airflow_cluster.cloudspace_name
server_class = "ch.vs1.medium-dfw"
bid_price = 0.05
autoscaling = { min_nodes = 1, max_nodes = 3 }
labels = {
"workload" = "airflow-workers"
"pool" = "compute-heavy"
}
}
# Worker pool 2: General purpose instances (different capacity market)
resource "spot_spotnodepool" "workers_general" {
cloudspace_name = spot_cloudspace.airflow_cluster.cloudspace_name
server_class = "gp.vs1.medium-dfw"
bid_price = 0.08
autoscaling = { min_nodes = 1, max_nodes = 3 }
labels = {
"workload" = "airflow-workers"
"pool" = "general-purpose"
}
}
General overview of Terraform configuration
Let's take a look at the diversified server class approach. This Terraform configuration provisions a single Kubernetes cluster with diversified node pools to improve spot capacity resilience and cost optimization.
It creates:
- A Cloudspace (the Kubernetes cluster and hosted control plane) with its primary region set to
us-central-dfw-1. - Multiple control-plane node pools using different server classes (
gp.vs1.large-dfwandch.vs1.large-dfw) to diversify across spot capacity markets. These pools are intended for core Airflow components such as the scheduler, webserver, API server, and triggerer. - Multiple worker node pools using different server classes (
ch.vs1.medium-dfwandgp.vs1.medium-dfw) that can autoscale based on task demand. By diversifying across server class families, workers can scale even when individual spot markets experience capacity shortages.
Taints and tolerations for resource isolation
As we've seen earlier, the infrastructure required to run Airflow is divided into two categories: the infrastructure needed to run the control plane (scheduler, API server, DAG processor, triggerer, and webserver) and the infrastructure required to run the Airflow workers.
For this architecture, we use Kubernetes taints and tolerations to ensure control plane pods get scheduled on specific nodes.
Tainting control plane nodes and adding tolerations to control plane pods ensures each component type lands on appropriately sized nodes based on their resource requirements. Worker pods lack the control plane toleration, so they're automatically excluded from control plane nodes and land on worker nodes instead.
In this all-spot architecture, we run multiple replicas of critical components (2 scheduler replicas, 2 webserver replicas, 2 API server replicas).
Compute Sizing
It's important to note that you shouldn't use a one-size-fits-all resource config. Different tasks have different needs and different node pools have different resource requirements.
Airflow workloads vary significantly in their compute requirements. Some DAGs primarily orchestrate external systems: submitting Spark jobs, calling APIs, triggering cloud functions, and require minimal resources. Others perform compute-intensive work directly within task pods, such as data transformations, machine learning training, or heavy data processing.
Understanding your workload patterns is critical to right-sizing your infrastructure.
When choosing worker node sizes, start with a reasonable baseline, then monitor actual usage.
You're looking for balance. If your worker pods consistently hit their CPU or memory limits, they're undersized and tasks will run slow or fail. If they're only using 20% of available capacity, they're oversized and you're overpaying for compute you don't need. The goal is right-sizing based on observed behavior.
In our setup using KubernetesExecutor, the control plane (scheduler and API server) handles the constant work of creating pods, monitoring their state, and cleaning up afterward. This makes them more resource-intensive than you might expect. Workers, on the other hand, are ephemeral pods that spin up per-task and terminate immediately after.
We also consider the type of workload the workers will be executing when provisioning worker node resources.
We take this into consideration when choosing compute sizes and right-sizing our infrastructure.
For the control plane nodes tasked with running multiple critical Airflow components including the scheduler (2 replicas), triggerer (2 replicas), webserver (2 replicas), API server (2 replicas), DAG processor (2 replicas), statsd, and PostgreSQL:
We use two node pools with similar compute specifications but from different server class families:
gp.vs1.large-dfw(4 vCPU, 15 GiB memory): 1 nodesch.vs1.large-dfw(4 vCPU, 7.5 GiB memory): 1 node
This gives the control plane components adequate headroom to operate reliably while diversifying across spot capacity markets.
For the worker pools, we use multiple server classes to ensure capacity availability:
ch.vs1.medium-dfw(2 vCPU, 3.75 GiB): Compute-optimized for CPU-bound tasksgp.vs1.medium-dfw(2 vCPU, 7.5 GiB): General-purpose for balanced workloads
The choice of these specific sizes is based on the workload characteristics in this example. For your use case, you should benchmark your tasks and adjust accordingly.
Server class diversification and multi-bid strategy
We provision multiple node pools with different server classes and bid prices to improve spot capacity resilience and optimize costs.
Why different server classes?
Each server class family (gp.vs1.*, ch.vs1.*) represents a separate spot market. If ch.vs1.medium runs out of capacity or experiences price spikes, workers can still scale using gp.vs1.medium nodes.
Why different bid prices?
Different bid prices allow you to access different price tiers within the spot market. By bidding $0.05 on one pool and $0.08 on another, you're diversifying across price-sensitive segments of the market. When spot prices fluctuate, your lower-bid pools might get reclaimed while higher-bid pools remain stable. This means you're not putting all your capacity in one price tier.
For control plane pools, we use higher bids ($0.15) to ensure stability, since control plane downtime stops all orchestration. For workers, we use lower, varied bids ($0.05-$0.08) to maximize savings while maintaining capacity options across different price points.
This strategy provides capacity resilience and cost optimization without requiring geographic distribution.
Configure Airflow with Helm values
Create a file kubernetes/airflow-values.yaml and these helm values.
executor: KubernetesExecutor
scheduler:
replicas: 2
nodeSelector:
workload: airflow-control-plane
tolerations:
- key: "workload"
operator: "Equal"
value: "airflow-control-plane"
effect: "NoSchedule"
# Airflow metadata database (required for real installs)
postgresql:
enabled: true
persistence:
enabled: true
storageClass: ssd
size: 10Gi
image:
registry: docker.io
repository: bitnami/postgresql
tag: "latest"
primary:
nodeSelector:
workload: airflow-control-plane
tolerations:
- key: "workload"
operator: "Equal"
value: "airflow-control-plane"
effect: "NoSchedule"
# Redis isn't required for KubernetesExecutor
redis:
enabled: false
# Persist logs (optional; you can also do remote logging later)
logs:
persistence:
enabled: false
storageClassName: sata
size: 10Gi
# DAGs from Git repository using git-sync
dags:
persistence:
enabled: false
gitSync:
enabled: true
repo: https://github.com/your-org/airflow-dags.git
branch: <replace-with-branch>
subPath: "dags/"
wait: 60
maxFailures: 0
depth: 1
triggerer:
replicas: 2
nodeSelector:
workload: airflow-control-plane
tolerations:
- key: "workload"
operator: "Equal"
value: "airflow-control-plane"
effect: "NoSchedule"
persistence:
enabled: false
webserver:
replicas: 2
nodeSelector:
workload: airflow-control-plane
tolerations:
- key: "workload"
operator: "Equal"
value: "airflow-control-plane"
effect: "NoSchedule"
startupProbe:
failureThreshold: 20
periodSeconds: 10
apiServer:
replicas: 2
nodeSelector:
workload: airflow-control-plane
tolerations:
- key: "workload"
operator: "Equal"
value: "airflow-control-plane"
effect: "NoSchedule"
startupProbe:
failureThreshold: 20
periodSeconds: 10
# DAG Processor (processes DAG files)
dagProcessor:
replicas: 2
nodeSelector:
workload: airflow-control-plane
tolerations:
- key: "workload"
operator: "Equal"
value: "airflow-control-plane"
effect: "NoSchedule"
# Statsd exporter for metrics (does not support multiple replicas)
statsd:
nodeSelector:
workload: airflow-control-plane
tolerations:
- key: "workload"
operator: "Equal"
value: "airflow-control-plane"
effect: "NoSchedule"
# Airflow create user job
createUserJob:
nodeSelector:
workload: airflow-control-plane
tolerations:
- key: "workload"
operator: "Equal"
value: "airflow-control-plane"
effect: "NoSchedule"
# Airflow database migration job
migrateDatabaseJob:
nodeSelector:
workload: airflow-control-plane
tolerations:
- key: "workload"
operator: "Equal"
value: "airflow-control-plane"
effect: "NoSchedule"
secret:
- envName: "AIRFLOW_CONN_AWS_DEFAULT"
secretName: "aws-connection"
secretKey: "connection-uri"
- envName: "AIRFLOW_CONN_POSTGRES_DEFAULT"
secretName: "postgres-connection"
secretKey: "connection-uri"
# Remote logging to S3 (required for KubernetesExecutor)
config:
logging:
remote_logging: "True"
remote_base_log_folder: "s3://your-airflow-logs-bucket/logs"
remote_log_conn_id: "aws_default"
core:
task_log_reader: "s3.task"Here’s a summary of what’s in the airflow-values.yaml file:
- KubernetesExecutor
- Control-plane components (scheduler/webserver/triggerer/API server/DAG processor) pinned to control plane nodes via
nodeSelector+ tolerations for the control plane taint - Task pods land on worker nodes via Airflow's KubernetesExecutor settings (worker pods lack the control plane toleration, so they're automatically excluded from control plane nodes)
- GitSync for DAGs (no RWX dependency)
- Remote logging to S3
- PostgreSQL in-cluster with persistent storage on control plane nodes (for testing/dev environments; for production, consider managed databases like Rackspace Dbaas.
Executor
executor: KubernetesExecutorThis tells Airflow how to run tasks.
- Each task gets its own dedicated Kubernetes pod
- Pod spins up when task starts, terminates when task completes
- No persistent worker pools (unlike CeleryExecutor)
- Tasks are isolated from each other (separate containers)
Airflow control-plane configuration
scheduler:
replicas: 2
nodeSelector:
workload: airflow-control-plane
tolerations:
- key: "workload"
operator: "Equal"
value: "airflow-control-plane"
effect: "NoSchedule"Same pattern for webserver, apiServer, dagProcessor, triggerer
All core Airflow control plane components (scheduler, webserver, API server, DAG processor, triggerer) share the same configuration pattern. Each runs 2 replicas for high availability, uses nodeSelector to land only on nodes labeled workload: airflow-control-plane, and includes tolerations to schedule on tainted control plane nodes. This ensures all critical components run on properly-sized control plane nodes (4 vCPU, 15 GiB) rather than smaller worker nodes, and prevents worker pods from landing on control plane nodes since they lack the required toleration. The webserver and API server also include extended startupProbe settings (20 failures × 10 seconds = 200 seconds total) to give them enough time to initialize without Kubernetes killing them prematurely.
Airflow metadatadb configuration
postgresql:
enabled: true
persistence:
enabled: true
storageClass: ssd
size: 10Gi
image:
registry: docker.io
repository: bitnami/postgresql
tag: "latest"
primary:
nodeSelector:
workload: airflow-control-plane
tolerations:
- key: "workload"
operator: "Equal"
value: "airflow-control-plane"
effect: "NoSchedule"This deploys a single-replica PostgreSQL instance, it works fine for development and testing, but for production you should consider Rackspace DBaaS (fully managed, zero ops overhead).
Rackspace Spot currently provides RWO (ReadWriteOnce) block storage, meaning volumes attach to one node at a time. For HA PostgreSQL in-cluster, you'd need Longhorn for distributed storage with CloudNativePG managing streaming replication across multiple replicas (each gets its own volume).
DAGs (GitSync)
dags:
persistence:
enabled: false
gitSync:
enabled: true
repo: https://github.com/your-org/airflow-dags.git
branch: master
subPath: "dags/"
wait: 60
maxFailures: 0
depth: 1Configures DAG loading for Airflow. We use GitSync to pull DAGs from a Git repository rather than relying on PersistentVolumes, so deployments are handled entirely through Git.
Secrets
secret:
- envName: "AIRFLOW_CONN_AWS_DEFAULT"
secretName: "aws-connection"
secretKey: "connection-uri"
- envName: "AIRFLOW_CONN_POSTGRES_DEFAULT"
secretName: "postgres-connection"
secretKey: "connection-uri"Injects Airflow connections as environment variables from Kubernetes Secrets.
Note: This postgres connection secret will be used later for the ML pipeline.
Deploy the infrastructure
Initialize Terraform to download the required providers:
terraform initReview the planned changes before applying them:
terraform planApply the configuration to provision the infrastructure:
terraform apply --auto-approve
Connect to the Kubernetes cluster
mkdir -p ~/.kube/rackspace-spot
spotctl cloudspaces get-config \
--name airflow-cluster \
--file ~/.kube/rackspace-spot
# Merge with existing kubeconfig
export KUBECONFIG=~/.kube/config:~/.kube/rackspace-spot/airflow-cluster.yaml
kubectl config view --flatten > ~/.kube/config.tmp
mv ~/.kube/config.tmp ~/.kube/config
chmod 600 ~/.kube/config
# Switch to the new context
kubectl config use-context airflow-cluster
# Verify connection
kubectl get nodes
Install Airflow on the Kubernetes cluster
kubectl create ns airflow
helm install airflow apache-airflow/airflow \
-n airflow \
-f airflow-values.yaml \
--timeout 15m \
--debug
Access the airflow UI:
kubectl port-forward svc/airflow-api-server 8080:8080 -n airflowThen access the Airflow UI at: http://localhost:8080. The default credentials are Username: admin, Password: admin
Conclusion
In this article, we walked through deploying Airflow on Rackspace Spot instances with a cost-optimized, fault-tolerant architecture. We covered infrastructure setup with Terraform, server class diversification to mitigate capacity risks, multi-bid strategies to balance cost and stability, and Helm-based Airflow deployment with proper resource isolation using taints and tolerations. We were able to setup:
- All-spot Airflow cluster with control plane and workers running on spot instances
- Server class diversification across multiple instance families (gp.vs1, ch.vs1) to tap into independent spot markets
- Multi-bid pricing strategy: higher bids for control plane stability, lower bids for worker cost savings
- GitSync for version-controlled DAG deployment
- Remote logging to S3 for worker task pods
- PostgreSQL with persistent storage (with production alternatives discussed)
This infrastructure is ready to run workloads, but we haven't covered how to write DAGs that are optimized for spot interruptions. In the next article, we'll dive into:
- DAG design patterns for spot-tolerant workflows (idempotent tasks, checkpointing, external state management)
- Handling interruptions at the task level (retries, task timeouts, graceful shutdown)
- Real-world ML pipeline example running the loan prediction model end-to-end