Deploy Apache Airflow on Kubernetes Using Spot Instances

Introduction

This guide walks through how to properly deploy Airflow on Kubernetes, with a focus on using spot Instances for maximum cost savings.

Spot instances are preemptible and interruptions are expected. In this guide we design around that reality at every level: diversified node pools and multi-bid pricing keep capacity available when spot markets fluctuate, replicated control plane components survive node preemptions, and application patterns like retries, idempotent tasks, and remote logging ensure workloads recover automatically.

We’ll be using Rackspace spot instances for this setup, which uses an auction-based pricing model. Because you control your maximum bid, you have visibility into your cost ceiling and more control over when capacity gets reclaimed. To give you a sense of how this holds up in the real world, here's an experience from a user running Airflow entirely on spot instances:

“Yes, and this is working fine for me. The core Airflow components on k8s are fault tolerant, so preemption is not a concern. I'm running three node clusters per environment/cloudspace, and wouldn't want to lose all three at the same time, but after 45 days or so, I haven't had any failures related to infra.”

No preemption for that long while running entirely on Spot is a pretty sweet deal, especially for a tool like Airflow where most workloads are batch and fault tolerant.

Most times people focus on running Airflow control-plane components on on-demand nodes and the worker components on spot instances. The reason for this is that the control-plane components are expected to run on more stable nodes. You can still use the examples in this article to achieve that hybrid setup, just replace the control plane spot node pools with on-demand node pools, while keeping worker pools on spot instances.

In this article we go through how to run Airflow on Rackspace Spot using the Kubernetes Executor for workers. We also cover cost optimization strategies, compute sizing, and how to right-size each component for better efficiency.

This is the first article in a series on running Airflow on spot instances. This article focuses solely on infrastructure setup and Airflow installation. The next article will dive into Directed Acyclic Graph (DAG) design patterns optimized for spot instances and handling task-level interruptions. Future articles will cover intentional preemption testing, fault tolerance validation, and database management options like CloudNativePG (CNPG) and Longhorn for spot-heavy architectures.

‍

What is Airflow?

Apache Airflow is an open-source tool for defining, scheduling, and running workflows using code. It lets you model processes as a series of tasks with clear dependencies and execution order, so pipelines that need to run in sequence can be orchestrated, retried, and monitored from a central system.

A common example, and the one we will use throughout this article series, is a machine-learning retraining pipeline. In this workflow, Airflow fetches loan data from PostgreSQL, trains a Random Forest model, evaluates its performance, and stores artifacts in S3.

This scenario came from an ML engineer who shared that their Airflow environment ran entirely on on-demand instances even though the retraining job only executed once every 10 days. The infrastructure stayed up continuously while heavy compute was only needed for a short window.

Now imagine that same pipeline running primarily on spot instances at 70-80% lower compute costs, with a retry-friendly, fault-tolerant architecture that handles interruptions automatically.

‍

Airflow components

Before configuring anything, we first need to understand the Airflow components in this setup. Airflow's architecture is divided into two major categories: control plane components and workers.

Scheduler

The scheduler is Airflow's core orchestration engine. It watches all DAGs and tasks, creating new DAG runs when schedule conditions are met and marking tasks as ready when their dependencies are satisfied. For each new run, it pulls the latest DAG version from the database.

‍

Executor

The executor sits between the scheduler and workers. When the scheduler marks a task as ready, it hands it to the executor. The executor decides how to run that task. Examples include KubernetesExecutor, CeleryExecutor, and LocalExecutor.

‍

API Server

The API server handles requests from various Airflow components and serves as the central access point to the metadata database. Workers communicate with the API server during task execution rather than connecting to the database directly. The API server processes these requests and manages all database operations on their behalf.

‍

Airflow metadata database

The metadata database is Airflow's central source of truth. It stores everything Airflow needs to function: connections, serialized DAG definitions, XCom outputs, and the complete history of DAG runs and task instances, including their current and past states.

‍

DAG processor

The DAG processor continuously scans the configured DAG directory for DAG files. It parses each file to understand the task structure and dependencies, then stores a serialized representation in the metadata database. This parsing happens separately from actual task execution.

‍

Triggerer

The triggerer handles tasks that need to wait for external events or conditions. Instead of blocking a worker while waiting for a database migration to finish or an API endpoint to become healthy, the task delegates the waiting to the triggerer. This frees the worker to execute other tasks while the triggerer monitors the event asynchronously.

‍

Workers

Workers are responsible for executing the actual task code. How they operate depends entirely on which executor is configured in your Airflow deployment:

CeleryExecutor: Tasks run on Celery workers distributed across separate processes or machines.
KubernetesExecutor: Each task gets its own dedicated Kubernetes pod.
LocalExecutor: Tasks run as separate Python processes on the same machine as the scheduler.

‍
How Airflow 3 executes a DAG

The DAG processor parses the Python file and writes a serialized version to the metadata database.
The scheduler scans serialized DAGs and checks if any are ready to run based on schedules, dataset triggers, or external events.
The executor determines where tasks should run and creates pod specs (for KubernetesExecutor) or publishes to a queue (for CeleryExecutor).
Workers poll the queue and pick up tasks to execute.
The worker communicates with the API server for all metadata operations. This includes fetching runtime configuration, reporting task state, and storing task outputs. Logs are written directly to the configured storage backend.
The scheduler monitors task completion in the database and queues downstream tasks as soon as dependencies are met.

‍

Prerequisite

Before diving in, make sure you have:

A Rackspace Spot Account
Helm installed locally for deployment of Airflow
AWS S3 Bucket for storing model artifacts (or any S3-compatible storage)Required for S3 remote logging and ML artifact storage
Spotctl Installed: Rackspace's CLI tool for managing cloudspaces and retrieving kubeconfig files. Download and install spotctl with the instructions here

‍

Generate a Terraform API token

You'll need an API key to authenticate with the Rackspace Terraform provider.

In your cloudspace, go to API Access > Terraform
Click Get new token
Name the token and copy the generated value

Terraform infrastructure setup

Start by creating a dedicated directory for your Terraform configuration.

mkdir airflow-terraform
cd airflow-terraform

‍‍
Step 1: Configure credentials for Terraform access to Rackspace Spot

Create a secrets.auto.tfvars file and store your API token there so Terraform can authenticate with the Rackspace Spot API.

rackspace_spot_token = "XXXXXXXXXXXXXXX"

Important: Add secrets.auto.tfvars to your .gitignore to prevent committing credentials to version control.

For local development and testing, secrets.auto.tfvars with .gitignore is fine. For production or shared environments, use a proper secret management solution

Alternative approaches for production:

Use environment variables: export TF_VAR_rackspace_spot_token="your-token"
Store in a secret manager (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault)
Use Terraform Cloud workspace variables with sensitive flag enabled
Use encrypted backends like S3 with encryption at rest

‍

Step 2: Create a providers.tf file

Note: The Spot Terraform provider is available via the official Terraform registry.‍

terraform {
  required_version = ">= 1.0"
  
  required_providers {
    spot = {
      source  = "rackerlabs/spot"
      version = ">= 0.1.0"
    }
    
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "spot" {
  token = var.rackspace_spot_token
}

provider "aws" {
  region = var.aws_region
}

The AWS provider here manages the S3 bucket for storing Airflow logs and artifacts.

‍

Step 3: Create the variable.tf file

variable "rackspace_spot_token" {
  description = "Rackspace Spot authentication token"
  type        = string
  sensitive   = true
}

variable "aws_region" {
  description = "AWS region for S3 bucket"
  type        = string
  default     = "us-east-1"
}

‍

Step 4: Create the rackspace.tf file

You can retrieve the list of available server classes and their specifications using the Terraform data source documentation: Rackspace Spot Server Classes and quickly view current spot market prices for different server classes in the Rackspace Spot pricing documentation.

‍

Note: Server class descriptions are done in this manner gp.vs1.large-dfw:

gp = General Purpose family
vs1 = Generation-1 Virtual Servers
large = Size tier (small, medium, large, xlarge, etc.)
dfw = Dallas region


resource "spot_cloudspace" "airflow_cluster" {
  cloudspace_name    = "airflow-cluster"
  region             = "us-central-dfw-1"
  hacontrol_plane    = false
  wait_until_ready   = true
  kubernetes_version = "1.31.1"
  cni                = "calico"
}

# Spot control plane node pool
resource "spot_spotnodepool" "control_plane" {
  cloudspace_name      = spot_cloudspace.airflow_cluster.cloudspace_name
  server_class         = "gp.vs1.large-dfw"  # General purpose Large: 4 vCPU, 15 GiB
  bid_price            = 0.15
  desired_server_count = 2  

  labels = {
    "workload" = "airflow-control-plane"
  }

  taints = [{
    key    = "workload"
    value  = "airflow-control-plane"
    effect = "NoSchedule"
  }]
}

# Spot worker node pool
resource "spot_spotnodepool" "workers" {
  cloudspace_name = spot_cloudspace.airflow_cluster.cloudspace_name
  server_class    = "ch.vs1.medium-dfw"  # Compute Heavy Medium: 2 vCPU, 3.75 GiB
  bid_price       = 0.05
  autoscaling     = { min_nodes = 2, max_nodes = 3 }

  labels = { 
    "workload" = "airflow-workers"
  }
}

‍

Diversified Server Classes (Single Region, Multiple Capacity Markets)

To deal with capacity shortages, you can use multiple server classes within the same region to diversify across different spot markets:

‍

resource "spot_cloudspace" "airflow_cluster" {
  cloudspace_name    = "airflow-cluster"
  region             = "us-central-dfw-1"
  hacontrol_plane    = false
  wait_until_ready   = true
  kubernetes_version = "1.31.1"
  cni                = "calico"
}

# Control plane pool 1: General purpose instances
resource "spot_spotnodepool" "control_plane_gp" {
  cloudspace_name      = spot_cloudspace.airflow_cluster.cloudspace_name
  server_class         = "gp.vs1.large-dfw"  # 4 vCPU, 15 GiB
  bid_price            = 0.15
  desired_server_count = 1

  labels = {
    "workload" = "airflow-control-plane"
    "pool"     = "gp"
  }

  taints = [{
    key    = "workload"
    value  = "airflow-control-plane"
    effect = "NoSchedule"
  }]
}

# Control plane pool 2: Compute-heavy instances (different spot market)
resource "spot_spotnodepool" "control_plane_ch" {
  cloudspace_name      = spot_cloudspace.airflow_cluster.cloudspace_name
  server_class         = "ch.vs1.large-dfw"  # 4 vCPU, 7.5 GiB
  bid_price            = 0.15
  desired_server_count = 1

  labels = {
    "workload" = "airflow-control-plane"
    "pool"     = "ch"
  }

  taints = [{
    key    = "workload"
    value  = "airflow-control-plane"
    effect = "NoSchedule"
  }]
}

# Worker pool 1: Compute-heavy instances
resource "spot_spotnodepool" "workers_compute_heavy" {
  cloudspace_name = spot_cloudspace.airflow_cluster.cloudspace_name
  server_class    = "ch.vs1.medium-dfw"
  bid_price       = 0.05
  autoscaling     = { min_nodes = 1, max_nodes = 3 }

  labels = { 
    "workload" = "airflow-workers"
    "pool"     = "compute-heavy"
  }
}

# Worker pool 2: General purpose instances (different capacity market)
resource "spot_spotnodepool" "workers_general" {
  cloudspace_name = spot_cloudspace.airflow_cluster.cloudspace_name
  server_class    = "gp.vs1.medium-dfw"
  bid_price       = 0.08
  autoscaling     = { min_nodes = 1, max_nodes = 3 }

  labels = { 
    "workload" = "airflow-workers"
    "pool"     = "general-purpose"
  }
}

‍

General overview of Terraform configuration

Let's take a look at the diversified server class approach. This Terraform configuration provisions a single Kubernetes cluster with diversified node pools to improve spot capacity resilience and cost optimization.

It creates:

A Cloudspace (the Kubernetes cluster and hosted control plane) with its primary region set to us-central-dfw-1.
Multiple control-plane node pools using different server classes (gp.vs1.large-dfw and ch.vs1.large-dfw) to diversify across spot capacity markets. These pools are intended for core Airflow components such as the scheduler, webserver, API server, and triggerer.
Multiple worker node pools using different server classes (ch.vs1.medium-dfw and gp.vs1.medium-dfw) that can autoscale based on task demand. By diversifying across server class families, workers can scale even when individual spot markets experience capacity shortages.

‍

Taints and tolerations for resource isolation

As we've seen earlier, the infrastructure required to run Airflow is divided into two categories: the infrastructure needed to run the control plane (scheduler, API server, DAG processor, triggerer, and webserver) and the infrastructure required to run the Airflow workers.

For this architecture, we use Kubernetes taints and tolerations to ensure control plane pods get scheduled on specific nodes.

Tainting control plane nodes and adding tolerations to control plane pods ensures each component type lands on appropriately sized nodes based on their resource requirements. Worker pods lack the control plane toleration, so they're automatically excluded from control plane nodes and land on worker nodes instead.

In this all-spot architecture, we run multiple replicas of critical components (2 scheduler replicas, 2 webserver replicas, 2 API server replicas).

‍

Compute Sizing

It's important to note that you shouldn't use a one-size-fits-all resource config. Different tasks have different needs and different node pools have different resource requirements.

Airflow workloads vary significantly in their compute requirements. Some DAGs primarily orchestrate external systems: submitting Spark jobs, calling APIs, triggering cloud functions, and require minimal resources. Others perform compute-intensive work directly within task pods, such as data transformations, machine learning training, or heavy data processing.

Understanding your workload patterns is critical to right-sizing your infrastructure.

When choosing worker node sizes, start with a reasonable baseline, then monitor actual usage.

You're looking for balance. If your worker pods consistently hit their CPU or memory limits, they're undersized and tasks will run slow or fail. If they're only using 20% of available capacity, they're oversized and you're overpaying for compute you don't need. The goal is right-sizing based on observed behavior.

In our setup using KubernetesExecutor, the control plane (scheduler and API server) handles the constant work of creating pods, monitoring their state, and cleaning up afterward. This makes them more resource-intensive than you might expect. Workers, on the other hand, are ephemeral pods that spin up per-task and terminate immediately after.

We also consider the type of workload the workers will be executing when provisioning worker node resources.

We take this into consideration when choosing compute sizes and right-sizing our infrastructure.

For the control plane nodes tasked with running multiple critical Airflow components including the scheduler (2 replicas), triggerer (2 replicas), webserver (2 replicas), API server (2 replicas), DAG processor (2 replicas), statsd, and PostgreSQL:

We use two node pools with similar compute specifications but from different server class families:

gp.vs1.large-dfw (4 vCPU, 15 GiB memory): 1 nodes
ch.vs1.large-dfw (4 vCPU, 7.5 GiB memory): 1 node

This gives the control plane components adequate headroom to operate reliably while diversifying across spot capacity markets.

For the worker pools, we use multiple server classes to ensure capacity availability:

ch.vs1.medium-dfw (2 vCPU, 3.75 GiB): Compute-optimized for CPU-bound tasks
gp.vs1.medium-dfw (2 vCPU, 7.5 GiB): General-purpose for balanced workloads

The choice of these specific sizes is based on the workload characteristics in this example. For your use case, you should benchmark your tasks and adjust accordingly.

‍

Server class diversification and multi-bid strategy

We provision multiple node pools with different server classes and bid prices to improve spot capacity resilience and optimize costs.

Why different server classes?

Each server class family (gp.vs1.*, ch.vs1.*) represents a separate spot market. If ch.vs1.medium runs out of capacity or experiences price spikes, workers can still scale using gp.vs1.medium nodes.

Why different bid prices?

Different bid prices allow you to access different price tiers within the spot market. By bidding $0.05 on one pool and $0.08 on another, you're diversifying across price-sensitive segments of the market. When spot prices fluctuate, your lower-bid pools might get reclaimed while higher-bid pools remain stable. This means you're not putting all your capacity in one price tier.

For control plane pools, we use higher bids ($0.15) to ensure stability, since control plane downtime stops all orchestration. For workers, we use lower, varied bids ($0.05-$0.08) to maximize savings while maintaining capacity options across different price points.

This strategy provides capacity resilience and cost optimization without requiring geographic distribution.

Configure Airflow with Helm values

Create a file kubernetes/airflow-values.yaml and these helm values.

executor: KubernetesExecutor

scheduler:
  replicas: 2

  nodeSelector:
    workload: airflow-control-plane

  tolerations:
    - key: "workload"
      operator: "Equal"
      value: "airflow-control-plane"
      effect: "NoSchedule"

# Airflow metadata database
postgresql:
  enabled: true
  persistence:
    enabled: true
    storageClass: ssd
    size: 10Gi
  image:
    registry: docker.io
    repository: bitnami/postgresql
    tag: "latest"
  primary:
    nodeSelector:
      workload: airflow-control-plane
    tolerations:
      - key: "workload"
        operator: "Equal"
        value: "airflow-control-plane"
        effect: "NoSchedule"

# Redis isn't required for KubernetesExecutor
redis:
  enabled: false

# Persist logs
logs:
  persistence:
    enabled: false
    storageClassName: sata
    size: 10Gi

# DAGs from Git repository using git-sync
dags:
  persistence:
    enabled: false
  gitSync:
    enabled: true
    repo: https://github.com/your-org/airflow-dags.git
    branch: <replace-with-branch>
    subPath: "dags/"
    wait: 60
    maxFailures: 0
    depth: 1

triggerer:
  replicas: 2

  nodeSelector:
    workload: airflow-control-plane

  tolerations:
    - key: "workload"
      operator: "Equal"
      value: "airflow-control-plane"
      effect: "NoSchedule"

  persistence:
    enabled: false

webserver:
  replicas: 2

  nodeSelector:
    workload: airflow-control-plane

  tolerations:
    - key: "workload"
      operator: "Equal"
      value: "airflow-control-plane"
      effect: "NoSchedule"

  startupProbe:
    failureThreshold: 20
    periodSeconds: 10

apiServer:
  replicas: 2

  nodeSelector:
    workload: airflow-control-plane

  tolerations:
    - key: "workload"
      operator: "Equal"
      value: "airflow-control-plane"
      effect: "NoSchedule"

  startupProbe:
    failureThreshold: 20
    periodSeconds: 10

# DAG Processor (processes DAG files)
dagProcessor:
  replicas: 2

  nodeSelector:
    workload: airflow-control-plane

  tolerations:
    - key: "workload"
      operator: "Equal"
      value: "airflow-control-plane"
      effect: "NoSchedule"

# Statsd exporter for metrics (does not support multiple replicas)
statsd:
  nodeSelector:
    workload: airflow-control-plane

  tolerations:
    - key: "workload"
      operator: "Equal"
      value: "airflow-control-plane"
      effect: "NoSchedule"

# Airflow create user job
createUserJob:
  nodeSelector:
    workload: airflow-control-plane

  tolerations:
    - key: "workload"
      operator: "Equal"
      value: "airflow-control-plane"
      effect: "NoSchedule"

# Airflow database migration job
migrateDatabaseJob:
  nodeSelector:
    workload: airflow-control-plane

  tolerations:
    - key: "workload"
      operator: "Equal"
      value: "airflow-control-plane"
      effect: "NoSchedule"

secret:
  - envName: "AIRFLOW_CONN_AWS_DEFAULT"
    secretName: "aws-connection"
    secretKey: "connection-uri"
  - envName: "AIRFLOW_CONN_POSTGRES_DEFAULT"
    secretName: "postgres-connection"
    secretKey: "connection-uri"

# Remote logging to S3 (required for KubernetesExecutor)
config:
  logging:
    remote_logging: "True"
    remote_base_log_folder: "s3://your-airflow-logs-bucket/logs"
    remote_log_conn_id: "aws_default"
  core:
    task_log_reader: "s3.task"

Here’s a summary of what’s in the airflow-values.yaml file:

KubernetesExecutor
Control-plane components (scheduler/webserver/triggerer/API server/DAG processor) pinned to control plane nodes via nodeSelector + tolerations for the control plane taint
Task pods land on worker nodes via Airflow's KubernetesExecutor settings (worker pods lack the control plane toleration, so they're automatically excluded from control plane nodes)
GitSync for DAGs (no RWX dependency)
Remote logging to S3
PostgreSQL in-cluster with persistent storage on control plane nodes (for testing/dev environments; for production, consider managed databases like Rackspace Dbaas.

‍

Executor

executor: KubernetesExecutor

This tells Airflow how to run tasks.

Each task gets its own dedicated Kubernetes pod
Pod spins up when task starts, terminates when task completes
No persistent worker pools (unlike CeleryExecutor)
Tasks are isolated from each other (separate containers)

‍

Airflow control-plane configuration

scheduler:
  replicas: 2
  nodeSelector:
    workload: airflow-control-plane
  tolerations:
    - key: "workload"
      operator: "Equal"
      value: "airflow-control-plane"
      effect: "NoSchedule"

Same pattern for webserver, apiServer, dagProcessor, triggerer

All core Airflow control plane components (scheduler, webserver, API server, DAG processor, triggerer) share the same configuration pattern. Each runs 2 replicas for high availability, uses nodeSelector to land only on nodes labeled workload: airflow-control-plane, and includes tolerations to schedule on tainted control plane nodes. This ensures all critical components run on properly-sized control plane nodes (4 vCPU, 15 GiB) rather than smaller worker nodes, and prevents worker pods from landing on control plane nodes since they lack the required toleration. The webserver and API server also include extended startupProbe settings (20 failures × 10 seconds = 200 seconds total) to give them enough time to initialize without Kubernetes killing them prematurely.
‍

Airflow metadatadb configuration

‍

postgresql:
  enabled: true
  persistence:
    enabled: true
    storageClass: ssd
    size: 10Gi
  image:
    registry: docker.io
    repository: bitnami/postgresql
    tag: "latest"
  primary:
    nodeSelector:
      workload: airflow-control-plane
    tolerations:
      - key: "workload"
        operator: "Equal"
        value: "airflow-control-plane"
        effect: "NoSchedule"

This deploys a single-replica PostgreSQL instance, it works fine for development and testing, but for production you should consider Rackspace DBaaS (fully managed, zero ops overhead).

Rackspace Spot currently provides RWO (ReadWriteOnce) block storage, meaning volumes attach to one node at a time. For HA PostgreSQL in-cluster, you'd need Longhorn for distributed storage with CloudNativePG managing streaming replication across multiple replicas (each gets its own volume).

‍

DAGs (GitSync)

dags:
  persistence:
    enabled: false
  gitSync:
    enabled: true
    repo: https://github.com/your-org/airflow-dags.git
    branch: master
    subPath: "dags/"
    wait: 60
    maxFailures: 0
    depth: 1

Configures DAG loading for Airflow. We use GitSync to pull DAGs from a Git repository rather than relying on PersistentVolumes, so deployments are handled entirely through Git.

‍

Secrets

secret:
  - envName: "AIRFLOW_CONN_AWS_DEFAULT"
    secretName: "aws-connection"
    secretKey: "connection-uri"

This Injects Airflow connections as environment variables from Kubernetes Secrets.

Create the AWS connection secret that Airflow will use to access S3 for artifact storage and log persistence.

kubectl create secret generic aws-connection \
  --from-literal=connection-uri='s3://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>@/?region_name=<AWS_REGION>' \
  -n airflow

Deploy the infrastructure

Initialize Terraform to download the required providers:

terraform init

Review the planned changes before applying them:

terraform plan

Apply the configuration to provision the infrastructure:

terraform apply --auto-approve

‍

Connect to the Kubernetes cluster

mkdir -p ~/.kube/rackspace-spot
spotctl cloudspaces get-config \
  --name airflow-cluster \
  --file ~/.kube/rackspace-spot

# Merge with existing kubeconfig
export KUBECONFIG=~/.kube/config:~/.kube/rackspace-spot/airflow-cluster.yaml
kubectl config view --flatten > ~/.kube/config.tmp
mv ~/.kube/config.tmp ~/.kube/config
chmod 600 ~/.kube/config

# Switch to the new context
kubectl config use-context airflow-cluster

# Verify connection
kubectl get nodes

‍

Install Airflow on the Kubernetes cluster

kubectl create ns airflow

helm install airflow apache-airflow/airflow \
  -n airflow \
  -f airflow-values.yaml \
  --timeout 15m \
  --debug

‍

Access the airflow UI:

kubectl port-forward svc/airflow-api-server 8080:8080 -n airflow

Then access the Airflow UI at: http://localhost:8080. The default credentials are Username: admin, Password: admin

Conclusion

In this article, we walked through deploying Airflow on Rackspace Spot instances with a cost-optimized, fault-tolerant architecture. We covered infrastructure setup with Terraform, server class diversification to mitigate capacity risks, multi-bid strategies to balance cost and stability, and Helm-based Airflow deployment with proper resource isolation using taints and tolerations. We were able to setup:

All-spot Airflow cluster with control plane and workers running on spot instances
Server class diversification across multiple instance families (gp.vs1, ch.vs1) to tap into independent spot markets
Multi-bid pricing strategy: higher bids for control plane stability, lower bids for worker cost savings
GitSync for version-controlled DAG deployment
Remote logging to S3 for worker task pods
PostgreSQL with persistent storage (with production alternatives discussed)

This infrastructure is ready to run workloads, but we haven't covered how to write DAGs that are optimized for spot interruptions. In the next article, we'll dive into:

DAG design patterns for spot-tolerant workflows (idempotent tasks, checkpointing, external state management)
Handling interruptions at the task level (retries, task timeouts, graceful shutdown)
Real-world ML pipeline example running the loan prediction model end-to-end

Previous/Next

Deploy Apache Airflow on Kubernetes Using Spot Instances

Introduction

What is Airflow?

Airflow components

Scheduler

Executor

API Server

Airflow metadata database

DAG processor

Triggerer

Workers

‍How Airflow 3 executes a DAG

Prerequisite

Generate a Terraform API token

Terraform infrastructure setup

Step 2: Create a providers.tf file

Step 3: Create the variable.tf file

Step 4: Create the rackspace.tf file

Diversified Server Classes (Single Region, Multiple Capacity Markets)

General overview of Terraform configuration

Taints and tolerations for resource isolation

Compute Sizing

Server class diversification and multi-bid strategy

Configure Airflow with Helm values

Executor

Airflow control-plane configuration

Airflow metadatadb configuration

‍

DAGs (GitSync)

Secrets

Deploy the infrastructure

Connect to the Kubernetes cluster

Install Airflow on the Kubernetes cluster

Conclusion

‍
How Airflow 3 executes a DAG