Building Fault-Tolerant Airflow Pipelines on Spot Infrastructure
This article explores what actually happens when Apache Airflow runs on spot instances, using real experiments to simulate node preemption across both control plane and worker nodes. It walks through how tasks recover using retries, how S3 enables checkpointing without rerunning previous steps, and how to handle partial outputs through validation strategies like success markers. It also highlights the limitations of this approach, particularly around the Airflow metadata database, and outlines the architectural patterns required to build a fault-tolerant Airflow system on interruptible infrastructure.
Read More »





