avatarGiorgos Myrianthous

Summary

The article provides troubleshooting solutions for the "Task received SIGTERM signal" error in Apache Airflow, which can cause DAGs to fail.

Abstract

The article addresses a common issue encountered when migrating DAGs from Airflow 1 to Airflow 2, where tasks receive a SIGTERM signal leading to failures. The author explores several potential causes and solutions for this problem, including increasing the dagrun_timeout value for DAGs with long-running tasks, ensuring sufficient memory for Airflow tasks to prevent out-of-memory errors, checking for high CPU usage in the metadata database and adjusting the job_heartbeat_sec configuration, and disabling the "Mini Scheduler" feature due to a bug that may cause tasks to be prematurely terminated. The author emphasizes that the appropriate solution may vary based on the specific use case and that it's possible to encounter multiple issues simultaneously, necessitating a combination of fixes.

Opinions

  • The author suggests that the error message "Task received SIGTERM signal" is not informative, indicating a need for better error handling or messaging in Apache Airflow.
  • The author implies that the default dagrun_timeout may not be suitable for all DAGs, especially those with long-running tasks, and that users should adjust this setting according to their needs.
  • There is an opinion that deploying Airflow on the Cloud requires special attention to memory usage and Kubernetes pod eviction events, which could be a source of the SIGTERM issue.
  • The author points out that using SQLite for the metadata database in a production environment is not recommended due to potential CPU overutilization issues.
  • The article conveys that the "Mini Scheduler" feature, while intended to improve performance, has a known bug that can lead to tasks being killed, and thus the author recommends disabling it until the bug is resolved in a future Airflow release.
  • The author encourages readers to consider applying multiple solutions if their configuration is affected by more than one of the discussed issues.

How To Fix Task received SIGTERM signal In Airflow

Fixing the SIGTERM signal in Apache Airflow tasks

Photo by Jeremy Perkins on Unsplash

Introduction

While I have been recently working on migrating DAGs from Airflow 1 (v1.10.15) to Airflow 2 (v2.2.5) I’ve spent a lot of time trying to figure out one error that I was getting for some of the DAGs that wasn’t informative at all.

WARNING airflow.exceptions.AirflowException: Task received SIGTERM signal
INFO - Marking task as FAILED.

Even though I have spent some time trying out possible solutions that I’ve found online, none of them seemed to have worked for me.

In today’s article I will go through a few potential solutions to the SIGTERM signal that is sent to tasks, causing Airflow DAGs to fail. Depending on your configuration and your specific use-case a different solution may work for you so make sure to carefully go through each propose solution and try it out.

DAG run timeout

One of the reasons why your task is receiving a SIGTERM signal is due to a short dagrun_timeout value. The DAG class takes this argument that is used to specify how long a DagRun should be up before timing out / failing, so that new DagRuns can be created. Note that the timeout is only enforced for scheduled DagRuns.

For DAGs containing many long-running tasks there’s a chance that dagrun_timeout is exceeded and the the actively running tasks will therefore receive a SIGTERM signal so that the DAG can then fail and a new DagRun gets executed.

You can check the duration of a DagRun on Airflow UI and if you observe that this is greater than the dagrun_timeout value specified when creating an instance of a DAG, you can then increase it to a reasonable amount of time depending on your specific use case.

Note that this configuration is applicable to the DAG so you need to come up with a value that will allow enough time for all the tasks included in your DAG to run.

from datetime import datetime, timedelta
from airflow.models.dag import DAG
with DAG(
    'my_dag', 
    start_date=datetime(2016, 1, 1),
    schedule_interval='0 * * * *',
    dagrun_timeout=timedelta(minutes=60),
) as dag:
    ...

Running out of memory

Another possibility is that the machine that is currently running an Airflow Task runs out of memory. Depending on how you deployed Airflow you may need to inspect the memory usage of the workers and make sure that they do have sufficient memory.

For instance, if your deployment is on the Cloud you may have to check whether any of the Kubernetes pods was evicted. Pods are usually evicted due to resource-starved nodes and therefore this may be the reason why your Airflow task is receiving a SIGTERM signal.

Metadata Database draining the CPU

Another commonly reported issue that may be causing Airflow Tasks to receive SIGTERM signals is the CPU usage on the metadata database.

By default, Airflow uses SQLite, which is intended for development purposes only but it was designed to support database backend for PostgreSQL, MySQL, or MSSQL.

There’s a chance that the CPU usage on the database is at 100% and this may be the reason why your Airflow tasks are receiving a SIGTERM signal. If this is the case, then you should consider increasing the value of job_heartbeat_sec configuration (or AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC environment variable) that by default is set to 5 seconds.

job_heartbeat_sec

Task instances listen for external kill signal (when you clear tasks from the CLI or the UI), this defines the frequency at which they should listen (in seconds).

- Airflow Documentation

In the Airflow configuration file airflow.cfg make sure to specify this configuration under the scheduler section as illustrated below.

[scheduler]
job_heartbeat_sec = 20

Alternatively, you can modify the value of this configuration through the corresponding environment variable:

export AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC=20

If the CPU consumption on the Database level was an issue, then the increase to the above configuration should now significantly reduce CPU usage.

Disable “Mini Scheduler”

By default, the task supervisor process attempts to schedule more tasks of the same Airflow DAG in order to improve the performance and eventually help DAG to get executed in less amount of time.

This behaviour is configured through the schedule_after_task_execution that defaults to True.

schedule_after_task_execution

Should the Task supervisor process perform a “mini scheduler” to attempt to schedule more tasks of the same DAG. Leaving this on will mean tasks in the same DAG execute quicker, but might starve out other dags in some circumstances.

- Airflow Documentation

Due to a bug in Airflow, the chances of tasks being killed by the LocalTaskJob heartbeat were pretty high. Therefore, one possible solution is to simply disable the mini scheduler.

In your Airflow configuration file airflow.cfg, you need to set schedule_after_task_execution to False.

[scheduler]
schedule_after_task_execution = False

Alternatively, this configuration can be overwritten through the AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION environment variable:

export AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION=False

If this was the problem in your case, then you may also want to consider upgrading Airflow into a version in which this bug was fixed.

Final Thoughts

In today’s tutorial we discussed about the meaning of SIGTERM signal that can be occasionally sent to Airflow tasks, causing DAGs to fail. We discussed about a few potential reasons why this may be happening and showcased how to overcome this problem depending on your specific use case.

Note that there’s also a chance that your configuration suffers to more than a single problem discussed in this tutorial and thus you may have to apply a combination of solutions we discussed today in order to get rid of SIGTERM signal.

Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.

Related articles you may also like

Python
Programming
Data Engineering
Airflow
Data Science
Recommended from ReadMedium