Skip to content

developmentseed/ecs-task-shield

Repository files navigation

ECS Task Shield

A robust, framework-agnostic Python library for managing AWS ECS Task Scale-In Protection.

🛡️ Perfect Pairing: If your architecture specifically uses Celery with an AWS SQS broker, we highly recommend pairing this library with Celery SQS Extender. While ECS Task Shield protects your worker container from being killed by auto-scaling, the SQS Extender automatically protects the underlying message from timing out and being duplicate-processed by another worker!

For background on why this feature is essential for dynamic workloads, check out the AWS announcement blog post on Amazon ECS task scale-in protection.

Originally intended for Celery but usable in other synchronous task processing frameworks, this library handles the complexities of running concurrent background workers inside ECS. It uses file-based locking to guarantee that shared task protection state is safely toggled, debounced, and healed across multiple isolated processes running in the same container.

Use Cases

AWS ECS is a fantastic environment for auto-scaling background workers, but scale-in events can interrupt long-running or critical tasks. This library is perfect for protecting workloads such as:

  • AI/ML Training & Inference: Protecting expensive GPU instances from spinning down while mid-way through generating outputs, running batch inference, or training an epoch.
  • Video & Audio Transcoding: Safeguarding heavy media processing jobs that take significant time and compute resources to complete.
  • Non-Idempotent Operations: Financial transactions, payment processing, or third-party API synchronizations where being SIGKILLed halfway through corrupts external state.

Features

  • Framework Agnostic: Designed to wrap any Python function, integrating flawlessly with any background worker framework.
  • Multi-Process Safe: Safely tracks concurrent task counts across multiple worker processes using filelock.
  • API Debouncing: Intelligently debounces AWS ECS Agent API calls to prevent rate-limiting when many short tasks start concurrently.
  • Dynamic Extension: Allows long-running tasks to dynamically extend their protection window mid-execution.
  • Self-Healing: Automatically recovers from hard worker crashes (e.g., SIGKILL or OOM errors) to prevent "zombie" protection locks.
  • Strict/Relaxed Modes: Configurable behavior to strictly fail tasks if protection cannot be acquired, or to relax enforcement during temporary ECS agent outages.

IAM Configuration

To allow your ECS tasks to manage their own protection state, you must grant the task's Task Role (not the Task Execution Role) the following IAM permissions.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["ecs:GetTaskProtection", "ecs:UpdateTaskProtection"],
      "Resource": ["*"]
    }
  ]
}

Note: You can restrict the Resource ARN to your specific cluster(s) (e.g., arn:aws:ecs:region:account:task/my-cluster/*) if required by your organization's security posture. To restrict this further to specific task definitions, you can append an IAM Condition block to the statement.

Quick Start

The easiest way to use the library is via the @protect_function decorator. You can place this under your framework's task decorator.

Example: Celery

from datetime import timedelta
from ecs_task_shield import protect_function

@celery_app.task()
@protect_function(
    duration=timedelta(minutes=15),
    disable_after=True,
    ignore_exception=True, # Set to False for critical, non-idempotent tasks
)
def process_video():
    print("Processing heavy workload. Safe from ECS scale-in!")

Note: Always ensure that your worker framework's decorator (if it uses one) wraps the outside of @protect_function.

Advanced Usage

Dynamic Protection Extension

If you have a task where the total execution time is unpredictable, you can inject the protection_ctx into your function signature to extend the protection window dynamically.

from datetime import timedelta
from ecs_task_shield import Injected, ProtectionContext, protect_function

@protect_function(duration=timedelta(minutes=10))
def massive_batch_job(
    data_list,
    # Injected() satisfies static type checkers (like mypy/pyright) by providing a
    # mock default, since the decorator injects the real context at runtime.
    protection_ctx: ProtectionContext = Injected()
):
    for index, item in enumerate(data_list):
        # Oh no! This is taking longer than expected.
        # Request 10 more minutes from right now.
        if index % 100 == 0:
            protection_ctx.extend(timedelta(minutes=10))
        process_item(item)

How Concurrency Works

If you run a worker node with concurrency (e.g., Celery with --concurrency=4 or an RQ worker pool), multiple isolated processes are handling jobs simultaneously in the same ECS container. This library coordinates them using a lightweight /tmp/ecs_protection_state.json file and an fcntl file lock.

  1. When Worker A starts a task, it calls the ECS Agent to protect the container.
  2. When Worker B starts a concurrent task, it increments the local active task count but skips the ECS API call (debouncing).
  3. When Worker A finishes, it decrements the count but skips unprotecting the container.
  4. When Worker B finishes, the active count hits 0, and it safely calls the ECS Agent to remove protection.

About

A robust, framework-agnostic Python library for managing AWS ECS Task Scale-In Protection.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

 
 
 

Contributors