Skip to content

Detect container registry rate limits uniformly#6732

Open
agners wants to merge 1 commit intomainfrom
improve-docker-container-registry-toomanyrequests-detection
Open

Detect container registry rate limits uniformly#6732
agners wants to merge 1 commit intomainfrom
improve-docker-container-registry-toomanyrequests-detection

Conversation

@agners
Copy link
Copy Markdown
Member

@agners agners commented Apr 13, 2026

Proposed change

Container registry rate limits reach Supervisor in three distinct shapes:

  1. HTTP 429 from the daemon — currently recognised, but the resulting exception and resolution issue are both hardcoded to "Docker Hub". Since Supervisor/Core/plugin images all live on ghcr.io now, virtually every 429 we see in the field is actually a GHCR throttle that we mislabel. Issue SUPERVISOR-16BK (>115k events, >93k users) is exactly this — the image in the event context is ghcr.io/home-assistant/amd64-hassio-supervisor:latest, yet the user sees a "log into Docker Hub" suggestion.
  2. HTTP 500 with toomanyrequests in the body — not currently recognised. Docker daemons before 28.3.0 wrap an upstream 429 into a 500 to the client. This was fixed upstream by moby/moby commit 23fa0ae74a ("Cleanup http status error checks")
  3. JSON error event during a streaming pull — not currently recognised. POST /images/create returns 200 OK and streams progress/error events, so rate limits that land during layer download arrive as plain text inside the stream and have no HTTP status to key off of. Happens on all recent daemon versions. Issues SUPERVISOR-13FQ (>16k events) and SUPERVISOR-13E0 (>8k events) are examples.

Cases 2 and 3 propagate as plain DockerError, bypass the 429 detection in docker/interface.py:install() entirely, never produce a DOCKER_RATELIMIT resolution issue, and generate large amounts of Sentry noise. Case 1 is handled but routes every GHCR 429 through Docker-Hub-specific messaging and suggestions.

This PR addresses all three shapes and splits the registry-specific handling so ghcr.io failures produce a new GITHUB_RATELIMIT issue with appropriate guidance (no misleading Docker Hub login suggestion), while Docker Hub failures keep their existing behaviour.

Summary of the changes:

  • New DockerRegistryRateLimitExceeded base exception with DockerHubRateLimitExceeded and a new GithubContainerRegistryRateLimitExceeded as subclasses. All extend APITooManyRequests so callers and future retry logic can key off a single type.
  • New GITHUB_RATELIMIT IssueType (no REGISTRY_LOGIN suggestion, since GHCR authentication is different from Docker Hub).
  • PullLogEntry.exception now maps stream errors containing toomanyrequests to DockerRegistryRateLimitExceeded (case 3).
  • docker/interface.py:install() routes all three cases through a single _registry_rate_limit_exception() helper that picks the right resolution issue, suggestion and exception subclass based on the image's registry.
  • utils/sentry.py filters APITooManyRequests (and anything wrapping it via __cause__) in both capture_exception and async_capture_exception. Single policy point, every caller benefits, no per-site changes needed.

Callers (supervisor.update(), plugin manager, HA core update) are intentionally unchangedUPDATE_FAILED issues still get created alongside the registry-specific rate limit issue, giving users both the cause (rate limit) and the effect (update failed) in the resolution center.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New feature (which adds functionality to the supervisor)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Ruff (ruff format supervisor tests)
  • Tests have been added to verify that the new code works.

If API endpoints or add-on configuration are added/changed:

Container registry rate limits reach Supervisor in three distinct shapes:

  1. HTTP 429 from the daemon - recognised today, but the exception and
     resolution issue are hardcoded to Docker Hub. Since Core/Supervisor/
     plugin images all live on ghcr.io now, virtually every 429 we see in
     the field is actually a GHCR throttle that we mislabel. The biggest
     Sentry issue (SUPERVISOR-16BK) has >115k events / >93k users, all
     pulling a ghcr.io image, yet each user is told to "log into
     Docker Hub".
  2. HTTP 500 with 'toomanyrequests' in the body - not recognised. Docker
     daemons before 28.3.0 wrap upstream 429s as 500 (fixed upstream by
     moby/moby 23fa0ae74a, "Cleanup http status error checks"). The large
     fleet on older daemons still produces this shape.
  3. JSON error event during a streaming pull - not recognised. Once the
     daemon starts writing the 200 OK response body the status is locked
     in, so rate limits that land during layer download arrive as plain
     text in the pull stream. Happens on all recent daemon versions -
     SUPERVISOR-13FQ (>16k events) and SUPERVISOR-13E0 (>8k events) are
     two large examples.

Cases 2 and 3 propagate as plain DockerError, bypass the 429 detection in
install() entirely, never produce a DOCKER_RATELIMIT resolution issue, and
generate large amounts of Sentry noise. Case 1 is detected but routes
every GHCR 429 through Docker-Hub-specific messaging and suggestions.

Changes:

- Add DockerRegistryRateLimitExceeded as the common base class and
  GithubContainerRegistryRateLimitExceeded alongside the existing
  DockerHubRateLimitExceeded. All extend APITooManyRequests so callers
  and retry logic can key off a single type.
- Add GITHUB_RATELIMIT IssueType so GHCR failures don't show the
  "log in to Docker Hub" suggestion that DOCKER_RATELIMIT carries.
- PullLogEntry.exception now maps stream errors containing
  'toomanyrequests' to DockerRegistryRateLimitExceeded (case 3).
- docker/interface.py:install() routes all three cases through a single
  _registry_rate_limit_exception() helper that picks the right issue
  type, suggestion and exception subclass based on the image's registry.
- utils/sentry.py filters APITooManyRequests (and anything wrapping it
  via __cause__) in capture_exception / async_capture_exception. One
  point of policy, every caller benefits.

Callers (supervisor.update(), plugin manager, homeassistant core) are
unchanged - UPDATE_FAILED issues still get created alongside the
registry-specific rate limit issue, giving users the full picture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@agners
Copy link
Copy Markdown
Member Author

agners commented Apr 13, 2026

Maybe the GITHUB_RATELIMIT issue type is not really all that useful since not actionable 🤔 . Maybe simply log (for the task case)/or raise errors to the caller if it happens on a Supervisor API request?

@agners agners requested a review from mdegat01 April 13, 2026 18:14
@agners agners added the new-feature A new feature label Apr 13, 2026
Copy link
Copy Markdown
Contributor

@mdegat01 mdegat01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small things. Going to approve so you don't have to wait for another review as I assume the fixes are small (or possibly rejected if the merge I suggested in the second one is impossible). If significant changes result I'll take another look 👍

suggestions=[SuggestionType.REGISTRY_LOGIN],
)
raise DockerHubRateLimitExceeded(_LOGGER.error) from err
# Pre-28.3.0 daemons wrap registry rate limits as HTTP 500
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just checked and we updated HAOS to 28.3.0 at 16.0. So 15.2 is the last version of HAOS that used a pre-28.3.0 version of docker, released April 14th, 2025. Is this even in our support window anymore? We should put an expected version of Supervisor this can be dropped on in the comment here if so.

Comment on lines +82 to +83
if _is_rate_limit(err):
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to our existing filter method here:

def filter_data(coresys: CoreSys, event: Event, hint: Hint) -> Event | None:
"""Filter event data before sending to sentry."""
# Ignore some exceptions
if "exc_info" in hint:
_, exc_value, _ = hint["exc_info"]
if isinstance(exc_value, (AppConfigurationError)):
return None

Then we can do one isinstance check like isinstance(exc_value, (AppConfigurationError, APITooManyRequests)) instead of multiple as they're kind of expensive. Or even it can't be merged due to the err = err.__cause__ bit it still feels like our filtering out of exception noise should all be in one place. Or is the err = err.__cause__ why it can't be merged into there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants