Conversation
Container registry rate limits reach Supervisor in three distinct shapes:
1. HTTP 429 from the daemon - recognised today, but the exception and
resolution issue are hardcoded to Docker Hub. Since Core/Supervisor/
plugin images all live on ghcr.io now, virtually every 429 we see in
the field is actually a GHCR throttle that we mislabel. The biggest
Sentry issue (SUPERVISOR-16BK) has >115k events / >93k users, all
pulling a ghcr.io image, yet each user is told to "log into
Docker Hub".
2. HTTP 500 with 'toomanyrequests' in the body - not recognised. Docker
daemons before 28.3.0 wrap upstream 429s as 500 (fixed upstream by
moby/moby 23fa0ae74a, "Cleanup http status error checks"). The large
fleet on older daemons still produces this shape.
3. JSON error event during a streaming pull - not recognised. Once the
daemon starts writing the 200 OK response body the status is locked
in, so rate limits that land during layer download arrive as plain
text in the pull stream. Happens on all recent daemon versions -
SUPERVISOR-13FQ (>16k events) and SUPERVISOR-13E0 (>8k events) are
two large examples.
Cases 2 and 3 propagate as plain DockerError, bypass the 429 detection in
install() entirely, never produce a DOCKER_RATELIMIT resolution issue, and
generate large amounts of Sentry noise. Case 1 is detected but routes
every GHCR 429 through Docker-Hub-specific messaging and suggestions.
Changes:
- Add DockerRegistryRateLimitExceeded as the common base class and
GithubContainerRegistryRateLimitExceeded alongside the existing
DockerHubRateLimitExceeded. All extend APITooManyRequests so callers
and retry logic can key off a single type.
- Add GITHUB_RATELIMIT IssueType so GHCR failures don't show the
"log in to Docker Hub" suggestion that DOCKER_RATELIMIT carries.
- PullLogEntry.exception now maps stream errors containing
'toomanyrequests' to DockerRegistryRateLimitExceeded (case 3).
- docker/interface.py:install() routes all three cases through a single
_registry_rate_limit_exception() helper that picks the right issue
type, suggestion and exception subclass based on the image's registry.
- utils/sentry.py filters APITooManyRequests (and anything wrapping it
via __cause__) in capture_exception / async_capture_exception. One
point of policy, every caller benefits.
Callers (supervisor.update(), plugin manager, homeassistant core) are
unchanged - UPDATE_FAILED issues still get created alongside the
registry-specific rate limit issue, giving users the full picture.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Maybe the |
| suggestions=[SuggestionType.REGISTRY_LOGIN], | ||
| ) | ||
| raise DockerHubRateLimitExceeded(_LOGGER.error) from err | ||
| # Pre-28.3.0 daemons wrap registry rate limits as HTTP 500 |
There was a problem hiding this comment.
So I just checked and we updated HAOS to 28.3.0 at 16.0. So 15.2 is the last version of HAOS that used a pre-28.3.0 version of docker, released April 14th, 2025. Is this even in our support window anymore? We should put an expected version of Supervisor this can be dropped on in the comment here if so.
| if _is_rate_limit(err): | ||
| return |
There was a problem hiding this comment.
Can we move this to our existing filter method here:
supervisor/supervisor/misc/filter.py
Lines 44 to 50 in a504d85
Then we can do one isinstance check like isinstance(exc_value, (AppConfigurationError, APITooManyRequests)) instead of multiple as they're kind of expensive. Or even it can't be merged due to the err = err.__cause__ bit it still feels like our filtering out of exception noise should all be in one place. Or is the err = err.__cause__ why it can't be merged into there?
Proposed change
Container registry rate limits reach Supervisor in three distinct shapes:
ghcr.ionow, virtually every 429 we see in the field is actually a GHCR throttle that we mislabel. IssueSUPERVISOR-16BK(>115k events,>93k users) is exactly this — the image in the event context isghcr.io/home-assistant/amd64-hassio-supervisor:latest, yet the user sees a "log into Docker Hub" suggestion.toomanyrequestsin the body — not currently recognised. Docker daemons before28.3.0wrap an upstream 429 into a 500 to the client. This was fixed upstream by moby/moby commit23fa0ae74a("Cleanup http status error checks")POST /images/createreturns200 OKand streams progress/error events, so rate limits that land during layer download arrive as plain text inside the stream and have no HTTP status to key off of. Happens on all recent daemon versions. IssuesSUPERVISOR-13FQ(>16k events) andSUPERVISOR-13E0(>8k events) are examples.Cases 2 and 3 propagate as plain
DockerError, bypass the 429 detection indocker/interface.py:install()entirely, never produce aDOCKER_RATELIMITresolution issue, and generate large amounts of Sentry noise. Case 1 is handled but routes every GHCR 429 through Docker-Hub-specific messaging and suggestions.This PR addresses all three shapes and splits the registry-specific handling so
ghcr.iofailures produce a newGITHUB_RATELIMITissue with appropriate guidance (no misleading Docker Hub login suggestion), while Docker Hub failures keep their existing behaviour.Summary of the changes:
DockerRegistryRateLimitExceededbase exception withDockerHubRateLimitExceededand a newGithubContainerRegistryRateLimitExceededas subclasses. All extendAPITooManyRequestsso callers and future retry logic can key off a single type.GITHUB_RATELIMITIssueType(noREGISTRY_LOGINsuggestion, since GHCR authentication is different from Docker Hub).PullLogEntry.exceptionnow maps stream errors containingtoomanyrequeststoDockerRegistryRateLimitExceeded(case 3).docker/interface.py:install()routes all three cases through a single_registry_rate_limit_exception()helper that picks the right resolution issue, suggestion and exception subclass based on the image's registry.utils/sentry.pyfiltersAPITooManyRequests(and anything wrapping it via__cause__) in bothcapture_exceptionandasync_capture_exception. Single policy point, every caller benefits, no per-site changes needed.Callers (
supervisor.update(), plugin manager, HA core update) are intentionally unchanged —UPDATE_FAILEDissues still get created alongside the registry-specific rate limit issue, giving users both the cause (rate limit) and the effect (update failed) in the resolution center.Type of change
Additional information
Checklist
ruff format supervisor tests)If API endpoints or add-on configuration are added/changed: