feat: operator-driven retry and error taxonomy for ceremonies#105
Merged
Conversation
Introduce an in-run error model so a ceremony can recover from transient
failures (a loose cable, an absent token) without losing a long-running
session, while ensuring procedural failures are never silently re-run.
Error classification on the types:
- `Retriability` (`Transient` / `Fatal`) with `BackendError::retriability()`
and `ActionError::retriability()`, classified by an exhaustive, wildcard-free
match so every future variant must be deliberately classified. Environmental
conditions are transient; malformed data, unsupported operations, and
terminal device states are fatal. A procedural failure such as a verification
mismatch is never retriable.
Transcript taxonomy:
- `ErrorClass` (environmental / procedural / integrity / abort) recorded on
`ErrorRecord`, so an auditor can tell a failure's nature without parsing the
message and abort is distinguishable from failure.
- New `StepAttemptFailed { step, attempt, error }` fact; reports group attempts
under their step.
Runtime retry:
- A transient error pauses the step and prompts the operator to retry or abort,
recorded through the existing prompt machinery. A conservative
re-executability gate refuses to retry once an attempt has emitted any
evidence, which also keeps retried attempts from reusing entropy.
DSL `retry:` field:
- `retry: never` and `retry: { attempts: N }` constrain the retry-by-default;
an absent field prompts the operator unlimited times. A zero attempt budget
is rejected with a resolver diagnostic.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce an in-run error model so a ceremony can recover from transient failures (a loose cable, an absent token) without losing a long-running session, while ensuring procedural failures are never silently re-run.
Error classification on the types:
Retriability(Transient/Fatal) withBackendError::retriability()andActionError::retriability(), classified by an exhaustive, wildcard-free match so every future variant must be deliberately classified. Environmental conditions are transient; malformed data, unsupported operations, and terminal device states are fatal. A procedural failure such as a verification mismatch is never retriable.Transcript taxonomy:
ErrorClass(environmental / procedural / integrity / abort) recorded onErrorRecord, so an auditor can tell a failure's nature without parsing the message and abort is distinguishable from failure.StepAttemptFailed { step, attempt, error }fact; reports group attempts under their step.Runtime retry:
DSL
retry:field:retry: neverandretry: { attempts: N }constrain the retry-by-default; an absent field prompts the operator unlimited times. A zero attempt budget is rejected with a resolver diagnostic.