Skip to content

feat: operator-driven retry and error taxonomy for ceremonies#105

Merged
lomigmegard merged 1 commit into
mainfrom
feat/error-retry-and-taxonomy
Jun 20, 2026
Merged

feat: operator-driven retry and error taxonomy for ceremonies#105
lomigmegard merged 1 commit into
mainfrom
feat/error-retry-and-taxonomy

Conversation

@lomigmegard

Copy link
Copy Markdown
Contributor

Introduce an in-run error model so a ceremony can recover from transient failures (a loose cable, an absent token) without losing a long-running session, while ensuring procedural failures are never silently re-run.

Error classification on the types:

  • Retriability (Transient / Fatal) with BackendError::retriability() and ActionError::retriability(), classified by an exhaustive, wildcard-free match so every future variant must be deliberately classified. Environmental conditions are transient; malformed data, unsupported operations, and terminal device states are fatal. A procedural failure such as a verification mismatch is never retriable.

Transcript taxonomy:

  • ErrorClass (environmental / procedural / integrity / abort) recorded on ErrorRecord, so an auditor can tell a failure's nature without parsing the message and abort is distinguishable from failure.
  • New StepAttemptFailed { step, attempt, error } fact; reports group attempts under their step.

Runtime retry:

  • A transient error pauses the step and prompts the operator to retry or abort, recorded through the existing prompt machinery. A conservative re-executability gate refuses to retry once an attempt has emitted any evidence, which also keeps retried attempts from reusing entropy.

DSL retry: field:

  • retry: never and retry: { attempts: N } constrain the retry-by-default; an absent field prompts the operator unlimited times. A zero attempt budget is rejected with a resolver diagnostic.

Introduce an in-run error model so a ceremony can recover from transient
failures (a loose cable, an absent token) without losing a long-running
session, while ensuring procedural failures are never silently re-run.

Error classification on the types:
- `Retriability` (`Transient` / `Fatal`) with `BackendError::retriability()`
  and `ActionError::retriability()`, classified by an exhaustive, wildcard-free
  match so every future variant must be deliberately classified. Environmental
  conditions are transient; malformed data, unsupported operations, and
  terminal device states are fatal. A procedural failure such as a verification
  mismatch is never retriable.

Transcript taxonomy:
- `ErrorClass` (environmental / procedural / integrity / abort) recorded on
  `ErrorRecord`, so an auditor can tell a failure's nature without parsing the
  message and abort is distinguishable from failure.
- New `StepAttemptFailed { step, attempt, error }` fact; reports group attempts
  under their step.

Runtime retry:
- A transient error pauses the step and prompts the operator to retry or abort,
  recorded through the existing prompt machinery. A conservative
  re-executability gate refuses to retry once an attempt has emitted any
  evidence, which also keeps retried attempts from reusing entropy.

DSL `retry:` field:
- `retry: never` and `retry: { attempts: N }` constrain the retry-by-default;
  an absent field prompts the operator unlimited times. A zero attempt budget
  is rejected with a resolver diagnostic.
@lomigmegard lomigmegard self-assigned this Jun 19, 2026
@lomigmegard lomigmegard merged commit 51112e5 into main Jun 20, 2026
9 checks passed
@lomigmegard lomigmegard deleted the feat/error-retry-and-taxonomy branch June 20, 2026 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant