Skip to content

fix(workflow): preserve event.output for cached LlmAgent results#5587

Open
ValentinCordonnier wants to merge 1 commit intogoogle:v2from
ValentinCordonnier:fix/5553-preserve-event-output-on-resume
Open

fix(workflow): preserve event.output for cached LlmAgent results#5587
ValentinCordonnier wants to merge 1 commit intogoogle:v2from
ValentinCordonnier:fix/5553-preserve-event-output-on-resume

Conversation

@ValentinCordonnier
Copy link
Copy Markdown

Fixes #5553


Summary

Runner._consume_event_queue was nulling event.output before persisting any non-partial event whose node_info.message_as_output was True. The intent looked like a denormalization step ("the LLM text already lives in event.content, no need to store it twice"), but it broke resume rehydration. On the live path, process_llm_agent_output populates event.output with the validated dict (or raw text, when no output_schema); after the strip, the persisted event lost that value. On resume, _reconstruct_node_state walked the persisted events, found event.output is None, and fell through to child.output = event.content — handing the orchestrator a raw genai.types.Content instead of the dict. Anything subscripting that result (lab[\"findings\"], Schema.model_validate(out), …) crashed with 'Content' object is not subscriptable only after a HITL pause, even though the same code worked on the initial pass.

The diagnosis in #5553 by @samarth1224 already pinned the two functions involved; this PR is the minimal change.

Changes

  • src/google/adk/runners.py: removed the four-line strip block in _consume_event_queue that copied the event and assigned event.output = None for non-partial message_as_output events. With the strip gone, event.output survives the persist / rehydrate roundtrip and _reconstruct_node_state's primary branch (if event.output is not None: child.output = event.output) returns the same Python object type the live path produced.
  • tests/unittests/workflow/test_workflow_hitl.py: added a regression test that pauses a rerun_on_resume=True orchestrator on a RequestInput, resumes, and subscripts the cached LlmAgent output. Fails on v2 HEAD with the reported TypeError; passes after this change.
  • tests/unittests/workflow/test_workflow_dynamic_nodes.py: updated test_workflow_resume_does_not_rerun_completed_llm_agent. Its previous assertion agent_event.output is None (commented "Verify that runners.py cleared the output") was asserting the buggy behavior as if intentional. Replaced with a positive check that the persisted output equals the agent's emitted text. The test's actual semantic guarantee — the LLM is not re-fired on resume — still holds via the existing agent_runs_again == [] assertion.

Motivation

Any @node(rerun_on_resume=True) orchestrator that subscripts an LlmAgent result via ctx.run_node and survives a RequestInput pause is silently broken on v2. Graph-edge propagation of LlmAgent outputs across resume has the same failure mode (the rehydrated child.output is a Content instead of the schema dict), so this isn't only a dynamic-orchestrator problem. The fix unblocks the entire HITL-with-structured-output pattern, which is a core v2 use case.

What does not change

  • Live-path semantics. _node_runner._enqueue_event runs before this strip in the queue and was already reading event.output correctly; non-resume runs are unaffected.
  • Caching policy. rerun_on_resume, dedup of completed nodes, and the _dynamic_node_scheduler state machine are all unchanged. The fix only changes what value gets returned from the cache, not whether the cache is hit.
  • The event.content fallback in _reconstruct_node_state is preserved (it now handles only legacy persisted events).
  • No public API surface changes. Event.output: Any | None is unchanged; no contract said it had to be None for LLM events.

Downsides

  • Slightly larger persisted events for LlmAgent final responses. The validated dict is stored alongside the original text in content.parts. For output_schema agents that's roughly the JSON serialization of the dict; for plain-text agents it's roughly the response text duplicated. The previous saving was illusory — rehydration was returning a different object type than the live path, which is the bug we're fixing — but storage-cost-sensitive deployments will see the bytes-per-event grow.
  • Plugin behavior change. Plugins that read event.output for LLM events used to see None; they will now see the dict / string. Any plugin that branched on event.output is None to detect "this is the final LLM event" needs to switch to event.node_info.message_as_output (unchanged, set by the live path). I couldn't find any in-tree plugin doing this, but third-party plugins exist.
  • Old persisted sessions stay broken. Sessions persisted by pre-fix runners still have output=None for these events; resuming them post-fix continues to hit the event.content fallback. That's not a regression — those sessions were broken before — but cross-version resume is not magically fixed. The fallback branch could be removed in a follow-up once legacy sessions age out.
  • One existing test asserted the buggy behavior and had to be updated (see the third bullet under Changes). Reviewers should sanity-check that the replacement assertion still expresses the test's true semantic intent.

The runner stripped event.output to None before persisting any
non-partial event with node_info.message_as_output=True. On resume,
_reconstruct_node_state fell back to event.content (a raw
genai.types.Content) for the cached child output, so subscripting the
result of ctx.run_node on an LlmAgent across a HITL pause crashed with
"'Content' object is not subscriptable". Removing the strip lets the
validated dict produced by process_llm_agent_output survive the
persist/rehydrate roundtrip.

Closes google#5553
@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 4, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@adk-bot adk-bot added the core [Component] This issue is related to the core interface and implementation label May 4, 2026
@adk-bot
Copy link
Copy Markdown
Collaborator

adk-bot commented May 4, 2026

Response from ADK Triaging Agent

Hello @ValentinCordonnier, thank you for creating this detailed pull request!

It looks like the CLA check is failing. Before we can review your contribution, you'll need to sign the Contributor License Agreement. You can do so by following the instructions at https://cla.developers.google.com/.

Thanks!

Copy link
Copy Markdown

@samarth1224 samarth1224 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key Points:

Does this Fix the issue?

  • Yes

Is this a correct way to fix this?

  • I believe no

Hey, while your changes fixes the issue, but as far as my knowledge goes of this codebase, I believe this might not be the correct way to fix the issue.

As mentioned in the my issue, in your PR, in the test cases, and also as inferred from the codebase, event.output should ideally remain None when the message_as_input == True. Since, output can be inferred from the event.content.

  • And, therefore, I believe the change required must be made within the function _reconstruct_node_states() and not the _consume_event_queue().

  • Specifically, the change must be either a helper function to process the event.content to suitable output before assigning to child.output OR direct extraction of the output during the assignment to child.output.

I feel a helper function would be a better choice where in the function we could also validate output(stored in event.content) against the agent's output schema ( Validation might not be needed here, since output has already been validated during Fresh Run by the process_llm_agent_output() function. ).

In nutshell:

  • I believe, your changes defeat the purpose of message_as_output attribute of NodeInfo class.
  • Your changes need to be accompanied with additional changes in multiple files and functions, including _reconstruct_node_states(), _track_event_in_context() etc. so that codebase remains consistent.
  • The issue could simply be resolved by processing the event.content before assigning it to child.output in _reconstruct_node_states().

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core [Component] This issue is related to the core interface and implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants