Skip to content

fix: record workflow gRPC call latency for monitoring#2702

Open
rsd-darshan wants to merge 2 commits into
NVIDIA:mainfrom
rsd-darshan:fix/record-workflow-latency-metrics
Open

fix: record workflow gRPC call latency for monitoring#2702
rsd-darshan wants to merge 2 commits into
NVIDIA:mainfrom
rsd-darshan:fix/record-workflow-latency-metrics

Conversation

@rsd-darshan

@rsd-darshan rsd-darshan commented Jun 19, 2026

Copy link
Copy Markdown

Fixes #2649

Summary

Adds latency measurement and logging to gRPC calls in workflow activities. Workflow latency metrics were previously defined in site-agent's coregrpc manager but never actually recorded. This change instruments the activities to measure and log the duration of each gRPC call to NICo.

Changes

  • Introduces latency.go with a shared helper logGrpcCallLatency() to standardize duration logging across activities
  • Instruments vpcprefix.go activities (Create, Update, Delete) to measure gRPC call duration
  • Instruments instance.go UpdateInstanceConfig activity with latency measurement
  • Logs include operation name, duration, and error state for observability

Benefits

  • Visibility into API response times for troubleshooting performance issues
  • Foundation for future metric recording to Prometheus/telemetry systems
  • Enables monitoring of workflow SLA compliance and identifying slow operations

Testing

All existing tests pass. Verified that duration logging appears in test output with actual measured latencies.

@rsd-darshan rsd-darshan requested a review from a team as a code owner June 19, 2026 11:24
@copy-pr-bot

copy-pr-bot Bot commented Jun 19, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 14aef867-c351-4061-9bdc-4789e6e2aee9

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Adds latency measurement and logging to gRPC calls in workflow activities. Introduces a shared helper function to standardize how gRPC call durations are logged across activities, enabling better observability of workflow performance.

This addresses issue NVIDIA#2649 where workflow latency metrics were defined but not recorded. The duration is now captured and logged for each gRPC call, providing visibility into API response times for troubleshooting and performance monitoring.

Fixes NVIDIA#2649

Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
@rsd-darshan rsd-darshan force-pushed the fix/record-workflow-latency-metrics branch from b4fbd86 to 0ae41ea Compare June 19, 2026 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: site-agent defines but does not record workflow metrics

1 participant