Skip to content

[OpenTelemetry] Metrics - beter handling for double.NaN v2#7100

Open
Kielek wants to merge 3 commits intoopen-telemetry:mainfrom
Kielek:nanv2
Open

[OpenTelemetry] Metrics - beter handling for double.NaN v2#7100
Kielek wants to merge 3 commits intoopen-telemetry:mainfrom
Kielek:nanv2

Conversation

@Kielek
Copy link
Copy Markdown
Member

@Kielek Kielek commented Apr 16, 2026

Fixes #7072 (comment)

Changes

This PR mitigates the NaN/Infinity spin risk without changing existing histogram/gauge semantics.

We reject non-finite double measurements (NaN, +Inf, -Inf) only for sum instruments (Counter, UpDownCounter, ObservableCounter, ObservableUpDownCounter) at the SDK callback boundary (MeasurementRecordedDouble). This prevents invalid values from reaching the sum aggregation path where Compare-And-Swap-based accumulation is used.

This is valid because InterlockedHelper.Add is used in MetricPoint only for DoubleSumIncomingDelta (sum path), not histogram updates. So filtering sum-instrument inputs is the right targeted mitigation for the original finding.

We intentionally preserve histogram behavior (including NaN test expectations), and added tests covering:

non-finite drop for counter/updown counter,

histogram NaN behavior unchanged.

This keeps the fix focused on the vulnerable path while avoiding regressions in existing histogram/gauge behavior.


Discussed with @utpilla in private channel keeping/dropign changes realted to bitwise comparison:
Consistency of message. The whole point of the boundary fix is to say "NaN is not a valid input." Keeping the InterlockedHelper change contradicts that — it's defensive code for a scenario the boundary fix makes unreachable. Reverting reinforces the stance.
Simplicity. returnedValue != currentValue is immediately readable. The AreSame + BitConverter.DoubleToInt64Bits indirection adds complexity for a case that can no longer happen.

But we can reconsider this with other feedback also.

@utpilla, great to see you here, always looking for your opinion!

Merge requirement checklist

  • CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)
  • Unit tests added/updated
  • [ ] Appropriate CHANGELOG.md files updated for non-trivial changes
  • [ ] Changes in public API reviewed (if applicable)

@Kielek Kielek requested a review from a team as a code owner April 16, 2026 05:18
@github-actions github-actions bot added the pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package label Apr 16, 2026
@Kielek Kielek changed the title Nanv2 [OpenTelemetry] Metrics - beter handling for double.NaN v2 Apr 16, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.79%. Comparing base (1f3582a) to head (98759f5).
✅ All tests successful. No failed tests found.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #7100      +/-   ##
==========================================
+ Coverage   88.78%   88.79%   +0.01%     
==========================================
  Files         270      270              
  Lines       12927    12933       +6     
==========================================
+ Hits        11477    11484       +7     
+ Misses       1450     1449       -1     
Flag Coverage Δ
unittests-Project-Experimental 88.75% <100.00%> (+0.02%) ⬆️
unittests-Project-Stable 88.53% <100.00%> (-0.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/OpenTelemetry/Internal/InterlockedHelper.cs 100.00% <100.00%> (ø)
src/OpenTelemetry/Metrics/MeterProviderSdk.cs 93.90% <100.00%> (+0.17%) ⬆️

... and 1 file with indirect coverage changes

@Kielek Kielek marked this pull request as draft April 16, 2026 05:28
@Kielek Kielek marked this pull request as ready for review April 16, 2026 12:54
@rajkumar-rangaraj
Copy link
Copy Markdown
Member

@utpilla Could you please review this PR when you have a moment?

@utpilla
Copy link
Copy Markdown
Contributor

utpilla commented Apr 17, 2026

Before merging, I think we need to answer a core question: how should the Metrics SDK handle non-finite values (NaN, +Inf, -Inf)?

Right now, we don't seem to have a deliberate answer. The behavior varies by instrument. Counters get their sum poisoned, explicit bucket histograms put NaN in the +Inf bucket, exponential histograms filter NaN from buckets but still corrupt sum/min/max, and gauges just store NaN directly. None of this is documented, and this PR would change behavior for some instruments but not others.

The spec (Numerical limits handling) says the SDK MUST handle NaN/Inf but leaves the how unspecified. For exponential histograms specifically, the spec says implementations SHOULD NOT incorporate non-normal values into sum/min/max. There's no equivalent guidance for explicit bucket histograms.

So, we have a choice. There are broadly three stances:

Reject at boundary No boundary checks Per-aggregation handling
Approach IsFinite check in MeasurementRecordedDouble for all instruments Let NaN/Inf propagate naturally; fix only bugs (e.g., InterlockedHelper hang) Protect sum/min/max/buckets per path; drop for counters/gauges; for histograms, decide whether to count non-finite values or drop entirely
InterlockedHelper fix Not needed Needed (the hang is a bug) Not needed
Complexity One guard, all paths Already done (InterlockedHelper fix shipped in #7072) Each aggregation has its own logic and sub-choices
Consistency Uniform Mostly uniform, but exp. histograms already filter NaN from buckets while explicit ones don't Inconsistent: counters/gauges have no count to preserve
Exporter impact None Must handle NaN/Inf If histograms count non-finite values, count won't match sum of bucket counts
Gap from current state Add IsFinite check for all instrument types; revert InterlockedHelper bitwise fix; update Prometheus tests Already in place after #7072 Guards in each path; fix exp. histogram sum/min/max; decide on explicit histogram counting; revert InterlockedHelper; update tests

This PR lands somewhere between these. It rejects NaN/Inf for counters at the boundary but lets them through for histograms (corrupted sum/min/max, NaN in +Inf bucket) and gauges (stores NaN as the value). The Prometheus serializer already has tests expecting Nan/+Inf/-Inf in output, which other exporters may not handle.

I'm just sharing some pointers here since I happened to look at this. I'm no longer an active maintainer so I'll leave the decision and review to you all. I think the first step is to pick a stance, then implement and document it consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants