[OpenTelemetry] Metrics - beter handling for double.NaN v2#7100
[OpenTelemetry] Metrics - beter handling for double.NaN v2#7100Kielek wants to merge 3 commits intoopen-telemetry:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7100 +/- ##
==========================================
+ Coverage 88.78% 88.79% +0.01%
==========================================
Files 270 270
Lines 12927 12933 +6
==========================================
+ Hits 11477 11484 +7
+ Misses 1450 1449 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
|
@utpilla Could you please review this PR when you have a moment? |
|
Before merging, I think we need to answer a core question: how should the Metrics SDK handle non-finite values (NaN, +Inf, -Inf)? Right now, we don't seem to have a deliberate answer. The behavior varies by instrument. Counters get their sum poisoned, explicit bucket histograms put The spec (Numerical limits handling) says the SDK MUST handle So, we have a choice. There are broadly three stances:
This PR lands somewhere between these. It rejects I'm just sharing some pointers here since I happened to look at this. I'm no longer an active maintainer so I'll leave the decision and review to you all. I think the first step is to pick a stance, then implement and document it consistently. |
Fixes #7072 (comment)
Changes
This PR mitigates the NaN/Infinity spin risk without changing existing histogram/gauge semantics.
We reject non-finite double measurements (NaN, +Inf, -Inf) only for sum instruments (Counter, UpDownCounter, ObservableCounter, ObservableUpDownCounter) at the SDK callback boundary (MeasurementRecordedDouble). This prevents invalid values from reaching the sum aggregation path where Compare-And-Swap-based accumulation is used.
This is valid because InterlockedHelper.Add is used in MetricPoint only for DoubleSumIncomingDelta (sum path), not histogram updates. So filtering sum-instrument inputs is the right targeted mitigation for the original finding.
We intentionally preserve histogram behavior (including NaN test expectations), and added tests covering:
non-finite drop for counter/updown counter,
histogram NaN behavior unchanged.
This keeps the fix focused on the vulnerable path while avoiding regressions in existing histogram/gauge behavior.
Discussed with @utpilla in private channel keeping/dropign changes realted to bitwise comparison:
Consistency of message. The whole point of the boundary fix is to say "NaN is not a valid input." Keeping the InterlockedHelper change contradicts that — it's defensive code for a scenario the boundary fix makes unreachable. Reverting reinforces the stance.
Simplicity. returnedValue != currentValue is immediately readable. The AreSame + BitConverter.DoubleToInt64Bits indirection adds complexity for a case that can no longer happen.
But we can reconsider this with other feedback also.
@utpilla, great to see you here, always looking for your opinion!
Merge requirement checklist
[ ] AppropriateCHANGELOG.mdfiles updated for non-trivial changes[ ] Changes in public API reviewed (if applicable)