Skip to content

Add ConfigNode ReadOnly heartbeat self-check (DiskFull/DiskCrash)#17724

Open
CRZbulabula wants to merge 5 commits into
masterfrom
confignode-readonly-disk-check
Open

Add ConfigNode ReadOnly heartbeat self-check (DiskFull/DiskCrash)#17724
CRZbulabula wants to merge 5 commits into
masterfrom
confignode-readonly-disk-check

Conversation

@CRZbulabula
Copy link
Copy Markdown
Contributor

Description

Why

ConfigNode currently has no notion of being ReadOnly — only DataNode samples its disk and self-marks. The heartbeat from the leader to other ConfigNodes is effectively just a liveness ping, so a ConfigNode with a full or crashed disk would silently keep accepting writes from peers / clients. This PR extends ReadOnly tracking to ConfigNode and adds a new disk-failure reason (DiskCrash) on top of the existing DiskFull.

What changes

  • New status reason. NodeStatus.DISK_CRASH constant added next to DISK_FULL (both are still string identifiers stored in statusReason).
  • Shared utility. New org.apache.iotdb.commons.cluster.DiskChecker in node-commons: probes a list of directories via test-write + free-space ratio, then runs a single state-machine apply on CommonConfig with priority DiskCrash > DiskFull > Normal. Recovery to Running only fires when the prior reason was disk-related; other ReadOnly reasons (e.g. manual maintenance) are left untouched.
  • ConfigNode leader. Self-checks [systemDir, consensusDir] at the top of HeartbeatService#heartbeatLoopBody (before fanning out heartbeats).
  • ConfigNode follower. Self-checks on every received heartbeat in ConfigNodeRPCServiceProcessor#getConfigNodeHeartBeat, and reports status + statusReason back via newly-added optional fields 4 and 5 on TConfigNodeHeartbeatResp (forward-compatible — old peers simply leave them unset).
  • Leader's self entry in the cache. ConfigNodeHeartbeatCache#updateCurrentStatistics no longer short-circuits for CURRENT_NODE_ID; instead, the self entry mirrors CommonConfig, so show confignodes reflects the leader's own disk state.
  • DataNode. FolderManager now registers each instance into a weak-ref static list and exposes static boolean hasAnyAbnormalFolder(). DataNodeInternalRPCServiceImpl#sampleDiskLoad consults that aggregator and, when any folder is ABNORMAL, maps to DiskCrash (which outranks the existing free-ratio DiskFull check). State-machine application is delegated to DiskChecker.apply so DataNode and ConfigNode follow identical transition rules.

Design notes

  • DataNode keeps its current system-wide free-ratio check (SystemMetric.SYS_DISK_AVAILABLE_SPACE) for DiskFull. The new DiskCrash signal is path-scoped — it just observes already-recorded write failures rather than probing IO itself. ConfigNode runs both checks per-directory through DiskChecker (File.getUsableSpace/getTotalSpace + a tiny Files.createTempFile/write/delete probe), giving symmetric behavior on the two nodes from the cluster's perspective.
  • Thrift fields are optional to keep rolling upgrade safe: an older ConfigNode that doesn't populate them parses as Running with no reason.
  • ABNORMAL on DataNode is sticky: once a folder fails a business write it stays ABNORMAL until restart. This PR intentionally does not introduce an auto-recovery path for ReadOnly(DiskCrash) on DataNode (would require new ABNORMAL -> HEALTHY transitions inside FolderManager and is left as follow-up). ConfigNode does auto-recover, because testWrite reruns every heartbeat.

i18n

New disk health messages live in CommonMessages under both src/main/i18n/en and src/main/i18n/zh:

  • DISK_FULL_SET_READ_ONLY
  • DISK_CRASH_SET_READ_ONLY
  • DISK_CRASH_PROBE_FAILED
  • DISK_RECOVERED_SET_RUNNING

The existing inline English log in sampleDiskLoad is retained (just descriptive context); the state-change log itself is routed through CommonMessages so the Chinese build works out of the box.


This PR has:

  • been self-reviewed.
  • added Javadocs for the new public surface (`DiskChecker`, `FolderManager.hasAnyAbnormalFolder`, `ConfigNodeConfig.getCriticalDirs`).
  • added comments explaining the why where non-obvious (cache self-entry rewrite, optional Thrift fields, priority ordering).
  • added unit tests — `DiskCheckerTest` covers 17 cases including: NORMAL/DISK_FULL/DISK_CRASH detection, crash-wins-over-full, recovery clearing reason, non-disk `ReadOnly` reason left untouched, idempotency, probe-file cleanup.
  • added integration tests. (Heartbeat-driven state transitions are exercised at unit-test level; cluster-level disk failure injection felt heavier than this PR warrants — happy to add an IT in a follow-up if reviewers prefer.)
  • been tested in a test IoTDB cluster.

Known follow-ups
  • DataNode cannot auto-recover from `ReadOnly(DiskCrash)` once any folder is marked ABNORMAL by `FolderManager` — would need an ABNORMAL→HEALTHY re-probe path.
  • Consider unifying DataNode's free-ratio source with ConfigNode's per-directory `File.getUsableSpace` (currently DataNode goes through `MetricService`).

Key changed/added classes in this PR

New

  • `org.apache.iotdb.commons.cluster.DiskChecker`
  • `org.apache.iotdb.commons.cluster.DiskCheckerTest`

Modified

  • `org.apache.iotdb.commons.cluster.NodeStatus`
  • `org.apache.iotdb.commons.i18n.CommonMessages` (en + zh)
  • `org.apache.iotdb.confignode.rpc.thrift.TConfigNodeHeartbeatResp` (Thrift IDL)
  • `org.apache.iotdb.confignode.conf.ConfigNodeConfig`
  • `org.apache.iotdb.confignode.manager.load.service.HeartbeatService`
  • `org.apache.iotdb.confignode.manager.load.cache.node.ConfigNodeHeartbeatCache`
  • `org.apache.iotdb.confignode.manager.load.cache.node.NodeHeartbeatSample`
  • `org.apache.iotdb.confignode.service.thrift.ConfigNodeRPCServiceProcessor`
  • `org.apache.iotdb.db.storageengine.rescon.disk.FolderManager`
  • `org.apache.iotdb.db.protocol.thrift.impl.DataNodeInternalRPCServiceImpl`

…check

ConfigNode now reports its own NodeStatus.ReadOnly when its critical
directories (systemDir, consensusDir) are unwritable or near-full,
mirroring the existing DataNode behavior. NodeStatus reasons are
extended with a new DISK_CRASH constant alongside DISK_FULL, and the
ConfigNode heartbeat carries status/statusReason back to the leader.

- node-commons: new DiskChecker utility (probe + state-machine apply),
  with priority DiskCrash > DiskFull and recovery to Running when the
  reason was disk-related. i18n messages added in en + zh.
- thrift-confignode: TConfigNodeHeartbeatResp gains optional status
  and statusReason fields (forward-compatible).
- confignode: leader self-checks before fanning out heartbeats;
  follower self-checks on receive and reports back; cache reads from
  CommonConfig for the leader's self entry, otherwise from the sample.
- datanode: FolderManager exposes a static hasAnyAbnormalFolder()
  aggregator; sampleDiskLoad treats any ABNORMAL folder as DiskCrash
  (which wins over DiskFull) and reuses DiskChecker.apply.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 35.02538% with 128 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.55%. Comparing base (34cc346) to head (6d662fd).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
...g/apache/iotdb/consensus/ratis/RatisConsensus.java 0.00% 28 Missing ⚠️
.../confignode/manager/load/service/EventService.java 20.58% 27 Missing ⚠️
...db/db/storageengine/rescon/disk/FolderManager.java 11.11% 16 Missing ⚠️
...ol/thrift/impl/DataNodeInternalRPCServiceImpl.java 0.00% 9 Missing ⚠️
...fignode/manager/load/service/HeartbeatService.java 0.00% 8 Missing ⚠️
...a/org/apache/iotdb/commons/cluster/NodeStatus.java 0.00% 8 Missing ⚠️
.../org/apache/iotdb/consensus/ratis/utils/Utils.java 36.36% 7 Missing ⚠️
...ager/load/cache/node/ConfigNodeHeartbeatCache.java 62.50% 6 Missing ⚠️
.../service/thrift/ConfigNodeRPCServiceProcessor.java 0.00% 6 Missing ⚠️
...e/manager/load/cache/node/NodeHeartbeatSample.java 0.00% 5 Missing ⚠️
... and 5 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17724      +/-   ##
============================================
+ Coverage     40.41%   40.55%   +0.13%     
- Complexity     2574     2577       +3     
============================================
  Files          5179     5181       +2     
  Lines        349767   350066     +299     
  Branches      44714    44768      +54     
============================================
+ Hits         141373   141978     +605     
+ Misses       208394   208088     -306     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…n after fanout

- HeartbeatService: the leader's DiskChecker.checkAndApply now runs after
  pingRegisteredConfigNodes/DataNodes/AINodes so its (blocking) probe IO
  does not delay the async heartbeat dispatch. Gated on a snapshot of
  heartbeatCounter taken at loop entry so the leader self-samples on the
  same iterations DataNode/AINode load sampling fires. The 10-iteration
  cadence is lifted into a shared LOAD_SAMPLING_INTERVAL constant and
  reused by the existing setNeedSamplingLoad callsites.
- ConfigNodeRPCServiceProcessor: follower side gates its disk check on a
  local heartbeatReceivedCounter (every Nth receive, matching the leader
  cadence) instead of running on every received heartbeat.
- Tidy comments per code-review feedback.
…sive crash detect

Three changes that let ReadOnly state actually shape Raft behavior on ConfigNode:

- Utils.rejectWrite / stallApply now match ConfigRegion in addition to DataRegion,
  so a ReadOnly ConfigNode leader hits the same forceStepDownLeader path that
  DataRegion leaders already use. Comment at RatisConsensus.write updated.
- New NodeStatus.priorityForStatus maps Running=0, ReadOnly(DiskFull)=-1,
  ReadOnly(DiskCrash)=-2. HeartbeatService runs a reconciliation step on the
  leader (same cadence as the load-sampling pass, after async fanout) that pushes
  each ConfigNode peer's desired priority into Ratis. Unknown/Removing/manual
  ReadOnly are left empty so transient blips do not churn the group config.
  IConsensus gains a default-no-op reconfigurePeerPriorities; RatisConsensus
  overrides it to rebuild the peer list and call sendReconfiguration.
- Replace DiskChecker.check (active testWrite probe) with a passive observer
  threaded through Ratis. ApplicationStateMachineProxy gains a diskFailureListener
  parameter and fires it from the applyTransaction catch when Utils.isDiskFailure
  matches the cause (IOError / FileSystemException). RatisConsensus also tags
  IOException out of writeLocallyWithRetry / writeRemotelyWithRetry so log-write
  failures register as DiskCrash. DiskChecker keeps only checkFreeRatio (for the
  DiskFull path) and apply (for the state machine); DiskCrash is now sticky on
  both DataNode and ConfigNode until restart.

DiskCheckerTest trimmed to drop testWrite-specific cases and to assert that
NORMAL no longer recovers DiskCrash; 14 cases pass.
…rtbeatService

- TConfigNodeHeartbeatReq carries needSamplingLoad; ConfigNodeRPCServiceProcessor
  gates its disk-health check on that flag instead of a local counter, matching
  the DataNode/AINode heartbeat shape.
- HeartbeatService now bumps heartbeatCounter exactly once at the top of
  heartbeatLoopBody and threads the iterationIndex into genHeartbeatReq /
  genConfigNodeHeartbeatReq / genAIHeartbeatReq / addConfigNodeLocationsToReq;
  the gen methods no longer touch the counter.
- Cross-peer priority reconciliation moved out of HeartbeatService into a new
  ConfigRegionPriorityBalancer that implements IClusterStatusSubscriber. It is
  registered on the EventService alongside RouteBalancer/TopologyService and only
  reacts when NodeStatisticsChangeEvent reports a transition whose priority
  bucket actually moved (filtered to ConfigNode peers, gated on isLeader).
- RatisConsensus.reconfigurePeerPriorities is now a mechanical merge — it
  rebuilds the peer list from the requested priorities and unconditionally calls
  sendReconfiguration. Decisions about "did the priority change" live in the
  balancer, not at the consensus layer.
…ndex threading

- HeartbeatService bumps heartbeatCounter once at the tail of heartbeatLoopBody;
  the gen* methods read heartbeatCounter.get() directly again instead of taking
  an iterationIndex parameter.
- Remove the standalone ConfigRegionPriorityBalancer. Priority reconciliation now
  lives in EventService, fired from checkAndBroadcastNodeStatisticsChangeEventIfNecessary
  exactly when ConfigNode statistics change. It is leader-gated, filtered to
  ConfigNode peers, and only pushes peers whose priority bucket actually moved.
  EventService now takes the IManager to reach the consensus impl.
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant