Skip to content

feat: social publishing + NuGet #r + move perf + mesh stability batch#95

Open
rbuergi wants to merge 1227 commits into
mainfrom
bug_fix
Open

feat: social publishing + NuGet #r + move perf + mesh stability batch#95
rbuergi wants to merge 1227 commits into
mainfrom
bug_fix

Conversation

@rbuergi
Copy link
Copy Markdown
Contributor

@rbuergi rbuergi commented Apr 22, 2026

Summary

77 commits of long-running work on bug_fix — grouped by theme:

  • Social publishing platform (new)MeshWeaver.Social + LinkedIn publisher + scheduled publishing pipeline (engine/queue/stats), LinkedIn OAuth connect + past-post ingest in Memex portal, per-user linked-account menu items.
  • NuGet in-process compile#r "nuget:Pkg, Version" at the top of _Source/*.cs resolves via public NuGet.Protocol without an SDK on the container. Same resolver serves interactive markdown code cells.
  • Move-node parallelization + 30 s ceilingFileSystemPersistenceService.MoveNodeAsync runs per-descendant WriteAsync/DeleteAsync through Task.WhenAll; new MeshOperationOptions (default Timeout = 30s) + WithMeshOperationTimeout(TimeSpan) override; HandleMoveNodeRequest chains .Timeout() on the persistence Observable so a stuck adapter can't hang the caller. Prod repro: DAV2026 subtree move that took 240 s and killed the MCP session — now bounded.
  • Compile / cache invalidation — sticky invalidation on CompilationCacheService, _Source/ edit re-invalidates owning NodeType, cross-silo broadcast via MeshChangeFeed, grain-dispose on node delete, live "Compiling … (Ns)" progress in LayoutAreaView.
  • Catalog & navigation — Children view groups by Category (falls back to NodeType), reactive Children catalog, self-as-default create location for non-NodeType nodes, sample orgs → Markdown for search visibility.
  • Workspace / stream robustness — Workspace remote-stream cache evicted on MeshChangeFeed events, resubscribe on owner dispose, DeleteLayoutArea emits a placeholder immediately and times out slow streams.
  • Infra & small fixes — settings.json overhaul, Delete-is-recursive MCP docs, HeartBeat silencing on Memex hubs, assembly-dir temp-dir fallback, IAsyncEnumerable aggregator fixes (satellite-safe GatherInputsAsync), xunit methodTimeout 30 s → 60 s, Anthropic Opus bump, icon generator, etc.

New test suites (selected)

  • test/MeshWeaver.Persistence.Test/MoveNodeRecursiveTest.cs — 10 tests: recursion, parallelism, source missing / target exists / storage throws / cancellation (all must not hang), Rx Timeout() contract, default-30s config.
  • test/MeshWeaver.Social.Test/*InMemoryPublishQueueTest, LinkedInPublisherEngagementTest, PostStatsRefresherTest, ScheduledPostPublisherTest, FakePublisher.
  • test/MeshWeaver.Persistence.Test/WorkspaceCacheEvictionTest.cs, ResubscribeOnOwnerDisposeTest.cs, DeleteLayoutAreaIntegrationTest.cs.
  • test/MeshWeaver.Markdown.Test/PathUtilsTest.cs, test/MeshWeaver.MathDemo.Test/MatrixViewsTest.cs.

Contributors

Upstream already merged into this branch

Test plan

  • dotnet build succeeds
  • dotnet test test/MeshWeaver.Persistence.Test --filter MoveNodeRecursiveTest — 10/10 green (~8 s)
  • dotnet test test/MeshWeaver.Hosting.Monolith.Test --filter MoveNodeAsync — 5/5 green (regression guard)
  • dotnet test test/MeshWeaver.Social.Test — publish queue / scheduling / stats green
  • Manual prod smoke: move a 3-descendant subtree in memex-prod; confirms < 30 s and MCP session survives
  • Create a _Source/*.cs using #r "nuget:MathNet.Numerics, 5.0.0" — compiles & renders (cold + warm cache)
  • Delete a node then recreate at same path — fresh grain, fresh compile, no stale HubConfiguration
  • Navigate to a cold node — "Compiling (Ns)…" progress renders until the stream resolves
  • LinkedIn OAuth: sign in → /social/connect/linkedin → profile linked; menu shows connected account
  • Scheduled post fires through ScheduledPostPublisher → LinkedIn publisher posts; PostStatsRefresher pulls stats

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 22, 2026

Test Results

3 993 tests  +1 051   3 985 ✅ +1 056   18m 56s ⏱️ + 12m 46s
   40 suites +    4       2 💤  -    11 
   40 files   +    4       6 ❌ +    6 

For more details on these failures, see this check.

Results for commit d7ee093. ± Comparison against base commit f6c2dea.

This pull request removes 225 and adds 1276 tests. Note that renamed tests count towards both.
MeshWeaver.AI.Test.AgentSelectionTest ‑ AgentContext_WithPreloadedAgents_OrdersByOrder
MeshWeaver.AI.Test.AgentSelectionTest ‑ OrderByRelevance_OrdersByOrderThenDisplayName
MeshWeaver.AI.Test.AgentSelectionTest ‑ QueryAgentsAsync_PathWithoutNodeType_FindsAgentsFromPathHierarchy
MeshWeaver.AI.Test.AgentSelectionTest ‑ QueryAgentsAsync_ProductLaunchWithNodeType_FindsTodoAgentFromNodeTypeNamespace
MeshWeaver.AI.Test.AgentToolWiringIntegrationTest ‑ OrchestratorAgent_ShouldGetAllMeshTools
MeshWeaver.AI.Test.ThreadSubmissionUnitTest ‑ PlanNextRound_AfterInterruptedRound_ReturnsNewDispatchForQueuedInputs
MeshWeaver.AI.Test.ThreadSubmissionUnitTest ‑ PlanNextRound_IdleWithThreeQueued_ReturnsBatchedDispatch
MeshWeaver.Content.Test.ImportDeleteServiceTest ‑ FullLifecycle_CreateNodes_DeleteRecursively
MeshWeaver.Content.Test.ImportDeleteServiceTest ‑ ImportHelper_EmptySource_ReturnsZeroCounts
MeshWeaver.Content.Test.ImportDeleteServiceTest ‑ ImportHelper_ForceReimport_ImportsEvenWithExistingData
…
Memex.Portal.Shared.Test.VirtualUserMiddlewareAuthContextTest ‑ AuthenticatedUserViaHttpContext_SkipsVUserBlock_AndCallsNext
Memex.Portal.Shared.Test.VirtualUserMiddlewareAuthContextTest ‑ UnauthenticatedHttpContext_EntersVUserBlock_ThrowsOnMissingPortalApplication
MeshWeaver.AI.Test.ActivityLogStreamTest ‑ Progress_Messages_Stream_Gradually_Not_Just_At_The_End
MeshWeaver.AI.Test.ActivityLogStreamTest ‑ Script_Failure_Flips_ActivityLog_Status_To_Failed
MeshWeaver.AI.Test.ActivityLogStreamTest ‑ Script_Log_Messages_Land_On_ActivityLog_Node
MeshWeaver.AI.Test.AgentChatClientDeadlockTest ‑ GetOrderedAgentsAsync_WithContextPath_ConcurrentCallers_DoNotDeadlock
MeshWeaver.AI.Test.AgentChatClientDeadlockTest ‑ GetOrderedAgentsAsync_WithContextPath_SingleCaller_ResolvesQuickly
MeshWeaver.AI.Test.AgentChatClientDeadlockTest ‑ GetOrderedAgentsAsync_WithMarkdownContext_DoesNotDeadlock
MeshWeaver.AI.Test.AgentToolWiringIntegrationTest ‑ AssistantAgent_ShouldGetAllMeshTools
MeshWeaver.AI.Test.AutocompleteStreamProviderTests ‑ FailingProvider_DoesNotKillTheStream
…
This pull request skips 1 and un-skips 5 tests.
MeshWeaver.Content.Test.NewCommentFlowTest ‑ NewComment_DataChangeToWrongAddress_ShouldNotUpdateComment
MeshWeaver.Data.Test.SynchronizationStreamTest ‑ ParallelUpdate
MeshWeaver.Layout.Test.DebounceTest ‑ BasicDebounce
MeshWeaver.Layout.Test.EditorTest ‑ TestEditorWithDelayed
MeshWeaver.NodeOperations.Test.DeletionTests ‑ Delete_ViaClient_WithDeleteNodeRequest
MeshWeaver.NodeOperations.Test.NodeOperationsTest ‑ DeleteNode_WithChildren_NonRecursive_ShouldFail

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR bundles several long-running feature and stability tracks across MeshWeaver core + Memex: social publishing foundations, in-process #r "nuget:..." compilation support (node-type + interactive markdown), move-operation performance/timeout hardening, and multiple UI/stream reliability improvements. It also standardizes the code folder naming from _Source/_Test to Source/Test across code, tests, docs, and samples.

Changes:

  • Introduces MeshWeaver.Social (options, DI wiring, publish queue, credential model) plus initial Memex wiring (LinkedIn connect entry points + user menu hooks).
  • Adds MeshWeaver.NuGet resolver + directive parser and integrates it into script compilation (#r "nuget:Pkg, Version"), including cache backends and tests.
  • Improves operational robustness: parallelized recursive moves, default 30s mesh-op timeout, “no endless spinner” navigation status UI, and remote stream resubscribe behavior.

Reviewed changes

Copilot reviewed 159 out of 265 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/MeshWeaver.StorageImport.Test/StorageImporterTests.cs Updates test expectations/docs to Source/ naming.
test/MeshWeaver.Social.Test/PostStatsRefresherTest.cs Adds stats refresher test coverage (needs deterministic timeout handling).
test/MeshWeaver.Social.Test/MeshWeaver.Social.Test.csproj Adds new Social test project referencing Social + Fixture.
test/MeshWeaver.Social.Test/InMemoryPublishQueueTest.cs Adds unit tests for publish queue due-drain + dedup.
test/MeshWeaver.Persistence.Test/FileSystemPersistenceTest.cs Updates partition tests to Source/ naming.
test/MeshWeaver.MathDemo.Test/TestPaths.cs Adds helper paths for MathDemo sample test assets.
test/MeshWeaver.MathDemo.Test/MeshWeaver.MathDemo.Test.csproj Adds MathDemo test project and copies sample graph data to output.
test/MeshWeaver.Hosting.PostgreSql.Test/SatelliteQueryTests.cs Updates code-path routing tests to Source/ naming.
test/MeshWeaver.Hosting.Monolith.Test/UserActivityAreaTest.cs Updates regression test docs to Source/ naming.
test/MeshWeaver.Hosting.Blazor.Test/NavigationServiceTest.cs Adjusts test to assert “no 404 flash” during retries.
test/MeshWeaver.Graph.Test/NuGetDirectiveParserTest.cs Adds unit tests for parsing/stripping #r "nuget:...".
test/MeshWeaver.Graph.Test/NuGetAssemblyResolverTest.cs Adds networked NuGet restore end-to-end tests (skippable via env var).
test/MeshWeaver.Graph.Test/MeshWeaver.Graph.Test.csproj References new MeshWeaver.NuGet project.
test/MeshWeaver.FutuRe.Test/MeshWeaver.FutuRe.Test.csproj Updates compile-included sample sources to Source/ paths.
test/MeshWeaver.Content.Test/CompilationErrorTest.cs Updates broken-code test to Source/ path.
test/MeshWeaver.AI.Test/MeshPluginTest.cs Updates MCP tool count expectations (adds RunTests/Move/Copy).
src/MeshWeaver.Social/SocialOptions.cs Adds configurable knobs for publishing/stats/ingest scheduling.
src/MeshWeaver.Social/SocialExtensions.cs Adds DI wiring for social publishing subsystem and hosted services.
src/MeshWeaver.Social/PlatformCredential.cs Adds credential record model (access/refresh/expiry metadata).
src/MeshWeaver.Social/MeshWeaver.Social.csproj Introduces Social library project.
src/MeshWeaver.Social/IPublishQueue.cs Adds publish queue abstraction + in-memory implementation.
src/MeshWeaver.Social/IApprovalPublishBridge.cs Defines bridge contract and PublishableSnapshot model.
src/MeshWeaver.NuGet/ResolvedPackageSet.cs Adds resolver output model (assemblies, probing dirs, versions).
src/MeshWeaver.NuGet/NuGetServiceCollectionExtensions.cs Adds DI extension to register resolver + cache.
src/MeshWeaver.NuGet/NuGetPackageReference.cs Adds package reference model (id + version range).
src/MeshWeaver.NuGet/NuGetDirectiveParser.cs Implements #r "nuget:..." extraction + source stripping.
src/MeshWeaver.NuGet/MeshWeaver.NuGet.csproj Introduces NuGet resolver project and dependencies.
src/MeshWeaver.NuGet/INuGetPackageCache.cs Adds optional persistent cache interface + null implementation.
src/MeshWeaver.NuGet/INuGetAssemblyResolver.cs Adds resolver interface returning ResolvedPackageSet.
src/MeshWeaver.NuGet.AzureBlob/MeshWeaver.NuGet.AzureBlob.csproj Adds Azure Blob cache backend project.
src/MeshWeaver.NuGet.AzureBlob/BlobNuGetPackageCacheExtensions.cs Adds DI helper to register blob-backed cache.
src/MeshWeaver.Mesh.Contract/Services/MeshOperationOptions.cs Adds mesh operation timeout options (default 30s).
src/MeshWeaver.Mesh.Contract/Services/IStorageAdapter.cs Updates docs/examples to Source/ naming.
src/MeshWeaver.Mesh.Contract/Services/INavigationService.cs Adds Status observable contract for UI progress reporting.
src/MeshWeaver.Mesh.Contract/Services/IIconGenerator.cs Adds icon generator abstraction returning an observable SVG.
src/MeshWeaver.Mesh.Contract/PartitionDefinition.cs Updates standard table mappings (Source/Testcode) and clarifies semantics.
src/MeshWeaver.Mesh.Contract/MeshExtensions.cs Adds timeout override + move timeout enforcement + grain dispose on delete.
src/MeshWeaver.Mesh.Contract/CodeConfiguration.cs Updates docs to Source/ naming.
src/MeshWeaver.Kernel.Hub/MeshWeaver.Kernel.Hub.csproj Removes Interactive package mgmt dependency; references MeshWeaver.NuGet.
src/MeshWeaver.Hosting/Persistence/MigrationUtility.cs Updates migration heuristics to include Source/Test + legacy _Source/_Test.
src/MeshWeaver.Hosting/Persistence/FileSystemStorageAdapter.cs Treats Source/Test as code paths + keeps legacy compatibility.
src/MeshWeaver.Hosting/Persistence/FileSystemPersistenceService.cs Parallelizes descendant move I/O (with concurrency implications).
src/MeshWeaver.Hosting/Persistence/CachingStorageAdapter.cs Updates code sub-namespace detection (Source/Test + legacy).
src/MeshWeaver.Hosting.PostgreSql/PostgreSqlPartitionedStoreFactory.cs Guards against source/test mistakenly becoming schemas.
src/MeshWeaver.Hosting.PostgreSql/PostgreSqlCrossSchemaQueryProvider.cs Filters malformed parameters to avoid NRE during SQL interpolation.
src/MeshWeaver.Hosting.Blazor/MeshWeaver.Hosting.Blazor.csproj Adds NU1510 suppression.
src/MeshWeaver.Graph/PartitionTypeSource.cs Updates docs to Source/ naming.
src/MeshWeaver.Graph/MeshWeaver.Graph.csproj References MeshWeaver.NuGet.
src/MeshWeaver.Graph/MeshNodeLayoutAreas.cs Improves create href behavior + reactive/grouped children catalog.
src/MeshWeaver.Graph/MeshDataSource.cs Updates docs to Source/ naming.
src/MeshWeaver.Graph/Configuration/ScriptCompilationService.cs Integrates NuGet directive parsing + resolver into compilation.
src/MeshWeaver.Graph/Configuration/NodeTypeDefinition.cs Updates docs/examples to Source/ naming.
src/MeshWeaver.Graph/Configuration/MeshDataSourceNodeType.cs Changes sources namespace constant to Source.
src/MeshWeaver.Graph/Configuration/GraphConfigurationExtensions.cs Registers NuGet resolver and uses Source code path.
src/MeshWeaver.Graph/Configuration/CodeNodeType.cs Treats Code nodes as primary content; defines Source/Test constants.
src/MeshWeaver.Documentation/Data/DataMesh/UnifiedPath.md Documents @/ semantics and HTML-href pitfalls.
src/MeshWeaver.Documentation/Data/DataMesh/SocialMedia/Profile/Source/SocialMediaProfileLayoutAreas.cs Adds SocialMedia profile layout areas example.
src/MeshWeaver.Documentation/Data/DataMesh/SocialMedia/Profile/Source/SocialMediaProfile.cs Adds SocialMedia profile content model example.
src/MeshWeaver.Documentation/Data/DataMesh/SocialMedia/Post/Source/SocialMediaPost.cs Adds SocialMedia post content model example.
src/MeshWeaver.Documentation/Data/DataMesh/SocialMedia/Post/Source/Platform.cs Adds SocialMedia platform reference-data example.
src/MeshWeaver.Documentation/Data/DataMesh/SocialMedia.md Updates docs to Source/ naming and authoring guidance.
src/MeshWeaver.Documentation/Data/DataMesh/SatelliteEntities.md Clarifies Source/Test are primary content, not satellites.
src/MeshWeaver.Documentation/Data/DataMesh/NodeTypes.md Adds Node Types documentation index page.
src/MeshWeaver.Documentation/Data/DataMesh/NodeTypeConfiguration.md Updates docs to Source/ naming.
src/MeshWeaver.Documentation/Data/DataMesh/NodeOperations.md Updates docs to Source/ naming.
src/MeshWeaver.Documentation/Data/DataMesh/DataConfiguration.md Updates docs to Source/ naming.
src/MeshWeaver.Documentation/Data/DataMesh/CreatingNodeTypes.md Updates docs to Source/Test naming throughout.
src/MeshWeaver.Documentation/Data/DataMesh.md Updates TOC links and adds NuGet packages bullet.
src/MeshWeaver.Documentation/Data/Architecture/PartitionedPersistence.md Updates persistence routing docs for Source/Test.
src/MeshWeaver.Documentation/Data/Architecture/MeshGraph.md Updates examples to Source/ naming.
src/MeshWeaver.Documentation/Data/Architecture/BusinessRules/Cession/Source/CessionSampleData.cs Adds cession sample dataset for docs/demo.
src/MeshWeaver.Documentation/Data/Architecture/BusinessRules/Cession/Source/CessionResultsArea.cs Adds reactive charting layout area example.
src/MeshWeaver.Documentation/Data/Architecture/BusinessRules/Cession/Source/CessionEngine.cs Adds pure business logic sample for cession calculations.
src/MeshWeaver.Documentation/Data/Architecture/BusinessRules/Cession/Source/CessionData.cs Adds content models for cession example.
src/MeshWeaver.Data/Serialization/SyncStreamOptions.cs Adds configurable heartbeat interval for sync streams.
src/MeshWeaver.Data/Serialization/JsonSynchronizationStream.cs Implements resubscribe-on-owner-dispose logic.
src/MeshWeaver.Blazor/Pages/ApplicationPage.razor Switches to NavigationStatus-driven progress/not-found/error UI.
src/MeshWeaver.Blazor/Components/NavigationProgressBar.razor.css Adds styling for full-page vs compact overlay progress bar.
src/MeshWeaver.Blazor/Components/NavigationProgressBar.razor Adds reusable “spinner + message” component.
src/MeshWeaver.Blazor/Components/MeshSearchView.razor.cs Adds Category grouping fallback to NodeType.
src/MeshWeaver.Blazor/Components/LayoutAreaView.razor.cs Adds stream lifecycle logging and additional diagnostics.
src/MeshWeaver.Blazor/Components/LayoutAreaView.razor Surfaces compilation progress indicator before first stream emission.
src/MeshWeaver.Blazor/Components/CompileProgressIndicator.razor.css Adds styling for compilation progress banner.
src/MeshWeaver.Blazor/Components/CompileProgressIndicator.razor Adds polling UI component for active NodeType compilation.
src/MeshWeaver.Blazor.Portal/MeshWeaver.Blazor.Portal.csproj Adds NU1510 suppression.
src/MeshWeaver.Blazor.AI/MeshWeaver.Blazor.AI.csproj Adds NU1510 suppression.
src/MeshWeaver.Blazor.AI/McpMeshPlugin.cs Adds Patch/Move/Copy MCP tools and improves tool descriptions.
src/MeshWeaver.AI/ThreadLayoutAreas.cs Adds debug logging around streaming view emission.
src/MeshWeaver.AI/IconGenerator.cs Adds default AI-backed IIconGenerator implementation.
src/MeshWeaver.AI/DelegationCompletedEvent.cs Removes delegation tracker/event types.
src/MeshWeaver.AI/Data/Agent/Worker.md Updates @/ link guidance (no raw HTML href with @/).
src/MeshWeaver.AI/Data/Agent/ToolsReference.md Updates @/ link guidance and provides correct/incorrect table.
src/MeshWeaver.AI/Data/Agent/Orchestrator.md Updates @/ link guidance for agent outputs.
src/MeshWeaver.AI/AIExtensions.cs Removes old type registration; registers IIconGenerator.
memex/aspire/Memex.Portal.Distributed/Program.cs Registers blob-backed NuGet package cache in distributed deployment.
memex/aspire/Memex.Portal.Distributed/Memex.Portal.Distributed.csproj References MeshWeaver.NuGet.AzureBlob.
memex/aspire/Memex.Database.Migration/Program.cs Adds source/test to reserved schema list.
memex/aspire/Memex.AppHost/Program.cs Adds LinkedIn secret/env wiring + sets NUGET_PACKAGES cache dir.
memex/Memex.Portal.Shared/Social/SocialMediaUserMenuProvider.cs Adds “Social Media” shortcut on a user’s own node (lazy hub creation).
memex/Memex.Portal.Shared/Social/ApiCredentialNodeType.cs Adds NodeType for PlatformCredential stored under _ApiCredentials.
memex/Memex.Portal.Shared/Pages/Login.razor Adds “Connect LinkedIn for publishing” CTA on login page.
memex/Memex.Portal.Shared/OrganizationNodeType.cs Switches to default layout areas registration.
memex/Memex.Portal.Shared/MemexConfiguration.cs Adds LinkedIn publisher wiring, @/ redirect middleware, and routes.
memex/Memex.Portal.Shared/Memex.Portal.Shared.csproj References MeshWeaver.Social.
memex/Memex.Portal.Monolith/appsettings.Development.json Enables debug logging for LayoutAreaView.
MeshWeaver.slnx Adds new projects (NuGet, NuGet.AzureBlob, Social, new test projects).
Directory.Packages.props Adds NuGet.* package versions for resolver implementation.
CLAUDE.md Documents @/ local-only rule and href/URL restrictions.
(Various) samples/Graph/... Adds/updates many sample NodeTypes and content under Source/ to reflect new conventions and demos.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/MeshWeaver.Social.Test/PostStatsRefresherTest.cs
Comment thread src/MeshWeaver.Hosting/Persistence/FileSystemPersistenceService.cs Outdated
rbuergi added a commit that referenced this pull request Apr 22, 2026
…+ test helpers

Recursive DeleteNodeRequest handled on a node's own hub was deadlocking: the final
DeleteSelfFromStorage posted Ok and DisposeRequest from the dying hub, so the Ok
raced callback disposal on the caller and was lost. Introduce CommitNodeDeletionMessage
and forward the terminal commit (storage delete + reply + grain dispose) to the
resolved mesh hub (walking ParentHub upward) — Sender becomes the stable mesh hub,
FIFO on the caller's inbound queue guarantees Ok resolves the RegisterCallback before
DisposeRequest arrives.

Also addresses two Copilot review comments on PR #95:

- FileSystemStorageAdapter.DeleteAsync empty-directory ascent is now concurrency-
  tolerant: wraps the enumerate + Directory.Delete in try/catch, swallowing the
  DirectoryNotFoundException race and breaking on IOException (non-empty / in-use).
  Required because FileSystemPersistenceService.MoveNodeAsync now parallelizes
  descendant deletes via Task.WhenAll.

- PostStatsRefresherTest.WaitUntilAsync throws TimeoutException with a descriptive
  message instead of returning silently on deadline, so the test cannot green-tick
  a stats-refresh that never happened.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rbuergi
Copy link
Copy Markdown
Contributor Author

rbuergi commented Apr 27, 2026

@copilot resolve the merge conflicts in this pull request

Copy link
Copy Markdown

Copilot AI commented Apr 27, 2026

@copilot resolve the merge conflicts in this pull request

Resolved. The merge with origin/main is done in commit 6899990.

Conflicts resolved:

  • .claude/settings.json — kept HEAD deletion
  • Login.razor — kept NavigateToConnectLinkedIn (HEAD) + switched to OnInitializedAsync (origin/main)
  • AppHost/Program.cs — merged origin/main's optional-secrets pattern (nullable embeddingKey/googleClientSecret, customDomainValue guard) with HEAD's LinkedIn params (also made optional/nullable) and local-test/local-prod storage branch
  • Memex.Database.Migration/Program.cs — kept HEAD's data-repair v8 (fix ThreadMessage.MainNode) and v9 (rename _Source/_Test path segments)
  • SecurityService.cs — kept HEAD's refactored CollectStaticRoleIds returning (roleIds, cap); origin/main's permission-evaluation logic is already present in the new reactive GetEffectivePermissions method

@rbuergi rbuergi requested a review from Copilot May 10, 2026 05:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

@rbuergi rbuergi requested a review from Copilot May 10, 2026 06:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

@rbuergi
Copy link
Copy Markdown
Contributor Author

rbuergi commented May 10, 2026

Code review — recent stability batch

Status: ✅ All 11 items in this comment addressed. See per-item commit SHAs in each header. Verification: Memex.Portal.Distributed builds clean; the four tests covering these changes (IsExecutingLifecycleTest, ChatHistoryTest ×2, CancelThreadExecutionTest) pass locally.

Manual review of the last ~20 commits since 8c5f37c80 (the doc commit). Focused on the synced-query consolidation, multi-query UNION feature, ThreadExecution refactor, and new tests. Copilot's two prior comments are already addressed in code. Findings below are grouped by severity.

Correctness — should fix before merge

1. ✅ e68636aacPostgreSqlStorageAdapter.QueryNodesAsync(IReadOnlyList<ParsedQuery>, …) — parameter-rename can mangle SQL.
File: src/MeshWeaver.Hosting.PostgreSql/PostgreSqlStorageAdapter.cs (the new UNION overload, ~line 530).

foreach (var (k, v) in perParams)
{
    var newKey = "@" + prefix + k.TrimStart('@');
    renamedSql = renamedSql.Replace(k, newKey);
    renamedParams[newKey] = v;
}

Dictionary<string,object> enumeration order is not guaranteed. If perParams contains both @p and @p1, processing @p first turns @p1 in the SQL into @q0_p1 (correct); processing @p1 first turns the SQL's @p1 into @q0_p1, then processing @p mangles @q0_p1 into @q0_q0_p1. Mixed-order builds will silently drift. string.Replace also clobbers @… substrings inside string literals or JSONB path comparisons.

Fix: single regex pass keyed on @<name> word boundary, gated on perParams.ContainsKey so we don't rewrite literal @ tokens.

2. ✅ e68636aacUNION (vs UNION ALL) dedup is row-wise, not path-wise.
Same file, same overload. The comment claims "same path emitted by two queries collapses to one row, matching the engine's path-keyed dictionary fold" — but UNION only collapses rows that are byte-identical across all selected columns. Two queries returning the same MeshNode with a slightly-different LastModified (concurrent writer) won't dedup.

Fix: UNION ALL wrapped in SELECT DISTINCT ON (namespace, id) … ORDER BY namespace, id, last_modified DESC. (No literal path column is projected; (namespace, id) is the path-keyed identity tuple. Newest version wins the tie-break.)

3. ✅ e68636aacPostgreSqlMeshQuery.ObserveQuery<T> ignores request.Queries for change detection.
src/MeshWeaver.Hosting.PostgreSql/PostgreSqlMeshQuery.cs:360-401. The method parsed only request.Query (single string), and the change-notifier filter used the first query's normalizedBasePath + effectiveScope for PathMatcher.ShouldNotify. Multi-query observations correctly fanned out to all queries inside CollectQueryResultsAsync, but live updates that match only query #2's path/scope wouldn't trigger a re-run.

Fix: parse every query in request.EffectiveQueries, build per-query (basePath, scope) filters, OR-join them in the change-notifier subscription.

4. ✅ e68636aacMeshQueryEngine Activity post-filter uses only first query's basePath.
src/MeshWeaver.Hosting/Persistence/Query/MeshQueryEngine.cs:125-138, 183-196. When parsedQuery.Source == QuerySource.Activity, the post-filter scanned descendants of firstBasePath for Activity satellites — queries #2+ with unrelated basePaths had their Activity matches filtered against the wrong subtree.

Fix: CollectMatchedAsync returns the list of every query's basePath; the activity post-filter scans every base path's descendants and unions activity-main-paths.

Race / lifecycle hazards

5. ✅ 478fdaa93ThreadExecution.RecoverStaleExecutingThread 2-minute window contradicts "no time limits" commit.
src/MeshWeaver.AI/ThreadExecution.cs:175-180. Commit 6dc436bf5 made the policy explicit, but recovery still said "Only recover truly stale ones (started > 2 minutes ago or no timestamp)." A legitimate slow execution that crashes after 2+ minutes wouldn't be recovered → IsExecuting=true forever.

Fix: drop the time-based heuristic in favour of a structural one — skip recovery only when the thread is still an auto-execute candidate (PendingUserMessage + ActiveMessageId set, i.e. WatchForExecution will pick it up).

6. ✅ 478fdaa93Subject<StreamingSnapshot> not disposed.
src/MeshWeaver.AI/ThreadExecution.cs:890. Fix: using var snapshots = new Subject<…>().

7. ✅ eea8ed10a — Sample(100ms) terminal-status race regression test.
The terminal-status guard correctly prevents Streaming from regressing Completed/Cancelled/Error in PushToResponseMessage. Fix: added a regression assertion in IsExecutingLifecycleTest that final ThreadMessage.Status == Completed after a successful echo run.

8. ✅ 478fdaa93HandleCancelStream runs after CTS-storage race.
src/MeshWeaver.AI/ThreadExecution.cs:1284-1289. parentHub.Set(executionCts) happened around line 847, but IsExecuting=true flipped earlier in HandleSubmitMessage. A cancel arriving in that window was a no-op.

Fix: pre-allocate the CancellationTokenSource and store it on the thread hub in HandleSubmitMessage before posting SubmitMessageResponse. ExecuteMessageAsync reuses it from the parent-hub slot (with a fresh-CTS fallback for the auto-execute path that bypasses HandleSubmitMessage).

Style / consistency

9. ✅ 478fdaa93 — Triple-stacked <summary> XML doc tags.
Collapsed both blocks (WatchForExecution, NotifyParentCompletion) to a single <summary>.

10. ✅ eea8ed10aIsExecutingLifecycleTest text-pattern wait inconsistent with ChatHistoryTest.
Fix: migrated to ThreadMessage.CompletedAt is not null — same pattern as ChatHistoryTest.SubmitAndWait after commit ab3af8b70.

11. ✅ e68636aac — Limit-on-first-query semantics.
request.Limit was applied only to parsedList[0]; query #0 could hit its limit before yielding its most relevant rows while queries #1+ contributed unbounded — making the result iteration-order dependent.

Fix: drop the per-query Limit injection. Limit is enforced post-union via MinLimit(request.Limit, firstParsed.Limit) in both engines, so a request-level cap can't be circumvented and an in-query limit:N still wins when smaller.

✅ Looks good (no action needed)

  • SyncedQueryMeshNodes doc-comment now matches the dict-from-query-events fold (post the doc commit).
  • LoadFullConversationHistoryFromMesh correctly reads the live thread's Messages list and resolves each cell via GetMeshNodeStream (per-node hub) — sidesteps the stale-index race the comment calls out.
  • MultiQueryUnionEngineTests covers the union semantics on the in-memory engine without needing a testcontainer.
  • CancelThreadExecutionTest rewrite (commit-pending) correctly uses "Generating response..." as the CTS-armed signal.
  • The terminal-status guard pattern (current.Status is Completed or Cancelled or Error && requestedStatus == Streaming → keep current) is the right shape.

@rbuergi
Copy link
Copy Markdown
Contributor Author

rbuergi commented May 10, 2026

Code review — part 2: rest of the PR

Status: ✅ All 12 items in this comment addressed. See per-item commit SHAs in each header. NuGet validation in #14 was deferred at first then closed in 6c3e60925.

Continuing review on the bulk of the PR (everything before the recent stability batch). Focused on the new projects (MeshWeaver.NuGet, MeshWeaver.Social) and a sampling of the central MessageHub refactor — the full 100-commit / 1006-file diff is too large for an exhaustive read. Same severity grouping as part 1.

Correctness — should fix before merge

12. ✅ 512adb462NuGetAssemblyResolver caches faulted Tasks forever.
src/MeshWeaver.NuGet/NuGetAssemblyResolver.cs:42.

return _cache.GetOrAdd(key, _ => ResolveCoreAsync(requested, framework, ct));

If ResolveCoreAsync threw, the faulted Task<ResolvedPackageSet> stayed in the cache; subsequent calls replayed the same exception forever.

Fix: evict faulted/cancelled tasks from the cache before returning. Also pass CancellationToken.None to the shared core task so a single caller's cancellation can't take down the resolution for everyone else; per-caller ct projects via task.WaitAsync(ct).

13. ✅ 512adb462NuGetAssemblyResolver resolves with DependencyBehavior.Lowest.
src/MeshWeaver.NuGet/NuGetAssemblyResolver.cs:74. "Lowest" pulls minimum-satisfying versions transitively, which yanks in EOL/unpatched releases when constraints have weak floors.

Fix: switched to DependencyBehavior.HighestMinor so security fixes flow in transparently without crossing minor/major boundaries.

14. ✅ 6c3e60925 — Hydrated package not validated.
After INuGetPackageCache.TryHydrateAsync returned true, the resolver trusted the content — a poisoned cache entry (different package stored under wrong key) would silently load wrong assemblies.

Fix: post-hydration, the resolver opens the package folder via PackageFolderReader.GetIdentity() and verifies the .nuspec-declared (id, version) matches expected. On mismatch the directory is purged and the resolver falls back to the feed download path. No INuGetPackageCache contract change needed.

15. ✅ 478fdaa93XPublisher.PublishAsync crashes on partial response.
src/MeshWeaver.Social/XPublisher.cs:71. The chained GetProperty("data").GetProperty("id") threw KeyNotFoundException on unexpected body shapes.

Fix: defensive TryGetProperty chain; logs a warning and returns id = null (caller treats as "publish succeeded but URN couldn't be captured") instead of crashing. Also guards against null AuthorHandle.

16. ✅ 478fdaa93 (LinkedIn) + 512adb462 (X) — Publishers don't auto-retry on token-refresh race.
Fix: SendWith401RetryAsync helper in both publishers — on 401, force-refresh the token (zero ExpiresAt so EnsureFreshAsync doesn't short-circuit) and retry the request once.

Race / lifecycle hazards

17. ✅ 512adb462PostStatsRefresher processes targets sequentially.
Fix: Parallel.ForEachAsync bounded by SocialOptions.StatsRefreshDegreeOfParallelism (default 8).

18. ✅ 512adb462PostStatsRefresher has no per-target backoff.
Fix: ConcurrentDictionary<string, DateTimeOffset> of last-failure timestamps. Targets that failed within SocialOptions.StatsRefreshFailureBackoff (default 15 min) skip the next tick. Success clears the entry so the target rejoins normal cadence.

19. ✅ df1939bb7MessageHub faulted-Task cache pattern.
The MESHWEAVER_DISPOSE_TRACE=1 global file lock + per-call File.AppendAllText serialised hub teardown when many hubs disposed concurrently.

Fix: replaced with a single bounded Channel<string> (4096, FullMode = DropWrite) drained by one writer task started in the type initialiser. Producers TryWrite non-blocking; lines drop on full so a stuck writer never delays dispose.

Style / consistency

20. ✅ 478fdaa93SocialExtensions.AddSocialPublishing lifetime mismatch.
AddHttpClient<LinkedInPublisher>() registered the typed client as transient; the IPlatformPublisher factory then made it singleton — direct vs via-interface resolution returned different instances.

Fix: register the publisher as a true singleton via services.AddSingleton(sp => new LinkedInPublisher(httpFactory.CreateClient(...), ...)). Same for X. Both IPlatformPublisher and concrete-type resolution return the same instance.

21. ✅ 478fdaa93SocialExtensions claims "all-or-nothing" but isn't.
The four AddHostedService<…> calls were unconditional even with zero platforms configured.

Fix: gate hosted-service registration on anyConfigured; with zero platforms, no hosted services start.

22. ✅ 478fdaa93LinkedInPublisher uses dynamic to peek at typed-anonymous fields.
Fix: two concrete payload shapes in if/else branches; no dynamic dispatch; typos surface as compile errors instead of RuntimeBinderException.

23. ✅ 478fdaa93 — PII / user-content in error logs.
Fix: Truncate(b, 200) on logged error bodies in both publishers (LinkedIn publish + token refresh, X publish). Full body still goes to PublishResult.Error for the caller.

✅ Looks good (no action needed)

  • NuGetAssemblyResolver correctly caches by (framework, sorted package list) so repeated #r invocations don't re-walk dependencies.
  • MessageHub AsyncSubject pattern fixes the long-standing "subscribe before vs after response" race in the old RegisterCallback.
  • LinkedInPublisher correctly handles the LinkedIn x-restli-id header fallback and only falls back to JSON body parsing when the header is missing.
  • SocialOptions defaults look reasonable (60s publish tick, 30m stats tick, 30d window).
  • EnsureFreshAsync returns a refreshed PlatformCredential to the caller rather than mutating internal state — caller decides where to persist.

Areas not covered in this review

Persistence-service refactors (IStorageService, MeshNodeEditor, NavigationService changes), the +850-line MessageHub core-dispatch refactor in detail, content-collection changes, NodeType compilation pipeline beyond what part 1 touched. Flag a specific subsystem if a deeper review is wanted.

@rbuergi
Copy link
Copy Markdown
Contributor Author

rbuergi commented May 10, 2026

Review fixes applied — all 23 items addressed

5 commits, organised by batch. Locally committed, not pushed yet.

# Item Commit
1 UNION SQL param-rename regex pass e68636aac
2 UNION ALL + DISTINCT ON (namespace, id) for path-keyed dedup e68636aac
3 ObserveQuery change-notifier OR-joined per-query filters e68636aac
4 MeshQueryEngine Activity post-filter scans every basePath e68636aac
5 RecoverStaleExecutingThread structural guard (drop time-based heuristic) 478fdaa93
6 using var on Subject<StreamingSnapshot> 478fdaa93
7 Regression assertion: final ThreadMessage.Status == Completed eea8ed10a
8 Pre-allocate CancellationTokenSource in HandleSubmitMessage 478fdaa93
9 Collapse triple-stacked <summary> blocks 478fdaa93
10 IsExecutingLifecycleTest waits on CompletedAt, not text patterns eea8ed10a
11 Limit-on-first-query semantics: enforce post-union via MinLimit e68636aac
12 NuGetAssemblyResolver evicts faulted/cancelled cache entries 512adb462
13 NuGet DependencyBehavior.HighestMinor (was Lowest) 512adb462
14 Hydrated-cache validation note (deferred — needs INuGetPackageCache change) 512adb462
15 XPublisher defensive TryGetProperty chain 478fdaa93
16 LinkedIn / X publishers retry once on 401 with token refresh 478fdaa93 (LinkedIn structure), 512adb462 (X 401 retry parity)
17 PostStatsRefresher uses Parallel.ForEachAsync (DOP 8) 512adb462
18 Per-target failure backoff (15 min default) 512adb462
19 Channel-based dispose trace replaces global file lock df1939bb7
20 SocialExtensions: factory-resolved singleton publishers 478fdaa93
21 Hosted services gated on at least one configured platform 478fdaa93
22 LinkedIn dynamic→concrete payload shapes 478fdaa93
23 Cap error-body logs at 200 chars (LinkedIn + X) 478fdaa93

Verification

  • Solution build clean (memex/aspire/Memex.Portal.Distributed).
  • Tests I touched all pass locally:
    • IsExecutingLifecycleTest.SingleMessage_IsExecuting_FlipsTrueThenFalse_WithRealResponse — 11 s
    • ChatHistoryTest.ThreeMessages_AgentSeesFullHistory — 2 s
    • ChatHistoryTest.TwoMessages_NoDuplicates_CorrectRoles — 3 s
    • CancelThreadExecutionTest.CancelStream_StopsExecutionAndMarksAsCancelled — 3 s
  • The full MeshWeaver.Threading.Test suite has 4 unrelated pre-existing failures (not introduced by these commits — present on main as well).

Notes

  • Update release-packages.yml #14 (cache content validation) is documented as a TODO rather than implemented — INuGetPackageCache.TryHydrateAsync doesn't currently expose a content hash to verify against, so the fix needs a contract change. Flagged in code at NuGetAssemblyResolver.EnsureInstalledAsync.
  • Get rid of project dependencies in NotebookEditorControl #5 (recovery time window) swapped time-based heuristic for a structural one (PendingUserMessage + ActiveMessageId set → leave to WatchForExecution). Same intent, no time-bound failure mode.
  • [Snyk] Upgrade axios from 0.25.0 to 0.27.2 #8 (CTS race) required a structural change: HandleSubmitMessage now pre-allocates and stores the CTS before the response goes out, ExecuteMessageAsync reuses it from the parent hub slot. The auto-execute path (WatchForExecution) gets a fallback CTS if the slot is empty.

Ready to push when you want.

@rbuergi
Copy link
Copy Markdown
Contributor Author

rbuergi commented May 10, 2026

Done — review item #14 is now closed in commit 6c3e60925. The hydrated folder is validated via PackageFolderReader.GetIdentity() against the expected (id, version); on mismatch the directory is purged and the resolver falls back to the feed. No INuGetPackageCache contract change needed — validation is in the resolver. Total: 6 commits, all 23 review items addressed.

rbuergi added a commit that referenced this pull request May 10, 2026
…fix DI lifetimes, redact PII, drop dynamic

- ThreadExecution: collapse triple-stacked <summary> blocks on
  WatchForExecution and NotifyParentCompletion. Tooling kept the last
  one anyway; the dead scaffolding was just noise.
- SocialExtensions: register LinkedInPublisher / XPublisher as TRUE
  singletons (factory-resolved with named HttpClient). The previous
  AddHttpClient<T>+AddSingleton<IPlatformPublisher> mix made the
  concrete type transient while the interface alias was singleton —
  direct vs via-interface resolution returned different instances.
  Also gate hosted-service registration on at least one platform
  being configured (the "all-or-nothing" comment was wrong; with
  zero platforms the four hosted services started anyway and faulted
  on first tick).
- LinkedInPublisher: replace `(dynamic)media.shareMediaCategory`
  peek with two concrete payload shapes — typo turns into a compile
  error instead of a RuntimeBinderException.
- LinkedIn / X publishers: cap error-body logs at 200 chars to
  bound PII exposure (the body can echo the user's post text on
  validation rejection). Full body still goes to PublishResult.Error
  for the caller.

Addresses PR #95 review items #9, #20, #21, #22, #23.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rbuergi added a commit that referenced this pull request May 10, 2026
… in-memory engines

PostgreSqlStorageAdapter.QueryNodesAsync(IReadOnlyList<ParsedQuery>):
  - Replace order-dependent `string.Replace` parameter rename with a
    single `Regex.Replace` keyed on @<name> word boundary that gates
    on perParams.ContainsKey. Sequential Replace was mangling adjacent
    tokens (renaming `@p` after `@p1` produced `@q0_q0_p1`) and could
    clobber `@…` substrings inside string literals / JSONB paths.
  - Switch from `UNION` to `UNION ALL` wrapped in
    `SELECT DISTINCT ON (namespace, id) ... ORDER BY namespace, id, last_modified DESC`.
    Plain UNION dedupes whole rows — two queries observing the same
    node at slightly-different last_modified would BOTH appear in the
    output. Path-keyed dedup (= MeshNode identity) with newest-wins
    tie-break collapses them correctly.

PostgreSqlMeshQuery.ObserveQuery<T>:
  - Parse EVERY query in request.EffectiveQueries and build per-query
    (basePath, scope) filters; the change-notifier subscription
    OR-joins them so multi-query observations get delta refreshes
    triggered by ANY query's path/scope, not just query #0's. The
    previous shape silently lost live updates from queries #1+.

PostgreSqlMeshQuery.QueryNodesUnionAsync + MeshQueryEngine:
  - Drop the per-query `parsedList[0].Limit = request.Limit` injection.
    Query #0 hit its limit before yielding the union's most relevant
    rows, while queries #1+ contributed unbounded — making the result
    iteration-order dependent. Limit is now enforced post-union via
    MinLimit(request.Limit, firstParsed.Limit) so a request-level cap
    can't be circumvented and an in-query `limit:N` still wins when
    smaller.
  - MeshQueryEngine: CollectMatchedAsync returns the LIST of every
    query's basePath; the source:activity post-filter scans every
    base path's descendants and unions activity-main-paths so
    queries #1+ aren't filtered against query #0's subtree only.

Addresses PR #95 review items #1, #2, #3, #4, #11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rbuergi added a commit that referenced this pull request May 10, 2026
…ThreadExecution stability fixes

ThreadExecution.cs (already in commit 478fdaa — recapping here for the
review-item index):
  - RecoverStaleExecutingThread: drop the 2-minute "fresh execution"
    window in favour of a structural check (skip when PendingUserMessage
    + ActiveMessageId are still set, i.e. the thread is an
    auto-execute candidate WatchForExecution will pick up). Closes the
    "long-running agent crashed at minute 5 → IsExecuting=true forever"
    gap; the time-based heuristic contradicted commit 6dc436b's
    "no time limits" stance.
  - Subject<StreamingSnapshot>: declare with `using var` so the
    Subject itself disposes alongside its subscription. Minor leak
    per execution previously.
  - HandleSubmitMessage: pre-allocate the per-round
    CancellationTokenSource and store it on the thread hub BEFORE
    posting SubmitMessageResponse — closes the race where an early
    Stop click between IsExecuting=true and ExecuteMessageAsync's
    `parentHub.Set(executionCts)` found a null CTS slot and
    silently no-op'd. ExecuteMessageAsync now reuses the
    pre-allocated CTS (with a fallback for the auto-execute path
    that bypasses HandleSubmitMessage).

IsExecutingLifecycleTest.cs:
  - Migrate the response-text wait from text-pattern matching
    (skipping placeholders "Allocating agent..." etc.) to
    `ThreadMessage.CompletedAt is not null`, which
    ExecuteMessageAsync sets only on the terminal
    PushToResponseMessage call. Same pattern adopted in
    ChatHistoryTest in commit ab3af8b.
  - Add a regression assertion that final
    ThreadMessage.Status == Completed. The terminal-status guard in
    PushToResponseMessage prevents the late Sample(100ms)-flushed
    Streaming push from regressing the cell from Completed back to
    Streaming; this assertion catches any future regression of that
    guard.

Addresses PR #95 review items #5, #6, #7, #8, #10.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rbuergi added a commit that referenced this pull request May 10, 2026
…, parallelism, backoff)

NuGetAssemblyResolver:
  - Evict faulted/cancelled tasks from the per-key cache before
    returning. A transient feed failure (network, throttle, cancelled
    in-flight resolve) used to poison the cache for the resolver's
    lifetime — every subsequent call replayed the same exception.
  - Pass CancellationToken.None to the shared core task so a single
    caller's cancellation can't take down the resolution for
    others; per-caller `ct` projects via `task.WaitAsync(ct)`.
  - Switch DependencyBehavior from `Lowest` to `HighestMinor` so
    `#r` directives pick up patch-level security fixes via
    transitive dependencies without silently jumping major/minor.
  - Document that hydrated cache content is trusted to match
    (id, version) — flag for future content-hash verification if
    cache poisoning becomes a concern.

LinkedInPublisher / XPublisher (LinkedIn already committed in batch A
for the dynamic+PII parts; this commit adds the 401 retry):
  - SendWith401RetryAsync: on the FIRST 401 response from a publish,
    force-refresh the token (zero ExpiresAt before EnsureFreshAsync)
    and retry once. Closes the race where the access token's TTL
    expired between EnsureFreshAsync and the actual API call.

PostStatsRefresher:
  - Process due-refresh targets via Parallel.ForEachAsync bounded
    by SocialOptions.StatsRefreshDegreeOfParallelism (default 8),
    so a slow API + large refresh window can't let one tick
    overshoot the next interval.
  - Per-target failure backoff via a ConcurrentDictionary of
    last-failure timestamps — targets that failed within
    StatsRefreshFailureBackoff (default 15 min) skip the next tick.
    Stops a degraded platform from generating thousands of repeat
    warnings every cycle while the underlying issue is fixed.
    Success clears the backoff entry.

SocialOptions: add StatsRefreshDegreeOfParallelism (8) and
StatsRefreshFailureBackoff (15 min) knobs.

Addresses PR #95 review items #12, #13, #14, #16, #17, #18.
(#15 XPublisher defensive parse + the LinkedIn dynamic / PII items
were already in commit 478fdaa.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rbuergi added a commit that referenced this pull request May 10, 2026
… file lock

The MESHWEAVER_DISPOSE_TRACE=1 trace took a global lock per call
(`File.AppendAllText` under `lock (DisposeTraceLogLock)`), serialising
hub teardown under load when many hubs disposed concurrently.

Replaced with a single bounded `Channel<string>` (capacity 4096,
FullMode = DropWrite) drained by one writer task started in the
type initialiser. Producers `TryWrite` non-blocking — if the disk is
slow / locked, lines drop on full instead of putting back-pressure
on dispose. Single-reader semantics avoid contention on the file
handle.

Addresses PR #95 review item #19.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rbuergi added a commit that referenced this pull request May 10, 2026
Replaces the TODO from commit 512adb4. After a successful
INuGetPackageCache.TryHydrateAsync, the resolver now opens the
hydrated folder via PackageFolderReader and compares the package's
own .nuspec-declared (id, version) against the expected (id, version).
On mismatch the directory is purged and the resolver falls back to
the feed.

This catches the failure modes #14 was about: wrong package stored
under right key (cross-tenant blob, accidental copy, drift after a
manual edit). The .nuspec is the canonical NuGet source of truth, so
a tampered cache entry can't fake the identity without rewriting the
nuspec — which we'd then catch at hydration time.

No INuGetPackageCache contract change; validation lives entirely in
the resolver.

Closes the last open item from PR #95 review (item #14).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rbuergi and others added 7 commits May 22, 2026 15:06
…e key

Revisits the per-user keying from ed784e7. That commit changed the
synced-query cache from `id` → `(id, userId)` which fixed cross-user
leak but broke the "same id → same observable" contract that
LanguageModelSyncedQueryTest pins (chat view re-subscribes need to hit
the cached upstream, not allocate a fresh one each time).

New shape: cache by raw `id` (legacy contract preserved), wrap each
GetQuery return value with a per-subscriber RLS filter
(WrapWithPerUserRls). The filter captures the subscriber's AccessContext
at Subscribe time and uses ISecurityService.HasPermission to drop nodes
the subscriber can't Read. Two users sharing the same `id` each get
their own filtered VIEW over the SAME shared upstream — no duplicate
subscriptions, no cross-user leak.

System / no-AsyncLocal callers bypass the filter (infrastructure paths
get the full snapshot — SecurityService's own _Access walks etc.).

Updated LanguageModelSyncedQueryTest to assert "same upstream snapshot"
(BeEquivalentTo paths) rather than ReferenceEquals — the Defer wrapper
varies per call site by design (per-subscriber Subscribe-time capture),
but the cached upstream is the same instance via Replay(1).RefCount.

SyncedQueryRegistry gains:
  - RegisterAlias for dual-key entries (not used after this revert but
    kept for the structural extension path).
  - FindAnyById for the loose-match lookup fallback (defensive — no
    longer hit by GetQuery but useful for diagnostics).

SyncedQueryPerUserIsolationTest still 4/4 — the cross-user-leak guard
moved from cache key to subscriber wrapper but the invariant holds.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… alarm

Introduces "cache/mesh-node-cache" as the first fine-grained sanctioned
identity per the new access-context propagation model. The MeshNodeStreamCache
hydrator runs under this dedicated address instead of ImpersonateAsSystem; the
identity is granted ONLY Permission.Read in SecurityService — Create/Update/
Delete fall through to normal RLS and deny (no AccessAssignment exists).

Boundary tests (MeshNodeCacheIdentityTest, 6/6 pass) verify:
 - GetEffectivePermissions returns exactly Permission.Read
 - Create / Update / Delete under cache identity all throw
   UnauthorizedAccessException
 - Read still succeeds (hydration unaffected)

Framework changes:
 - AccessService.SetContext / SetCircuitContext: log Error + stack when a
   hub-shaped principal (sync/, mesh/, node/, activity/, portal/) lands on
   AsyncLocal. Diagnostic alarm for the identity-baton model; CI parses for
   it to catch regressions.
 - MeshService.DeleteNode: NodeDeletionRejectionReason.Unauthorized now maps
   to UnauthorizedAccessException (previously fell through to
   InvalidOperationException — inconsistent with Create/Update).
 - SyncedQueryDataSourceExtensions.WrapWithPerUserRls: defer ISecurityService
   resolution to Subscribe time. Wrap-time resolution recursed through
   Autofac when SecurityService.ctor called workspace.GetQuery → ~200-level
   stack overflow at test discovery.

Documentation:
 - AccessContextPropagation.md rewritten around the piecewise single-threaded
   identity-baton model: Mermaid sequence diagram, 6-phase contract, security
   guarantees table, sanctioned exceptions with define/grant/test contract.
 - AccessControl.md cross-referenced; "ImpersonateAsHub" section reframed as
   "Sanctioned dedicated identities".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MessageHub.HandleMessageAsync and UserServiceDeliveryPipeline previously
stamped AccessService.Context (AsyncLocal) from delivery.AccessContext
unconditionally. When the delivery carried a hub-shaped principal (sync/,
mesh/, node/, activity/, portal/) — legitimately set by ImpersonateAsHub at
the producer for AccessControl purposes — that principal leaked into
AsyncLocal at the receiver. From there it propagated as fake user identity
into every downstream post and watcher emission, producing the
"CreatedBy=sync/xxx" symptom on stored MeshNodes.

Fix: only stamp AsyncLocal when delivery.AccessContext is a USER identity
(not hub-shaped). Hub-shaped principals still ride delivery.AccessContext
for the AccessControl check on this message; they just don't propagate
beyond it. AccessService.LooksLikeHubPrincipal helper centralises the
predicate; the existing SetContext error log catches anything that still
slips through.

Two callsites updated, identical guard:
 - MessageHub.HandleMessageAsync (per-handler dispatch boundary)
 - UserServiceDeliveryPipeline (per-delivery boundary)

See Doc/Architecture/AccessContextPropagation.md → "Security guarantees"
table for the full model — this fix backs the "no write is attributed to
a hub address by accident" guarantee.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…elf-persist

Two hub-internal infrastructure flows previously stamped ImpersonateAsHub at
the post site purely to suppress the PostPipeline "no AccessContext" warning.
The stamped hub address (sync/, node/) then leaked into AsyncLocal at the
receiver and propagated as fake user identity into downstream writes — the
"CreatedBy=sync/xxx" symptom on user-driven sync-stream updates.

Both messages are now marked [SystemMessage], which exempts them from the
PostPipeline warning and lets the post go through with whatever AsyncLocal
holds (typically the originating user via the CarryAccessContext path, or
null for genuine background protocol traffic).

 * SetCurrentRequest (SynchronizationStream protocol): receiver does not gate
   on AccessControl (HandleSetCurrent). User-data binding pushes through this
   path now correctly carry the user's AccessContext.
 * SaveMeshNodeRequest / DeleteMeshNodeRequest (per-node hub auto-persistence):
   hub posting to itself to flush its own data. No end-user write semantics.

Backs the "user identity is never lost across an async hop" guarantee in
Doc/Architecture/AccessContextPropagation.md → Security guarantees.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
HeartBeatEvent is already [SystemMessage], so the PostPipeline accepts a null
AccessContext without warning. The ImpersonateAsHub stamp served no purpose
once the warning was suppressed by the attribute — it just polluted the
delivery with a hub-shaped principal that the new AsyncLocal guard would
filter out anyway.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CompilationCacheService's default CacheDirectory is '.mesh-cache' relative
to the test assembly — SHARED across every test in the testhost. Two test
classes compiling the same NodeType name (e.g. 'BrokenType', 'DynamicType')
race on the same DLL file: test-N's lingering ALC pins the file open while
test-N+1 tries to write the next version, causing test-N+1 to time out.

Symptom traced 2026-05-22:
 * CodeEditRecompileTest.FailedCompile_PreservesErrorLogAndDoesNotCreateRelease
 * LinkedInTelemetryImportTest.LinkedInTelemetryImport_CompilesAndRendersImportArea
Both pass in isolation (no contention when running alone). Both time out in
the full Monolith sweep. Memory-delta trace ruled out memory pressure —
in-sweep instances show normal ~45 MiB RSS deltas.

Fix: configure CompilationCacheOptions.EnableDiskCache = false in
ConfigureMeshBase. The option exists precisely for this scenario — doc on
CompilationCacheOptions.cs:25 reads "Useful for tests to avoid file locking
issues." In-memory compilation bypasses the disk file entirely;
FileSystemAssemblyStore (already configured at _assemblyStoreRoot, unique
per ConfigureMeshBase call via Process+Guid) provides any disk-backed
assembly storage the test actually needs.

Acme tests that opt back to true via services.Configure<CompilationCacheOptions>
are preserved — options-pattern composes, last writer wins.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ry.ImpersonateAsNode

Two cleanups:

1. CompilationCacheService disk cache is now isolated per test class via a
   unique temp directory (_compilationCacheDir, Process+Guid). Previously
   the default '.mesh-cache' (relative to assembly) was SHARED across
   every test in the testhost — multiple classes compiling NodeTypes with
   the same name (e.g. 'BrokenType', 'DynamicType') raced on the same DLL
   file, causing test-N+1 to time out waiting on a lock held by test-N's
   lingering ALC. Symptom (traced 2026-05-22 via /tmp/meshweaver-memory-delta.log):
    * CodeEditRecompileTest.FailedCompile_PreservesErrorLogAndDoesNotCreateRelease
    * LinkedInTelemetryImportTest.LinkedInTelemetryImport_CompilesAndRendersImportArea
   Both passed isolated, timed out in full Monolith sweep.
   First attempt (EnableDiskCache=false) regressed 6 compile tests that
   depend on disk-backed DLL loading; reverted to per-test directory which
   keeps the cache working but eliminates cross-test contention.
   Result: Monolith 9 flakes (CodeEditRecompile + LinkedInTelemetry +
   6 disk-cache regressions) → 1 flake (MeshHubRemoteStream, separate
   issue) + 1 pre-existing (DeleteNodeBehavior).

2. IMeshQuery.ImpersonateAsNode() removed. Zero implementations, zero
   callers (verified via grep). Documented as legacy in AccessControl.md
   2026-05-22 — modern code uses sanctioned dedicated identities
   (cache/mesh-node-cache) or ImpersonateAsSystem.

The PostOptions.ImpersonateAsHub / AccessService.ImpersonateAsHub APIs
stay — HubDataSource (used by FutuRe Group + similar redistributor hubs)
relies on them for the documented "hub-as-redistributor" pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rbuergi and others added 30 commits May 27, 2026 01:22
PageLoadingTest hangs in CI were 3 separate flakes traced to the same root
cause: cold Roslyn compile of custom NodeTypes (Cornerstone/Insured + Pricing +
Article, ACME/Article + Project + Todo) is much slower on the CI Linux runners
than locally. Diagnosed by running each failing test locally with Trace logging
per DebuggingMessageFlow.md — every one passed in 300 ms or less.

Stream-level Timeout: 20s → 50s.
Per-test [Theory/Fact(Timeout)]: 60s → 120s.

The wider budget is only burned on cold compile; cache-hit runs (every test
after the first activation per NodeType) still finish in milliseconds. A
genuine hub-activation hang still surfaces within 120s rather than running
indefinitely.

Same change applied to ConcurrentRequestsTest (sibling class in the file)
since it depends on the same NodeType compilations.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ions extensions

Match the shape of hub.CancelActivity / hub.StartThread / hub.SubmitMessage:
application code asks the hub for an answer; the extension resolves the
process-wide ISecurityService and forwards. No more layout areas reaching
into DI for ISecurityService, no more PermissionHelper.GetEffectivePermissions
static-class calls, no more hand-rolled `namespace:.../_Access` queries.

Four methods on IMessageHub:

  IObservable<Permission> hub.GetEffectivePermissions(path)            // ambient user
  IObservable<Permission> hub.GetEffectivePermissions(path, userId)    // explicit user
  IObservable<bool>       hub.CheckPermission(path, permission)
  IObservable<bool>       hub.CheckPermission(path, userId, permission)

All return IObservable<T> end-to-end. Tests bridge to Task at their edge.

Behind the scenes ISecurityService composes against the process-wide
IMeshNodeStreamCache under WellKnownUsers.System identity — one shared
sync subscription per scope (AccessAssignment subtree + PartitionAccessPolicy
chain), held alive via Observable.Using(ImpersonateAsSystem, …) so the
identity scope doesn't exit before the subscription emits. Zero per-hub
synced-query subscriptions for access lookups → zero "hub-shaped principal
set as AccessContext — must never happen" errors (the CI 59-occurrence
flake culprit, traced to SecurityService.ObserveScopeAssignments leaking
the hub identity through the synced-query subscription thread).

Documentation: PermissionApi.md.

Follow-ups (not in this commit):
- Wire SecurityService's two ObserveScope* methods through IMeshNodeStreamCache
  with held system impersonation.
- Migrate the 16 application-code callers of PermissionHelper.GetEffectivePermissions
  to hub.CheckPermission.
- Decide on PermissionHelper deprecation timeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Header section directs application code to the hub.CheckPermission /
hub.GetEffectivePermissions extensions (PermissionApi.md) and clarifies
the rest of AccessControl.md covers the internals that back them.

Removes the implicit invitation to resolve ISecurityService from DI in
layout areas — that surface stays for framework-internal callers (the
storage adapter's secured query path, the RLS node validator, the
access-control pipeline) but is no longer the documented public API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…NodeTypeAsync

The two methods were [Obsolete]-marked legacy shims preserved only for
the AgentSelectionTest class, which mocked IMeshService.ObserveQuery
directly. Both production callers migrated to
AgentPickerProjection.ObserveAgents (workspace-backed synced source) some
time ago.

Delete both methods and the test class. Coverage for the real flow lives
in AgentChatClientNoSuitableAgentTest + AgentPickerProjectionTest, which
exercise ObserveAgents end-to-end via a real workspace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two test-infrastructure leaks rolled into one fix:

1. CreateClientAddress() returned a fixed `client/1` for every call.
   Under ShareMeshAcrossTests, each test's GetClient() overwrote
   streams[client/1] in RoutingService — the prior client hub stayed
   alive on the mesh, its server-side LayoutAreaReference /
   MeshNodeReference sync streams kept emitting DataChangedEvents
   addressed to client/1, and those events queued on the LATEST
   client/1's action block ahead of new SubscribeAcks + initial-state
   emissions. PageLoadingTest.ConcurrentRequests (commit 02dd88f)
   was the first symptom; the AI/Threading suite 6-min CI timeouts
   are the same shape at scale.

   Switched to `client/{guid12}` per call. Routing tables now partition
   per client; leaked traffic from a prior test lands at a dead slot
   and is dropped harmlessly.

2. GetClient() didn't track the hubs it created. The shared-mesh
   DisposeAsync skipped Mesh.Dispose entirely, so client hubs from
   prior tests stayed alive indefinitely (until process exit), each
   holding its routing-stream registration + workspace subscriptions.

   Added _clientsCreated list + DisposeTestClients() in DisposeAsync
   on both the shared-mesh and per-test paths. Each tracked client is
   disposed at end-of-test — the framework's Dispose hook unregisters
   the routing stream and cancels in-flight subscriptions, so the
   server-side sync streams complete cleanly without orphaned emission.

Same fix shape applies to HubTestBase.CreateClientAddress (the sibling
fixture base for HubTestBase-derived tests). Both call sites switched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of commit 8cc3479 (MonolithMeshTestBase) on the Orleans side:

- CreateClientAddress() now returns `client/{12-char-guid}` instead of
  the fixed `client/1`. Each test gets its own routing slot; leaked
  traffic from a prior test's client lands at a dead address and is
  dropped harmlessly.

- GetClientAsync() appends each created hub to a per-test list;
  DisposeAsync disposes them before tearing down the cluster. Closes
  the synchronization streams paired with each client cleanly, so the
  silo's hosted-hub registry doesn't carry stale per-node subscriptions
  across tests within a shared cluster.

Same root cause as the Monolith fix; mirror cleanup. The Orleans dispose
runs BEFORE Cluster.DisposeAsync so the hubs unsubscribe before grain
teardown — otherwise grain-driven re-emissions during shutdown can race
the cluster's own shutdown sequence and produce NullReferenceExceptions
in Orleans.Streams.PersistentStreamPullingManager.Stop (the 82-NRE
batch we saw in CI logs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mission, hold ImpersonateAsSystem across SecurityService synced subscriptions

Three coupled fixes, single commit:

1. **Observable.Using on SecurityService synced subscriptions.**
   ObserveScopeAssignments + ObserveScopePolicies opened the
   ImpersonateAsSystem scope INSIDE a `using` block and assigned the
   resulting `workspace.GetQuery(...)` observable to a variable — but the
   observable subscription happens LATER (Replay(1).RefCount fires on
   first subscriber). By the time the upstream change-feed handlers run,
   the `using` block has long exited and AsyncLocal AccessContext is
   whatever the caller's context happens to be — usually the hub's own
   address. AccessService.SetContext then logs that as `[Error] SetContext:
   hub-shaped principal ... set as AccessContext — must never happen`
   (the 59-occurrence CI flake on commit 8af66d8).

   Switched both methods to `Observable.Using(() => accessSvc.ImpersonateAsSystem(), _ => workspace.GetQuery(...))`.
   The impersonation scope opens on Subscribe and disposes on the
   observable's Dispose — alive for every emission, every change-feed
   callback, every re-query.

2. **Marked ISecurityService [EditorBrowsable(Advanced)].** Application
   code MUST go through hub.CheckPermission / hub.GetEffectivePermissions
   (the extension surface introduced in commit 2ef5a8b). The interface
   stays public because framework-internal consumers
   (AccessControlPipeline, RlsNodeValidator, StorageAdapterMeshQueryProvider)
   still resolve it; the IDE just hides it from default IntelliSense so
   new callers reach for the extension first.

3. **Killed PermissionHelper entirely.** Static-class wrapper around
   `_securityService.GetEffectivePermissions(path)`; redundant with the
   hub extension and a competing surface. Migrated all 17 application-code
   call sites in src/MeshWeaver.Graph + MarkdownExportMenuProvider:

     PermissionHelper.GetEffectivePermissions(hub, path)  → hub.GetEffectivePermissions(path)
     PermissionHelper.CanEdit(hub, path)                  → hub.CheckPermission(path, Permission.Update)
     PermissionHelper.CanCreate(hub, parentPath)          → hub.CheckPermission(parentPath, Permission.Create)
     PermissionHelper.CanDelete(hub, path)                → hub.CheckPermission(path, Permission.Delete)

   Deleted PermissionHelper.cs. Test-file comments updated.

Solution builds clean (0 warnings, 0 errors). CI run will validate
that the AccessContext-leak fix dropped the 59 "hub-shaped principal"
errors and the cascade of timing-out tests that triggered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds IMessageHub extension that mirrors the existing IWorkspace.GetQuery
overloads (cached single-arg + get-or-create multi-arg). Same shape as
hub.GetMeshNodeStream / hub.CheckPermission / hub.StartThread —
application code resolves the hub once and chains everything off it
instead of also threading workspace through call sites.

Internally delegates to hub.GetWorkspace().GetQuery(...) — zero
behavior change, single-line wrapper. The follow-up commits centralise
the synced-query registry on IMeshNodeStreamCache so all GetQuery calls
share one process-wide cache hosted on the cache hub.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eStreamCache

Replaces the legacy per-workspace ConditionalWeakTable<IWorkspace,
SyncedQueryRegistry> with a process-wide registry on the
IMeshNodeStreamCache singleton. Every workspace.GetQuery / hub.GetQuery
call now delegates to the same cache, regardless of which hub the
caller originates from.

Key changes:
- IMeshNodeStreamCache.GetQuery(id, queries) — new method, registers on
  the cache hub's workspace so the upstream subscription runs under
  MeshNodeCacheIdentity (Permission.Read only). The secured query
  surface short-circuits to raw upstream; no per-hub AsyncLocal
  AccessContext can leak in.
- IMeshNodeStreamCache.GetQuery(id) — lookup-only overload.
- IMeshNodeStreamCache.GetQuery(id, options, queries) — typed-content
  overload, round-trips each emitted MeshNode's Content through the
  caller's JsonSerializerOptions at the cache boundary (same shape as
  GetStream(path, options)).
- workspace.GetQuery / hub.GetQuery → delegate to the cache via
  workspace.Hub.ServiceProvider.GetRequiredService<IMeshNodeStreamCache>().
- WithMeshQuery no longer registers into a per-workspace registry —
  the typesource attaches directly to its data source, and lookups
  from other hubs go through the central cache.
- Deleted SyncedQueryRegistry class entirely (was only used by the
  removed ConditionalWeakTable path).

Query execution itself was already on TaskPoolScheduler.Default via
.SubscribeOn(...) in StorageAdapterMeshQueryProvider — nothing here
moves it back onto a hub action block.

Build: clean. Graph tests: 296/296 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ice + RlsSecurityService

Application code never reached into ISecurityService directly — it always went
through hub.CheckPermission/GetEffectivePermissions, and the interface only
existed to bridge the Mesh.Contract↔Hosting project boundary. The interface
form was hostile to the canonical `hub.GetService<SecurityService>()` shape
shared by every other framework service.

Replace with:
- `public abstract class SecurityService` in MeshWeaver.Mesh.Contract.Security
  (same namespace as before; same public surface as the old interface)
- `public sealed class RlsSecurityService : SecurityService` in
  MeshWeaver.Hosting.Security — the concrete RLS implementation
- `public sealed class NullSecurityService : SecurityService` in
  MeshWeaver.Mesh.Contract.Security — fall-through "permission granted"
  used by satellite access rules when RLS isn't configured
- DI: `services.TryAddScoped<SecurityService, RlsSecurityService>()`

45 consumer sites that referenced ISecurityService now reference the abstract
class directly; no semantic change. Solution builds clean, AccessAssignment
tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Replay(1).RefCount + TaskPoolScheduler

ConcurrentDictionary.GetOrAdd does NOT serialize its value factory across
threads — when two callers race for the same id, both factories run, both
allocate, both subscribe upstream, and only one wins the cache slot; the
loser's work leaks. Replace with `ImmutableDictionary<object, IObservable<...>>`
swapped via `Interlocked.CompareExchange` — losers see the winner's stream
on the next iteration's TryGetValue and discard the unused closure.

The cached observable is `Observable.Defer(...).SubscribeOn(TaskPoolScheduler).Replay(1).RefCount()`:
- First subscriber triggers the Defer body on a thread-pool thread, never
  on the calling hub's action block — concurrent GetQuery callers across
  many hubs no longer queue behind one SyncedQueryMeshNodes construction.
- Replay(1) caches the latest snapshot for all later subscribers (this is
  the "cache" — earlier callers asked for it explicitly).
- RefCount shares one upstream subscription.

Docs: bulk-rename ISecurityService → SecurityService (the abstract class
shipped in the prior commit) across AccessControl.md, AccessContextPropagation.md,
ExtensibleDefaults.md, PermissionApi.md, and the 3_0_0-preview2 release notes.

Tests: SyncedQueryCrossSiloTest migrated off the deleted per-workspace
registry — `workspace.GetQuery(id, query)` (get-or-create) replaces
`workspace.GetQuery(id)!` after `WithMeshQuery(query)`. All 22 SyncedQuery
tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tatic PermissionEvaluator + config-driven evaluator delegate

Algorithm moved from the 1084-line RlsSecurityService class to a single
static `PermissionEvaluator` in Mesh.Contract. Per-scope AccessAssignment
and PartitionAccessPolicy walks now share one IMeshNodeStreamCache.GetQuery
subscription per (scope) across the whole process — no per-hub IMemoryCache
layer, no per-hub _warmupSubscriptions, no per-hub scoped service.

Configuration is hub-level via the standard MessageHubConfiguration
property bag: AddRowLevelSecurity() on the builder calls
config.AddRowLevelSecurity() (Mesh.Contract extension) which sets an
EffectivePermissionsDelegate that HubPermissionExtensions resolves on
every check. When no delegate is configured, the default returns
Permission.All — same lambda flows through whether RLS is on or off.

Application code only sees hub.CheckPermission / hub.GetEffectivePermissions.
Internal framework callers (RlsNodeValidator, AccessControlPipeline,
SatelliteAccessRule, StorageAdapterMeshQueryProvider) go through the same
extensions; no separate framework-internal surface.

8 of 228 Security tests still failing (Menu / HubPermissionRuleSet edge
cases) — down from 99 after the algorithm port. Triage separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ry to thread pool

IMeshNodeStreamCache.GetQuery used Replay(1).RefCount() — when subscriber
count dropped to 0 between calls the Replay buffer was retained but the
upstream synced query disconnected. The next FirstAsync after a runtime
AccessAssignment write saw the STALE cached snapshot before the change
feed's Added event landed. RuntimeCreateNode test hung 8m45s in silent
deadlock under [Fact(Timeout = 20000)] because the xUnit timeout is
cooperative cancellation and the test ignored the ct.

Switch to .Replay(1).AutoConnect(0): upstream connects on the first
accessor call and stays connected for the cache singleton's lifetime.
Live AccessAssignment writes propagate to Replay(1) in real time.

Also wrap MeshQuery.ObserveQuery / Query / IMeshQueryCore.ObserveQuery
with .SubscribeOn(TaskPoolScheduler.Default) so DB connection pool /
change feed subscriptions open on the thread pool, not on the calling
hub's action block. Doc: OrleansTaskScheduler.md updated with the
grain-hosted-cache rationale.

Suite: Security.Test from 13m03s / 8 failures → 4m27s / 7 failures.
The runtime AccessAssignment propagation tests now pass; remaining
failures are Menu / HubPermissionRuleSet / SpaceCreation edge cases
that require separate triage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t / 60s hard)

Tests targeted behavior explicitly removed in 77d9941 (Organization → Space):
- HubPermissionRuleSetTest.WithPublicRead_AllowsAuthenticatedUserRead — Space
  NodeType doesn't have WithPublicRead
- OrganizationHubAccessTest.Admin_HasReadOnOrganization_WithoutClaimBasedRoles —
  Organization NodeType doesn't exist
- PartitionAccessTest.SpaceCreation_CreatesPartitionNode — per-tenant Partition
  auto-emission was explicitly removed in the rename commit

Suite now 225/225 green.

Also add a watchdog in MonolithMeshTestBase.DisposeAsync that catches the
silent-deadlock pattern xUnit v3 misses: when a test ignores its
CancellationToken, [Fact(Timeout=N)] is cooperative cancellation only — the
await blocks past the deadline and xUnit reports Passed with the actual
(often multi-minute) duration. The watchdog records the test-method start
timestamp at the end of InitializeAsync and computes elapsed at the start of
DisposeAsync; >30s logs a warning, >60s throws a TimeoutException naming the
likely cause (uncooperative cancellation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… Error

HostedHubsCollection.DisposeHosted catches an outer disposal exception and
then dumps the status of every task. The dump was at LogError unconditionally
— including for tasks that completed cleanly (IsCompleted=True, IsFaulted=
False, IsCanceled=False). One CI run produced thousands of these per test
class; App Insights ingest cost + xUnit test-log size both blow up under it.

Split the per-task arm:
  - IsFaulted → LogError (unchanged, with the actual exception)
  - IsCanceled → LogWarning
  - otherwise → LogDebug (this is the diagnostic-noise case)

The outer parent-exception logging at LogError is unchanged — the real error
signal stays.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
TestMemTrace at end of every test's DisposeAsync ran two passes of
GC.Collect(2, Forced, blocking: true, compacting: true) +
WaitForPendingFinalizers. ~1.5s × 225 tests = ~5 minutes of pure GC per
suite. The forced GC is a leak-detection diagnostic; useful when chasing
a memory regression, dead weight on every other run.

Default OFF; set MESHWEAVER_TEST_FORCE_GC=1 to re-enable when memory
delta lines need to reflect retained allocations rather than in-flight
collectible garbage.

Security.Test: 7m11s → 2m52s, 225/225 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l-safe permission evaluation

Two real bugs across the async boundary I shifted in commit 9eb62c0
(SubscribeOn(TaskPoolScheduler.Default) on cache.GetQuery + MeshQuery):

1. PermissionEvaluator.GetEffectivePermissions reads accessService.Context
   (AsyncLocal) inside .Select lambdas that run on TaskPool emission threads.
   AsyncLocal does NOT flow through SubscribeOn — claim-based Roles from a
   user's AccessContext silently dropped, IsApiToken gating bypassed.

   Capture Context + CircuitContext snapshots at GetEffectivePermissions
   entry (caller's thread, AsyncLocal still valid). Pass to ComputeRoleState
   and the inner Select lambda as closure values. ComputeRoleState's
   accessService parameter replaced by AccessContext? capturedContext +
   AccessContext? capturedCircuitContext.

2. IMeshNodeStreamCache.GetQuery used .Replay(1).AutoConnect(0). AutoConnect(0)
   eagerly Connect()s at observable construction; under CAS contention the
   ImmutableDictionary swap loop builds N observables, only one wins the
   slot, but all N already opened upstream IMeshQueryCore subscriptions.

   Switch to .Replay(1).AutoConnect(1) — lazy connect on first Subscribe.
   The CAS loser's discarded chain has no subscribers and never connects.

Also bulk-completes the Organization → Space rename across remaining test
files (NodeType strings, node-type permission seeds, log comparisons).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ery — serialize batches via Concat

With SubscribeOn(TaskPoolScheduler.Default) shifting query work onto the
thread pool, overlapping batch executions inside ObserveQuery's debounce
pipeline could race on the per-subscription currentItems Dictionary that
ProcessBatch mutates.

Before:
    changeBuffer.Buffer(...).Subscribe(batch =>
        disposables.Add(RunQuery().Subscribe(newResults => ProcessBatch(...))));
    // outer Subscribe runs sequentially, but each .Subscribe(batch=>...)
    // body fires an async RunQuery and continues; batch #2's RunQuery starts
    // before batch #1's ProcessBatch completes -> currentItems race.

After:
    changeBuffer.Buffer(...)
        .Select(batch => RunQuery().Select(newResults => (batch, newResults)))
        .Concat()                  // next batch's RunQuery waits for prev
        .Subscribe(t => ProcessBatch(...));

Concat() guarantees one RunQuery (= one DB connection acquired from the
NpgsqlDataSource pool, one read, one ProcessBatch mutation of currentItems)
completes before the next starts. Strict unit-of-work per batch.

Also fix the early-backlog drain path: instead of running a parallel
RunQuery() that could race the first live batch, push the backlog through
the same changeBuffer subject so it queues behind the live pipeline's
Concat.

The connection-level safety was always fine — _dataSource.CreateCommand()
is pooled and thread-safe. The hazard was in the Rx orchestration: each
.Subscribe(batch=>RunQuery()...) is non-awaiting, so the outer Rx
"sequential" guarantee didn't extend to the inner async work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ueryProvider.ObserveQuery

Same overlapping-RunQuery race as PostgreSqlMeshQuery (commit 7f418e1):
   changeBuffer.Buffer(...).Subscribe(batch =>
       disposables.Add(RunQuery(ct).Subscribe(newResults =>
           ProcessBatch(...))));

The outer Subscribe is sequential, but each .Subscribe(batch=>...) body
fires an async RunQuery and returns. Batch #2's async read can start
before batch #1's ProcessBatch mutates currentItems.

Switch to Select+Concat so the next batch's RunQuery doesn't subscribe
until the previous batch's read completes AND ProcessBatch has finished
mutating the shared dictionary.

Also push the early-backlog drain through the same changeBuffer
(scheduled on Scheduler.Default to avoid stack recursion), so the
backlog queues behind the live Concat pipeline instead of running a
parallel RunQuery that races the first live batch.

Security.Test: 225/225 green at 1:47 after the PG fix; verifying.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nce-in-depth + explicit contract docs

Orleans grains are re-entrant; the IMeshNodeStreamCache singleton is hit
by many grains concurrently. Lock the contract down with three tests:

  GetQuery_ManyConcurrentCallersSameId_AllSeeSameSnapshot
      64 threads racing GetQuery(sameId, query). Asserts every subscriber
      sees the same MeshNode snapshot — if CAS-loser observables had
      leaked Connect (the AutoConnect(0) bug fixed in 04fae84), we'd
      see divergent snapshots from racing initial queries.

  GetQuery_ReturnsLiveUpdatesAfterRuntimeCreate
      Eventual-consistency check: subscriber attaches before any writes,
      then nodes are created at runtime. Both the held-open subscription
      and a late-arriving subscriber must see the live state, not the
      stale Initial Replay buffer.

  GetQuery_ConcurrentDifferentIds_AllResolveIndependently
      32 threads racing with distinct ids. Stresses the ImmutableDictionary
      CAS retry loop with N keys hitting _queries simultaneously — every
      caller's chain must converge.

Add .Synchronize() at the public surface of GetQuery for defence-in-depth:
ReplaySubject already serialises OnNext/Subscribe internally, but wrapping
the returned observable makes the single-threaded-callback contract
explicit at the cache's API.

Inline the thread-safety contract (creation, CAS, subscription, emission,
eventual consistency) as comments on _queries — future readers don't have
to know Rx internals to trust the cache is safe under fan-out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rialises emissions

ReplaySubject<T> (backing Replay(1)) is internally synchronised — OnNext +
Subscribe coordinate via lock. Wrapping with .Synchronize() added a second
gate that contended under concurrent subscriber load.

Security.Test suite: 3:30 → 1:44 after this revert. The contract docs stay
in place — readers don't have to know ReplaySubject's internal sync, the
comment now points at it directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uery providers

Buffer(DefaultDebounceInterval=100ms) in PostgreSqlMeshQuery.ObserveQuery
and StorageAdapterMeshQueryProvider.ObserveQuery was the source of
order-dependent flakes:

  T+0    Test commits CreateNode(AccessAssignment) → persistence.Write
         → adapter._changes Subject fires DataChangeNotification.
  T+0    Notification lands in changeBuffer Subject.
  T+10   Test calls hub.CheckPermission → cache.GetQuery first subscriber
         → AutoConnect(1) Connect → ObserveQuery's existing Replay(1)
         buffer holds the PRE-WRITE snapshot.
  T+10   Subscriber returns Permission.None — wrong.
  T+100  Buffer flushes, RunQuery diffs, ProcessBatch emits Added →
         Scan updates → Replay(1) caches new state — too late.

The 100ms debounce window IS the race. Subscribers attaching during it
see stale Replay(1).

Switch both providers to process every change immediately:

    changeBuffer
        .Select(n => RunQuery().Select(newResults => (batch=[n], newResults)))
        .Concat()
        .Subscribe(t => ProcessBatch(...))

Concat preserves the unit-of-work guarantee — next RunQuery doesn't start
until previous ProcessBatch completes — but the per-change RunQuery
means the Replay(1) buffer reflects every commit within milliseconds of
its persistence write, not 100 ms later.

Trade-off: throughput cost is one RunQuery per change instead of one
per batch. For prod load that's bounded by the connection pool; for
test correctness it eliminates the entire flake class.

Security.Test: 225/225 green locally at 2:04 (was 222-225 / 3 Menu flakes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oseInChildren=true

Commit 95f840f flipped ExposeInChildren default from true to false (to
fix wire-serialisation drop on false values). The AddFileSystemContentCollection
builder doesn't set ExposeInChildren on the config it produces, so the new
default of false silently took effect — GetAllCollectionConfigs filters by
ExposeInChildren, returns empty, and tests that list configs fail
("Expected configs to have an item matching c.Name == 'test-content' …
but found 0").

Set ExposeInChildren = true on the config produced by this builder — these
are user-facing filesystem collections and the whole point of registering
them is to surface them to children.

ContentService_ListsCollectionConfigs now passes in isolation; AI suite
flake count drops as a result.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three regressions surfaced after the per-change persistence rewrite that
removed the 100ms debounce window:

1. PostgreSqlMeshQuery.Test.ObserveQueryTests.ObserveQuery_MultipleRapidChanges_AreBatched
   — `List<T>` accumulator + polling-lambda enumeration raced the Subscribe
   handler's `.Add(c)` once changes started arriving one-per-emission instead
   of one batched-per-debounce. Threw "Collection was modified" mid-poll.
   Guard both ends with the same `lock(changes)` and snapshot via `ToArray()`
   under the lock — the test's assertion already accepts either shape
   (one Added emission with 3 items OR three separate Added emissions).

2. NodeOperations.Test.DeletionTests.Delete_FromNodeHub_Succeeds
   — `TestTimeout` had been reverted from 90s → 45s by 195d1b6 and the
   Linux CI per-message-hub activation routinely now takes >45s when the
   suite is mid-run; STALE-CALLBACK at GetDataRequest@{nodePath}(44+s)
   re-appeared. Restore the 90s TestTimeout that the earlier revert had
   undone, and bump the [Fact(Timeout)] from 60s → 120s so xUnit doesn't
   kill the test before the inner CT fires.

3. NodeOperations.Test.DeletionTests.Delete_DeeplyNested_DeletesBottomToTop
   — inner `.Timeout(15s)` on the empty-subtree poll loop is too tight for
   Linux CI after the unit-of-work change made deletion fan-out emit more
   small batches (instead of one debounced 100ms tick). Bump to 30s.

Local: all 3 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ped on Linux

The synced-query path through CreatableTypesProvider has a 15 s per-query
inner Timeout(15s, Empty) on each merged ObserveQuery (see
QueryTypeNodes). With a 20 s xUnit ceiling, a single slow query that
trips the inner timeout left no margin for the Aggregate to flush and
the downstream emission to land.

Local: passes in ~14s. The bump gives the happy path the same finish
time while covering the slow path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 26557749128 caught Delete_FromNodeHub_Succeeds tripping the base-class
60s hard deadline despite the earlier [Fact(Timeout=120000)] and 90s
TestTimeout bumps — the MonolithMeshTestBase watchdog (in DisposeAsync)
fails any test whose body-elapsed exceeds TestHardDeadline regardless of
the xUnit budget.

Lift both ceilings for this class so the watchdog matches what the test
budgets allow: 60s soft (warn), 120s hard (fail). Local runs still finish
in ~10s; CI's slow-hub-activation path now has the room it needs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… handler

Run 26559166360 caught MeshHub_RemoteStream_ReceivesNodeUpdate with
'Expected names {"V1", "V2"} to contain "V2"' — FluentAssertions printed
the post-failure snapshot, but at the moment of assertion the list only
had ["V1"].

The test has two independent observers on the cached stream:
  1. `await stream.Where(V2).FirstAsync()` — the synchronisation point
  2. `using var sub = ...Subscribe(ci => names.Add(...))` — the accumulator

Under the new per-change emission shape (486e8d2: Buffer→Concat), the
synchronisation observer can resolve BEFORE the accumulator observer has
appended V2. Locally batched emissions hid this; CI exposes it.

Fix: lock both ends + poll the accumulator until it contains V1 AND V2
before snapshotting under the same lock for the assertion. The
`ToList()` → `ToArray()` switch is a workaround for the Observable.ToList
overload winning argument-inference in this file.

Local: passes in 10s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… backing

Three races + one footgun across the AI suite:

1) MeshNodeStreamCache: concurrent mirror-side Updates on the same path
   race their `current` snapshot — each lambda runs against the same
   pre-patch baseline, so each emits a JSON-merge patch that REPLACES
   ImmutableList fields (RFC 7396 merges JSON objects by key but
   replaces arrays). Symptom: 3 rapid SubmitMessage calls land only
   1 entry in MeshThread.UserMessageIds at the owner; analogous
   clobbering for Messages / IngestedMessageIds.

   Fix: per-path `Subject<UpdateRequest>` → `Concat` serial queue +
   wait for the cache's shared stream to emit an echo (LastModified
   >= the just-written value) before subscribing the next inner
   observable. 3-second echo timeout — TimeoutException is logged at
   Debug and does NOT propagate to the caller (the local OnNext
   already fired); the next Update still benefits from the action-
   block ordering on the owner. Per-stage Debug/Trace logs
   (ENQUEUE / START / LOCAL_EMIT / ECHO_CANDIDATE / ECHO_RECEIVED /
   ECHO_TIMEOUT / COMPLETE / EVICTED) make hangs visible — flip
   `MeshWeaver.Hosting.MeshNodeStreamCache` to Trace to see them.

   Queue storage: `MemoryCache` with 10-minute sliding expiration,
   not a long-lived `ConcurrentDictionary`. Paths that go quiet
   release their Subject + Concat subscription via eviction callback;
   a fresh write transparently recreates the queue. The cached VALUE
   is a `Lazy<UpdateQueueEntry>(ExecutionAndPublication)` because
   `MemoryCache.GetOrCreate` is NOT atomic — the factory can run
   more than once under contention, and only one result wins; losers
   would orphan a Subject + subscription whose eviction callback is
   never registered. Same pattern as the existing `_streams`
   Lazy<Entry>.

2) DelegationTool: the sub-thread drain was running on the caller's
   SynchronizationContext (Orleans grain scheduler in prod, the
   single-threaded pump in `DelegationDeadlockTest`). Adding
   `.SubscribeOn(TaskPoolScheduler.Default)` between
   `executeAsync(...)` and `.Subscribe(...)` hops the Subscribe to
   ThreadPool, so the `Observable.Create<async>` body's MoveNextAsync
   continuations no longer capture the grain scheduler and wedge it
   when sub-thread continuations post back through the same scheduler.

3) AgentPickerProjection.BuildQueries: per-NodeType inheritance was
   `path:{nodeTypePath} scope:ancestors`, which finds agents whose
   PATH is an ancestor of the NodeType — only `ACME`, `""`, etc.
   TodoAgent.md at `ACME/Project/TodoAgent` (namespace `ACME/Project`)
   was missed entirely. Correct semantic: agents inherit DOWN the
   NAMESPACE hierarchy, so query is
   `namespace:{nodeTypePath} scope:selfAndAncestors`. TodoAgent's
   namespace equals the NodeType path = self match; agents at parent
   namespaces (`ACME`, `""`) still inherit via the ancestor scope.
   Fixes AgentChatClient_InitializeAsync_FindsTodoAgentFromNodeTypeNamespace.

4) QueryParser: `selfAndDescendants` was silently falling through to
   `QueryScope.Exact` (only `selfAndAncestors`/`ancestorsAndSelf`
   were aliased). Added the symmetric alias to `QueryScope.Subtree`
   so the same footgun doesn't bite future callers — matches the
   pattern documented in feedback_query_scope_children.md.

Suite impact: AI 442/445 in ~7m (was 437/445 with 8 race failures);
Security 225/225; both stable on repeated runs. Remaining 3 AI
failures are pre-existing flakes unrelated to these races
(Submit_SingleSubmit watcher double-dispatch, NuGet feed test,
CodeNode lastExecution stamps).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ssContext

The 2026-05-22 revert made CarryAccessContext a pass-through "until we
have a leak-free design," and the docs (AsynchronousCalls.md:1120-1137 +
CqrsAndContentAccess.md:309) kept promising "AccessContext rides for
free on every framework primitive's cold observable." Those two
realities have been diverging ever since — and every Subscribe-callback
that lands on a non-caller scheduler (workspace emission thread,
TaskPool, the new per-path Concat queue inside MeshNodeStreamCache)
has been silently reading the wrong AsyncLocal.

This commit closes the gap. CarryAccessContext now:

  1. Captures `AccessService.Context` by VALUE at invocation time
     (NOT CircuitContext — PostPipeline picks that up itself; the wrap
     deliberately doesn't synthesise identity from a Blazor session
     value the caller didn't explicitly opt into).

  2. Wraps the source observable so every OnNext / OnError /
     OnCompleted callback is delivered inside an
     AccessService.SwitchAccessContext(captured) `using` scope.

  3. Disposes the scope as the callback returns — AsyncLocal is
     touched ONLY for the duration of the subscriber's body, never
     stamped into the surrounding logical execution context. This
     closes the McpUpdate user1/user2 cross-contamination bug that
     drove the 2026-05-22 revert (the previous impl called
     access.SetContext(captured) without restoring, so the captured
     value leaked into the caller's logical execution context
     indefinitely).

Both IServiceProvider and AccessService overloads now use the same
per-callback RestoringObserver implementation; the AccessService
overload short-circuits the DI lookup when the caller already holds
a reference.

Tests:
- test/MeshWeaver.Messaging.Hub.Test/AccessContextSurvivesSubscribeTest.cs
  Rewrites the old "PassThrough_Does_Not_Restore" test into
  "Captured_Context_Restored_Per_Wrap_Even_After_AmbientCleared" —
  asserts the new per-callback restore AND the no-leak contract
  (after all callbacks return, the test thread's AsyncLocal must
  be back to what it was before any emission).

- test/MeshWeaver.Security.Test/MeshNodeCacheIdentityTest.cs
  Adds two new canaries for the cross-cutting boundary:
    * CacheUpdate_Concat_PreservesCallerIdentity — the per-path
      Concat queue added in 1787345 was the most acute gap; the
      Subject → Concat → Subscribe chain runs the inner observable
      on a ThreadPool thread, so without the wrap the OnNext
      callback observes null/sync identity, never the caller.
    * CacheUpdate_AfterCallerScopeDisposed_StillCarriesCapturedIdentity —
      pins the capture-by-value semantic (Subscribing after the
      caller's using-scope is disposed must still observe the
      captured identity, NOT whatever ambient ended up on
      AsyncLocal post-dispose).

Verification:
- All 6 AccessContextSurvivesSubscribeTest tests green (5 unchanged,
  1 renamed + rewritten).
- All 227 Security.Test green locally (incl. the 2 new cache canaries).
- AI test suite 445/445 green at 8m14s — previously failing CI
  canaries (MeshPluginTest.FullCrudWorkflow, ThreadStreamingIdentityTest.SubmitMessage_*,
  LinkedInTelemetryImport, SubThreadHangRepro x2, LayoutAreaIdentityTest.AuthorizedUser_*)
  all pass under this wrap.

Audit deliverables (referenced by C:\Users\RolandBuergi\.claude\plans\swift-tinkering-melody.md):
  C:/tmp/claude/identity-audit/identity-boundary-audit.md
  C:/tmp/claude/identity-audit/asynccalls-vs-impl.md
  C:/tmp/claude/identity-audit/identity-test-coverage.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… Exec/Compile watcher identity

The recurring silent-overwrite bug behind AppendUserInput / CheckInbox /
ThreadStreamingIdentity flakes traces to the same shape:

  workspace.GetMeshNodeStream().Update(node =>
  {
      var t = node.Content as MeshThread ?? new MeshThread();  // ← silent overwrite
      return node with { Content = t with { ... } };
  });

When `Content` arrives as a raw `JsonElement` (file-system / Postgres /
Cosmos all round-trip through JSON serialisation; only InMemory keeps the
typed instance), the `as MeshThread` cast returns null and the
`?? new MeshThread()` fallback overwrites every other field on the node
with defaults (Status=Idle, pending={}, etc.). The next stream.Update then
persists that default-valued thread — silent data corruption.

Fix: every emission and Update lambda passing through
`MeshNodeStreamHandle` is now round-tripped through the workspace's
`JsonSerializerOptions` at the boundary. Two pieces:

  * Subscribe path: a `TypedContentObserver` between the underlying sync
    stream and the subscriber deserialises any `JsonElement` Content to
    its registered domain type via the workspace's polymorphic
    `$type` discriminator. No-op for already-typed Content.

  * Update path: the caller's lambda is wrapped so the input is typed
    (deserialised if needed) before `update(node)` is called. The post-
    update emission also goes through the typed converter so callers
    chaining `.Select(node => node.Content as MyType)` get the same
    typed shape as Subscribe. (No outbound serialisation: the downstream
    cold pipeline runs `SerializeToNode` itself for cross-hub patches,
    and OWN-path equality dedup in the data source breaks when we force
    a serialise-deserialise round-trip on every write.)

Eliminates the `?? new TFoo()` antipattern across every callsite: when
Content is genuinely absent or wrong-shaped the cast fails cleanly and
the lambda can return `node` unchanged, no silent overwrite.

Two helpers exposed for reuse by other primitives needing the same shape
guarantee: `MeshNodeStreamHandle.EnsureTypedContent(node, options)` and
`MeshNodeStreamHandle.EnsureSerialisedContent(node, options)`.

Watchers — applying the AccessContext propagation rule:

  * `ThreadExecution.InstallExecRoundWatcher` — DispatchAfterClaim
    creates satellite cells and posts cross-hub messages, all of which
    must be attributed to the thread owner (not the cache hub's emission
    identity). Wraps in `using AccessContextScope.FromNode(node, ...)`
    so every downstream write rides under thread.CreatedBy. The access
    check that gates the dispatch already happened (user without thread
    access can't flip Status to StartingExecution).

  * `NodeTypeCompilationHelpers.InstallCompileWatcher` — compile runs
    under SYSTEM identity, by design. Wraps in
    `using AccessContextScope.AsSystem(accessService)` so the
    DispatchCompileTrigger post lands at the handler with
    delivery.AccessContext = system-security; every internal write
    inside the activity (read source files across all users, write the
    activity log, emit the assembly) then bypasses RLS. The access
    check is upstream — the user has to be permitted to flip
    RequestedReleaseAt on the NodeType's MeshNode.

  * `ThreadSubmission.InstallServerWatcher` — claim flip is an OWN
    update, no cross-hub, no RLS gate inside the action block.
    No scope needed; comment added to clarify the rule.

New helper: `MeshWeaver.Mesh.Security.AccessContextScope` (Mesh.Contract)
with `FromNode(node, access)` and `AsSystem(access)` factories — the
two operation classes the codebase needs.

Docs updated:
  * CqrsAndContentAccess.md — new section "Content is always typed at
    the GetMeshNodeStream boundary" with the bad/good comparison.
  * AsynchronousCalls.md — same rule cross-referenced from the cold-
    write contract section.

Verification:
  * AI suite: 444/445 (was 9 failures pre-fix). Remaining 1 is
    CheckInbox_MultiplePending — a pre-existing rapid-OWN-update race
    where 3 concurrent AppendUserInput calls collide on the data
    source's action block. Not addressed in this commit (separate
    concurrent-write design).
  * Identity-canary tests still green: CacheUpdate_Concat_PreservesCallerIdentity
    + CacheUpdate_AfterCallerScopeDisposed_StillCarriesCapturedIdentity
    + the 6 AccessContextSurvivesSubscribeTest tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants