Skip to content

Mount ReFS Dev Drive on Windows to speed up small-file I/O#5065

Closed
pietern wants to merge 15 commits intomainfrom
test-time-windows-devdrive
Closed

Mount ReFS Dev Drive on Windows to speed up small-file I/O#5065
pietern wants to merge 15 commits intomainfrom
test-time-windows-devdrive

Conversation

@pietern
Copy link
Copy Markdown
Contributor

@pietern pietern commented Apr 22, 2026

Outcome: not worth pursuing

This experiment tried to speed up Windows CI by putting the workspace and caches on a ReFS Dev Drive (~127k IOPS vs ~4.3k IOPS on default C:). The data and the Windows-platform constraints don't support continuing.

What was measured

GOCACHE / GOMODCACHE / GOTMPDIR redirected to a ReFS Dev Drive (commit 164569bbf) vs main (c6168a1be):

Job main PR (GOCACHE on Dev Drive) Δ
task test (windows, terraform) 32m 04s 33m 30s +1m 26s
task test (windows, direct) 25m 44s 24m 49s −0m 55s

Within run-to-run noise. A 30× faster volume for Go's caches produced no measurable change — the workload is not bound on small-file disk I/O.

Why workspace-on-Dev-Drive isn't a quick fix either

Mounting the workspace on a Dev Drive looked like the natural next step, but every approach hits a Windows-specific block:

  • Junction `$GITHUB_WORKSPACE` → Dev Drive folder — `Remove-Item` on the workspace dir fails with "being used by another process" because the runner agent holds an open handle from "Set up job" through the entire job.
  • Mount the volume directly at `$GITHUB_WORKSPACE` via `Add-PartitionAccessPath` — `actions/checkout` calls `scandir` on the volume root, which contains a `System Volume Information` folder with restrictive ACLs. `EPERM`. Stripping the folder doesn't stick: Windows recreates `WPSettings.dat` etc. immediately.
  • The canonical action `samypr100/setup-dev-drive` explicitly documents this exact `System Volume Information` failure and refuses to mount onto `$GITHUB_WORKSPACE` at all. Their supported workaround is `workspace-copy: true`: check out normally, then copy the workspace into a sibling Dev Drive location. That works but requires `working-directory` plumbing on every subsequent step, plus a one-time copy.

Recommendation

Given the GOCACHE-only baseline showed no speedup, investing in the `workspace-copy` plumbing is unlikely to pay off. Before revisiting Dev Drive, profile where the time actually goes on Windows runs:

  • Per-test timings from the `gotestsum` JSON we already upload — does a small set of tests dominate?
  • Try `fsutil devdrv trust` + Defender performance mode on the existing Dev Drive. If Defender real-time scanning is the bottleneck, this alone would move the needle without any workspace work.
  • One-off run with Defender real-time protection disabled — establishes the AV ceiling.

If profiling shows the workspace I/O actually matters, return to Dev Drive via `samypr100/setup-dev-drive` with `workspace-copy: true`.

Closing without merge.

@pietern pietern changed the title ci: mount ReFS Dev Drive on Windows to speed up small-file I/O Mount ReFS Dev Drive on Windows to speed up small-file I/O Apr 22, 2026
pietern added 8 commits May 7, 2026 11:09
Acceptance tests on Windows spend most of their wall-clock on small-file
writes: each terraform init copies providers into a per-test `.terraform/`
under `$TEMP`, and the go build/module caches see similar churn. The
default C: drive on GitHub-hosted and Databricks-protected Windows
runners is backed by remote block storage (~4.3k IOPS); a ReFS Dev Drive
is ~127k IOPS on comparable benchmarks.

This step creates a 20GB dynamic VHDX, mounts it as Z:, formats it ReFS
(with the `-DevDrive` flag where the host supports it, falling back to
plain ReFS otherwise), and redirects TEMP/TMP + GOCACHE/GOMODCACHE/
GOTMPDIR onto it. Checkout stays on C: -- moving it would be invasive
(acceptance test output normalization) for little further gain.

Placed at the top of the composite action so it applies to every caller
(test, test-exp-aitools, test-exp-ssh, test-pipelines). No-op on
non-Windows runners via `runner.os == 'Windows'`.

Co-authored-by: Isaac
Redirecting TEMP to Z: puts `t.TempDir()` (and therefore each test's
bundle cwd) on Z:, while the checkout and uv's Python package cache stay
on C:. Under `bundle/python/*` tests with older databricks-bundles
versions (e.g. PYDAB_VERSION=0.266.0), the Python mutator calls
`os.path.commonpath([os.getcwd(), path])` which raises
`ValueError: Paths don't have the same drive`. Six tests regressed:
experimental-compatibility{,-both-equal}, resource-loading,
unicode-support, restricted-execution, resolve-variable.

Keep only GOCACHE/GOMODCACHE/GOTMPDIR on the Dev Drive -- those benefit
Go compilation I/O without spanning drive boundaries at the Python
level. Per-test `.terraform/` speedup is lost; recover it in a follow-up
by plumbing a test-framework-specific tmpdir that callers can keep on
the same drive as the checkout.

Co-authored-by: Isaac
Rather than redirecting individual env vars (TEMP/TMP/GO*) and running
into cross-drive path issues (Python mutator's
`os.path.commonpath([cwd, path])` fails when cwd and the uv-cached
module live on different drives), we can just relocate the entire
workspace.

Create a directory junction from `$GITHUB_WORKSPACE` (still on C:) to
`Z:\workspace`. From every tool's point of view the path is unchanged --
it starts with `C:\` and `commonpath` works. Physically, all reads and
writes go to the ReFS volume.

Flow:
- Mount VHDX at Z:
- Wipe the workflow's prior checkout at $GITHUB_WORKSPACE
- Create junction to Z:\workspace
- The composite's own `actions/checkout` step re-populates the
  workspace via the junction (so setup-jfrog etc. find their files)

No Go / TEMP env gymnastics needed. bundle/python/* tests stay happy.

Co-authored-by: Isaac
Go-caches-only on Z: left the big Windows test jobs effectively flat
(windows/terraform 32m34s vs 32m33s baseline) because the dominant cost
is per-test `.terraform/` churn under TEMP, not Go compilation. Moving
TEMP onto the Dev Drive was the missing piece.

The first TEMP-on-Z: attempt broke `bundle/python/*` tests (older
databricks-bundles calls `os.path.commonpath([cwd, uv_cache_path])` and
chokes when the two live on different drives). Fix: create a directory
junction at `C:\a\_fast` (sibling to `C:\a\cli\cli`, not inside the
repo) pointing at `Z:\fast`. Path strings stay `C:\...`, so
`commonpath` is happy; I/O physically lands on Z:.

Junction is outside the checkout to avoid `git status` pollution,
`git clean` interactions after `actions/checkout`, and unintended
traversal by repo-walking tools.

Co-authored-by: Isaac
`bundle.TestRootLookup` (unit test) called `filepath.EvalSymlinks` on
its `t.TempDir()` path, which landed under the `C:\a\_fast` directory
junction. Go's stdlib EvalSymlinks on Windows returns
"cannot find the path specified" for `IO_REPARSE_TAG_MOUNT_POINT`
(junctions) rooted at a newly mounted ReFS VHDX, but handles
`IO_REPARSE_TAG_SYMLINK` (directory symlinks) correctly.

Switching `New-Item -ItemType Junction` to `cmd /c mklink /D`
(directory symbolic link) to dodge the quirk. Symlinks require
Developer Mode, which is the default on GitHub-hosted Windows
runners.

Co-authored-by: Isaac
@pietern pietern force-pushed the test-time-windows-devdrive branch from f5621c6 to 164569b Compare May 7, 2026 09:10
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 09:10 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 09:10 — with GitHub Actions Inactive
Mount the ReFS volume at C:\dd via Add-PartitionAccessPath instead of
assigning drive letter Z:. Junction $GITHUB_WORKSPACE onto C:\dd\ws and
point TMP/TEMP at C:\dd\tmp.

Every path the test process sees now starts with C:\ — file I/O still
lands on the Dev Drive, but os.path.commonpath won't trip on cross-drive
paths in older databricks-bundles under bundle/python/*.

Co-authored-by: Isaac
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 10:01 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 10:01 — with GitHub Actions Inactive
The calling workflows run actions/checkout before invoking this composite
action, so the workspace is already populated when the Mount step runs.
Robocopy /MOVE the contents to C:\dd\ws before placing the junction so
file I/O lands on the Dev Drive.

Co-authored-by: Isaac
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 10:39 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 10:39 — with GitHub Actions Inactive
The previous attempt called the mount step from the setup-build-environment
composite, which runs after the workflow's actions/checkout — moving
17k workspace files from C: to ReFS while Defender scans every read takes
forever and the step hung past 40 min on every Windows job.

Inline the mount step into each test* job in push.yml so it runs before
checkout. The junction is placed over an empty GITHUB_WORKSPACE and the
checkout writes directly onto ReFS — no copy needed.

Local composite actions can't be referenced before checkout, so the
PowerShell is duplicated across the four test* jobs that use Windows.

Co-authored-by: Isaac
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 11:26 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 11:26 — with GitHub Actions Inactive
The pwsh shell starts with CWD set to GITHUB_WORKSPACE, so Remove-Item
on that path fails with "because it is in use". Switch to C:\ first.

Co-authored-by: Isaac
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 12:03 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 12:03 — with GitHub Actions Inactive
Previous attempt junctioned GITHUB_WORKSPACE to a folder on a single
Dev Drive, but Remove-Item on the workspace dir fails because the
runner agent holds an open handle on it ("being used by another
process"). Set-Location alone wasn't enough to release it.

Switch to two ReFS Dev Drive volumes mounted via Add-PartitionAccessPath:
- 15 GB volume mounted directly at $GITHUB_WORKSPACE (the empty folder
  becomes a reparse point overlay; open handles don't matter for that)
- 10 GB volume mounted at C:\dd for TEMP, GOCACHE, GOMODCACHE, GOTMPDIR

Co-authored-by: Isaac
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 12:42 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 12:42 — with GitHub Actions Inactive
ReFS volumes auto-create a "System Volume Information" folder at their
root that scandir can't enumerate (EPERM), which trips actions/checkout's
pre-clone cleanup. The workspace is freshly mounted and otherwise empty,
so disabling the clean step is safe.

Co-authored-by: Isaac
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 14:01 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 14:01 — with GitHub Actions Inactive
ReFS auto-creates a System Volume Information folder at the volume root
with ACLs that block scandir for non-SYSTEM users, which trips
actions/checkout (EPERM during readdir). Take ownership and remove it
right after mount; the workspace volume is fresh and we don't need it.

Also drop `clean: false` — it didn't avoid the scandir, since checkout
walks the workspace before the clean step.

Co-authored-by: Isaac
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 14:38 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 7, 2026 14:38 — with GitHub Actions Inactive
@pietern pietern closed this May 8, 2026
@pietern pietern deleted the test-time-windows-devdrive branch May 8, 2026 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant