Mount ReFS Dev Drive on Windows to speed up small-file I/O#5065
Closed
Mount ReFS Dev Drive on Windows to speed up small-file I/O#5065
Conversation
Acceptance tests on Windows spend most of their wall-clock on small-file writes: each terraform init copies providers into a per-test `.terraform/` under `$TEMP`, and the go build/module caches see similar churn. The default C: drive on GitHub-hosted and Databricks-protected Windows runners is backed by remote block storage (~4.3k IOPS); a ReFS Dev Drive is ~127k IOPS on comparable benchmarks. This step creates a 20GB dynamic VHDX, mounts it as Z:, formats it ReFS (with the `-DevDrive` flag where the host supports it, falling back to plain ReFS otherwise), and redirects TEMP/TMP + GOCACHE/GOMODCACHE/ GOTMPDIR onto it. Checkout stays on C: -- moving it would be invasive (acceptance test output normalization) for little further gain. Placed at the top of the composite action so it applies to every caller (test, test-exp-aitools, test-exp-ssh, test-pipelines). No-op on non-Windows runners via `runner.os == 'Windows'`. Co-authored-by: Isaac
Redirecting TEMP to Z: puts `t.TempDir()` (and therefore each test's
bundle cwd) on Z:, while the checkout and uv's Python package cache stay
on C:. Under `bundle/python/*` tests with older databricks-bundles
versions (e.g. PYDAB_VERSION=0.266.0), the Python mutator calls
`os.path.commonpath([os.getcwd(), path])` which raises
`ValueError: Paths don't have the same drive`. Six tests regressed:
experimental-compatibility{,-both-equal}, resource-loading,
unicode-support, restricted-execution, resolve-variable.
Keep only GOCACHE/GOMODCACHE/GOTMPDIR on the Dev Drive -- those benefit
Go compilation I/O without spanning drive boundaries at the Python
level. Per-test `.terraform/` speedup is lost; recover it in a follow-up
by plumbing a test-framework-specific tmpdir that callers can keep on
the same drive as the checkout.
Co-authored-by: Isaac
Rather than redirecting individual env vars (TEMP/TMP/GO*) and running into cross-drive path issues (Python mutator's `os.path.commonpath([cwd, path])` fails when cwd and the uv-cached module live on different drives), we can just relocate the entire workspace. Create a directory junction from `$GITHUB_WORKSPACE` (still on C:) to `Z:\workspace`. From every tool's point of view the path is unchanged -- it starts with `C:\` and `commonpath` works. Physically, all reads and writes go to the ReFS volume. Flow: - Mount VHDX at Z: - Wipe the workflow's prior checkout at $GITHUB_WORKSPACE - Create junction to Z:\workspace - The composite's own `actions/checkout` step re-populates the workspace via the junction (so setup-jfrog etc. find their files) No Go / TEMP env gymnastics needed. bundle/python/* tests stay happy. Co-authored-by: Isaac
This reverts commit 931c8a9.
Go-caches-only on Z: left the big Windows test jobs effectively flat (windows/terraform 32m34s vs 32m33s baseline) because the dominant cost is per-test `.terraform/` churn under TEMP, not Go compilation. Moving TEMP onto the Dev Drive was the missing piece. The first TEMP-on-Z: attempt broke `bundle/python/*` tests (older databricks-bundles calls `os.path.commonpath([cwd, uv_cache_path])` and chokes when the two live on different drives). Fix: create a directory junction at `C:\a\_fast` (sibling to `C:\a\cli\cli`, not inside the repo) pointing at `Z:\fast`. Path strings stay `C:\...`, so `commonpath` is happy; I/O physically lands on Z:. Junction is outside the checkout to avoid `git status` pollution, `git clean` interactions after `actions/checkout`, and unintended traversal by repo-walking tools. Co-authored-by: Isaac
`bundle.TestRootLookup` (unit test) called `filepath.EvalSymlinks` on its `t.TempDir()` path, which landed under the `C:\a\_fast` directory junction. Go's stdlib EvalSymlinks on Windows returns "cannot find the path specified" for `IO_REPARSE_TAG_MOUNT_POINT` (junctions) rooted at a newly mounted ReFS VHDX, but handles `IO_REPARSE_TAG_SYMLINK` (directory symlinks) correctly. Switching `New-Item -ItemType Junction` to `cmd /c mklink /D` (directory symbolic link) to dodge the quirk. Symlinks require Developer Mode, which is the default on GitHub-hosted Windows runners. Co-authored-by: Isaac
This reverts commit fce8d2d.
This reverts commit 9908fe7.
f5621c6 to
164569b
Compare
Mount the ReFS volume at C:\dd via Add-PartitionAccessPath instead of assigning drive letter Z:. Junction $GITHUB_WORKSPACE onto C:\dd\ws and point TMP/TEMP at C:\dd\tmp. Every path the test process sees now starts with C:\ — file I/O still lands on the Dev Drive, but os.path.commonpath won't trip on cross-drive paths in older databricks-bundles under bundle/python/*. Co-authored-by: Isaac
The calling workflows run actions/checkout before invoking this composite action, so the workspace is already populated when the Mount step runs. Robocopy /MOVE the contents to C:\dd\ws before placing the junction so file I/O lands on the Dev Drive. Co-authored-by: Isaac
The previous attempt called the mount step from the setup-build-environment composite, which runs after the workflow's actions/checkout — moving 17k workspace files from C: to ReFS while Defender scans every read takes forever and the step hung past 40 min on every Windows job. Inline the mount step into each test* job in push.yml so it runs before checkout. The junction is placed over an empty GITHUB_WORKSPACE and the checkout writes directly onto ReFS — no copy needed. Local composite actions can't be referenced before checkout, so the PowerShell is duplicated across the four test* jobs that use Windows. Co-authored-by: Isaac
The pwsh shell starts with CWD set to GITHUB_WORKSPACE, so Remove-Item on that path fails with "because it is in use". Switch to C:\ first. Co-authored-by: Isaac
Previous attempt junctioned GITHUB_WORKSPACE to a folder on a single
Dev Drive, but Remove-Item on the workspace dir fails because the
runner agent holds an open handle on it ("being used by another
process"). Set-Location alone wasn't enough to release it.
Switch to two ReFS Dev Drive volumes mounted via Add-PartitionAccessPath:
- 15 GB volume mounted directly at $GITHUB_WORKSPACE (the empty folder
becomes a reparse point overlay; open handles don't matter for that)
- 10 GB volume mounted at C:\dd for TEMP, GOCACHE, GOMODCACHE, GOTMPDIR
Co-authored-by: Isaac
ReFS volumes auto-create a "System Volume Information" folder at their root that scandir can't enumerate (EPERM), which trips actions/checkout's pre-clone cleanup. The workspace is freshly mounted and otherwise empty, so disabling the clean step is safe. Co-authored-by: Isaac
ReFS auto-creates a System Volume Information folder at the volume root with ACLs that block scandir for non-SYSTEM users, which trips actions/checkout (EPERM during readdir). Take ownership and remove it right after mount; the workspace volume is fresh and we don't need it. Also drop `clean: false` — it didn't avoid the scandir, since checkout walks the workspace before the clean step. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Outcome: not worth pursuing
This experiment tried to speed up Windows CI by putting the workspace and caches on a ReFS Dev Drive (~127k IOPS vs ~4.3k IOPS on default C:). The data and the Windows-platform constraints don't support continuing.
What was measured
GOCACHE / GOMODCACHE / GOTMPDIR redirected to a ReFS Dev Drive (commit
164569bbf) vs main (c6168a1be):task test (windows, terraform)task test (windows, direct)Within run-to-run noise. A 30× faster volume for Go's caches produced no measurable change — the workload is not bound on small-file disk I/O.
Why workspace-on-Dev-Drive isn't a quick fix either
Mounting the workspace on a Dev Drive looked like the natural next step, but every approach hits a Windows-specific block:
Recommendation
Given the GOCACHE-only baseline showed no speedup, investing in the `workspace-copy` plumbing is unlikely to pay off. Before revisiting Dev Drive, profile where the time actually goes on Windows runs:
If profiling shows the workspace I/O actually matters, return to Dev Drive via `samypr100/setup-dev-drive` with `workspace-copy: true`.
Closing without merge.