Skip to content

Optimize Firecracker snapshot resume#256

Draft
sjmiller609 wants to merge 1 commit into
codex/restore-deep-trace-debugfrom
codex/firecracker-resume-on-load
Draft

Optimize Firecracker snapshot resume#256
sjmiller609 wants to merge 1 commit into
codex/restore-deep-trace-debugfrom
codex/firecracker-resume-on-load

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

Summary

  • request Firecracker to resume during /snapshot/load and skip the separate host PATCH /vm resume call when restore already resumed
  • preserve deep restore tracing by keeping explicit Resume when HYPEMAN_RESTORE_DEEP_TRACE=1
  • replace Firecracker API socket readiness 50ms polling with Linux inotify plus 1ms retry once the socket path exists, with a 1ms polling fallback on non-Linux

Remote results

Baseline from the phase harness before this branch had steady fork latency around 429-430ms, with hypervisor.start_process around 50-51ms and resume_vm around 27-28ms.

On deft-kernel-dev with both optimizations enabled, 5 iterations showed:

  • steady fork totals: 340ms, 344ms, 360ms, plus a 491ms IO/fault outlier and a cold 1235ms first iteration
  • hypervisor.start_process: 1-2ms
  • resume_vm: 0ms; the VM was resumed during snapshot load
  • fault_guest_memory_from_disk remains the dominant bucket: 224-286ms in the steady/outlier samples

A control run with HYPEMAN_FIRECRACKER_RESTORE_RESUME_ON_LOAD=0 confirmed the socket readiness change alone drops start_process to 1-2ms.

Validation

  • gofmt
  • git diff --check
  • GOCACHE=/private/tmp/hypeman-go-build go test ./lib/hypervisor ./lib/hypervisor/firecracker ./lib/instances -run 'TestWrapHypervisorPreservesRestoredResumed|TestShouldResumeOnSnapshotLoad|TestWaitForSocketReturnsWhenSocketAppears|TestSnapshotParamPaths|TestForkSnapshotPhaseBreakdownPerf|TestRestoreDeepTracePerf|TestPatchGuestResumeNetworkMailbox|TestForkSnapshotMapsWaitForNetwork' -count=1
  • remote: HYPEMAN_RUN_FORK_PHASE_BREAKDOWN_PERF=1 HYPEMAN_FORK_PHASE_BREAKDOWN_PERF_ITERS=5 go test ./lib/instances -run TestForkSnapshotPhaseBreakdownPerf -count=1 -timeout=20m -v
  • remote: focused hypervisor/firecracker unit tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant