Skip to content

[PRODENG-3471] Fix Windows MCR install failing on FIPS channels#640

Open
james-nesbitt wants to merge 8 commits into
mainfrom
PRODENG-3471
Open

[PRODENG-3471] Fix Windows MCR install failing on FIPS channels#640
james-nesbitt wants to merge 8 commits into
mainfrom
PRODENG-3471

Conversation

@james-nesbitt

@james-nesbitt james-nesbitt commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

What

Remove hardcoded DOCKER_VERSION=latest from the Windows MCR install command, letting install.ps1 resolve the version from the channel's index.json.

Why

For FIPS channels, install.ps1 appends +fips to DOCKER_VERSION, producing docker-latest+fips.zip — a file that has never been published. Only versioned artifacts (docker-29.2.1+fips.zip) exist on FIPS channels.

How

  • Drop version := "latest" and the DOCKER_VERSION env var from the set … && powershell install command in InstallMCR
  • Without DOCKER_VERSION set, install.ps1 fetches {repoURL}/win/static/{channel}/index.json and calls getNumericallyHigherVersion, returning the correct pinned artifact (e.g. 29.2.1.1+fips)
  • Works for all channel types: versioned FIPS, rolling FIPS, and non-FIPS

Testing

  • Build passes (go build ./...)
  • Manually verified S3 artifact layout: docker-latest+fips.zip absent, docker-29.2.1.1+fips.zip present; index.json lists versioned entries that getNumericallyHigherVersion correctly selects

Links

Checklist

  • Tests added or updated
  • Docs updated if user-visible behaviour changed
  • No debug output or dead code left in

Written by AI: claude-sonnet-4-5

DOCKER_VERSION was hardcoded to "latest" in the InstallMCR command.
For FIPS channels, install.ps1 appends +fips to produce
"latest+fips", but docker-latest+fips.zip is never published —
only versioned artifacts (docker-29.2.1+fips.zip) exist.

Without DOCKER_VERSION set, install.ps1 fetches the channel's
index.json and calls getNumericallyHigherVersion, which returns
the correct pinned version. This works for all channel types:
versioned FIPS (stable-29.2.1/fips), rolling FIPS (stable/fips),
and non-FIPS channels alike.

Fixes PRODENG-3471
Exercises a ubuntu24 manager + windows_2022 worker cluster with
MCR channel stable-29.2.1/fips. Both Linux (APT component
ubuntu/dists/noble/stable-29.2.1/fips/) and Windows
(win/static/stable-29.2.1/fips/index.json) have published
artifacts for this channel.

The test directly covers the regression: without the fix, the
Windows installer would attempt docker-latest+fips.zip (which
does not exist); with the fix it resolves docker-29.2.1.1+fips.zip
from the channel index.
Triggered by the 'smoke-windows-fips' PR label (or 'smoke-test'
to run all suites, or on push to main).
@james-nesbitt james-nesbitt added the smoke-windows-fips Run Windows FIPS smoke test label Jun 3, 2026
@james-nesbitt james-nesbitt added smoke-fips Run FIPS smoke test and removed smoke-windows-fips Run Windows FIPS smoke test labels Jun 3, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Windows MCR installation on FIPS channels by no longer forcing DOCKER_VERSION=latest, allowing the upstream install.ps1 script to resolve an appropriate versioned +fips artifact via the channel index.json.

Changes:

  • Remove hardcoded DOCKER_VERSION=latest from the Windows MCR install command so the installer script can select a valid versioned artifact.
  • Add a new FIPS-focused smoke test (TestFIPSCluster) and wire it into Make and GitHub Actions as an on-demand smoke job.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
pkg/configurer/windows.go Drops DOCKER_VERSION from the Windows MCR install invocation so FIPS channels resolve to versioned artifacts.
test/smoke/smoke_test.go Adds a smoke test covering an Ubuntu manager + Windows 2022 worker using a FIPS MCR channel.
Makefile Adds a smoke-fips target to run the new FIPS smoke test.
.github/workflows/smoke-tests.yaml Adds a smoke-fips workflow job gated by labels / pushes, invoking make smoke-fips.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/configurer/windows.go Outdated
}()

installCommand := fmt.Sprintf("set DOWNLOAD_URL=%s && set DOCKER_VERSION=%s && set CHANNEL=%s && powershell -ExecutionPolicy Bypass -NoProfile -NonInteractive -File %s -Verbose", engineConfig.RepoURL, version, engineConfig.Channel, ps.DoubleQuote(installer))
installCommand := fmt.Sprintf("set DOWNLOAD_URL=%s && set CHANNEL=%s && powershell -ExecutionPolicy Bypass -NoProfile -NonInteractive -File %s -Verbose", engineConfig.RepoURL, engineConfig.Channel, ps.DoubleQuote(installer))
…ks up v0.1.6

The terraform-mirantis-modules/provision-aws module gained ubuntu_22.04_fips
in v0.1.6 but the Terraform Registry has not yet indexed that release.
Add it to lib_local_platform_definitions in examples/terraform/aws-simple/platform.tf
as a temporary override with a TODO to remove once the module source is
bumped to >= v0.1.6.

Also wire Ubuntu22FIPS into test/platforms.go and update TestFIPSCluster
to use it as the manager node.
@james-nesbitt

Copy link
Copy Markdown
Collaborator Author

The smoke-fips CI job ran and timed out waiting for WinRM on the Windows 2022 worker. This is an infrastructure timing issue (Windows boot + EC2Launch UserData on port 5986) unrelated to the MCR channel fix — the fix itself (dropping DOCKER_VERSION=latest) is a one-line change with a clear causal chain verified against the S3 artifact layout and install.ps1 source. We are continuing to investigate the FIPS smoke test reliability separately.

Written by AI: claude-sonnet-4-5

windows_2022 consistently fails to open port 5986 (WinRM HTTPS)
within the retry window. Switching to windows_2025 to determine
if the issue is AMI-specific.
Address Copilot review on #640: the cmd.exe install chain interpolated
engineConfig.RepoURL and engineConfig.Channel into bare `set VAR=value`
assignments. Use the atomic `set "VAR=value"` form so cmd metacharacters
(&, |, ^, <, >) in either value cannot break the command or inject extra
commands.
The smoke-fips Windows 2025 worker rejected every WinRM connection with
"401 - invalid content type" (wrong Basic-auth credentials), so launchpad
never reached the MCR install step under test.

Root cause: Server 2025 uses EC2Launch v2, which sets a random Administrator
password via its PreReady setAdminAccount task. The user-data script runs in
the later UserData stage and must override it, but the legacy ADSI WinNT
SetPassword silently failed on Server 2025 Core (user data runs with
ErrorActionPreference=Continue), leaving the random password active.
setAdminAccount cannot be used from user data (PreReady-only), so set the
password in-script with the modern LocalAccounts API (Set-LocalUser /
Enable-LocalUser) and -ErrorAction Stop to fail loudly if it does not apply.

Verification pending a smoke-fips CI run (requires AWS provisioning).
@james-nesbitt

Copy link
Copy Markdown
Collaborator Author

Pushed two fixes to PRODENG-3471 (tip 7aae999); smoke-fips is re-running.

The smoke-fips failure is real and deterministic — not the infra/WinRM boot timeout described earlier. On the Windows 2025 worker the 5986 HTTPS listener was up and answering; all 60 WinRM attempts returned 401 - invalid content type, i.e. rig's Basic auth was rejected with wrong credentials. The MCR fix was never exercised because launchpad could not authenticate to the host at all.

Root cause: Server 2025 runs EC2Launch v2, which sets a random Administrator password via its setAdminAccount task in the PreReady stage. The user-data script runs later (UserData stage) and is supposed to override it, but the legacy ADSI WinNT://./administrator SetPassword silently failed on Server 2025 Core (user data runs with ErrorActionPreference=Continue), leaving the random password active. setAdminAccount cannot be used from user data (it is PreReady-only), so the script itself must set the password reliably.

Changes:

  • examples/terraform/aws-simple/userdata_windows.tpl: set the Administrator password with the modern Set-LocalUser/Enable-LocalUser API and -ErrorAction Stop, so it applies reliably and fails loudly if it does not.
  • pkg/configurer/windows.go: use the atomic set "VAR=value" form for DOWNLOAD_URL/CHANNEL (addresses the Copilot review comment on cmd metacharacter / injection safety).

The Administrator-password fix is a strong candidate but unverified until this smoke-fips run completes. If it still 401s, the next diagnostic is the EC2 console / agent.log from the instance to confirm whether the password set errored.

Written by AI: claude-opus-4-8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

smoke-fips Run FIPS smoke test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants