Skip to content

Add RDP E2E tests#230

Draft
bernie-g wants to merge 14 commits into
feat/pam-rdp-recordingfrom
feat/pam-rdp-tests
Draft

Add RDP E2E tests#230
bernie-g wants to merge 14 commits into
feat/pam-rdp-recordingfrom
feat/pam-rdp-tests

Conversation

@bernie-g
Copy link
Copy Markdown
Contributor

Summary

  • Add 6 RDP E2E subtests: connection, bad credentials, unreachable target, reconnect, concurrent connections, session duration
  • xrdp Docker container (Ubuntu + xrdp + openbox) as the test target
  • FreeRDP under xvfb for headless connection verification in CI
  • CI pam-test job updated: Rust bridge build, -tags rdp, FreeRDP + xvfb install
  • Raw HTTP helpers for Windows PAM resource/account creation (no client regen needed yet)
  • openapi-cfg.yaml updated with createWindowsPamResource and createWindowsPamAccount for future client regeneration

Test plan

  • CI pam-test job builds Rust bridge and CLI with RDP support
  • xrdp container starts and accepts connections
  • All 6 RDP subtests pass in CI
  • Existing SSH/Postgres/Redis tests still pass (no regressions)

x032205 and others added 12 commits May 8, 2026 04:04
# Conflicts:
#	packages/pam/pam-proxy.go
fix: add keepalive, idle timeout, and session cleanup to gateway
goreleaser v1.x's mode: append cannot find draft releases (GitHub's
"get release by tag" API excludes drafts), so the pre-created draft
was ignored and goreleaser created a second, separate release.

Fix: remove the create-release-draft job and let the goreleaser jobs
handle release creation themselves via mode: append. The first job to
finish publishing creates the release, the others append to it.

Also replaces the broken --skip=build (v2-only flag) with separate
dry-run/release goreleaser steps.
fix(release): remove --skip=build flag unsupported in goreleaser v1.26.2
* feat(pam): Oracle database support via proxied-auth gateway

Adds Oracle DB as the 8th PAM handler. Gateway accepts client
connections with a placeholder password, proxies pre-auth bytes
verbatim to upstream, intercepts at O5Logon to swap client-supplied
placeholder-keyed material for real-password-keyed material, and
byte-relays post-auth. Credential injection works end-to-end for
JDBC thin clients (sqlcl, SQL Developer, DBeaver) and go-ora;
user never sees real Oracle credentials.

Handler lives in packages/pam/handlers/oracle/. Ports crypto
primitives (PBKDF2+SHA512, AES-CBC session-key encryption, PKCS5
padding) and TTC codec (compressed ints, CLR byte strings, KVP
encoding) from MIT-licensed sijms/go-ora. See ATTRIBUTION.md and
ORACLE_PAM_NOTES.md for architecture notes and handoff.

Known dead code remains in this commit from the earlier full-
impersonation attempt (ano.go, nego.go, nego_templates.go, parts
of o5logon*.go, upstream.go, handshake_test.go). Kept intact to
preserve history of the approach; cleanup follows as a separate
commit.

* refactor(pam-oracle): remove impersonation-era dead code

The initial attempt used server-side Oracle impersonation (see prior
commit). That design worked through authentication but hit a
state-mismatch problem post-auth: upstream (via go-ora) and client
negotiated different TTC capabilities, so relayed queries were
rejected as protocol violations.

The replacement — proxied-auth — is already in proxy_auth.go and is
the flow wired to HandleConnection. This commit removes the dead
files and vestigial symbols that supported the old path:

Removed entirely:
  - ano.go, nego.go, nego_templates.go (pre-auth TNS/TTC negotiation
    handlers; pre-auth is now forwarded verbatim to upstream)
  - upstream.go (go-ora-based upstream dial + KVP extraction;
    replaced by dialUpstreamRaw in proxy_auth.go)
  - handshake_test.go (tested the impersonation path, orphaned)

Pruned:
  - proxy.go: handleConnectionLegacy (~200 LOC)
  - o5logon.go: O5LogonServerState, NewO5LogonServerState,
    VerifyClientPassword, deriveKey11g, md5Hash, parseIntVal
  - o5logon_server.go: AuthPhaseOne, ParseAuthPhaseOne,
    BuildAuthPhaseOneResponse, BuildAuthPhaseTwoResponse,
    BuildAuthPhaseTwoResponseFromUpstream, RunServerO5Logon,
    dumpBytes, readUint32
  - tns.go: AcceptPacket, AcceptFromConnect, ConnectPacket,
    ParseConnectPacket, MarkerPacketBytes (we forward raw packet
    bytes rather than parse/build CONNECT or ACCEPT)

Kept: crypto primitives, DATA packet codec, TTC reader/builder,
query logger, prependedConn, error helpers — all live in the
proxied-auth flow.

Drops github.com/sijms/go-ora/v2 from go.mod — no longer imported.
Adds .idea and .vscode to .gitignore. Updates ORACLE_PAM_NOTES.md
with a current-state header; historical sections below retained
for context.

Net: ~1,600 LOC removed. The handler directory goes from 12 files
to 8. Build, fmt, and the Oracle SQL test matrix (SELECT, INSERT,
DDL, PL/SQL, bind vars, NLS queries) still pass against sqlcl →
gateway → AWS RDS Oracle 19c.

* fix(pam-oracle): capture SQL in session recordings

The TTC query extractor was logging empty strings for OALL8 payloads
because tryExtractSQL's "skip 6 compressed ints then read a CLR"
heuristic didn't land on the SQL text — the OALL8 wire format has
variable-length headers that differ by client driver and bind
pattern. As a result, session recordings contained only session
headers (124 bytes) with no actual query content.

Replace structured parsing with a simple scan for the longest
printable ASCII run in the payload. In practice the SQL text is
always the longest such run. Verified with sqlcl: .enc file grows
from 124 bytes (empty) to ~880 bytes (with captured queries) for
a SELECT + COMMIT session.

This only affects content — the tap, packet demultiplexing, and
encrypted file I/O were all working correctly. Fix is localised to
tryExtractSQL.

* fix(pam-oracle): capture queries bundled behind piggyback OCLOSE

sqlcl (and other JDBC thin clients) frequently bundle a piggybacked OCLOSE
for the previous cursor with the next OALL8 query in a single TTC packet.
The previous parser checked byte 0 for 0x03 (function call) and bailed when
it saw 0x11 (piggyback marker), missing the OALL8 underneath.

Scan the payload for the function-call + OALL8 byte pair instead, so the
parser finds the query regardless of any preceding piggyback prefix. Same
treatment for COMMIT and ROLLBACK, which also get piggybacked.

* feat(pam-oracle): TLS (TCPS) support for upstream

Matches the other SQL handlers' pattern: the client speaks plain TCP to
our local listener, we do TLS to the upstream database and translate in
the middle. No change to the client-facing UX — the same JDBC URL that
works against a plain-TCP Oracle resource now also works against a
TCPS-enabled one.

Implementing this correctly required discovering Oracle TCPS's two-handshake
flow (by reading go-ora's network/session.go readPacket RESEND branch —
credit there):

1. Dial TCP, do an initial TLS handshake. Forward the client's CONNECT
   through this first TLS session.
2. When the upstream RESEND packet's byte-5 flag has 0x08 set, Oracle
   expects the client to abandon the first TLS session and run a FRESH
   TLS handshake on the bare TCP socket. Server drops its first-round
   TLS state in lockstep. We mirror this via upgradeToTLS on the same
   rawConn, then continue the Oracle handshake through the new session.
3. Mask byte 5 on packets going downstream so thin clients (JDBC thin,
   python-oracledb thin) don't see the 0x08 signal and try to cast their
   local TcpNTAdapter to TcpsNTAdapter — the cast would fail because the
   client-to-proxy socket is plain TCP.
4. Accept TLS 1.0 as the floor for the upstream dial: Oracle 19c's
   second-round handshake negotiates down to 1.0 in some configurations
   (AWS RDS's SSL option being one of them). The outer ALPN mTLS tunnel
   remains TLS 1.2+.

Also removes now-dead SSLRejectUnauthorized / SSLCertificate fields from
OracleProxyConfig — the shared TLSConfig built in pam-proxy.go carries
that information already.

Verified end-to-end against AWS RDS Oracle 19c (SSL option, port 2484)
with sqlcl: authentication, DDL, DML+COMMIT, PL/SQL, DBMS_OUTPUT, bind
variables, and session recording all work. Plain-TCP Oracle path is
unchanged.

* chore(pam-oracle): tighten TCPS TLS config comment

Pure comment cleanup on buildOracleTLSConfig — no behavior change.

* chore(pam-oracle): update stale comments + attribution

- ATTRIBUTION.md: drop reference to nego.go (deleted in the
  impersonation-era cleanup) and add the upstream TCPS two-handshake
  flow adaptation from go-ora's session.readPacket RESEND branch.
- o5logon_server.go: the file header still described the
  impersonation-era architecture where the gateway acted as an Oracle
  server and drove the O5Logon exchange. Rewrite it to describe the
  current proxied-auth role — packet-layer helpers used by the byte-
  level O5Logon translation in proxy_auth.go — and drop the reference
  to upstream.go, which was removed in 3ff9cff.

* chore(pam-oracle): fix go vet warnings in proxy_auth.go

Two warnings, both from the original feat commit (4b25ec2):

- Dead self-assignment state.ServerSessKey = state.ServerSessKey on
  the phase-2 request translation path. The inline comment "no-op;
  kept for clarity" already admitted it was pointless. Delete.

- Unreachable _ = prefix after a return statement inside
  replaceKvpValueKeepingSize. The prefix variable was left over from
  a refactor — the new code uses oldStart / oldEnd instead — and the
  _ = prefix trick to silence "declared but not used" landed on the
  wrong side of the return. Delete both the dead prefix declaration
  and the unreachable suppression. Also drops valStart (only used
  by the dead prefix computation).

No behavior change; go vet is now clean on the oracle handler.

* chore(pam-oracle): drop TNS_ADMIN dance, shorten placeholder password

Two UX simplifications to the Oracle PAM access command:

1. Stop creating a per-session TNS_ADMIN directory and printing an
   `export TNS_ADMIN=...` line in the connect instructions. The
   directory only existed to hold a sqlnet.ora that set
   DISABLE_OOB=TRUE — a defence against sqlcl's out-of-band Ctrl-C
   signalling that we never actually observed breaking. If a real
   interrupt problem surfaces we can revisit with a proper fix
   instead of a per-session file dance.

2. Change ProxyPasswordPlaceholder from "infisical-pam-proxy" to
   "password". The string value is cryptographically arbitrary —
   Oracle's O5Logon needs the client and gateway to agree on SOME
   string; any works. Shorter is easier to copy-paste. The
   accompanying "not a real credential" note in the CLI output stays.

* chore(pam-oracle): drop unrelated changes from branch

Three items that snuck in but aren't part of Oracle PAM:

1. Untrack ORACLE_PAM_NOTES.md. It's a development notes/research log
   that belongs with scratch work, not in the branch. The file stays
   on disk (uncommitted) for local reference. Its content is also
   stale since the placeholder password changed from
   "infisical-pam-proxy" to "password" in 3ee7951 without matching
   updates to the notes.

2. Revert the github.com/emirpasic/gods indirect bump from v1.12.0
   to v1.18.1 in go.mod/go.sum. This was residue from when go-ora/v2
   was temporarily added as a direct dep in the initial feat commit —
   go-ora needed the newer gods, and Go module MVS held the bumped
   version even after go-ora was removed in 3ff9cff. Running
   `go mod tidy` against current HEAD (with main's go.mod/go.sum as
   the baseline) produces no further changes, confirming nothing we
   ship actually needs v1.18.1.

.gitignore additions (.vscode, .idea) stay — reasonable hygiene that
doesn't hurt the PR and will prevent future editor-file noise.

* fix(pam-oracle): AI-review findings + username parity with other handlers

- Password plaintext no longer embedded in "password mismatch" error
  (was bubbling to gateway logs via zerolog's .Err chain).
- Long Oracle passwords (≥ 96 chars) now encode correctly: replaceKVPValue
  routes AUTH_PASSWORD through TTCBuilder.PutClr so values above the
  short-form threshold emit the 0xFE chunked form instead of a truncated
  single-byte length.
- Client-supplied username rewritten to InjectUsername in phase-1 and
  phase-2 auth requests. Matches the effect of how the postgres/mysql/mssql
  handlers overwrite the startup-packet user — the client's choice becomes
  inert; upstream always looks up the configured account's verifier.
- Dead / misleading code removed: local PacketTypeResendMarker constant
  that duplicated tns.go's PacketTypeResend, package-level min() shadowing
  the Go 1.21+ builtin, the if !use32Bit branch in extractDataPayload
  where both arms assigned the same value, and the now-unused
  encodeCompressedInt helper.

* chore(pam-oracle): remove dead code + tighten attribution

Static-analysis sweep (staticcheck + manual cross-file grep) across the
oracle handler package. All removals are symbols that were defined but
never referenced anywhere:

- tns.go: markerTypeReset, markerTypeInterrupt constants.
- o5logon_server.go: TTCMsgAuthResponse, TTCMsgBreak,
  LogonModeUserAndPass, LogonModeNoNewPass constants; AuthPhaseTwo
  fields ESpeedyKey / AlterSession / ClientInfo / LogonMode (parser
  wrote them, no reader downstream); ParseAuthPhaseTwo's KVP switch
  trimmed to the two keys actually consumed (AUTH_SESSKEY,
  AUTH_PASSWORD).
- ttc.go: TTCReader.GetNullTermString(), TTCReader.SetUseBigClrChunks()
  — both uncalled.

Also added a proper attribution header to o5logon_server.go (it was
adapting go-ora's phase-2 layout + summary-object format but lacked the
same kind of header the other ported files have), and expanded
ATTRIBUTION.md to cover it plus the specific primitives borrowed by
o5logon.go.

No behavior change. go build / go vet / staticcheck clean on the package.

* chore(pam-oracle): remove dead verifier-type and error-code constants

Exhaustive dead-export sweep via `go doc -all` (a more reliable check
than my earlier regex-based grep, which missed untyped constants).
Removed:

- VerifierType10g / VerifierType11g / VerifierType12c — defined but no
  callers; the only verifier type our code implements (18453, 12c+
  PBKDF2+SHA512) is hardcoded, the three named constants were never
  referenced. The 18453-specific comments in the code retain the
  documentation.
- ORA12660EncryptionRequired — defined but no callers. Only
  ORA1017InvalidCredentials is actually emitted.

Post-sweep: every exported symbol in the oracle handler package has a
call site. `go build` / `go vet` / `staticcheck` clean.

* chore(pam-oracle): remove dead enum-block const members

staticcheck's U1000 has an exemption for const blocks: if any member is
used, sibling members aren't flagged as unused even when they are. An
exhaustive sweep that enumerates every top-level identifier via
manual grep (so const-block membership is irrelevant) caught the ones
staticcheck skipped.

Removed:
- tns.go: PacketTypeAbort, PacketTypeAck, PacketTypeAttn, PacketTypeCtrl,
  PacketTypeNull. Only the 7 PacketType values we actually dispatch on
  remain.
- query_logger.go: ttcFuncOFETCH, ttcFuncOCLOSE, ttcFuncOSTMT,
  ttcFuncOLOGOFF, ttcMsgPiggyback. Only the 4 TTC opcodes the query tap
  actually looks for remain (OALL8, OCOMMIT, ORLLBK, the outer msgFunction).

Post-sweep: exhaustive enumeration finds zero unused symbols across the
package (189 candidates checked). go build / go vet / staticcheck clean.

* fix(pam-oracle): show Infisical account name (not real DB user) in the banner

The banner was printing the real upstream DB username
(pamResponse.Metadata["username"]) in the connection URL, even though
the preceding "Account:" label already shows the Infisical account name.
Since the gateway now rewrites the client-supplied username in the
O5Logon exchange to the configured real user, the client can (and
should) connect using the account name — and the banner makes that
explicit.

Before:
  Resource: aws-oracledb
  Account:  admin2
  oracle://admin:password@localhost:53521/DATABASE   ← confusing

After:
  Resource: aws-oracledb
  Account:  admin2
  oracle://admin2:password@localhost:53521/DATABASE  ← matches the label

Scope: Oracle only for now. The postgres/mysql/mssql handlers also
overwrite the client username on the wire, so the same banner change
would work there, but that needs a separate verification pass per
dialect before we extend it.

* fix(pam-oracle): inject SERVICE_NAME from config into CONNECT packet

The client's CONNECT description string was forwarded unchanged, requiring
users to know the real Oracle service name. Now we rewrite SERVICE_NAME
to match InjectDatabase from the vault config, consistent with how
username and password are already injected.

* fix(pam-oracle): remove placeholder password verification in phase-2

The gateway no longer checks whether the client sent the placeholder
password "password". It unconditionally encrypts the real password
from the vault, regardless of what the client typed. Auth will
succeed either way since we inject the real credentials.

The placeholder password is still shown to the user in the CLI banner
and still used for key derivation in phase-1 — only the verification
check is removed.

* fix(pam-oracle): forward phase-2 response directly, remove SVR_RESPONSE regen

AUTH_SVR_RESPONSE is encrypted with encKey, which is derived from
session keys + CSK salt — not the password. The client and Oracle
derive the same encKey, so Oracle's original proof is already valid
for the client. No need to regenerate it.

Removes translatePhase2Response, BuildSvrResponse, and the
placeholderEncKey field from ProxyAuthState.

* fix(pam-oracle): replace 3s timeout peek with deterministic supplement drain

The post-ACCEPT supplement peek used a 3-second read deadline to detect
whether a go-ora client sent connect-data as a separate packet. This
added a fixed 3s delay for every non-go-ora client (sqlcl, JDBC thin).

Instead, check the CONNECT packet structure: if connect-data-length +
connect-data-offset exceeds the packet size, the data wasn't inline and
a supplement will follow. Track whether the RESEND handler already
consumed it; if not, do a blocking read (the supplement is guaranteed
to be in the TCP buffer since the client sent it before waiting for
a response).

Removes prependedConn, detectConnectDataSupplement, and the timeout.

* chore(pam-oracle): strip comments, enforce InjectDatabase, fix ora() formatting

- Remove verbose comments across all files (~20% → ~2% comment rate)
- Remove per-file go-ora attribution (ATTRIBUTION.md carries the license)
- Trim ATTRIBUTION.md to just the copyright notice + MIT text
- Make InjectDatabase mandatory (error if empty, always overwrite client's SERVICE_NAME)
- Format unknown Oracle error codes as ORA-XXXXX instead of bare "ERROR"
- Clarify ProxyPasswordPlaceholder as a decoy

* chore(pam-oracle): remove redundant banner comment and auth note

* fix(pam-oracle): forward connect-data supplement before handshake loop

JDBC thin sends the CONNECT description as a separate follow-up packet
when it doesn't fit inline (74-byte header-only CONNECT). Previously we
drained this supplement only after ACCEPT, relying on the server to
RESEND. AWS RDS sends RESEND, but OCI Autonomous DB does not — it waits
for the connect-data and eventually drops the connection.

Fix: check the CONNECT packet for non-inline data (connect-data-offset +
connect-data-length > packet size) and forward the supplement immediately.
Also forward it again after a RESEND+TLS-restart, since the client
re-sends CONNECT + supplement over the new TLS session.

* fix(pam-oracle): restore placeholder password check for clearer errors

Wrong password fails cryptographically anyway (ORA-17452 from garbage
decryption), but the error is confusing. Restore the explicit check
so the client gets a clean ORA-01017 instead.

* fix(pam-oracle): show real DB username in banner, matching other handlers

* chore(pam-oracle): update ProxyPasswordPlaceholder comment

* chore(pam-oracle): simplify ProxyPasswordPlaceholder comment

* chore(pam-oracle): rename ResourceTypeOracle to ResourceTypeOracledb

* fix(pam-oracle): cap handshake loop, zero keys after auth, remove dead comment

- Add maxHandshakeAttempts (10) to prevent infinite RESEND loops
- Zero RealKey, PlaceholderKey, ServerSessKey after auth completes
- Remove leftover debug comment from proxy_auth.go

* chore(pam-oracle): simplify key cleanup to state = nil

* fix(pam-oracle): show Easy Connect and JDBC connection strings in banner

---------

Co-authored-by: saif <11242541+saifsmailbox98@users.noreply.github.com>
* debug(pam): add diagnostic logging for SSH resource creation flake

Add detailed logging when CreateSshPamResource returns non-200 to
diagnose the intermittent 400 error seen only in CI. On failure, dumps:
- Response body (the actual backend error message)
- Request params (gatewayId, projectId, host, port)
- Relay/gateway process status and stderr/stdout
- Gateway API state (heartbeat, name)
- Backend container logs (last 3000 chars)

Also temporarily scopes CI to only PAM SSH tests in a loop (up to 10
runs, stops on first failure) to maximize chance of catching the flake.

* fix(pam): use localhost for SSH resource host to avoid iptables race

The SSH e2e tests were using getOutboundIP() (e.g. 10.1.0.34) as the
resource host. The gateway dials this address to validate the connection
during resource creation. On Linux CI with native Docker, this path goes
through iptables DNAT rules which can have a brief propagation delay
after container start — causing intermittent "connection refused" even
though testcontainers confirmed the port was listening (via localhost).

CI error: "Unable to validate connection to ssh: Connection lost before
handshake" with gateway log: "dial tcp 10.1.0.34:32774: connect:
connection refused"

Fix: use localhost instead. The gateway runs on the host, so
localhost:port reaches docker-proxy directly on the loopback interface
without needing iptables DNAT.

Diagnostic logging kept in place to catch any other failure modes.

* debug(ci): add sleep between loop iterations for port release

The relay binds to hardcoded ports (8443, 2222). When running multiple
test iterations in a loop, the previous relay's ports may still be in
TIME_WAIT when the next iteration starts, causing "Relay is reachable"
to never appear. Add 15s sleep between iterations.

* experiment: revert to getOutboundIP to confirm flake reproduces in CI

Reverting the localhost fix to verify that getOutboundIP triggers the
iptables race ("connection refused") in CI. If this fails, it confirms
the root cause. Will re-apply the fix afterward.

* fix(pam): re-apply localhost fix after confirming getOutboundIP flake

Experiment confirmed:
- getOutboundIP (10.1.0.168): FAILED run 1 — "dial tcp 10.1.0.168:32773:
  connect: connection refused" (iptables DNAT race)
- localhost: 20/20 passed across two CI runs

Re-applying the localhost fix.

* experiment: second run with getOutboundIP to confirm reproducibility

* fix(pam): re-apply localhost fix — confirmed root cause with A/B test

A/B experiment in CI confirms getOutboundIP causes the flake:

  getOutboundIP (external IP):
    - Experiment 1: FAILED run 1 (password, 10.1.0.168:32773)
    - Experiment 2: FAILED run 4 (certificate, 10.1.0.114:32799)
    - Error: "dial tcp <external-ip>:<port>: connect: connection refused"

  localhost:
    - 3 CI runs: PASSED 25/25 consecutively

Root cause: on Linux CI with native Docker, port mappings via external
IP go through iptables DNAT rules which have a brief propagation delay
after container start. localhost bypasses this via docker-proxy on
loopback.

* fix(pam): finalize — restore CI workflow, keep diagnostics

- Restore all CI jobs (CLI, Agent, PAM) to normal single-run mode
- Run full PAM test suite (not just SSH)
- Keep diagnostic dump on createSSHPamResource failure
- Log resourceHost value so any future failure shows which path was used

* test: validation loop with log correlation check

Runs SSH tests 5 times with localhost fix, then greps all logs for
"connection refused" to verify the error is completely absent.

With getOutboundIP: "connection refused" appeared in gateway logs
right before every failure.
With localhost: should see zero occurrences across all 5 runs.

* fix(pam): use localhost for SSH test resource host + add failure diagnostics

Root cause: SSH e2e tests used getOutboundIP() (e.g. 10.1.0.168) as
the resource host. During backend's validateConnection(), the gateway
dials this address to reach the SSH container. On Linux CI with native
Docker, this path goes through iptables DNAT rules which have a brief
propagation delay after container start. testcontainers confirms the
port via localhost (docker-proxy on loopback), but the iptables path
isn't ready yet — causing "connection refused" at the gateway.

Fix: use "localhost" instead. The gateway runs on the Docker host, so
localhost reaches the port mapping directly via docker-proxy without
needing iptables DNAT.

Verified with A/B experiment in CI:
  getOutboundIP: FAILED 2/2 runs ("connection refused" in gateway logs)
  localhost:     PASSED 30/30 runs (zero "connection refused" in logs)

Also adds diagnostic dump on createSSHPamResource failure (response
body, relay/gateway logs, backend container logs, gateway API state)
for visibility into any future failure modes.

* fix(pam): get SSH resource host from container.Host() like Postgres/Redis

Use container.Host(ctx) instead of getOutboundIP() or hardcoded
"localhost". This is the same pattern Postgres and Redis tests use,
and returns the correct Docker daemon host (localhost on Linux,
which routes through docker-proxy without iptables DNAT races).

* fix(pam): use container.Host() for SSH resource host, remove debug code

Get the SSH resource host from container.Host(ctx) — same as Postgres
and Redis tests — instead of getOutboundIP(). On Linux CI, Host()
returns localhost which routes through docker-proxy on loopback,
avoiding the iptables DNAT propagation race that caused intermittent
"connection refused" when using the host's external IP.

Removes all diagnostic logging added during investigation.

* chore: restore original workflow file (no changes needed)

* debug: log response body on createSSHPamResource failure

* debug: SSH-only loop to catch second flake mode with response body logging

* ci: retrigger loop to catch rare second flake

* debug: add gateway/relay stderr dump on failure to catch second flake

* ci: retrigger to catch rare flake (attempt 3)

* fix(pam): wait for sshd "Server listening" log, not just port

The SSH container's port 22 becomes listenable before sshd is fully
ready to accept connections. testcontainers' ForListeningPort confirms
the port is open, but sshd may still be generating host keys or loading
config. Connections during this window get "connection reset by peer"
(0 bytes from service), which the backend sees as "Connection lost
before handshake".

Fix: use ForAll(ForListeningPort, ForLog("Server listening")) to wait
for both port availability AND sshd's startup confirmation log.

* ci: retrigger for confidence

* fix(pam): use SSH handshake to verify sshd readiness after SIGHUP

After configureCertAuth sends SIGHUP to reload sshd config, the test
checked readiness with a TCP dial. But sshd never closes its listen
socket during SIGHUP, so the port is always reachable — the dial passes
instantly even while sshd is still processing the config reload.
Connections during this window get accepted then dropped, causing
"Connection to 127.0.0.1 closed by remote host".

Fix: do an actual SSH handshake (ssh.Dial with password auth) instead
of a TCP dial. This confirms sshd is fully serving connections, not
just listening.

* ci: retrigger confidence run

* fix(pam): fix three SSH e2e test flakes

1. Use container.Host(ctx) for resource host instead of getOutboundIP().
   On Linux CI, external IPs route through iptables DNAT which has a
   propagation delay after container start.

2. Wait for sshd "Server listening" log, not just port open.
   ForListeningPort passes before sshd finishes key generation,
   causing connection resets.

3. Use SSH handshake to verify sshd readiness after SIGHUP config
   reload. sshd never closes its listen socket during SIGHUP, so TCP
   dial passes instantly while sshd is still reloading.

---------

Co-authored-by: saif <11242541+saifsmailbox98@users.noreply.github.com>
6 subtests: connection, bad credentials, unreachable target, reconnect,
concurrent connections, and session duration. Uses an xrdp container as
the target and FreeRDP under xvfb for headless verification. CI pam-test
job updated to build the Rust bridge and CLI with -tags rdp.
@infisical-review-police
Copy link
Copy Markdown

💬 Discussion in Slack: #pr-review-cli-230-add-rdp-e2e-tests

Posted by Review Police — reviews, comments, new commits, and CI failures will stream into this channel.

bernie-g added 2 commits May 11, 2026 15:56
The backend requires a pam_project_recording_configs row to exist
before allowing Windows resource creation. Insert a dummy row directly
into the database, bypassing FK checks since we don't need a real
app connection for the E2E tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants