Skip to content

[bugfix] continuous action precision mismatch#584

Merged
jsuarez5341 merged 1 commit into
PufferAI:4.0from
yacineMTB:fix-continuous-action-logprob-4.0
Jun 15, 2026
Merged

[bugfix] continuous action precision mismatch#584
jsuarez5341 merged 1 commit into
PufferAI:4.0from
yacineMTB:fix-continuous-action-logprob-4.0

Conversation

@yacineMTB

@yacineMTB yacineMTB commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

I think the best test that shows this improvement is an independent sweep before the fix, and then after the fix.

image image image

Basically; if precision_t was bf16/fp16, the sampled fp32 action gets rounded, and that messes with things downstream in PPO. I'm not sure exactly how PPO uses the ratio; but before this fix, if you did not change the policy, the logratio would be nonzero. As I understand it; if the policy did not change, this should be zero. With discrete actionspace, it does stay zero, but in continuous. it is does not.

discrete doesn't have this problem because it uses integers.

Comment thread src/pufferlib.cu Outdated
@yacineMTB yacineMTB force-pushed the fix-continuous-action-logprob-4.0 branch from 4dd4990 to af45050 Compare June 13, 2026 03:37
@yacineMTB yacineMTB force-pushed the fix-continuous-action-logprob-4.0 branch from af45050 to 21c987b Compare June 15, 2026 16:31
@yacineMTB yacineMTB changed the title [bug] [draft] continuous action logprob precision mismatch [draft] continuous action precision mismatch Jun 15, 2026
@yacineMTB yacineMTB changed the title [draft] continuous action precision mismatch [bugfix] continuous action precision mismatch Jun 15, 2026
@yacineMTB yacineMTB marked this pull request as ready for review June 15, 2026 16:37
@jsuarez5341 jsuarez5341 merged commit e90b58e into PufferAI:4.0 Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants