Skip to content

Universal tool and model logging to leaderboard and Prisma#3

Open
RayirthDinesh wants to merge 15 commits into
mainfrom
MLE-Bench_Logger
Open

Universal tool and model logging to leaderboard and Prisma#3
RayirthDinesh wants to merge 15 commits into
mainfrom
MLE-Bench_Logger

Conversation

@RayirthDinesh
Copy link
Copy Markdown
Collaborator

@RayirthDinesh RayirthDinesh commented May 15, 2026

Logs the AI coding tool and model used for every submission, automatically and with no human input, so the tool/model leaderboards and Prisma stay accurate.

How it works

  • Detect tool + model per submit via layered fallback: env signals, the tool's own session transcript on disk (Claude Code, Codex), process-tree walk, and a universal .gym_attribution.json self-report any agent can write.
  • Reject any submission where tool or model can't be resolved, printing a copy-paste block the user hands to their AI IDE to self-report and resubmit. No blank rows reach the leaderboard.
  • Persist the resolved values to the backend (Prisma Submission / UserProgress).

Companion PRs

  • AICodingGym (site): leaderboard-tools-models — backend persistence + leaderboard aggregation.
  • gym-environment: test — agent instructions to write .gym_attribution.json.

🤖 Generated with Claude Code

qyli00 and others added 14 commits March 15, 2026 23:22
feat: add cr fetch/submit commands for code review challenges
- Replace shell=True subprocess calls with list-based args to avoid
  shell quoting issues across platforms
- Add _restrict_key_permissions() using icacls on Windows, chmod on Unix
- Restrict config directory permissions on Windows via icacls
- Pre-check ssh-keygen availability with Windows-specific error message
- Quote SSH key paths with forward slashes for Windows compatibility
- Use /dev/null (not os.devnull) in GIT_SSH_COMMAND for MSYS2 ssh compat
- Use platform-appropriate file viewer hint (type on Windows, cat on Unix)
- Bump version to 0.3.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On Windows, System32's OpenSSH (ssh.exe) is typically found before Git
for Windows' bundled MSYS2 ssh on PATH. The native OpenSSH can trigger
GUI credential dialogs or deadlock when stdout is captured, causing
swe fetch to hang indefinitely during git clone.

- Add _find_git_ssh() to locate Git for Windows' bundled ssh.exe
- Use explicit ssh path in GIT_SSH_COMMAND to bypass System32 OpenSSH
- Add BatchMode=yes to prevent interactive prompts on all platforms

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Broadens git add exclusions to ignore all dot-prefixed files/directories
(.*) and all markdown files (*.md) so scaffold metadata like .devcontainer,
.swebench, .gitignore, problem_description.md, and hints_text.md are never
included in submissions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Downloads AI agent config files (.claudeignore, .cursorrules, CLAUDE.md,
AGENTS.md, etc.) from AICodingGym/gym-environment into problem directories
during swe fetch, cr fetch, and mle download. Also installs to the
workspace root during configure. Downloaded files are added to .gitignore.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bump version to 0.5.1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- cli_env.py: allowlist-only tool/model detection (no full env dump); reads ANTHROPIC_MODEL, CLAUDE_CODE_MODEL, OPENAI_MODEL, AIDER_MODEL, GEMINI_MODEL, CURSOR_MODEL
- api.py: extend submit_notification / cr_submit_review / mlebench_submit_csv with tool/tool_version/ai_model; add notify_mle_progress to forward percentile + attribution to Prisma UserProgress
- cli.py: --tool / --tool-version / --ai-model flags on swe submit, cr submit, mle submit; MLE reads solution_log.json for accurate model record per CLAUDE.md
- bump version 0.5.1 -> 0.6.0
@RayirthDinesh RayirthDinesh changed the title MLE Bench Logger feat: detect + send tool/model on submit (v0.6.0) May 25, 2026
Resolve the coding tool + AI model for every swe/mle/cr submit via
layered detection (env signals, tool session transcripts, process tree,
and an agent-written .gym_attribution.json self-report), then reject
submissions that lack attribution with a copy-paste fix. Values flow to
the backend for Prisma persistence and the tool/model leaderboards.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@RayirthDinesh RayirthDinesh changed the title feat: detect + send tool/model on submit (v0.6.0) Universal tool and model logging to leaderboard and Prisma Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants