Files
agent-evals/RESULTS.md
2026-02-06 17:20:02 -05:00

1.1 KiB

Description

Agentic Tools

Grading

Purely opinion based, but based on

  • One-shot performance
  • Follow up fixes
  • UI design

Rankings

Overall

  1. eval/pi-kimi-k2.5
  2. eval/pi-qwen3-coder-next-80b
  3. eval/pi-glm4.7
  4. eval/pi-glm4.7-flash

Parameters: < 100B (Local)

  1. eval/pi-qwen3-coder-next-80b
    • UI: 8/10
    • Fixes:
      • New File
      • Markdown Styling
  2. eval/pi-glm4.7-flash
    • UI: ?/10
    • Fixes:
      • FE wouldnt run on first try
      • Had to downgrade a bunch of deps so it would run
      • Used legacy packages
  3. eval/pi-devstral-small-2
    • UI: 2/10
    • Fixes:
      • ?

Parameters: > 100B (Hosted)

  1. eval/pi-kimi-k2.5
    • UI: 9/10
    • Fixes:
      • Files List
      • Editor Scrolling
      • Duplicate .md Name
  2. eval/pi-glm4.7
    • UI: 7/10
    • Fixes:
      • Files List
      • Markdown Styling
      • Delete Failure