Files
agent-evals/README.md

1.2 KiB

LLM Evaluation Framework

Evaluate different LLM models and agentic tools (opencode, claude code, etc.) in controlled environments using git branches.

Setup

# Direnv
direnv allow

# Development shell
nix develop

Running Evaluations

  1. Start evaluation:
./scripts/start-eval.sh <eval-name>

This creates a new orphan branch eval/<eval-name>, sets up the flake environment, and starts opencode. Example: ./scripts/start-eval.sh opencode-glm47

  1. Run your evaluation:

    • Set up prompts/tasks
    • Let the LLM work through the task
  2. Finish evaluation:

git checkout main

All commits are automatically preserved in the eval/<eval-name> branch.

Managing Evaluations

  • List all evaluations: git branch | grep "^ eval/"
  • View an evaluation: git checkout eval/<eval-name>
  • Compare evaluations: git diff eval/foo eval/bar
  • Delete an evaluation: git branch -D eval/<eval-name>

Structure

eval/
├── flake.nix
├── flake.lock
├── .envrc
├── scripts/
│   └── start-eval.sh
└── README.md

Each evaluation lives as a separate branch in the repository with its own git history.