LLM Evaluation Framework

Evaluate different LLM models and agentic tools (opencode, claude code, etc.) in controlled environments using git branches.

Setup

# Direnv
direnv allow

# Development shell
nix develop

Running Evaluations

Start evaluation:

./scripts/start-eval.sh <eval-name>

This creates a new orphan branch eval/<eval-name>, sets up the flake environment, and starts opencode. Example: ./scripts/start-eval.sh opencode-glm47

Run your evaluation:
- Set up prompts/tasks
- Let the LLM work through the task
Finish evaluation:

git checkout main

All commits are automatically preserved in the eval/<eval-name> branch.

Managing Evaluations

List all evaluations: git branch | grep "^ eval/"
View an evaluation: git checkout eval/<eval-name>
Compare evaluations: git diff eval/foo eval/bar
Delete an evaluation: git branch -D eval/<eval-name>

Structure

eval/
├── flake.nix
├── flake.lock
├── .envrc
├── scripts/
│   └── start-eval.sh
└── README.md

Each evaluation lives as a separate branch in the repository with its own git history.