LLM Evaluation Framework
Evaluate different LLM models and agentic tools (opencode, claude code, etc.) in controlled environments using git branches.
Setup
# Direnv
direnv allow
# Development shell
nix develop
Running Evaluations
- Start evaluation:
./scripts/start-eval.sh <eval-name>
This creates a new orphan branch eval/<eval-name>, sets up the flake environment, and starts opencode.
Example: ./scripts/start-eval.sh opencode-glm47
-
Run your evaluation:
- Set up prompts/tasks
- Let the LLM work through the task
-
Finish evaluation:
git checkout main
All commits are automatically preserved in the eval/<eval-name> branch.
Managing Evaluations
- List all evaluations:
git branch | grep "^ eval/" - View an evaluation:
git checkout eval/<eval-name> - Compare evaluations:
git diff eval/foo eval/bar - Delete an evaluation:
git branch -D eval/<eval-name>
Structure
eval/
├── flake.nix
├── flake.lock
├── .envrc
├── scripts/
│ └── start-eval.sh
└── README.md
Each evaluation lives as a separate branch in the repository with its own git history.
Description
Languages
Shell
64.1%
Nix
35.9%