# LLM Evaluation Framework Evaluate different LLM models and agentic tools (opencode, claude code, etc.) in controlled environments using git branches. ## Setup ```bash # Direnv direnv allow # Development shell nix develop ``` ## Running Evaluations 1. **Start evaluation:** ```bash ./scripts/start-eval.sh ``` This creates a new orphan branch `eval/`, sets up the flake environment, and starts opencode. Example: `./scripts/start-eval.sh opencode-glm47` 2. **Run your evaluation:** - Set up prompts/tasks - Let the LLM work through the task 3. **Finish evaluation:** ```bash git checkout main ``` All commits are automatically preserved in the `eval/` branch. ## Managing Evaluations - **List all evaluations:** `git branch | grep "^ eval/"` - **View an evaluation:** `git checkout eval/` - **Compare evaluations:** `git diff eval/foo eval/bar` - **Delete an evaluation:** `git branch -D eval/` ## Structure ``` eval/ ├── flake.nix ├── flake.lock ├── .envrc ├── scripts/ │ └── start-eval.sh └── README.md ``` Each evaluation lives as a separate branch in the repository with its own git history.