Initial: branch-oriented eval framework

2026-01-30 09:32:12 -05:00
commit fece98f5ee
6 changed files with 317 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,53 @@
+# LLM Evaluation Framework
+
+Evaluate different LLM models and agentic tools (opencode, claude code, etc.) in controlled environments using git branches.
+
+## Setup
+
+```bash
+# Direnv
+direnv allow
+
+# Development shell
+nix develop
+```
+
+## Running Evaluations
+
+1. **Start evaluation:**
+```bash
+./scripts/start-eval.sh <eval-name>
+```
+This creates a new orphan branch `eval/<eval-name>`, sets up the flake environment, and starts opencode.
+Example: `./scripts/start-eval.sh opencode-glm47`
+
+2. **Run your evaluation:**
+   - Set up prompts/tasks
+   - Let the LLM work through the task
+
+3. **Finish evaluation:**
+```bash
+git checkout main
+```
+All commits are automatically preserved in the `eval/<eval-name>` branch.
+
+## Managing Evaluations
+
+- **List all evaluations:** `git branch | grep "^  eval/"`
+- **View an evaluation:** `git checkout eval/<eval-name>`
+- **Compare evaluations:** `git diff eval/foo eval/bar`
+- **Delete an evaluation:** `git branch -D eval/<eval-name>`
+
+## Structure
+
+```
+eval/
+├── flake.nix
+├── flake.lock
+├── .envrc
+├── scripts/
+│   └── start-eval.sh
+└── README.md
+```
+
+Each evaluation lives as a separate branch in the repository with its own git history.