Initial: branch-oriented eval framework

This commit is contained in:
2026-01-30 09:32:12 -05:00
commit fece98f5ee
6 changed files with 317 additions and 0 deletions

53
README.md Normal file
View File

@@ -0,0 +1,53 @@
# LLM Evaluation Framework
Evaluate different LLM models and agentic tools (opencode, claude code, etc.) in controlled environments using git branches.
## Setup
```bash
# Direnv
direnv allow
# Development shell
nix develop
```
## Running Evaluations
1. **Start evaluation:**
```bash
./scripts/start-eval.sh <eval-name>
```
This creates a new orphan branch `eval/<eval-name>`, sets up the flake environment, and starts opencode.
Example: `./scripts/start-eval.sh opencode-glm47`
2. **Run your evaluation:**
- Set up prompts/tasks
- Let the LLM work through the task
3. **Finish evaluation:**
```bash
git checkout main
```
All commits are automatically preserved in the `eval/<eval-name>` branch.
## Managing Evaluations
- **List all evaluations:** `git branch | grep "^ eval/"`
- **View an evaluation:** `git checkout eval/<eval-name>`
- **Compare evaluations:** `git diff eval/foo eval/bar`
- **Delete an evaluation:** `git branch -D eval/<eval-name>`
## Structure
```
eval/
├── flake.nix
├── flake.lock
├── .envrc
├── scripts/
│ └── start-eval.sh
└── README.md
```
Each evaluation lives as a separate branch in the repository with its own git history.