agent-evals/README.md

# LLM Evaluation Framework

Evaluate different LLM models and agentic tools (opencode, claude code, etc.) in controlled environments using git branches.

## Setup

```bash
# Direnv
direnv allow

# Development shell
nix develop
```

## Running Evaluations

1. **Start evaluation:**
```bash
./scripts/start-eval.sh <eval-name>
```
This creates a new orphan branch `eval/<eval-name>`, sets up the flake environment, and starts opencode.
Example: `./scripts/start-eval.sh opencode-glm47`

2. **Run your evaluation:**
   - Set up prompts/tasks
   - Let the LLM work through the task

3. **Finish evaluation:**
```bash
git checkout main
```
All commits are automatically preserved in the `eval/<eval-name>` branch.

## Managing Evaluations

- **List all evaluations:** `git branch | grep "^  eval/"`
- **View an evaluation:** `git checkout eval/<eval-name>`
- **Compare evaluations:** `git diff eval/foo eval/bar`
- **Delete an evaluation:** `git branch -D eval/<eval-name>`

## Structure

```
eval/
├── flake.nix
├── flake.lock
├── .envrc
├── scripts/
│   └── start-eval.sh
└── README.md
```

Each evaluation lives as a separate branch in the repository with its own git history.