54 lines
1.2 KiB
Markdown
54 lines
1.2 KiB
Markdown
# LLM Evaluation Framework
|
|
|
|
Evaluate different LLM models and agentic tools (opencode, claude code, etc.) in controlled environments using git branches.
|
|
|
|
## Setup
|
|
|
|
```bash
|
|
# Direnv
|
|
direnv allow
|
|
|
|
# Development shell
|
|
nix develop
|
|
```
|
|
|
|
## Running Evaluations
|
|
|
|
1. **Start evaluation:**
|
|
```bash
|
|
./scripts/start-eval.sh <eval-name>
|
|
```
|
|
This creates a new orphan branch `eval/<eval-name>`, sets up the flake environment, and starts opencode.
|
|
Example: `./scripts/start-eval.sh opencode-glm47`
|
|
|
|
2. **Run your evaluation:**
|
|
- Set up prompts/tasks
|
|
- Let the LLM work through the task
|
|
|
|
3. **Finish evaluation:**
|
|
```bash
|
|
git checkout main
|
|
```
|
|
All commits are automatically preserved in the `eval/<eval-name>` branch.
|
|
|
|
## Managing Evaluations
|
|
|
|
- **List all evaluations:** `git branch | grep "^ eval/"`
|
|
- **View an evaluation:** `git checkout eval/<eval-name>`
|
|
- **Compare evaluations:** `git diff eval/foo eval/bar`
|
|
- **Delete an evaluation:** `git branch -D eval/<eval-name>`
|
|
|
|
## Structure
|
|
|
|
```
|
|
eval/
|
|
├── flake.nix
|
|
├── flake.lock
|
|
├── .envrc
|
|
├── scripts/
|
|
│ └── start-eval.sh
|
|
└── README.md
|
|
```
|
|
|
|
Each evaluation lives as a separate branch in the repository with its own git history.
|