Go to file

Evan Reichard d9b143f4f8 docs: update specification to use TypeScript frontend

- Replace 'Vanilla JavaScript' with 'TypeScript' in frontend requirements
- Update flake.nix to use nodejs package instead of tailwindcss
- Clarify testing requirements to cover both frontend and backend
- Fix punctuation in evaluation checklist

2026-02-03 20:55:02 -05:00

scripts

Initial: branch-oriented eval framework

2026-01-30 10:32:50 -05:00

.envrc

Initial: branch-oriented eval framework

2026-01-30 10:32:50 -05:00

flake.lock

Initial: branch-oriented eval framework

2026-01-30 10:32:50 -05:00

flake.nix

docs: update specification to use TypeScript frontend

2026-02-03 20:55:02 -05:00

README.md

Initial: branch-oriented eval framework

2026-01-30 10:32:50 -05:00

SPEC.md

docs: update specification to use TypeScript frontend

2026-02-03 20:55:02 -05:00

README.md

LLM Evaluation Framework

Evaluate different LLM models and agentic tools (opencode, claude code, etc.) in controlled environments using git branches.

Setup

# Direnv
direnv allow

# Development shell
nix develop

Running Evaluations

Start evaluation:

./scripts/start-eval.sh <eval-name>

This creates a new orphan branch eval/<eval-name>, sets up the flake environment, and starts opencode. Example: ./scripts/start-eval.sh opencode-glm47

Run your evaluation:
- Set up prompts/tasks
- Let the LLM work through the task
Finish evaluation:

git checkout main

All commits are automatically preserved in the eval/<eval-name> branch.

Managing Evaluations

List all evaluations: git branch | grep "^ eval/"
View an evaluation: git checkout eval/<eval-name>
Compare evaluations: git diff eval/foo eval/bar
Delete an evaluation: git branch -D eval/<eval-name>

Structure

eval/
├── flake.nix
├── flake.lock
├── .envrc
├── scripts/
│   └── start-eval.sh
└── README.md

Each evaluation lives as a separate branch in the repository with its own git history.