d9b143f4f8a945e625381131d418650a594275e0
- Replace 'Vanilla JavaScript' with 'TypeScript' in frontend requirements - Update flake.nix to use nodejs package instead of tailwindcss - Clarify testing requirements to cover both frontend and backend - Fix punctuation in evaluation checklist
LLM Evaluation Framework
Evaluate different LLM models and agentic tools (opencode, claude code, etc.) in controlled environments using git branches.
Setup
# Direnv
direnv allow
# Development shell
nix develop
Running Evaluations
- Start evaluation:
./scripts/start-eval.sh <eval-name>
This creates a new orphan branch eval/<eval-name>, sets up the flake environment, and starts opencode.
Example: ./scripts/start-eval.sh opencode-glm47
-
Run your evaluation:
- Set up prompts/tasks
- Let the LLM work through the task
-
Finish evaluation:
git checkout main
All commits are automatically preserved in the eval/<eval-name> branch.
Managing Evaluations
- List all evaluations:
git branch | grep "^ eval/" - View an evaluation:
git checkout eval/<eval-name> - Compare evaluations:
git diff eval/foo eval/bar - Delete an evaluation:
git branch -D eval/<eval-name>
Structure
eval/
├── flake.nix
├── flake.lock
├── .envrc
├── scripts/
│ └── start-eval.sh
└── README.md
Each evaluation lives as a separate branch in the repository with its own git history.
Description
Languages
Shell
64.1%
Nix
35.9%