Initial: branch-oriented eval framework
This commit is contained in:
53
README.md
Normal file
53
README.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# LLM Evaluation Framework
|
||||
|
||||
Evaluate different LLM models and agentic tools (opencode, claude code, etc.) in controlled environments using git branches.
|
||||
|
||||
## Setup
|
||||
|
||||
```bash
|
||||
# Direnv
|
||||
direnv allow
|
||||
|
||||
# Development shell
|
||||
nix develop
|
||||
```
|
||||
|
||||
## Running Evaluations
|
||||
|
||||
1. **Start evaluation:**
|
||||
```bash
|
||||
./scripts/start-eval.sh <eval-name>
|
||||
```
|
||||
This creates a new orphan branch `eval/<eval-name>`, sets up the flake environment, and starts opencode.
|
||||
Example: `./scripts/start-eval.sh opencode-glm47`
|
||||
|
||||
2. **Run your evaluation:**
|
||||
- Set up prompts/tasks
|
||||
- Let the LLM work through the task
|
||||
|
||||
3. **Finish evaluation:**
|
||||
```bash
|
||||
git checkout main
|
||||
```
|
||||
All commits are automatically preserved in the `eval/<eval-name>` branch.
|
||||
|
||||
## Managing Evaluations
|
||||
|
||||
- **List all evaluations:** `git branch | grep "^ eval/"`
|
||||
- **View an evaluation:** `git checkout eval/<eval-name>`
|
||||
- **Compare evaluations:** `git diff eval/foo eval/bar`
|
||||
- **Delete an evaluation:** `git branch -D eval/<eval-name>`
|
||||
|
||||
## Structure
|
||||
|
||||
```
|
||||
eval/
|
||||
├── flake.nix
|
||||
├── flake.lock
|
||||
├── .envrc
|
||||
├── scripts/
|
||||
│ └── start-eval.sh
|
||||
└── README.md
|
||||
```
|
||||
|
||||
Each evaluation lives as a separate branch in the repository with its own git history.
|
||||
118
SPEC.md
Normal file
118
SPEC.md
Normal file
@@ -0,0 +1,118 @@
|
||||
# WYSIWYG Markdown Editor - Specification
|
||||
|
||||
## Overview
|
||||
|
||||
Build a WYSIWYG markdown editor with save functionality consisting of a Go backend and vanilla JavaScript frontend with Tailwind CSS.
|
||||
Beyond the listed features, no authentication, user accounts, collaborative editing, or version history are required.
|
||||
|
||||
## Backend (Go)
|
||||
|
||||
### Requirements
|
||||
|
||||
- Use Go with standard HTTP server
|
||||
- Use Cobra library for CLI argument parsing
|
||||
- Implement CRUD operations for markdown files:
|
||||
- Create: Add new markdown files
|
||||
- Read: Retrieve/View markdown files
|
||||
- Update: Edit existing markdown files
|
||||
- Delete: Remove markdown files
|
||||
- Store markdown files on disk in a specified directory
|
||||
- The server must handle concurrent requests safely; the last successful write wins
|
||||
- Markdown file names are case-sensitive and must end in `.md`. Illegal characters for the current OS must be rejected with 400 Bad Request
|
||||
- Directory structure under `--data-dir` is flat: every `.md` file is a sibling; sub-directories are ignored
|
||||
|
||||
### CLI Options
|
||||
|
||||
The application must support the following CLI flags:
|
||||
|
||||
- `--data-dir`: Path to the directory where markdown files will be stored (default: `./data`)
|
||||
- `--port`: Port number to run the HTTP server on (default: `8080`)
|
||||
- `--host`: Host address to bind to (default: `127.0.0.1`)
|
||||
Cobra must generate help text for `--help` and `-h` that includes defaults.
|
||||
|
||||
### API Implementation
|
||||
|
||||
- Design and implement appropriate REST endpoints for CRUD operations
|
||||
- Handle error responses appropriately
|
||||
- Validate inputs appropriately
|
||||
- Any operation that fails must return an HTTP 4xx/5xx code and a JSON body containing at least an `error` string
|
||||
|
||||
## Frontend (Vanilla JavaScript + Tailwind CSS)
|
||||
|
||||
### Requirements
|
||||
|
||||
- Vanilla JavaScript (no frameworks)
|
||||
- Tailwind CSS for styling
|
||||
- Single-page application interface
|
||||
- No build step except running the Tailwind CLI once to generate the CSS file; runtime must work in a current Chrome/Edge/Firefox without polyfills
|
||||
|
||||
### Features
|
||||
|
||||
#### Markdown Editor
|
||||
|
||||
- Split-view or toggleable view with:
|
||||
- Edit pane: Textarea for markdown input
|
||||
- Preview pane: Rendered markdown preview
|
||||
- Real-time preview updates
|
||||
- Render markdown with GitHub-Flavored-Markdown (GFM) semantics
|
||||
|
||||
#### Theme Support
|
||||
|
||||
Three theme modes:
|
||||
|
||||
- **Dark**: Dark color scheme
|
||||
- **Light**: Light color scheme
|
||||
- **System**: Follows system preference
|
||||
|
||||
Theme switching requirements:
|
||||
|
||||
- Auto-detect system preference on load (prefers-color-scheme)
|
||||
- Manual theme switcher with three options: Dark, Light, System
|
||||
- Persist theme preference in localStorage under the key `wysiwyg-theme`
|
||||
- Update theme immediately when changed
|
||||
- Respect system theme changes when in "System" mode
|
||||
|
||||
#### File Management UI
|
||||
|
||||
- List all markdown files
|
||||
- Create new files
|
||||
- Open existing files for editing
|
||||
- Save changes
|
||||
- Delete files
|
||||
|
||||
#### Responsive Design
|
||||
|
||||
- Works on desktop and mobile
|
||||
- Responsive layout using Tailwind classes
|
||||
|
||||
## Development Environment
|
||||
|
||||
The repository root contains a `flake.nix` locked to `github:NixOS/nixpkgs/nixos-25.11`.
|
||||
`nix develop` has already been executed; the resulting shell provides:
|
||||
|
||||
- go, gopls, golangci-lint
|
||||
- tailwindcss
|
||||
- gnumake
|
||||
|
||||
You must not modify the flake or add packages outside this environment.
|
||||
|
||||
## General Requirements
|
||||
|
||||
- No database - use file system
|
||||
- Minimal dependencies
|
||||
- Clean, maintainable code
|
||||
- Proper error handling
|
||||
|
||||
## Testing & Observability
|
||||
|
||||
- Provide at least one automated test (unit, integration, or end-to-end) that can be run with a single command (`go test`, `make test`, etc.).
|
||||
- The test must demonstrate that a markdown file can be created, read, updated, and deleted through the REST endpoints.
|
||||
- On start-up the server must log its bound address in the format `listening on <host>:<port>` so evaluators can script against it.
|
||||
|
||||
## Evaluation Checklist
|
||||
|
||||
Evaluation will check:
|
||||
(1) CLI starts with defaults,
|
||||
(2) CRUD round-trip,
|
||||
(3) theme switch & persistence,
|
||||
(4) responsive layout on 320 px and 1920 px,
|
||||
61
flake.lock
generated
Normal file
61
flake.lock
generated
Normal file
@@ -0,0 +1,61 @@
|
||||
{
|
||||
"nodes": {
|
||||
"flake-utils": {
|
||||
"inputs": {
|
||||
"systems": "systems"
|
||||
},
|
||||
"locked": {
|
||||
"lastModified": 1731533236,
|
||||
"narHash": "sha256-l0KFg5HjrsfsO/JpG+r7fRrqm12kzFHyUHqHCVpMMbI=",
|
||||
"owner": "numtide",
|
||||
"repo": "flake-utils",
|
||||
"rev": "11707dc2f618dd54ca8739b309ec4fc024de578b",
|
||||
"type": "github"
|
||||
},
|
||||
"original": {
|
||||
"owner": "numtide",
|
||||
"repo": "flake-utils",
|
||||
"type": "github"
|
||||
}
|
||||
},
|
||||
"nixpkgs": {
|
||||
"locked": {
|
||||
"lastModified": 1769318308,
|
||||
"narHash": "sha256-Mjx6p96Pkefks3+aA+72lu1xVehb6mv2yTUUqmSet6Q=",
|
||||
"owner": "NixOS",
|
||||
"repo": "nixpkgs",
|
||||
"rev": "1cd347bf3355fce6c64ab37d3967b4a2cb4b878c",
|
||||
"type": "github"
|
||||
},
|
||||
"original": {
|
||||
"owner": "NixOS",
|
||||
"ref": "nixos-25.11",
|
||||
"repo": "nixpkgs",
|
||||
"type": "github"
|
||||
}
|
||||
},
|
||||
"root": {
|
||||
"inputs": {
|
||||
"flake-utils": "flake-utils",
|
||||
"nixpkgs": "nixpkgs"
|
||||
}
|
||||
},
|
||||
"systems": {
|
||||
"locked": {
|
||||
"lastModified": 1681028828,
|
||||
"narHash": "sha256-Vy1rq5AaRuLzOxct8nz4T6wlgyUR7zLU309k9mBC768=",
|
||||
"owner": "nix-systems",
|
||||
"repo": "default",
|
||||
"rev": "da67096a3b9bf56a91d16901293e51ba5b49a27e",
|
||||
"type": "github"
|
||||
},
|
||||
"original": {
|
||||
"owner": "nix-systems",
|
||||
"repo": "default",
|
||||
"type": "github"
|
||||
}
|
||||
}
|
||||
},
|
||||
"root": "root",
|
||||
"version": 7
|
||||
}
|
||||
37
flake.nix
Normal file
37
flake.nix
Normal file
@@ -0,0 +1,37 @@
|
||||
{
|
||||
description = "Development Environment";
|
||||
|
||||
inputs = {
|
||||
nixpkgs.url = "github:NixOS/nixpkgs/nixos-25.11";
|
||||
flake-utils.url = "github:numtide/flake-utils";
|
||||
};
|
||||
|
||||
outputs =
|
||||
{ self
|
||||
, nixpkgs
|
||||
, flake-utils
|
||||
,
|
||||
}:
|
||||
flake-utils.lib.eachDefaultSystem (
|
||||
system:
|
||||
let
|
||||
pkgs = (
|
||||
import nixpkgs {
|
||||
system = system;
|
||||
}
|
||||
);
|
||||
in
|
||||
{
|
||||
devShells.default = pkgs.mkShell {
|
||||
packages = with pkgs; [
|
||||
go
|
||||
gopls
|
||||
golangci-lint
|
||||
tailwindcss
|
||||
gnumake
|
||||
lsof
|
||||
];
|
||||
};
|
||||
}
|
||||
);
|
||||
}
|
||||
47
scripts/start-eval.sh
Executable file
47
scripts/start-eval.sh
Executable file
@@ -0,0 +1,47 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
if [[ $# -ne 1 ]]; then
|
||||
echo "Usage: $0 <eval-name>"
|
||||
echo "Example: $0 opencode-glm47"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
EVAL_NAME="eval/$1"
|
||||
EVAL_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
|
||||
# Verify we're on main
|
||||
CURRENT_BRANCH=$(git branch --show-current)
|
||||
if [[ "${CURRENT_BRANCH}" != "main" ]]; then
|
||||
echo "Error: Must be on 'main' branch to start an evaluation."
|
||||
echo "Current branch: ${CURRENT_BRANCH}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check if eval branch already exists
|
||||
if git show-ref --verify --quiet refs/heads/"${EVAL_NAME}"; then
|
||||
echo "Error: Evaluation branch '${EVAL_NAME}' already exists."
|
||||
echo "Use a different name or delete the existing branch first."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Creating evaluation branch: ${EVAL_NAME}"
|
||||
|
||||
# Create orphan branch
|
||||
git switch --orphan "${EVAL_NAME}"
|
||||
|
||||
# Copy only flake files from main
|
||||
git checkout main -- flake.nix flake.lock .envrc SPEC.md
|
||||
|
||||
# Initial commit
|
||||
git add .
|
||||
git commit -m "Initial: setup evaluation environment"
|
||||
|
||||
# Set up direnv
|
||||
direnv allow
|
||||
|
||||
echo ""
|
||||
echo "Evaluation environment ready!"
|
||||
echo "Working on branch: ${EVAL_NAME}"
|
||||
echo ""
|
||||
echo "Run 'git checkout main' when you're done to return to main."
|
||||
Reference in New Issue
Block a user