docs(pi): add agent knowledge capture guidance

docs(pi): tighten agent guidance
fix(llama-swap): sync qwen vllm 3090 configs
2026-05-09 10:18:13 -04:00 · 2026-05-09 10:16:34 -04:00 · 2026-05-09 10:16:32 -04:00 · 2026-05-09 10:16:29 -04:00
4 changed files with 128 additions and 71 deletions
--- a/.agents/skills/update-vllm-3090-configs/SKILL.md
+++ b/.agents/skills/update-vllm-3090-configs/SKILL.md
@@ -0,0 +1,61 @@
 ---
 name: update-vllm-3090-configs
 description: Update only the qwen3.6-27b vLLM 3090 llama-swap configs from club-3090 refs; compare diffs, present a plan, and require approval before editing.
 ---
 # Update vLLM 3090 Configs
 ## Scope
 Use only for Qwen3.6 27B vLLM 3090 configs in `modules/nixos/services/llama-swap/`.
 Do not use this skill for other models, other Qwen sizes, non-vLLM configs, or package bumps.
 Local files:
 - `modules/nixos/services/llama-swap/config.nix`
 - `modules/nixos/services/llama-swap/setup-qwen36-vllm.sh`
 Local config keys:
 - `vllm-qwen3.6-27b-tools-text`
 - `vllm-qwen3.6-27b-long-text`
 - `vllm-qwen3.6-27b-long-vision`
 ## Upstream References
 Compare against `club-3090` master:
 - `models/qwen3.6-27b/vllm/compose/single/tools-text.yml`
 - `models/qwen3.6-27b/vllm/compose/single/long-text.yml`
 - `models/qwen3.6-27b/vllm/compose/single/long-vision.yml`
 - `scripts/setup.sh` for the current `GENESIS_PIN="${GENESIS_PIN:-...}"`
 Use raw URLs or a temp clone under `_scratch/club-3090`. Prefer a temp clone when checking broad changes:
 ```bash
 mkdir -p _scratch
 git clone https://github.com/noonghunna/club-3090 _scratch/club-3090 2>/dev/null || git -C _scratch/club-3090 pull --ff-only
 ```
 ## Required Workflow
 1. Fetch/update upstream refs under `_scratch/club-3090` or fetch the raw files.
 2. Compare upstream compose files to the three local llama-swap entries. Translate docker-compose semantics into the existing `docker run`/llama-swap format.
 3. Compare upstream `scripts/setup.sh` Genesis pin to local `GENESIS_PIN` in `setup-qwen36-vllm.sh`.
 4. Check upstream compose volumes/entrypoint for sidecar patches. If patches are added, removed, renamed, or invoked differently, update both:
   - runtime mounts and `python3 /patches/...` calls in `config.nix`
   - download/install logic and summary in `setup-qwen36-vllm.sh`
 5. Ignore these diffs unless the user explicitly asks otherwise:
   - `shm_size` / shm-related compose settings
   - local timing patch `patch_timings_07351e088.py` and its mount/invocation
   - model served-name differences caused by llama-swap `${MODEL_ID}`
   - `HUGGING_FACE_HUB_TOKEN`; keep local CUDA device/env choices
   - upstream relative paths vs local `/mnt/ssd/vLLM/...` paths
   - docker-compose format vs local llama-swap/Nix format
 6. Before editing, present:
   - upstream files/commit checked
   - meaningful diffs found
   - ignored diffs
   - exact planned local changes
   Then wait for explicit user approval.
 7. After approval, edit minimally and validate:
   - `bash -n modules/nixos/services/llama-swap/setup-qwen36-vllm.sh`
   - `nix-instantiate --parse modules/nixos/services/llama-swap/config.nix`
 8. Summarize changed files and any remaining upstream differences.
--- a/modules/home/programs/terminal/pi/config/AGENTS.md
+++ b/modules/home/programs/terminal/pi/config/AGENTS.md
@@ -1,67 +1,45 @@
 # AI Agent Guidelines
-## Important Rules
+Be cognizant of context use; this file is loaded for all LLMs. Keep guidance concise and high-signal.
-1. **Timeout for bash tool**: The `bash` tool MUST have a timeout specified. Without a timeout, the tool will hang indefinitely and cause the task to fail.
+## Critical Rules
-2. **File writing**: Do NOT use `cat` with heredocs to write files. Use the `write` tool instead (or `edit` for modifications).
+1. **Bash timeouts**: Every `bash` tool call MUST specify a timeout.
   ```bash
   bash(command="some command", timeout=30)
   ```
-3. **Ephemeral files**: Put temporary scripts, plans, notes, and other scratch artifacts in `_scratch/`. It is gitignored, and reusable exploration/testing scripts should be iterated there instead of recreated repeatedly.
+2. **File writing**: Do NOT use `cat` with heredocs to write files. Use `write` for new/rewritten files and `edit` for targeted modifications.
-## Example of Correct Usage
+3. **Scratch files**: Put temporary scripts, plans, notes, and reusable exploration artifacts in `_scratch/`. It is gitignored.
-### Incorrect (will hang):
+4. **Missing commands**: If a tool is not installed, prefer `nix run` instead of installing it.
   ```bash
   nix run nixpkgs#python3 -- script.py
   ```
-```bash
+## Context Discipline
 bash(command="some long-running command")
 ```
-### Correct (with timeout):
+Prefer a **search → targeted read** pattern:
-```bash
+1. Search with `rg -n` / `grep -n` to find relevant line numbers.
-bash(command="some command", timeout=30)
+2. Read only the needed range with `read(path, offset, limit)`.
 ```
-### Incorrect (file writing):
+Full-file reads are fine when genuinely needed, but avoid them as the default reflex.
 ```bash
 bash(command="cat > file.txt << 'EOF'\ncontent\nEOF")
 ```
 ### Correct (file writing):
 ```bash
 write(path="file.txt", content="content")
 ```
 ## Reading Files
 Prefer a **search → targeted read** pattern to minimize context usage:
 1. **Search** with `grep -n` / `rg -n` to find relevant line numbers.
 2. **Read** only the needed range using `read(path, offset, limit)` or `sed -n 'X,Yp'`.
 ```bash
 # Find the relevant lines
 bash(command="rg -n 'functionName' src/", timeout=10)
 # Read just that region (e.g. lines 42-70)
 read(path="src/foo.go", offset=42, limit=29)
 ```
 Full-file reads are fine when genuinely needed (small files, needing full picture), but avoid them as the default reflex.
 ## Principles
-1. **KISS / YAGNI** - Keep solutions simple and straightforward. Don't introduce abstractions, generics, or indirection unless there is a concrete, immediate need. Prefer obvious code over clever code.
+1. **KISS / YAGNI**: Keep solutions simple. Avoid abstractions, generics, or indirection unless there is a concrete need.
-2. **Maintain AGENTS.md** - If the project has an `AGENTS.md`, keep it up to date as conventions or architecture evolve. However, follow the **BLUF** (Bottom Line Up Front) principle: keep it concise, actionable, and context-size conscious. Don't overload it with information that belongs in code comments or external docs.
+2. **Maintain AGENTS.md**: Keep project guidance up to date, but BLUF: concise, actionable, and context-size conscious.
 3. **Knowledge Capture**: At task end, if you discovered non-obvious conventions, pitfalls, or repeatable workflows that would have saved time, briefly recommend adding them to AGENTS.md or a skill. Say whether each belongs in project-level context, global agent context, or a task-specific skill. Skip this when there is nothing meaningful.
 ## Style
 ### Comment Style
-A logical "block" of code (doesn't have to be a scope, but a cohesive group of statements responsible for something) should have a comment above it with a short "title". The title must be in **Title Case**. For example:
+A logical block of code (not necessarily a language scope) should have a short Title Case comment above it:
 ```go
 // Map Component Results
@@ -70,12 +48,10 @@ for _, comp := range components {
 }
 ```
-If the block is more complicated or non-obvious, explain _why_ it does what it does after the title:
+If the block is more complicated or non-obvious, explain _why_ after the title:
 ```go
-// Map Component Results - This is needed because downstream consumers
+// Map Component Results - Downstream consumers expect a name-keyed lookup.
 // expect a name-keyed lookup. Without it, the renderer would fall back
 // to O(n) scans on every frame.
 for _, comp := range components {
    results[comp.Name] = comp.Result
 }
--- a/modules/nixos/services/llama-swap/config.nix
+++ b/modules/nixos/services/llama-swap/config.nix
@@ -129,7 +129,7 @@ in
    };
    # https://github.com/noonghunna/club-3090/tree/master/models/qwen3.6-27b/vllm
-    # Synced from: club-3090 f6613c8 (2026-05-02) - docker-compose.long-text.yml
+    # Synced from: club-3090 e1137d6 (2026-05-09) - single/long-text.yml
    # Long-text variant - 180K context, text-only (no vision)
    # TurboQuant 3-bit KV + MTP n=3 + Genesis v7.69 + Cliff 2 closure recipe
    "vllm-qwen3.6-27b-long-text" = {
@@ -141,8 +141,9 @@ in
          vllmCmd = ''
            set -e; pip install xxhash pandas scipy -q;
            python3 -m vllm._genesis.patches.apply_all;
            python3 /patches/qwen3coder_tool_parser_deferred_commit.py;
            python3 /patches/patch_timings_07351e088.py;
-            exec vllm serve
+            exec vllm serve ''${VLLM_ENFORCE_EAGER:+--enforce-eager}
            --served-model-name ''${MODEL_ID}
            --model /root/.cache/huggingface/qwen3.6-27b-autoround-int4
            --quantization auto_round
@@ -188,7 +189,6 @@ in
            -e GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL=1 \
            -e GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1 \
            -e GENESIS_ENABLE_P64_QWEN3CODER_MTP_STREAMING=1 \
            -e GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 \
            -e GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER=1 \
            -e GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1 \
            -e GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 \
@@ -214,6 +214,7 @@ in
            -e GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 \
            -e GENESIS_ENABLE_PN26_SPARSE_V=1 \
            -e GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE=1 \
            -e GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 \
            -e GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 \
            -e GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 \
            -e GENESIS_ENABLE_PN59_STREAMING_GDN=1 \
@@ -244,10 +245,12 @@ in
            -e VLLM_USE_FLASHINFER_SAMPLER=1 \
            -e VLLM_USE_FUSED_MOE_GROUPED_TOPK=1 \
            -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
            -e VLLM_ENFORCE_EAGER \
            -v /mnt/ssd/vLLM/Models:/root/.cache/huggingface \
            -v /mnt/ssd/vLLM/Cache/torch_compile:/root/.cache/vllm/torch_compile_cache \
            -v /mnt/ssd/vLLM/Cache/triton:/root/.triton/cache \
            -v /mnt/ssd/vLLM/Patches/genesis/vllm/_genesis:/usr/local/lib/python3.12/dist-packages/vllm/_genesis:ro \
            -v /mnt/ssd/vLLM/Patches/qwen3coder_tool_parser_deferred_commit.py:/patches/qwen3coder_tool_parser_deferred_commit.py:ro \
            -v /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py:/patches/patch_timings_07351e088.py:ro \
            -p ''${PORT}:8000 \
            --entrypoint /bin/bash \
@@ -265,7 +268,7 @@ in
    };
    # https://github.com/noonghunna/club-3090/tree/master/models/qwen3.6-27b/vllm
-    # Synced from: club-3090 f6613c8 (2026-05-02) - docker-compose.long-vision.yml
+    # Synced from: club-3090 e1137d6 (2026-05-09) - single/long-vision.yml
    # Long-vision variant - 145K context with vision tower active
    # TurboQuant 3-bit KV + MTP n=3 + Genesis v7.69 + Cliff 2 env vars (mem-util kept at 0.95)
    "vllm-qwen3.6-27b-long-vision" = {
@@ -277,8 +280,9 @@ in
          vllmCmd = ''
            set -e; pip install xxhash pandas scipy -q;
            python3 -m vllm._genesis.patches.apply_all;
            python3 /patches/qwen3coder_tool_parser_deferred_commit.py;
            python3 /patches/patch_timings_07351e088.py;
-            exec vllm serve
+            exec vllm serve ''${VLLM_ENFORCE_EAGER:+--enforce-eager}
            --served-model-name ''${MODEL_ID}
            --model /root/.cache/huggingface/qwen3.6-27b-autoround-int4
            --quantization auto_round
@@ -323,7 +327,6 @@ in
            -e GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL=1 \
            -e GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1 \
            -e GENESIS_ENABLE_P64_QWEN3CODER_MTP_STREAMING=1 \
            -e GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 \
            -e GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER=1 \
            -e GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1 \
            -e GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 \
@@ -349,19 +352,15 @@ in
            -e GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 \
            -e GENESIS_ENABLE_PN26_SPARSE_V=1 \
            -e GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE=1 \
            -e GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 \
            -e GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 \
            -e GENESIS_ENABLE_PN59_STREAMING_GDN=1 \
            -e GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 \
            -e GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN=1 \
            -e GENESIS_FLA_FWD_H_MAX_T=16384 \
            -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000 \
            -e GENESIS_P82_THRESHOLD_SINGLE=0.3 \
            -e GENESIS_PN26_SPARSE_V_BLOCK_KV=8 \
            -e GENESIS_PN26_SPARSE_V_NUM_WARPS=4 \
            -e GENESIS_PN26_SPARSE_V_THRESHOLD=0.01 \
            -e GENESIS_PN32_GDN_CHUNK_SIZE=8192 \
            -e GENESIS_PN32_GDN_CHUNK_THRESHOLD=16384 \
            -e GENESIS_PREALLOC_TOKEN_BUDGET=4128 \
            -e GENESIS_PROFILE_RUN_CAP_M=4128 \
            -e NCCL_CUMEM_ENABLE=0 \
@@ -378,10 +377,12 @@ in
            -e VLLM_USE_FLASHINFER_SAMPLER=1 \
            -e VLLM_USE_FUSED_MOE_GROUPED_TOPK=1 \
            -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
            -e VLLM_ENFORCE_EAGER \
            -v /mnt/ssd/vLLM/Models:/root/.cache/huggingface \
            -v /mnt/ssd/vLLM/Cache/torch_compile:/root/.cache/vllm/torch_compile_cache \
            -v /mnt/ssd/vLLM/Cache/triton:/root/.triton/cache \
            -v /mnt/ssd/vLLM/Patches/genesis/vllm/_genesis:/usr/local/lib/python3.12/dist-packages/vllm/_genesis:ro \
            -v /mnt/ssd/vLLM/Patches/qwen3coder_tool_parser_deferred_commit.py:/patches/qwen3coder_tool_parser_deferred_commit.py:ro \
            -v /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py:/patches/patch_timings_07351e088.py:ro \
            -p ''${PORT}:8000 \
            --entrypoint /bin/bash \
@@ -412,8 +413,9 @@ in
          vllmCmd = ''
            set -e; pip install xxhash pandas scipy -q;
            python3 -m vllm._genesis.patches.apply_all;
            python3 /patches/qwen3coder_tool_parser_deferred_commit.py;
            python3 /patches/patch_timings_07351e088.py;
-            exec vllm serve
+            exec vllm serve ''${VLLM_ENFORCE_EAGER:+--enforce-eager}
            --served-model-name ''${MODEL_ID}
            --model /root/.cache/huggingface/qwen3.6-27b-autoround-int4
            --quantization auto_round
@@ -472,10 +474,12 @@ in
            -e VLLM_NO_USAGE_STATS=1 \
            -e VLLM_USE_FLASHINFER_SAMPLER=1 \
            -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
            -e VLLM_ENFORCE_EAGER \
            -v /mnt/ssd/vLLM/Models:/root/.cache/huggingface \
            -v /mnt/ssd/vLLM/Cache/torch_compile:/root/.cache/vllm/torch_compile_cache \
            -v /mnt/ssd/vLLM/Cache/triton:/root/.triton/cache \
            -v /mnt/ssd/vLLM/Patches/genesis/vllm/_genesis:/usr/local/lib/python3.12/dist-packages/vllm/_genesis:ro \
            -v /mnt/ssd/vLLM/Patches/qwen3coder_tool_parser_deferred_commit.py:/patches/qwen3coder_tool_parser_deferred_commit.py:ro \
            -v /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py:/patches/patch_timings_07351e088.py:ro \
            -p ''${PORT}:8000 \
            --entrypoint /bin/bash \
--- a/modules/nixos/services/llama-swap/setup-qwen36-vllm.sh
+++ b/modules/nixos/services/llama-swap/setup-qwen36-vllm.sh
@@ -23,6 +23,10 @@ GENESIS_PIN="${GENESIS_PIN:-7b9fd319}"
 BASE_3090_PATCH_URL="https://raw.githubusercontent.com/noonghunna/club-3090/v7.69-cliff2-test/models/qwen3.6-27b/vllm/patches"
 INPUTS_EMBEDS_PATCH="${PATCHES_DIR}/patch_inputs_embeds_optional.py"
 # Tool Parser Patch
 TOOL_PARSER_PATCH="${PATCHES_DIR}/qwen3coder_tool_parser_deferred_commit.py"
 TOOL_PARSER_PATCH_URL="${TOOL_PARSER_PATCH_URL:-https://raw.githubusercontent.com/noonghunna/club-3090/refs/heads/master/models/qwen3.6-27b/vllm/patches/local/qwen3coder_tool_parser_deferred_commit.py}"
 # Timings Patch
 TIMINGS_PATCH="${PATCHES_DIR}/patch_timings_07351e088.py"
 TIMINGS_PATCH_URL="${TIMINGS_PATCH_URL:-https://gitea.va.reichard.io/evan/nix/raw/branch/master/modules/nixos/services/llama-swap/patches/patch_timings_07351e088.py}"
@@ -83,20 +87,31 @@ download_patch() {
 download_patch "${INPUTS_EMBEDS_PATCH}"
-# ---------- Download Timing Patch ----------
+# ---------- Download URL Patch ----------
-tmp_timings_patch="$(mktemp)"
+install_url_patch() {
-trap 'rm -f "${tmp_timings_patch}"' EXIT
+  local name="$1"
  local url="$2"
  local dest="$3"
  local tmp_patch
  tmp_patch="$(mktemp)"
-echo "Downloading patch_timings_07351e088.py from this repo..."
+  echo "Downloading ${name}..."
-curl -fsSL "${TIMINGS_PATCH_URL}" -o "${tmp_timings_patch}"
+  curl -fsSL "${url}" -o "${tmp_patch}"
-if [ -f "${TIMINGS_PATCH}" ] && cmp -s "${tmp_timings_patch}" "${TIMINGS_PATCH}"; then
+  if [ -f "${dest}" ] && cmp -s "${tmp_patch}" "${dest}"; then
-  echo "Timing patch already current at ${TIMINGS_PATCH}, skipping."
+    echo "${name} already current at ${dest}, skipping."
-else
+  else
-  echo "Installing timing patch to ${TIMINGS_PATCH}..."
+    echo "Installing ${name} to ${dest}..."
-  install -m 0644 "${tmp_timings_patch}" "${TIMINGS_PATCH}"
+    install -m 0644 "${tmp_patch}" "${dest}"
-  echo "Timing patch installed."
+    echo "${name} installed."
-fi
+  fi
  rm -f "${tmp_patch}"
 }
 # ---------- Download Boot-Time Patches ----------
 install_url_patch "qwen3coder_tool_parser_deferred_commit.py" "${TOOL_PARSER_PATCH_URL}" "${TOOL_PARSER_PATCH}"
 install_url_patch "patch_timings_07351e088.py" "${TIMINGS_PATCH_URL}" "${TIMINGS_PATCH}"
 # ---------- Summary ----------
 echo ""
@@ -116,4 +131,5 @@ echo "  └── Patches/"
 echo "      ├── genesis/                               (Genesis @ ${GENESIS_PIN})"
 echo "      │   └── vllm/_genesis/                     (mounted into container)"
 echo "      ├── patch_inputs_embeds_optional.py        (boot-time: vllm#35975 backport, text-only models)"
 echo "      ├── qwen3coder_tool_parser_deferred_commit.py (boot-time: qwen3coder SSE deferred commit fix)"
 echo "      └── patch_timings_07351e088.py             (boot-time: llama.cpp-compatible timings)"
Author	SHA1	Message	Date
Evan Reichard	b41e9f2a84	docs(pi): add agent knowledge capture guidance	2026-05-09 10:18:13 -04:00
Evan Reichard	b25a933dd0	docs(pi): tighten agent guidance	2026-05-09 10:16:34 -04:00
Evan Reichard	37b0fae7e2	fix(llama-swap): sync qwen vllm 3090 configs	2026-05-09 10:16:32 -04:00
Evan Reichard	02410568dc	docs(skills): add vllm 3090 update workflow	2026-05-09 10:16:29 -04:00