fix(llama-swap): update vllm timings patch

2026-05-11 09:40:13 -04:00
parent 187c717383
commit ecad94aab3
6 changed files with 119 additions and 246 deletions
--- a/modules/nixos/services/llama-swap/patches/README.md
+++ b/modules/nixos/services/llama-swap/patches/README.md
@@ -1,22 +1,54 @@
 # vLLM Timings Patch

-This scratch directory contains two ways to patch vLLM so its OpenAI-compatible responses include llama.cpp-compatible `timings` data. llama-swap already parses this `timings` object to populate cached tokens, prompt processing speed, and generation speed.
+This directory contains the custom timings patch for the current vLLM Docker image used by the llama-swap module:
+
+```text
+vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657
+```
+
+The patch adds a top-level llama.cpp-compatible `timings` object to OpenAI-compatible responses so llama-swap can populate cached tokens, prompt processing speed, and generation speed.

 ## Files

- `patch_timings_07351e088.py` — disk-edit patch script for running inside the vLLM Docker container before `vllm serve`.
- `vllm-timings-07351e088.patch` — standard unified git patch against `vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08`.
+- `patch_timings_1acd67a.py` — idempotent boot-time disk-edit patch script for the vLLM Docker container.
+- `vllm-timings-1acd67a.patch` — equivalent standard unified git patch against the current image's vLLM source.

-## What The Patch Adds
+## Runtime Script

-The patch adds a top-level `timings` object to:
+Deploy the script under `/mnt/ssd/vLLM/Patches/` and mount it into the container:

- `/v1/chat/completions` non-streaming responses
- `/v1/chat/completions` streaming final usage chunk
- `/v1/completions` non-streaming responses
- `/v1/completions` streaming final usage chunk
+```nix
+-v /mnt/ssd/vLLM/Patches/patch_timings_1acd67a.py:/patches/patch_timings_1acd67a.py:ro \
+```

-The object matches llama.cpp's fields:
+Run it before `exec vllm serve`:
+
+```bash
+python3 /patches/patch_timings_1acd67a.py;
+exec vllm serve ...
+```
+
+The script is idempotent. Re-running it skips files that already contain `# [patch_timings]`.
+
+## Standard Patch
+
+For a source checkout at commit `1acd67a795ebccdf9b9db7697ae9082058301657`:
+
+```bash
+git apply --check /path/to/vllm-timings-1acd67a.patch
+git apply /path/to/vllm-timings-1acd67a.patch
+```
+
+At container runtime, applying the `.patch` directly is possible if the image has `patch` or `git` installed:
+
+```bash
+cd /usr/local/lib/python3.12/dist-packages
+patch -p1 < /patches/vllm-timings-1acd67a.patch
+```
+
+The Python script remains the safer boot-time option because it is idempotent and does not depend on external patch tools being present in the Docker image.
+
+## Timings Fields

 ```json
 {
@@ -35,78 +67,3 @@ Data comes from vLLM's existing internal `RequestStateStats` and `RequestOutput.
 - prompt/prefill time: `first_token_ts - scheduled_ts`
 - generation/decode time: `last_token_ts - first_token_ts`
 - cached tokens: `num_cached_tokens`
-
-## Option 1: Runtime Docker Patch Script
-
-Copy the script into the deployed patch directory:
-
-```bash
-cp _scratch/patch_timings_07351e088.py /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py
-```
-
-Add the Docker mount in `/etc/nixos/modules/nixos/services/llama-swap/config.nix`:
-
-```nix
-v /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py:/patches/patch_timings_07351e088.py:ro \
-```
-
-Run it before `exec vllm serve` in `vllmCmd`:
-
-```bash
-python3 /patches/patch_timings_07351e088.py;
-exec vllm serve ...
-```
-
-The script is idempotent. Re-running it skips files that already contain `# [patch_timings]`.
-
-## Option 2: Standard Patch File
-
-Use this for a source checkout or future vLLM updates where conflicts can be resolved normally.
-
-From a vLLM checkout at commit `07351e0883470724dd5a7e9730ed10e01fc99d08`:
-
-```bash
-git apply /path/to/_scratch/vllm-timings-07351e088.patch
-```
-
-Or with `patch`:
-
-```bash
-patch -p1 < /path/to/_scratch/vllm-timings-07351e088.patch
-```
-
-For future vLLM versions, try:
-
-```bash
-git apply --check /path/to/_scratch/vllm-timings-07351e088.patch
-```
-
-If it fails, apply manually or with rejects and resolve conflicts around the changed response-construction code.
-
-## Verification Performed
-
-The patch was checked against the Docker tag's pinned commit:
-
-```text
-vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08
-```
-
-Validation done locally:
-
-```bash
-git apply --check _scratch/vllm-timings-07351e088.patch
-git apply _scratch/vllm-timings-07351e088.patch
-nix run nixpkgs#python3 -- -m py_compile \
-  vllm/entrypoints/openai/chat_completion/protocol.py \
-  vllm/entrypoints/openai/chat_completion/serving.py \
-  vllm/entrypoints/openai/completion/protocol.py \
-  vllm/entrypoints/openai/completion/serving.py
-```
-
-The runtime `patch_timings_07351e088.py` script was also tested against files extracted from the pinned commit and confirmed idempotent.
-
-## Caveats
-
- Normal chat completion usage should be correct.
- `/v1/completions` with multiple prompts returns aggregate token counts, but the timing values come from the last completed request. Single-prompt completions are the expected use case.
- Streaming timings are attached only to the final usage chunk, so clients must request/include usage for streaming if they want timings in the stream.