feat: vllm timings patch

2026-05-01 10:57:59 -04:00
parent 74ff71803b
commit a3b2efa5bb
4 changed files with 592 additions and 0 deletions
--- a/modules/nixos/services/llama-swap/patches/README.md
+++ b/modules/nixos/services/llama-swap/patches/README.md
@@ -0,0 +1,112 @@
+# vLLM Timings Patch
+
+This scratch directory contains two ways to patch vLLM so its OpenAI-compatible responses include llama.cpp-compatible `timings` data. llama-swap already parses this `timings` object to populate cached tokens, prompt processing speed, and generation speed.
+
+## Files
+
+- `patch_timings_07351e088.py` — disk-edit patch script for running inside the vLLM Docker container before `vllm serve`.
+- `vllm-timings-07351e088.patch` — standard unified git patch against `vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08`.
+
+## What The Patch Adds
+
+The patch adds a top-level `timings` object to:
+
+- `/v1/chat/completions` non-streaming responses
+- `/v1/chat/completions` streaming final usage chunk
+- `/v1/completions` non-streaming responses
+- `/v1/completions` streaming final usage chunk
+
+The object matches llama.cpp's fields:
+
+```json
+{
+  "prompt_n": 123,
+  "prompt_ms": 456.7,
+  "prompt_per_second": 269.3,
+  "predicted_n": 50,
+  "predicted_ms": 1000.0,
+  "predicted_per_second": 50.0,
+  "cache_n": 100
+}
+```
+
+Data comes from vLLM's existing internal `RequestStateStats` and `RequestOutput.num_cached_tokens`:
+
+- prompt/prefill time: `first_token_ts - scheduled_ts`
+- generation/decode time: `last_token_ts - first_token_ts`
+- cached tokens: `num_cached_tokens`
+
+## Option 1: Runtime Docker Patch Script
+
+Copy the script into the deployed patch directory:
+
+```bash
+cp _scratch/patch_timings_07351e088.py /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py
+```
+
+Add the Docker mount in `/etc/nixos/modules/nixos/services/llama-swap/config.nix`:
+
+```nix
+-v /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py:/patches/patch_timings_07351e088.py:ro \
+```
+
+Run it before `exec vllm serve` in `vllmCmd`:
+
+```bash
+python3 /patches/patch_timings_07351e088.py;
+exec vllm serve ...
+```
+
+The script is idempotent. Re-running it skips files that already contain `# [patch_timings]`.
+
+## Option 2: Standard Patch File
+
+Use this for a source checkout or future vLLM updates where conflicts can be resolved normally.
+
+From a vLLM checkout at commit `07351e0883470724dd5a7e9730ed10e01fc99d08`:
+
+```bash
+git apply /path/to/_scratch/vllm-timings-07351e088.patch
+```
+
+Or with `patch`:
+
+```bash
+patch -p1 < /path/to/_scratch/vllm-timings-07351e088.patch
+```
+
+For future vLLM versions, try:
+
+```bash
+git apply --check /path/to/_scratch/vllm-timings-07351e088.patch
+```
+
+If it fails, apply manually or with rejects and resolve conflicts around the changed response-construction code.
+
+## Verification Performed
+
+The patch was checked against the Docker tag's pinned commit:
+
+```text
+vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08
+```
+
+Validation done locally:
+
+```bash
+git apply --check _scratch/vllm-timings-07351e088.patch
+git apply _scratch/vllm-timings-07351e088.patch
+nix run nixpkgs#python3 -- -m py_compile \
+  vllm/entrypoints/openai/chat_completion/protocol.py \
+  vllm/entrypoints/openai/chat_completion/serving.py \
+  vllm/entrypoints/openai/completion/protocol.py \
+  vllm/entrypoints/openai/completion/serving.py
+```
+
+The runtime `patch_timings_07351e088.py` script was also tested against files extracted from the pinned commit and confirmed idempotent.
+
+## Caveats
+
+- Normal chat completion usage should be correct.
+- `/v1/completions` with multiple prompts returns aggregate token counts, but the timing values come from the last completed request. Single-prompt completions are the expected use case.
+- Streaming timings are attached only to the final usage chunk, so clients must request/include usage for streaming if they want timings in the stream.