Files
nix/modules/nixos/services/llama-swap/patches
2026-05-01 14:36:36 -04:00
..
2026-05-01 10:57:59 -04:00
wip
2026-05-01 14:36:36 -04:00
2026-05-01 10:57:59 -04:00

vLLM Timings Patch

This scratch directory contains two ways to patch vLLM so its OpenAI-compatible responses include llama.cpp-compatible timings data. llama-swap already parses this timings object to populate cached tokens, prompt processing speed, and generation speed.

Files

  • patch_timings_07351e088.py — disk-edit patch script for running inside the vLLM Docker container before vllm serve.
  • vllm-timings-07351e088.patch — standard unified git patch against vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08.

What The Patch Adds

The patch adds a top-level timings object to:

  • /v1/chat/completions non-streaming responses
  • /v1/chat/completions streaming final usage chunk
  • /v1/completions non-streaming responses
  • /v1/completions streaming final usage chunk

The object matches llama.cpp's fields:

{
  "prompt_n": 123,
  "prompt_ms": 456.7,
  "prompt_per_second": 269.3,
  "predicted_n": 50,
  "predicted_ms": 1000.0,
  "predicted_per_second": 50.0,
  "cache_n": 100
}

Data comes from vLLM's existing internal RequestStateStats and RequestOutput.num_cached_tokens:

  • prompt/prefill time: first_token_ts - scheduled_ts
  • generation/decode time: last_token_ts - first_token_ts
  • cached tokens: num_cached_tokens

Option 1: Runtime Docker Patch Script

Copy the script into the deployed patch directory:

cp _scratch/patch_timings_07351e088.py /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py

Add the Docker mount in /etc/nixos/modules/nixos/services/llama-swap/config.nix:

-v /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py:/patches/patch_timings_07351e088.py:ro \

Run it before exec vllm serve in vllmCmd:

python3 /patches/patch_timings_07351e088.py;
exec vllm serve ...

The script is idempotent. Re-running it skips files that already contain # [patch_timings].

Option 2: Standard Patch File

Use this for a source checkout or future vLLM updates where conflicts can be resolved normally.

From a vLLM checkout at commit 07351e0883470724dd5a7e9730ed10e01fc99d08:

git apply /path/to/_scratch/vllm-timings-07351e088.patch

Or with patch:

patch -p1 < /path/to/_scratch/vllm-timings-07351e088.patch

For future vLLM versions, try:

git apply --check /path/to/_scratch/vllm-timings-07351e088.patch

If it fails, apply manually or with rejects and resolve conflicts around the changed response-construction code.

Verification Performed

The patch was checked against the Docker tag's pinned commit:

vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08

Validation done locally:

git apply --check _scratch/vllm-timings-07351e088.patch
git apply _scratch/vllm-timings-07351e088.patch
nix run nixpkgs#python3 -- -m py_compile \
  vllm/entrypoints/openai/chat_completion/protocol.py \
  vllm/entrypoints/openai/chat_completion/serving.py \
  vllm/entrypoints/openai/completion/protocol.py \
  vllm/entrypoints/openai/completion/serving.py

The runtime patch_timings_07351e088.py script was also tested against files extracted from the pinned commit and confirmed idempotent.

Caveats

  • Normal chat completion usage should be correct.
  • /v1/completions with multiple prompts returns aggregate token counts, but the timing values come from the last completed request. Single-prompt completions are the expected use case.
  • Streaming timings are attached only to the final usage chunk, so clients must request/include usage for streaming if they want timings in the stream.