vLLM Timings Patch

This directory contains the custom timings patch for the current vLLM Docker image used by the llama-swap module:

vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657

The patch adds a top-level llama.cpp-compatible timings object to OpenAI-compatible responses so llama-swap can populate cached tokens, prompt processing speed, and generation speed.

Files

patch_timings_1acd67a.py — idempotent boot-time disk-edit patch script for the vLLM Docker container.
vllm-timings-1acd67a.patch — equivalent standard unified git patch against the current image's vLLM source.

Runtime Script

Deploy the script under /mnt/ssd/vLLM/Patches/ and mount it into the container:

-v /mnt/ssd/vLLM/Patches/patch_timings_1acd67a.py:/patches/patch_timings_1acd67a.py:ro \

Run it before exec vllm serve:

python3 /patches/patch_timings_1acd67a.py;
exec vllm serve ...

The script is idempotent. Re-running it skips files that already contain # [patch_timings].

Standard Patch

For a source checkout at commit 1acd67a795ebccdf9b9db7697ae9082058301657:

git apply --check /path/to/vllm-timings-1acd67a.patch
git apply /path/to/vllm-timings-1acd67a.patch

At container runtime, applying the .patch directly is possible if the image has patch or git installed:

cd /usr/local/lib/python3.12/dist-packages
patch -p1 < /patches/vllm-timings-1acd67a.patch

The Python script remains the safer boot-time option because it is idempotent and does not depend on external patch tools being present in the Docker image.

Timings Fields

{
  "prompt_n": 123,
  "prompt_ms": 456.7,
  "prompt_per_second": 269.3,
  "predicted_n": 50,
  "predicted_ms": 1000.0,
  "predicted_per_second": 50.0,
  "cache_n": 100
}

Data comes from vLLM's existing internal RequestStateStats and RequestOutput.num_cached_tokens:

prompt/prefill time: first_token_ts - scheduled_ts
generation/decode time: last_token_ts - first_token_ts
cached tokens: num_cached_tokens