fix(llama-swap): update vllm timings patch

This commit is contained in:
2026-05-11 09:40:13 -04:00
parent 187c717383
commit ecad94aab3
6 changed files with 119 additions and 246 deletions

View File

@@ -1,22 +1,54 @@
# vLLM Timings Patch
This scratch directory contains two ways to patch vLLM so its OpenAI-compatible responses include llama.cpp-compatible `timings` data. llama-swap already parses this `timings` object to populate cached tokens, prompt processing speed, and generation speed.
This directory contains the custom timings patch for the current vLLM Docker image used by the llama-swap module:
```text
vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657
```
The patch adds a top-level llama.cpp-compatible `timings` object to OpenAI-compatible responses so llama-swap can populate cached tokens, prompt processing speed, and generation speed.
## Files
- `patch_timings_07351e088.py` disk-edit patch script for running inside the vLLM Docker container before `vllm serve`.
- `vllm-timings-07351e088.patch` — standard unified git patch against `vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08`.
- `patch_timings_1acd67a.py` — idempotent boot-time disk-edit patch script for the vLLM Docker container.
- `vllm-timings-1acd67a.patch` equivalent standard unified git patch against the current image's vLLM source.
## What The Patch Adds
## Runtime Script
The patch adds a top-level `timings` object to:
Deploy the script under `/mnt/ssd/vLLM/Patches/` and mount it into the container:
- `/v1/chat/completions` non-streaming responses
- `/v1/chat/completions` streaming final usage chunk
- `/v1/completions` non-streaming responses
- `/v1/completions` streaming final usage chunk
```nix
-v /mnt/ssd/vLLM/Patches/patch_timings_1acd67a.py:/patches/patch_timings_1acd67a.py:ro \
```
The object matches llama.cpp's fields:
Run it before `exec vllm serve`:
```bash
python3 /patches/patch_timings_1acd67a.py;
exec vllm serve ...
```
The script is idempotent. Re-running it skips files that already contain `# [patch_timings]`.
## Standard Patch
For a source checkout at commit `1acd67a795ebccdf9b9db7697ae9082058301657`:
```bash
git apply --check /path/to/vllm-timings-1acd67a.patch
git apply /path/to/vllm-timings-1acd67a.patch
```
At container runtime, applying the `.patch` directly is possible if the image has `patch` or `git` installed:
```bash
cd /usr/local/lib/python3.12/dist-packages
patch -p1 < /patches/vllm-timings-1acd67a.patch
```
The Python script remains the safer boot-time option because it is idempotent and does not depend on external patch tools being present in the Docker image.
## Timings Fields
```json
{
@@ -35,78 +67,3 @@ Data comes from vLLM's existing internal `RequestStateStats` and `RequestOutput.
- prompt/prefill time: `first_token_ts - scheduled_ts`
- generation/decode time: `last_token_ts - first_token_ts`
- cached tokens: `num_cached_tokens`
## Option 1: Runtime Docker Patch Script
Copy the script into the deployed patch directory:
```bash
cp _scratch/patch_timings_07351e088.py /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py
```
Add the Docker mount in `/etc/nixos/modules/nixos/services/llama-swap/config.nix`:
```nix
-v /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py:/patches/patch_timings_07351e088.py:ro \
```
Run it before `exec vllm serve` in `vllmCmd`:
```bash
python3 /patches/patch_timings_07351e088.py;
exec vllm serve ...
```
The script is idempotent. Re-running it skips files that already contain `# [patch_timings]`.
## Option 2: Standard Patch File
Use this for a source checkout or future vLLM updates where conflicts can be resolved normally.
From a vLLM checkout at commit `07351e0883470724dd5a7e9730ed10e01fc99d08`:
```bash
git apply /path/to/_scratch/vllm-timings-07351e088.patch
```
Or with `patch`:
```bash
patch -p1 < /path/to/_scratch/vllm-timings-07351e088.patch
```
For future vLLM versions, try:
```bash
git apply --check /path/to/_scratch/vllm-timings-07351e088.patch
```
If it fails, apply manually or with rejects and resolve conflicts around the changed response-construction code.
## Verification Performed
The patch was checked against the Docker tag's pinned commit:
```text
vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08
```
Validation done locally:
```bash
git apply --check _scratch/vllm-timings-07351e088.patch
git apply _scratch/vllm-timings-07351e088.patch
nix run nixpkgs#python3 -- -m py_compile \
vllm/entrypoints/openai/chat_completion/protocol.py \
vllm/entrypoints/openai/chat_completion/serving.py \
vllm/entrypoints/openai/completion/protocol.py \
vllm/entrypoints/openai/completion/serving.py
```
The runtime `patch_timings_07351e088.py` script was also tested against files extracted from the pinned commit and confirmed idempotent.
## Caveats
- Normal chat completion usage should be correct.
- `/v1/completions` with multiple prompts returns aggregate token counts, but the timing values come from the last completed request. Single-prompt completions are the expected use case.
- Streaming timings are attached only to the final usage chunk, so clients must request/include usage for streaming if they want timings in the stream.