feat: vllm timings patch
This commit is contained in:
112
modules/nixos/services/llama-swap/patches/README.md
Normal file
112
modules/nixos/services/llama-swap/patches/README.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# vLLM Timings Patch
|
||||
|
||||
This scratch directory contains two ways to patch vLLM so its OpenAI-compatible responses include llama.cpp-compatible `timings` data. llama-swap already parses this `timings` object to populate cached tokens, prompt processing speed, and generation speed.
|
||||
|
||||
## Files
|
||||
|
||||
- `patch_timings_07351e088.py` — disk-edit patch script for running inside the vLLM Docker container before `vllm serve`.
|
||||
- `vllm-timings-07351e088.patch` — standard unified git patch against `vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08`.
|
||||
|
||||
## What The Patch Adds
|
||||
|
||||
The patch adds a top-level `timings` object to:
|
||||
|
||||
- `/v1/chat/completions` non-streaming responses
|
||||
- `/v1/chat/completions` streaming final usage chunk
|
||||
- `/v1/completions` non-streaming responses
|
||||
- `/v1/completions` streaming final usage chunk
|
||||
|
||||
The object matches llama.cpp's fields:
|
||||
|
||||
```json
|
||||
{
|
||||
"prompt_n": 123,
|
||||
"prompt_ms": 456.7,
|
||||
"prompt_per_second": 269.3,
|
||||
"predicted_n": 50,
|
||||
"predicted_ms": 1000.0,
|
||||
"predicted_per_second": 50.0,
|
||||
"cache_n": 100
|
||||
}
|
||||
```
|
||||
|
||||
Data comes from vLLM's existing internal `RequestStateStats` and `RequestOutput.num_cached_tokens`:
|
||||
|
||||
- prompt/prefill time: `first_token_ts - scheduled_ts`
|
||||
- generation/decode time: `last_token_ts - first_token_ts`
|
||||
- cached tokens: `num_cached_tokens`
|
||||
|
||||
## Option 1: Runtime Docker Patch Script
|
||||
|
||||
Copy the script into the deployed patch directory:
|
||||
|
||||
```bash
|
||||
cp _scratch/patch_timings_07351e088.py /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py
|
||||
```
|
||||
|
||||
Add the Docker mount in `/etc/nixos/modules/nixos/services/llama-swap/config.nix`:
|
||||
|
||||
```nix
|
||||
-v /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py:/patches/patch_timings_07351e088.py:ro \
|
||||
```
|
||||
|
||||
Run it before `exec vllm serve` in `vllmCmd`:
|
||||
|
||||
```bash
|
||||
python3 /patches/patch_timings_07351e088.py;
|
||||
exec vllm serve ...
|
||||
```
|
||||
|
||||
The script is idempotent. Re-running it skips files that already contain `# [patch_timings]`.
|
||||
|
||||
## Option 2: Standard Patch File
|
||||
|
||||
Use this for a source checkout or future vLLM updates where conflicts can be resolved normally.
|
||||
|
||||
From a vLLM checkout at commit `07351e0883470724dd5a7e9730ed10e01fc99d08`:
|
||||
|
||||
```bash
|
||||
git apply /path/to/_scratch/vllm-timings-07351e088.patch
|
||||
```
|
||||
|
||||
Or with `patch`:
|
||||
|
||||
```bash
|
||||
patch -p1 < /path/to/_scratch/vllm-timings-07351e088.patch
|
||||
```
|
||||
|
||||
For future vLLM versions, try:
|
||||
|
||||
```bash
|
||||
git apply --check /path/to/_scratch/vllm-timings-07351e088.patch
|
||||
```
|
||||
|
||||
If it fails, apply manually or with rejects and resolve conflicts around the changed response-construction code.
|
||||
|
||||
## Verification Performed
|
||||
|
||||
The patch was checked against the Docker tag's pinned commit:
|
||||
|
||||
```text
|
||||
vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08
|
||||
```
|
||||
|
||||
Validation done locally:
|
||||
|
||||
```bash
|
||||
git apply --check _scratch/vllm-timings-07351e088.patch
|
||||
git apply _scratch/vllm-timings-07351e088.patch
|
||||
nix run nixpkgs#python3 -- -m py_compile \
|
||||
vllm/entrypoints/openai/chat_completion/protocol.py \
|
||||
vllm/entrypoints/openai/chat_completion/serving.py \
|
||||
vllm/entrypoints/openai/completion/protocol.py \
|
||||
vllm/entrypoints/openai/completion/serving.py
|
||||
```
|
||||
|
||||
The runtime `patch_timings_07351e088.py` script was also tested against files extracted from the pinned commit and confirmed idempotent.
|
||||
|
||||
## Caveats
|
||||
|
||||
- Normal chat completion usage should be correct.
|
||||
- `/v1/completions` with multiple prompts returns aggregate token counts, but the timing values come from the last completed request. Single-prompt completions are the expected use case.
|
||||
- Streaming timings are attached only to the final usage chunk, so clients must request/include usage for streaming if they want timings in the stream.
|
||||
Reference in New Issue
Block a user