113 lines
3.5 KiB
Markdown
113 lines
3.5 KiB
Markdown
# vLLM Timings Patch
|
|
|
|
This scratch directory contains two ways to patch vLLM so its OpenAI-compatible responses include llama.cpp-compatible `timings` data. llama-swap already parses this `timings` object to populate cached tokens, prompt processing speed, and generation speed.
|
|
|
|
## Files
|
|
|
|
- `patch_timings_07351e088.py` — disk-edit patch script for running inside the vLLM Docker container before `vllm serve`.
|
|
- `vllm-timings-07351e088.patch` — standard unified git patch against `vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08`.
|
|
|
|
## What The Patch Adds
|
|
|
|
The patch adds a top-level `timings` object to:
|
|
|
|
- `/v1/chat/completions` non-streaming responses
|
|
- `/v1/chat/completions` streaming final usage chunk
|
|
- `/v1/completions` non-streaming responses
|
|
- `/v1/completions` streaming final usage chunk
|
|
|
|
The object matches llama.cpp's fields:
|
|
|
|
```json
|
|
{
|
|
"prompt_n": 123,
|
|
"prompt_ms": 456.7,
|
|
"prompt_per_second": 269.3,
|
|
"predicted_n": 50,
|
|
"predicted_ms": 1000.0,
|
|
"predicted_per_second": 50.0,
|
|
"cache_n": 100
|
|
}
|
|
```
|
|
|
|
Data comes from vLLM's existing internal `RequestStateStats` and `RequestOutput.num_cached_tokens`:
|
|
|
|
- prompt/prefill time: `first_token_ts - scheduled_ts`
|
|
- generation/decode time: `last_token_ts - first_token_ts`
|
|
- cached tokens: `num_cached_tokens`
|
|
|
|
## Option 1: Runtime Docker Patch Script
|
|
|
|
Copy the script into the deployed patch directory:
|
|
|
|
```bash
|
|
cp _scratch/patch_timings_07351e088.py /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py
|
|
```
|
|
|
|
Add the Docker mount in `/etc/nixos/modules/nixos/services/llama-swap/config.nix`:
|
|
|
|
```nix
|
|
-v /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py:/patches/patch_timings_07351e088.py:ro \
|
|
```
|
|
|
|
Run it before `exec vllm serve` in `vllmCmd`:
|
|
|
|
```bash
|
|
python3 /patches/patch_timings_07351e088.py;
|
|
exec vllm serve ...
|
|
```
|
|
|
|
The script is idempotent. Re-running it skips files that already contain `# [patch_timings]`.
|
|
|
|
## Option 2: Standard Patch File
|
|
|
|
Use this for a source checkout or future vLLM updates where conflicts can be resolved normally.
|
|
|
|
From a vLLM checkout at commit `07351e0883470724dd5a7e9730ed10e01fc99d08`:
|
|
|
|
```bash
|
|
git apply /path/to/_scratch/vllm-timings-07351e088.patch
|
|
```
|
|
|
|
Or with `patch`:
|
|
|
|
```bash
|
|
patch -p1 < /path/to/_scratch/vllm-timings-07351e088.patch
|
|
```
|
|
|
|
For future vLLM versions, try:
|
|
|
|
```bash
|
|
git apply --check /path/to/_scratch/vllm-timings-07351e088.patch
|
|
```
|
|
|
|
If it fails, apply manually or with rejects and resolve conflicts around the changed response-construction code.
|
|
|
|
## Verification Performed
|
|
|
|
The patch was checked against the Docker tag's pinned commit:
|
|
|
|
```text
|
|
vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08
|
|
```
|
|
|
|
Validation done locally:
|
|
|
|
```bash
|
|
git apply --check _scratch/vllm-timings-07351e088.patch
|
|
git apply _scratch/vllm-timings-07351e088.patch
|
|
nix run nixpkgs#python3 -- -m py_compile \
|
|
vllm/entrypoints/openai/chat_completion/protocol.py \
|
|
vllm/entrypoints/openai/chat_completion/serving.py \
|
|
vllm/entrypoints/openai/completion/protocol.py \
|
|
vllm/entrypoints/openai/completion/serving.py
|
|
```
|
|
|
|
The runtime `patch_timings_07351e088.py` script was also tested against files extracted from the pinned commit and confirmed idempotent.
|
|
|
|
## Caveats
|
|
|
|
- Normal chat completion usage should be correct.
|
|
- `/v1/completions` with multiple prompts returns aggregate token counts, but the timing values come from the last completed request. Single-prompt completions are the expected use case.
|
|
- Streaming timings are attached only to the final usage chunk, so clients must request/include usage for streaming if they want timings in the stream.
|