70 lines
2.0 KiB
Markdown
70 lines
2.0 KiB
Markdown
# vLLM Timings Patch
|
|
|
|
This directory contains the custom timings patch for the current vLLM Docker image used by the llama-swap module:
|
|
|
|
```text
|
|
vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657
|
|
```
|
|
|
|
The patch adds a top-level llama.cpp-compatible `timings` object to OpenAI-compatible responses so llama-swap can populate cached tokens, prompt processing speed, and generation speed.
|
|
|
|
## Files
|
|
|
|
- `patch_timings_1acd67a.py` — idempotent boot-time disk-edit patch script for the vLLM Docker container.
|
|
- `vllm-timings-1acd67a.patch` — equivalent standard unified git patch against the current image's vLLM source.
|
|
|
|
## Runtime Script
|
|
|
|
Deploy the script under `/mnt/ssd/vLLM/Patches/` and mount it into the container:
|
|
|
|
```nix
|
|
-v /mnt/ssd/vLLM/Patches/patch_timings_1acd67a.py:/patches/patch_timings_1acd67a.py:ro \
|
|
```
|
|
|
|
Run it before `exec vllm serve`:
|
|
|
|
```bash
|
|
python3 /patches/patch_timings_1acd67a.py;
|
|
exec vllm serve ...
|
|
```
|
|
|
|
The script is idempotent. Re-running it skips files that already contain `# [patch_timings]`.
|
|
|
|
## Standard Patch
|
|
|
|
For a source checkout at commit `1acd67a795ebccdf9b9db7697ae9082058301657`:
|
|
|
|
```bash
|
|
git apply --check /path/to/vllm-timings-1acd67a.patch
|
|
git apply /path/to/vllm-timings-1acd67a.patch
|
|
```
|
|
|
|
At container runtime, applying the `.patch` directly is possible if the image has `patch` or `git` installed:
|
|
|
|
```bash
|
|
cd /usr/local/lib/python3.12/dist-packages
|
|
patch -p1 < /patches/vllm-timings-1acd67a.patch
|
|
```
|
|
|
|
The Python script remains the safer boot-time option because it is idempotent and does not depend on external patch tools being present in the Docker image.
|
|
|
|
## Timings Fields
|
|
|
|
```json
|
|
{
|
|
"prompt_n": 123,
|
|
"prompt_ms": 456.7,
|
|
"prompt_per_second": 269.3,
|
|
"predicted_n": 50,
|
|
"predicted_ms": 1000.0,
|
|
"predicted_per_second": 50.0,
|
|
"cache_n": 100
|
|
}
|
|
```
|
|
|
|
Data comes from vLLM's existing internal `RequestStateStats` and `RequestOutput.num_cached_tokens`:
|
|
|
|
- prompt/prefill time: `first_token_ts - scheduled_ts`
|
|
- generation/decode time: `last_token_ts - first_token_ts`
|
|
- cached tokens: `num_cached_tokens`
|