nix/modules/nixos/services/llama-swap/patches/README.md

# vLLM Timings Patch

This directory contains the custom timings patch for the current vLLM Docker image used by the llama-swap module:

```text
vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657
```

The patch adds a top-level llama.cpp-compatible `timings` object to OpenAI-compatible responses so llama-swap can populate cached tokens, prompt processing speed, and generation speed.

## Files

- `patch_timings_1acd67a.py` — idempotent boot-time disk-edit patch script for the vLLM Docker container.
- `vllm-timings-1acd67a.patch` — equivalent standard unified git patch against the current image's vLLM source.

## Runtime Script

Deploy the script under `/mnt/ssd/vLLM/Patches/` and mount it into the container:

```nix
-v /mnt/ssd/vLLM/Patches/patch_timings_1acd67a.py:/patches/patch_timings_1acd67a.py:ro \
```

Run it before `exec vllm serve`:

```bash
python3 /patches/patch_timings_1acd67a.py;
exec vllm serve ...
```

The script is idempotent. Re-running it skips files that already contain `# [patch_timings]`.

## Standard Patch

For a source checkout at commit `1acd67a795ebccdf9b9db7697ae9082058301657`:

```bash
git apply --check /path/to/vllm-timings-1acd67a.patch
git apply /path/to/vllm-timings-1acd67a.patch
```

At container runtime, applying the `.patch` directly is possible if the image has `patch` or `git` installed:

```bash
cd /usr/local/lib/python3.12/dist-packages
patch -p1 < /patches/vllm-timings-1acd67a.patch
```

The Python script remains the safer boot-time option because it is idempotent and does not depend on external patch tools being present in the Docker image.

## Timings Fields

```json
{
  "prompt_n": 123,
  "prompt_ms": 456.7,
  "prompt_per_second": 269.3,
  "predicted_n": 50,
  "predicted_ms": 1000.0,
  "predicted_per_second": 50.0,
  "cache_n": 100
}
```

Data comes from vLLM's existing internal `RequestStateStats` and `RequestOutput.num_cached_tokens`:

- prompt/prefill time: `first_token_ts - scheduled_ts`
- generation/decode time: `last_token_ts - first_token_ts`
- cached tokens: `num_cached_tokens`