nix/modules/nixos/services/llama-swap/patches/README.md

# vLLM Timings Patch

This scratch directory contains two ways to patch vLLM so its OpenAI-compatible responses include llama.cpp-compatible `timings` data. llama-swap already parses this `timings` object to populate cached tokens, prompt processing speed, and generation speed.

## Files

- `patch_timings_07351e088.py` — disk-edit patch script for running inside the vLLM Docker container before `vllm serve`.
- `vllm-timings-07351e088.patch` — standard unified git patch against `vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08`.

## What The Patch Adds

The patch adds a top-level `timings` object to:

- `/v1/chat/completions` non-streaming responses
- `/v1/chat/completions` streaming final usage chunk
- `/v1/completions` non-streaming responses
- `/v1/completions` streaming final usage chunk

The object matches llama.cpp's fields:

```json
{
  "prompt_n": 123,
  "prompt_ms": 456.7,
  "prompt_per_second": 269.3,
  "predicted_n": 50,
  "predicted_ms": 1000.0,
  "predicted_per_second": 50.0,
  "cache_n": 100
}
```

Data comes from vLLM's existing internal `RequestStateStats` and `RequestOutput.num_cached_tokens`:

- prompt/prefill time: `first_token_ts - scheduled_ts`
- generation/decode time: `last_token_ts - first_token_ts`
- cached tokens: `num_cached_tokens`

## Option 1: Runtime Docker Patch Script

Copy the script into the deployed patch directory:

```bash
cp _scratch/patch_timings_07351e088.py /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py
```

Add the Docker mount in `/etc/nixos/modules/nixos/services/llama-swap/config.nix`:

```nix
-v /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py:/patches/patch_timings_07351e088.py:ro \
```

Run it before `exec vllm serve` in `vllmCmd`:

```bash
python3 /patches/patch_timings_07351e088.py;
exec vllm serve ...
```

The script is idempotent. Re-running it skips files that already contain `# [patch_timings]`.

## Option 2: Standard Patch File

Use this for a source checkout or future vLLM updates where conflicts can be resolved normally.

From a vLLM checkout at commit `07351e0883470724dd5a7e9730ed10e01fc99d08`:

```bash
git apply /path/to/_scratch/vllm-timings-07351e088.patch
```

Or with `patch`:

```bash
patch -p1 < /path/to/_scratch/vllm-timings-07351e088.patch
```

For future vLLM versions, try:

```bash
git apply --check /path/to/_scratch/vllm-timings-07351e088.patch
```

If it fails, apply manually or with rejects and resolve conflicts around the changed response-construction code.

## Verification Performed

The patch was checked against the Docker tag's pinned commit:

```text
vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08
```

Validation done locally:

```bash
git apply --check _scratch/vllm-timings-07351e088.patch
git apply _scratch/vllm-timings-07351e088.patch
nix run nixpkgs#python3 -- -m py_compile \
  vllm/entrypoints/openai/chat_completion/protocol.py \
  vllm/entrypoints/openai/chat_completion/serving.py \
  vllm/entrypoints/openai/completion/protocol.py \
  vllm/entrypoints/openai/completion/serving.py
```

The runtime `patch_timings_07351e088.py` script was also tested against files extracted from the pinned commit and confirmed idempotent.

## Caveats

- Normal chat completion usage should be correct.
- `/v1/completions` with multiple prompts returns aggregate token counts, but the timing values come from the last completed request. Single-prompt completions are the expected use case.
- Streaming timings are attached only to the final usage chunk, so clients must request/include usage for streaming if they want timings in the stream.