fix(llama-swap): update vllm timings patch

2026-05-11 09:40:13 -04:00
parent 187c717383
commit ecad94aab3
6 changed files with 119 additions and 246 deletions
--- a/modules/nixos/services/llama-swap/patches/README.md
+++ b/modules/nixos/services/llama-swap/patches/README.md
@@ -1,22 +1,54 @@
 # vLLM Timings Patch

-This scratch directory contains two ways to patch vLLM so its OpenAI-compatible responses include llama.cpp-compatible `timings` data. llama-swap already parses this `timings` object to populate cached tokens, prompt processing speed, and generation speed.
+This directory contains the custom timings patch for the current vLLM Docker image used by the llama-swap module:
+
+```text
+vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657
+```
+
+The patch adds a top-level llama.cpp-compatible `timings` object to OpenAI-compatible responses so llama-swap can populate cached tokens, prompt processing speed, and generation speed.

 ## Files

- `patch_timings_07351e088.py` — disk-edit patch script for running inside the vLLM Docker container before `vllm serve`.
- `vllm-timings-07351e088.patch` — standard unified git patch against `vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08`.
+- `patch_timings_1acd67a.py` — idempotent boot-time disk-edit patch script for the vLLM Docker container.
+- `vllm-timings-1acd67a.patch` — equivalent standard unified git patch against the current image's vLLM source.

-## What The Patch Adds
+## Runtime Script

-The patch adds a top-level `timings` object to:
+Deploy the script under `/mnt/ssd/vLLM/Patches/` and mount it into the container:

- `/v1/chat/completions` non-streaming responses
- `/v1/chat/completions` streaming final usage chunk
- `/v1/completions` non-streaming responses
- `/v1/completions` streaming final usage chunk
+```nix
+-v /mnt/ssd/vLLM/Patches/patch_timings_1acd67a.py:/patches/patch_timings_1acd67a.py:ro \
+```

-The object matches llama.cpp's fields:
+Run it before `exec vllm serve`:
+
+```bash
+python3 /patches/patch_timings_1acd67a.py;
+exec vllm serve ...
+```
+
+The script is idempotent. Re-running it skips files that already contain `# [patch_timings]`.
+
+## Standard Patch
+
+For a source checkout at commit `1acd67a795ebccdf9b9db7697ae9082058301657`:
+
+```bash
+git apply --check /path/to/vllm-timings-1acd67a.patch
+git apply /path/to/vllm-timings-1acd67a.patch
+```
+
+At container runtime, applying the `.patch` directly is possible if the image has `patch` or `git` installed:
+
+```bash
+cd /usr/local/lib/python3.12/dist-packages
+patch -p1 < /patches/vllm-timings-1acd67a.patch
+```
+
+The Python script remains the safer boot-time option because it is idempotent and does not depend on external patch tools being present in the Docker image.
+
+## Timings Fields

 ```json
 {
@@ -35,78 +67,3 @@ Data comes from vLLM's existing internal `RequestStateStats` and `RequestOutput.
 - prompt/prefill time: `first_token_ts - scheduled_ts`
 - generation/decode time: `last_token_ts - first_token_ts`
 - cached tokens: `num_cached_tokens`
-
-## Option 1: Runtime Docker Patch Script
-
-Copy the script into the deployed patch directory:
-
-```bash
-cp _scratch/patch_timings_07351e088.py /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py
-```
-
-Add the Docker mount in `/etc/nixos/modules/nixos/services/llama-swap/config.nix`:
-
-```nix
-v /mnt/ssd/vLLM/Patches/patch_timings_07351e088.py:/patches/patch_timings_07351e088.py:ro \
-```
-
-Run it before `exec vllm serve` in `vllmCmd`:
-
-```bash
-python3 /patches/patch_timings_07351e088.py;
-exec vllm serve ...
-```
-
-The script is idempotent. Re-running it skips files that already contain `# [patch_timings]`.
-
-## Option 2: Standard Patch File
-
-Use this for a source checkout or future vLLM updates where conflicts can be resolved normally.
-
-From a vLLM checkout at commit `07351e0883470724dd5a7e9730ed10e01fc99d08`:
-
-```bash
-git apply /path/to/_scratch/vllm-timings-07351e088.patch
-```
-
-Or with `patch`:
-
-```bash
-patch -p1 < /path/to/_scratch/vllm-timings-07351e088.patch
-```
-
-For future vLLM versions, try:
-
-```bash
-git apply --check /path/to/_scratch/vllm-timings-07351e088.patch
-```
-
-If it fails, apply manually or with rejects and resolve conflicts around the changed response-construction code.
-
-## Verification Performed
-
-The patch was checked against the Docker tag's pinned commit:
-
-```text
-vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08
-```
-
-Validation done locally:
-
-```bash
-git apply --check _scratch/vllm-timings-07351e088.patch
-git apply _scratch/vllm-timings-07351e088.patch
-nix run nixpkgs#python3 -- -m py_compile \
-  vllm/entrypoints/openai/chat_completion/protocol.py \
-  vllm/entrypoints/openai/chat_completion/serving.py \
-  vllm/entrypoints/openai/completion/protocol.py \
-  vllm/entrypoints/openai/completion/serving.py
-```
-
-The runtime `patch_timings_07351e088.py` script was also tested against files extracted from the pinned commit and confirmed idempotent.
-
-## Caveats
-
- Normal chat completion usage should be correct.
- `/v1/completions` with multiple prompts returns aggregate token counts, but the timing values come from the last completed request. Single-prompt completions are the expected use case.
- Streaming timings are attached only to the final usage chunk, so clients must request/include usage for streaming if they want timings in the stream.
--- a/modules/nixos/services/llama-swap/patches/patch_timings_07351e088.py
+++ b/modules/nixos/services/llama-swap/patches/patch_timings_07351e088.py
@@ -1,5 +1,5 @@
 """
-Disk-edit patch for vLLM nightly-07351e0883470724dd5a7e9730ed10e01fc99d08:
+Disk-edit patch for vLLM nightly-1acd67a795ebccdf9b9db7697ae9082058301657:
 inject llama.cpp-compatible `timings` into chat/completion API responses.

 Adds `timings` to:
@@ -13,7 +13,7 @@ The `timings` object matches llama.cpp fields consumed by llama-swap:
  predicted_n, predicted_ms, predicted_per_second, cache_n

 Usage, before `exec vllm serve`:
-  python3 /patches/patch_timings.py
+  python3 /patches/patch_timings_1acd67a.py
 """

 import logging
@@ -85,70 +85,8 @@ def _write(path, content):

 def _replace_once(content, old, new, label):
    count = content.count(old)
-    if count == 1:
-        return content.replace(old, new, 1)
-
-    # vLLM v0.20 added system_fingerprint to response constructors. Preserve
-    # compatibility with the original dev205 anchors by retrying with that
-    # field inserted when the old anchor is not present.
-    variants = [
-        (
-            old.replace(
-                "                    usage=final_usage,\n                )",
-                "                    usage=final_usage,\n                    system_fingerprint=self.system_fingerprint,\n                )",
-            ),
-            new.replace(
-                "                    usage=final_usage,\n                )",
-                "                    usage=final_usage,\n                    system_fingerprint=self.system_fingerprint,\n                )",
-            ),
-        ),
-        (
-            old.replace(
-                "            usage=usage,\n            prompt_logprobs=",
-                "            usage=usage,\n            system_fingerprint=self.system_fingerprint,\n            prompt_logprobs=",
-            ),
-            new.replace(
-                "            usage=usage,\n            prompt_logprobs=",
-                "            usage=usage,\n            system_fingerprint=self.system_fingerprint,\n            prompt_logprobs=",
-            ),
-        ),
-        (
-            old.replace(
-                "                    usage=final_usage_info,\n                )",
-                "                    usage=final_usage_info,\n                    system_fingerprint=self.system_fingerprint,\n                )",
-            ),
-            new.replace(
-                "                    usage=final_usage_info,\n                )",
-                "                    usage=final_usage_info,\n                    system_fingerprint=self.system_fingerprint,\n                )",
-            ),
-        ),
-        (
-            old.replace(
-                "            usage=usage,\n            kv_transfer_params=kv_transfer_params,",
-                "            usage=usage,\n            system_fingerprint=self.system_fingerprint,\n            kv_transfer_params=kv_transfer_params,",
-            ),
-            new.replace(
-                "            usage=usage,\n            kv_transfer_params=kv_transfer_params,",
-                "            usage=usage,\n            system_fingerprint=self.system_fingerprint,\n            kv_transfer_params=kv_transfer_params,",
-            ),
-        ),
-    ]
-    matches = [(variant_old, variant_new) for variant_old, variant_new in variants if content.count(variant_old) == 1]
-    if len(matches) == 1:
-        variant_old, variant_new = matches[0]
-        return content.replace(variant_old, variant_new, 1)
-
-    variant_counts = [content.count(variant_old) for variant_old, _ in variants]
-    raise RuntimeError(f"{label}: anchor matched {count} times; v0.20 variants matched {variant_counts}")
-
-
-def _replace_once_any(content, replacements, label):
-    """Replace exactly one of several version-specific anchors."""
-    matches = [(old, new) for old, new in replacements if content.count(old) == 1]
-    if len(matches) != 1:
-        counts = [content.count(old) for old, _ in replacements]
-        raise RuntimeError(f"{label}: versioned anchors matched {counts}")
-    old, new = matches[0]
+    if count != 1:
+        raise RuntimeError(f"{label}: anchor matched {count} times")
    return content.replace(old, new, 1)


@@ -231,19 +169,19 @@ def _patch_chat_serving(vllm_dir):
            label,
        )

-        # Streaming Final Usage Chunk - pinned image has no system_fingerprint arg.
+        # Streaming Final Usage Chunk
        content = _replace_once(
            content,
-            '''                final_usage_chunk = ChatCompletionStreamResponse(\n                    id=request_id,\n                    object=chunk_object_type,\n                    created=created_time,\n                    choices=[],\n                    model=model_name,\n                    usage=final_usage,\n                )\n''',
-            f'''                final_usage_chunk = ChatCompletionStreamResponse(\n                    id=request_id,\n                    object=chunk_object_type,\n                    created=created_time,\n                    choices=[],\n                    model=model_name,\n                    usage=final_usage,\n                )\n                # Inject Timings  {PATCH_TAG}\n                try:\n                    _s_cached = _last_stream_res.num_cached_tokens\n                    final_usage_chunk.timings = _compute_timings(\n                        _last_stream_res.metrics,\n                        num_prompt_tokens, completion_tokens, _s_cached,\n                    )\n                except NameError:\n                    pass\n''',
+            '''                final_usage_chunk = ChatCompletionStreamResponse(\n                    id=request_id,\n                    object=chunk_object_type,\n                    created=created_time,\n                    choices=[],\n                    model=model_name,\n                    usage=final_usage,\n                    system_fingerprint=self.system_fingerprint,\n                )\n''',
+            f'''                final_usage_chunk = ChatCompletionStreamResponse(\n                    id=request_id,\n                    object=chunk_object_type,\n                    created=created_time,\n                    choices=[],\n                    model=model_name,\n                    usage=final_usage,\n                    system_fingerprint=self.system_fingerprint,\n                )\n                # Inject Timings  {PATCH_TAG}\n                try:\n                    _s_cached = _last_stream_res.num_cached_tokens\n                    final_usage_chunk.timings = _compute_timings(\n                        _last_stream_res.metrics,\n                        num_prompt_tokens, completion_tokens, _s_cached,\n                    )\n                except NameError:\n                    pass\n''',
            label,
        )

-        # Non-Streaming Response - pinned image has no system_fingerprint arg.
+        # Non-Streaming Response
        content = _replace_once(
            content,
-            '''        response = ChatCompletionResponse(\n            id=request_id,\n            created=created_time,\n            model=model_name,\n            choices=choices,\n            usage=usage,\n            prompt_logprobs=clamp_prompt_logprobs(final_res.prompt_logprobs),\n            prompt_token_ids=(\n                final_res.prompt_token_ids if request.return_token_ids else None\n            ),\n            kv_transfer_params=final_res.kv_transfer_params,\n        )\n''',
-            f'''        response = ChatCompletionResponse(\n            id=request_id,\n            created=created_time,\n            model=model_name,\n            choices=choices,\n            usage=usage,\n            prompt_logprobs=clamp_prompt_logprobs(final_res.prompt_logprobs),\n            prompt_token_ids=(\n                final_res.prompt_token_ids if request.return_token_ids else None\n            ),\n            kv_transfer_params=final_res.kv_transfer_params,\n        )\n\n        # Inject Timings  {PATCH_TAG}\n        _cached = final_res.num_cached_tokens\n        response.timings = _compute_timings(\n            final_res.metrics, num_prompt_tokens, num_generated_tokens,\n            _cached,\n        )\n''',
+            '''        response = ChatCompletionResponse(\n            id=request_id,\n            created=created_time,\n            model=model_name,\n            choices=choices,\n            usage=usage,\n            system_fingerprint=self.system_fingerprint,\n            prompt_logprobs=clamp_prompt_logprobs(final_res.prompt_logprobs),\n            prompt_token_ids=(\n                final_res.prompt_token_ids if request.return_token_ids else None\n            ),\n            kv_transfer_params=final_res.kv_transfer_params,\n            prompt_routed_experts=prompt_routed_experts,\n        )\n''',
+            f'''        response = ChatCompletionResponse(\n            id=request_id,\n            created=created_time,\n            model=model_name,\n            choices=choices,\n            usage=usage,\n            system_fingerprint=self.system_fingerprint,\n            prompt_logprobs=clamp_prompt_logprobs(final_res.prompt_logprobs),\n            prompt_token_ids=(\n                final_res.prompt_token_ids if request.return_token_ids else None\n            ),\n            kv_transfer_params=final_res.kv_transfer_params,\n            prompt_routed_experts=prompt_routed_experts,\n        )\n\n        # Inject Timings  {PATCH_TAG}\n        _cached = final_res.num_cached_tokens\n        response.timings = _compute_timings(\n            final_res.metrics, num_prompt_tokens, num_generated_tokens,\n            _cached,\n        )\n''',
            label,
        )
    except RuntimeError as e:
@@ -284,19 +222,19 @@ def _patch_completion_serving(vllm_dir):
            label,
        )

-        # Streaming Final Usage Chunk - pinned image has no system_fingerprint arg.
+        # Streaming Final Usage Chunk
        content = _replace_once(
            content,
-            '''                final_usage_chunk = CompletionStreamResponse(\n                    id=request_id,\n                    created=created_time,\n                    model=model_name,\n                    choices=[],\n                    usage=final_usage_info,\n                )\n''',
-            f'''                final_usage_chunk = CompletionStreamResponse(\n                    id=request_id,\n                    created=created_time,\n                    model=model_name,\n                    choices=[],\n                    usage=final_usage_info,\n                )\n                # Inject Timings  {PATCH_TAG}\n                try:\n                    _sc_cached = _last_comp_res.num_cached_tokens\n                    final_usage_chunk.timings = _compute_timings(\n                        _last_comp_res.metrics,\n                        total_prompt_tokens, total_completion_tokens,\n                        _sc_cached,\n                    )\n                except NameError:\n                    pass\n''',
+            '''                final_usage_chunk = CompletionStreamResponse(\n                    id=request_id,\n                    created=created_time,\n                    model=model_name,\n                    choices=[],\n                    usage=final_usage_info,\n                    system_fingerprint=self.system_fingerprint,\n                )\n''',
+            f'''                final_usage_chunk = CompletionStreamResponse(\n                    id=request_id,\n                    created=created_time,\n                    model=model_name,\n                    choices=[],\n                    usage=final_usage_info,\n                    system_fingerprint=self.system_fingerprint,\n                )\n                # Inject Timings  {PATCH_TAG}\n                try:\n                    _sc_cached = _last_comp_res.num_cached_tokens\n                    final_usage_chunk.timings = _compute_timings(\n                        _last_comp_res.metrics,\n                        total_prompt_tokens, total_completion_tokens,\n                        _sc_cached,\n                    )\n                except NameError:\n                    pass\n''',
            label,
        )

-        # Non-Streaming Response - pinned image has no system_fingerprint arg.
+        # Non-Streaming Response
        content = _replace_once(
            content,
-            '''        return CompletionResponse(\n            id=request_id,\n            created=created_time,\n            model=model_name,\n            choices=choices,\n            usage=usage,\n            kv_transfer_params=kv_transfer_params,\n        )\n''',
-            f'''        _comp_response = CompletionResponse(  {PATCH_TAG}\n            id=request_id,\n            created=created_time,\n            model=model_name,\n            choices=choices,\n            usage=usage,\n            kv_transfer_params=kv_transfer_params,\n        )\n        # Inject Timings  {PATCH_TAG}\n        if last_final_res is not None:\n            _comp_cached = last_final_res.num_cached_tokens\n            _comp_response.timings = _compute_timings(\n                last_final_res.metrics, num_prompt_tokens,\n                num_generated_tokens, _comp_cached,\n            )\n        return _comp_response\n''',
+            '''        return CompletionResponse(\n            id=request_id,\n            created=created_time,\n            model=model_name,\n            choices=choices,\n            usage=usage,\n            system_fingerprint=self.system_fingerprint,\n            kv_transfer_params=kv_transfer_params,\n            prompt_routed_experts=prompt_routed_experts,\n        )\n''',
+            f'''        _comp_response = CompletionResponse(  {PATCH_TAG}\n            id=request_id,\n            created=created_time,\n            model=model_name,\n            choices=choices,\n            usage=usage,\n            system_fingerprint=self.system_fingerprint,\n            kv_transfer_params=kv_transfer_params,\n            prompt_routed_experts=prompt_routed_experts,\n        )\n        # Inject Timings  {PATCH_TAG}\n        if last_final_res is not None:\n            _comp_cached = last_final_res.num_cached_tokens\n            _comp_response.timings = _compute_timings(\n                last_final_res.metrics, num_prompt_tokens,\n                num_generated_tokens, _comp_cached,\n            )\n        return _comp_response\n''',
            label,
        )
    except RuntimeError as e:
--- a/modules/nixos/services/llama-swap/patches/vllm-timings-07351e088.patch
+++ b/modules/nixos/services/llama-swap/patches/vllm-timings-07351e088.patch
@@ -1,8 +1,8 @@
 diff --git a/vllm/entrypoints/openai/chat_completion/protocol.py b/vllm/entrypoints/openai/chat_completion/protocol.py
-index aacac38..074ca45 100644
+index 742f9cc..ade939f 100644
 --- a/vllm/entrypoints/openai/chat_completion/protocol.py
 +++ b/vllm/entrypoints/openai/chat_completion/protocol.py
-@@ -111,6 +111,9 @@ class ChatCompletionResponse(OpenAIBaseModel):
+@@ -115,6 +115,9 @@ class ChatCompletionResponse(OpenAIBaseModel):
         default=None, description="KVTransfer parameters."
     )
 
@@ -12,7 +12,7 @@ index aacac38..074ca45 100644
 
 class ChatCompletionResponseStreamChoice(OpenAIBaseModel):
     index: int
-@@ -132,6 +135,9 @@ class ChatCompletionStreamResponse(OpenAIBaseModel):
+@@ -139,6 +142,9 @@ class ChatCompletionStreamResponse(OpenAIBaseModel):
     # not part of the OpenAI spec but for tracing the tokens
     prompt_token_ids: list[int] | None = None
 
@@ -23,10 +23,10 @@ index aacac38..074ca45 100644
 class ChatCompletionToolsParam(OpenAIBaseModel):
     type: Literal["function"] = "function"
 diff --git a/vllm/entrypoints/openai/chat_completion/serving.py b/vllm/entrypoints/openai/chat_completion/serving.py
-index 12dc2cd..c15fb6d 100644
+index 1026e0a..a9c5708 100644
 --- a/vllm/entrypoints/openai/chat_completion/serving.py
 +++ b/vllm/entrypoints/openai/chat_completion/serving.py
-@@ -83,6 +83,34 @@ if TYPE_CHECKING:
+@@ -79,6 +79,34 @@ if TYPE_CHECKING:
 logger = init_logger(__name__)
 
 
@@ -61,7 +61,7 @@ index 12dc2cd..c15fb6d 100644
 class OpenAIServingChat(OpenAIServing):
     def __init__(
         self,
-@@ -633,6 +661,7 @@ class OpenAIServingChat(OpenAIServing):
+@@ -485,6 +513,7 @@ class OpenAIServingChat(OpenAIServing):
 
         try:
             async for res in result_generator:
@@ -69,9 +69,9 @@ index 12dc2cd..c15fb6d 100644
                 if res.prompt_token_ids is not None:
                     num_prompt_tokens = len(res.prompt_token_ids)
                     if res.encoder_prompt_token_ids is not None:
-@@ -1230,6 +1259,15 @@ class OpenAIServingChat(OpenAIServing):
-                     model=model_name,
+@@ -947,6 +976,15 @@ class OpenAIServingChat(OpenAIServing):
                     usage=final_usage,
+                     system_fingerprint=self.system_fingerprint,
                 )
 +                # Inject Timings  # [patch_timings]
 +                try:
@@ -85,8 +85,8 @@ index 12dc2cd..c15fb6d 100644
                 final_usage_data = final_usage_chunk.model_dump_json(
                     exclude_unset=True, exclude_none=True
                 )
-@@ -1644,6 +1682,13 @@ class OpenAIServingChat(OpenAIServing):
-             kv_transfer_params=final_res.kv_transfer_params,
+@@ -1377,6 +1415,13 @@ class OpenAIServingChat(OpenAIServing):
+             prompt_routed_experts=prompt_routed_experts,
         )
 
 +        # Inject Timings  # [patch_timings]
@@ -100,10 +100,10 @@ index 12dc2cd..c15fb6d 100644
         if self.enable_log_outputs and self.request_logger:
             for choice in choices:
 diff --git a/vllm/entrypoints/openai/completion/protocol.py b/vllm/entrypoints/openai/completion/protocol.py
-index c785d25..85928f4 100644
+index 7bb3c8d..8487e93 100644
 --- a/vllm/entrypoints/openai/completion/protocol.py
 +++ b/vllm/entrypoints/openai/completion/protocol.py
-@@ -485,6 +485,9 @@ class CompletionResponse(OpenAIBaseModel):
+@@ -489,6 +489,9 @@ class CompletionResponse(OpenAIBaseModel):
         default=None, description="KVTransfer parameters."
     )
 
@@ -113,15 +113,18 @@ index c785d25..85928f4 100644
 
 class CompletionResponseStreamChoice(OpenAIBaseModel):
     index: int
-@@ -512,3 +515,6 @@ class CompletionStreamResponse(OpenAIBaseModel):
+@@ -516,6 +519,9 @@ class CompletionStreamResponse(OpenAIBaseModel):
     model: str
     choices: list[CompletionResponseStreamChoice]
     usage: UsageInfo | None = Field(default=None)
 +
 +    # llama.cpp-compatible per-request timings  # [patch_timings]
 +    timings: dict[str, Any] | None = None
+     # Set only on the final chunk of a stream to mirror non-streaming responses
+     # without the per-chunk serialization overhead.
+     system_fingerprint: str | None = None
 diff --git a/vllm/entrypoints/openai/completion/serving.py b/vllm/entrypoints/openai/completion/serving.py
-index fb7f253..11a5350 100644
+index ee4ca9f..8b27011 100644
 --- a/vllm/entrypoints/openai/completion/serving.py
 +++ b/vllm/entrypoints/openai/completion/serving.py
@@ -48,6 +48,34 @@ if TYPE_CHECKING:
@@ -159,7 +162,7 @@ index fb7f253..11a5350 100644
 class OpenAIServingCompletion(OpenAIServing):
     def __init__(
         self,
-@@ -290,6 +318,7 @@ class OpenAIServingCompletion(OpenAIServing):
+@@ -291,6 +319,7 @@ class OpenAIServingCompletion(OpenAIServing):
 
         try:
             async for prompt_idx, res in result_generator:
@@ -167,9 +170,9 @@ index fb7f253..11a5350 100644
                 prompt_token_ids = res.prompt_token_ids
                 prompt_logprobs = res.prompt_logprobs
 
-@@ -434,6 +463,16 @@ class OpenAIServingCompletion(OpenAIServing):
-                     choices=[],
+@@ -445,6 +474,16 @@ class OpenAIServingCompletion(OpenAIServing):
                     usage=final_usage_info,
+                     system_fingerprint=self.system_fingerprint,
                 )
 +                # Inject Timings  # [patch_timings]
 +                try:
@@ -184,18 +187,18 @@ index fb7f253..11a5350 100644
                 final_usage_data = final_usage_chunk.model_dump_json(
                     exclude_unset=False, exclude_none=True
                 )
-@@ -556,7 +595,7 @@ class OpenAIServingCompletion(OpenAIServing):
-         request_metadata.final_usage_info = usage
-         if final_res_batch:
-             kv_transfer_params = final_res_batch[0].kv_transfer_params
+@@ -577,7 +616,7 @@ class OpenAIServingCompletion(OpenAIServing):
+             if pre is not None:
+                 prompt_routed_experts = pre.tolist()
+ 
 -        return CompletionResponse(
 +        _comp_response = CompletionResponse(  # [patch_timings]
             id=request_id,
             created=created_time,
             model=model_name,
-@@ -564,6 +603,14 @@ class OpenAIServingCompletion(OpenAIServing):
-             usage=usage,
+@@ -587,6 +626,14 @@ class OpenAIServingCompletion(OpenAIServing):
             kv_transfer_params=kv_transfer_params,
+             prompt_routed_experts=prompt_routed_experts,
         )
 +        # Inject Timings  # [patch_timings]
 +        if last_final_res is not None: