fix(usage): capture reasoning_tokens from completion_tokens_details on chat_completions (#57340)

normalize_usage only read output_tokens_details.reasoning_tokens (the
Responses API shape). Chat Completions providers — OpenAI, OpenRouter,
DeepSeek, and every OpenAI-compatible proxy — report it under
completion_tokens_details.reasoning_tokens, so reasoning_tokens was 0 for
every chat_completions reasoning model: hidden thinking was invisible in
session accounting, MoA traces, and the eval's per-task token columns.

Measured impact (HermesBench MoA run on deepseek-v4-flash, 4,828 advisor
calls): reasoning_tokens showed 0 everywhere while individual calls burned
up to 21.5K hidden thinking tokens to emit ~500 visible tokens. Verified
live against OpenRouter: deepseek-v4-flash returns
completion_tokens_details.reasoning_tokens=61 for a 74-completion-token
call; the field was simply never read.

Responses-shape reads are unchanged; the new read only fires when the
Responses shape yielded nothing.
This commit is contained in:
Teknium 2026-07-02 13:52:42 -07:00 committed by GitHub
parent ab942330fc
commit 3a122ba4ac
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -820,9 +820,22 @@ def normalize_usage(
input_tokens = max(0, prompt_total - cache_read_tokens - cache_write_tokens)
reasoning_tokens = 0
# Responses API shape: output_tokens_details.reasoning_tokens.
# Chat Completions shape (OpenAI, OpenRouter, DeepSeek, etc.):
# completion_tokens_details.reasoning_tokens. Reading only the former
# left reasoning_tokens=0 for every chat_completions reasoning model —
# hidden thinking was invisible in session accounting even though it
# dominates output spend on models like deepseek-v4-flash (measured:
# single calls burning 21K reasoning tokens to emit 500 visible tokens).
output_details = getattr(response_usage, "output_tokens_details", None)
if output_details:
reasoning_tokens = _to_int(getattr(output_details, "reasoning_tokens", 0))
if not reasoning_tokens:
completion_details = getattr(response_usage, "completion_tokens_details", None)
if completion_details:
reasoning_tokens = _to_int(
getattr(completion_details, "reasoning_tokens", 0)
)
return CanonicalUsage(
input_tokens=input_tokens,