How to Read Streaming Response Metadata
Every token comes with a receipt. Learn how to read it.
When you stream a response from an LLM, the provider sends more than just text. Hidden in the Server-Sent Events (SSE) stream is metadata that reveals cost, performance, and model behavior. Knowing how to read it separates power users from tourists.
What Metadata Is Available
Anthropic Claude
Claude streams data using the message_start, content_block_delta, and message_delta events. Key fields:
| Field | Event Type | Meaning |
|---|---|---|
input_tokens | message_start | Tokens in your prompt |
output_tokens | message_delta | Tokens in the final response |
stop_reason | message_delta | Why generation ended (end_turn, max_tokens, stop_sequence) |
OpenAI GPT-4o
OpenAI uses a simpler SSE format with choices[0].delta and a final usage chunk:
| Field | Location | Meaning |
|---|---|---|
prompt_tokens | Final usage chunk | Input token count |
completion_tokens | Final usage chunk | Output token count |
finish_reason | choices[0] | stop, length, content_filter |
Google Gemini
Gemini streams via its own event format. Look for usageMetadata in the final chunk:
| Field | Meaning |
|---|---|
promptTokenCount | Input tokens |
candidatesTokenCount | Output tokens |
totalTokenCount | Sum of both |
Decoding SSE in the Browser
Here is how to extract metadata from a raw Anthropic stream in JavaScript:
const decoder = new TextDecoder();
let inputTokens = 0;
let outputTokens = 0;
for await (const chunk of stream) {
const text = decoder.decode(chunk);
const lines = text.split('\n').filter(l => l.startsWith('data: '));
for (const line of lines) {
const data = JSON.parse(line.slice(6));
if (data.type === 'message_start') {
inputTokens = data.message.usage.input_tokens;
}
if (data.type === 'message_delta' && data.usage) {
outputTokens = data.usage.output_tokens;
}
}
}
console.log(`Cost: ${(inputTokens + outputTokens) / 1_000_000 * 3.0} USD`);
What the Metadata Tells You
1. Cost in Real Time
By accumulating tokens as they stream, you can display a running cost estimate to the user. This builds trust and prevents bill shock.
2. Detecting Truncation
If stop_reason is max_tokens, the model hit its limit. The response is incomplete. You should warn the user and suggest increasing max_tokens.
3. Caching Verification
For Claude, the cache_creation_input_tokens and cache_read_input_tokens fields in the final usage tell you whether your prompt caching actually saved money.
How AIWorkbench.dev Surfaces This
The workbench parses metadata from all supported providers in real time. The token counter, cost calculator, and streaming debugger all consume the same SSE stream you would parse manually — but we normalize the formats so you don't have to.
Key Takeaway
Don't treat LLM responses as plain text. They are structured data streams. Parse the metadata, surface it to users, and use it to debug why a response ended or cost what it did.