How to Read Streaming Response Metadata

Every token comes with a receipt. Learn how to read it.

When you stream a response from an LLM, the provider sends more than just text. Hidden in the Server-Sent Events (SSE) stream is metadata that reveals cost, performance, and model behavior. Knowing how to read it separates power users from tourists.

What Metadata Is Available

Anthropic Claude

Claude streams data using the message_start, content_block_delta, and message_delta events. Key fields:

Field	Event Type	Meaning
`input_tokens`	`message_start`	Tokens in your prompt
`output_tokens`	`message_delta`	Tokens in the final response
`stop_reason`	`message_delta`	Why generation ended (`end_turn`, `max_tokens`, `stop_sequence`)

OpenAI GPT-4o

OpenAI uses a simpler SSE format with choices[0].delta and a final usage chunk:

Field	Location	Meaning
`prompt_tokens`	Final `usage` chunk	Input token count
`completion_tokens`	Final `usage` chunk	Output token count
`finish_reason`	`choices[0]`	`stop`, `length`, `content_filter`

Google Gemini

Gemini streams via its own event format. Look for usageMetadata in the final chunk:

Field	Meaning
`promptTokenCount`	Input tokens
`candidatesTokenCount`	Output tokens
`totalTokenCount`	Sum of both

Decoding SSE in the Browser

Here is how to extract metadata from a raw Anthropic stream in JavaScript:

const decoder = new TextDecoder();
let inputTokens = 0;
let outputTokens = 0;

for await (const chunk of stream) {
  const text = decoder.decode(chunk);
  const lines = text.split('\n').filter(l => l.startsWith('data: '));
  
  for (const line of lines) {
    const data = JSON.parse(line.slice(6));
    
    if (data.type === 'message_start') {
      inputTokens = data.message.usage.input_tokens;
    }
    if (data.type === 'message_delta' && data.usage) {
      outputTokens = data.usage.output_tokens;
    }
  }
}

console.log(`Cost: ${(inputTokens + outputTokens) / 1_000_000 * 3.0} USD`);

What the Metadata Tells You

1. Cost in Real Time

By accumulating tokens as they stream, you can display a running cost estimate to the user. This builds trust and prevents bill shock.

2. Detecting Truncation

If stop_reason is max_tokens, the model hit its limit. The response is incomplete. You should warn the user and suggest increasing max_tokens.

3. Caching Verification

For Claude, the cache_creation_input_tokens and cache_read_input_tokens fields in the final usage tell you whether your prompt caching actually saved money.

How AIWorkbench.dev Surfaces This

The workbench parses metadata from all supported providers in real time. The token counter, cost calculator, and streaming debugger all consume the same SSE stream you would parse manually — but we normalize the formats so you don't have to.

Key Takeaway

Don't treat LLM responses as plain text. They are structured data streams. Parse the metadata, surface it to users, and use it to debug why a response ended or cost what it did.