Prompt Caching Deep Dive: Claude vs Gemini Implementation

The difference between "saving 90%" and "saving 0%" is in the implementation details.

Prompt caching is the single most impactful cost optimization for production LLM applications. But Anthropic and Google implement it differently, and the wrong approach can cost you more than no caching at all.

How Claude Caching Works

Anthropic uses ephemeral prefix matching. The system computes a hash of your prompt prefix. If an identical prefix was sent within the last 5 minutes (extendable to 1 hour), the cached portion is billed at 10% of the standard input rate.

The Rules

  1. Exact match required. A single character difference invalidates the cache.
  2. Prefix only. You cannot cache the middle or end of a prompt. The cacheable portion must start at message index 0.
  3. Minimum size. The cacheable block must be at least 1,024 tokens (Claude 3.5 Sonnet) or 4,096 tokens (Claude 3 Opus).
  4. Breakpoints matter. You set cache_control: { type: "ephemeral" } on specific messages to mark where caching should occur.

Claude Caching Cost Example

System prompt: 8,000 tokens (cached)
User message: 200 tokens (uncached)

Standard cost: 8,200 × $3.00/1M = $0.0246
Cached cost: (8,000 × $0.30/1M) + (200 × $3.00/1M) = $0.0030
Savings: 87.8%

How Gemini Caching Works

Google uses context caching, which is fundamentally different. Instead of ephemeral prefix matching, you create a persistent cache resource with a TTL of up to 40 days. You pay a storage fee per hour and a reduced input fee when using the cached context.

The Rules

  1. Storage-based. You upload context once, pay ~$4.50 per million tokens per hour of storage.
  2. Reusable across sessions. The same cached context can be referenced by multiple users or conversations.
  3. No exact-match requirement. As long as you reference the cache by ID, the content is reused.
  4. Best for: Large documents, video transcripts, or system prompts shared across thousands of requests.

Gemini Caching Cost Example

Document: 100,000 tokens stored for 24 hours
Storage cost: 100K × $4.50/1M × 24h = $10.80
Per-request input cost: 100K × $0.625/1M = $0.0625

If you make 500 requests:
Total: $10.80 + (500 × $0.0625) = $42.05
Without caching: 500 × (100K × $1.25/1M) = $62.50
Savings: 32.7%

If you make 5,000 requests:
Total: $10.80 + (5,000 × $0.0625) = $323.30
Without caching: 5,000 × $125.00/1M = $625.00
Savings: 48.3%

When to Use Which

ScenarioChooseWhy
Same system prompt, varying user queriesClaudeEphemeral caching handles this natively
Large document, thousands of Q&A sessionsGeminiPersistent cache amortizes storage cost
Multi-turn chat with long historyClaudePrefix matching captures rolling context
Shared knowledge base across usersGeminiOne cache resource, many consumers
Session-level caching (single user)ClaudeNo storage overhead, instant activation

Common Caching Mistakes

  1. Not placing system prompt first. Claude requires the cached block to be at the start. A "developer" message before the system prompt breaks caching.
  2. Forgetting cache read is not free. Claude still charges $0.30/1M for cached input. It is cheap, not free.
  3. Over-caching short prompts. If your prompt is under 1,024 tokens, Claude caching is unavailable. Gemini caching has overhead that may not pay off under 100 requests.
  4. Ignoring cache TTL. Claude's 5-minute default TTL means bursts of requests benefit, but sporadic usage does not.

How AIWorkbench.dev Helps

The workbench visualizes cache hits and misses in real time. When you enable caching, the token counter shows two rows: standard input and cached input. The cost calculator factors TTL and request frequency into its monthly estimate.

Key Takeaway

Claude caching is for session-level, high-frequency prompts with identical prefixes. Gemini caching is for infrastructure-level, shared context with high volume. Choose the provider whose caching model matches your traffic pattern, not just whose API you already use.