Prompt Caching Deep Dive: Claude vs Gemini Implementation
The difference between "saving 90%" and "saving 0%" is in the implementation details.
Prompt caching is the single most impactful cost optimization for production LLM applications. But Anthropic and Google implement it differently, and the wrong approach can cost you more than no caching at all.
How Claude Caching Works
Anthropic uses ephemeral prefix matching. The system computes a hash of your prompt prefix. If an identical prefix was sent within the last 5 minutes (extendable to 1 hour), the cached portion is billed at 10% of the standard input rate.
The Rules
- Exact match required. A single character difference invalidates the cache.
- Prefix only. You cannot cache the middle or end of a prompt. The cacheable portion must start at message index 0.
- Minimum size. The cacheable block must be at least 1,024 tokens (Claude 3.5 Sonnet) or 4,096 tokens (Claude 3 Opus).
- Breakpoints matter. You set
cache_control: { type: "ephemeral" }on specific messages to mark where caching should occur.
Claude Caching Cost Example
System prompt: 8,000 tokens (cached)
User message: 200 tokens (uncached)
Standard cost: 8,200 × $3.00/1M = $0.0246
Cached cost: (8,000 × $0.30/1M) + (200 × $3.00/1M) = $0.0030
Savings: 87.8%
How Gemini Caching Works
Google uses context caching, which is fundamentally different. Instead of ephemeral prefix matching, you create a persistent cache resource with a TTL of up to 40 days. You pay a storage fee per hour and a reduced input fee when using the cached context.
The Rules
- Storage-based. You upload context once, pay ~$4.50 per million tokens per hour of storage.
- Reusable across sessions. The same cached context can be referenced by multiple users or conversations.
- No exact-match requirement. As long as you reference the cache by ID, the content is reused.
- Best for: Large documents, video transcripts, or system prompts shared across thousands of requests.
Gemini Caching Cost Example
Document: 100,000 tokens stored for 24 hours
Storage cost: 100K × $4.50/1M × 24h = $10.80
Per-request input cost: 100K × $0.625/1M = $0.0625
If you make 500 requests:
Total: $10.80 + (500 × $0.0625) = $42.05
Without caching: 500 × (100K × $1.25/1M) = $62.50
Savings: 32.7%
If you make 5,000 requests:
Total: $10.80 + (5,000 × $0.0625) = $323.30
Without caching: 5,000 × $125.00/1M = $625.00
Savings: 48.3%
When to Use Which
| Scenario | Choose | Why |
|---|---|---|
| Same system prompt, varying user queries | Claude | Ephemeral caching handles this natively |
| Large document, thousands of Q&A sessions | Gemini | Persistent cache amortizes storage cost |
| Multi-turn chat with long history | Claude | Prefix matching captures rolling context |
| Shared knowledge base across users | Gemini | One cache resource, many consumers |
| Session-level caching (single user) | Claude | No storage overhead, instant activation |
Common Caching Mistakes
- Not placing system prompt first. Claude requires the cached block to be at the start. A "developer" message before the system prompt breaks caching.
- Forgetting cache read is not free. Claude still charges $0.30/1M for cached input. It is cheap, not free.
- Over-caching short prompts. If your prompt is under 1,024 tokens, Claude caching is unavailable. Gemini caching has overhead that may not pay off under 100 requests.
- Ignoring cache TTL. Claude's 5-minute default TTL means bursts of requests benefit, but sporadic usage does not.
How AIWorkbench.dev Helps
The workbench visualizes cache hits and misses in real time. When you enable caching, the token counter shows two rows: standard input and cached input. The cost calculator factors TTL and request frequency into its monthly estimate.
Key Takeaway
Claude caching is for session-level, high-frequency prompts with identical prefixes. Gemini caching is for infrastructure-level, shared context with high volume. Choose the provider whose caching model matches your traffic pattern, not just whose API you already use.