Anthropic Prompt Caching: A Practical Guide
How Anthropic prompt caching works, the 1,024 token threshold, write vs read economics, and how to find cache-eligible blocks in your prompts.
Anthropic's prompt caching lets you pay significantly less for tokens your application sends repeatedly. Used correctly, it is one of the highest-leverage cost optimizations for Claude workloads.
How caching works
When you mark a prompt block with cache_control, Anthropic stores a fingerprint of that prefix. Subsequent requests with the same prefix can read from cache instead of reprocessing full input tokens.
Economics (Sonnet-class models)
| Operation | Typical multiplier |
|---|---|
| Cache write | ~1.25× base input price (once per TTL window) |
| Cache read | ~0.10× base input price |
Break-even is often 2 reads within the TTL — after that, every hit saves ~90% on those tokens.
The 1,024 token minimum
Blocks must meet Anthropic's minimum size (1,024 tokens) to be cache-eligible. Smaller system prompts need consolidation or bundling with adjacent static content.
What to cache
Best candidates:
- System instructions — identical across requests
- Tool definitions — stable agent configurations
- RAG corpora — documents that rotate weekly, not per request
Avoid caching unique per-user content — it will never hit.
Finding cache opportunities manually is hard
Production prompts mix static and dynamic regions. Cache intelligence tools map each block, model write/read economics, and recommend cache_control placement.
Tokenistt detects cache-eligible regions automatically and projects monthly savings. Join the beta to try it.
Related: Reduce Claude API costs · Cache docs
Start monitoring LLM costs today
Join the Tokenistt waitlist for early access to AI cost management and LLM spend observability.