Anthropic Prompt Caching: A Practical Guide

How Anthropic prompt caching works, the 1,024 token threshold, write vs read economics, and how to find cache-eligible blocks in your prompts.

Anthropic's prompt caching lets you pay significantly less for tokens your application sends repeatedly. Used correctly, it is one of the highest-leverage cost optimizations for Claude workloads.

How caching works

When you mark a prompt block with cache_control, Anthropic stores a fingerprint of that prefix. Subsequent requests with the same prefix can read from cache instead of reprocessing full input tokens.

Economics (Sonnet-class models)

Operation	Typical multiplier
Cache write	~1.25× base input price (once per TTL window)
Cache read	~0.10× base input price

Break-even is often 2 reads within the TTL — after that, every hit saves ~90% on those tokens.

The 1,024 token minimum

Blocks must meet Anthropic's minimum size (1,024 tokens) to be cache-eligible. Smaller system prompts need consolidation or bundling with adjacent static content.

What to cache

Best candidates:

System instructions — identical across requests
Tool definitions — stable agent configurations
RAG corpora — documents that rotate weekly, not per request

Avoid caching unique per-user content — it will never hit.

Finding cache opportunities manually is hard

Production prompts mix static and dynamic regions. Cache intelligence tools map each block, model write/read economics, and recommend cache_control placement.

Tokenistt detects cache-eligible regions automatically and projects monthly savings. Join the beta to try it.

Related: Reduce Claude API costs · Cache docs