claude6 min read

How to Use Claude Prompt Caching to Reduce API Costs by 90%

Claude's prompt caching feature dramatically reduces API costs for repetitive workflows. Here's how it works, how to implement it, and real-world cost savings examples.

LT
Luke Thompson

Co-founder, The Operations Guide

How to Use Claude Prompt Caching to Reduce API Costs by 90%
Share:
Anthropic quietly launched prompt caching last month. For API users with repetitive workflows, it's a game-changer that can reduce costs by 70-90%. If you're running Claude API calls with large repeated context - system prompts, documentation, knowledge bases - prompt caching cuts costs dramatically while improving response times. ## What Prompt Caching Is Prompt caching lets Claude remember and reuse parts of prompts across API calls. Instead of sending the same context (documentation, examples, knowledge base) with every API request, you mark it as cacheable. Claude stores that context and reuses it for subsequent requests. **The savings:** Cached content costs 10% of normal input token pricing: - Regular input: $3 per million tokens (Claude 3.5 Sonnet) - Cached input: $0.30 per million tokens (90% reduction) For workflows with large repeated context, this is massive cost savings. ## How It Works Prompt caching operates at the API level. When you make a Claude API request, you can mark specific content as cacheable. **Technical flow:** 1. Send API request with cache control markers 2. Claude processes request and caches marked content (5-minute TTL) 3. Subsequent requests within 5 minutes reference cached content 4. Claude reuses cached content instead of reprocessing 5. You pay cache read price (10% of normal) instead of full input price **Cache duration:** Caches last 5 minutes. Any request within 5 minutes of the last cache use extends the cache another 5 minutes. For active workflows, caches effectively stay warm indefinitely. ## Implementation Add cache control to your API requests: ```python import anthropic client = anthropic.Anthropic(api_key="your-api-key") response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, system=[ { "type": "text", "text": "You are a customer support assistant for Acme Corp.", }, { "type": "text", "text": "", "cache_control": {"type": "ephemeral"} } ], messages=[{"role": "user", "content": "How do I reset my password?"}] ) ``` The `cache_control` marker tells Claude to cache that content block. **What to cache:** - System prompts and instructions - Product documentation - Knowledge base content - Examples and few-shot prompts - Company information and context - Style guides and templates **What not to cache:** - User questions (changes every request) - Dynamic data (changes frequently) - Short content (cache overhead isn't worth it for small text) ## Cost Savings Examples ### Customer Support Bot **Scenario:** Support bot with 50,000 tokens of product documentation sent with every customer question. **Without caching:** - 1,000 daily conversations - 50,000 tokens documentation + ~500 tokens question = 50,500 tokens per request - 50,500 × 1,000 = 50,500,000 tokens/day - Cost: 50.5M × $3/1M = $151.50/day = **$4,545/month** **With caching:** - First request: 50,500 tokens (full price) - Subsequent 999 requests: 50,000 tokens (cached at 10%) + 500 tokens (normal) - First request: 50,500 × $3/1M = $0.15 - Next 999: (50,000 × $0.30/1M) + (500 × $3/1M) × 999 = $16.48 - Total: $16.63/day = **$499/month** **Savings: $4,046/month (89% reduction)** ### Document Analysis Pipeline **Scenario:** Analyzing customer contracts against company policies. Policy document is 30,000 tokens, sent with each contract analysis. **Without caching:** - 200 contracts/day - 30,000 tokens policy + 5,000 tokens contract average = 35,000 tokens per analysis - 35,000 × 200 = 7,000,000 tokens/day - Cost: 7M × $3/1M = $21/day = **$630/month** **With caching:** - Policy document cached (30,000 tokens at 10% price) - Only contract varies (5,000 tokens at full price) - First request: 35,000 × $3/1M = $0.105 - Next 199: (30,000 × $0.30/1M) + (5,000 × $3/1M) × 199 = $4.76 - Total: $4.87/day = **$146/month** **Savings: $484/month (77% reduction)** ### API-Powered Feature **Scenario:** SaaS product with Claude-powered feature. System prompt with examples is 20,000 tokens. **Without caching:** - 10,000 requests/day - 20,000 tokens system + 1,000 tokens user average = 21,000 tokens per request - 21,000 × 10,000 = 210,000,000 tokens/day - Cost: 210M × $3/1M = $630/day = **$18,900/month** **With caching:** - System prompt cached (20,000 tokens) - User input varies (1,000 tokens) - First request: 21,000 × $3/1M = $0.063 - Next 9,999: (20,000 × $0.30/1M) + (1,000 × $3/1M) × 9,999 = $36.00 - Total: $36.06/day = **$1,082/month** **Savings: $17,818/month (94% reduction)** ## Best Practices **Structure prompts for caching:** Put stable content (system prompts, documentation) before dynamic content (user input). Claude caches content in order, so placing static content first maximizes cache hits. **Monitor cache hit rates:** Anthropic's API returns cache statistics. Check your hit rate - should be 95%+ for well-structured workflows. **Optimize cache duration:** 5-minute TTL works for most use cases. For very high-volume APIs, requests naturally extend the cache. For bursty traffic, you might have lower hit rates. **Test cache effectiveness:** Run a day of production traffic with and without caching enabled. Measure actual cost reduction and latency improvement. **Combine with model routing:** Use Haiku for simple tasks, Sonnet for moderate complexity, Opus for complex reasoning. Add caching to all three for maximum savings. ## Performance Benefits Prompt caching doesn't just reduce costs - it also improves response times. Claude doesn't need to reprocess cached content, reducing latency by 20-30% on average. **Before caching:** Large system prompt → 3-4 second response times **After caching:** Same prompt → 2-3 second response times For user-facing features, this latency improvement noticeably improves experience. ## Limitations **5-minute TTL:** Caches expire after 5 minutes of inactivity. Low-traffic applications won't benefit as much. **Minimum cache size:** Content must be at least 1,024 tokens to cache. Small system prompts don't benefit. **Cache overhead:** First request in a cache set pays full price plus small overhead. Only beneficial if you have multiple requests. **No cross-user caching:** Caches are per API key. Can't share caches across different customers or API keys. ## When to Use Prompt Caching Prompt caching makes sense when: - You have large static context (documentation, knowledge bases, examples) - Multiple requests share the same context - Request volume is steady (cache stays warm) - Input tokens are a significant cost driver Prompt caching doesn't help when: - Context changes every request - Very low request volume (cache always cold) - Context is small (<1,024 tokens) - Output tokens dominate your costs (caching doesn't affect output pricing) ## Quick Takeaway Prompt caching reduces Claude API costs by 70-90% for workflows with large repeated context. Mark system prompts, documentation, and knowledge bases as cacheable. Cached content costs $0.30 per million tokens instead of $3. Best for: customer support bots with documentation, document analysis with policy references, API-powered features with fixed system prompts, and any workflow sending large static context with variable user input. Implementation is straightforward - add cache control markers to API requests. Savings are immediate and substantial. If you're spending $500+/month on Claude API with repeated context, prompt caching should be your first optimization.
Share:

Get Weekly Claude AI Insights

Join thousands of professionals staying ahead with expert analysis, tips, and updates delivered to your inbox every week.

Comments Coming Soon

We're setting up GitHub Discussions for comments. Check back soon!

Setup Instructions for Developers

Step 1: Enable GitHub Discussions on the repo

Step 2: Visit https://giscus.app and configure

Step 3: Update Comments.tsx with repo and category IDs