Prompt caching
What you will achieve
Section titled “What you will achieve”Send a large fixed system prefix twice. Confirm the second call reports cachedTokens > 0 in usage stats — using one cache option that works across providers. Then learn the full CacheConfig union to pick the right strategy for your use case.
When and why you need this
Section titled “When and why you need this”Prompt caching is the single highest-leverage cost optimisation for workloads that reuse a large stable prefix — a long system prompt, a full document, a large tool-definition block. The provider stores the computed key-value attention states for the prefix and reuses them on subsequent calls that share the same prefix, avoiding recomputation.
Economics at a glance (Anthropic as example):
- Cache write: ~25% more than base input token price (one-time, on the call that populates the cache)
- Cache read: ~90% less than base input token price (on every subsequent cache hit)
- TTL: 5 minutes by default; extendable with
ttloption
For a 50,000-token system prompt called 20 times: ~$0.75 write + ~$0.015 * 19 hits = ~$1.04 total, vs ~$0.75 * 20 = ~$15.00 without caching. A ~14x cost reduction for that prefix.
The challenge without ORXA: Anthropic requires explicit cache_control: { type: 'ephemeral' } blocks on each content part; OpenAI and Google cache automatically with no annotation needed. There is also no unified field to read cache hit counts across providers.
Step by step
Section titled “Step by step”Step 1 — Enable auto-caching with a shared LLMClient
Section titled “Step 1 — Enable auto-caching with a shared LLMClient”import { createLLM } from '@combycode/llm-sdk';
const llm = createLLM({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY,});
// A large stable prefix -- in practice this is a long system prompt,// a document, or a big tool list.const prefix = 'The quick brown fox jumps over the lazy dog. '.repeat(500);const opts = { system: prefix, cache: 'auto' as const, maxTokens: 16, stateful: false };
// First call: populates the cache (Anthropic charges cache-write tokens).const r1 = await llm.complete('Reply with: ok.', opts);console.log(r1.usage.cachedTokens); // 0 on first call
// Second call: hits the cache.const r2 = await llm.complete('Reply with: ok.', opts);console.log(r2.usage.cachedTokens); // > 0 -- cache hitconsole.log(r2.usage.cacheWriteTokens); // 0 -- no new cache writecache: 'auto' is the simplest setting. On Anthropic, the SDK injects cache_control: { type: 'ephemeral' } onto the system content block, the last message, and the last tool definition automatically. On OpenAI and Google, where caching is always-on, no annotation is added and the SDK reads the provider’s cache-hit fields into usage.cachedTokens.
Step 2 — Read cache economics from the usage object
Section titled “Step 2 — Read cache economics from the usage object”Both cachedTokens and cacheWriteTokens are always present on Usage, defaulting to 0:
const { usage } = await llm.complete('Reply with: ok.', opts);
console.log(`Input tokens: ${usage.inputTokens}`);console.log(`Cached tokens: ${usage.cachedTokens}`); // read from cacheconsole.log(`Cache write tokens: ${usage.cacheWriteTokens}`); // written to cacheconsole.log(`Output tokens: ${usage.outputTokens}`);// Anthropic: inputTokens = non-cached input; cachedTokens = cache_read_input_tokens// OpenAI: inputTokens = total input; cachedTokens = cached_tokens (input_token_details)The inputTokens field on Anthropic represents only the non-cached portion; cachedTokens is the separately billed cache-read portion. On OpenAI, inputTokens includes everything; cachedTokens is how many of those were served from cache.
Step 3 — Cache system and tools independently
Section titled “Step 3 — Cache system and tools independently”cache: 'auto' is all-or-nothing. Use the object form to control exactly what gets cached:
import { createLLM } from '@combycode/llm-sdk';
const llm = createLLM({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY });
const opts = { system: largeSystemPrompt, cache: { system: true, // cache the system prompt block tools: false, // do NOT cache tool definitions (they change per call) }, maxTokens: 256,};
const { text, usage } = await llm.complete('What is the capital of France?', opts);If your system prompt is stable but your tool list changes per user, { system: true, tools: false } avoids cache invalidation from tool list changes.
Step 4 — Extend the cache TTL
Section titled “Step 4 — Extend the cache TTL”By default Anthropic caches for 5 minutes. Pass ttl to request a longer window (provider-specific string; Anthropic accepts '1h'):
const opts = { system: largeSystemPrompt, cache: { system: true, tools: true, ttl: '1h', // request 1-hour cache window (provider-dependent support) }, maxTokens: 256,};ttl is a pass-through to the provider. OpenAI and Google do not support explicit TTL configuration today — this field is only effective on Anthropic.
Step 5 — Cache in an AgentLoop for repeated turns
Section titled “Step 5 — Cache in an AgentLoop for repeated turns”In a stateful agent the system prompt is sent on every step. With cache: 'auto' on the AgentLoop, the system prefix is cached from the first step onwards:
import { createLLM, AgentLoop } from '@combycode/llm-sdk';
const llm = createLLM({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY });
const loop = new AgentLoop({ client: llm, system: largeSystemPrompt, cache: 'auto', // applies to every step in the loop maxTokens: 512, tools: [...myTools],});
const r1 = await loop.complete('Hello.');const r2 = await loop.complete('What did I just say?');// r2 uses cached system + growing history prefix where possibleFor maximum cache hits, keep the system prompt text stable between turns and prefer to extend history rather than regenerating the loop from scratch.
Your options
Section titled “Your options”CacheConfig — the full union (from llm/types/request.ts):
type CacheConfig = | 'auto' | 'off' | { system?: boolean; tools?: boolean; ttl?: string; };| Value | What it caches | When to use |
|---|---|---|
'auto' | System prompt + last message + last tool definition | Simplest option; maximises cache coverage automatically |
'off' | Nothing | Explicitly opt out (useful to disable engine-default caching) |
{ system: true } | System prompt block only | Stable system prompt, variable tools or messages |
{ tools: true } | Last tool definition block only | Stable tool list, variable system prompt |
{ system: true, tools: true } | Both system and tools | Both are stable; common for production agents |
{ system: true, ttl: '1h' } | System prompt + 1-hour TTL request | Long sessions where 5-min TTL causes repeated cache misses |
Provider cache behaviour at a glance:
| Provider | Annotation required | Cached field in response | Cache TTL |
|---|---|---|---|
| Anthropic | Yes — cache_control: { type: 'ephemeral' } on content blocks | cache_read_input_tokens | 5 min default; extendable |
| OpenAI (Responses API) | No — automatic on stable prefixes | input_token_details.cached_tokens | Managed by OpenAI |
| OpenAI (Completions) | No — automatic | usage.prompt_tokens_details.cached_tokens | Managed by OpenAI |
| No — automatic context caching | Not reported per-call in standard usage | Managed by Google |
ORXA normalises all of these into usage.cachedTokens and usage.cacheWriteTokens.
When caching does NOT help:
- Highly variable system prompts (personalised per user on every call) — the prefix never matches.
- Short system prompts (under 1000 tokens) — cache overhead may outweigh savings.
- Single-call scripts — there is no second call to hit the cache.
- Provider that does not support caching for the model you selected.
Layered context and cache ordering:
The Layered context system (ContextRegistry) composes system prompts from multiple contributors (agent role, task context, retrieved facts). For maximum cache hits, ensure stable layers (agent role) are placed at the FRONT of the composed prompt. ORXA’s registry outputs layers in ascending priority order by default — lower priority (more stable) layers appear first, higher priority (more dynamic) layers appear later, which is also the optimal order for cache hit rates.
Compare the SDKs
Section titled “Compare the SDKs”The structural difference: Anthropic requires inserting cache_control: { type: 'ephemeral' } into every specific content block you want cached (system, tools, messages). Without ORXA this means three separate conditional injection points that you maintain across every request builder. OpenAI and Google cache automatically but use completely different response fields for hit counts (cached_tokens, cache_read_input_tokens, usage_metadata.cached_content_token_count). ORXA resolves both issues: it injects cache_control for Anthropic automatically when cache is set, and normalises all provider fields into the single usage.cachedTokens field.
Gotchas and next steps
Section titled “Gotchas and next steps”The first call is always a cache miss. It populates the cache (Anthropic calls this a cache write, charges ~25% premium). Cache hits only appear from the second call onwards with the same prefix.
Any prefix change invalidates the cache. If you append to the system prompt, change a tool description, or modify a message, the cached prefix is no longer valid and the provider recomputes from scratch. Keep the stable portion at the top and variable content at the end.
Anthropic’s cache: 'auto' also caches the last message. This covers conversation prefix scenarios where you replay the same history — the model’s prior turns can be cached too. This is the “conversation prefix” caching pattern; the SDK injects cache_control on the last message’s last content block.
OpenAI automatic caching has a minimum prefix length. Cache hits are only reported for prompts that exceed a provider-managed minimum (typically 1024 tokens on GPT-4o). Short prompts will never show cachedTokens > 0.
cache: 'off' is useful for testing. If you want to force cache bypass to measure true non-cached latency (e.g. in benchmarks), set cache: 'off' explicitly. On providers where caching is always-on, 'off' has no effect — it only suppresses annotation injection on Anthropic.
Next steps:
- Token counting — estimate tokens before sending
- Cost tracking — use
estimate()to pre-flight cost with cached token pricing - Layered context — compose system prompts for stable cache prefixes
- Built-in tool runner —
AgentLoopwith per-step caching viacacheconfig