Prompt caching

What you will achieve

Send a large fixed system prefix twice. Confirm the second call reports cachedTokens > 0 in usage stats — using one cache option that works across providers. Then learn the full CacheConfig union to pick the right strategy for your use case.

When and why you need this

Prompt caching is the single highest-leverage cost optimisation for workloads that reuse a large stable prefix — a long system prompt, a full document, a large tool-definition block. The provider stores the computed key-value attention states for the prefix and reuses them on subsequent calls that share the same prefix, avoiding recomputation.

Economics at a glance (Anthropic as example):

Cache write: ~25% more than base input token price (one-time, on the call that populates the cache)
Cache read: ~90% less than base input token price (on every subsequent cache hit)
TTL: 5 minutes by default; extendable with ttl option

For a 50,000-token system prompt called 20 times: ~$0.75 write + ~$0.015 * 19 hits = ~$1.04 total, vs ~$0.75 * 20 = ~$15.00 without caching. A ~14x cost reduction for that prefix.

The challenge without ORXA: Anthropic requires explicit cache_control: { type: 'ephemeral' } blocks on each content part; OpenAI and Google cache automatically with no annotation needed. There is also no unified field to read cache hit counts across providers.

Step by step

Step 1 — Enable auto-caching with a shared LLMClient

import { createLLM } from '@combycode/llm-sdk';

const llm = createLLM({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
});

// A large stable prefix -- in practice this is a long system prompt,
// a document, or a big tool list.
const prefix = 'The quick brown fox jumps over the lazy dog. '.repeat(500);
const opts = { system: prefix, cache: 'auto' as const, maxTokens: 16, stateful: false };

// First call: populates the cache (Anthropic charges cache-write tokens).
const r1 = await llm.complete('Reply with: ok.', opts);
console.log(r1.usage.cachedTokens); // 0 on first call

// Second call: hits the cache.
const r2 = await llm.complete('Reply with: ok.', opts);
console.log(r2.usage.cachedTokens); // > 0 -- cache hit
console.log(r2.usage.cacheWriteTokens); // 0 -- no new cache write

cache: 'auto' is the simplest setting. On Anthropic, the SDK injects cache_control: { type: 'ephemeral' } onto the system content block, the last message, and the last tool definition automatically. On OpenAI and Google, where caching is always-on, no annotation is added and the SDK reads the provider’s cache-hit fields into usage.cachedTokens.

Step 2 — Read cache economics from the usage object

Both cachedTokens and cacheWriteTokens are always present on Usage, defaulting to 0:

const { usage } = await llm.complete('Reply with: ok.', opts);

console.log(`Input tokens:       ${usage.inputTokens}`);
console.log(`Cached tokens:      ${usage.cachedTokens}`);      // read from cache
console.log(`Cache write tokens: ${usage.cacheWriteTokens}`);  // written to cache
console.log(`Output tokens:      ${usage.outputTokens}`);
// Anthropic: inputTokens = non-cached input; cachedTokens = cache_read_input_tokens
// OpenAI:    inputTokens = total input;      cachedTokens = cached_tokens (input_token_details)

The inputTokens field on Anthropic represents only the non-cached portion; cachedTokens is the separately billed cache-read portion. On OpenAI, inputTokens includes everything; cachedTokens is how many of those were served from cache.

Step 3 — Cache system and tools independently

cache: 'auto' is all-or-nothing. Use the object form to control exactly what gets cached:

import { createLLM } from '@combycode/llm-sdk';

const llm = createLLM({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY });

const opts = {
  system: largeSystemPrompt,
  cache: {
    system: true,   // cache the system prompt block
    tools: false,   // do NOT cache tool definitions (they change per call)
  },
  maxTokens: 256,
};

const { text, usage } = await llm.complete('What is the capital of France?', opts);

If your system prompt is stable but your tool list changes per user, { system: true, tools: false } avoids cache invalidation from tool list changes.

Step 4 — Extend the cache TTL

By default Anthropic caches for 5 minutes. Pass ttl to request a longer window (provider-specific string; Anthropic accepts '1h'):

const opts = {
  system: largeSystemPrompt,
  cache: {
    system: true,
    tools: true,
    ttl: '1h', // request 1-hour cache window (provider-dependent support)
  },
  maxTokens: 256,
};

ttl is a pass-through to the provider. OpenAI and Google do not support explicit TTL configuration today — this field is only effective on Anthropic.

Step 5 — Cache in an AgentLoop for repeated turns

In a stateful agent the system prompt is sent on every step. With cache: 'auto' on the AgentLoop, the system prefix is cached from the first step onwards:

import { createLLM, AgentLoop } from '@combycode/llm-sdk';

const llm = createLLM({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY });

const loop = new AgentLoop({
  client: llm,
  system: largeSystemPrompt,
  cache: 'auto',           // applies to every step in the loop
  maxTokens: 512,
  tools: [...myTools],
});

const r1 = await loop.complete('Hello.');
const r2 = await loop.complete('What did I just say?');
// r2 uses cached system + growing history prefix where possible

For maximum cache hits, keep the system prompt text stable between turns and prefer to extend history rather than regenerating the loop from scratch.

Your options

CacheConfig — the full union (from llm/types/request.ts):

type CacheConfig =
  | 'auto'
  | 'off'
  | {
      system?: boolean;
      tools?: boolean;
      ttl?: string;
    };

Value	What it caches	When to use
`'auto'`	System prompt + last message + last tool definition	Simplest option; maximises cache coverage automatically
`'off'`	Nothing	Explicitly opt out (useful to disable engine-default caching)
`{ system: true }`	System prompt block only	Stable system prompt, variable tools or messages
`{ tools: true }`	Last tool definition block only	Stable tool list, variable system prompt
`{ system: true, tools: true }`	Both system and tools	Both are stable; common for production agents
`{ system: true, ttl: '1h' }`	System prompt + 1-hour TTL request	Long sessions where 5-min TTL causes repeated cache misses

Provider cache behaviour at a glance:

Provider	Annotation required	Cached field in response	Cache TTL
Anthropic	Yes — `cache_control: { type: 'ephemeral' }` on content blocks	`cache_read_input_tokens`	5 min default; extendable
OpenAI (Responses API)	No — automatic on stable prefixes	`input_token_details.cached_tokens`	Managed by OpenAI
OpenAI (Completions)	No — automatic	`usage.prompt_tokens_details.cached_tokens`	Managed by OpenAI
Google	No — automatic context caching	Not reported per-call in standard usage	Managed by Google

ORXA normalises all of these into usage.cachedTokens and usage.cacheWriteTokens.

When caching does NOT help:

Highly variable system prompts (personalised per user on every call) — the prefix never matches.
Short system prompts (under 1000 tokens) — cache overhead may outweigh savings.
Single-call scripts — there is no second call to hit the cache.
Provider that does not support caching for the model you selected.

Layered context and cache ordering:

The Layered context system (ContextRegistry) composes system prompts from multiple contributors (agent role, task context, retrieved facts). For maximum cache hits, ensure stable layers (agent role) are placed at the FRONT of the composed prompt. ORXA’s registry outputs layers in ascending priority order by default — lower priority (more stable) layers appear first, higher priority (more dynamic) layers appear later, which is also the optimal order for cache hit rates.

Compare the SDKs

import { createLLM } from '@combycode/llm-sdk';

// One `cache` option — maps to explicit cache_control on Anthropic and is a
// no-op where caching is automatic. Normalized `usage.cachedTokens` everywhere.
const llm = createLLM({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY });
const prefix = 'The quick brown fox jumps over the lazy dog. '.repeat(500);
// The large stable prefix goes in `system`; cache:'auto' caches it (explicit on
// Anthropic, automatic elsewhere). 2nd identical call reports cache-read tokens.
const opts = { system: prefix, cache: 'auto' as const, maxTokens: 16, stateful: false };

const t0 = performance.now();
await llm.complete('Reply with the single word: ok.', opts);
const r2 = await llm.complete('Reply with the single word: ok.', opts);

console.log(JSON.stringify({ result: String(r2.usage.cachedTokens), ms: Math.round(performance.now() - t0) }));

The structural difference: Anthropic requires inserting cache_control: { type: 'ephemeral' } into every specific content block you want cached (system, tools, messages). Without ORXA this means three separate conditional injection points that you maintain across every request builder. OpenAI and Google cache automatically but use completely different response fields for hit counts (cached_tokens, cache_read_input_tokens, usage_metadata.cached_content_token_count). ORXA resolves both issues: it injects cache_control for Anthropic automatically when cache is set, and normalises all provider fields into the single usage.cachedTokens field.

Gotchas and next steps

The first call is always a cache miss. It populates the cache (Anthropic calls this a cache write, charges ~25% premium). Cache hits only appear from the second call onwards with the same prefix.

Any prefix change invalidates the cache. If you append to the system prompt, change a tool description, or modify a message, the cached prefix is no longer valid and the provider recomputes from scratch. Keep the stable portion at the top and variable content at the end.

Anthropic’s cache: 'auto' also caches the last message. This covers conversation prefix scenarios where you replay the same history — the model’s prior turns can be cached too. This is the “conversation prefix” caching pattern; the SDK injects cache_control on the last message’s last content block.

OpenAI automatic caching has a minimum prefix length. Cache hits are only reported for prompts that exceed a provider-managed minimum (typically 1024 tokens on GPT-4o). Short prompts will never show cachedTokens > 0.

cache: 'off' is useful for testing. If you want to force cache bypass to measure true non-cached latency (e.g. in benchmarks), set cache: 'off' explicitly. On providers where caching is always-on, 'off' has no effect — it only suppresses annotation injection on Anthropic.

Next steps:

Token counting — estimate tokens before sending
Cost tracking — use estimate() to pre-flight cost with cached token pricing
Layered context — compose system prompts for stable cache prefixes
Built-in tool runner — AgentLoop with per-step caching via cache config