Reasoning / thinking
What you will achieve
Section titled “What you will achieve”Enable extended reasoning on a math question, get the correct answer, and see thinking
tokens reported in usage — one thinking option across all reasoning-capable models.
When and why you need this
Section titled “When and why you need this”Reasoning models (OpenAI o3, o4-mini; Anthropic claude-3-7-sonnet; Google
gemini-2.0-flash-thinking; xAI grok-3-mini) spend tokens on internal reasoning
before producing a reply. This “thinking” step dramatically improves accuracy on math,
logic, multi-step planning, and code correctness.
Each provider exposes the knob differently:
- Anthropic —
thinking: { type: 'enabled', budget_tokens: N }. - OpenAI / xAI —
reasoning_effort: 'low' | 'medium' | 'high'. - Google —
thinkingConfig: { thinkingBudget: N }.
Three shapes for one concept. If you hard-code one you break on any other.
Step by step
Section titled “Step by step”Step 1 — Enable thinking with mode: 'auto'
Section titled “Step 1 — Enable thinking with mode: 'auto'”import { createLLM } from '@combycode/llm-sdk';
const llm = createLLM({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY,});
const r = await llm.complete( 'If 2x = 10, what is x? Reply with just the number.', { thinking: { mode: 'auto', effort: 'low' }, maxTokens: 512, },);
console.log(r.text); // '5'console.log(r.usage.reasoningTokens); // thinking tokens used (where provider reports them)mode: 'auto' means “use extended reasoning if the model supports it; be a no-op on
models that do not”. This is the safest default — the same call works on both reasoning
and non-reasoning model slugs.
Step 2 — Choose effort level
Section titled “Step 2 — Choose effort level”effort maps to a budget on each provider:
| effort | Anthropic budget_tokens | OpenAI/xAI reasoning_effort | Google thinkingBudget |
|---|---|---|---|
'low' | ~1024 | 'low' | ~1024 |
'medium' | ~5000 | 'medium' | ~8000 |
'high' | ~10000 | 'high' | ~24576 |
'max' | max for model | 'high' | max for model |
const r = await llm.complete('Prove that sqrt(2) is irrational.', { thinking: { mode: 'auto', effort: 'high' }, maxTokens: 4096,});Use 'low' for quick math and logic checks. Use 'high' for multi-step proofs, complex
code generation, or tasks where getting it wrong is expensive. 'max' is useful when
you do not know the reasoning budget ahead of time and want the model to use as much
as it needs.
Step 3 — Force reasoning on with mode: 'on'
Section titled “Step 3 — Force reasoning on with mode: 'on'”const r = await llm.complete('Design the data model for a social media app.', { thinking: { mode: 'on', effort: 'medium' }, maxTokens: 2000,});mode: 'on' throws if the model does not support extended thinking. Use it when you
explicitly require reasoning and want to fail fast if the model is wrong for the task,
rather than silently getting a non-reasoning reply.
Step 4 — Disable reasoning explicitly with mode: 'off'
Section titled “Step 4 — Disable reasoning explicitly with mode: 'off'”const r = await llm.complete('Say hello.', { thinking: { mode: 'off' }, maxTokens: 32,});mode: 'off' disables reasoning even on models that have it on by default (some
Anthropic models have a minimum thinking budget). Use it when speed and cost matter
more than accuracy, or when the task is simple enough that reasoning adds no value.
Step 5 — Stream thinking tokens
Section titled “Step 5 — Stream thinking tokens”Thinking tokens arrive as 'thinking' events before the text:
let thinkingText = '';let replyText = '';
for await (const ev of llm.stream( 'What is the 100th prime number? Show your reasoning.', { thinking: { mode: 'auto', effort: 'medium' }, maxTokens: 2000 },)) { if (ev.type === 'thinking') thinkingText += ev.text; if (ev.type === 'text') replyText += ev.text; if (ev.type === 'usage') console.log('reasoning tokens:', ev.usage.reasoningTokens);}
console.log('Thinking:\n', thinkingText);console.log('Answer:\n', replyText);Thinking tokens are not included in the final text field — they are separate. On
providers that do not surface thinking tokens as text (xAI currently only reports
reasoning_effort but not the raw reasoning text) the 'thinking' event type is never
emitted.
Your options
Section titled “Your options”ThinkingConfig has three variants:
type ThinkingConfig = | { mode: 'auto'; effort?: 'low' | 'medium' | 'high' | 'max' } | { mode: 'on'; effort?: 'low' | 'medium' | 'high' | 'max' } | { mode: 'off' };| mode | Behaviour |
|---|---|
'auto' | Enables reasoning on models that support it; no-op on others. effort defaults to 'medium' when omitted. Safe for code that targets both reasoning and non-reasoning models. |
'on' | Requires reasoning. Throws if the model does not support it. Use when you need the guarantee. |
'off' | Disables reasoning, even on models with a default budget. Use to reduce cost on simple tasks. |
| effort | When to use |
|---|---|
'low' | Single-step arithmetic, boolean logic, simple classification. Cheapest. |
'medium' | Multi-step reasoning, code review, data extraction from complex text. |
'high' | Mathematical proofs, algorithmic design, tasks where errors are expensive. |
'max' | When you do not know the complexity ahead of time. Highest cost, highest quality ceiling. |
Effect on maxTokens: thinking tokens count against maxTokens on Anthropic
(they are drawn from the same budget). If you set maxTokens: 512 with
effort: 'high' (10 000 token budget) the model may exhaust maxTokens on thinking
and produce no reply. Set maxTokens significantly higher than you expect for the
reply itself — 2x-4x is a good starting point for high-effort reasoning.
Cost of thinking tokens: reasoning tokens are billed at the same rate as output
tokens on most providers. A 'high' effort call can cost 10-20x more than a
'low' or no-reasoning call on the same input. Track usage.reasoningTokens to
understand the breakdown.
Compare the SDKs
Section titled “Compare the SDKs”The structural difference: each official SDK exposes a different interface for the same
concept — Anthropic’s thinking.budget_tokens is a raw integer; OpenAI’s
reasoning_effort is a string enum; Google’s thinkingConfig.thinkingBudget is an
integer with a different scale. ORXA maps the semantic effort level to each
provider’s native parameter, so switching from openai/o4-mini to
anthropic/claude-3-7-sonnet requires only changing the model string.
Gotchas and next steps
Section titled “Gotchas and next steps”Not all models support reasoning. Standard chat models (gpt-4o, claude-3-5-sonnet,
gemini-2.0-flash) ignore the thinking option or treat it as a no-op. Use mode: 'on'
if you need a hard guarantee that reasoning is active.
effort: 'max' + small maxTokens = truncated answer. The model fills its thinking
budget first, then generates a reply in the remaining token allowance. If maxTokens is
too small the reply is truncated. Always set maxTokens generously when using reasoning.
Thinking tokens are not in text. r.text and streamed 'text' events contain only
the model’s final reply. Thinking content arrives in r.thinking (non-streaming) or
'thinking' stream events (streaming). Some providers do not surface thinking text at all
even when reasoning is active.
Next steps:
- Streaming — stream thinking tokens alongside reply tokens
- Structured output — combine reasoning with JSON extraction
- Tool call — reasoning models excel at multi-step tool use