Reasoning / thinking

▶ Try in Sandbox Opens a live chat playground with this example prefilled — add your API key then hit Send. Runs in your browser; no code is executed.

What you will achieve

Enable extended reasoning on a math question, get the correct answer, and see thinking tokens reported in usage — one thinking option across all reasoning-capable models.

When and why you need this

Reasoning models (OpenAI o3, o4-mini; Anthropic claude-3-7-sonnet; Google gemini-2.0-flash-thinking; xAI grok-3-mini) spend tokens on internal reasoning before producing a reply. This “thinking” step dramatically improves accuracy on math, logic, multi-step planning, and code correctness.

Each provider exposes the knob differently:

Anthropic — thinking: { type: 'enabled', budget_tokens: N }.
OpenAI / xAI — reasoning_effort: 'low' | 'medium' | 'high'.
Google — thinkingConfig: { thinkingBudget: N }.

Three shapes for one concept. If you hard-code one you break on any other.

Step by step

Step 1 — Enable thinking with `mode: 'auto'`

import { createLLM } from '@combycode/llm-sdk';

const llm = createLLM({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
});

const r = await llm.complete(
  'If 2x = 10, what is x? Reply with just the number.',
  {
    thinking: { mode: 'auto', effort: 'low' },
    maxTokens: 512,
  },
);

console.log(r.text);              // '5'
console.log(r.usage.reasoningTokens); // thinking tokens used (where provider reports them)

mode: 'auto' means “use extended reasoning if the model supports it; be a no-op on models that do not”. This is the safest default — the same call works on both reasoning and non-reasoning model slugs.

Step 2 — Choose effort level

effort maps to a budget on each provider:

effort	Anthropic budget_tokens	OpenAI/xAI reasoning_effort	Google thinkingBudget
`'low'`	~1024	`'low'`	~1024
`'medium'`	~5000	`'medium'`	~8000
`'high'`	~10000	`'high'`	~24576
`'max'`	max for model	`'high'`	max for model

const r = await llm.complete('Prove that sqrt(2) is irrational.', {
  thinking: { mode: 'auto', effort: 'high' },
  maxTokens: 4096,
});

Use 'low' for quick math and logic checks. Use 'high' for multi-step proofs, complex code generation, or tasks where getting it wrong is expensive. 'max' is useful when you do not know the reasoning budget ahead of time and want the model to use as much as it needs.

Step 3 — Force reasoning on with `mode: 'on'`

const r = await llm.complete('Design the data model for a social media app.', {
  thinking: { mode: 'on', effort: 'medium' },
  maxTokens: 2000,
});

mode: 'on' throws if the model does not support extended thinking. Use it when you explicitly require reasoning and want to fail fast if the model is wrong for the task, rather than silently getting a non-reasoning reply.

Step 4 — Disable reasoning explicitly with `mode: 'off'`

const r = await llm.complete('Say hello.', {
  thinking: { mode: 'off' },
  maxTokens: 32,
});

mode: 'off' disables reasoning even on models that have it on by default (some Anthropic models have a minimum thinking budget). Use it when speed and cost matter more than accuracy, or when the task is simple enough that reasoning adds no value.

Step 5 — Stream thinking tokens

Thinking tokens arrive as 'thinking' events before the text:

let thinkingText = '';
let replyText = '';

for await (const ev of llm.stream(
  'What is the 100th prime number? Show your reasoning.',
  { thinking: { mode: 'auto', effort: 'medium' }, maxTokens: 2000 },
)) {
  if (ev.type === 'thinking') thinkingText += ev.text;
  if (ev.type === 'text')     replyText    += ev.text;
  if (ev.type === 'usage')    console.log('reasoning tokens:', ev.usage.reasoningTokens);
}

console.log('Thinking:\n', thinkingText);
console.log('Answer:\n', replyText);

Thinking tokens are not included in the final text field — they are separate. On providers that do not surface thinking tokens as text (xAI currently only reports reasoning_effort but not the raw reasoning text) the 'thinking' event type is never emitted.

Your options

ThinkingConfig has three variants:

type ThinkingConfig =
  | { mode: 'auto'; effort?: 'low' | 'medium' | 'high' | 'max' }
  | { mode: 'on';   effort?: 'low' | 'medium' | 'high' | 'max' }
  | { mode: 'off' };

mode	Behaviour
`'auto'`	Enables reasoning on models that support it; no-op on others. `effort` defaults to `'medium'` when omitted. Safe for code that targets both reasoning and non-reasoning models.
`'on'`	Requires reasoning. Throws if the model does not support it. Use when you need the guarantee.
`'off'`	Disables reasoning, even on models with a default budget. Use to reduce cost on simple tasks.

effort	When to use
`'low'`	Single-step arithmetic, boolean logic, simple classification. Cheapest.
`'medium'`	Multi-step reasoning, code review, data extraction from complex text.
`'high'`	Mathematical proofs, algorithmic design, tasks where errors are expensive.
`'max'`	When you do not know the complexity ahead of time. Highest cost, highest quality ceiling.

Effect on maxTokens: thinking tokens count against maxTokens on Anthropic (they are drawn from the same budget). If you set maxTokens: 512 with effort: 'high' (10 000 token budget) the model may exhaust maxTokens on thinking and produce no reply. Set maxTokens significantly higher than you expect for the reply itself — 2x-4x is a good starting point for high-effort reasoning.

Cost of thinking tokens: reasoning tokens are billed at the same rate as output tokens on most providers. A 'high' effort call can cost 10-20x more than a 'low' or no-reasoning call on the same input. Track usage.reasoningTokens to understand the breakdown.

Compare the SDKs

import { createLLM } from '@combycode/llm-sdk';

// One `thinking` option, mapped to each provider's reasoning knob (anthropic
// thinking budget, openai/xai reasoning effort, google thinkingConfig).
const llm = createLLM({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY });

const t0 = performance.now();
const r = await llm.complete('If 2x=10 what is x? Reply with just the number.', {
  thinking: { mode: 'auto', effort: 'low' },
  maxTokens: 512,
});

console.log(JSON.stringify({ result: r.text.trim(), ms: Math.round(performance.now() - t0) }));

The structural difference: each official SDK exposes a different interface for the same concept — Anthropic’s thinking.budget_tokens is a raw integer; OpenAI’s reasoning_effort is a string enum; Google’s thinkingConfig.thinkingBudget is an integer with a different scale. ORXA maps the semantic effort level to each provider’s native parameter, so switching from openai/o4-mini to anthropic/claude-3-7-sonnet requires only changing the model string.

Gotchas and next steps

Not all models support reasoning. Standard chat models (gpt-4o, claude-3-5-sonnet, gemini-2.0-flash) ignore the thinking option or treat it as a no-op. Use mode: 'on' if you need a hard guarantee that reasoning is active.

effort: 'max' + small maxTokens = truncated answer. The model fills its thinking budget first, then generates a reply in the remaining token allowance. If maxTokens is too small the reply is truncated. Always set maxTokens generously when using reasoning.

Thinking tokens are not in text. r.text and streamed 'text' events contain only the model’s final reply. Thinking content arrives in r.thinking (non-streaming) or 'thinking' stream events (streaming). Some providers do not surface thinking text at all even when reasoning is active.

Next steps:

Streaming — stream thinking tokens alongside reply tokens
Structured output — combine reasoning with JSON extraction
Tool call — reasoning models excel at multi-step tool use