Skip to content

Token streaming

Try in Sandbox Opens a live chat playground with this example prefilled — add your API key then hit Send. Runs in your browser; no code is executed.

Stream 'Count from 1 to 5.' and receive tokens as the model emits them — using a single for await loop with one event shape, regardless of provider.

With a blocking complete() call your UI or terminal shows nothing until the model finishes. For long answers (code, essays, analysis) that can be several seconds of blank screen. Streaming lets you forward each token to the user the instant it arrives.

The challenge with raw provider APIs: each has a different chunk shape.

  • OpenAI Responses API emits typed events like response.output_text.delta with .delta text.
  • Anthropic emits content_block_delta events with .delta.text.
  • Google emits generateContentStream chunks with .candidates[0].content.parts[0].text.

Normalising these into one format requires provider-specific parsing code in your application.

import { createLLM } from '@combycode/llm-sdk';
const llm = createLLM({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
});

createLLM() returns an LLMClient bound to one provider and model. The client is reusable — create it once and call .stream() as many times as you need. Creating a new client per request works but wastes adapter resolution overhead.

Step 2 — Open the stream and iterate events

Section titled “Step 2 — Open the stream and iterate events”
let text = '';
for await (const ev of llm.stream('Count from 1 to 5.')) {
if (ev.type === 'text') {
text += ev.text;
process.stdout.write(ev.text); // forward to UI or terminal in real time
}
}
console.log('\n--- done ---');
console.log('full text:', text);

The for await loop runs until the model closes the stream. Every provider’s events are normalised to the same StreamEvent union before they reach your loop.

Step 3 — Handle all event types you care about

Section titled “Step 3 — Handle all event types you care about”

Not every call emits every event type. Check what your use case needs:

for await (const ev of llm.stream('Explain async/await in TypeScript.', {
thinking: { mode: 'auto', effort: 'low' },
maxTokens: 512,
})) {
switch (ev.type) {
case 'thinking':
// Reasoning tokens from o3, claude-3-7-sonnet, gemini-thinking models.
process.stdout.write(`[think] ${ev.text}`);
break;
case 'text':
process.stdout.write(ev.text);
break;
case 'usage':
console.log('\nTokens:', ev.usage.inputTokens, '->', ev.usage.outputTokens);
break;
case 'done':
console.log('\nFinish reason:', ev.finishReason);
break;
case 'error':
console.error('Stream error:', ev.error.message);
break;
}
}

Step 4 — Stream with tools (tool-call events)

Section titled “Step 4 — Stream with tools (tool-call events)”

When you pass tools, the model may emit tool-call events mid-stream before the final text:

import { defineTool } from '@combycode/llm-sdk';
const getTime = defineTool({
name: 'get_time',
description: 'Return the current UTC time string.',
params: {},
execute: () => new Date().toISOString(),
});
for await (const ev of llm.stream('What time is it?', { tools: [getTime] })) {
if (ev.type === 'tool_call_start') console.log('calling tool:', ev.name);
if (ev.type === 'tool_call_delta') process.stdout.write(ev.arguments);
if (ev.type === 'tool_call_end') console.log(' (args complete)');
if (ev.type === 'text') process.stdout.write(ev.text);
}

Use an AbortController to cancel before the model finishes:

const controller = new AbortController();
setTimeout(() => controller.abort(), 2000); // cancel after 2 seconds
try {
for await (const ev of llm.stream('Write a very long story.', {
signal: controller.signal,
maxTokens: 2000,
})) {
if (ev.type === 'text') process.stdout.write(ev.text);
}
} catch (err) {
if (controller.signal.aborted) console.log('stream cancelled');
else throw err;
}

stream() accepts the same ExecuteOptions as complete(). The streaming-specific considerations:

OptionEffect on streaming
maxTokensStill required. The model can generate indefinitely without it — set a cap.
thinkingEnables 'thinking' events in the stream. Only emitted by models that support extended reasoning (o3, claude-3-7-sonnet, gemini-2.0-flash-thinking).
signalAbortSignal from an AbortController. The for await loop throws when aborted.
timeoutMilliseconds. If the stream does not complete in time, it throws a timeout error.
toolsEnables tool_call_start/delta/end events. The AgentLoop handles these automatically; in raw stream you must execute tools yourself and resume.
providerOptionsProvider-specific fields forwarded verbatim. Useful for OpenAI’s stream_options: { include_usage: true } if you need usage on older models.

Full StreamEvent union:

Event typeFieldsWhen it fires
texttext: stringEach text token chunk. May be one word, a partial word, or several.
thinkingtext: stringEach reasoning/thinking token chunk (only when thinking is enabled).
tool_call_startid, nameModel is about to call a tool. Arguments stream in tool_call_delta events.
tool_call_deltaid, arguments: stringPartial JSON of the tool arguments. Accumulate to build the full args object.
tool_call_endidArguments are complete for this tool call.
usageusage: UsageFinal token counts. Fires once, near the end, when the provider reports usage.
donefinishReason: stringStream is closed. finishReason is 'stop', 'tool_calls', 'length', or 'content_filter'.
errorerror: ErrorStream-level error from the provider. The for await loop does NOT re-throw automatically — check this event type if you want to handle it without a try/catch.
media_startmediaType, mimeTypeMedia stream opening (image/audio/video generation).
media_chunkdata: string, progress?Base64 chunk of generated media.
media_endmediaId?Media stream closing.

When to use stream() vs complete():

complete() waits for the entire response, then returns. Use it when your code needs the full text before doing anything (e.g. parsing JSON, running validation). Use stream() when you want to display partial output progressively, or when the request is long enough that first-token latency matters for user experience.

import { createLLM } from '@combycode/llm-sdk';

const llm = createLLM({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY });

const t0 = performance.now();
let text = '';
for await (const ev of llm.stream('Count from 1 to 5.')) {
  if (ev.type === 'text') text += ev.text;
}

console.log(JSON.stringify({ result: text.trim(), ms: Math.round(performance.now() - t0) }));

The structural difference: official SDKs emit provider-specific event shapes. OpenAI Responses API emits a dozen distinct event types (response.created, response.output_item.added, response.output_text.delta, etc.) that you switch on. Anthropic emits content_block_start, content_block_delta, message_delta, each with different sub-fields. ORXA collapses all of these into the eight-event union above. Your for await loop handles eight cases once and works on every provider.

done fires before the for await exits. The loop ends when the underlying async iterable is exhausted, which happens after done is emitted. You can use done as a signal that usage stats will not change but you still need to await the iterator to finish the loop body.

error events do not throw. If you need error propagation, check ev.type === 'error' and throw manually, or wrap the entire loop in a try/catch (the underlying fetch will throw on network errors).

Streaming structured output is not supported. structured requires the model to emit valid JSON in one block. Use complete() for structured extraction.

Next steps: