Token streaming

▶ Try in Sandbox Opens a live chat playground with this example prefilled — add your API key then hit Send. Runs in your browser; no code is executed.

What you will achieve

Stream 'Count from 1 to 5.' and receive tokens as the model emits them — using a single for await loop with one event shape, regardless of provider.

When and why you need this

With a blocking complete() call your UI or terminal shows nothing until the model finishes. For long answers (code, essays, analysis) that can be several seconds of blank screen. Streaming lets you forward each token to the user the instant it arrives.

The challenge with raw provider APIs: each has a different chunk shape.

OpenAI Responses API emits typed events like response.output_text.delta with .delta text.
Anthropic emits content_block_delta events with .delta.text.
Google emits generateContentStream chunks with .candidates[0].content.parts[0].text.

Normalising these into one format requires provider-specific parsing code in your application.

Step by step

Step 1 — Create a reusable `LLMClient`

import { createLLM } from '@combycode/llm-sdk';

const llm = createLLM({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
});

createLLM() returns an LLMClient bound to one provider and model. The client is reusable — create it once and call .stream() as many times as you need. Creating a new client per request works but wastes adapter resolution overhead.

Step 2 — Open the stream and iterate events

let text = '';
for await (const ev of llm.stream('Count from 1 to 5.')) {
  if (ev.type === 'text') {
    text += ev.text;
    process.stdout.write(ev.text);   // forward to UI or terminal in real time
  }
}
console.log('\n--- done ---');
console.log('full text:', text);

The for await loop runs until the model closes the stream. Every provider’s events are normalised to the same StreamEvent union before they reach your loop.

Step 3 — Handle all event types you care about

Not every call emits every event type. Check what your use case needs:

for await (const ev of llm.stream('Explain async/await in TypeScript.', {
  thinking: { mode: 'auto', effort: 'low' },
  maxTokens: 512,
})) {
  switch (ev.type) {
    case 'thinking':
      // Reasoning tokens from o3, claude-3-7-sonnet, gemini-thinking models.
      process.stdout.write(`[think] ${ev.text}`);
      break;
    case 'text':
      process.stdout.write(ev.text);
      break;
    case 'usage':
      console.log('\nTokens:', ev.usage.inputTokens, '->', ev.usage.outputTokens);
      break;
    case 'done':
      console.log('\nFinish reason:', ev.finishReason);
      break;
    case 'error':
      console.error('Stream error:', ev.error.message);
      break;
  }
}

Step 4 — Stream with tools (tool-call events)

When you pass tools, the model may emit tool-call events mid-stream before the final text:

import { defineTool } from '@combycode/llm-sdk';

const getTime = defineTool({
  name: 'get_time',
  description: 'Return the current UTC time string.',
  params: {},
  execute: () => new Date().toISOString(),
});

for await (const ev of llm.stream('What time is it?', { tools: [getTime] })) {
  if (ev.type === 'tool_call_start') console.log('calling tool:', ev.name);
  if (ev.type === 'tool_call_delta') process.stdout.write(ev.arguments);
  if (ev.type === 'tool_call_end')   console.log(' (args complete)');
  if (ev.type === 'text')            process.stdout.write(ev.text);
}

Step 5 — Cancel a stream

Use an AbortController to cancel before the model finishes:

const controller = new AbortController();
setTimeout(() => controller.abort(), 2000);  // cancel after 2 seconds

try {
  for await (const ev of llm.stream('Write a very long story.', {
    signal: controller.signal,
    maxTokens: 2000,
  })) {
    if (ev.type === 'text') process.stdout.write(ev.text);
  }
} catch (err) {
  if (controller.signal.aborted) console.log('stream cancelled');
  else throw err;
}

Your options

stream() accepts the same ExecuteOptions as complete(). The streaming-specific considerations:

Option	Effect on streaming
`maxTokens`	Still required. The model can generate indefinitely without it — set a cap.
`thinking`	Enables `'thinking'` events in the stream. Only emitted by models that support extended reasoning (`o3`, `claude-3-7-sonnet`, `gemini-2.0-flash-thinking`).
`signal`	`AbortSignal` from an `AbortController`. The `for await` loop throws when aborted.
`timeout`	Milliseconds. If the stream does not complete in time, it throws a timeout error.
`tools`	Enables `tool_call_start/delta/end` events. The AgentLoop handles these automatically; in raw stream you must execute tools yourself and resume.
`providerOptions`	Provider-specific fields forwarded verbatim. Useful for OpenAI’s `stream_options: { include_usage: true }` if you need usage on older models.

Full StreamEvent union:

Event type	Fields	When it fires
`text`	`text: string`	Each text token chunk. May be one word, a partial word, or several.
`thinking`	`text: string`	Each reasoning/thinking token chunk (only when `thinking` is enabled).
`tool_call_start`	`id, name`	Model is about to call a tool. Arguments stream in `tool_call_delta` events.
`tool_call_delta`	`id, arguments: string`	Partial JSON of the tool arguments. Accumulate to build the full args object.
`tool_call_end`	`id`	Arguments are complete for this tool call.
`usage`	`usage: Usage`	Final token counts. Fires once, near the end, when the provider reports usage.
`done`	`finishReason: string`	Stream is closed. `finishReason` is `'stop'`, `'tool_calls'`, `'length'`, or `'content_filter'`.
`error`	`error: Error`	Stream-level error from the provider. The `for await` loop does NOT re-throw automatically — check this event type if you want to handle it without a try/catch.
`media_start`	`mediaType, mimeType`	Media stream opening (image/audio/video generation).
`media_chunk`	`data: string, progress?`	Base64 chunk of generated media.
`media_end`	`mediaId?`	Media stream closing.

When to use stream() vs complete():

complete() waits for the entire response, then returns. Use it when your code needs the full text before doing anything (e.g. parsing JSON, running validation). Use stream() when you want to display partial output progressively, or when the request is long enough that first-token latency matters for user experience.

Compare the SDKs

import { createLLM } from '@combycode/llm-sdk';

const llm = createLLM({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY });

const t0 = performance.now();
let text = '';
for await (const ev of llm.stream('Count from 1 to 5.')) {
  if (ev.type === 'text') text += ev.text;
}

console.log(JSON.stringify({ result: text.trim(), ms: Math.round(performance.now() - t0) }));

The structural difference: official SDKs emit provider-specific event shapes. OpenAI Responses API emits a dozen distinct event types (response.created, response.output_item.added, response.output_text.delta, etc.) that you switch on. Anthropic emits content_block_start, content_block_delta, message_delta, each with different sub-fields. ORXA collapses all of these into the eight-event union above. Your for await loop handles eight cases once and works on every provider.

Gotchas and next steps

done fires before the for await exits. The loop ends when the underlying async iterable is exhausted, which happens after done is emitted. You can use done as a signal that usage stats will not change but you still need to await the iterator to finish the loop body.

error events do not throw. If you need error propagation, check ev.type === 'error' and throw manually, or wrap the entire loop in a try/catch (the underlying fetch will throw on network errors).

Streaming structured output is not supported. structured requires the model to emit valid JSON in one block. Use complete() for structured extraction.

Next steps:

Quickstart — non-streaming baseline and key setup
Reasoning — stream thinking tokens alongside text
Multi-turn conversation — stream inside a conversation loop