Token streaming
What you will achieve
Section titled “What you will achieve”Stream 'Count from 1 to 5.' and receive tokens as the model emits them — using a
single for await loop with one event shape, regardless of provider.
When and why you need this
Section titled “When and why you need this”With a blocking complete() call your UI or terminal shows nothing until the model
finishes. For long answers (code, essays, analysis) that can be several seconds of
blank screen. Streaming lets you forward each token to the user the instant it arrives.
The challenge with raw provider APIs: each has a different chunk shape.
- OpenAI Responses API emits typed events like
response.output_text.deltawith.deltatext. - Anthropic emits
content_block_deltaevents with.delta.text. - Google emits
generateContentStreamchunks with.candidates[0].content.parts[0].text.
Normalising these into one format requires provider-specific parsing code in your application.
Step by step
Section titled “Step by step”Step 1 — Create a reusable LLMClient
Section titled “Step 1 — Create a reusable LLMClient”import { createLLM } from '@combycode/llm-sdk';
const llm = createLLM({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY,});createLLM() returns an LLMClient bound to one provider and model. The client is
reusable — create it once and call .stream() as many times as you need. Creating a
new client per request works but wastes adapter resolution overhead.
Step 2 — Open the stream and iterate events
Section titled “Step 2 — Open the stream and iterate events”let text = '';for await (const ev of llm.stream('Count from 1 to 5.')) { if (ev.type === 'text') { text += ev.text; process.stdout.write(ev.text); // forward to UI or terminal in real time }}console.log('\n--- done ---');console.log('full text:', text);The for await loop runs until the model closes the stream. Every provider’s events are
normalised to the same StreamEvent union before they reach your loop.
Step 3 — Handle all event types you care about
Section titled “Step 3 — Handle all event types you care about”Not every call emits every event type. Check what your use case needs:
for await (const ev of llm.stream('Explain async/await in TypeScript.', { thinking: { mode: 'auto', effort: 'low' }, maxTokens: 512,})) { switch (ev.type) { case 'thinking': // Reasoning tokens from o3, claude-3-7-sonnet, gemini-thinking models. process.stdout.write(`[think] ${ev.text}`); break; case 'text': process.stdout.write(ev.text); break; case 'usage': console.log('\nTokens:', ev.usage.inputTokens, '->', ev.usage.outputTokens); break; case 'done': console.log('\nFinish reason:', ev.finishReason); break; case 'error': console.error('Stream error:', ev.error.message); break; }}Step 4 — Stream with tools (tool-call events)
Section titled “Step 4 — Stream with tools (tool-call events)”When you pass tools, the model may emit tool-call events mid-stream before the
final text:
import { defineTool } from '@combycode/llm-sdk';
const getTime = defineTool({ name: 'get_time', description: 'Return the current UTC time string.', params: {}, execute: () => new Date().toISOString(),});
for await (const ev of llm.stream('What time is it?', { tools: [getTime] })) { if (ev.type === 'tool_call_start') console.log('calling tool:', ev.name); if (ev.type === 'tool_call_delta') process.stdout.write(ev.arguments); if (ev.type === 'tool_call_end') console.log(' (args complete)'); if (ev.type === 'text') process.stdout.write(ev.text);}Step 5 — Cancel a stream
Section titled “Step 5 — Cancel a stream”Use an AbortController to cancel before the model finishes:
const controller = new AbortController();setTimeout(() => controller.abort(), 2000); // cancel after 2 seconds
try { for await (const ev of llm.stream('Write a very long story.', { signal: controller.signal, maxTokens: 2000, })) { if (ev.type === 'text') process.stdout.write(ev.text); }} catch (err) { if (controller.signal.aborted) console.log('stream cancelled'); else throw err;}Your options
Section titled “Your options”stream() accepts the same ExecuteOptions as complete(). The streaming-specific
considerations:
| Option | Effect on streaming |
|---|---|
maxTokens | Still required. The model can generate indefinitely without it — set a cap. |
thinking | Enables 'thinking' events in the stream. Only emitted by models that support extended reasoning (o3, claude-3-7-sonnet, gemini-2.0-flash-thinking). |
signal | AbortSignal from an AbortController. The for await loop throws when aborted. |
timeout | Milliseconds. If the stream does not complete in time, it throws a timeout error. |
tools | Enables tool_call_start/delta/end events. The AgentLoop handles these automatically; in raw stream you must execute tools yourself and resume. |
providerOptions | Provider-specific fields forwarded verbatim. Useful for OpenAI’s stream_options: { include_usage: true } if you need usage on older models. |
Full StreamEvent union:
| Event type | Fields | When it fires |
|---|---|---|
text | text: string | Each text token chunk. May be one word, a partial word, or several. |
thinking | text: string | Each reasoning/thinking token chunk (only when thinking is enabled). |
tool_call_start | id, name | Model is about to call a tool. Arguments stream in tool_call_delta events. |
tool_call_delta | id, arguments: string | Partial JSON of the tool arguments. Accumulate to build the full args object. |
tool_call_end | id | Arguments are complete for this tool call. |
usage | usage: Usage | Final token counts. Fires once, near the end, when the provider reports usage. |
done | finishReason: string | Stream is closed. finishReason is 'stop', 'tool_calls', 'length', or 'content_filter'. |
error | error: Error | Stream-level error from the provider. The for await loop does NOT re-throw automatically — check this event type if you want to handle it without a try/catch. |
media_start | mediaType, mimeType | Media stream opening (image/audio/video generation). |
media_chunk | data: string, progress? | Base64 chunk of generated media. |
media_end | mediaId? | Media stream closing. |
When to use stream() vs complete():
complete() waits for the entire response, then returns. Use it when your code needs
the full text before doing anything (e.g. parsing JSON, running validation). Use
stream() when you want to display partial output progressively, or when the request
is long enough that first-token latency matters for user experience.
Compare the SDKs
Section titled “Compare the SDKs”The structural difference: official SDKs emit provider-specific event shapes. OpenAI
Responses API emits a dozen distinct event types (response.created,
response.output_item.added, response.output_text.delta, etc.) that you switch on.
Anthropic emits content_block_start, content_block_delta, message_delta, each
with different sub-fields. ORXA collapses all of these into the eight-event union above.
Your for await loop handles eight cases once and works on every provider.
Gotchas and next steps
Section titled “Gotchas and next steps”done fires before the for await exits. The loop ends when the underlying async
iterable is exhausted, which happens after done is emitted. You can use done as a
signal that usage stats will not change but you still need to await the iterator to
finish the loop body.
error events do not throw. If you need error propagation, check ev.type === 'error'
and throw manually, or wrap the entire loop in a try/catch (the underlying fetch will
throw on network errors).
Streaming structured output is not supported. structured requires the model to emit
valid JSON in one block. Use complete() for structured extraction.
Next steps:
- Quickstart — non-streaming baseline and key setup
- Reasoning — stream thinking tokens alongside text
- Multi-turn conversation — stream inside a conversation loop