Skip to content

Image input (vision)

Try in Sandbox Opens a live chat playground with this example prefilled — add your API key and attach an image, then hit Send. Runs in your browser; no code is executed.

Send a small solid-red PNG and prompt 'What color is this image? One word.'. Assert the response matches /red/i on OpenAI, Anthropic, and Google — with one call shape regardless of provider.

Any task that asks the model to reason about visual content: reading a chart, describing a photo, identifying text in a screenshot, or analysing a UI mockup.

The challenge with raw provider SDKs is that image content blocks are completely different shapes:

  • OpenAI Responses API wraps images as { type: 'input_image', image_url: { url: 'data:image/png;base64,...' } } in the input array.
  • Anthropic uses { type: 'image', source: { type: 'base64', media_type, data } } inside a content array.
  • Google uses { inlineData: { mimeType, data } } as a parts entry in contents.

Each provider also expects base64 encoding done differently (data-URL prefix for some, raw string for others). With multiple images in one message the divergence multiplies.

attachments unifies all of this into one list of file paths, URLs, or bytes.

import { complete } from '@combycode/llm-sdk';
const { text } = await complete({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
prompt: 'What color is this image? One word.',
attachments: ['./red.png'],
maxTokens: 32,
});
console.log(text); // "Red"

attachments accepts a local file path. The SDK reads the file with loadContent(), detects the MIME type (image/png, image/jpeg, etc.) from the file extension and magic bytes, base64-encodes the bytes, and places the result into an ImagePart with a base64 DataSource. The provider adapter then translates that into the correct wire shape.

const { text } = await complete({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
prompt: 'Describe this image in one sentence.',
attachments: ['https://example.com/photo.jpg'],
maxTokens: 128,
});

When the attachment string starts with http:// or https://, loadContent() fetches the URL, reads the response bytes, detects the MIME type from the Content-Type header and/or the URL extension, and encodes the result as base64 — the same base64 DataSource reaches every provider. The provider never sees the URL itself (it always gets inline bytes).

import { readFileSync } from 'fs';
const imageBytes = new Uint8Array(readFileSync('./chart.png'));
const { text } = await complete({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
prompt: 'What is the trend shown in this chart?',
attachments: [imageBytes],
maxTokens: 256,
});

Pass a Uint8Array when you already have the bytes in memory (from a canvas, upload buffer, etc.). MIME type is detected from the magic bytes (PNG header, JPEG FF D8 FF, GIF, WebP/RIFF), defaulting to image/png if no signature matches.

Step 4 — Send multiple images in one message

Section titled “Step 4 — Send multiple images in one message”
const { text } = await complete({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
prompt: 'Which of these two images shows a cat?',
attachments: ['./image-a.jpg', './image-b.jpg'],
maxTokens: 64,
});

attachments is a list — each entry resolves to one ContentPart in the user message. The SDK appends the text prompt as a TextPart first, then each image part in order. All three providers accept multi-image content; the per-provider shape is handled internally.

Step 5 — Build content parts manually for more control

Section titled “Step 5 — Build content parts manually for more control”

When you need to set per-image detail or mix sources, build the content array yourself:

import { complete } from '@combycode/llm-sdk';
import type { ImagePart } from '@combycode/llm-sdk';
import { readFileSync } from 'fs';
import { Buffer } from 'buffer';
const raw = new Uint8Array(readFileSync('./diagram.png'));
const b64 = Buffer.from(raw).toString('base64');
const imagePart: ImagePart = {
type: 'image',
source: { type: 'base64', mimeType: 'image/png', data: b64 },
detail: 'high',
};
const { text } = await complete({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Describe the components in this architecture diagram.' },
imagePart,
],
},
],
maxTokens: 512,
});

The manual path gives you access to detail (OpenAI only) and lets you mix DataSource variants in one message.

ImagePart shape (ContentPart of type 'image'):

FieldTypeDescription
type'image'Discriminator.
sourceDataSourceWhere the bytes come from (see table below).
detail'auto' | 'low' | 'high'Optional. Controls tile-level resolution for OpenAI vision models. Ignored by Anthropic and Google. Defaults to 'auto'.

DataSource variants for images:

typeRequired fieldsWhen to use
'base64'mimeType: string, data: string (raw base64, no data: prefix)Bytes you have in memory as a base64 string. Most common output of loadContent().
'url'url: stringA public URL. The SDK fetches and re-encodes before sending — the provider never sees the URL.
'buffer'mimeType: string, data: Uint8ArrayRaw bytes in memory. MIME type is sniffed from magic bytes and the declared value is corrected if it mismatches.
'file'fileId: stringA file already uploaded via the Files API. Provider translates to its own file-reference format.
'path'mimeType: string, path: stringLocal file path (Node/Bun only). The SDK reads and encodes the file. Use attachments instead for simpler calls.
'provider_ref'mimeType: string, refId: stringAn opaque provider-specific reference (e.g. a Google Files API URI). Passed through as fileData.

detail trade-offs (OpenAI only):

ValueToken costQualityUse when
'auto'VariableModel decidesDefault; suitable for most tasks.
'low'Fixed low (~85 tokens)Coarse (512x512 tile)Fast queries where spatial precision is not needed (e.g. “is this a dog?”).
'high'Variable (512x512 tiles)Full resolutionDocuments, diagrams, screenshots, any task requiring fine detail.

MIME types auto-detected by the SDK:

Extension / MagicMIME type
.png / PNG headerimage/png
.jpg / .jpeg / FF D8 FFimage/jpeg
.gif / GIF38image/gif
.webp / RIFF+WEBPimage/webp

Any format not in this list falls back to image/png. Override explicitly with a buffer or base64 DataSource and set the correct mimeType.

Provider support for image input:

ProviderSupported modelsNotes
OpenAIGPT-4o, GPT-4o mini, o1, o3, gpt-4-turboResponses API: input_image block. Chat Completions: image_url block.
AnthropicClaude 3+ (Haiku, Sonnet, Opus)image block with base64 source.
GoogleGemini 1.5+, Gemini 2.0+inlineData or fileData part.
import { complete } from '@combycode/llm-sdk';

// `attachments` accepts a path / URL / bytes and builds the right image part
// per provider (base64 / inlineData / image_url) under the hood.
const t0 = performance.now();
const { text } = await complete({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  prompt: 'What color is this image? One word.',
  attachments: ['../../official-samples/_fixtures/red.png'],
  maxTokens: 32,
});

console.log(JSON.stringify({ result: text.trim(), ms: Math.round(performance.now() - t0) }));

Every official SDK builds a different content block shape and different base64 encoding convention. OpenAI’s Responses API wraps images in input_image items inside the input array; the Chat Completions path uses image_url with a data-URL string inside content[].image_url.url. Anthropic uses source.type = 'base64' with a media_type field. Google uses inlineData.mimeType / inlineData.data. ORXA resolves a single base64 DataSource into the correct shape per provider — your code does not branch.

URLs are always fetched by the SDK, not passed through. OpenAI’s Responses API can accept raw URLs natively, but ORXA’s url DataSource still fetches and re-encodes — this ensures uniform behaviour across all providers. Pass a base64 or buffer DataSource if you need to avoid the extra fetch.

Large images cost more tokens. With detail: 'high', OpenAI tiles the image into 512x512 patches. A 2048x2048 image generates 16 tiles at ~170 tokens each — about 2700 tokens of image overhead. Use detail: 'low' for yes/no queries on large images.

Anthropic has a 5 MB per-image limit on base64-encoded size. For larger images, resize before sending.

GIF animation is not understood. All three providers receive only the first frame of a GIF (or the entire still image if it is not animated).

Next steps: