Image input (vision)

▶ Try in Sandbox Opens a live chat playground with this example prefilled — add your API key and attach an image, then hit Send. Runs in your browser; no code is executed.

What you will achieve

Send a small solid-red PNG and prompt 'What color is this image? One word.'. Assert the response matches /red/i on OpenAI, Anthropic, and Google — with one call shape regardless of provider.

When and why you need this

Any task that asks the model to reason about visual content: reading a chart, describing a photo, identifying text in a screenshot, or analysing a UI mockup.

The challenge with raw provider SDKs is that image content blocks are completely different shapes:

OpenAI Responses API wraps images as { type: 'input_image', image_url: { url: 'data:image/png;base64,...' } } in the input array.
Anthropic uses { type: 'image', source: { type: 'base64', media_type, data } } inside a content array.
Google uses { inlineData: { mimeType, data } } as a parts entry in contents.

Each provider also expects base64 encoding done differently (data-URL prefix for some, raw string for others). With multiple images in one message the divergence multiplies.

attachments unifies all of this into one list of file paths, URLs, or bytes.

Step by step

Step 1 — Send one image by file path

import { complete } from '@combycode/llm-sdk';

const { text } = await complete({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  prompt: 'What color is this image? One word.',
  attachments: ['./red.png'],
  maxTokens: 32,
});

console.log(text); // "Red"

attachments accepts a local file path. The SDK reads the file with loadContent(), detects the MIME type (image/png, image/jpeg, etc.) from the file extension and magic bytes, base64-encodes the bytes, and places the result into an ImagePart with a base64 DataSource. The provider adapter then translates that into the correct wire shape.

Step 2 — Send an image by URL

const { text } = await complete({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  prompt: 'Describe this image in one sentence.',
  attachments: ['https://example.com/photo.jpg'],
  maxTokens: 128,
});

When the attachment string starts with http:// or https://, loadContent() fetches the URL, reads the response bytes, detects the MIME type from the Content-Type header and/or the URL extension, and encodes the result as base64 — the same base64 DataSource reaches every provider. The provider never sees the URL itself (it always gets inline bytes).

Step 3 — Send raw bytes

import { readFileSync } from 'fs';

const imageBytes = new Uint8Array(readFileSync('./chart.png'));

const { text } = await complete({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  prompt: 'What is the trend shown in this chart?',
  attachments: [imageBytes],
  maxTokens: 256,
});

Pass a Uint8Array when you already have the bytes in memory (from a canvas, upload buffer, etc.). MIME type is detected from the magic bytes (PNG header, JPEG FF D8 FF, GIF, WebP/RIFF), defaulting to image/png if no signature matches.

Step 4 — Send multiple images in one message

const { text } = await complete({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  prompt: 'Which of these two images shows a cat?',
  attachments: ['./image-a.jpg', './image-b.jpg'],
  maxTokens: 64,
});

attachments is a list — each entry resolves to one ContentPart in the user message. The SDK appends the text prompt as a TextPart first, then each image part in order. All three providers accept multi-image content; the per-provider shape is handled internally.

Step 5 — Build content parts manually for more control

When you need to set per-image detail or mix sources, build the content array yourself:

import { complete } from '@combycode/llm-sdk';
import type { ImagePart } from '@combycode/llm-sdk';
import { readFileSync } from 'fs';
import { Buffer } from 'buffer';

const raw = new Uint8Array(readFileSync('./diagram.png'));
const b64 = Buffer.from(raw).toString('base64');

const imagePart: ImagePart = {
  type: 'image',
  source: { type: 'base64', mimeType: 'image/png', data: b64 },
  detail: 'high',
};

const { text } = await complete({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Describe the components in this architecture diagram.' },
        imagePart,
      ],
    },
  ],
  maxTokens: 512,
});

The manual path gives you access to detail (OpenAI only) and lets you mix DataSource variants in one message.

Your options

ImagePart shape (ContentPart of type 'image'):

Field	Type	Description
`type`	`'image'`	Discriminator.
`source`	`DataSource`	Where the bytes come from (see table below).
`detail`	`'auto' \| 'low' \| 'high'`	Optional. Controls tile-level resolution for OpenAI vision models. Ignored by Anthropic and Google. Defaults to `'auto'`.

DataSource variants for images:

`type`	Required fields	When to use
`'base64'`	`mimeType: string`, `data: string` (raw base64, no `data:` prefix)	Bytes you have in memory as a base64 string. Most common output of `loadContent()`.
`'url'`	`url: string`	A public URL. The SDK fetches and re-encodes before sending — the provider never sees the URL.
`'buffer'`	`mimeType: string`, `data: Uint8Array`	Raw bytes in memory. MIME type is sniffed from magic bytes and the declared value is corrected if it mismatches.
`'file'`	`fileId: string`	A file already uploaded via the Files API. Provider translates to its own file-reference format.
`'path'`	`mimeType: string`, `path: string`	Local file path (Node/Bun only). The SDK reads and encodes the file. Use `attachments` instead for simpler calls.
`'provider_ref'`	`mimeType: string`, `refId: string`	An opaque provider-specific reference (e.g. a Google Files API URI). Passed through as `fileData`.

detail trade-offs (OpenAI only):

Value	Token cost	Quality	Use when
`'auto'`	Variable	Model decides	Default; suitable for most tasks.
`'low'`	Fixed low (~85 tokens)	Coarse (512x512 tile)	Fast queries where spatial precision is not needed (e.g. “is this a dog?”).
`'high'`	Variable (512x512 tiles)	Full resolution	Documents, diagrams, screenshots, any task requiring fine detail.

MIME types auto-detected by the SDK:

Extension / Magic	MIME type
`.png` / PNG header	`image/png`
`.jpg` / `.jpeg` / FF D8 FF	`image/jpeg`
`.gif` / GIF38	`image/gif`
`.webp` / RIFF+WEBP	`image/webp`

Any format not in this list falls back to image/png. Override explicitly with a buffer or base64 DataSource and set the correct mimeType.

Provider support for image input:

Provider	Supported models	Notes
OpenAI	GPT-4o, GPT-4o mini, o1, o3, gpt-4-turbo	Responses API: `input_image` block. Chat Completions: `image_url` block.
Anthropic	Claude 3+ (Haiku, Sonnet, Opus)	`image` block with `base64` source.
Google	Gemini 1.5+, Gemini 2.0+	`inlineData` or `fileData` part.

Compare the SDKs

import { complete } from '@combycode/llm-sdk';

// `attachments` accepts a path / URL / bytes and builds the right image part
// per provider (base64 / inlineData / image_url) under the hood.
const t0 = performance.now();
const { text } = await complete({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  prompt: 'What color is this image? One word.',
  attachments: ['../../official-samples/_fixtures/red.png'],
  maxTokens: 32,
});

console.log(JSON.stringify({ result: text.trim(), ms: Math.round(performance.now() - t0) }));

Every official SDK builds a different content block shape and different base64 encoding convention. OpenAI’s Responses API wraps images in input_image items inside the input array; the Chat Completions path uses image_url with a data-URL string inside content[].image_url.url. Anthropic uses source.type = 'base64' with a media_type field. Google uses inlineData.mimeType / inlineData.data. ORXA resolves a single base64 DataSource into the correct shape per provider — your code does not branch.

Gotchas and next steps

URLs are always fetched by the SDK, not passed through. OpenAI’s Responses API can accept raw URLs natively, but ORXA’s url DataSource still fetches and re-encodes — this ensures uniform behaviour across all providers. Pass a base64 or buffer DataSource if you need to avoid the extra fetch.

Large images cost more tokens. With detail: 'high', OpenAI tiles the image into 512x512 patches. A 2048x2048 image generates 16 tiles at ~170 tokens each — about 2700 tokens of image overhead. Use detail: 'low' for yes/no queries on large images.

Anthropic has a 5 MB per-image limit on base64-encoded size. For larger images, resize before sending.

GIF animation is not understood. All three providers receive only the first frame of a GIF (or the entire still image if it is not animated).

Next steps:

PDF document input — same attachments API, DocumentPart shape
Audio input — send audio files to a multimodal model
File upload — persist a file server-side and reference it across calls