Skip to content

Image generation

Try in Sandbox Opens a live chat playground with this example prefilled — add your API key then hit Send. Runs in your browser; no code is executed.

Prompt 'a red circle on white background' and confirm non-empty image bytes are written to disk — same generateImage() call for OpenAI and Google (Anthropic has no image generation API).

Image generation turns a text prompt into pixel data you can display, embed in a document, or feed back into another model call as a vision input. Use cases: product mock-ups, icon sets, illustration pipelines, data augmentation.

The challenge with raw provider SDKs:

  • OpenAI calls client.images.generate({ model, prompt, size, quality, n }) and returns a list of b64_json or url items. gpt-image-1 always returns b64_json and rejects the response_format parameter; dall-e-3 requires it. Two models, two different call shapes inside the same provider.
  • Google Imagen uses a :predict endpoint with instances and parameters; predictions contain bytesBase64Encoded fields. Google Gemini image models instead use generateContent with responseModalities: ['IMAGE'] and return inlineData parts. Two endpoints within the same provider.

createMediaOutput() routes each provider+model combination to the correct endpoint, saves the raw bytes to disk, and returns a uniform MediaResult.

import { createMediaOutput } from '@combycode/llm-sdk';
const media = createMediaOutput({
model: 'openai/gpt-image-1',
apiKey: process.env.OPENAI_API_KEY,
dir: './.media-out',
});

createMediaOutput() requires either dir (Node/Bun, for FileMediaStore) or store (any environment, e.g. new MemoryMediaStore() in the browser). The model string is 'provider/model-id' or a bare model with a separate provider field. API key falls back to engine.apiKeys[provider] when not passed explicitly.

const [img] = await media.generateImage({
prompt: 'a red circle on white background',
params: { size: '1024x1024' },
});
console.log(`saved ${img.meta.size} bytes, id: ${img.id}`);
// img.meta.mimeType -> 'image/png'
// img.meta.provider -> 'openai'
// img.meta.model -> 'gpt-image-1'

generateImage() returns a MediaResult[]. Each item has { id, type: 'image', mimeType, meta }. The bytes are saved to dir under the id. Load them back with output.raw.mediaStore.load(id) when needed.

const images = await media.generateImage({
prompt: 'a watercolor painting of a mountain',
params: { n: 4, size: '1024x1024' },
});
for (const img of images) {
console.log(`${img.id}: ${img.meta.size} bytes`);
}

params.n requests multiple images in one API call. OpenAI DALL-E 2 supports up to 10; gpt-image-1 and DALL-E 3 support 1 (with gpt-image-1 supporting up to 10 in batch). Google Imagen supports up to 4 (sampleCount). Images are stored individually; the array length matches n.

import { readFileSync } from 'fs';
const sourceBytes = new Uint8Array(readFileSync('./original.png'));
const [edited] = await media.editImage({
prompt: 'replace the background with a sunset',
sourceImage: { type: 'buffer', mimeType: 'image/png', data: sourceBytes },
params: { size: '1024x1024' },
});

editImage() is available on OpenAI (gpt-image-1 via /v1/images/edits) and Google (Gemini via generateContent with the image attached as an extra part). Pass mask as a second DataSource for inpainting (OpenAI only).

const googleMedia = createMediaOutput({
model: 'google/imagen-4.0-generate-001',
apiKey: process.env.GOOGLE_API_KEY,
dir: './.media-out',
});
const [img] = await googleMedia.generateImage({
prompt: 'a photorealistic red apple on a white table',
params: {
n: 1,
aspectRatio: '1:1',
},
});

The Google Imagen path calls :predict on the Imagen model. Google Gemini image models (e.g. gemini-2.0-flash-exp) call generateContent with responseModalities: ['IMAGE']. The SDK routes automatically based on whether the model name starts with 'imagen'.

createMediaOutput() options:

OptionTypeDescription
modelstringNamespaced ('openai/gpt-image-1') or bare. Required unless provider is set and the adapter uses a default.
providerProviderNameRequired when model is bare.
apiKeystringOptional; falls back to engine.apiKeys[provider].
dirstringDirectory for FileMediaStore (Node/Bun).
storeMediaStoreCustom store. Use new MemoryMediaStore() in the browser.
providersRecord<string, MediaProviderAdapter>Override or extend auto-registered adapters (custom baseURL, shared instance).
engineEngineHandleShare an existing engine (hooks, catalog, fetch queue).
configMediaOutputConfigpollIntervalMs (default 5000) and maxPollWaitMs (default 600000) for async video.

ImageGenRequest.params — full option set:

ParamTypeProvidersDescription
nnumberOpenAI, Google ImagenNumber of images to generate. OpenAI default 1; Google Imagen max 4.
sizestringOpenAIPixel dimensions string: '1024x1024', '1792x1024', '1024x1792' (DALL-E 3 / gpt-image-1). '256x256', '512x512', '1024x1024' (DALL-E 2).
aspectRatiostringGoogleAspect ratio string: '1:1', '3:4', '4:3', '9:16', '16:9'.
imageSizestringGoogle ImagenSample image size ('1K', '2K'). Maps to sampleImageSize for Imagen, imageSize for Gemini.
qualitystringOpenAI'standard' or 'hd' (DALL-E 3); 'low', 'medium', 'high', 'auto' (gpt-image-1).
stylestringOpenAI DALL-E 3'vivid' or 'natural'. Ignored by gpt-image-1 and Google.
backgroundstringOpenAI gpt-image-1'transparent', 'opaque', or 'auto'. Requires PNG output.
outputFormatstringOpenAI gpt-image-1'png', 'jpeg', or 'webp'. Default 'png'.
responseFormat'b64_json' | 'url'OpenAI DALL-E 2/3 onlygpt-image-1 always returns b64_json and ignores this parameter. The adapter omits it automatically.
strengthnumberOpenRouterImage-to-image strength (0-1); lower = closer to source.

MediaResult fields:

FieldTypeDescription
idstringGenerated media id (img_<uuid>). Use to load bytes from the store.
type'image'Media type discriminator.
mimeTypestringe.g. 'image/png'. From provider response or outputFormat.
meta.sizenumberByte count of the stored file.
meta.providerstringProvider that generated it.
meta.modelstringModel id used.
meta.promptstringThe prompt sent.
meta.revisedPromptstring | undefinedOpenAI may return a revised prompt when it rewrites your input.
meta.width / meta.heightnumber | undefinedPixel dimensions when reported by the provider.

Cost note: Image generation is priced per image (DALL-E 2/3) or per output token (gpt-image-1, Gemini). gpt-image-1 output tokens are billed at $0.04/1K by default (higher for HD). DALL-E 3 standard 1024x1024 is $0.04/image. Google Imagen pricing varies by model and region. Check provider pricing pages before running large batches.

Provider and model reference:

ProviderModelsEndpoint
OpenAIgpt-image-1, dall-e-3, dall-e-2/v1/images/generations (generate); /v1/images/edits (edit)
Googleimagen-4.0-generate-001, gemini-2.0-flash-exp (image):predict for Imagen; generateContent for Gemini image
xAIAurora models via OpenRouter/v1/images/generations
import { createMediaOutput } from '@combycode/llm-sdk';

// Unified media output: one handle, generateImage() defaults provider+model from
// the configured model id. (Official samples call provider-specific endpoints:
// openai images.generate vs google generateContent responseModalities:['IMAGE'].)
const media = createMediaOutput({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  dir: './.media-out',
});

const t0 = performance.now();
const [img] = await media.generateImage({
  prompt: 'a red circle on white background',
  params: { size: '1024x1024' },
});
console.log(JSON.stringify({ result: String(img?.meta.size ?? 0), ms: Math.round(performance.now() - t0) }));

OpenAI’s SDK calls client.images.generate() and returns response.data[] with b64_json or url fields — you decode base64 and save to disk yourself. Google has no official image-generation method in the Node SDK; you call client.models.generateContent() with responseModalities: ['IMAGE'] and extract inlineData.data from the response parts manually. ORXA calls generateImage() once and returns typed MediaResult[] with bytes already saved to dir — no provider-specific extraction code in your app.

gpt-image-1 always returns PNG bytes as b64_json. The responseFormat parameter is silently omitted by the adapter for gpt-image-1 because the API rejects it. DALL-E 3 and DALL-E 2 still need it set to 'b64_json' internally — the adapter handles this.

Revised prompts. OpenAI’s API may rewrite your prompt for safety or quality. The original prompt is stored in meta.prompt; the rewritten version (if any) is in meta.revisedPrompt. Log revisedPrompt when debugging unexpected output.

Google Imagen vs Gemini image models have different endpoints. Model names starting with 'imagen' go to :predict; all other Google models go to generateContent. Set model in createMediaOutput to the correct string; the adapter routes automatically.

Bytes are saved on every generateImage() call. If generation fails mid-call (network error after the image is returned but before mediaStore.save()), no file is written. The promise rejects cleanly — retry safely.

Edit requires sourceImage as a DataSource, not a path string. editImage() does not accept raw file paths. Read the file into a buffer DataSource first (see Step 4 above).

Next steps:

  • Vision input — feed generated images back into a vision model
  • TTS — audio generation counterpart
  • File upload — upload a source image to the Files API for use in edits