OpenAI Format - Transcriptions

Model-Specific Parameters

The OpenAI transcription endpoint exposes different fields depending on the model. The request body above only documents fields common to all models. The following sections describe model-specific or model-restricted fields.

Supported Models

Model ID	Description
`whisper-1`	Classic Whisper V2 model. Supports the broadest set of output formats and timestamp granularities
`gpt-4o-transcribe`	High-accuracy transcription. Only `json` output. Streamable
`gpt-4o-mini-transcribe`	Lightweight high-accuracy transcription. Only `json` output. Streamable
`gpt-4o-mini-transcribe-2025-12-15`	Versioned snapshot of `gpt-4o-mini-transcribe`
`gpt-4o-transcribe-diarize`	Transcription with speaker diarization. Use `diarized_json` to receive per-segment speaker labels

`response_format` Compatibility Matrix

Model	Supported formats
`whisper-1`	`json` / `text` / `srt` / `verbose_json` / `vtt`
`gpt-4o-transcribe`, `gpt-4o-mini-transcribe`(-2025-12-15)	`json` only
`gpt-4o-transcribe-diarize`	`json` / `text` / `diarized_json` (use `diarized_json` to receive speaker annotations)

`whisper-1`-Only Features

timestamp_granularities[] — array, allowed values: word / segment, default [segment]
- Word / segment-level timestamp granularity
- Takes effect only when response_format=verbose_json
- Sent as repeated form field timestamp_granularities[]
- gpt-4o-* models cannot use this in practice (they only support json); gpt-4o-transcribe-diarize explicitly disallows it
Streaming not supported: stream=true is silently ignored on whisper-1.

gpt-4o-* Series Parameters

Applies to gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15.

include[] — array, allowed value: logprobs
- Returns the log probabilities of each token, useful for assessing model confidence
- Only effective when response_format=json
- Not available on whisper-1 or gpt-4o-transcribe-diarize
stream — boolean, default false
- Streams transcription results via SSE (Server-Sent Events)
- Ignored on whisper-1

chunking_strategy — "auto" string or server_vad object

Controls how the audio is split into chunks. If unset, the audio is transcribed as a single block
When "auto": the server normalizes loudness and then uses VAD to choose chunk boundaries

When a server_vad object (manual VAD tuning):

Field	Type	Default	Description
`type`	string	—	Required, must be `"server_vad"`
`prefix_padding_ms`	integer	`300`	Audio (ms) included before VAD-detected speech
`silence_duration_ms`	integer	`200`	Silence (ms) used to detect end of speech. Shorter values respond faster but may cut on short pauses

`gpt-4o-transcribe-diarize`-Only Parameters

Applies only to gpt-4o-transcribe-diarize (speaker-diarization model).

chunking_strategy — Required for inputs longer than 30 seconds (recommended: "auto")
known_speaker_names[] — array, max 4
- Identifier list for known speakers (e.g. customer, agent)
- Maps 1-to-1 with known_speaker_references[]
known_speaker_references[] — array, max 4
- Reference audio for each speaker, in data URL format
- Each sample must be 2-10 seconds
- Same audio formats as the file field

Fields Not Supported by `gpt-4o-transcribe-diarize`

The following fields are not available on gpt-4o-transcribe-diarize:

Field	Note
`prompt`	Style/continuation prompt not supported
`timestamp_granularities[]`	Word / segment timestamp granularity not configurable
`include[]`	Additional returns like `logprobs` not supported
`stream`	Streaming output not supported

Authorizations

Authorization

string

header

required

All APIs require Bearer Token authentication

Add to request header:

Authorization: Bearer YOUR_API_KEY

Body

multipart/form-data

file

required

Audio file to transcribe

Notes:

Uploaded via multipart/form-data
Supported formats: flac / mp3 / mp4 / mpeg / mpga / m4a / ogg / wav / webm

model

string

required

Speech-to-text model ID. Allowed values: whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe

Example:

"whisper-1"

language

string

ISO-639-1 language code of the input audio (e.g. en, zh, ja). Supplying this improves accuracy and latency.

Example:

"en"

prompt

string

Optional text to guide the model's style or to continue from a previous audio segment. The prompt should match the audio language.

response_format

enum<string>

default:json

Format of the transcription output

Available options:

json,

text,

srt,

verbose_json,

vtt

temperature

number

default:0

Sampling temperature between 0 and 1. Higher values produce more random output; 0 lets the model auto-tune.

Required range: 0 <= x <= 1

Response

Transcription response

Option 1
Option 2

text

string

required

Transcribed text

Example:

"The weather is nice today, let's go for a walk in the park."

Image Series

Video Series

Audio Series

Text Series

Task Management

File Management

OpenAI Format - Transcriptions

Model-Specific Parameters

Supported Models

`response_format` Compatibility Matrix

`whisper-1`-Only Features

gpt-4o-* Series Parameters

`gpt-4o-transcribe-diarize`-Only Parameters

Fields Not Supported by `gpt-4o-transcribe-diarize`

Authorizations

All APIs require Bearer Token authentication

Body

Response

Image Series

Video Series

Audio Series

Text Series

Task Management

File Management

Documentation Index

​Model-Specific Parameters

​Supported Models

​response_format Compatibility Matrix

​whisper-1-Only Features

​gpt-4o-* Series Parameters

​gpt-4o-transcribe-diarize-Only Parameters

​Fields Not Supported by gpt-4o-transcribe-diarize

Authorizations

All APIs require Bearer Token authentication

Body

Response

Model-Specific Parameters

Supported Models

`response_format` Compatibility Matrix

`whisper-1`-Only Features

gpt-4o-* Series Parameters

`gpt-4o-transcribe-diarize`-Only Parameters

Fields Not Supported by `gpt-4o-transcribe-diarize`