Skip to main content
POST
/
v1
/
audio
/
transcriptions
OpenAI Format - Transcriptions
curl --request POST \
  --url https://api.foxapi.cc/v1/audio/transcriptions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: multipart/form-data' \
  --form file='@example-file' \
  --form model=whisper-1 \
  --form language=en \
  --form 'prompt=<string>' \
  --form response_format=json \
  --form temperature=0
{
  "text": "The weather is nice today, let's go for a walk in the park."
}

Documentation Index

Fetch the complete documentation index at: https://docs.foxapi.cc/llms.txt

Use this file to discover all available pages before exploring further.

Model-Specific Parameters

The OpenAI transcription endpoint exposes different fields depending on the model. The request body above only documents fields common to all models. The following sections describe model-specific or model-restricted fields.

Supported Models

Model IDDescription
whisper-1Classic Whisper V2 model. Supports the broadest set of output formats and timestamp granularities
gpt-4o-transcribeHigh-accuracy transcription. Only json output. Streamable
gpt-4o-mini-transcribeLightweight high-accuracy transcription. Only json output. Streamable
gpt-4o-mini-transcribe-2025-12-15Versioned snapshot of gpt-4o-mini-transcribe
gpt-4o-transcribe-diarizeTranscription with speaker diarization. Use diarized_json to receive per-segment speaker labels

response_format Compatibility Matrix

ModelSupported formats
whisper-1json / text / srt / verbose_json / vtt
gpt-4o-transcribe, gpt-4o-mini-transcribe(-2025-12-15)json only
gpt-4o-transcribe-diarizejson / text / diarized_json (use diarized_json to receive speaker annotations)

whisper-1-Only Features

  • timestamp_granularities[] — array, allowed values: word / segment, default [segment]
    • Word / segment-level timestamp granularity
    • Takes effect only when response_format=verbose_json
    • Sent as repeated form field timestamp_granularities[]
    • gpt-4o-* models cannot use this in practice (they only support json); gpt-4o-transcribe-diarize explicitly disallows it
  • Streaming not supported: stream=true is silently ignored on whisper-1.

gpt-4o-* Series Parameters

Applies to gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15.
  • include[] — array, allowed value: logprobs
    • Returns the log probabilities of each token, useful for assessing model confidence
    • Only effective when response_format=json
    • Not available on whisper-1 or gpt-4o-transcribe-diarize
  • stream — boolean, default false
    • Streams transcription results via SSE (Server-Sent Events)
    • Ignored on whisper-1
  • chunking_strategy"auto" string or server_vad object
    • Controls how the audio is split into chunks. If unset, the audio is transcribed as a single block
    • When "auto": the server normalizes loudness and then uses VAD to choose chunk boundaries
    • When a server_vad object (manual VAD tuning):
      FieldTypeDefaultDescription
      typestringRequired, must be "server_vad"
      prefix_padding_msinteger300Audio (ms) included before VAD-detected speech
      silence_duration_msinteger200Silence (ms) used to detect end of speech. Shorter values respond faster but may cut on short pauses

gpt-4o-transcribe-diarize-Only Parameters

Applies only to gpt-4o-transcribe-diarize (speaker-diarization model).
  • chunking_strategyRequired for inputs longer than 30 seconds (recommended: "auto")
  • known_speaker_names[] — array, max 4
    • Identifier list for known speakers (e.g. customer, agent)
    • Maps 1-to-1 with known_speaker_references[]
  • known_speaker_references[] — array, max 4
    • Reference audio for each speaker, in data URL format
    • Each sample must be 2-10 seconds
    • Same audio formats as the file field

Fields Not Supported by gpt-4o-transcribe-diarize

The following fields are not available on gpt-4o-transcribe-diarize:
FieldNote
promptStyle/continuation prompt not supported
timestamp_granularities[]Word / segment timestamp granularity not configurable
include[]Additional returns like logprobs not supported
streamStreaming output not supported

Authorizations

Authorization
string
header
required

All APIs require Bearer Token authentication

Add to request header:

Authorization: Bearer YOUR_API_KEY

Body

multipart/form-data
file
file
required

Audio file to transcribe

Notes:

  • Uploaded via multipart/form-data
  • Supported formats: flac / mp3 / mp4 / mpeg / mpga / m4a / ogg / wav / webm
model
string
required

Speech-to-text model ID. Allowed values: whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe

Example:

"whisper-1"

language
string

ISO-639-1 language code of the input audio (e.g. en, zh, ja). Supplying this improves accuracy and latency.

Example:

"en"

prompt
string

Optional text to guide the model's style or to continue from a previous audio segment. The prompt should match the audio language.

response_format
enum<string>
default:json

Format of the transcription output

Available options:
json,
text,
srt,
verbose_json,
vtt
temperature
number
default:0

Sampling temperature between 0 and 1. Higher values produce more random output; 0 lets the model auto-tune.

Required range: 0 <= x <= 1

Response

Transcription response

text
string
required

Transcribed text

Example:

"The weather is nice today, let's go for a walk in the park."