whisper-1 (recommended), gpt-4o-transcribe, gpt-4o-mini-transcribeDocumentation Index
Fetch the complete documentation index at: https://docs.foxapi.cc/llms.txt
Use this file to discover all available pages before exploring further.
| Model ID | Description |
|---|---|
whisper-1 | Classic Whisper V2 model. Supports the broadest set of output formats and timestamp granularities |
gpt-4o-transcribe | High-accuracy transcription. Only json output. Streamable |
gpt-4o-mini-transcribe | Lightweight high-accuracy transcription. Only json output. Streamable |
gpt-4o-mini-transcribe-2025-12-15 | Versioned snapshot of gpt-4o-mini-transcribe |
gpt-4o-transcribe-diarize | Transcription with speaker diarization. Use diarized_json to receive per-segment speaker labels |
response_format Compatibility Matrix| Model | Supported formats |
|---|---|
whisper-1 | json / text / srt / verbose_json / vtt |
gpt-4o-transcribe, gpt-4o-mini-transcribe(-2025-12-15) | json only |
gpt-4o-transcribe-diarize | json / text / diarized_json (use diarized_json to receive speaker annotations) |
whisper-1-Only Featurestimestamp_granularities[] — array, allowed values: word / segment, default [segment]
response_format=verbose_jsontimestamp_granularities[]json); gpt-4o-transcribe-diarize explicitly disallows itstream=true is silently ignored on whisper-1.gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15.
include[] — array, allowed value: logprobs
response_format=jsonwhisper-1 or gpt-4o-transcribe-diarizestream — boolean, default false
whisper-1chunking_strategy — "auto" string or server_vad object
"auto": the server normalizes loudness and then uses VAD to choose chunk boundaries
server_vad object (manual VAD tuning):
| Field | Type | Default | Description |
|---|---|---|---|
type | string | — | Required, must be "server_vad" |
prefix_padding_ms | integer | 300 | Audio (ms) included before VAD-detected speech |
silence_duration_ms | integer | 200 | Silence (ms) used to detect end of speech. Shorter values respond faster but may cut on short pauses |
gpt-4o-transcribe-diarize-Only Parametersgpt-4o-transcribe-diarize (speaker-diarization model).
chunking_strategy — Required for inputs longer than 30 seconds (recommended: "auto")
known_speaker_names[] — array, max 4
customer, agent)known_speaker_references[]known_speaker_references[] — array, max 4
file fieldgpt-4o-transcribe-diarizegpt-4o-transcribe-diarize:
| Field | Note |
|---|---|
prompt | Style/continuation prompt not supported |
timestamp_granularities[] | Word / segment timestamp granularity not configurable |
include[] | Additional returns like logprobs not supported |
stream | Streaming output not supported |
Add to request header:
Authorization: Bearer YOUR_API_KEY
Audio file to transcribe
Notes:
Speech-to-text model ID. Allowed values: whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe
"whisper-1"
ISO-639-1 language code of the input audio (e.g. en, zh, ja). Supplying this improves accuracy and latency.
"en"
Optional text to guide the model's style or to continue from a previous audio segment. The prompt should match the audio language.
Format of the transcription output
json, text, srt, verbose_json, vtt Sampling temperature between 0 and 1. Higher values produce more random output; 0 lets the model auto-tune.
0 <= x <= 1Transcription response
Transcribed text
"The weather is nice today, let's go for a walk in the park."