> ## Documentation Index > Fetch the complete documentation index at: https://novita.ai/docs/llms.txt > Use this file to discover all available pages before exploring further. # MiniMax Speech 2.8 HD Async Text-to-Speech MiniMax asynchronous text-to-speech API, supports various voice, emotion, speed and other parameter settings, text length limit up to 50,000 characters, supports file input (up to 100,000 characters) This is an **asynchronous** API; only the **task\_id** will be returned. You should use the **task\_id** to request the [**Task Result API**](/api-reference/model-apis-task-result) to retrieve the video generation results. ## Request Headers Supports: `application/json` Bearer authentication format, for example: Bearer \{\{API Key}}. ## Request Body Text to synthesize into audio, maximum length is 50,000 characters. Either `text` or `text_file_id` is required. * Interjection tags: Only supported when model is `speech-2.8-hd` or `speech-2.8-turbo`. Supported interjections: `(laughs)` (laughter), `(chuckle)` (light laugh), `(coughs)` (cough), `(clear-throat)` (clear throat), `(groans)` (groan), `(breath)` (normal breathing), `(pant)` (panting), `(inhale)` (inhale), `(exhale)` (exhale), `(gasps)` (gasp), `(sniffs)` (sniff), `(sighs)` (sigh), `(snorts)` (snort), `(burps)` (burp), `(lip-smacking)` (lip smacking), `(humming)` (humming), `(hissing)` (hissing), `(emm)` (um), `(whistles)` (whistle), `(sneezes)` (sneeze), `(crying)` (crying), `(applause)` (applause) Text file ID for audio synthesis, single file length limit is less than 100,000 characters, supported file formats: txt, zip. Either `text` or `text_file_id` is required, format will be automatically validated. * **txt file**: Length limit \<100000 characters. Supports custom pause using `<#x#>` tag. x is pause duration (in seconds), range \[0.01, 99.99], up to 2 decimal places. Pause must be set between two pronounceable text segments, cannot use multiple pause tags consecutively * **zip file**: * Compressed package must contain txt or json files of the same format. * json file format: Supports \[`title`, `content`, `extra`] three fields, representing title, body, and additional information. If all three fields exist, 3 groups of results will be produced, 9 files in total, stored in one folder. If a field does not exist or is empty, no corresponding result will be generated Pitch adjustment (deep/bright), range \[-100, 100], values closer to -100 produce deeper voice; closer to 100 produce brighter voice Value range: \[-100, 100] Timbre adjustment (rich/crisp), range \[-100, 100], values closer to -100 produce richer voice; closer to 100 produce crisper voice Value range: \[-100, 100] Intensity adjustment (powerful/soft), range \[-100, 100], values closer to -100 produce more powerful voice; closer to 100 produce softer voice Value range: \[-100, 100] Sound effect setting, only one can be selected at a time. Options: 1. spacious\_echo (spacious echo) 2. auditorium\_echo (auditorium broadcast) 3. lofi\_telephone (telephone distortion) 4. robotic (electronic) Optional values: `spacious_echo`, `auditorium_echo`, `lofi_telephone`, `robotic` Audio output format. Options `[mp3, pcm, flac, wav, pcmu_raw, pcmu_wav, opus]`, default is `mp3`. `pcmu_raw` and `pcmu_wav` use G.711 μ-law encoding (sample rate 8 kHz; `pcmu_raw` is headerless raw data, `pcmu_wav` is wrapped in a WAV container). `opus` uses Ogg/Opus encoding, only supports sample rates `[8000, 12000, 16000, 24000, 48000]`; using other sample rates will cause task errors. Optional values: `mp3`, `pcm`, `flac`, `wav`, `pcmu_raw`, `pcmu_wav`, `opus` Audio bitrate. Options `[32000, 64000, 128000, 256000]`, default is `128000`. This parameter only applies to `mp3` format Number of audio channels. Options: `[1, 2]`, where `1` is mono and `2` is stereo, default is 1 Audio sample rate. Options `[8000, 16000, 22050, 24000, 32000, 44100]`, default is `32000` Audio volume, higher value means louder. Range (0, 10], default is 1.0 Value range: \[0, 10] Audio pitch, range `[-12, 12]`, default is 0, where 0 is original voice output Value range: \[-12, 12] Speech speed, higher value means faster. Range `[0.5, 2]`, default is 1.0 Value range: \[0.5, 2] Controls the emotion of synthesized speech. Options `["happy", "sad", "angry", "fearful", "disgusted", "surprised", "calm", "fluent", "whisper"]` correspond to 8 emotions: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper * The model will automatically match appropriate emotion based on input text, usually no need to specify manually * This parameter only works for `speech-2.6-hd`, `speech-2.6-turbo`, `speech-02-hd`, `speech-02-turbo`, `speech-01-hd`, `speech-01-turbo` models * Options `fluent`, `whisper` only work for `speech-2.6-turbo`, `speech-2.6-hd` models Optional values: `happy`, `sad`, `angry`, `fearful`, `disgusted`, `surprised`, `calm`, `fluent`, `whisper` Voice ID for audio synthesis. If mixed voice is needed, set timber\_weights parameter and leave this empty. Supports system voice, cloned voice, and text-generated voice. Below are some of the latest system voices (ID)

Chinese: moss\_audio\_ce44fc67-7ce3-11f0-8de5-96e35d26fb85, moss\_audio\_aaa1346a-7ce7-11f0-8e61-2e6e3c7ee85d, Chinese (Mandarin)\_Lyrical\_Voice, Chinese (Mandarin)\_HK\_Flight\_Attendant
English: English\_Graceful\_Lady, English\_Insightful\_Speaker, English\_radiant\_girl, English\_Persuasive\_Man, moss\_audio\_6dc281eb-713c-11f0-a447-9613c873494c, moss\_audio\_570551b1-735c-11f0-b236-0adeeecad052, moss\_audio\_ad5baf92-735f-11f0-8263-fe5a2fe98ec8, English\_Lucky\_Robot
Japanese: Japanese\_Whisper\_Belle, moss\_audio\_24875c4a-7be4-11f0-9359-4e72c55db738, moss\_audio\_7f4ee608-78ea-11f0-bb73-1e2a4cfcd245, moss\_audio\_c1a6a3ac-7be6-11f0-8e8e-36b92fbb4f95

Supports English text normalization, which can improve performance in number reading scenarios but slightly increases latency, default false Controls whether to add audio rhythm identifier at the end of synthesized audio, default is False. This parameter is only valid for non-streaming synthesis Whether to enhance recognition ability for specified minor languages and dialects. Default is `null`, can be set to `auto` to let the model decide automatically. Optional values: `Chinese`, `Chinese,Yue`, `English`, `Arabic`, `Russian`, `Spanish`, `French`, `Portuguese`, `German`, `Turkish`, `Dutch`, `Ukrainian`, `Vietnamese`, `Indonesian`, `Japanese`, `Italian`, `Korean`, `Thai`, `Polish`, `Romanian`, `Greek`, `Czech`, `Finnish`, `Hindi`, `Bulgarian`, `Danish`, `Hebrew`, `Malay`, `Persian`, `Slovak`, `Swedish`, `Croatian`, `Filipino`, `Hungarian`, `Norwegian`, `Slovenian`, `Catalan`, `Nynorsk`, `Tamil`, `Afrikaans`, `auto` Enable this parameter to make clause transitions more natural, only supported by `speech-2.8-hd` and `speech-2.8-turbo` models Defines pronunciation or replacement rules for special characters or symbols. For Chinese text, tones are represented by numbers: 1st tone = 1, 2nd tone = 2, 3rd tone = 3, 4th tone = 4, neutral tone = 5 Example: `["omg/oh my god"]` ## Response Corresponding audio file ID returned after task creation. * After task completion, use file\_id to download * This field is not returned when request fails Note: The download URL is valid for 9 hours (32400 seconds) from generation. After expiration, the file will become invalid and generated information will be lost. Please pay attention to download timing Use the task\_id to retrieve the generated outputs. Status details Status code

`0`: Success
`1002`: Rate limit
`1004`: Authentication failed
`1039`: TPM rate limit triggered
`1042`: Invalid characters exceed 10%
`2013`: Parameter error

Token used to complete the current task Billable character count