An overview of all the endpoint parameters.


This is the text to be synthesized to audio.


  • Dan: Young Male
  • Will: Mature Male
  • Scarlett: Young Female
  • Liv: Young Female
  • Amy: Mature Female


Defaults to 192k.

  • Use lower values for low bandwidth or to reduce the transferred file size.
  • Use higher values for higher fidelity.


Defaults to 0. Examples:

  • 0.5: makes the audio 50% faster (i.e. 60-second audio becomes 42 seconds)
  • -0.5: makes the audio 50% slower (i.e. 60-second audio becomes 90 seconds)


Defaults to 1. However, on the landing page, we default male voices to 0.92 as people tend to prefer lower/deeper male voices.


Defaults to libmp3lame (MP3).

  • Use pcm_mulaw for phone calls. pcm_s16le returns 22050 Hz raw audio.


Defaults to 0.25.

  • The lower values make audio deterministic and more stable.
  • The higher values make audio more expressive and less-deterministic.
  • With a high Temperature value, audio will be different every time. However, it also increases the probability of mispronunciation.


By default, the endpoint returns per-sentence timestamps. Use word to get per-word timestamps.

The timestamp feature is currently not supported via the /stream endpoint.


If provided, the server will POST a JSON body to the CallbackUrl. A sample body looks like below:

   "TaskId": "8282b92d",  
   "TaskStatus": "completed", // or "failed"