Text-to-Speech Open Weights

Speech Arena Leaderboard

Leaderboard text-to-speech APIs compared below using third-party data from Artificial Analysis leaderboard rankings (as of April 2026), production reliability metrics, and deployment flexibility, covering latency benchmarks, language coverage, and integration options.

Inworld AI TTS-1.5 Max ranks #1 with an ELO score of 1,208 based on thousands of blind user preference comparisons, with sub-250ms P90 latency.

Range Creator Model ELO API Pricing
#1 Fish Audio Fish Audio S2 Pro 1,124 $15 /1M chars
#2 StepFun Step Audio EditX 1,098 N/A
#3 NVIDIA Magpie-Multilingual 357M 1,063 N/A
#4 Kokoro Kokoro 82M v1.0 1,055 $0.7 /1M chars
#5 Mistral Voxtral TTS 1,053 $16 /1M chars
#6 Maya Research Maya1 1,049 N/A
#7 Fish Audio Fish Audio 1.5 1,012 $15 /1M chars
#8 Resemble AI Chatterbox 1,006 $25 /1M chars
#9 Zyphra Zonos-v0.1 1,000 $20 /1M chars
#10 Microsoft VibeVoice 7B 957 N/A
#11 OpenVoice OpenVoice v2 948 $8.3 /1M chars
#12 Alibaba Qwen3 TTS 932 N/A
#13 Coqui XTTS v2 885 $40.4 /1M chars
#14 StyleTTS StyleTTS 2 877 $2.8 /1M chars
#15 MetaVoice MetaVoice v1 764 N/A

Sub-200ms latency is now achievable through modern neural architectures, and zero-shot voice cloning from 3-15 seconds of audio has become a standard feature set rather than premium.

Top 27 TTS AI Models by ELO

Top 45 TTS & Voice Generation Models from GitHub

Model Parameters Languages Voice Cloning Streaming Pronounce Emotion ASR Other License
Audio Flamingo 7B - Website Multi-lingual N/A Context Up to 30 minutes Apache
Chatterbox 350M-500M 23+ MIT
Dia 1.6B English Apache
FireRedTTS-2 Hugging Face - arXiv 7 (En, Zh, Jp, Ko, Fr, De, Ru) ✅ (140ms) 4 speakers; 3 minutes Apache
Fish Audio S2 Pro 5B 80+ (Tier 1: En, Zh, Jp) ~10 GB (BF16) License
Fish Speech 4B 8 (En, Jp, Ko, Zh, Fr, De, Ar, Es) RTF ~1:7 Apache
Fun-CosyVoice 3.0 0.5B - arXiv 9 + 18+ Chinese dialects Apache
GLM-TTS Hugging Face - arXiv Chinese, English Apache
IndexTTS2 _ Chinese, English 1–4 speakers Apache
Irodori-TTS-500M-v2 500M Japanese 48kHz waveform MIT
Kimi-Audio 7B Multiple N/A MIT & Apache
KittenTTS 15M int8 15M 40M 80M English, Multiple <25MB, no GPU Apache
Kokoro-82M 82M 8 langs, 54 voices <$0.06/hr audio Apache
KokoClone Base: Kokoro-ONNX 7 (En, Hi, Fr, Ja, Zh, It, Pt, Es) Apache
KugelAudio 7B 23 EU langs Website MIT
LEMAS-TTS 0.3B Website 10 (Zh, En, Es, Ru, Fr, De, It, Pt, Id, Vi) Word-level editing (LEMAS-Edit) Apache
LFM2-Audio-1.5B 1.5B English N/A Website LFM Open
LongCat-AudioDiT 1B / 3.5B Chinese, English Rate 24000 Hz MIT
LuxTTS _ RTF 150×, 1GB VRAM Apache
Maya1 3B En (multi-accent) Website Apache
MegaTTS3 0.45B Chinese, English arXiv Apache
MiMo-Audio 7B Multi-lingual N/A Few-shot learner Apache
MioTTS-2.6B 2.6B English, Japanese RTF 0.135–0.145 LFM Open
MOSS-TTS 8B Delay, 1.7B Local 20 langs Max 1 hour Apache
MOSS-TTS-Nano 0.1B 20 langs 48 kHz Stereo Apache
NeuTTS 360M Air / 120M Nano En/Es/De/Fr GGUF on-device Apache / NeuTTS
OmniVoice _ 600+ langs 581k hours Apache
Orpheus-TTS 3B Multilingual Llama-3b backbone Apache
Qwen3-TTS 0.6B–1.7B 10 (Zh, En, Ja, Ko, De, Fr, Ru, Pt, Es, It) arXiv Apache
SoproTTS 135M English RTF 0.05 CPU M3 Apache
SoulX-Podcast Hugging Face, arXiv Mandarin, English, Cantonese, Sichuanese, Henanese Max 90+ min Apache
SoulX-Singer Hugging Face, arXiv Mandarin, English, Cantonese Singing synthesis Apache
Spark-TTS 0.5B Chinese, English Qwen2.5 backbone Apache
Step-Audio 130B Chat / 3B TTS Zh, En, Jp arXiv Apache
Step-Audio-EditX 3B (4B BF16) Mandarin, English, Sichuanese, Cantonese, Japanese, Korean Audio editing Apache
Supertonic 3 66M 31 (Ar, Bg, Hr, Cs, Da, Nl, En, Et, Fi, Fr, De, El, Hi, Hu, Id, It, Ja, Ko, Lv, Lt, Pl, Pt, Ro, Ru, Sk, Sl, Es, Sv, Tr, Uk, Vi) RTF 0.001, ONNX OpenRAIL-M
T5Gemma-TTS 2B-2B English, Chinese, Japanese 7.6-10.6 GB VRAM MIT
TinyTTS 1.6M English ~3.4 MB (ONNX FP16) Apache
VibeVoice-Realtime 0.5B 50+ langs Max ~10 min MIT
VieNeu-TTS 0.3B–0.6B Vietnamese, English Apache
VoxCPM 640M - 800M - 2B 30 (Ar, My, Zh, Da, Nl, En, Fi, Fr, De, El, He, Hi, Id, It, Ja, Km, Ko, Lo, Ms, No, Pl, Pt, Ru, Es, Sw, Sv, Tl, Th, Tr, Vi, 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话) Tokenizer-free Apache
Voxtral-4B-TTS 4B 9 (En, Fr, Es, De, It, Pt, Nl, Ar, Hi) 24 kHz CC BY-NC 4.0
ZipVoice 123M Chinese, English Dialogue support Apache