Text-to-Speech Open Weights
Speech Arena Leaderboard
Leaderboard text-to-speech APIs compared below using third-party data from Artificial Analysis leaderboard rankings (as of April 2026), production reliability metrics, and deployment flexibility, covering latency benchmarks, language coverage, and integration options.
Inworld AI TTS-1.5 Max ranks #1 with an ELO score of 1,208 based on thousands of blind user preference comparisons, with sub-250ms P90 latency.
| Range | Creator | Model | ELO | API Pricing |
|---|---|---|---|---|
| #1 | Fish Audio | Fish Audio S2 Pro | 1,124 | $15 /1M chars |
| #2 | StepFun | Step Audio EditX | 1,098 | N/A |
| #3 | NVIDIA | Magpie-Multilingual 357M | 1,063 | N/A |
| #4 | Kokoro | Kokoro 82M v1.0 | 1,055 | $0.7 /1M chars |
| #5 | Mistral | Voxtral TTS | 1,053 | $16 /1M chars |
| #6 | Maya Research | Maya1 | 1,049 | N/A |
| #7 | Fish Audio | Fish Audio 1.5 | 1,012 | $15 /1M chars |
| #8 | Resemble AI | Chatterbox | 1,006 | $25 /1M chars |
| #9 | Zyphra | Zonos-v0.1 | 1,000 | $20 /1M chars |
| #10 | Microsoft | VibeVoice 7B | 957 | N/A |
| #11 | OpenVoice | OpenVoice v2 | 948 | $8.3 /1M chars |
| #12 | Alibaba | Qwen3 TTS | 932 | N/A |
| #13 | Coqui | XTTS v2 | 885 | $40.4 /1M chars |
| #14 | StyleTTS | StyleTTS 2 | 877 | $2.8 /1M chars |
| #15 | MetaVoice | MetaVoice v1 | 764 | N/A |
Sub-200ms latency is now achievable through modern neural architectures, and zero-shot voice cloning from 3-15 seconds of audio has become a standard feature set rather than premium.
Top 27 TTS AI Models by ELO
Top 45 TTS & Voice Generation Models from GitHub
| Model | Parameters | Languages | Voice Cloning | Streaming | Pronounce | Emotion | ASR | Other | License |
|---|---|---|---|---|---|---|---|---|---|
| Audio Flamingo | 7B - Website | Multi-lingual | ❌ | ✅ | N/A | ✅ | ✅ | Context Up to 30 minutes | Apache |
| Chatterbox | 350M-500M | 23+ | ✅ | ✅ | ✅ | ✅ | ❌ | MIT | |
| Dia | 1.6B | English | ✅ | ✅ | ✅ | ✅ | ❌ | Apache | |
| FireRedTTS-2 | Hugging Face - arXiv | 7 (En, Zh, Jp, Ko, Fr, De, Ru) | ✅ | ✅ (140ms) | ✅ | ✅ | ✅ | 4 speakers; 3 minutes | Apache |
| Fish Audio S2 Pro | 5B | 80+ (Tier 1: En, Zh, Jp) | ✅ | ✅ | ✅ | ✅ | ❌ | ~10 GB (BF16) | License |
| Fish Speech | 4B | 8 (En, Jp, Ko, Zh, Fr, De, Ar, Es) | ✅ | ✅ | ✅ | ✅ | ❌ | RTF ~1:7 | Apache |
| Fun-CosyVoice 3.0 | 0.5B - arXiv | 9 + 18+ Chinese dialects | ✅ | ✅ | ✅ | ✅ | ❌ | Apache | |
| GLM-TTS | Hugging Face - arXiv | Chinese, English | ✅ | ✅ | ✅ | ✅ | ❌ | Apache | |
| IndexTTS2 | _ | Chinese, English | ✅ | ✅ | ✅ | ✅ | ❌ | 1–4 speakers | Apache |
| Irodori-TTS-500M-v2 | 500M | Japanese | ✅ | ❌ | ❌ | ✅ | ❌ | 48kHz waveform | MIT |
| Kimi-Audio | 7B | Multiple | ✅ | ✅ | N/A | ✅ | ✅ | MIT & Apache | |
| KittenTTS | 15M int8 15M 40M 80M | English, Multiple | ✅ | ✅ | ❌ | ✅ | ❌ | <25MB, no GPU | Apache |
| Kokoro-82M | 82M | 8 langs, 54 voices | ✅ | ✅ | ✅ | ✅ | ❌ | <$0.06/hr audio | Apache |
| KokoClone | Base: Kokoro-ONNX | 7 (En, Hi, Fr, Ja, Zh, It, Pt, Es) | ✅ | ✅ | ❌ | ✅ | ❌ | Apache | |
| KugelAudio | 7B | 23 EU langs | ✅ | ✅ | ✅ | ✅ | ❌ | Website | MIT |
| LEMAS-TTS | 0.3B Website | 10 (Zh, En, Es, Ru, Fr, De, It, Pt, Id, Vi) | ✅ | ❌ | ✅ | ✅ | ❌ | Word-level editing (LEMAS-Edit) | Apache |
| LFM2-Audio-1.5B | 1.5B | English | ✅ | ✅ | N/A | ✅ | ✅ | Website | LFM Open |
| LongCat-AudioDiT | 1B / 3.5B | Chinese, English | ✅ | ❌ | ❌ | ❌ | ❌ | Rate 24000 Hz | MIT |
| LuxTTS | _ | — | ✅ | ✅ | ❌ | ❌ | ❌ | RTF 150×, 1GB VRAM | Apache |
| Maya1 | 3B | En (multi-accent) | ✅ | ✅ | ✅ | ✅ | ❌ | Website | Apache |
| MegaTTS3 | 0.45B | Chinese, English | ✅ | ✅ | ✅ | ✅ | ❌ | arXiv | Apache |
| MiMo-Audio | 7B | Multi-lingual | ✅ | ✅ | N/A | ✅ | ✅ | Few-shot learner | Apache |
| MioTTS-2.6B | 2.6B | English, Japanese | ✅ | ✅ | ❌ | ❌ | ❌ | RTF 0.135–0.145 | LFM Open |
| MOSS-TTS | 8B Delay, 1.7B Local | 20 langs | ✅ | ✅ | ✅ | ✅ | ❌ | Max 1 hour | Apache |
| MOSS-TTS-Nano | 0.1B | 20 langs | ✅ | ✅ | ❌ | ❌ | ❌ | 48 kHz Stereo | Apache |
| NeuTTS | 360M Air / 120M Nano | En/Es/De/Fr | ✅ | ✅ | ❌ | ❌ | ❌ | GGUF on-device | Apache / NeuTTS |
| OmniVoice | _ | 600+ langs | ✅ | ❌ | ✅ | ✅ | ❌ | 581k hours | Apache |
| Orpheus-TTS | 3B | Multilingual | ✅ | ✅ | ✅ | ✅ | ❌ | Llama-3b backbone | Apache |
| Qwen3-TTS | 0.6B–1.7B | 10 (Zh, En, Ja, Ko, De, Fr, Ru, Pt, Es, It) | ✅ | ✅ | ✅ | ✅ | ❌ | arXiv | Apache |
| SoproTTS | 135M | English | ✅ | ✅ | ❌ | ✅ | ❌ | RTF 0.05 CPU M3 | Apache |
| SoulX-Podcast | Hugging Face, arXiv | Mandarin, English, Cantonese, Sichuanese, Henanese | ✅ | ✅ | ✅ | ✅ | ❌ | Max 90+ min | Apache |
| SoulX-Singer | Hugging Face, arXiv | Mandarin, English, Cantonese | ✅ | ✅ | ✅ | ✅ | ❌ | Singing synthesis | Apache |
| Spark-TTS | 0.5B | Chinese, English | ✅ | ✅ | ✅ | ✅ | ❌ | Qwen2.5 backbone | Apache |
| Step-Audio | 130B Chat / 3B TTS | Zh, En, Jp | ✅ | ✅ | ✅ | ✅ | ✅ | arXiv | Apache |
| Step-Audio-EditX | 3B (4B BF16) | Mandarin, English, Sichuanese, Cantonese, Japanese, Korean | ✅ | ✅ | ✅ | ✅ | ❌ | Audio editing | Apache |
| Supertonic 3 | 66M | 31 (Ar, Bg, Hr, Cs, Da, Nl, En, Et, Fi, Fr, De, El, Hi, Hu, Id, It, Ja, Ko, Lv, Lt, Pl, Pt, Ro, Ru, Sk, Sl, Es, Sv, Tr, Uk, Vi) | ❌ | ✅ | ❌ | ❌ | ❌ | RTF 0.001, ONNX | OpenRAIL-M |
| T5Gemma-TTS | 2B-2B | English, Chinese, Japanese | ✅ | ❌ | ✅ | ❌ | ❌ | 7.6-10.6 GB VRAM | MIT |
| TinyTTS | 1.6M | English | ❌ | ✅ | ✅ | ❌ | ❌ | ~3.4 MB (ONNX FP16) | Apache |
| VibeVoice-Realtime | 0.5B | 50+ langs | ✅ | ✅ | ✅ | ✅ | ❌ | Max ~10 min | MIT |
| VieNeu-TTS | 0.3B–0.6B | Vietnamese, English | ✅ | ✅ | ✅ | ❌ | ❌ | Apache | |
| VoxCPM | 640M - 800M - 2B | 30 (Ar, My, Zh, Da, Nl, En, Fi, Fr, De, El, He, Hi, Id, It, Ja, Km, Ko, Lo, Ms, No, Pl, Pt, Ru, Es, Sw, Sv, Tl, Th, Tr, Vi, 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话) | ✅ | ✅ | ✅ | ✅ | ❌ | Tokenizer-free | Apache |
| Voxtral-4B-TTS | 4B | 9 (En, Fr, Es, De, It, Pt, Nl, Ar, Hi) | ✅ | ✅ | ❌ | ✅ | ❌ | 24 kHz | CC BY-NC 4.0 |
| ZipVoice | 123M | Chinese, English | ✅ | ✅ | ❌ | ❌ | ❌ | Dialogue support | Apache |