Text-to-Speech Open Weights

Speech Arena Leaderboard

Leaderboard text-to-speech APIs compared below using third-party data from Artificial Analysis leaderboard rankings (as of April 2026), production reliability metrics, and deployment flexibility, covering latency benchmarks, language coverage, and integration options.

Inworld AI TTS-1.5 Max ranks #1 with an ELO score of 1,208 based on thousands of blind user preference comparisons, with sub-250ms P90 latency.

Range	Creator	Model	ELO	API Pricing
#1	Fish Audio	Fish Audio S2 Pro	1,124	$15 /1M chars
#2	StepFun	Step Audio EditX	1,098	N/A
#3	NVIDIA	Magpie-Multilingual 357M	1,063	N/A
#4	Kokoro	Kokoro 82M v1.0	1,055	$0.7 /1M chars
#5	Mistral	Voxtral TTS	1,053	$16 /1M chars
#6	Maya Research	Maya1	1,049	N/A
#7	Fish Audio	Fish Audio 1.5	1,012	$15 /1M chars
#8	Resemble AI	Chatterbox	1,006	$25 /1M chars
#9	Zyphra	Zonos-v0.1	1,000	$20 /1M chars
#10	Microsoft	VibeVoice 7B	957	N/A
#11	OpenVoice	OpenVoice v2	948	$8.3 /1M chars
#12	Alibaba	Qwen3 TTS	932	N/A
#13	Coqui	XTTS v2	885	$40.4 /1M chars
#14	StyleTTS	StyleTTS 2	877	$2.8 /1M chars
#15	MetaVoice	MetaVoice v1	764	N/A

Sub-200ms latency is now achievable through modern neural architectures, and zero-shot voice cloning from 3-15 seconds of audio has become a standard feature set rather than premium.

Top 27 TTS AI Models by ELO

Top 45 TTS & Voice Generation Models from GitHub

Model	Parameters	Languages	Voice Cloning	Streaming	Pronounce	Emotion	ASR	Other	License
Audio Flamingo	7B - Website	Multi-lingual	❌	✅	N/A	✅	✅	Context Up to 30 minutes	Apache
Chatterbox	350M-500M	23+	✅	✅	✅	✅	❌		MIT
Dia	1.6B	English	✅	✅	✅	✅	❌		Apache
FireRedTTS-2	Hugging Face - arXiv	7 (En, Zh, Jp, Ko, Fr, De, Ru)	✅	✅ (140ms)	✅	✅	✅	4 speakers; 3 minutes	Apache
Fish Audio S2 Pro	5B	80+ (Tier 1: En, Zh, Jp)	✅	✅	✅	✅	❌	~10 GB (BF16)	License
Fish Speech	4B	8 (En, Jp, Ko, Zh, Fr, De, Ar, Es)	✅	✅	✅	✅	❌	RTF ~1:7	Apache
Fun-CosyVoice 3.0	0.5B - arXiv	9 + 18+ Chinese dialects	✅	✅	✅	✅	❌		Apache
GLM-TTS	Hugging Face - arXiv	Chinese, English	✅	✅	✅	✅	❌		Apache
IndexTTS2	_	Chinese, English	✅	✅	✅	✅	❌	1–4 speakers	Apache
Irodori-TTS-500M-v2	500M	Japanese	✅	❌	❌	✅	❌	48kHz waveform	MIT
Kimi-Audio	7B	Multiple	✅	✅	N/A	✅	✅		MIT & Apache
KittenTTS	15M int8 15M 40M 80M	English, Multiple	✅	✅	❌	✅	❌	<25MB, no GPU	Apache
Kokoro-82M	82M	8 langs, 54 voices	✅	✅	✅	✅	❌	<$0.06/hr audio	Apache
KokoClone	Base: Kokoro-ONNX	7 (En, Hi, Fr, Ja, Zh, It, Pt, Es)	✅	✅	❌	✅	❌		Apache
KugelAudio	7B	23 EU langs	✅	✅	✅	✅	❌	Website	MIT
LEMAS-TTS	0.3B Website	10 (Zh, En, Es, Ru, Fr, De, It, Pt, Id, Vi)	✅	❌	✅	✅	❌	Word-level editing (LEMAS-Edit)	Apache
LFM2-Audio-1.5B	1.5B	English	✅	✅	N/A	✅	✅	Website	LFM Open
LongCat-AudioDiT	1B / 3.5B	Chinese, English	✅	❌	❌	❌	❌	Rate 24000 Hz	MIT
LuxTTS	_	—	✅	✅	❌	❌	❌	RTF 150×, 1GB VRAM	Apache
Maya1	3B	En (multi-accent)	✅	✅	✅	✅	❌	Website	Apache
MegaTTS3	0.45B	Chinese, English	✅	✅	✅	✅	❌	arXiv	Apache
MiMo-Audio	7B	Multi-lingual	✅	✅	N/A	✅	✅	Few-shot learner	Apache
MioTTS-2.6B	2.6B	English, Japanese	✅	✅	❌	❌	❌	RTF 0.135–0.145	LFM Open
MOSS-TTS	8B Delay, 1.7B Local	20 langs	✅	✅	✅	✅	❌	Max 1 hour	Apache
MOSS-TTS-Nano	0.1B	20 langs	✅	✅	❌	❌	❌	48 kHz Stereo	Apache
NeuTTS	360M Air / 120M Nano	En/Es/De/Fr	✅	✅	❌	❌	❌	GGUF on-device	Apache / NeuTTS
OmniVoice	_	600+ langs	✅	❌	✅	✅	❌	581k hours	Apache
Orpheus-TTS	3B	Multilingual	✅	✅	✅	✅	❌	Llama-3b backbone	Apache
Qwen3-TTS	0.6B–1.7B	10 (Zh, En, Ja, Ko, De, Fr, Ru, Pt, Es, It)	✅	✅	✅	✅	❌	arXiv	Apache
SoproTTS	135M	English	✅	✅	❌	✅	❌	RTF 0.05 CPU M3	Apache
SoulX-Podcast	Hugging Face, arXiv	Mandarin, English, Cantonese, Sichuanese, Henanese	✅	✅	✅	✅	❌	Max 90+ min	Apache
SoulX-Singer	Hugging Face, arXiv	Mandarin, English, Cantonese	✅	✅	✅	✅	❌	Singing synthesis	Apache
Spark-TTS	0.5B	Chinese, English	✅	✅	✅	✅	❌	Qwen2.5 backbone	Apache
Step-Audio	130B Chat / 3B TTS	Zh, En, Jp	✅	✅	✅	✅	✅	arXiv	Apache
Step-Audio-EditX	3B (4B BF16)	Mandarin, English, Sichuanese, Cantonese, Japanese, Korean	✅	✅	✅	✅	❌	Audio editing	Apache
Supertonic 3	66M	31 (Ar, Bg, Hr, Cs, Da, Nl, En, Et, Fi, Fr, De, El, Hi, Hu, Id, It, Ja, Ko, Lv, Lt, Pl, Pt, Ro, Ru, Sk, Sl, Es, Sv, Tr, Uk, Vi)	❌	✅	❌	❌	❌	RTF 0.001, ONNX	OpenRAIL-M
T5Gemma-TTS	2B-2B	English, Chinese, Japanese	✅	❌	✅	❌	❌	7.6-10.6 GB VRAM	MIT
TinyTTS	1.6M	English	❌	✅	✅	❌	❌	~3.4 MB (ONNX FP16)	Apache
VibeVoice-Realtime	0.5B	50+ langs	✅	✅	✅	✅	❌	Max ~10 min	MIT
VieNeu-TTS	0.3B–0.6B	Vietnamese, English	✅	✅	✅	❌	❌		Apache
VoxCPM	640M - 800M - 2B	30 (Ar, My, Zh, Da, Nl, En, Fi, Fr, De, El, He, Hi, Id, It, Ja, Km, Ko, Lo, Ms, No, Pl, Pt, Ru, Es, Sw, Sv, Tl, Th, Tr, Vi, 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话)	✅	✅	✅	✅	❌	Tokenizer-free	Apache
Voxtral-4B-TTS	4B	9 (En, Fr, Es, De, It, Pt, Nl, Ar, Hi)	✅	✅	❌	✅	❌	24 kHz	CC BY-NC 4.0
ZipVoice	123M	Chinese, English	✅	✅	❌	❌	❌	Dialogue support	Apache

Kokoro 82M v1.0 Miniconda

Chatterbox