Exploring the World of Open-Source Text-to-Speech Models

Updated on May 16, 2025 • Written By Sherlock Xu

The demand for text-to-speech (TTS) technology has skyrocketed over the past year, thanks to its wide-ranging applications across industries such as accessibility, education, and virtual assistants. Just like advancements in LLMs and image generation models, TTS models have evolved to generate more realistic, human-like speech from text input.

If you're looking to integrate TTS into your system, open-source models are a fantastic option. They offer greater flexibility, control, and customization compared to proprietary alternatives. In this post, we’ll explore some of the most popular open-source TTS models available today. We'll dive into their strengths and weaknesses, helping you choose the best model for your needs. Finally, we'll provide answers to some frequently asked questions.

XTTS-v2#

XTTS is one of the most popular voice generation models. Its latest version, XTTS-v2 is capable of cloning voices into different languages with just a quick 6-second audio sample. This efficiency eliminates the need for extensive training data, making it an attractive solution for voice cloning and multilingual speech generation.

The bad news is that the company behind XTTS was shut down in early 2024, leaving the project to the open-source community. However, the source code remains available on GitHub, and XTTS-v2 continues to be one of the most downloaded TTS models on Hugging Face.

Key features:

Voice cloning with minimal input: XTTS-v2 allows you to clone voices across multiple languages using only a 6-second audio clip, greatly simplifying the voice cloning process.
Multi-language support: The model supports 17 languages, making it ideal for global, multilingual applications.
Emotion and style transfer: XTTS-v2 can replicate not only the voice but also the emotional tone and speaking style, resulting in more realistic and expressive speech synthesis.
Low-latency performance: The model can achieve less than 150ms streaming latency with a pure PyTorch implementation on a consumer-grade GPU.

Points to be cautious about:

Non-commercial use only: XTTS-v2 is licensed under the Coqui Public Model License, which restricts its use to non-commercial purposes. This limits its application in commercial products unless specific licensing terms are negotiated.
Project shutdown: As the original company was shut down, the model's future development relies entirely on the open-source community.

Here is a code example of serving XTTS-v2 with BentoML:

Deploy XTTS-v2Deploy XTTS-v2 Deploy XTTS-v2 with a streaming endpointDeploy XTTS-v2 with a streaming endpoint

ChatTTS#

ChatTTS is a voice generation model designed for conversational applications, particularly for dialogue tasks in LLM assistants. It’s also ideal for conversational audio, video introductions, and other interactive tasks. Trained on approximately 100,000 hours of Chinese and English data, ChatTTS is capable of producing natural and high-quality speech in both languages.

Key features:

High-quality synthesis: With extensive training, it delivers natural, fluid speech with clear articulation.
Specialized for dialogues: ChatTTS is optimized for conversational tasks, making it an excellent choice for LLM-based assistants and dialogue systems.
Token-level control: It offers limited but useful token-based control over elements like laughter and pauses, allowing some flexibility in dialogue delivery.

Points to be cautious about:

Limited language support: Compared with other TTS models, ChatTTS currently supports only English and Chinese, which may restrict its use for applications in other languages.
Limited emotional control: At present, the model only supports basic token-level controls like laughter and breaks. More nuanced emotional controls are expected in future versions but are currently unavailable.
Stability issues: ChatTTS can sometimes encounter stability issues such as generating multi-speaker outputs or producing inconsistent audio quality. These issues are common with autoregressive models and you may need to generate multiple samples to get the desired result.

Deploy ChatTTSDeploy ChatTTS

Dia#

Dia is a 1.6B parameter text-to-speech model developed by Nari Labs, designed specifically for dialogue generation. Unlike most TTS models, Dia directly generates expressive, realistic multi-speaker conversations from text scripts, including nonverbal elements like laughter, coughing, or sighing. Its design makes it ideal for dynamic applications such as podcasts, audio dramas, game dialogues, or conversational interfaces.

Key features:

Dialogue-first generation: Dia interprets [S1] and [S2] tags to create flowing conversations between speakers. You can use these tags for character-based storytelling.

Emotion and tone control: You can control the speaker tone, emotion, and voice style with nonverbal tags like (laughs), (coughs), and (gasps) to enhance realism.

# A dialogue example
[S1] BentoML is a unified AI inference platform. [S2] Wow. Amazing. (laughs) [S1] You can deploy any models with it. [S2] I will try it now.

Voice cloning: You can upload an audio sample and generate a new script in that voice, ensuring speaker consistency across sessions.
Open source: Dia is free for commercial and non-commercial use under the Apache 2.0 license.

Points to be cautious about:

No fixed voice identity: By default, the model does not generate consistent voices unless guided by an audio prompt or fixed seed.
Nonverbal tag handling: Nonverbal tags like (laughs), (sighs), or (coughs) are supported, but may lead to unpredictable or inconsistent results.
English-only: The model currently only supports English generation.

MeloTTS#

MeloTTS is a high-quality, multilingual TTS library developed by MyShell.ai. It supports a wide range of languages and accents, including several English dialects (American, British, Indian, and Australian). MeloTTS is optimized for real-time inference, even on CPUs.

Currently, its English variant (MeloTTS-English) is the most downloaded TTS model on Hugging Face.

Key features:

Multilingual support: MeloTTS offers a broad range of languages and accents. A key highlight is the ability of the Chinese speaker to handle mixed Chinese and English speech. This makes the model particularly useful in scenarios where both languages are needed, such as in international business or multilingual media content.
Real-time inference: It’s optimized for fast performance, even on CPUs, making it suitable for applications requiring low-latency responses.
Free for commercial use: Licensed under the MIT License, MeloTTS is available for both commercial and non-commercial usage.

Points to be cautious about:

No voice cloning: MeloTTS does not support voice cloning, which could be a limitation for applications that require personalized voice replication.

OpenVoice v2#

Also developed by MyShell.ai, OpenVoice v2 is an instant voice cloning model that replicates a speaker's voice from just a short audio clip. It supports speech generation in multiple languages, providing granular control over various voice attributes like emotion, accent, rhythm, pauses, and intonation.

Key features:

Accurate tone color cloning: OpenVoice v2 accurately replicates the reference speaker's tone color, allowing the cloned voice to be used across multiple languages and accents.
Flexible voice style control: Users can control granular details of the speech output, such as emotion, accent, rhythm, pauses, and intonation, offering more customization than many other TTS models.
Zero-shot cross-lingual voice cloning: The model can clone a voice in a language that isn't present in the reference speech or the training data. In other words, the provided sample speech audio for OpenVoice v2 can be in any language.
Free for commercial use: Licensed under the MIT License, OpenVoice v2 is available for both commercial and non-commercial projects.

Points to be cautious about:

Differences with MeloTTS: According to Zengyi Qin, one of OpenVoice's maintainers, OpenVoice supports fewer languages and sounds less natural than MeloTTS. However, unlike OpenVoice, MeloTTS does not support voice cloning, so the choice between the two depends on your specific needs for naturalness versus voice cloning capabilities.

Parler-TTS#

Parler-TTS is a collection of lightweight TTS models developed by Hugging Face, designed to generate high-quality, natural-sounding speech. It allows users to control various speech features, such as gender, pitch, speaking style, and even background noise. Developed as a fully open-source release, Parler-TTS offers all the training code, datasets, and model weights publicly under a permissive license, enabling the community to build and customize their own TTS models.

Key features:

Voice style control: Parler-TTS provides granular control over speech characteristics such as emotion, speaking rate, pitch, and reverberation using simple text prompts.
Speaker cloning: The models can replicate the style of 34 pre-defined speakers, making it useful for applications requiring consistent speaker identities.
Optimized for efficiency: Parler-TTS supports fast generation techniques, including SDPA and Flash Attention 2, making it computationally efficient. SDPA is used by default and speeds up generation time by up to 1.4x compared with eager attention.

Points to be cautious about:

Model size: Parler-TTS is available in two versions — Mini (880M parameters) and Large (2.3B parameters). The Mini version is a lightweight model ideal for quick and efficient speech generation. However, if you need more expressiveness and control over finer details of speech, the Large version provides more advanced capabilities, though it requires greater computational resources.

Fish Speech v1.5#

Fish Speech v1.5 is a leading text-to-speech model trained on over one million hours of multilingual audio data.

Key features:

Multilingual & cross-lingual support: The model supports major languages including English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish, with strong cross-lingual capabilities.
Voice cloning: You can generate new TTS output using a 10–30 second vocal sample for personalized or character-based speech synthesis.
High accuracy and speed: The model achieves approximately 2% Character Error Rate (CER) and Word Error Rate (WER) for 5-minute English text. Using fish-tech acceleration, it reaches real-time factors of 1:5 on an RTX 4060 laptop and 1:15 on an RTX 4090.
Language-agnostic processing: Unlike many TTS models, Fish Speech doesn't depend on phonemes. It has strong generalization capabilities and can process text in any language script.

Points to be cautious about:

Non-commercial use: Fish Speech v1.5 is licensed under BY-CC-NC-SA-4.0, which restricts its use in commercial products. Verify your project’s licensing requirements before integrating Fish Speech v1.5.

Now that we’ve explored some of the top open-source TTS models and their features, you might still have questions about how these models perform, their deployment, and the best practices. To help, we’ve compiled a list of FAQs to guide you through the considerations when working with TTS models.

Any benchmarks for TTS models? And how much should I trust them?#

While LLMs have well-established benchmarks that offer insights into their performance across different tasks, the same cannot be said for TTS models. Evaluating their quality is inherently more challenging due to the subjective nature of human speech perception. If you use metrics like Word Error Rate (WER) to measure the performance, they often fail to capture the nuances of naturalness, inflection, and emotional tone in speech.

I suggest you understand benchmarks for TTS models with caution. While they provide a rough overview of performance, they may not fully reflect how a model will perform in real-world scenarios. If you're interested in exploring TTS model rankings, you can check out the TTS Arena leaderboard, curated by the TTS AGI community on Hugging Face. Note that the leaderboard displays models in descending order of how natural they sound based on community votes.

What should I consider when deploying TTS models?#

When deploying TTS models, key considerations include:

Performance and latency: Determine if your application requires real-time speech synthesis (e.g., virtual assistants) or if batch processing is sufficient (e.g., generating audiobooks). Real-time TTS deployment requires low-latency systems with optimized models.
Fast scaling: If your application expects to handle a large number of users simultaneously (e.g., in call centers or customer service bots), ensure that your infrastructure can scale horizontally, adding more compute resources as needed. In this connection, BentoML provides a simple way to build scalable APIs and lets you run any TTS models on BentoCloud, which provides fast and scalable infrastructure for model inference and advanced AI applications.
Integration for compound AI: TTS models can be combined with other AI components to create compound AI solutions. For example, you can use STTTTS (speech-to-text and text-to-speech) pipelines to enable voice assistants, interactive systems, and real-time transcription services with seamless bidirectional communication. BentoML provides a set of toolkits that let you easily build and scale compound AI systems, offering the key primitives for serving optimizations, task queues, batching, multi-model chains, distributed orchestration, and multi-GPU serving.

Text-to-Speech vs. Text-to-Audio. Which one should I choose?#

While "text-to-speech" and "text-to-audio" may seem interchangeable, they refer to slightly different concepts depending on your use case.

TTS focuses on converting written text into spoken words that sound as close to human speech as possible. It is typically used for applications like virtual assistants, accessibility tools, audiobooks, and voice interfaces. The goal is to generate speech that feels natural and conversational.
TTA is broader and can refer to any conversion of text into an audio format, not necessarily human speech. It may include sound effects, alerts, or any type of non-verbal audio cues based on the textual input.

If you need human-like speech output, a TTS model is what you're looking for. On the other hand, if your focus is simply generating any form of audio from text, including sound effects or alerts, you may be considering text-to-audio solutions. Some popular open-source text-to-audio models include Stable Audio Open 1.0, Tango, Bark (which also functions as a TTS model), and MusicGen (often referred to as a "text-to-music" model).

What should I consider regarding speech quality?#

When evaluating the speech quality of a TTS model, there are several key factors to consider to ensure the output meets your application's needs:

Naturalness and intelligibility#

One of the most important aspects of any TTS model is how natural and human-like the generated speech sounds. Listen for smooth transitions between words, appropriate pauses, and minimal robotic or synthetic artifacts.
Intelligibility is equally important. Ensure that the speech is clear and easy to understand, even with complex or lengthy text inputs.

Multilingual and accent support#

If your application is multilingual, test the model’s ability to generate high-quality speech across different languages, accents, and dialects. Some models, like MeloTTS mentioned above, are known for handling a broad range of languages, while others may specialize in fewer languages.
Be sure to test how well the model adapts to accents within the same language, especially for global applications requiring regional variations in speech.

Prosody and intonation#

Prosody refers to the rhythm, stress, and intonation of speech, which play a critical role in making the generated speech sound natural. A good TTS model should replicate human-like prosody to avoid sounding monotonous or unnatural.
Intonation should vary naturally, reflecting questions, statements, and exclamations appropriately.

Emotional expression#

For more advanced applications, consider a model's ability to convey different emotions in speech. Some models, such as OpenVoice, support granular control over emotional expression, which can be critical in customer service, virtual assistants, or entertainment applications.

Final thoughts#

TTS technology has come a long way, with open-source models now offering high-quality, natural-sounding speech generation across multiple languages and applications. Ultimately, the right TTS model for you will depend on your specific use case.

At BentoML, we work to help enterprises build scalable AI systems with production-grade reliability. Our unified AI inference platform lets developers bring their custom inference code and libraries to build AI systems 10x faster without the infrastructure complexity. You can scale your application efficiently in your cloud and maintain full control over security and compliance.

Check out the following resources to learn more about deploying TTS models with BentoML:

Explore our example projects using ChatTTS, XTTS, and Bark
Sign up for BentoCloud for free to deploy your first TTS model
Join our Slack community
Contact us if you have any questions of deploying TTS models

Join the InferenceOps Community

Join in conversations, ask for help, and find resources.

Start a free trial

Get a demo

Stay updated with the latest news

Exploring the World of Open-Source Text-to-Speech Models

XTTS-v2#

ChatTTS#

Dia#

MeloTTS#

OpenVoice v2#

Parler-TTS#

Fish Speech v1.5#

Any benchmarks for TTS models? And how much should I trust them?#

What should I consider when deploying TTS models?#

Text-to-Speech vs. Text-to-Audio. Which one should I choose?#

What should I consider regarding speech quality?#

Naturalness and intelligibility#

Multilingual and accent support#

Prosody and intonation#

Emotional expression#

Final thoughts#

Freedom To Build

Products

Resources

Company

Join our community