Exploring the World of Open-Source Text-to-Speech Models

September 19, 2024 • Written By Sherlock Xu

The demand for text-to-speech (TTS) technology has skyrocketed over the past year, thanks to its wide-ranging applications across industries such as accessibility, education, and virtual assistants. Just like advancements in LLMs and image generation models, TTS models have evolved to generate more realistic, human-like speech from text input.

If you're looking to integrate TTS into your system, open-source models are a fantastic option. They offer greater flexibility, control, and customization compared to proprietary alternatives. In this post, we’ll explore some of the most popular open-source TTS models available today. We'll dive into their strengths and weaknesses, helping you choose the best model for your needs. Finally, we'll provide answers to some frequently asked questions.

XTTS-v2

XTTS is one of the most popular voice generation models. Its latest version, XTTS-v2 is capable of cloning voices into different languages with just a quick 6-second audio sample. This efficiency eliminates the need for extensive training data, making it an attractive solution for voice cloning and multilingual speech generation.

The bad news is that the company behind XTTS was shut down in early 2024, leaving the project to the open-source community. However, the source code remains available on GitHub, and XTTS-v2 continues to be one of the most downloaded TTS models on Hugging Face.

Key features:

  • Voice cloning with minimal input: XTTS-v2 allows you to clone voices across multiple languages using only a 6-second audio clip, greatly simplifying the voice cloning process.
  • Multi-language support: The model supports 17 languages, making it ideal for global, multilingual applications.
  • Emotion and style transfer: XTTS-v2 can replicate not only the voice but also the emotional tone and speaking style, resulting in more realistic and expressive speech synthesis.
  • Low-latency performance: The model can achieve less than 150ms streaming latency with a pure PyTorch implementation on a consumer-grade GPU.

Points to be cautious about:

  • Non-commercial use only: XTTS-v2 is licensed under the Coqui Public Model License, which restricts its use to non-commercial purposes. This limits its application in commercial products unless specific licensing terms are negotiated.
  • Project shutdown: As the original company was shut down, the model's future development relies entirely on the open-source community.

ChatTTS

ChatTTS is a voice generation model designed for conversational applications, particularly for dialogue tasks in LLM assistants. It’s also ideal for conversational audio, video introductions, and other interactive tasks. Trained on approximately 100,000 hours of Chinese and English data, ChatTTS is capable of producing natural and high-quality speech in both languages.

Key features:

  • High-quality synthesis: With extensive training, it delivers natural, fluid speech with clear articulation.
  • Specialized for dialogues: ChatTTS is optimized for conversational tasks, making it an excellent choice for LLM-based assistants and dialogue systems.
  • Token-level control: It offers limited but useful token-based control over elements like laughter and pauses, allowing some flexibility in dialogue delivery.

Points to be cautious about:

  • Limited language support: Compared with other TTS models, ChatTTS currently supports only English and Chinese, which may restrict its use for applications in other languages.
  • Limited emotional control: At present, the model only supports basic token-level controls like laughter and breaks. More nuanced emotional controls are expected in future versions but are currently unavailable.
  • Stability issues: ChatTTS can sometimes encounter stability issues such as generating multi-speaker outputs or producing inconsistent audio quality. These issues are common with autoregressive models and you may need to generate multiple samples to get the desired result.

MeloTTS

MeloTTS is a high-quality, multilingual TTS library developed by MyShell.ai. It supports a wide range of languages and accents, including several English dialects (American, British, Indian, and Australian). MeloTTS is optimized for real-time inference, even on CPUs.

Currently, its English variant (MeloTTS-English) is the most downloaded TTS model on Hugging Face.

Key features:

  • Multilingual support: MeloTTS offers a broad range of languages and accents. A key highlight is the ability of the Chinese speaker to handle mixed Chinese and English speech. This makes the model particularly useful in scenarios where both languages are needed, such as in international business or multilingual media content.
  • Real-time inference: It’s optimized for fast performance, even on CPUs, making it suitable for applications requiring low-latency responses.
  • Free for commercial use: Licensed under the MIT License, MeloTTS is available for both commercial and non-commercial usage.

Points to be cautious about:

  • No voice cloning: MeloTTS does not support voice cloning, which could be a limitation for applications that require personalized voice replication.

OpenVoice v2

Also developed by MyShell.ai, OpenVoice v2 is an instant voice cloning model that replicates a speaker's voice from just a short audio clip. It supports speech generation in multiple languages, providing granular control over various voice attributes like emotion, accent, rhythm, pauses, and intonation.

Key features:

  • Accurate tone color cloning: OpenVoice v2 accurately replicates the reference speaker's tone color, allowing the cloned voice to be used across multiple languages and accents.
  • Flexible voice style control: Users can control granular details of the speech output, such as emotion, accent, rhythm, pauses, and intonation, offering more customization than many other TTS models.
  • Zero-shot cross-lingual voice cloning: The model can clone a voice in a language that isn't present in the reference speech or the training data. In other words, the provided sample speech audio for OpenVoice v2 can be in any language.
  • Free for commercial use: Licensed under the MIT License, OpenVoice v2 is available for both commercial and non-commercial projects.

Points to be cautious about:

Parler-TTS

Parler-TTS is a collection of lightweight TTS models developed by Hugging Face, designed to generate high-quality, natural-sounding speech. It allows users to control various speech features, such as gender, pitch, speaking style, and even background noise. Developed as a fully open-source release, Parler-TTS offers all the training code, datasets, and model weights publicly under a permissive license, enabling the community to build and customize their own TTS models.

Key features:

  • Voice style control: Parler-TTS provides granular control over speech characteristics such as emotion, speaking rate, pitch, and reverberation using simple text prompts.
  • Speaker cloning: The models can replicate the style of 34 pre-defined speakers, making it useful for applications requiring consistent speaker identities.
  • Optimized for efficiency: Parler-TTS supports fast generation techniques, including SDPA and Flash Attention 2, making it computationally efficient. SDPA is used by default and speeds up generation time by up to 1.4x compared with eager attention.

Points to be cautious about:

  • Model size: Parler-TTS is available in two versions — Mini (880M parameters) and Large (2.3B parameters). The Mini version is a lightweight model ideal for quick and efficient speech generation. However, if you need more expressiveness and control over finer details of speech, the Large version provides more advanced capabilities, though it requires greater computational resources.

Now that we’ve explored some of the top open-source TTS models and their features, you might still have questions about how these models perform, their deployment, and the best practices. To help, we’ve compiled a list of FAQs to guide you through the considerations when working with TTS models.

Any benchmarks for TTS models? And how much should I trust them?

While LLMs have well-established benchmarks that offer insights into their performance across different tasks, the same cannot be said for TTS models. Evaluating their quality is inherently more challenging due to the subjective nature of human speech perception. If you use metrics like Word Error Rate (WER) to measure the performance, they often fail to capture the nuances of naturalness, inflection, and emotional tone in speech.

I suggest you understand benchmarks for TTS models with caution. While they provide a rough overview of performance, they may not fully reflect how a model will perform in real-world scenarios. If you're interested in exploring TTS model rankings, you can check out the TTS Arena leaderboard, curated by the TTS AGI community on Hugging Face. Note that the leaderboard displays models in descending order of how natural they sound based on community votes.

What should I consider when deploying TTS models?

When deploying TTS models, key considerations include:

Text-to-Speech vs. Text-to-Audio. Which one should I choose?

While "text-to-speech" and "text-to-audio" may seem interchangeable, they refer to slightly different concepts depending on your use case.

  • TTS focuses on converting written text into spoken words that sound as close to human speech as possible. It is typically used for applications like virtual assistants, accessibility tools, audiobooks, and voice interfaces. The goal is to generate speech that feels natural and conversational.
  • TTA is broader and can refer to any conversion of text into an audio format, not necessarily human speech. It may include sound effects, alerts, or any type of non-verbal audio cues based on the textual input.

If you need human-like speech output, a TTS model is what you're looking for. On the other hand, if your focus is simply generating any form of audio from text, including sound effects or alerts, you may be considering text-to-audio solutions. Some popular open-source text-to-audio models include Stable Audio Open 1.0, Tango, Bark (which also functions as a TTS model), and MusicGen (often referred to as a "text-to-music" model).

What should I consider regarding speech quality?

When evaluating the speech quality of a TTS model, there are several key factors to consider to ensure the output meets your application's needs:

Naturalness and intelligibility

  • One of the most important aspects of any TTS model is how natural and human-like the generated speech sounds. Listen for smooth transitions between words, appropriate pauses, and minimal robotic or synthetic artifacts.
  • Intelligibility is equally important. Ensure that the speech is clear and easy to understand, even with complex or lengthy text inputs.

Multilingual and accent support

  • If your application is multilingual, test the model’s ability to generate high-quality speech across different languages, accents, and dialects. Some models, like MeloTTS mentioned above, are known for handling a broad range of languages, while others may specialize in fewer languages.
  • Be sure to test how well the model adapts to accents within the same language, especially for global applications requiring regional variations in speech.

Prosody and intonation

  • Prosody refers to the rhythm, stress, and intonation of speech, which play a critical role in making the generated speech sound natural. A good TTS model should replicate human-like prosody to avoid sounding monotonous or unnatural.
  • Intonation should vary naturally, reflecting questions, statements, and exclamations appropriately.

Emotional expression

  • For more advanced applications, consider a model's ability to convey different emotions in speech. Some models, such as OpenVoice, support granular control over emotional expression, which can be critical in customer service, virtual assistants, or entertainment applications.

Final thoughts

TTS technology has come a long way, with open-source models now offering high-quality, natural-sounding speech generation across multiple languages and applications. Whether you're looking for basic TTS functionality, voice cloning, or advanced control over speech styles and emotions, there's a wide range of models to choose from.

Ultimately, the right TTS model for you will depend on your specific use case, whether it’s for virtual assistants, multilingual applications, or interactive media. By carefully evaluating features and performance, you can harness the power of TTS to transform the way you interact with text and speech.

Check out the following resources to learn more: