Back to List
Meituan LongCat-AudioDiT: Redefining Zero-Shot Voice Cloning by Eliminating Intermediate Mel-Spectrogram Representations in TTS
Research BreakthroughTTSVoice CloningAI Research

Meituan LongCat-AudioDiT: Redefining Zero-Shot Voice Cloning by Eliminating Intermediate Mel-Spectrogram Representations in TTS

Meituan's LongCat team has unveiled LongCat-AudioDiT, a novel model that advances the state of zero-shot Text-to-Speech (TTS) voice cloning. The core innovation lies in its departure from traditional intermediate representations, such as Mel-spectrograms, which often introduce cascade errors during the synthesis process. Instead, LongCat-AudioDiT utilizes a diffusion-based architecture that operates directly within the waveform latent space. By learning the fundamental patterns of sound without intermediate steps, the model aims to achieve higher fidelity and more accurate voice replication. This technical breakthrough addresses long-standing bottlenecks in audio generation, positioning LongCat-AudioDiT as a significant development in the field of AI-driven voice synthesis and zero-shot cloning technology.

美团技术团队

Key Takeaways

  • Elimination of Mel-Spectrograms: LongCat-AudioDiT removes the need for intermediate Mel-spectrogram representations, which are standard in traditional TTS pipelines.
  • Direct Waveform Latent Space: The model operates directly within the waveform latent space to generate audio, reducing the complexity of the synthesis process.
  • Diffusion-Based Architecture: It utilizes a diffusion model (AudioDiT) to learn the inherent laws of sound and voice patterns.
  • Reduction of Cascade Errors: By bypassing intermediate stages, the model prevents the accumulation of errors that typically occur during data conversion between different stages of a TTS system.
  • Zero-Shot Capability: The architecture is specifically designed to push the upper limits of zero-shot voice cloning, allowing for high-quality replication of unseen voices.

In-Depth Analysis

Moving Beyond Intermediate Representations

For years, the standard architecture for Text-to-Speech (TTS) systems has relied on a multi-stage process. Typically, a model first converts text into an intermediate representation, most commonly a Mel-spectrogram. A second model, known as a vocoder, then converts that Mel-spectrogram into a listenable audio waveform. While effective, this approach has a fundamental flaw: cascade errors. Errors introduced during the text-to-spectrogram phase are amplified during the spectrogram-to-waveform phase, often resulting in artifacts or a loss of naturalness in the cloned voice.

Meituan’s LongCat team has addressed this bottleneck with the introduction of LongCat-AudioDiT. The defining characteristic of this model is the complete abandonment of Mel-spectrograms. By removing this middleman, the model seeks to "directly learn the laws of sound itself." This shift represents a move toward a more end-to-end philosophy in audio generation, where the AI interacts more closely with the raw characteristics of the audio signal rather than a visual approximation of it.

The Power of Waveform Latent Space and Diffusion

LongCat-AudioDiT operates within the waveform latent space. In technical terms, this means the model works with a compressed, high-dimensional representation of the actual audio waveform rather than a frequency-domain representation like a spectrogram. By performing synthesis in this space, the model can capture the intricate nuances of a person's voice—such as timbre, breathiness, and prosody—more accurately than traditional methods.

Central to this process is the use of Diffusion Models (AudioDiT). Diffusion models have recently revolutionized image generation, and their application to audio is proving equally transformative. In LongCat-AudioDiT, the diffusion process iteratively refines noise into a clear audio signal within the latent space. This allows the model to generate high-fidelity audio that maintains the structural integrity of the original voice being cloned. Because the model is trained to understand the underlying patterns of sound directly, it can generalize better to new, unseen voices, which is the hallmark of "zero-shot" cloning.

Overcoming Technical Bottlenecks in Voice Cloning

The primary goal of LongCat-AudioDiT is to break the "upper limit" of zero-shot voice cloning. Zero-shot cloning is particularly challenging because the AI must replicate a voice it has never encountered during its primary training phase, often using only a very short sample of the target voice. Traditional models often struggle with this because the conversion to Mel-spectrograms loses critical phase information and fine-grained acoustic details.

By "blocking the cascade error from the source," LongCat-AudioDiT ensures that the data remains as pure as possible throughout the generation pipeline. This direct-to-waveform approach ensures that the subtle characteristics that make a human voice unique are preserved. The result is a system that doesn't just mimic the pitch and speed of a voice, but captures its essential "artistry" and identity, pushing the boundaries of what is possible in synthetic speech.

Industry Impact

The release of LongCat-AudioDiT by Meituan marks a significant shift in the AI audio landscape. By demonstrating that high-quality TTS can be achieved without intermediate Mel-spectrograms, Meituan is challenging the industry standard and paving the way for more streamlined, efficient audio models. This innovation has broad implications for various sectors:

  1. Content Creation: High-fidelity zero-shot cloning allows for more realistic dubbing and voice-over work without the need for extensive recording sessions.
  2. Human-Computer Interaction: Virtual assistants and AI agents can adopt more natural and diverse personas with minimal data input.
  3. Technical Efficiency: Reducing the number of stages in the TTS pipeline can lead to more robust models that are less prone to the "robotic" artifacts associated with traditional vocoders.

As diffusion models continue to mature, the transition to latent space audio generation is likely to become the new benchmark for excellence in the field of artificial intelligence.

Frequently Asked Questions

Question: What is the main difference between LongCat-AudioDiT and traditional TTS models?

Traditional TTS models usually convert text to a Mel-spectrogram first and then use a vocoder to create sound. LongCat-AudioDiT skips the Mel-spectrogram entirely and generates audio directly in the waveform latent space using diffusion models, which prevents errors from building up between stages.

Question: Why is the removal of Mel-spectrograms considered a breakthrough?

Mel-spectrograms are an approximation of sound that can lose important details. By removing them, the model avoids "cascade errors"—where mistakes in the first stage of generation affect the final output—resulting in a more accurate and natural-sounding voice clone.

Question: What does "zero-shot" mean in the context of LongCat-AudioDiT?

Zero-shot refers to the model's ability to clone a specific person's voice using only a short audio sample, even if the model was never specifically trained on that person's voice before. LongCat-AudioDiT aims to improve the quality and realism of these clones.

Related News

LARYBench Released: A New Benchmark Defining the ImageNet for Embodied Action Representation and Generalization
Research Breakthrough

LARYBench Released: A New Benchmark Defining the ImageNet for Embodied Action Representation and Generalization

The Meituan Technical Team has officially introduced LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of general latent action representations from large-scale visual data. Positioned as the 'ImageNet' for the embodied AI field, LARYBench provides a standardized way to measure how well models can understand and execute actions. The benchmark's initial experimental results reveal a significant shift in AI development: general-purpose vision models consistently outperform specialized embodied AI expert models in both action generalization and control precision. Furthermore, the research confirms that sophisticated embodied action representations can naturally emerge from training on extensive human video datasets, offering a scalable path for future robotic intelligence and autonomous systems.

Meituan Showcases AI Innovations at ACL 2026: Advancing Large Model Evaluation and Inference Optimization
Research Breakthrough

Meituan Showcases AI Innovations at ACL 2026: Advancing Large Model Evaluation and Inference Optimization

Meituan's technical team has announced the acceptance of six research papers at ACL 2026, a premier international conference for computational linguistics and natural language processing. These papers represent significant advancements in the field of AI, covering a diverse range of technical directions including large-scale model evaluation, complex process reasoning, and competition-level mathematical thinking optimization. Additionally, the research explores reinforcement learning optimization and generative recommendation systems. This selection underscores Meituan's strategic focus on building a new paradigm for generative AI, emphasizing both the rigorous assessment of model capabilities and the enhancement of inference efficiency for complex tasks.

Autonomous AI Agent Discovers 21 Zero-Day Vulnerabilities in FFmpeg Media Library Following Google and Anthropic Audits
Research Breakthrough

Autonomous AI Agent Discovers 21 Zero-Day Vulnerabilities in FFmpeg Media Library Following Google and Anthropic Audits

A production autonomous security agent developed by depthfirst has identified 21 previously unknown zero-day vulnerabilities within FFmpeg, a critical media processing library used globally. This discovery follows recent security analyses by Google’s Big Sleep team and Anthropic’s Mythos model. The depthfirst agent not only identified these flaws—some of which have existed in the codebase for up to 20 years—but also produced concrete, reproducible Proof of Concept (PoC) inputs and demonstrated a Remote Code Execution (RCE) exploit primitive. Operating at a significantly lower cost than traditional methods ($1,000 vs. $10,000), this breakthrough highlights the increasing capability of AI-driven security systems to audit complex, hardened C codebases that underpin modern digital infrastructure.