Back to List
LARYBench Released: Defining the ImageNet for Embodied Action Representation and Measuring Generalization from Human Videos
Research BreakthroughEmbodied AIComputer VisionMachine Learning

LARYBench Released: Defining the ImageNet for Embodied Action Representation and Measuring Generalization from Human Videos

The Meituan Technical Team has officially released LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of general latent action representations from large-scale visual data. This benchmark marks a significant milestone in embodied AI, often referred to as the 'ImageNet' for action representation. Experimental findings within the benchmark reveal that general vision models significantly outperform specialized embodied AI action expert models in both action generalization and control precision. Crucially, the research demonstrates that embodied action representations can emerge directly from large-scale human video data, providing a new methodology for measuring how AI systems translate visual observation into physical action capabilities.

美团技术团队

Key Takeaways

  • Introduction of LARYBench: A systematic benchmark designed to evaluate latent action representations learned from large-scale visual datasets.
  • Superiority of General Models: General vision models demonstrate significantly better performance in action generalization and control precision compared to specialized embodied AI expert models.
  • Emergence from Human Videos: The benchmark proves that embodied action representations can emerge from observing large-scale human video data without explicit action labels.
  • A New Industry Standard: LARYBench is positioned as the 'ImageNet' for the embodied AI field, providing a standardized metric for generalization and precision.

In-Depth Analysis

The Framework of LARYBench

LARYBench, which stands for Latent Action Representation Yielding Benchmark, represents a systematic shift in how the AI industry evaluates embodied intelligence. By focusing on "latent action representation," the benchmark addresses the critical gap between seeing an action and understanding the underlying mechanics required to replicate it. The system is designed to guide the learning process from massive visual datasets, transforming passive observation into actionable intelligence. By establishing a systematic evaluation protocol, LARYBench allows researchers to measure how effectively a model can extract action-oriented features from raw pixels, a process that is fundamental to the development of autonomous agents and robotics.

General Vision Models vs. Specialized Experts

One of the most striking revelations from the LARYBench experimental results is the performance gap between general-purpose vision models and specialized embodied AI action expert models. Traditionally, the industry has leaned toward developing niche models trained specifically for robotic control or embodied tasks. However, LARYBench demonstrates that general vision models—those trained on broad, diverse visual data—possess a superior ability to generalize actions across different scenarios. Furthermore, these general models exhibit higher control precision. This suggests that the foundational visual features learned by large-scale general models are more robust and adaptable for embodied tasks than the features captured by models with a narrower, task-specific focus.

Action Representation Emergence from Human Videos

The benchmark provides empirical evidence for a transformative concept in AI: the emergence of embodied action representations from human video data. This implies that AI models do not necessarily require direct robotic telemetry or specialized sensor data to understand physical movement. Instead, by processing large-scale videos of humans performing various tasks, these models can synthesize a latent understanding of action. This "emergence" is a critical finding, as it suggests that the vast repositories of human video content available globally can serve as a primary training ground for embodied AI, significantly lowering the barrier to training sophisticated robotic systems.

Industry Impact

The release of LARYBench is poised to redefine the development trajectory of embodied AI. By providing a standardized metric—akin to what ImageNet did for computer vision—it allows for objective comparisons between different architectural approaches. The finding that general vision models excel in this domain may lead to a consolidation of research efforts, where the focus shifts from building specialized action models to fine-tuning large-scale general vision models for physical tasks. This could accelerate the deployment of more precise and adaptable robots in real-world environments, as the industry moves toward leveraging human video data as a scalable resource for learning complex physical interactions.

Frequently Asked Questions

Question: What is the primary purpose of LARYBench?

LARYBench is a systematic evaluation benchmark designed to measure and guide the learning of general latent action representations from large-scale visual data, serving as a standard for the embodied AI industry.

Question: Why are general vision models performing better than specialized models in this benchmark?

According to the experimental results, general vision models show significantly better action generalization and control precision, suggesting that broad visual training provides a more robust foundation for understanding actions than specialized, task-specific training.

Question: Can AI learn to move just by watching human videos?

Yes, LARYBench demonstrates that embodied action representations can emerge from large-scale human video data, allowing models to learn the latent structures of action through visual observation.

Related News

LARYBench Released: A New Benchmark Defining the ImageNet for Embodied Action Representation and Generalization
Research Breakthrough

LARYBench Released: A New Benchmark Defining the ImageNet for Embodied Action Representation and Generalization

The Meituan Technical Team has officially introduced LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of general latent action representations from large-scale visual data. Positioned as the 'ImageNet' for the embodied AI field, LARYBench provides a standardized way to measure how well models can understand and execute actions. The benchmark's initial experimental results reveal a significant shift in AI development: general-purpose vision models consistently outperform specialized embodied AI expert models in both action generalization and control precision. Furthermore, the research confirms that sophisticated embodied action representations can naturally emerge from training on extensive human video datasets, offering a scalable path for future robotic intelligence and autonomous systems.

Meituan Showcases AI Innovations at ACL 2026: Advancing Large Model Evaluation and Inference Optimization
Research Breakthrough

Meituan Showcases AI Innovations at ACL 2026: Advancing Large Model Evaluation and Inference Optimization

Meituan's technical team has announced the acceptance of six research papers at ACL 2026, a premier international conference for computational linguistics and natural language processing. These papers represent significant advancements in the field of AI, covering a diverse range of technical directions including large-scale model evaluation, complex process reasoning, and competition-level mathematical thinking optimization. Additionally, the research explores reinforcement learning optimization and generative recommendation systems. This selection underscores Meituan's strategic focus on building a new paradigm for generative AI, emphasizing both the rigorous assessment of model capabilities and the enhancement of inference efficiency for complex tasks.

Meituan LongCat-AudioDiT: Redefining Zero-Shot Voice Cloning by Eliminating Intermediate Mel-Spectrogram Representations in TTS
Research Breakthrough

Meituan LongCat-AudioDiT: Redefining Zero-Shot Voice Cloning by Eliminating Intermediate Mel-Spectrogram Representations in TTS

Meituan's LongCat team has unveiled LongCat-AudioDiT, a novel model that advances the state of zero-shot Text-to-Speech (TTS) voice cloning. The core innovation lies in its departure from traditional intermediate representations, such as Mel-spectrograms, which often introduce cascade errors during the synthesis process. Instead, LongCat-AudioDiT utilizes a diffusion-based architecture that operates directly within the waveform latent space. By learning the fundamental patterns of sound without intermediate steps, the model aims to achieve higher fidelity and more accurate voice replication. This technical breakthrough addresses long-standing bottlenecks in audio generation, positioning LongCat-AudioDiT as a significant development in the field of AI-driven voice synthesis and zero-shot cloning technology.