Back to List
Meituan LongCat Team Launches General 365: A Rigorous New Benchmark for AI Reasoning
Research BreakthroughMeituanLongCatAI Benchmarking

Meituan LongCat Team Launches General 365: A Rigorous New Benchmark for AI Reasoning

The Meituan LongCat team has officially released General 365, a sophisticated evaluation benchmark designed to measure the reasoning capabilities of large language models (LLMs). In an initial assessment of 26 mainstream models, the benchmark revealed a significant performance gap across the industry. Gemini 3 Pro, currently regarded as one of the most capable models, achieved an accuracy rate of only 62.8%. More strikingly, the vast majority of the models tested failed to reach the 60% threshold, which is considered a basic passing grade. This release by Meituan sets a new, more challenging standard for AI evaluation, highlighting that complex reasoning remains a major hurdle for even the most advanced artificial intelligence systems today.

美团技术团队

Key Takeaways

  • New Benchmark Release: Meituan's LongCat team has introduced General 365, a benchmark specifically focused on evaluating the reasoning performance of AI models.
  • Industry-Wide Testing: The benchmark was used to evaluate 26 mainstream models to provide a comprehensive overview of the current state of AI reasoning.
  • Gemini 3 Pro Performance: Even the top-performing model in the test, Gemini 3 Pro, only reached an accuracy of 62.8%.
  • Low Success Rates: Most models evaluated failed to achieve a 60% accuracy score, indicating that current AI reasoning capabilities are still in their early stages relative to this new standard.

In-Depth Analysis

The Introduction of General 365

The Meituan LongCat team has officially entered the AI evaluation space with the release of General 365. This benchmark is designed to address the growing need for more rigorous testing of reasoning capabilities in large language models. As AI development shifts from simple conversational tasks to complex problem-solving, the industry requires benchmarks that can accurately differentiate between surface-level pattern matching and deep logical reasoning. General 365 appears to be positioned as a "high bar" for the industry, focusing on areas where current models still struggle significantly.

Analyzing the Performance Gap

The results released alongside the benchmark provide a sobering look at the current state of artificial intelligence. By testing 26 mainstream models, the LongCat team has established a broad baseline for performance. The fact that Gemini 3 Pro—a model recognized for its advanced capabilities—only managed a score of 62.8% suggests that General 365 contains tasks that are significantly more difficult than those found in traditional benchmarks.

Furthermore, the observation that the majority of models could not reach the 60% "passing line" highlights a critical bottleneck in AI development. This failure rate suggests that while models are becoming better at generating fluent text, their underlying logical frameworks are not yet robust enough to handle the specific reasoning challenges posed by General 365. This data indicates that the industry may have been overestimating the reasoning maturity of current LLMs based on older, less demanding benchmarks.

Setting a New Standard for Reasoning

By establishing a benchmark where even the "strongest" models are barely passing, Meituan is effectively recalibrating the expectations for AI performance. General 365 serves as a diagnostic tool that identifies the limits of current technology. The 60% threshold mentioned by the LongCat team acts as a symbolic barrier, separating models that possess basic reasoning competency from those that do not. This rigorous approach is essential for guiding future research and development, as it provides a clear target for engineers looking to improve the logical consistency and problem-solving depth of their models.

Industry Impact

The release of General 365 is likely to have a profound impact on how AI models are marketed and developed. For years, the industry has relied on benchmarks where top models frequently score in the 80th or 90th percentiles, leading to a perception that reasoning is a "solved" problem. General 365 shatters this illusion by showing that when the difficulty is increased, performance drops precipitously. This will likely push AI labs to focus more on the quality of reasoning rather than just the scale of the models.

Additionally, Meituan's involvement underscores the importance of real-world application providers in the AI ecosystem. As a company that relies on AI for complex logistics and consumer services, Meituan has a vested interest in ensuring that the models they use are truly capable of logical deduction. General 365 provides a transparent metric that can be used by both developers and enterprise users to assess the true utility of an AI model in high-stakes reasoning scenarios.

Frequently Asked Questions

Question: What is the General 365 benchmark?

General 365 is a new evaluation benchmark released by the Meituan LongCat team. It is specifically designed to test and measure the reasoning capabilities of mainstream large language models, providing a more rigorous standard than many existing evaluations.

Question: How did the top models perform on General 365?

According to the initial results, Gemini 3 Pro was the top performer with an accuracy rate of 62.8%. However, the vast majority of the 26 mainstream models tested failed to reach a 60% accuracy score, which is considered the passing threshold for the benchmark.

Question: Why is General 365 significant for the AI industry?

It is significant because it reveals a major gap in the reasoning abilities of current AI models. By setting a high difficulty level where most models fail to pass, it provides a more accurate and challenging metric for the next generation of AI development, moving beyond simpler benchmarks where models already achieve high scores.

Related News

LARYBench Released: A New Benchmark Defining the ImageNet for Embodied Action Representation and Generalization
Research Breakthrough

LARYBench Released: A New Benchmark Defining the ImageNet for Embodied Action Representation and Generalization

The Meituan Technical Team has officially introduced LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of general latent action representations from large-scale visual data. Positioned as the 'ImageNet' for the embodied AI field, LARYBench provides a standardized way to measure how well models can understand and execute actions. The benchmark's initial experimental results reveal a significant shift in AI development: general-purpose vision models consistently outperform specialized embodied AI expert models in both action generalization and control precision. Furthermore, the research confirms that sophisticated embodied action representations can naturally emerge from training on extensive human video datasets, offering a scalable path for future robotic intelligence and autonomous systems.

Meituan Showcases AI Innovations at ACL 2026: Advancing Large Model Evaluation and Inference Optimization
Research Breakthrough

Meituan Showcases AI Innovations at ACL 2026: Advancing Large Model Evaluation and Inference Optimization

Meituan's technical team has announced the acceptance of six research papers at ACL 2026, a premier international conference for computational linguistics and natural language processing. These papers represent significant advancements in the field of AI, covering a diverse range of technical directions including large-scale model evaluation, complex process reasoning, and competition-level mathematical thinking optimization. Additionally, the research explores reinforcement learning optimization and generative recommendation systems. This selection underscores Meituan's strategic focus on building a new paradigm for generative AI, emphasizing both the rigorous assessment of model capabilities and the enhancement of inference efficiency for complex tasks.

Meituan LongCat-AudioDiT: Redefining Zero-Shot Voice Cloning by Eliminating Intermediate Mel-Spectrogram Representations in TTS
Research Breakthrough

Meituan LongCat-AudioDiT: Redefining Zero-Shot Voice Cloning by Eliminating Intermediate Mel-Spectrogram Representations in TTS

Meituan's LongCat team has unveiled LongCat-AudioDiT, a novel model that advances the state of zero-shot Text-to-Speech (TTS) voice cloning. The core innovation lies in its departure from traditional intermediate representations, such as Mel-spectrograms, which often introduce cascade errors during the synthesis process. Instead, LongCat-AudioDiT utilizes a diffusion-based architecture that operates directly within the waveform latent space. By learning the fundamental patterns of sound without intermediate steps, the model aims to achieve higher fidelity and more accurate voice replication. This technical breakthrough addresses long-standing bottlenecks in audio generation, positioning LongCat-AudioDiT as a significant development in the field of AI-driven voice synthesis and zero-shot cloning technology.