Back to List
Meituan LongCat Team Launches General 365: A Rigorous New Benchmark for AI Reasoning Evaluation
Industry NewsMeituanAI BenchmarkingReasoning

Meituan LongCat Team Launches General 365: A Rigorous New Benchmark for AI Reasoning Evaluation

The Meituan LongCat team has officially released General 365, a new benchmark designed to evaluate the reasoning capabilities of large language models (LLMs). In an initial assessment of 26 mainstream models, the benchmark revealed a significant performance gap in the industry. Gemini 3 Pro, currently regarded as one of the most advanced models, achieved a top accuracy rate of only 62.8%. More strikingly, the vast majority of the models tested failed to reach the 60% accuracy threshold, which is traditionally considered a passing grade. This release by Meituan's technical team establishes a more demanding standard for measuring AI reasoning, highlighting that current models still face substantial challenges in complex logical tasks.

美团技术团队

Key Takeaways

  • New Evaluation Standard: Meituan's LongCat team has introduced General 365, specifically designed to test the reasoning limits of AI models.
  • Industry-Wide Testing: The benchmark was applied to 26 mainstream models to provide a comprehensive overview of current AI capabilities.
  • Performance Ceiling: Gemini 3 Pro emerged as the top performer but only managed an accuracy rate of 62.8%.
  • Reasoning Deficit: Most tested models failed to achieve a 60% score, indicating a widespread struggle with the reasoning tasks presented in General 365.

In-Depth Analysis

The Introduction of General 365

The Meituan LongCat team has officially open-sourced General 365, positioning it as a new yardstick for the evaluation of artificial intelligence. Unlike traditional benchmarks that may focus on general knowledge or linguistic fluency, General 365 appears to target the core cognitive function of reasoning. By releasing this tool, the LongCat team provides the developer community with a rigorous framework to identify the strengths and weaknesses of various large language models (LLMs) in logical processing.

The decision to open-source this benchmark suggests a move toward greater transparency and standardization in how AI progress is measured. As models become more sophisticated, the industry requires more difficult and nuanced testing environments to differentiate between superficial pattern matching and genuine logical reasoning.

Benchmarking the Leaders: Gemini 3 Pro and Beyond

In the initial testing phase conducted by the LongCat team, 26 mainstream models were put to the test. The results offer a sobering look at the current state of AI development. Gemini 3 Pro, which is currently identified as the strongest model in the field, reached an accuracy of 62.8%. While this represents the leading edge of current technology, it also highlights a significant margin for improvement.

The data reveals a steep drop-off in performance beyond the top-tier models. The fact that the majority of the 26 models could not reach a 60% accuracy level—often considered the minimum standard for competency—suggests that General 365 is a highly challenging benchmark. This performance gap underscores the difficulty of the reasoning tasks included in the set and indicates that many current LLMs may still struggle when faced with complex, multi-step logical requirements.

Industry Impact

The release of General 365 is significant for the AI industry as it shifts the focus from simple performance metrics to deep reasoning capabilities. By setting a benchmark where even the most advanced models score near the 60% mark, Meituan is effectively raising the bar for what constitutes a "high-performing" model. This encourages AI researchers and developers to move beyond optimizing for existing, potentially saturated benchmarks and instead focus on the fundamental challenges of machine reasoning.

Furthermore, the benchmark serves as a reality check for the industry. While marketing for AI models often emphasizes human-like capabilities, the General 365 results demonstrate that there is still a long way to go before AI can consistently master complex reasoning tasks. This new standard will likely drive a new wave of innovation focused on cognitive depth rather than just model size or data volume.

Frequently Asked Questions

Question: What is General 365?

General 365 is a new reasoning evaluation benchmark released by Meituan's LongCat team. It is designed to provide a rigorous standard for testing the logical reasoning capabilities of large language models.

Question: How did mainstream models perform on this benchmark?

In a test of 26 mainstream models, the performance was generally low. Gemini 3 Pro led the group with a 62.8% accuracy rate, but the majority of models failed to reach a 60% score.

Question: Why is the 60% score significant in this context?

The 60% mark is often viewed as a basic passing grade or a threshold for competency. The fact that most models fell below this line indicates that General 365 is a particularly difficult test that exposes the reasoning limitations of current AI technology.

Related News

Meituan LongCat Team Releases General 365 Benchmark Revealing Reasoning Gaps in Leading AI Models
Industry News

Meituan LongCat Team Releases General 365 Benchmark Revealing Reasoning Gaps in Leading AI Models

The Meituan LongCat team has officially introduced General 365, a new evaluation benchmark designed to test the reasoning capabilities of large language models. In a recent assessment of 26 mainstream models, the benchmark revealed a significant performance gap across the industry. Gemini 3 Pro, currently identified as the strongest model in the test, achieved an accuracy rate of 62.8%. However, the results indicate a broader struggle within the field, as the vast majority of the 26 models tested failed to reach the 60% accuracy threshold, which is considered the passing mark. This release by Meituan's technical team establishes a new standard for measuring AI reasoning, highlighting that even top-tier models have substantial room for improvement in complex cognitive tasks.

Managing AI Coding Through Agent Evaluation: A 310,000-Line Code Refactoring Case Study
Industry News

Managing AI Coding Through Agent Evaluation: A 310,000-Line Code Refactoring Case Study

As AI-generated code begins to account for over 90% of system development, the primary challenge shifts from increasing coding speed to managing and constraining AI output. Meituan's technical team has shared a comprehensive practice involving the refactoring of 310,000 lines of code using an 'Agent evaluation' mindset. By implementing a structured framework—including technical debt sorting, rule construction, standardized operating procedures (SOP), and a Pre-PR (Pull Request) mechanism—the team successfully transitioned code refactoring from a high-cost, specialized project into a sustainable, daily iterative process. This approach addresses the risk of AI-driven development amplifying system chaos and emphasizes the necessity of unified standards in the era of AI-native programming.

Meituan BI Evolution: Building a Next-Generation Architecture with Metrics Platforms and Enhanced Calculation Engines
Industry News

Meituan BI Evolution: Building a Next-Generation Architecture with Metrics Platforms and Enhanced Calculation Engines

Meituan's data platform team has pioneered a new generation of Business Intelligence (BI) architecture, placing a centralized metrics platform at its core. This strategic shift addresses critical limitations found in traditional BI systems, which often suffer from inconsistent data definitions—commonly known as "data caliber confusion"—and sluggish query performance when handling personalized datasets. By developing and implementing two primary technical capabilities, automatic semantics and enhanced calculation, Meituan has successfully streamlined its data processing workflows. This evolution marks a significant transition from dataset-driven analytics to a more robust, metrics-centric model, ensuring higher data reliability and faster insights for the organization's diverse business operations. The practice underscores Meituan's commitment to solving complex data engineering challenges through architectural innovation.