Back to List
Meituan LongCat Open Sources General 365: A New Benchmark Revealing the Reasoning Limits of Modern AI
Industry NewsMeituanAI ReasoningBenchmarking

Meituan LongCat Open Sources General 365: A New Benchmark Revealing the Reasoning Limits of Modern AI

The Meituan LongCat team has officially released General 365, a new open-source benchmark designed to evaluate the reasoning capabilities of large language models (LLMs). In an initial assessment of 26 mainstream models, the results highlight a significant gap in current AI reasoning performance. Gemini 3 Pro, currently regarded as one of the most powerful models globally, achieved an accuracy rate of only 62.8%. Furthermore, the vast majority of the models tested failed to reach the 60% threshold, which is traditionally considered a passing grade. This release by Meituan's technical team sets a rigorous new standard for the industry, emphasizing that complex reasoning remains a formidable challenge even for the most advanced artificial intelligence systems.

美团技术团队

Key Takeaways

  • New Evaluation Standard: Meituan's LongCat team has open-sourced General 365, a benchmark specifically focused on general reasoning capabilities.
  • Performance Gap: Out of 26 mainstream models tested, the industry-leading Gemini 3 Pro only managed a 62.8% accuracy rate.
  • Widespread Underperformance: Most current AI models failed to reach the 60% accuracy mark on the General 365 benchmark.
  • Open Source Contribution: The release provides the AI community with a "new ruler" to measure and improve reasoning logic in large language models.

In-Depth Analysis

The Launch of General 365 and the Reasoning Challenge

The Meituan LongCat team has introduced General 365 at a critical juncture in AI development. As large language models evolve, the focus is shifting from simple information retrieval to complex logical reasoning. By open-sourcing General 365, Meituan is providing a structured framework to evaluate how models handle multi-step logic and problem-solving. The title of the release, "Setting a New Ruler for Reasoning Evaluation," suggests that existing benchmarks may not be sufficiently challenging or comprehensive enough to distinguish the reasoning depths of modern LLMs. General 365 aims to fill this gap by offering a more rigorous testing ground.

Analyzing the Performance of Mainstream Models

The data released alongside General 365 provides a sobering look at the current state of artificial intelligence. The LongCat team conducted practical tests on 26 mainstream models, representing a broad cross-section of the industry's current capabilities. The results indicate that reasoning is still a significant hurdle. Even Gemini 3 Pro, which is described as the "strongest on the surface" (地表最强), only achieved an accuracy of 62.8%. This score, while leading the pack, suggests that even top-tier models struggle with nearly 40% of the reasoning tasks presented in the General 365 suite.

Perhaps more telling is the performance of the remaining 25 models. The report notes that the vast majority of these models did not even reach the 60% "passing line." This widespread failure to achieve a basic level of proficiency on the General 365 benchmark indicates that while AI has made strides in natural language processing, the underlying logical architecture required for consistent reasoning is still in its infancy for most developers. This data serves as a benchmark for the industry, highlighting the specific areas where current LLMs fall short.

Industry Impact

Redefining Success in AI Development

The introduction of General 365 is likely to shift the industry's focus toward more rigorous reasoning benchmarks. By demonstrating that even the most advanced models like Gemini 3 Pro have significant room for improvement, Meituan is encouraging a move away from superficial performance metrics toward deeper logical consistency. This "new ruler" provides a clear target for AI researchers, emphasizing that high-quality reasoning is the next frontier for model optimization.

Encouraging Transparency through Open Source

By open-sourcing the General 365 benchmark, the Meituan LongCat team is fostering a more transparent and competitive environment. Developers can now use this tool to identify specific weaknesses in their models' reasoning chains. As more teams adopt this benchmark, it could lead to a standardized way of reporting reasoning capabilities, making it easier for the industry to track progress and for users to understand the actual limitations of the AI tools they employ.

Frequently Asked Questions

Question: What is the primary purpose of Meituan's General 365?

General 365 is an open-source benchmark released by the Meituan LongCat team specifically designed to evaluate and set a new standard for the reasoning capabilities of large language models.

Question: How did top-tier models perform on this benchmark?

In tests involving 26 mainstream models, Gemini 3 Pro achieved the highest accuracy at 62.8%. However, most other models failed to reach a 60% accuracy rate, indicating that reasoning remains a major challenge for current AI technology.

Question: Why is the 60% accuracy mark significant in this report?

The report uses the 60% mark as a metaphorical "passing line." The fact that most models failed to reach this level suggests that current AI reasoning capabilities are not yet reliable for complex tasks defined by the General 365 benchmark.

Related News

Meituan LongCat Team Releases General 365 Benchmark Revealing Reasoning Gaps in Leading AI Models
Industry News

Meituan LongCat Team Releases General 365 Benchmark Revealing Reasoning Gaps in Leading AI Models

The Meituan LongCat team has officially introduced General 365, a new evaluation benchmark designed to test the reasoning capabilities of large language models. In a recent assessment of 26 mainstream models, the benchmark revealed a significant performance gap across the industry. Gemini 3 Pro, currently identified as the strongest model in the test, achieved an accuracy rate of 62.8%. However, the results indicate a broader struggle within the field, as the vast majority of the 26 models tested failed to reach the 60% accuracy threshold, which is considered the passing mark. This release by Meituan's technical team establishes a new standard for measuring AI reasoning, highlighting that even top-tier models have substantial room for improvement in complex cognitive tasks.

Managing AI Coding Through Agent Evaluation: A 310,000-Line Code Refactoring Case Study
Industry News

Managing AI Coding Through Agent Evaluation: A 310,000-Line Code Refactoring Case Study

As AI-generated code begins to account for over 90% of system development, the primary challenge shifts from increasing coding speed to managing and constraining AI output. Meituan's technical team has shared a comprehensive practice involving the refactoring of 310,000 lines of code using an 'Agent evaluation' mindset. By implementing a structured framework—including technical debt sorting, rule construction, standardized operating procedures (SOP), and a Pre-PR (Pull Request) mechanism—the team successfully transitioned code refactoring from a high-cost, specialized project into a sustainable, daily iterative process. This approach addresses the risk of AI-driven development amplifying system chaos and emphasizes the necessity of unified standards in the era of AI-native programming.

Meituan BI Evolution: Building a Next-Generation Architecture with Metrics Platforms and Enhanced Calculation Engines
Industry News

Meituan BI Evolution: Building a Next-Generation Architecture with Metrics Platforms and Enhanced Calculation Engines

Meituan's data platform team has pioneered a new generation of Business Intelligence (BI) architecture, placing a centralized metrics platform at its core. This strategic shift addresses critical limitations found in traditional BI systems, which often suffer from inconsistent data definitions—commonly known as "data caliber confusion"—and sluggish query performance when handling personalized datasets. By developing and implementing two primary technical capabilities, automatic semantics and enhanced calculation, Meituan has successfully streamlined its data processing workflows. This evolution marks a significant transition from dataset-driven analytics to a more robust, metrics-centric model, ensuring higher data reliability and faster insights for the organization's diverse business operations. The practice underscores Meituan's commitment to solving complex data engineering challenges through architectural innovation.