Back to List
Meituan LongCat Team Releases General 365 Benchmark Revealing Reasoning Gaps in Leading AI Models
Industry NewsMeituanAI BenchmarkingReasoning Models

Meituan LongCat Team Releases General 365 Benchmark Revealing Reasoning Gaps in Leading AI Models

The Meituan LongCat team has officially introduced General 365, a new evaluation benchmark designed to test the reasoning capabilities of large language models. In a recent assessment of 26 mainstream models, the benchmark revealed a significant performance gap across the industry. Gemini 3 Pro, currently identified as the strongest model in the test, achieved an accuracy rate of 62.8%. However, the results indicate a broader struggle within the field, as the vast majority of the 26 models tested failed to reach the 60% accuracy threshold, which is considered the passing mark. This release by Meituan's technical team establishes a new standard for measuring AI reasoning, highlighting that even top-tier models have substantial room for improvement in complex cognitive tasks.

美团技术团队

Key Takeaways

  • New Benchmark Release: Meituan's LongCat team has launched "General 365," a specialized benchmark for evaluating AI reasoning.
  • Industry Performance Ceiling: Gemini 3 Pro emerged as the top performer among 26 mainstream models with an accuracy of 62.8%.
  • Widespread Failure to Meet Standards: Most models tested under the General 365 framework failed to achieve a 60% accuracy rate.
  • New Evaluation Standard: General 365 is positioned as a new "ruler" or scale for assessing the reasoning depth of modern AI systems.

In-Depth Analysis

The Reasoning Ceiling: Analyzing Gemini 3 Pro's Performance

The release of the General 365 benchmark by the Meituan LongCat team provides a critical look at the current state of artificial intelligence. By testing 26 of the most prominent models available today, the benchmark has established a clear performance ceiling. Gemini 3 Pro, which is noted as the strongest model currently available in this specific test suite, reached an accuracy level of 62.8%.

While 62.8% represents the pinnacle of performance within this evaluation, it also serves as a stark reminder of the limitations inherent in current large language models. The fact that the industry leader is hovering just above the 60% mark suggests that complex reasoning remains a significant challenge for even the most advanced architectures. This data point from the LongCat team indicates that while AI has made strides in generative tasks, the logical consistency and depth required to navigate the General 365 evaluation present a formidable barrier.

The 60% Threshold and the Majority Gap

One of the most significant findings from the Meituan technical team's report is the performance of the broader field of AI models. Out of the 26 mainstream models evaluated, the vast majority were unable to reach the 60% accuracy threshold. In many academic and professional contexts, 60% is often viewed as the minimum standard for a "passing" grade, and the failure of most models to meet this mark highlights a systemic gap in reasoning capabilities.

This widespread inability to cross the 60% line suggests that the General 365 benchmark is designed to be exceptionally rigorous. It moves beyond simple pattern matching or information retrieval, instead focusing on the core reasoning processes that define advanced intelligence. The results imply that for the majority of mainstream AI developers, the path to achieving reliable, human-like reasoning is still in its early stages. The data provided by Meituan serves as a reality check for the industry, shifting the focus from sheer model size to the quality of logical output.

Establishing General 365 as a New Industry Scale

By introducing General 365, the Meituan LongCat team is attempting to redefine how the industry measures progress. The term "General 365" suggests a comprehensive and perhaps daily-standard approach to evaluation, aiming to be a definitive "ruler" (标尺) for the AI community. In an era where many benchmarks are criticized for being "saturated"—meaning models score so high that the tests no longer provide useful differentiation—General 365 appears to offer a much-needed level of difficulty.

The decision to publish these results, showing that most models are currently "failing," underscores a commitment to technical transparency. It provides a baseline that the industry can use to track future improvements. As models evolve, the gap between the current 62.8% peak and a theoretical 100% will serve as the primary metric for success in the development of next-generation reasoning engines.

Industry Impact

Redefining AI Evaluation Standards

The introduction of General 365 by Meituan's LongCat team is likely to influence how AI reasoning is evaluated globally. By setting a benchmark where even the most capable models like Gemini 3 Pro score in the low 60s, Meituan is pushing the industry away from vanity metrics and toward more rigorous, high-difficulty testing. This shift is essential for identifying the true limitations of large language models and for guiding researchers toward solving the underlying problems of logical inference and complex problem-solving.

Benchmarking the Race for Advanced Reasoning

The results of the General 365 test highlight the competitive landscape of the AI industry. With 26 models tested, the benchmark provides a comprehensive snapshot of where different developers stand. The fact that most models are currently underperforming relative to the 60% mark will likely spur a new wave of optimization focused specifically on the criteria set by General 365. As developers strive to surpass the 62.8% benchmark set by Gemini 3 Pro, the industry can expect a renewed focus on the architectural and data-driven improvements necessary to enhance reasoning depth.

Frequently Asked Questions

Question: What is the General 365 benchmark?

Answer: General 365 is a new AI reasoning evaluation benchmark released by the Meituan LongCat team. It is designed to serve as a rigorous standard or "ruler" for measuring the reasoning capabilities of large language models.

Question: Which model performed the best in the General 365 evaluation?

Answer: According to the results released by Meituan, Gemini 3 Pro was the top-performing model among the 26 mainstream models tested, achieving an accuracy rate of 62.8%.

Question: How did most AI models perform on this new benchmark?

Answer: The majority of the 26 mainstream models tested failed to reach the 60% accuracy mark, which is typically considered the passing threshold, indicating that reasoning remains a major challenge for current AI technology.

Related News

Managing AI Coding Through Agent Evaluation: A 310,000-Line Code Refactoring Case Study
Industry News

Managing AI Coding Through Agent Evaluation: A 310,000-Line Code Refactoring Case Study

As AI-generated code begins to account for over 90% of system development, the primary challenge shifts from increasing coding speed to managing and constraining AI output. Meituan's technical team has shared a comprehensive practice involving the refactoring of 310,000 lines of code using an 'Agent evaluation' mindset. By implementing a structured framework—including technical debt sorting, rule construction, standardized operating procedures (SOP), and a Pre-PR (Pull Request) mechanism—the team successfully transitioned code refactoring from a high-cost, specialized project into a sustainable, daily iterative process. This approach addresses the risk of AI-driven development amplifying system chaos and emphasizes the necessity of unified standards in the era of AI-native programming.

Meituan BI Evolution: Building a Next-Generation Architecture with Metrics Platforms and Enhanced Calculation Engines
Industry News

Meituan BI Evolution: Building a Next-Generation Architecture with Metrics Platforms and Enhanced Calculation Engines

Meituan's data platform team has pioneered a new generation of Business Intelligence (BI) architecture, placing a centralized metrics platform at its core. This strategic shift addresses critical limitations found in traditional BI systems, which often suffer from inconsistent data definitions—commonly known as "data caliber confusion"—and sluggish query performance when handling personalized datasets. By developing and implementing two primary technical capabilities, automatic semantics and enhanced calculation, Meituan has successfully streamlined its data processing workflows. This evolution marks a significant transition from dataset-driven analytics to a more robust, metrics-centric model, ensuring higher data reliability and faster insights for the organization's diverse business operations. The practice underscores Meituan's commitment to solving complex data engineering challenges through architectural innovation.

Comprehensive Collection of System Prompts and Models for Leading AI Tools Surfaces on GitHub
Industry News

Comprehensive Collection of System Prompts and Models for Leading AI Tools Surfaces on GitHub

A significant new repository titled 'system-prompts-and-models-of-ai-tools' has emerged on GitHub, curated by user x1xhlol. This project serves as a centralized documentation hub for the system prompts and underlying model configurations of a vast array of prominent AI applications. The collection includes high-profile tools such as Cursor, Devin AI, Perplexity, and NotionAI, alongside specialized development environments like Augment Code, Windsurf, and Replit. By aggregating the operational logic and instructional frameworks for both proprietary and open-source AI systems—including v0, Claude Code, and VSCode Agent—the repository provides a rare look into the prompt engineering strategies that drive modern AI-assisted coding, search, and productivity platforms. This release highlights a growing trend toward transparency and community-driven analysis within the AI development ecosystem.