Meituan LongCat General 365: New AI Reasoning Benchmark

Q: Question: What is the General 365 benchmark?

**Answer:** General 365 is a new AI reasoning evaluation benchmark released by the Meituan LongCat team. It is designed to serve as a rigorous standard or "ruler" for measuring the reasoning capabilities of large language models.

Q: Question: Which model performed the best in the General 365 evaluation?

**Answer:** According to the results released by Meituan, Gemini 3 Pro was the top-performing model among the 26 mainstream models tested, achieving an accuracy rate of 62.8%.

Q: Question: How did most AI models perform on this new benchmark?

**Answer:** The majority of the 26 mainstream models tested failed to reach the 60% accuracy mark, which is typically considered the passing threshold, indicating that reasoning remains a major challenge for current AI technology.

The Meituan LongCat team has officially introduced General 365, a new evaluation benchmark designed to test the reasoning capabilities of large language models. In a recent assessment of 26 mainstream models, the benchmark revealed a significant performance gap across the industry. Gemini 3 Pro, currently identified as the strongest model in the test, achieved an accuracy rate of 62.8%. However, the results indicate a broader struggle within the field, as the vast majority of the 26 models tested failed to reach the 60% accuracy threshold, which is considered the passing mark. This release by Meituan's technical team establishes a new standard for measuring AI reasoning, highlighting that even top-tier models have substantial room for improvement in complex cognitive tasks.

Key Takeaways

New Benchmark Release: Meituan's LongCat team has launched "General 365," a specialized benchmark for evaluating AI reasoning.
Industry Performance Ceiling: Gemini 3 Pro emerged as the top performer among 26 mainstream models with an accuracy of 62.8%.
Widespread Failure to Meet Standards: Most models tested under the General 365 framework failed to achieve a 60% accuracy rate.
New Evaluation Standard: General 365 is positioned as a new "ruler" or scale for assessing the reasoning depth of modern AI systems.

In-Depth Analysis

The Reasoning Ceiling: Analyzing Gemini 3 Pro's Performance

The release of the General 365 benchmark by the Meituan LongCat team provides a critical look at the current state of artificial intelligence. By testing 26 of the most prominent models available today, the benchmark has established a clear performance ceiling. Gemini 3 Pro, which is noted as the strongest model currently available in this specific test suite, reached an accuracy level of 62.8%.

While 62.8% represents the pinnacle of performance within this evaluation, it also serves as a stark reminder of the limitations inherent in current large language models. The fact that the industry leader is hovering just above the 60% mark suggests that complex reasoning remains a significant challenge for even the most advanced architectures. This data point from the LongCat team indicates that while AI has made strides in generative tasks, the logical consistency and depth required to navigate the General 365 evaluation present a formidable barrier.

The 60% Threshold and the Majority Gap

One of the most significant findings from the Meituan technical team's report is the performance of the broader field of AI models. Out of the 26 mainstream models evaluated, the vast majority were unable to reach the 60% accuracy threshold. In many academic and professional contexts, 60% is often viewed as the minimum standard for a "passing" grade, and the failure of most models to meet this mark highlights a systemic gap in reasoning capabilities.

This widespread inability to cross the 60% line suggests that the General 365 benchmark is designed to be exceptionally rigorous. It moves beyond simple pattern matching or information retrieval, instead focusing on the core reasoning processes that define advanced intelligence. The results imply that for the majority of mainstream AI developers, the path to achieving reliable, human-like reasoning is still in its early stages. The data provided by Meituan serves as a reality check for the industry, shifting the focus from sheer model size to the quality of logical output.

Establishing General 365 as a New Industry Scale

By introducing General 365, the Meituan LongCat team is attempting to redefine how the industry measures progress. The term "General 365" suggests a comprehensive and perhaps daily-standard approach to evaluation, aiming to be a definitive "ruler" (标尺) for the AI community. In an era where many benchmarks are criticized for being "saturated"—meaning models score so high that the tests no longer provide useful differentiation—General 365 appears to offer a much-needed level of difficulty.

The decision to publish these results, showing that most models are currently "failing," underscores a commitment to technical transparency. It provides a baseline that the industry can use to track future improvements. As models evolve, the gap between the current 62.8% peak and a theoretical 100% will serve as the primary metric for success in the development of next-generation reasoning engines.

Industry Impact

Redefining AI Evaluation Standards

The introduction of General 365 by Meituan's LongCat team is likely to influence how AI reasoning is evaluated globally. By setting a benchmark where even the most capable models like Gemini 3 Pro score in the low 60s, Meituan is pushing the industry away from vanity metrics and toward more rigorous, high-difficulty testing. This shift is essential for identifying the true limitations of large language models and for guiding researchers toward solving the underlying problems of logical inference and complex problem-solving.

Benchmarking the Race for Advanced Reasoning

The results of the General 365 test highlight the competitive landscape of the AI industry. With 26 models tested, the benchmark provides a comprehensive snapshot of where different developers stand. The fact that most models are currently underperforming relative to the 60% mark will likely spur a new wave of optimization focused specifically on the criteria set by General 365. As developers strive to surpass the 62.8% benchmark set by Gemini 3 Pro, the industry can expect a renewed focus on the architectural and data-driven improvements necessary to enhance reasoning depth.

Frequently Asked Questions

Question: What is the General 365 benchmark?

Answer: General 365 is a new AI reasoning evaluation benchmark released by the Meituan LongCat team. It is designed to serve as a rigorous standard or "ruler" for measuring the reasoning capabilities of large language models.

Question: Which model performed the best in the General 365 evaluation?

Answer: According to the results released by Meituan, Gemini 3 Pro was the top-performing model among the 26 mainstream models tested, achieving an accuracy rate of 62.8%.

Question: How did most AI models perform on this new benchmark?

Answer: The majority of the 26 mainstream models tested failed to reach the 60% accuracy mark, which is typically considered the passing threshold, indicating that reasoning remains a major challenge for current AI technology.

Meituan LongCat Team Releases General 365 Benchmark Revealing Reasoning Gaps in Leading AI Models