
Meituan LongCat Releases General 365 Reasoning Benchmark: Top Models Struggle to Surpass 63% Accuracy
The Meituan LongCat team has officially open-sourced General 365, a new benchmark designed to evaluate the reasoning capabilities of large language models. In a comprehensive assessment involving 26 mainstream AI models, the results highlight a significant performance gap in complex reasoning. Gemini 3 Pro, currently the top-performing model in this evaluation, achieved an accuracy rate of only 62.8%. Notably, the vast majority of the models tested failed to reach the 60% accuracy threshold, which is considered the passing mark for this benchmark. This release aims to establish a more rigorous standard for AI reasoning, exposing the current limitations of even the most advanced models in the industry.
Key Takeaways
- New Reasoning Benchmark: Meituan's LongCat team has officially released and open-sourced "General 365," a specialized tool for evaluating AI reasoning.
- Comprehensive Testing: The benchmark was used to assess 26 mainstream large language models to determine their logical and reasoning proficiency.
- Performance Ceiling: Gemini 3 Pro emerged as the leader in the test, yet it only managed an accuracy rate of 62.8%.
- Widespread Underperformance: Most models involved in the study were unable to reach the 60% passing threshold, indicating a significant challenge in current AI reasoning capabilities.
In-Depth Analysis
The Emergence of General 365: A New Standard in Reasoning
The Meituan LongCat team has introduced General 365 at a critical juncture in the evolution of artificial intelligence. As large language models (LLMs) become increasingly integrated into complex workflows, the need for a rigorous, specialized evaluation of their reasoning capabilities has become paramount. By open-sourcing General 365, the LongCat team is providing the global AI community with a new "yardstick" to measure progress. This benchmark is specifically designed to move beyond simple knowledge retrieval and focus on the intricate logical processes that define true reasoning.
The decision to test 26 different mainstream models provides a broad and representative cross-section of the current AI landscape. This comprehensive approach ensures that the benchmark's findings are not limited to a specific architecture or provider but instead reflect the general state of the industry. The results suggest that General 365 is a high-bar evaluation tool, designed to challenge models in ways that existing benchmarks might not, thereby revealing the true depth—or lack thereof—of their reasoning faculties.
Analyzing the Performance Gap and the 60% Threshold
The data released by the LongCat team reveals a stark reality: there is a significant performance gap in the realm of AI reasoning. The fact that Gemini 3 Pro, a model recognized for its advanced capabilities, achieved only a 62.8% accuracy rate is highly telling. This score represents the current "ceiling" of performance on the General 365 benchmark, suggesting that even the industry's most sophisticated models have a long way to go before mastering complex reasoning tasks.
Perhaps more concerning is the observation that the vast majority of the 26 models tested could not even reach the 60% mark. In many academic and professional contexts, 60% is considered the minimum passing grade. The failure of most mainstream models to hit this target on General 365 indicates that the benchmark has successfully identified a widespread limitation in current LLM development. This "60% barrier" serves as a clear indicator that while models are becoming more fluent and knowledgeable, their ability to consistently apply logic and reason through complex problems remains a significant hurdle.
Industry Impact
The introduction of General 365 is poised to have a lasting impact on the AI industry by shifting the focus of model evaluation. For a long time, the industry has prioritized scale and general knowledge, but Meituan's new benchmark highlights that reasoning is the next major frontier. By making General 365 open-source, the LongCat team is encouraging transparency and healthy competition among AI developers.
This benchmark provides a clear target for research teams worldwide. The specific data points—such as the 62.8% peak and the sub-60% average—provide a baseline that will likely drive future innovations in model architecture and training methodologies. As developers strive to surpass the benchmarks set by General 365, we can expect a renewed focus on logical consistency and multi-step reasoning, which are essential for the next generation of AI applications.
Frequently Asked Questions
Question: What is General 365 and who developed it?
General 365 is a reasoning evaluation benchmark developed and open-sourced by the Meituan LongCat team. It is designed to provide a rigorous standard for testing the logical reasoning capabilities of large language models.
Question: How did the top AI models perform on this benchmark?
According to the test results of 26 mainstream models, Gemini 3 Pro was the top performer with an accuracy of 62.8%. However, the majority of the other models tested failed to reach the 60% accuracy threshold, highlighting a general struggle with the reasoning tasks presented in the benchmark.
Question: Why is General 365 considered a "new yardstick" for the industry?
It is considered a new yardstick because it sets a high difficulty level that current mainstream models struggle to meet. By focusing specifically on reasoning and revealing that most models score below 60%, it establishes a more challenging and precise standard for evaluating the true intelligence of AI systems.


