Back to List
Meituan LongCat Unveils General 365: A Rigorous New Standard for AI Reasoning Evaluation
Industry NewsMeituanAI BenchmarkingReasoning Models

Meituan LongCat Unveils General 365: A Rigorous New Standard for AI Reasoning Evaluation

Meituan's LongCat team has officially released General 365, a new benchmark designed to evaluate the reasoning capabilities of artificial intelligence models. The initial testing phase involved 26 mainstream models, revealing a significant performance gap in the industry. According to the results, the top-performing model, Gemini 3 Pro, achieved an accuracy rate of only 62.8%. More strikingly, the vast majority of the models tested failed to reach the 60% accuracy threshold, which is considered a basic passing mark. This release by Meituan aims to provide a more challenging and accurate metric for assessing how well modern AI can handle complex reasoning tasks, highlighting that even the most advanced systems currently struggle with the demands of the General 365 evaluation.

美团技术团队

Key Takeaways

  • New Benchmark Release: Meituan's LongCat team has introduced General 365, a specialized evaluation tool for AI reasoning.
  • Industry Performance Gap: Out of 26 mainstream models tested, most failed to reach a 60% accuracy rate.
  • Top Performer Results: Gemini 3 Pro leads the current rankings but only managed a score of 62.8%.
  • A New Standard: General 365 is positioned as a "new ruler" or benchmark for measuring the true reasoning depth of large language models.

In-Depth Analysis

The Challenge of General 365

The release of General 365 by the Meituan LongCat team marks a pivotal moment in the evolution of AI benchmarking. By testing 26 of the most prominent models currently available, the team has provided a comprehensive snapshot of the industry's reasoning capabilities. The core finding—that the majority of these models cannot achieve a 60% accuracy rate—suggests that General 365 is designed to be significantly more rigorous than existing benchmarks. This "passing grade" of 60% serves as a critical indicator, suggesting that current AI development may be hitting a plateau when it comes to complex, multi-step reasoning tasks that go beyond simple pattern matching or data retrieval.

Benchmarking the Best: Gemini 3 Pro's Performance

One of the most notable aspects of the General 365 release is the performance of Gemini 3 Pro. Despite being recognized as one of the most powerful models globally, it achieved an accuracy of 62.8%. While this score places it at the top of the 26 models tested, the narrow margin by which it cleared the 60% threshold is telling. It highlights that even the industry leaders have substantial room for improvement. The fact that the "strongest" model is only slightly above what Meituan considers a basic level of competency on this benchmark underscores the difficulty of the reasoning tasks included in General 365. This data point provides a realistic perspective on the current state of artificial intelligence, tempering expectations with hard data regarding reasoning proficiency.

Redefining Evaluation Metrics

Meituan's decision to open-source or release General 365 (referred to as "Open General 365") indicates a move toward standardized, transparent evaluation. By establishing a "new ruler" (标尺), the LongCat team is challenging the AI community to look beyond high scores on older, perhaps saturated, benchmarks. The focus here is clearly on "General" reasoning, implying a broad applicability across different domains. The results suggest that as models become larger and more complex, their ability to reason effectively does not necessarily scale at the same rate, necessitating new tools like General 365 to identify these specific weaknesses.

Industry Impact

The introduction of General 365 is likely to have a profound impact on how AI models are developed and marketed. For the AI industry, this benchmark serves as a wake-up call, demonstrating that current "state-of-the-art" models still struggle with fundamental reasoning when held to a higher standard. It shifts the focus from general performance to specific reasoning accuracy. Furthermore, by setting a benchmark where most models currently fail, Meituan has created a new target for developers. This will likely drive a new wave of research focused specifically on closing the reasoning gap, as companies strive to move their models past the 60% mark and eventually challenge the 62.8% benchmark set by Gemini 3 Pro.

Frequently Asked Questions

Question: What is Meituan's General 365?

General 365 is a reasoning evaluation benchmark released by Meituan's LongCat team. It is designed to test the reasoning capabilities of mainstream AI models and currently serves as a rigorous new standard in the industry.

Question: How did mainstream AI models perform on the General 365 benchmark?

In a test of 26 mainstream models, most failed to reach a 60% accuracy rate. The highest-scoring model, Gemini 3 Pro, achieved an accuracy of 62.8%, indicating that the benchmark is highly challenging for current AI technology.

Question: Why is the 60% accuracy mark significant in this report?

The report notes that most models failed to reach the 60% mark, which is often viewed as a basic "passing" threshold. This highlights a significant gap in the reasoning abilities of current large language models when faced with the General 365 evaluation criteria.

Related News

Meituan LongCat Team Releases General 365 Benchmark Revealing Reasoning Gaps in Leading AI Models
Industry News

Meituan LongCat Team Releases General 365 Benchmark Revealing Reasoning Gaps in Leading AI Models

The Meituan LongCat team has officially introduced General 365, a new evaluation benchmark designed to test the reasoning capabilities of large language models. In a recent assessment of 26 mainstream models, the benchmark revealed a significant performance gap across the industry. Gemini 3 Pro, currently identified as the strongest model in the test, achieved an accuracy rate of 62.8%. However, the results indicate a broader struggle within the field, as the vast majority of the 26 models tested failed to reach the 60% accuracy threshold, which is considered the passing mark. This release by Meituan's technical team establishes a new standard for measuring AI reasoning, highlighting that even top-tier models have substantial room for improvement in complex cognitive tasks.

Managing AI Coding Through Agent Evaluation: A 310,000-Line Code Refactoring Case Study
Industry News

Managing AI Coding Through Agent Evaluation: A 310,000-Line Code Refactoring Case Study

As AI-generated code begins to account for over 90% of system development, the primary challenge shifts from increasing coding speed to managing and constraining AI output. Meituan's technical team has shared a comprehensive practice involving the refactoring of 310,000 lines of code using an 'Agent evaluation' mindset. By implementing a structured framework—including technical debt sorting, rule construction, standardized operating procedures (SOP), and a Pre-PR (Pull Request) mechanism—the team successfully transitioned code refactoring from a high-cost, specialized project into a sustainable, daily iterative process. This approach addresses the risk of AI-driven development amplifying system chaos and emphasizes the necessity of unified standards in the era of AI-native programming.

Meituan BI Evolution: Building a Next-Generation Architecture with Metrics Platforms and Enhanced Calculation Engines
Industry News

Meituan BI Evolution: Building a Next-Generation Architecture with Metrics Platforms and Enhanced Calculation Engines

Meituan's data platform team has pioneered a new generation of Business Intelligence (BI) architecture, placing a centralized metrics platform at its core. This strategic shift addresses critical limitations found in traditional BI systems, which often suffer from inconsistent data definitions—commonly known as "data caliber confusion"—and sluggish query performance when handling personalized datasets. By developing and implementing two primary technical capabilities, automatic semantics and enhanced calculation, Meituan has successfully streamlined its data processing workflows. This evolution marks a significant transition from dataset-driven analytics to a more robust, metrics-centric model, ensuring higher data reliability and faster insights for the organization's diverse business operations. The practice underscores Meituan's commitment to solving complex data engineering challenges through architectural innovation.