How We Evaluate AI Models: Beyond Standard Benchmarks
- Daniel Pelack
- Oct 4
- 3 min read
Why We Don't Trust Current Benchmarks
While we may evaluate our models on standardized benchmarks in the future, we currently do not feel that the benchmarks being used across the industry are accurate representations of any model's true capabilities or performance. The disconnect between claimed benchmark scores and real-world performance has led us to develop independent evaluation methods that better reflect actual model utility.
Our Independent Evaluation Process
Core Generalized Analysis Script
Our primary evaluation tool consists of 5,000 questions and tasks that vary in difficulty, category, and steps required to complete. This comprehensive script includes:
K-12 level content across multiple subjects
Professional-level questions and tasks
Math, science, writing, and factual assessments
Approximately equal distribution across five major disciplines (20% each)
Transitional Questions (TQs): The final 250 questions in each discipline are specifically designed to transition into the next subject area. These TQs often incorporate two or more subjects and rely on chat context to answer correctly, testing the model's ability to maintain coherence across topic changes.
Accuracy is evaluated separately and automatically via a dedicated script that reviews all responses against predetermined parameters and criteria. Critically, all models—ours and competitors'—are evaluated using the same script, questions, and review criteria to ensure fair comparison.
Genre-Based Tests
These focused assessments are smaller in scope, typically ranging from 500 to 1,500 questions. This more manageable size allows us to manually review the performance of both our models and competing models. Genre-based tests concentrate on one specific type of questioning or task, enabling us to:
Test edge cases
Perform safety checks
Conduct experiments with third-party integrations like search APIs
Research Findings
Benchmark Performance vs. Real Performance
Our independent testing has revealed interesting discrepancies. We consistently find that the claimed performance of top-tier models does not materialize within our evaluation framework. Models that perform worse on public benchmarks often outperform their supposedly superior competitors in our tests. Additionally, models that claim certain capabilities or metrics frequently fail to deliver on those promises when rigorously tested.
Chain-of-Thought (CoT) Model Limitations
We've observed particular issues with CoT models as context windows grow larger. After reviewing third-party thought processes, we've found that these models often begin generating nonsensical reasoning chains, which subsequently lead to inaccurate outputs. This suggests that the "thinking" process can become a liability rather than an asset under certain conditions.
The Search Dependency Problem
When models are stripped of their ability to search the web, we observe significantly degraded performance across all tasks—including those that don't require access to live information. We believe this is compelling evidence that these models are not as "intelligent" as their developing organizations claim. True intelligence should not be dependent on external search capabilities for tasks that require only reasoning and knowledge synthesis.
Small Models, Big Performance
Our model operates at roughly 3 billion parameters—approximately 425 times smaller than competing models on average. Despite this massive size difference, our model achieves similar and sometimes superior performance. This finding challenges the industry assumption that bigger always means better.
We've discovered that smaller, more focused models consistently outperform larger generalized ones. The reason appears to be that massive models suffer from internal conflicts and hallucinations caused by having access to vast amounts of irrelevant data. A focused architecture reduces noise and improves accuracy.
Conclusion
The AI industry's reliance on standardized benchmarks has created a misleading picture of model capabilities. Our independent evaluation methods reveal that real-world performance often contradicts benchmark claims. As we continue to develop and refine our evaluation processes, we remain committed to honest, rigorous testing that reflects actual model utility rather than optimized benchmark scores.
The future of AI evaluation must move beyond gameable metrics toward comprehensive assessments that measure true reasoning ability, consistency, and practical applicability.

Note: A more expansive version of this article with additional data and detailed analysis will be released following our product presentation, currently scheduled for October 15th, 2025


Comments