top of page
Search

How We Evaluate AI Models: Beyond Standard Benchmarks

Why We Don't Trust Current Benchmarks


While we may evaluate our models on standardized benchmarks in the future, we currently do not feel that the benchmarks being used across the industry are accurate representations of any model's true capabilities or performance. The disconnect between claimed benchmark scores and real-world performance has led us to develop independent evaluation methods that better reflect actual model utility.


Our Independent Evaluation Process


Core Generalized Analysis Script


Our primary evaluation tool consists of 5,000 questions and tasks that vary in difficulty, category, and steps required to complete. This comprehensive script includes:


  • K-12 level content across multiple subjects

  • Professional-level questions and tasks

  • Math, science, writing, and factual assessments

  • Approximately equal distribution across five major disciplines (20% each)


Transitional Questions (TQs): The final 250 questions in each discipline are specifically designed to transition into the next subject area. These TQs often incorporate two or more subjects and rely on chat context to answer correctly, testing the model's ability to maintain coherence across topic changes.


Accuracy is evaluated separately and automatically via a dedicated script that reviews all responses against predetermined parameters and criteria. Critically, all models—ours and competitors'—are evaluated using the same script, questions, and review criteria to ensure fair comparison.


Genre-Based Tests


These focused assessments are smaller in scope, typically ranging from 500 to 1,500 questions. This more manageable size allows us to manually review the performance of both our models and competing models. Genre-based tests concentrate on one specific type of questioning or task, enabling us to:


  • Test edge cases

  • Perform safety checks

  • Conduct experiments with third-party integrations like search APIs


Research Findings


Benchmark Performance vs. Real Performance


Our independent testing has revealed interesting discrepancies. We consistently find that the claimed performance of top-tier models does not materialize within our evaluation framework. Models that perform worse on public benchmarks often outperform their supposedly superior competitors in our tests. Additionally, models that claim certain capabilities or metrics frequently fail to deliver on those promises when rigorously tested.


Chain-of-Thought (CoT) Model Limitations


We've observed particular issues with CoT models as context windows grow larger. After reviewing third-party thought processes, we've found that these models often begin generating nonsensical reasoning chains, which subsequently lead to inaccurate outputs. This suggests that the "thinking" process can become a liability rather than an asset under certain conditions.


The Search Dependency Problem


When models are stripped of their ability to search the web, we observe significantly degraded performance across all tasks—including those that don't require access to live information. We believe this is compelling evidence that these models are not as "intelligent" as their developing organizations claim. True intelligence should not be dependent on external search capabilities for tasks that require only reasoning and knowledge synthesis.


Small Models, Big Performance


Our model operates at roughly 3 billion parameters—approximately 425 times smaller than competing models on average. Despite this massive size difference, our model achieves similar and sometimes superior performance. This finding challenges the industry assumption that bigger always means better.


We've discovered that smaller, more focused models consistently outperform larger generalized ones. The reason appears to be that massive models suffer from internal conflicts and hallucinations caused by having access to vast amounts of irrelevant data. A focused architecture reduces noise and improves accuracy.


Conclusion


The AI industry's reliance on standardized benchmarks has created a misleading picture of model capabilities. Our independent evaluation methods reveal that real-world performance often contradicts benchmark claims. As we continue to develop and refine our evaluation processes, we remain committed to honest, rigorous testing that reflects actual model utility rather than optimized benchmark scores.


The future of AI evaluation must move beyond gameable metrics toward comprehensive assessments that measure true reasoning ability, consistency, and practical applicability.

ree

Note: A more expansive version of this article with additional data and detailed analysis will be released following our product presentation, currently scheduled for October 15th, 2025

 
 
 

Comments


bottom of page