top of page
Search

How We Evaluate AI Models: Fighting Benchmark Gaming with Independent Testing

The Crisis of Trust in AI Benchmarks


The AI industry has a dirty secret: the benchmarks everyone uses to compare models are fundamentally broken. Not because the tests themselves are poorly designed, but because they've become targets for optimization rather than measures of true capability.

When benchmark questions and tasks are publicly available, they stop measuring generalization and start measuring memorization. Models are deliberately overfitted to these specific test sets, inflating scores while real-world performance remains mediocre. The result is a marketplace flooded with misleading claims where benchmark scores have become marketing tools rather than meaningful metrics.


At Nexus, we've taken a different approach. We don't trust current industry benchmarks, and we've built an independent evaluation system specifically designed to combat the benchmark gaming that has corrupted AI model assessment.


The Overfitting Problem: How Benchmarks Became Meaningless


The Public Benchmark Trap

Every major AI benchmark used today—MMLU, HumanEval, GSM8K, and others—shares a fatal flaw: their test sets are public. This creates an irresistible incentive for model developers to optimize specifically for these known questions.


The process is straightforward:

  1. Benchmark questions are publicly available

  2. Companies include these exact question types in training data

  3. Models learn to pattern-match against known test cases

  4. Benchmark scores increase dramatically

  5. Real-world performance remains stagnant


This isn't speculation. We've observed this pattern repeatedly in our independent testing. Models claiming massive benchmark advantages consistently fail to demonstrate those capabilities when tested on novel questions they haven't been optimized for.


Evidence of Benchmark Gaming

Consider a recent example: Grok Code was benchmarked as beating Gemini by enormous margins in coding tasks. The marketing materials showed impressive graphs with substantial performance gaps. Yet when we tested both models using our independent evaluation, Grok Code barely outperformed our own model—and we don't even focus on coding in our training datasets.


This discrepancy is not an anomaly. It's evidence of systematic overfitting. If a model truly possessed superior coding intelligence, that advantage should manifest across all coding tasks, not just public benchmark questions.


We've observed similar patterns across multiple model comparisons:


  • Models claiming state-of-the-art reasoning often produce illogical outputs on novel problems

  • Top-tier models with impressive MMLU scores frequently fail basic comprehension tasks

  • "Superior" models consistently underperform their benchmark predictions in our testing


The disconnect between claimed capability and actual performance has become so severe

that public benchmarks have lost their value as evaluation tools.


Our Solution: Closed, Comprehensive Evaluation


The Core Evaluation Framework


Our primary evaluation system consists of 5,000 carefully designed questions and tasks that vary systematically across multiple dimensions:


Difficulty Distribution:

  • K-12 level content across all major subjects

  • Professional-level questions requiring domain expertise

  • Graduate-level reasoning tasks

  • Edge cases designed to test true understanding


Category Distribution:


  • Math and computational tasks

  • Science questions across physics, chemistry, biology

  • Writing assessments for coherence and accuracy

  • Factual knowledge retrieval

  • Reasoning and logic problems


Subject Balance: Approximately 20% of questions fall into each of five major disciplines, ensuring no single domain dominates the evaluation. This balance prevents models from achieving high scores through narrow specialization.


Transitional Questions: Testing Context Coherence

The final 250 questions in each discipline are Transitional Questions (TQs)—specially designed to bridge subject areas and test contextual understanding.


Key characteristics of TQs:

  • Multi-subject integration: Require knowledge from two or more disciplines simultaneously

  • Context dependency: Rely on information from previous questions in the conversation history

  • Coherence testing: Evaluate whether models maintain logical consistency across topic shifts

  • Real-world simulation: Mirror how humans actually use AI—jumping between topics within a single conversation


For example, a TQ might start with physics concepts, incorporate mathematical calculations, and require writing a technical explanation—all while referencing specific details from earlier questions in the evaluation session. This design catches models that perform well on isolated questions but struggle with sustained, contextual reasoning—a critical capability for real-world applications.


The Closed System Advantage

Here's the crucial element that makes our evaluation system resistant to gaming: it's completely closed and anonymous.


No one knows:

  • What questions are in our evaluation set

  • How those questions are structured or phrased

  • What the distribution of difficulty levels is

  • Who we are or what organization is conducting these evaluations


Since we maintain complete control over our training data and know exactly what was fed into our model during development, we can guarantee that our evaluation questions were never part of the training set. This ensures we're testing true generalization rather than memorization.


Competing models have no ability to overfit to our evaluation because they don't know it exists. They can't optimize for questions they've never seen. This asymmetry allows us to measure actual intelligence rather than benchmark-specific pattern matching.


The Accuracy Evaluation System: Beyond Simple Scripts

Why Traditional Scripts Don't Work

Initially, many assume accuracy evaluation could be handled with a simple programmatic script—check if the output matches the expected answer, mark it correct or incorrect, move

on.


This approach fails immediately because of output variability. Even when models produce correct answers, they structure those answers differently every time:

  • Different sentence structures

  • Varied explanations

  • Additional context or reasoning

  • Alternative but equivalent phrasings


A rigid script would mark many correct answers as failures simply because they don't match a predefined string exactly. This approach is fundamentally incompatible with evaluating natural language.


The Accuracy LLM: Intelligent Evaluation

Our solution is an Accuracy LLM—a specialized language model dedicated entirely to evaluating other models' outputs. This is not a general-purpose model; it's been designed and configured specifically for rigorous, consistent evaluation.


The Accuracy LLM operates with:

  • Predefined evaluation parameters: Specific criteria it must check for each question type

  • Structured review steps: A systematic process it follows to validate outputs

  • Search API integration: Access to specialized search APIs to verify factual claims

  • Known answer references: For certain question types, predefined correct answers to compare against


How the Accuracy LLM Works

When evaluating a model's response, the Accuracy LLM follows a multi-step process:


Step 1: Parse the Question and Response

  • Identifies the question type (math, factual, reasoning, etc.)

  • Extracts the core claim or answer from the model's output

  • Determines what criteria must be satisfied for correctness

Step 2: Fact Verification (when applicable)

  • Uses specialized search APIs to verify factual claims

  • Cross-references information against authoritative sources

  • Checks for internal consistency in the response

Step 3: Answer Comparison (when applicable)

  • For questions with predetermined answers (e.g., math problems), compares the model's answer against the known correct answer

  • Accounts for equivalent formulations (e.g., "105" = "one hundred five" = "1.05 × 10²")

  • Identifies if the correct answer is present even if embedded in additional explanation

Step 4: Quality Assessment

  • Evaluates whether the response actually addresses the question

  • Checks for logical coherence

  • Assesses completeness of the answer

Step 5: Generate Structured Output

The Accuracy LLM produces output in a specific format with the following structure:

  • Model identifier: Model Name in double parentheses

  • Question number: Question ID in double curly braces

  • Original output: The model's actual response in triple square brackets

  • Expected answer: The correct answer in double square brackets

  • Evaluation result: Either "Correct" or "Incorrect" in double angle brackets


Example of structured output:

Model: Gemini-Pro | Question: 0847


Original Model Output: "Ohm's Law states that voltage equals current times resistance, or V = IR. This means that if you increase the resistance in a circuit while keeping voltage constant, current will decrease proportionally."

Proper Answer: "Ohm's Law: V = IR, where voltage is directly proportional to current and resistance"

Evaluation: Correct


This structured format enables automated processing while preserving the full context for manual review if needed.


The Extraction and Analysis Pipeline

Automated Processing


The structured output from the Accuracy LLM flows into an extraction script that processes evaluation results at scale.


The extraction script:

  1. Parses each evaluation output into its component parts

  2. Extracts correctness indicators to determine if responses were correct or incorrect

  3. Calculates accuracy percentages by dividing correct responses by total questions

  4. Generates performance reports broken down by category, difficulty, and question type

  5. Creates JSON files containing all model outputs for archival and review


JSON Output Structure


All evaluated outputs are preserved in JSON format with the following fields:

  • model: The name of the model being evaluated (e.g., "CompetitorX")

  • question_id: Unique identifier for each question (e.g., 847)

  • category: Subject area (e.g., "Physics")

  • difficulty: Difficulty level (e.g., "K12")

  • question: The actual question text (e.g., "Explain Ohm's Law")

  • model_output: The model's complete response (e.g., "Ohm's Law states that...")

  • expected_answer: The correct answer (e.g., "V = IR...")

  • accuracy_evaluation: Result marking (e.g., "Correct")

  • accuracy_llm_reasoning: Explanation of why it was marked correct/incorrect (e.g., "Response correctly identifies...")

  • timestamp: When the evaluation occurred (e.g., "2025-10-15T14:23:11Z")


This structured data enables:

  • Long-term performance tracking

  • Anomaly detection

  • Category-specific analysis

  • Manual spot-checking for quality assurance


Manual Review for Anomalies


While the system is largely automated, we maintain manual review capabilities for quality assurance. The JSON outputs are regularly sampled to identify:

  • Potential bugs in the Accuracy LLM's evaluation logic

  • Edge cases that might require refinement

  • Patterns of systemic errors

  • Unexpected model behaviors


Pre-Evaluation Validation: Ensuring System Accuracy

Before deploying this evaluation system at scale, we conducted extensive validation testing

to ensure the Accuracy LLM itself was performing correctly.


Validation process included:

  1. Known answer testing: Running questions with objectively correct answers through the system

  2. Cross-validation: Having multiple evaluators (human and AI) assess the same outputs

  3. Edge case testing: Deliberately submitting ambiguous or borderline responses

  4. Consistency checks: Running identical responses through evaluation multiple times to ensure deterministic results

  5. Human audits: Manual review of thousands of evaluation decisions to identify systematic biases


The goal was to confirm that the Accuracy LLM could reliably distinguish correct from incorrect responses across diverse question types. After extensive testing and refinement, we're confident the system achieves extremely high evaluation accuracy.


Category-Specific Challenges: The Subjectivity Problem

The Writing Evaluation Challenge

While our system performs exceptionally well on objective questions, we've encountered challenges with subjective evaluation—particularly in assessing writing quality.


Initial approach (too lenient):


Early versions of the accuracy evaluation would assess writing tasks like this:

Prompt: "Write a 3 paragraph, 12 sentence long paper on Ohm's Law"

Evaluation criteria:

  • Does it discuss Ohm's Law? ✓

  • Is it 3 paragraphs? ✓

  • Is it 12 sentences? ✓

  • Is the grammar correct? ✓


Result: Nearly 100% scores across all models, even when the content quality was poor.

The problem was clear: structural requirements and grammar checking are insufficient for evaluating writing quality. A response could be technically correct while being repetitive, superficial, or poorly organized.


Refined Writing Evaluation

We've since implemented more sophisticated evaluation criteria for writing tasks:


Content quality metrics:

  • Depth of explanation: Does the writing demonstrate genuine understanding?

  • Clarity: Is the explanation accessible to the intended audience?

  • Organization: Is information presented in a logical sequence?

  • Completeness: Are all relevant aspects of the topic covered?

  • Originality: Does the writing avoid repetitive or formulaic patterns?


Implementation approach:

The Accuracy LLM now uses a rubric-based evaluation for writing tasks, scoring multiple dimensions independently and combining them into an overall assessment. This provides more granular feedback and better differentiates between adequate and excellent writing.

However, we acknowledge this remains an imperfect science. Writing quality contains irreducible subjective elements, and we continue refining these evaluation methods.


Genre-Based Testing: Focused Evaluation

In addition to our comprehensive 5,000-question core evaluation, we maintain Genre-Based Tests—smaller, focused assessment sets ranging from 500 to 1,500 questions.


Purpose of Genre-Based Tests:

  • Edge case exploration: Testing unusual or boundary conditions

  • Safety validation: Ensuring models don't produce harmful outputs

  • Specialized capability testing: Deep dives into specific capabilities like coding, math, or reasoning

  • Third-party integration testing: Evaluating performance when models have access to external tools like search APIs

  • Rapid iteration: Smaller test sets enable faster experimentation and refinement


Manual review advantage:

The smaller scope of Genre-Based Tests makes comprehensive manual review feasible. We can examine every response in detail, catching nuances that automated evaluation might miss.


These focused tests complement the broad coverage of our core evaluation, providing both breadth and depth in our assessment methodology.

Real-World Findings: What We've Discovered

Benchmark Claims vs. Measured Performance


Our independent testing has revealed a consistent pattern: the claimed performance of top-tier models rarely materializes in our evaluation framework.


Specific observations:

1. Inflated benchmark scores don't translate Models scoring at the top of public benchmarks often perform at the middle of the pack in our testing. The correlation between public

benchmark performance and our evaluation results is surprisingly weak.


2. "Inferior" models outperform "superior" ones We routinely observe models that score lower on public benchmarks outperforming their supposedly superior competitors in our tests. This suggests public benchmarks are measuring something other than general intelligence.


3. Claimed capabilities don't manifest Models marketed with specific capabilities—"best in class reasoning," "superior coding," "state-of-the-art math"—frequently fail to demonstrate those advantages when rigorously tested on novel problems.


4. Private benchmarks are also gamed Some organizations have developed "private" benchmarks as alternatives to public ones. However, we've found that claimed performance on these private benchmarks also fails to materialize in our testing, suggesting gaming occurs even with supposedly closed evaluation sets.


Chain-of-Thought Models: A Double-Edged Sword

We've made particularly interesting observations about models that use explicit Chain-of-Thought (CoT) reasoning:


Performance degradation with context length:

As context windows grow larger—as conversations become longer and more complex—CoT models begin generating increasingly nonsensical reasoning chains. This degradation then propagates into their final outputs, producing incorrect answers.


Why this matters:

We've reviewed the "thinking" processes from third-party CoT models and found that the reasoning becomes circular, contradictory, or completely unmoored from the original question as context accumulates. The very mechanism intended to improve performance—explicit reasoning—becomes a liability.


Implication:

This suggests that current CoT implementations lack robust mechanisms for maintaining coherence over long contexts. The "thinking" process requires as much intelligence as the answering process, and current approaches haven't solved this effectively.


The Search Dependency Problem

One of our most revealing findings concerns model performance with and without web search access.


The experiment:

We evaluated models in two conditions:

  1. With access to web search APIs

  2. Without any external search capabilities


The results:

When stripped of search access, we observed significantly degraded performance across all tasks—including those that don't require access to current information.


Tasks that shouldn't require search:

  • Mathematical calculations

  • Logical reasoning puzzles

  • Coding problems

  • Explaining established scientific concepts

  • Language translation


Yet performance dropped substantially even on these tasks when search was disabled.

What this means:


We believe this is compelling evidence that these models are not as "intelligent" as their developers claim. True intelligence should not depend on external search for tasks that require only reasoning and knowledge synthesis.


Models appear to be using search as a crutch—compensating for gaps in genuine reasoning capability by retrieving information even when that information should already be encoded in their parameters.


Small Models, Big Performance

Perhaps our most surprising finding challenges the industry's "bigger is better" paradigm:

Our model specifications:

  • Approximately 3 billion parameters

  • Roughly 425 times smaller than competing models on average

Performance results:

  • Similar to much larger models across most categories

  • Superior to larger models in several specific domains


Why smaller models can compete:

Our hypothesis, supported by our testing data, is that massive models suffer from internal conflicts and hallucinations caused by having access to vast amounts of irrelevant data.


The noise problem:

When a model is trained on everything, it has difficulty determining what information is relevant to a given task. Contradictory training data creates internal conflicts. Irrelevant information introduces noise into reasoning processes.


The focus advantage:

A smaller, more focused model with a carefully curated training set:

  • Has less internal contradiction

  • Experiences less noise in its reasoning processes

  • Can achieve higher accuracy on the tasks it's designed for

  • Requires less computational resources for inference


Industry implications:

This suggests the race to build ever-larger models may be misguided. The future of AI performance may lie not in raw parameter count but in intelligent architecture design and high-quality, focused training data.


The Cost of Independent Evaluation

One significant challenge we face is the financial cost of evaluating competing models.


How costs accumulate:

To compare our model against competitors fairly, we must run their models through our entire evaluation suite. Since most competing models are only available through paid APIs, this means:

  • 5,000+ API calls per full evaluation

  • Multiple evaluations as models update

  • Testing multiple competing models

  • Genre-based test evaluations


Cost breakdown example:

For a single comprehensive evaluation of one competing model:

  • 5,000 questions × $0.01 per API call (average) = $50

  • Accuracy LLM evaluation of those outputs = additional processing costs

  • Multiple evaluation runs for consistency checking = multiply by 3-5x


When testing 10 different competing models with periodic re-evaluation as they update, costs can easily reach thousands of dollars monthly.


Why we bear this cost:

Despite the expense, we consider this investment essential. The only way to make honest claims about relative performance is to actually test those models rigorously. We refuse to rely on public benchmarks or marketing claims that we know to be misleading.

This commitment to honest evaluation differentiates us from competitors who make performance claims based solely on cherry-picked or gamed benchmarks.


Limitations and Ongoing Refinement

We acknowledge our evaluation system, while significantly more reliable than public benchmarks, is not perfect.


Current limitations:


Subjective evaluation challenges

  • Writing quality assessment remains partially subjective

  • Creative tasks are difficult to evaluate objectively

  • Style preferences vary across use cases


Coverage limitations

  • 5,000 questions, while comprehensive, cannot cover every possible task

  • Edge cases continually emerge as models evolve

  • New capabilities require new evaluation questions


Accuracy LLM dependency

  • Our evaluation quality depends on the Accuracy LLM's performance

  • We must continually validate that the Accuracy LLM remains unbiased

  • As evaluated models improve, evaluation criteria must evolve


Cost constraints

  • Comprehensive evaluation of many models is expensive

  • We must prioritize which models to evaluate most thoroughly

  • API costs limit evaluation frequency


Ongoing refinement:

We treat our evaluation system as a living framework requiring continuous improvement:

  • Regular manual audits of evaluation decisions

  • Addition of new question types as capabilities expand

  • Refinement of subjective evaluation criteria

  • Validation testing of the Accuracy LLM itself

  • Community feedback on evaluation methodology


The Future of AI Evaluation

What Needs to Change


The AI industry must move beyond benchmark gaming toward genuine, honest evaluation. This requires:


Closed evaluation sets

  • Test questions must not be public

  • Evaluation methodologies should be transparent but test content should remain private

  • Multiple independent evaluation organizations to prevent single points of gaming


Comprehensive assessment

  • Evaluations must test diverse capabilities across many domains

  • Context coherence and long-form reasoning must be assessed

  • Edge cases and failure modes must be explored systematically


Real-world task simulation

  • Evaluation should mirror actual use cases

  • Multi-turn conversations and context maintenance matter

  • Integration with tools and external resources should be tested


Transparency in limitations

  • Models should be evaluated for what they cannot do, not just successes

  • Failure modes should be documented and published

  • Confidence intervals and error bars should accompany all performance claims


Independent verification

  • Third-party evaluation should be the norm

  • Model developers' benchmark claims should be treated skeptically

  • Community-driven evaluation efforts should be supported


Our Commitment

At Nexus, we're committed to honest evaluation that reflects genuine model capabilities. We will:

  • Continue maintaining our closed, comprehensive evaluation framework

  • Publish evaluation results transparently (within the constraints of protecting our evaluation set)

  • Refine our methodology based on feedback and new findings

  • Bear the cost of independent testing rather than relying on marketing claims

  • Advocate for industry-wide adoption of more rigorous evaluation standards



Conclusion

The AI industry's reliance on gameable public benchmarks has created a crisis of trust. Performance claims have become disconnected from real-world capabilities. Models are optimized for test scores rather than genuine intelligence.


At Nexus, we've built an independent evaluation system specifically designed to combat benchmark gaming. Through closed, comprehensive testing with an intelligent Accuracy LLM and careful methodology, we can measure true model performance rather than memorization of known test sets.


Our findings challenge many industry assumptions:

  • Public benchmark scores are poor predictors of real-world performance

  • Smaller, focused models can compete with massive general-purpose ones

  • Search dependency reveals gaps in genuine reasoning capability

  • Chain-of-thought reasoning can degrade in long contexts

  • Claimed capabilities frequently fail to materialize under rigorous testing


The path forward requires the AI community to embrace honest evaluation, acknowledge the limitations of current benchmarks, and invest in rigorous, independent testing methodologies.

Because in the end, the goal is not to achieve high benchmark scores. The goal is to build AI systems that actually work—that genuinely understand, reason, and assist in the complex, nuanced ways that real-world applications demand. And that requires knowing the truth about model performance, even when that truth is uncomfortable.


This evaluation methodology represents our current approach as of October 2025. We welcome feedback, criticism, and suggestions for improvement. Contact us at nexusdevolpercontact@gmail.com

 
 
 

Comments


bottom of page