How We Evaluate AI Models: Fighting Benchmark Gaming with Independent Testing

Daniel Pelack
Oct 21, 2025
13 min read

The Crisis of Trust in AI Benchmarks

The AI industry has a dirty secret: the benchmarks everyone uses to compare models are fundamentally broken. Not because the tests themselves are poorly designed, but because they've become targets for optimization rather than measures of true capability.

When benchmark questions and tasks are publicly available, they stop measuring generalization and start measuring memorization. Models are deliberately overfitted to these specific test sets, inflating scores while real-world performance remains mediocre. The result is a marketplace flooded with misleading claims where benchmark scores have become marketing tools rather than meaningful metrics.

At Nexus, we've taken a different approach. We don't trust current industry benchmarks, and we've built an independent evaluation system specifically designed to combat the benchmark gaming that has corrupted AI model assessment.

The Overfitting Problem: How Benchmarks Became Meaningless

The Public Benchmark Trap

Every major AI benchmark used today—MMLU, HumanEval, GSM8K, and others—shares a fatal flaw: their test sets are public. This creates an irresistible incentive for model developers to optimize specifically for these known questions.

The process is straightforward:

Benchmark questions are publicly available
Companies include these exact question types in training data
Models learn to pattern-match against known test cases
Benchmark scores increase dramatically
Real-world performance remains stagnant

This isn't speculation. We've observed this pattern repeatedly in our independent testing. Models claiming massive benchmark advantages consistently fail to demonstrate those capabilities when tested on novel questions they haven't been optimized for.

Evidence of Benchmark Gaming

Consider a recent example: Grok Code was benchmarked as beating Gemini by enormous margins in coding tasks. The marketing materials showed impressive graphs with substantial performance gaps. Yet when we tested both models using our independent evaluation, Grok Code barely outperformed our own model—and we don't even focus on coding in our training datasets.

This discrepancy is not an anomaly. It's evidence of systematic overfitting. If a model truly possessed superior coding intelligence, that advantage should manifest across all coding tasks, not just public benchmark questions.

We've observed similar patterns across multiple model comparisons:

Models claiming state-of-the-art reasoning often produce illogical outputs on novel problems
Top-tier models with impressive MMLU scores frequently fail basic comprehension tasks
"Superior" models consistently underperform their benchmark predictions in our testing

The disconnect between claimed capability and actual performance has become so severe

that public benchmarks have lost their value as evaluation tools.

Our Solution: Closed, Comprehensive Evaluation

The Core Evaluation Framework

Our primary evaluation system consists of 5,000 carefully designed questions and tasks that vary systematically across multiple dimensions:

Difficulty Distribution:

K-12 level content across all major subjects
Professional-level questions requiring domain expertise
Graduate-level reasoning tasks
Edge cases designed to test true understanding

Category Distribution:

Math and computational tasks
Science questions across physics, chemistry, biology
Writing assessments for coherence and accuracy
Factual knowledge retrieval
Reasoning and logic problems

Subject Balance: Approximately 20% of questions fall into each of five major disciplines, ensuring no single domain dominates the evaluation. This balance prevents models from achieving high scores through narrow specialization.

Transitional Questions: Testing Context Coherence

The final 250 questions in each discipline are Transitional Questions (TQs)—specially designed to bridge subject areas and test contextual understanding.

Key characteristics of TQs:

Multi-subject integration: Require knowledge from two or more disciplines simultaneously
Context dependency: Rely on information from previous questions in the conversation history
Coherence testing: Evaluate whether models maintain logical consistency across topic shifts
Real-world simulation: Mirror how humans actually use AI—jumping between topics within a single conversation

For example, a TQ might start with physics concepts, incorporate mathematical calculations, and require writing a technical explanation—all while referencing specific details from earlier questions in the evaluation session. This design catches models that perform well on isolated questions but struggle with sustained, contextual reasoning—a critical capability for real-world applications.

The Closed System Advantage

Here's the crucial element that makes our evaluation system resistant to gaming: it's completely closed and anonymous.

No one knows:

What questions are in our evaluation set
How those questions are structured or phrased
What the distribution of difficulty levels is
Who we are or what organization is conducting these evaluations

Since we maintain complete control over our training data and know exactly what was fed into our model during development, we can guarantee that our evaluation questions were never part of the training set. This ensures we're testing true generalization rather than memorization.

Competing models have no ability to overfit to our evaluation because they don't know it exists. They can't optimize for questions they've never seen. This asymmetry allows us to measure actual intelligence rather than benchmark-specific pattern matching.

The Accuracy Evaluation System: Beyond Simple Scripts

Why Traditional Scripts Don't Work

Initially, many assume accuracy evaluation could be handled with a simple programmatic script—check if the output matches the expected answer, mark it correct or incorrect, move

on.

This approach fails immediately because of output variability. Even when models produce correct answers, they structure those answers differently every time:

Different sentence structures
Varied explanations
Additional context or reasoning
Alternative but equivalent phrasings

A rigid script would mark many correct answers as failures simply because they don't match a predefined string exactly. This approach is fundamentally incompatible with evaluating natural language.

The Accuracy LLM: Intelligent Evaluation

Our solution is an Accuracy LLM—a specialized language model dedicated entirely to evaluating other models' outputs. This is not a general-purpose model; it's been designed and configured specifically for rigorous, consistent evaluation.

The Accuracy LLM operates with:

Predefined evaluation parameters: Specific criteria it must check for each question type
Structured review steps: A systematic process it follows to validate outputs
Search API integration: Access to specialized search APIs to verify factual claims
Known answer references: For certain question types, predefined correct answers to compare against

How the Accuracy LLM Works

When evaluating a model's response, the Accuracy LLM follows a multi-step process:

Step 1: Parse the Question and Response

Identifies the question type (math, factual, reasoning, etc.)
Extracts the core claim or answer from the model's output
Determines what criteria must be satisfied for correctness

Step 2: Fact Verification (when applicable)

Uses specialized search APIs to verify factual claims
Cross-references information against authoritative sources
Checks for internal consistency in the response

Step 3: Answer Comparison (when applicable)

For questions with predetermined answers (e.g., math problems), compares the model's answer against the known correct answer
Accounts for equivalent formulations (e.g., "105" = "one hundred five" = "1.05 × 10²")
Identifies if the correct answer is present even if embedded in additional explanation

Step 4: Quality Assessment

Evaluates whether the response actually addresses the question
Checks for logical coherence
Assesses completeness of the answer

Step 5: Generate Structured Output

The Accuracy LLM produces output in a specific format with the following structure:

Model identifier: Model Name in double parentheses
Question number: Question ID in double curly braces
Original output: The model's actual response in triple square brackets
Expected answer: The correct answer in double square brackets
Evaluation result: Either "Correct" or "Incorrect" in double angle brackets

Example of structured output:

Model: Gemini-Pro | Question: 0847

Original Model Output: "Ohm's Law states that voltage equals current times resistance, or V = IR. This means that if you increase the resistance in a circuit while keeping voltage constant, current will decrease proportionally."

Proper Answer: "Ohm's Law: V = IR, where voltage is directly proportional to current and resistance"

Evaluation: Correct

This structured format enables automated processing while preserving the full context for manual review if needed.

The Extraction and Analysis Pipeline

Automated Processing

The structured output from the Accuracy LLM flows into an extraction script that processes evaluation results at scale.

The extraction script:

Parses each evaluation output into its component parts
Extracts correctness indicators to determine if responses were correct or incorrect
Calculates accuracy percentages by dividing correct responses by total questions
Generates performance reports broken down by category, difficulty, and question type
Creates JSON files containing all model outputs for archival and review

JSON Output Structure

All evaluated outputs are preserved in JSON format with the following fields:

model: The name of the model being evaluated (e.g., "CompetitorX")
question_id: Unique identifier for each question (e.g., 847)
category: Subject area (e.g., "Physics")
difficulty: Difficulty level (e.g., "K12")
question: The actual question text (e.g., "Explain Ohm's Law")
model_output: The model's complete response (e.g., "Ohm's Law states that...")
expected_answer: The correct answer (e.g., "V = IR...")
accuracy_evaluation: Result marking (e.g., "Correct")
accuracy_llm_reasoning: Explanation of why it was marked correct/incorrect (e.g., "Response correctly identifies...")
timestamp: When the evaluation occurred (e.g., "2025-10-15T14:23:11Z")

This structured data enables:

Long-term performance tracking
Anomaly detection
Category-specific analysis
Manual spot-checking for quality assurance

Manual Review for Anomalies

While the system is largely automated, we maintain manual review capabilities for quality assurance. The JSON outputs are regularly sampled to identify:

Potential bugs in the Accuracy LLM's evaluation logic
Edge cases that might require refinement
Patterns of systemic errors
Unexpected model behaviors

Pre-Evaluation Validation: Ensuring System Accuracy

Before deploying this evaluation system at scale, we conducted extensive validation testing

to ensure the Accuracy LLM itself was performing correctly.

Validation process included:

Known answer testing: Running questions with objectively correct answers through the system
Cross-validation: Having multiple evaluators (human and AI) assess the same outputs
Edge case testing: Deliberately submitting ambiguous or borderline responses
Consistency checks: Running identical responses through evaluation multiple times to ensure deterministic results
Human audits: Manual review of thousands of evaluation decisions to identify systematic biases

The goal was to confirm that the Accuracy LLM could reliably distinguish correct from incorrect responses across diverse question types. After extensive testing and refinement, we're confident the system achieves extremely high evaluation accuracy.

Category-Specific Challenges: The Subjectivity Problem

The Writing Evaluation Challenge

While our system performs exceptionally well on objective questions, we've encountered challenges with subjective evaluation—particularly in assessing writing quality.

Initial approach (too lenient):

Early versions of the accuracy evaluation would assess writing tasks like this:

Prompt: "Write a 3 paragraph, 12 sentence long paper on Ohm's Law"

Evaluation criteria:

Does it discuss Ohm's Law? ✓
Is it 3 paragraphs? ✓
Is it 12 sentences? ✓
Is the grammar correct? ✓

Result: Nearly 100% scores across all models, even when the content quality was poor.

The problem was clear: structural requirements and grammar checking are insufficient for evaluating writing quality. A response could be technically correct while being repetitive, superficial, or poorly organized.

Refined Writing Evaluation

We've since implemented more sophisticated evaluation criteria for writing tasks:

Content quality metrics:

Depth of explanation: Does the writing demonstrate genuine understanding?
Clarity: Is the explanation accessible to the intended audience?
Organization: Is information presented in a logical sequence?
Completeness: Are all relevant aspects of the topic covered?
Originality: Does the writing avoid repetitive or formulaic patterns?

Implementation approach:

The Accuracy LLM now uses a rubric-based evaluation for writing tasks, scoring multiple dimensions independently and combining them into an overall assessment. This provides more granular feedback and better differentiates between adequate and excellent writing.

However, we acknowledge this remains an imperfect science. Writing quality contains irreducible subjective elements, and we continue refining these evaluation methods.

Genre-Based Testing: Focused Evaluation

In addition to our comprehensive 5,000-question core evaluation, we maintain Genre-Based Tests—smaller, focused assessment sets ranging from 500 to 1,500 questions.

Purpose of Genre-Based Tests:

Edge case exploration: Testing unusual or boundary conditions
Safety validation: Ensuring models don't produce harmful outputs
Specialized capability testing: Deep dives into specific capabilities like coding, math, or reasoning
Third-party integration testing: Evaluating performance when models have access to external tools like search APIs
Rapid iteration: Smaller test sets enable faster experimentation and refinement

Manual review advantage:

The smaller scope of Genre-Based Tests makes comprehensive manual review feasible. We can examine every response in detail, catching nuances that automated evaluation might miss.

These focused tests complement the broad coverage of our core evaluation, providing both breadth and depth in our assessment methodology.

Real-World Findings: What We've Discovered

Benchmark Claims vs. Measured Performance

Our independent testing has revealed a consistent pattern: the claimed performance of top-tier models rarely materializes in our evaluation framework.

Specific observations:

1. Inflated benchmark scores don't translate Models scoring at the top of public benchmarks often perform at the middle of the pack in our testing. The correlation between public

benchmark performance and our evaluation results is surprisingly weak.

2. "Inferior" models outperform "superior" ones We routinely observe models that score lower on public benchmarks outperforming their supposedly superior competitors in our tests. This suggests public benchmarks are measuring something other than general intelligence.

3. Claimed capabilities don't manifest Models marketed with specific capabilities—"best in class reasoning," "superior coding," "state-of-the-art math"—frequently fail to demonstrate those advantages when rigorously tested on novel problems.

4. Private benchmarks are also gamed Some organizations have developed "private" benchmarks as alternatives to public ones. However, we've found that claimed performance on these private benchmarks also fails to materialize in our testing, suggesting gaming occurs even with supposedly closed evaluation sets.

Chain-of-Thought Models: A Double-Edged Sword

We've made particularly interesting observations about models that use explicit Chain-of-Thought (CoT) reasoning:

Performance degradation with context length:

As context windows grow larger—as conversations become longer and more complex—CoT models begin generating increasingly nonsensical reasoning chains. This degradation then propagates into their final outputs, producing incorrect answers.

Why this matters:

We've reviewed the "thinking" processes from third-party CoT models and found that the reasoning becomes circular, contradictory, or completely unmoored from the original question as context accumulates. The very mechanism intended to improve performance—explicit reasoning—becomes a liability.

Implication:

This suggests that current CoT implementations lack robust mechanisms for maintaining coherence over long contexts. The "thinking" process requires as much intelligence as the answering process, and current approaches haven't solved this effectively.

The Search Dependency Problem

One of our most revealing findings concerns model performance with and without web search access.

The experiment:

We evaluated models in two conditions:

With access to web search APIs
Without any external search capabilities

The results:

When stripped of search access, we observed significantly degraded performance across all tasks—including those that don't require access to current information.

Tasks that shouldn't require search:

Mathematical calculations
Logical reasoning puzzles
Coding problems
Explaining established scientific concepts
Language translation

Yet performance dropped substantially even on these tasks when search was disabled.

What this means:

We believe this is compelling evidence that these models are not as "intelligent" as their developers claim. True intelligence should not depend on external search for tasks that require only reasoning and knowledge synthesis.

Models appear to be using search as a crutch—compensating for gaps in genuine reasoning capability by retrieving information even when that information should already be encoded in their parameters.

Small Models, Big Performance

Perhaps our most surprising finding challenges the industry's "bigger is better" paradigm:

Our model specifications:

Approximately 3 billion parameters
Roughly 425 times smaller than competing models on average

Performance results:

Similar to much larger models across most categories
Superior to larger models in several specific domains

Why smaller models can compete:

Our hypothesis, supported by our testing data, is that massive models suffer from internal conflicts and hallucinations caused by having access to vast amounts of irrelevant data.

The noise problem:

When a model is trained on everything, it has difficulty determining what information is relevant to a given task. Contradictory training data creates internal conflicts. Irrelevant information introduces noise into reasoning processes.

The focus advantage:

A smaller, more focused model with a carefully curated training set:

Has less internal contradiction
Experiences less noise in its reasoning processes
Can achieve higher accuracy on the tasks it's designed for
Requires less computational resources for inference

Industry implications:

This suggests the race to build ever-larger models may be misguided. The future of AI performance may lie not in raw parameter count but in intelligent architecture design and high-quality, focused training data.

The Cost of Independent Evaluation

One significant challenge we face is the financial cost of evaluating competing models.

How costs accumulate:

To compare our model against competitors fairly, we must run their models through our entire evaluation suite. Since most competing models are only available through paid APIs, this means:

5,000+ API calls per full evaluation
Multiple evaluations as models update
Testing multiple competing models
Genre-based test evaluations

Cost breakdown example:

For a single comprehensive evaluation of one competing model:

5,000 questions × $0.01 per API call (average) = $50
Accuracy LLM evaluation of those outputs = additional processing costs
Multiple evaluation runs for consistency checking = multiply by 3-5x

When testing 10 different competing models with periodic re-evaluation as they update, costs can easily reach thousands of dollars monthly.

Why we bear this cost:

Despite the expense, we consider this investment essential. The only way to make honest claims about relative performance is to actually test those models rigorously. We refuse to rely on public benchmarks or marketing claims that we know to be misleading.

This commitment to honest evaluation differentiates us from competitors who make performance claims based solely on cherry-picked or gamed benchmarks.

Limitations and Ongoing Refinement

We acknowledge our evaluation system, while significantly more reliable than public benchmarks, is not perfect.

Current limitations:

Subjective evaluation challenges

Writing quality assessment remains partially subjective
Creative tasks are difficult to evaluate objectively
Style preferences vary across use cases

Coverage limitations

5,000 questions, while comprehensive, cannot cover every possible task
Edge cases continually emerge as models evolve
New capabilities require new evaluation questions

Accuracy LLM dependency

Our evaluation quality depends on the Accuracy LLM's performance
We must continually validate that the Accuracy LLM remains unbiased
As evaluated models improve, evaluation criteria must evolve

Cost constraints

Comprehensive evaluation of many models is expensive
We must prioritize which models to evaluate most thoroughly
API costs limit evaluation frequency

Ongoing refinement:

We treat our evaluation system as a living framework requiring continuous improvement:

Regular manual audits of evaluation decisions
Addition of new question types as capabilities expand
Refinement of subjective evaluation criteria
Validation testing of the Accuracy LLM itself
Community feedback on evaluation methodology

The Future of AI Evaluation

What Needs to Change

The AI industry must move beyond benchmark gaming toward genuine, honest evaluation. This requires:

Closed evaluation sets

Test questions must not be public
Evaluation methodologies should be transparent but test content should remain private
Multiple independent evaluation organizations to prevent single points of gaming

Comprehensive assessment

Evaluations must test diverse capabilities across many domains
Context coherence and long-form reasoning must be assessed
Edge cases and failure modes must be explored systematically

Real-world task simulation

Evaluation should mirror actual use cases
Multi-turn conversations and context maintenance matter
Integration with tools and external resources should be tested

Transparency in limitations

Models should be evaluated for what they cannot do, not just successes
Failure modes should be documented and published
Confidence intervals and error bars should accompany all performance claims

Independent verification

Third-party evaluation should be the norm
Model developers' benchmark claims should be treated skeptically
Community-driven evaluation efforts should be supported

Our Commitment

At Nexus, we're committed to honest evaluation that reflects genuine model capabilities. We will:

Continue maintaining our closed, comprehensive evaluation framework
Publish evaluation results transparently (within the constraints of protecting our evaluation set)
Refine our methodology based on feedback and new findings
Bear the cost of independent testing rather than relying on marketing claims
Advocate for industry-wide adoption of more rigorous evaluation standards

Conclusion

The AI industry's reliance on gameable public benchmarks has created a crisis of trust. Performance claims have become disconnected from real-world capabilities. Models are optimized for test scores rather than genuine intelligence.

At Nexus, we've built an independent evaluation system specifically designed to combat benchmark gaming. Through closed, comprehensive testing with an intelligent Accuracy LLM and careful methodology, we can measure true model performance rather than memorization of known test sets.

Our findings challenge many industry assumptions:

Public benchmark scores are poor predictors of real-world performance
Smaller, focused models can compete with massive general-purpose ones
Search dependency reveals gaps in genuine reasoning capability
Chain-of-thought reasoning can degrade in long contexts
Claimed capabilities frequently fail to materialize under rigorous testing

The path forward requires the AI community to embrace honest evaluation, acknowledge the limitations of current benchmarks, and invest in rigorous, independent testing methodologies.

Because in the end, the goal is not to achieve high benchmark scores. The goal is to build AI systems that actually work—that genuinely understand, reason, and assist in the complex, nuanced ways that real-world applications demand. And that requires knowing the truth about model performance, even when that truth is uncomfortable.

This evaluation methodology represents our current approach as of October 2025. We welcome feedback, criticism, and suggestions for improvement. Contact us at nexusdevolpercontact@gmail.com