Ask a developer to evaluate whether their AI tool’s output is “good,” and you will get a shrug, a maybe, and an answer that changes depending on the day. Ask them whether the output contains fewer than 300 words, and you will get a definitive yes or no.
That distinction between subjective judgment and binary measurement is one of the most important concepts in AI quality assurance today. It is also the concept that most teams building AI-powered applications completely overlook.
Binary assertions are simple true/false tests applied to AI output. Does the text contain certain formatting? Is the first line a standalone sentence? Does the response include at least one statistic? Is the word count under the limit? Each question has exactly one answer: yes or no. Pass or fail.
This simplicity is the point. When every quality criterion is binary, quality becomes a number. And when quality is a number, it can be tracked, compared, and systematically improved.
The concept is not merely theoretical. Open-source frameworks like Promptfoo and DeepEval have built entire evaluation systems around deterministic assertions, giving development teams production-ready tools for exactly this pattern. Meanwhile, research from Stanford’s DSPy project demonstrates that when assertions feed automated optimization loops, LLM systems can improve their own performance without human intervention.
The Problem with Subjective Evaluation
Most teams evaluate AI output the way they evaluate restaurant food: “this feels right” or “this does not seem quite right.” This approach has three fundamental flaws.
Non-Deterministic Results
Show the same AI output to the same evaluator on two different days and you will often get different assessments. Show it to two different evaluators and the divergence widens further. Research in the LLM evaluation space has consistently shown that human raters exhibit significant inter-rater variability when scoring open-ended text quality. When measurement is not consistent, improvement becomes impossible because you cannot tell whether a change helped or whether the evaluator’s mood shifted.
Non-Automatable Process
Subjective evaluation requires a human reading every output. This creates a bottleneck that prevents rapid iteration. If improving a prompt requires running 50 test cycles, and each cycle requires human evaluation, improvement takes weeks. If evaluation is automated, those 50 cycles can run overnight.
This is precisely why the LLM testing community has embraced deterministic assertion types. Promptfoo, one of the most widely adopted open-source evaluation tools, provides assertion types like contains, regex, equals, and custom JavaScript functions that produce binary pass/fail results with zero human involvement. DeepEval takes a similar approach with its assert_test() function, modeled after Pytest but specialized for LLM applications.
Non-Actionable Feedback
“This output is a 6 out of 10” tells the system nothing about what to change. Which aspect earned the 6? The structure? The length? The tone? The formatting? Without specific, targeted feedback, the system can only make random changes and hope one improves the score.
Binary assertions solve this by decomposing quality into individual, named criteria. When assertion number 14 (“word count under 300”) fails while assertions 1 through 13 pass, both the developer and any automated optimization loop know exactly what to fix.
What Binary Assertions Look Like in Practice
Binary assertions test one specific, measurable criterion per assertion. Here are examples across different domains that teams use in production today.
Content Generation
| Assertion | Tests For |
|---|---|
| First line is a standalone sentence (not part of a paragraph) | Hook structure |
| Contains at least one specific number or statistic | Credibility signals |
| Final line is not a question | Call-to-action style |
| Total word count is under 300 | Conciseness |
| Does not contain em dashes | Brand formatting |
| Contains at least one line break creating visual separation | Readability |
| References at least one concept from the brand guidelines file | Context awareness |
Code Generation
| Assertion | Tests For |
|---|---|
| Output compiles without errors | Basic correctness |
| All function names use camelCase | Naming conventions |
| No function exceeds 50 lines | Code organization |
| No hardcoded string values outside constants | Maintainability |
| Includes at least one comment per function | Documentation |
| All imports are at the top of the file | Structure |
| No console.log statements in output | Production readiness |
Email and Communication
| Assertion | Tests For |
|---|---|
| Subject line is under 50 characters | Email best practices |
| First paragraph is under 3 sentences | Scanability |
| Contains exactly one call-to-action | Focus |
| Does not use the word “synergy” | Brand voice |
| Includes the recipient’s name in the opening | Personalization |
| Total email length is under 200 words | Brevity |
In Promptfoo’s YAML configuration, these assertions translate directly into test definitions. A contains assertion checks for required strings. A regex assertion validates patterns. A javascript assertion runs custom logic that returns true or false. Every assertion type can be negated by prepending not- (for example, not-contains or not-regex), and assertions can be weighted differently based on importance.
Designing a Binary Assertion Test Suite
The 5×5 Approach
A practical test suite uses 5 representative test prompts with 5 assertions each, creating a 25-point scoring system. This provides enough granularity to detect meaningful changes while remaining manageable for teams adopting the practice for the first time.
Step 1: Define Test Prompts
Select 5 representative prompts that cover the range of inputs your AI application handles. For a marketing copywriting tool:
- “Write a LinkedIn post about why simple automations beat complex ones”
- “Write a LinkedIn post announcing a new product feature”
- “Write a LinkedIn post about a lesson learned from a failed project”
- “Write a LinkedIn post sharing industry statistics”
- “Write a LinkedIn post about a customer success story”
Each prompt tests different aspects of the application: different topics, different content types, different structural challenges. This diversity matters because an LLM might pass all assertions on one type of prompt and fail systematically on another.
Step 2: Define Assertions Per Prompt
For each prompt, define 5 binary assertions that capture the quality dimensions you care about. Some assertions apply universally across all prompts (word count, formatting rules). Others are prompt-specific (a customer success story should reference a specific outcome metric).
Step 3: Run and Score
Execute each prompt, apply each assertion, and produce a score: 23/25, 24/25, 25/25. The denominator stays constant; the numerator reflects quality. This is the foundation of what the TDD-for-LLMs community calls “test-driven prompt engineering,” where expected output specifications are defined before the prompt is written.
Assertion Design Principles
Measurable without interpretation. “Contains at least one statistic” is measurable. “Uses compelling data” is not. If two evaluators could disagree on the result, the assertion is too subjective. Promptfoo’s documentation explicitly categorizes this as the “deterministic” assertion category, distinct from model-graded assessments.
Tied to quality outcomes. Every assertion should map to a real quality indicator that matters for the end result. “Word count under 300” matters because LinkedIn data consistently shows that posts in the 150 to 300 word range tend to generate higher engagement rates. Do not add assertions just to inflate the total count.
Independent. Each assertion tests one thing. If assertion 3 depends on assertion 2, a single root cause failure looks like two failures, distorting the score.
Stable across runs. The same output should always produce the same assertion results. Assertions that involve string matching, counting, or pattern detection are naturally stable. Assertions that require interpretation are not.
Binary Assertions in Autonomous Improvement Loops
Binary assertions become most powerful when combined with autonomous improvement loops: systems that modify their own instructions, test the output, and keep or revert changes based on the score.
How the Loop Works
- Baseline: Run all tests, score the current output (for example, 21/25)
- Modify: Make one change to the prompt or system instructions
- Retest: Run all tests again, score the modified output
- Decide: If the score improved (22/25), keep the change. If it dropped (20/25), revert.
- Repeat: Make a different change and loop
This pattern is not theoretical. OpenAI’s cookbook documents it as the “Self-Evolving Agents” architecture, where agents capture performance issues, learn from feedback, and promote improvements back into production workflows. Google DeepMind’s AlphaEvolve system, unveiled in May 2025, uses a similar evolutionary approach where an LLM generates candidate algorithm modifications and selection is driven by automated evaluation functions. Researchers behind the SICA framework reported 17 to 53 percent performance improvements on coding tasks through agents that edit their own prompts and heuristics using this loop pattern.
Stanford’s DSPy framework formalizes this further. DSPy optimizers automatically tune prompts and weights to maximize developer-specified metrics. The key requirement? Those metrics must be computable functions, which is exactly what binary assertions provide.
Why Binary Assertions Enable Autonomous Loops
- Automatable scoring means no human is needed in the loop
- Clear improvement signal makes 22/25 greater than 21/25 unambiguous
- Attribution through one change per iteration means you know what caused the improvement
- Convergence as the system trends toward higher scores with each kept change
Typical Improvement Patterns
Autonomous loops follow predictable trajectories:
- Iterations 1 through 5: Rapid improvement as obvious structural issues are fixed (for example, 18/25 climbing to 23/25)
- Iterations 5 through 15: Moderate improvement as subtler issues are addressed (23/25 reaching 24/25)
- Iterations 15 and beyond: Diminishing returns as the system approaches its ceiling
A system that starts at 18/25 might reach 24 or 25 out of 25 within 20 to 30 iterations, representing roughly 2 to 3 hours of autonomous execution depending on model latency. The same improvement achieved manually through human evaluation and prompt tweaking would typically take weeks.
Advertisement
The Boundary: What Binary Assertions Cannot Measure
Binary assertions are powerful but bounded. They excel at measuring structural, countable, and pattern-based quality dimensions. They fall short in several areas.
Tone and Voice
“Does this sound like our brand?” is fundamentally subjective. You can proxy it by checking for banned words, required phrases, and sentence length patterns, but the holistic feel of a brand voice resists binary reduction.
Creative Quality
“Is this hook engaging?” depends on the reader, the context, and the competitive landscape. No binary assertion captures whether content will actually resonate with its audience.
Contextual Appropriateness
“Is this response appropriate for this customer’s situation?” requires understanding context that binary checks cannot capture. A response might pass every structural test and still be wrong for the specific situation.
The Hybrid Solution
The industry has converged on a hybrid approach. Use binary assertions for structural quality (the 60 to 70 percent of quality that is measurable) and combine them with LLM-as-judge evaluation for qualitative dimensions.
Promptfoo implements this directly. Alongside its deterministic assertions, it offers llm-rubric and g-eval model-graded assertion types that use a secondary LLM to evaluate tone, relevance, and coherence. The same test suite can combine binary checks (“output contains fewer than 300 words”) with model-graded checks (“output maintains a professional but conversational tone”) and produce a single composite score.
DeepEval takes a similar hybrid approach with over 50 evaluation metrics spanning both deterministic checks and LLM-evaluated criteria like answer relevancy, hallucination detection, and toxicity scoring.
This means you do not waste human attention on issues that machines can detect (formatting errors, word count violations, missing sections) and you do not pretend that machines can evaluate what they currently cannot (creativity, cultural nuance, contextual judgment).
The Real-World Tool Ecosystem
Teams implementing binary assertions today have several production-ready options.
Promptfoo
The most popular open-source option for prompt evaluation. Promptfoo uses YAML configuration files to define test suites with assertions. It supports both deterministic assertions (contains, regex, equals, JavaScript functions) and model-graded assertions (llm-rubric, G-Eval, search-rubric). Tests integrate directly into CI/CD pipelines, and assertion weights can be customized to reflect priority differences between criteria.
DeepEval
An open-source framework modeled after Pytest, specifically designed for unit testing LLM applications. DeepEval provides over 50 built-in metrics and supports CI/CD integration through standard test runners. Its assert_test() function makes binary assertion testing feel familiar to any developer who has written unit tests.
DSPy
Stanford’s framework takes a different approach: rather than testing output after generation, DSPy’s assertions constrain the generation itself. DSPy Assertions define rules that the LLM must follow, and DSPy’s optimizers automatically adjust prompts and weights to satisfy those constraints while maximizing specified metrics.
Braintrust and LangSmith
Enterprise platforms that combine evaluation with observability. Braintrust offers automated evaluation workflows with strong TypeScript/JavaScript support. LangSmith, built by the LangChain team, provides deep integration with LangChain-based applications. Both support custom scoring functions that can implement binary assertion logic at scale.
Implementing Binary Assertions: A Step-by-Step Guide
For Teams Starting Today
- List your output requirements. What must always be true about your AI tool’s output? Write them down as plain-language rules.
- Convert to binary. Rewrite each requirement as a yes/no question. “Output should be concise” becomes “Is the word count under 300?”
- Test manually first. Run your assertions by hand on 5 to 10 outputs to verify they capture real quality differences and do not produce false positives.
- Choose a framework. Promptfoo for YAML-based configuration, DeepEval for Pytest-style testing, or custom scripts if your needs are simple.
- Automate. Write your assertions as code and integrate them into your development workflow.
- Score and track. Produce a single number (pass count divided by total assertions) and log scores over time to detect regressions.
Common Mistakes to Avoid
Too many assertions. Twenty-five is a solid starting point. One hundred creates noise and makes individual improvements invisible in the score.
Too vague. “Output is well-structured” is not binary. “Output contains at least 3 subheadings” is.
Testing the wrong things. Assertions should reflect what actually matters for quality, not what is easy to measure. A perfectly formatted response that is factually wrong still fails in production.
No test prompt diversity. Running the same prompt five times does not test the application’s range. Use five different prompts that represent your actual use cases to catch systematic weaknesses.
Ignoring the qualitative gap. Binary assertions handle structural quality. You still need human review or LLM-as-judge evaluation for tone, creativity, and contextual appropriateness. Teams that rely exclusively on binary assertions develop blind spots in these areas.
Conclusion
Binary assertions transform AI quality from an opinion into a number. That number can be tracked, compared, automated, and systematically improved. The assertions themselves are simple (true or false, pass or fail) but the discipline of defining them forces teams to articulate what “quality” actually means for their specific use case.
The ecosystem has matured rapidly. Promptfoo, DeepEval, and DSPy provide production-ready implementations. The autonomous improvement loop pattern, validated by research from OpenAI, Google DeepMind, and Stanford, turns those assertions into an engine for continuous optimization. And the hybrid approach, combining deterministic assertions with LLM-as-judge evaluation, addresses the limitation that not everything worth measuring is binary.
The teams that adopt binary assertion frameworks gain two advantages: they can run autonomous improvement loops that optimize structural quality overnight, and they free their human evaluators to focus on qualitative dimensions (tone, creativity, cultural context) where human judgment is irreplaceable. Both dimensions improve. Neither is wasted.
Frequently Asked Questions
How many binary assertions should I start with for an AI application?
Start with a 5×5 framework: five representative test prompts with five assertions each, giving you a 25-point scoring system. This provides enough granularity to detect meaningful improvements without creating noise. As your team gains confidence, you can expand to 30 or 40 assertions, but resist the temptation to go beyond 50 for a single application. Too many assertions make individual score improvements invisible and increase maintenance burden. The key is that each assertion must map to a real quality outcome that matters for your users.
Can binary assertions replace human review of AI output entirely?
No. Binary assertions handle structural quality, which represents roughly 60 to 70 percent of what makes AI output good: correct formatting, appropriate length, required elements present, forbidden patterns absent. The remaining 30 to 40 percent involves qualitative dimensions like tone, creativity, cultural appropriateness, and contextual judgment that resist binary reduction. The industry best practice is a hybrid approach where binary assertions handle automatable checks and LLM-as-judge evaluation or human review covers qualitative assessment. This combination means humans focus on what they do best rather than catching formatting errors.
What is the difference between binary assertions and LLM-as-judge evaluation?
Binary assertions are deterministic: they use code-based logic (string matching, regex, counting, custom functions) to produce a definitive pass or fail result. The same output always produces the same assertion result. LLM-as-judge evaluation uses a secondary language model to score output on subjective criteria like coherence, relevance, or tone. LLM-as-judge is more flexible but less consistent, since the judge model itself is non-deterministic. Tools like Promptfoo support both in the same test suite, allowing teams to combine deterministic assertions for structural checks with model-graded assertions for qualitative assessment in a single evaluation run.
Sources & Further Reading
- Promptfoo Deterministic Metrics Documentation
- DeepEval: The LLM Evaluation Framework
- Self-Evolving Agents: Autonomous Agent Retraining
- DSPy Optimization Overview
- LLM Testing: Top Methods and Strategies
- Testing LLM Applications: A Practical Guide
- Promptfoo Assertions and Metrics
- The Complete Guide for TDD with LLMs
















