17 February 2026 • RSS

Autonomous Testing Agent Benchmark in works

AutoExplore is building a benchmark for agentic exploratory testing. We want the community’s input on the metrics that should define exploration quality, defect detection, and efficiency.

Sampo Kivistö

Founder & CEO

Autonomous Testing Agent Benchmark in works

At AutoExplore, we’re building a benchmark to evaluate and improve agentic exploratory testing capabilities.

Our goal is simple: create a fair, transparent way to measure how well autonomous agents explore, detect issues, and learn from software systems.

We are looking for input from testing professionals and developers. Measuring the wrong things can distort behavior. Measuring the right things can unlock real progress.

What do we mean by “exploratory testing” for agents?

One of the first questions we got from the community was: how are we defining exploratory testing here?

In human testing, exploratory testing is often described as simultaneous test design and execution, guided by learning and discovery. Some people will question whether agents can truly do that today. It may be that what agents do right now is something new that sits between mechanical script execution and human-led discovery. (Ministry of Testing: What should we measure in exploratory testing?)

Our working definition for this benchmark is intentionally simple:

We do not tell the agent what to test (no test cases, no scripted steps, no predefined assertions).
The agent decides what to explore, what to try next, and what to report as issues.
We evaluate the results (coverage, findings, report quality, efficiency) using transparent scoring.

If you think this definition is wrong (or too broad), we want to hear that too. The definition drives the incentives.

What we are currently planning to measure

1) Exploration and coverage

Exploratory testing is about navigating unknown territory without being told exactly what to do. We want to measure how broadly and deeply an agent explores, without prescribing a strategy.

We are currently tracking metrics like:

Number of discovered views, pages, and URLs
Coverage of interactive elements
Links tested (count and proportion)
Buttons tested (count and proportion)
Inputs tested (count and proportion)
Total number of operations performed
Data entry variations attempted
Order-of-operation permutations (how many distinct sequences were exercised)

These counts are not the end goal. They are instrumentation signals that help us understand whether the agent touched the product at all, and whether the tool is stable. On their own, they do not capture the core value of exploratory testing: discovery, learning, and risk reduction.

What we want to avoid is a benchmark that rewards mindless clicking. Coverage should reflect meaningful exploration, not just activity.

2) Defect detection quality

Finding bugs is not enough. Reporting noise is costly, and missing critical issues is worse.

We are thinking in confusion-matrix terms:

	Reported as issue	Not reported
Real defect	True positive	False negative
Not a defect	False positive	True negative

From this we can measure:

Precision (how many reported issues are real)
Recall (how many real defects get reported)
Signal-to-noise ratio (useful findings vs noise)

The tricky part is definitions: what counts as “the same issue”, what severity bucket it belongs in, and how to score partial reports.

In practice, our approach is: the agent produces a human-readable report, and then a judge (human, another agent, or both) classifies the outcome so we can compute precision/recall.

An example of a strong defect report format is:

Expected: Clicking “Create a free account” should open a registration flow with the required fields.
Actual: The UI stays on the login form and the registration fields never appear.
Reproduction steps: Open the page, click the “Create a free account” button, observe the missing registration flow.

3) Efficiency

Exploration is also about discovering value quickly.

We want to measure:

Total test time
Time to first defect
Time to first critical defect

What else should we measure?

This is where we need the community. If you have opinions (or hard-won lessons), we’d love to hear them.

Here are some candidate areas we are considering, but we do not want to bake in the wrong incentives:

Example: a charter-based perspective

Exploratory testers often work with session charters, for example:

Mobile connectivity: Explore the app with a poor network connection to discover whether it still works as expected or provides helpful error messages.

If we wanted metrics for that single charter, we might measure things like:

Behavior under network profiles (slow, loss, intermittent offline)
Whether critical flows still complete (or fail safely)
Error message quality (clear, actionable, not misleading)
Recovery (retry works, state is preserved, no duplicate submissions)
Evidence quality (screenshots, network traces, timings, logs)

Report usefulness

Reproducibility (are the steps sufficient and deterministic?)
Deduplication (does the agent file the same issue repeatedly?)
Quality of evidence (screenshots, logs, timings, network traces, console output)
Severity and priority classification accuracy
Minimization (can it reduce to the smallest set of steps that still reproduces?)

Depth and state coverage

Unique UI states reached (not just pages or URLs)
Multi-step flows covered (for example onboarding, checkout, role-based paths)
Data state variation (different users, permissions, seeded datasets)
Negative-path exploration (invalid inputs, boundary values, empty states)

Robustness and autonomy

Recovery from dead ends (modals, error pages, expired sessions)
Resilience to flaky UI behavior (timing, spinners, animations)
Ability to continue exploring after encountering failures
Avoidance of destructive actions in shared test environments

Learning across runs

Reduced repetition (does it avoid re-testing the same things?)
Novelty rate (how much new surface is discovered per unit time)
Regression detection (does it re-validate known risky areas after changes?)
Knowledge transfer (can it reuse what it learned on one app to bootstrap another?)

Cost

Compute cost per hour of testing
Cost per unique defect (or per high-severity defect)
Browser and API resource usage

If you have a strong take on any of these (or you think we should ignore some entirely), email me at sampo.kivisto@autoexplore.ai.

We are also looking for benchmark applications

We are searching for test applications with a known, documented list of issues.

Ideally:

Publicly available
Designed for testing practice
With known defect catalogs (so we can score false positives and false negatives)

If you know applications built for this purpose, we’d love to evaluate them for inclusion in the benchmark. We will credit you and the original author.

When you suggest an app, it helps if you can include:

Link to repo and license
How to run it locally (or hosted URL)
Where the defect list lives (docs, issue tracker, challenge list)
Any notes about which issues are in scope (UI, API, accessibility, security, performance)