Evaluation-by-Design for Learning-Enabled Robotics

Learning-enabled robotics has reached a point where system performance is no longer the only question. The harder question is whether a robot can be shown, in a structured and reproducible way, to remain acceptable under changing tasks, sensing conditions, and interaction contexts. This article argues that evaluation must move upstream from a final reporting activity to a design principle. In that framing, metrics, evidence, scenario coverage, safety margins, and intervention logic are not paperwork around the controller; they are part of the controller's engineering envelope. The article develops this idea from systems formulation to implementation practice, with emphasis on collaborative and industrial robotics. The goal is to retain research rigor while keeping the argument accessible to engineers, researchers, and technically literate readers who need to understand why trustworthy robotics depends on evaluation architecture rather than leaderboard performance alone.

Abstract

Introduction

Robotics is increasingly built from components that learn: learned perception, learned policies, learned world models, learned anomaly detectors, learned calibration, and learned adaptation layers. This has changed the burden of proof. A conventional robot could often be argued from mechanism, specification, and deterministic test cases. A learning-enabled robot has to be argued from evidence under uncertainty.

That is the practical meaning of evaluation-by-design. The phrase does not mean "test more." It means that evaluation requirements shape architecture from the beginning:

what must be logged;
what counts as a successful run;
what counts as an unacceptable event;
which scenario families must be sampled before a claim is made;
and how runtime intervention is measured rather than ignored.

This shift matters for both research and deployment. It is increasingly aligned with public standardization and regulatory pressure. The EU AI Act, for example, formalizes obligations around risk management, data governance, technical documentation, logging, and post-market monitoring for high-risk systems [1]. Industrial robotics continues to be anchored by the ISO 10218 family and ISO/TS 15066 for collaborative operation [2, 3]. NIST has likewise pushed robotics toward structured performance measurement rather than anecdotal demonstrations [4, 5].

The central claim of this article is simple: in learning-enabled robotics, evaluation is part of system design because the object being engineered is not only a policy \(\pi\), but an evidence-backed closed-loop system.

From first principles

At the most basic level, evaluating a robot means asking two different questions.

The first is functional: did the robot achieve the intended task? The second is admissibility: did it achieve the task without entering states or behaviors that the application cannot tolerate? Classical robotics often emphasized the first question because the system logic was relatively fixed. Learning-enabled robotics makes the second question equally important because adaptive components can behave well in familiar conditions while becoming brittle under shift, ambiguity, or rare interaction patterns.

That is why evaluation-by-design begins before experiments. It starts when the engineer decides which events are unacceptable, which variables must be measured, which scenarios must be covered, and what level of evidence is required before a claim is made.

The systems formulation

Consider a robot operating in environment state \(x_t\), with observation \(o_t\), control action \(u_t\), and latent uncertainty \(w_t\). A compact closed-loop description is

\[ x_{t+1} = f(x_t, u_t, w_t), \qquad o_t = h(x_t, v_t), \qquad u_t = \pi_\theta(o_{0:t}, m_t), \]

where \(\pi_\theta\) may contain learned components and \(m_t\) denotes monitoring or supervisory state.

The usual performance question asks whether the task objective

\[ J(\pi) = \mathbb{E}\!\left[\sum_{t=0}^{T} r(x_t, u_t)\right] \]

is high. Evaluation-by-design adds a second question: under what conditions is the system admissible? That requires explicit safety and evidence constraints. A useful abstraction is

\[ \max_{\pi} \; J(\pi) \quad \text{subject to} \quad \mathbb{P}\!\left(g_j(\tau) > 0\right) \le \varepsilon_j, \]

where \(g_j(\tau)\) encodes a violation condition over trajectory \(\tau\), such as collision, unsafe separation, control timing overrun, or stop failure.

The variables in this formulation matter. The state \(x_t\) represents the evolving robot-environment configuration, \(o_t\) is the information actually available to the controller, \(u_t\) is the executed action, and \(w_t\) collects unmodeled disturbances or uncertainty. The trajectory \(\tau\) denotes an entire run rather than a single time step, which is important because many safety failures are trajectory-level events.

This formulation does two important things.

First, it separates objectives from non-negotiable constraints. A robot that is faster but violates safety envelopes is not better. Second, it forces evaluation to include distributions over scenarios, not only nominal runs. Let scenario parameters be sampled as \(\xi \sim p(\xi)\). Then any claim about performance is really a claim about performance over \(p(\xi)\), not over a single demonstration.

Evidence as an engineered artifact

In practice, evaluation-by-design means building an evidence graph rather than a test report. The graph links claims to artifacts:

requirement definitions;
scenario generators;
runtime logs;
metric computation;
failure labels;
safety interventions;
uncertainty estimates;
and traceable software versions.

This is not bureaucracy. It is the only way to make a learning-enabled robot scientifically inspectable.

Table 1. Minimal evidence layers for a learning-enabled robotic system

Layer	Question answered	Typical artifact
task definition	what is the robot supposed to achieve	protocol and acceptance criteria
hazard definition	what must never happen	safety envelope, separation rules, stop conditions
scenario model	under what operational variability is the claim made	scenario distribution, perturbation ranges, seeds
runtime observability	what was actually measured	synchronized logs, traces, system timestamps
metric layer	how is success quantified	success rate, latency, distance margins, intervention counts
evidence synthesis	how are claims supported	confidence intervals, failure analysis, traceability matrix

The strongest robotics groups already work this way, even when they do not use the term. In collaborative robotics, for example, a good study does not only report task completion. It reports separation distances, stop responses, intervention frequency, and robustness under delayed human entry, occlusion, or sensing degradation [3, 4, 6].

A design pattern for evaluation-aware robotics

The architecture consequence is that the robot should expose an evaluation interface. A practical stack often includes:

a scenario definition layer;
a control layer;
a monitoring layer;
a logging and synchronization layer;
a metric computation layer;
and a claim layer that maps aggregated evidence to acceptance rules.

One can formalize an acceptance rule as

\[ \mathcal{A}(\pi) = \mathbf{1}\!\left[ \mu_{\text{task}} \ge \alpha \;\land\; \mu_{\text{unsafe}} = 0 \;\land\; \mu_{\text{intervention}} \le \beta \;\land\; \ell_{\max} \le L_{\max} \right], \]

where \(\mu_{\text{task}}\) is mean task performance, \(\mu_{\text{unsafe}}\) is the observed rate of safety-critical failures, \(\mu_{\text{intervention}}\) is the intervention rate, and \(\ell_{\max}\) is the worst-case control latency.

This rule is intentionally modest. It does not pretend to certify everything. It simply prevents the common mistake of reporting only task reward or throughput while hiding the rest of the system behavior.

Conceptually, the equation says that acceptance is conjunctive rather than compensatory. In other words, high task performance cannot compensate for unsafe behavior, excessive intervention, or unacceptable control timing.

Implementation sketch

The implementation challenge is rarely the metric formula itself. The challenge is forcing all required evidence to exist in a consistent schema. A lightweight example is below.

from dataclasses import dataclass
from statistics import mean


@dataclass
class RunRecord:
    scenario: str
    success: int
    completion_time: float
    min_separation: float
    interventions: int
    unsafe_event: int
    max_control_ms: float


def summarize(records):
    return {
        "success_rate": mean(r.success for r in records),
        "mean_completion_time": mean(r.completion_time for r in records),
        "mean_min_separation": mean(r.min_separation for r in records),
        "mean_interventions": mean(r.interventions for r in records),
        "unsafe_rate": mean(r.unsafe_event for r in records),
        "max_control_ms": max(r.max_control_ms for r in records),
    }


def accepts(summary, success_floor=0.95, max_interventions=1.0, budget_ms=20.0):
    return (
        summary["success_rate"] >= success_floor
        and summary["unsafe_rate"] == 0.0
        and summary["mean_interventions"] <= max_interventions
        and summary["max_control_ms"] <= budget_ms
    )

This is deliberately simple, but it captures the right engineering instinct: define the schema first, then define the claim. In many robotics projects the inverse happens, and evidence collection is retrofitted after the algorithm is already fixed.

Interpretation

The payoff is larger than better testing.

For researchers, it sharpens problem formulation. The question becomes not "did the method improve?" but "under which scenario distribution, under which constraints, and by what evidence standard did it improve?" For engineers, it changes integration. The robot must expose observability, timestamps, intervention hooks, and scenario metadata. For managers and reviewers, it makes claims easier to trust because the evidence path is explicit.

It also changes what counts as maturity. A flashy demo with a strong nominal success rate but no scenario coverage, no intervention accounting, and no confidence intervals is weak evidence. A quieter system with narrower claims but rigorous traceability is usually stronger science and better engineering.

Failure modes in evaluation practice

Several recurring mistakes undermine learning-enabled robotics:

treating simulation performance as if it were deployment evidence;
mixing nominal metrics and safety constraints into one opaque score;
using random split accuracy as a substitute for operational robustness;
failing to log the activation of supervisory overrides;
and making deployment claims without a scenario model.

These are not stylistic issues. They directly weaken the truth value of the reported result.

Conclusion

Evaluation-by-design is the natural methodological response to learning-enabled robotics. Once the robot includes adaptive or learned components, performance can no longer be separated cleanly from evidence architecture. The engineered object becomes a monitored, traceable, distribution-aware closed-loop system. That is the right level of abstraction for collaborative, industrial, and safety-relevant robotics.

If the field takes that abstraction seriously, evaluation stops being a final chapter in a paper and becomes part of the robot's design logic. That is the change needed for robotics to remain credible as it becomes more adaptive, more autonomous, and more deeply embedded in real work.