Hybrid Learning Control for Safe Multi-Agent Intralogistics

Intralogistics is one of the clearest settings in which the appeal and the limits of learning-based control become visible at once. Warehouses and material-flow systems are dynamic, congested, partially uncertain, and operationally expensive, so adaptive methods are attractive. At the same time, they are governed by hard constraints on collision avoidance, deadlock prevention, throughput, timing budgets, and maintainability, so unconstrained learning is rarely acceptable. This article argues that the right engineering object is neither a purely learned policy nor a purely hand-coded heuristic, but a hybrid controller in which learning is assigned a narrow and measurable role within a larger safety-preserving architecture. The article develops this position from graph-based formulation to practical implementation and evaluation.

Abstract

Introduction

Modern intralogistics systems are not abstract path-planning puzzles. They are real-time resource-allocation systems in physical space. Agents share lanes, merge points, buffers, charging constraints, and compute budgets. Congestion changes rapidly, disturbances occur continuously, and the cost of poor coordination is measured in throughput loss, queue growth, stop-start motion, and occasionally unsafe behavior.

This is precisely why learning has become interesting in the domain. Static rules can be too brittle when traffic structure changes. But it is also why naive end-to-end learning is risky. A method that improves average reward while occasionally inducing deadlock or violating timing budgets is not a good intralogistics controller.

The central claim of this article is that learning should be used where it has comparative advantage: local prioritization, congestion prediction, dispatch ranking, adaptive parameter tuning, or residual correction. Global feasibility, hard safety, and timing guarantees should remain anchored in classical planning, formal constraints, and supervisory logic.

Hybrid control design space for intralogistics

From first principles

An intralogistics controller solves a flow problem before it solves a learning problem. Goods must move through a shared physical network. That network has bottlenecks, capacities, contention points, and timing constraints. If those structural facts are ignored, a learning system may appear adaptive while actually making the plant harder to reason about.

This is why the article starts from graph structure and constraints rather than from reinforcement learning alone. The graph expresses what movement is possible; the controller then decides how that movement should be coordinated under demand, uncertainty, and congestion.

Problem formulation

Let the material-flow system be modeled as a capacitated directed graph

\[ G = (V, E, c), \]

where \(V\) is the set of locations, \(E\) the admissible transitions, and \(c_e\) the capacity of edge \(e\). Let \(N\) agents move over the graph with states \(x_t^i\) and actions \(u_t^i\). The joint system state is \(x_t = (x_t^1, \dots, x_t^N, q_t)\), where \(q_t\) includes queue, reservation, and resource information.

The operational objective is rarely a single reward. A more realistic formulation is

\[ \max_{\pi} \; \mathbb{E}\!\left[ \lambda_1 \,\text{Throughput}(T) - \lambda_2 \,\overline{\tau} - \lambda_3 \,\text{Energy} \right] \]

subject to

\[ \text{DeadlockRate} = 0, \qquad \text{CollisionRate} = 0, \qquad \ell_{\max} \le L_{\max}, \]

where \(\overline{\tau}\) is mean delay and \(\ell_{\max}\) is worst-case controller latency.

Here \(D(T)\) is the number of completed deliveries over horizon \(T\), \(\overline{\tau}\) summarizes delay across completed jobs, and the coefficients \(\lambda_1, \lambda_2, \lambda_3\) express the operational tradeoff between productivity, delay, and energy or wear. The key point is that the optimization target is multi-objective, while the safety and timing constraints remain non-negotiable.

This is already enough to explain why hybridization is natural. Throughput and delay reward adaptation. Deadlock, collision, and control timing demand explicit structure.

Where learning belongs

A hybrid controller decomposes the control problem into layers with different responsibilities.

Table 1. A practical role assignment for hybrid intralogistics control

Layer	Main function	Best implemented as
route feasibility	graph-valid movement and reservation feasibility	deterministic planner or optimizer
safety envelope	collision, capacity, and deadlock prevention	hard constraints, shield, supervisor
local prioritization	who moves first under congestion	learned ranking or adaptive heuristic
demand adaptation	dispatch weighting, release timing, local rerouting bias	prediction model or learned policy
performance auditing	throughput, delay, intervention counts, latency budgets	explicit evaluation pipeline

This split is not arbitrary. It reflects the fact that different subproblems have different tolerances for uncertainty.

The learned component should therefore be narrow enough that its effect can be measured. A good example is a learned prioritizer over feasible candidates. Let \(\mathcal{A}_{\text{safe}}(x_t)\) be the set of actions that preserve safety and feasibility. Then the learning problem is not to produce any action at all, but to rank admissible actions:

\[ u_t = \arg\max_{a \in \mathcal{A}_{\text{safe}}(x_t)} Q_\phi(x_t, a). \]

This is a much stronger formulation than unconstrained policy learning because the feasible set is already filtered by system logic.

Hybrid control as constrained decision making

One can express the architecture as a constrained Markov decision process,

\[ \max_{\pi} \; \mathbb{E}_\pi \left[\sum_t r_t \right] \quad \text{subject to} \quad \mathbb{E}_\pi \left[\sum_t c_j(s_t, a_t)\right] \le d_j, \]

but in deployment the stronger engineering view is often a shielded controller:

\[ a_t^{\mathrm{exec}} = \begin{cases} a_t^{\mathrm{learned}}, & a_t^{\mathrm{learned}} \in \mathcal{A}_{\text{safe}}(x_t), \ a_t^{\mathrm{fallback}}, & \text{otherwise}. \end{cases} \]

The notation makes the engineering logic explicit. The learned module is allowed to optimize only inside the admissible action set \(\mathcal{A}_{\text{safe}}(x_t)\). The supervisor therefore changes the problem from unrestricted action generation to constrained action selection.

This representation is operationally important. It says the learned module does not own the full system. It proposes; the supervisory structure decides what may be executed.

That idea connects directly to safe reinforcement learning via shielding [1]. It also aligns with how real warehouses are engineered: autonomy is allowed where it is beneficial, but the plant still requires bounded risk and recoverable failure modes.

Implementation sketch

The following toy example shows the right software pattern. The learner scores options, but the system only permits options that pass deterministic checks.

def admissible_actions(state, graph, reservations):
    candidates = []
    for action in state["candidate_moves"]:
        if respects_graph(action, graph) and respects_reservations(action, reservations):
            candidates.append(action)
    return candidates


def select_action(state, learner, graph, reservations, fallback):
    safe_actions = admissible_actions(state, graph, reservations)
    if not safe_actions:
        return fallback(state)

    scored = [(learner.score(state, a), a) for a in safe_actions]
    scored.sort(key=lambda x: x[0], reverse=True)
    return scored[0][1]

This design is intentionally conservative. In intralogistics that is usually a virtue. A learner that operates inside a feasibility envelope can still improve throughput, but it does so without silently taking ownership of safety-critical logic.

Evaluation and acceptance criteria

The correct benchmark is system-level, not policy-level. For controller \(m\), scenario family \(s\), and seed \(k\), let \(Y_{m,s,k}\) denote one run outcome. Then strong evaluation includes

\[ \text{Throughput}(T) = \frac{D(T)}{T}, \qquad \text{DeadlockRate}_{m,s} = \frac{1}{K}\sum_{k=1}^{K} d_k, \]

\[ \overline{\tau}_{m,s} = \frac{1}{|\mathcal{C}|}\sum_{i \in \mathcal{C}} \tau_i, \qquad \ell_{\max}^{m,s} = \max_k \ell_k. \]

A practical decision rule can be written as

\[ \Delta \text{Throughput} > 0, \qquad \Delta \text{DeadlockRate} \le 0, \qquad \ell_{\max} \le L_{\max}. \]

This is a better acceptance rule than "the reward improved" because it respects the operational structure of the problem.

It also makes the progression from basic to advanced reasoning clear. The basic question is whether more deliveries are completed. The more advanced question is whether that gain survives safety, deadlock, and compute-budget accounting. Serious intralogistics research must answer the second question, not only the first.

Interpretation

Hybrid control is sometimes described as a compromise. That undersells it. In intralogistics, it is better understood as a principled allocation of algorithmic responsibility.

Learning is strongest where local adaptation matters and the search space is too large for fixed rules. Classical control and combinatorial structure are strongest where admissibility, interpretability, and timing discipline matter. A good hybrid controller therefore does not dilute rigor. It concentrates flexibility in the narrow region where flexibility is valuable.

What weak papers often miss

Several mistakes recur in the literature and in industrial prototypes:

using simulation reward as a proxy for plant performance;
failing to report control latency or compute variability;
allowing a learned policy to generate infeasible actions and correcting them too late;
reporting average throughput without deadlock and queue-tail statistics;
and treating multi-agent coordination as if it were only a local navigation problem.

These mistakes matter because intralogistics is a systems problem. The unit of success is the flow architecture, not the isolated agent.

Conclusion

Hybrid learning control is the most credible route for safe multi-agent intralogistics because it matches the structure of the engineering problem. The system needs adaptation, but it also needs hard envelopes on feasibility, safety, and control timing. Assigning those requirements to different layers is not a workaround. It is the architecture implied by the domain.

The strongest future systems will likely become more learning-enabled, not less. But they will succeed only if the learned components are embedded inside measurable, reviewable, and operationally disciplined control stacks.