Docs

Watchdog and Fault Escalation

The watchdog is how ZeroKernel makes cooperative faults visible. This page covers heartbeat rules, execution windows, failure budgets, and what actually triggers state escalation.

Why this page matters

This page explains how Watchdog and Fault Escalation fits into the wider ZeroKernel execution model, what problem it is meant to solve, and what trade-off you are actually accepting when you use it in production firmware. The goal is not to treat Watchdog and Fault Escalation as an isolated API call, but to understand where it sits inside bounded scheduling, queue discipline, fault visibility, and profile selection.

Read this topic as an operational contract. Start from the smallest working path, wire it into a lean profile first, and only expand into richer routing, diagnostics, or transport state after you can prove that the timing outcome is still worth the extra flash and RAM. That mindset is what keeps ZeroKernel useful on small boards instead of turning it into another bloated abstraction.

The safest pattern is always the same: define the runtime boundary, keep the hot path short, measure the effect with compare scripts, and only then scale complexity. The examples below are not filler; they show the smallest repeatable patterns you can lift into real firmware when you need clean integration instead of ad-hoc loops.

Three practical patterns

Core cadence pattern

Use one bounded task for the hot path, then let the scheduler keep the phase aligned over time.

C++
    ZeroKernel.begin(boardMillis);
ZeroKernel.addTask("Fast", fastTask, 10, 0, true);
ZeroKernel.tick();
  
Deferred work pattern

Move non-critical routing and transport out of the immediate task body so fast paths stay predictable.

C++
    const auto key = ZeroKernel.makeTopicKey("telemetry.sample");
ZeroKernel.publishDeferredFast(key, sampleValue);
ZeroKernel.flushEvents();
  
Runtime visibility pattern

Read the timing report and stats together so you can prove the cost of each abstraction layer.

C++
    const auto stats = ZeroKernel.getStats();
const auto timing = ZeroKernel.getTimingReport();
Serial.println(timing.maxTickMs);
  

What to verify while you use it

  • Validate timing before you validate aesthetics. A cleaner API is not a win if fast misses rise.
  • Prefer the smallest profile that still matches the workload, then add optional modules only when the measured payoff is obvious.
  • Keep callbacks and transport steps bounded so watchdog, panic flow, and queue limits remain meaningful.

Common mistakes that make results misleading

  • Do not copy a demo pattern into production firmware without measuring it on the real board and real build profile you plan to ship.
  • Do not read success counters without reading queue depth, timing, and workload label next to them.
  • Do not enable heavier diagnostics and compatibility flags in a lean target just because the defaults looked convenient.

Recommended working sequence

Start from the smallest valid path

Boot the runtime, register the minimum useful task set, and prove that the baseline timing is clean before adding optional layers.

Add one layer, then measure it

Introduce routing, diagnostics, or transport one layer at a time so the cost and payoff remain obvious.

Publish only repeatable results

Update docs, charts, or public claims only after the same workload survives the same validation path more than once.

What the watchdog is actually for

The watchdog in ZeroKernel is not a magical rescue thread. It is the runtime’s way to turn silent cooperative failure into visible, measurable system state. That matters because in embedded firmware, “nothing happened” is often the most dangerous failure mode.

When a task stops heartbeating, runs too long too often, or consumes its allowed failure budget, the watchdog gives the runtime a structured way to escalate. That may mean entering degraded mode, pausing non-critical work, moving into safe mode, or explicitly triggering panic if safe progress is no longer possible.

The goal is observability first, controlled response second. That is what makes the cooperative model defensible in real device workloads.

Main watchdog inputs

SignalWhat it means
Heartbeat timeoutA supervised task stopped confirming life within its declared window.
Execution overrunA task repeatedly exceeds its expected budget or window.
Failure budgetA task crossed the allowed count of consecutive or repeated failures.
Policy escalationThe configured watchdog policy decides whether the result is degraded mode, safe mode, or panic.

Three common watchdog setups

C++
    ZeroKernel.setTaskHeartbeatTimeout("Sensor", 250);
ZeroKernel.heartbeatTask("Sensor");
  

This is the minimum supervision pattern for a critical task that should prove it is still alive during normal operation.

C++
    ZeroKernel.setTaskHeartbeatTimeout("NetPump", 750);
ZeroKernel.setTaskExecutionContract("NetPump", contract);
ZeroKernel.setWatchdogPolicy(policy);
  

This ties liveness and policy together so a noisy transport path cannot fail silently forever.

C++
    if (fatalFault) {
  ZeroKernel.triggerPanic("watchdog_escalation");
}
  

This is for unrecoverable states where continuing execution would be less safe than stopping cleanly.

What the watchdog does not do

The watchdog does not preempt a blocking I2C call, break a deadlocked socket operation apart, or rewind a driver that is stuck inside a bad library function. In a cooperative runtime, it cannot forcibly interrupt a running callback.

What it can do is expose the problem immediately, record that the task has violated its contract, and drive the rest of the runtime into a state where the failure is no longer silent. That is still a major operational advantage because the alternative is often an invisible freeze.

How to read watchdog data correctly

Look at watchdog counters next to timing counters. If overruns rise while queue depth also rises, the root cause may be load concentration. If heartbeat timeouts occur without queue growth, the task itself is likely stalling. If panic is being entered from a task that also has repeated failure budget hits, your execution contract is probably doing exactly what it should.

Watchdog FAQ

Can the watchdog save a totally blocking I2C call?

It can expose the fault and drive recovery policy, but it cannot preempt a blocking call in a cooperative model.

Should every task have a heartbeat?

No. Heartbeats are most valuable for critical or stateful tasks where silence is meaningful.

What is the best first watchdog mistake to avoid?

Do not enable heartbeats everywhere without thinking. Supervise the tasks whose silence truly matters, otherwise the signal becomes noisy and harder to trust.