Brief: Build a Recoverable Capability
Goal. Serve a native capability from two providers, invoke it by service name, kill the primary mid-run, and prove the call fails over to the standby — the "recover" half of the agent loop (Submitted Is Not Completed).
Prerequisites
- Rust toolchain;
cargo add net-mesh-sdk tokio serde. - No external mesh needed — this brief stands up its own two (or three) in-process
Meshnodes, the pattern proven byadapters/mcp/tests/serve_end_to_end.rs(invoke_fails_over_when_the_primary_provider_goes_down).
Steps
-
Serve the capability from two providers. Build two
Meshnodes, register the same typed handler on each under one service name, and make the capability substitutable so the mesh treats them as interchangeable:▸ codelet _h1 = primary.serve_rpc_typed("summarize", handler.clone())?; let _h2 = standby.serve_rpc_typed("summarize", handler.clone())?; -
Invoke by service name, not node id — this is what makes failover possible:
▸ codelet resp: SummarizeResp = caller.call_service_typed("summarize", &req, opts).await?; -
Kill the primary between two calls, then invoke again. Wrap the call in the retry helper so the transient failure during cutover is absorbed:
▸ codeuse net_sdk::mesh_rpc_resilience::RetryPolicy; let resp = caller.call_service_typed_with_retry("summarize", &req, opts, &RetryPolicy::default()).await?;
Expected output
- The first
call_service_typedreturns a result from the primary. - After the primary is dropped, the retried
call_service_*returns a result from the standby — one successful response, no caller-visible error.
Verify (acceptance)
- The pre-kill call and the post-kill call both return a valid
SummarizeResp. - The post-kill result demonstrably came from the standby (tag the two handlers' output so you can tell them apart).
- Calling by a pinned node id instead of the service name does not fail over — proving the failover is a property of service-name discovery, not magic.
Pitfalls
- Call by service name for failover. A call pinned to a dead node id cannot reroute — it just fails.
- Retry only retryable errors.
RetryPolicyre-issues transient/backpressure failures, not application errors — retrying abad requestforever is not recovery. - For latency-sensitive paths, prefer hedging (
call_service_with_hedge) over retry: race a second provider instead of waiting for the first to fail. - A
CircuitBreakeraround a repeatedly-failing target stops you from burning every deadline on a provider that's already down.
See Recover a Failed Workflow and the per-SDK errors pages (e.g. Rust).