Net v0.12 — "Firestarter"
v0.12 breaks the "Black Diamond" hardening line. After two consecutive releases of pure bug-fix + audit closure (v0.10 / v0.11), Firestarter is the first feature release on the line: it ships a complete request/response RPC surface (nRPC) on top of the v0.11 mesh, plus the four-language binding pipeline that consumes it (Node, Python, Go, plus the existing Rust SDK), plus a TypeScript migration of the Node binding's hand-written modules. The hardening posture is intact — every new surface has the same handle-lifetime, panic-safety, and FFI-soundness guarantees v0.11 established for the existing surfaces — but this release is about adding capability, not just polishing the existing one.
nRPC
Folds, Codec, Mesh Glue
The architectural anchor (and the prerequisite for everything else): an RPC server is a CortEX fold over a directed channel pair. There is no new transport, no new subsystem, no new daemon — just a typed dispatch enum on EventMeta, a channel-naming convention, and small caller-side / server-side helpers.
SubscriptionMode::QueueGroupon the channel roster (adapter/net/channel/roster.rs) — the one missing channel-layer primitive. Work-distribution dispatch alongside the existingBroadcastmode.add_with_mode/dispatch_recipients/subscriber_modeAPI; back-compat shims preserve every existing call site.MembershipMsg::Subscribe.queue_group: Option<String>wire field added atchannel/membership.rswith forward-compat decode (pre-queue-group senders with zero remaining bytes after the token decode asBroadcast). Public APIsMesh::subscribe_channel_in_queue_group[_with_token]. Pinned by 13 regression tests; cross-validated end-to-end bytests/queue_group_dispatch.rs(twoQueueGroupsubscribers on different nodes divide a stream of 100 events between them with exactly-once delivery; broadcast subscriber + queue-group pool coexist on one channel).cortex::rpccodec (adapter/net/cortex/rpc.rs) — dispatch constantsDISPATCH_RPC_REQUEST/RESPONSE/CANCEL/STREAM_GRANT/STREAM_CHUNK_DROPPED, flag bits (FLAG_RPC_STREAMING_RESPONSE,FLAG_RPC_PROPAGATE_TRACE),RpcStatusenum (Net-native with documented gRPC equivalence),RpcRequestPayload/RpcResponsePayloadround-trip codec withMAX_RPC_*caps andencoded_len()helpers for buffer pre-sizing. 15 regression tests pin wire stability + decode-rejection of malformed payloads.RpcServerFold—RedexFold<()>decoding REQUEST events, dispatching the handler in tokio, emitting RESPONSE via aRpcResponseEmittercallback.RpcCancellationToken(Notify+AtomicBool wrapper, race-safe),RpcContext(caller_origin + decoded payload + cancellation),RpcHandlerasync-trait,RpcHandlerError::{Application, Internal}. Handler panic caught viacatch_unwindand surfaced asRpcStatus::Internal. Fast deadline-already-passed short-circuit. CANCEL flips the in-flight token. Malformed payloads emit a structured warn-and-skip and continue (do not kill the cortex adapter). Duplicate REQUEST for an in-flightcall_idis refused; first-wins semantics. Per-channel-hash inbound dispatch hook onMeshNode(register_rpc_inbound/unregister_rpc_inbound) lets the mesh's inbound packet path consult a dispatcher map per packet (one DashMap get); registered channel hashes route directly and skip the per-shardinboundqueue.RpcClientFold+RpcClientPending— symmetric caller side.RpcClientPending::register(call_id)returns a oneshot receiver for unary calls;register_streaming(call_id)returns an mpsc receiver ofStreamItemfor streaming calls (the sameRpcClientFolddemuxes both call kinds via aPendingEntry::{Unary | Streaming}enum). Re-register of the samecall_idcloses the prior receiver (misuse detection).Mesh::serve_rpc(service, handler)/Mesh::call(target_node_id, service, payload, opts)glue (adapter/net/mesh_rpc.rs).serve_rpcregisters an inbound dispatcher for<service>.requests's channel hash; the dispatcher pushes events into a tokio mpsc that drains through theRpcServerFold.calllazy-subscribes to<service>.replies.<caller_origin>, allocates acall_id, registers a oneshot in the per-MeshRpcClientPending, direct-sends the REQUEST viapublish_to_peerbypassing the local subscriber roster (RPC's caller-knows-target model doesn't fit the publisher-led pub/sub roster), and awaits the receiver underopts.deadline. ReturnsRpcReplyonOk,RpcErroron any failure.ServeHandleis RAII — the dispatcher unregisters on Drop and in-flight handlers complete (no abort). Per-Mesh state additions onMeshNode:rpc_client_pending,rpc_next_call_id,rpc_reply_subscriptions(bounded; refuses hash collisions instead of overwriting).- End-to-end Mesh integration test (
tests/integration_nrpc_mesh.rs, 4 tests through real network handshake): round-trip echo, multiple sequential calls reusing the lazy reply subscription with exactly-once handler invocation, server panic surfaces asInternal, deadline emits CANCEL and surfaces asTimeoutto the caller. Deadline-fire CANCEL emission is now pinned by an explicit assertion test (rpc_deadline_fires_cancel_on_the_wire).
Service Discovery + Routing Policies
- Service discovery via capability announcements.
Mesh::serve_rpcauto-registers the service in a per-Meshrpc_local_servicesset;announce_capabilities[_with]auto-mergesnrpc:<service>tags onto the announcedCapabilitySet, propagating through the existing capability-broadcast machinery. Two new public APIs:Mesh::find_service_nodes(service) -> Vec<u64>queries the local capability index for nodes carrying thenrpc:<service>tag;Mesh::call_service(service, payload, opts) -> Result<RpcReply, RpcError>finds candidates, picks one perRoutingPolicy, dispatches via the existing direct-addressedcall(target, ...). ReturnsRpcError::NoRouteif no servers advertise the tag.ServeHandle::Dropremoves the service from the local registry so subsequent announcements stop emitting the tag. RoutingPolicyenum onCallOptions(defaultRoundRobin):RoundRobinuses a dedicated per-Mesh cursor withfetch_add(no longer collides with the call-id counter);Random(xxh3 ofcall_id, modulo);Sticky { key: u64 }(xxh3 of key, modulo a sorted candidate list — same key → same target while the candidate set is stable);LowestLatency(picks the candidate with smallestlatency_usper the localProximityGraph; deterministic fallback to the lexicographically-first sorted node id when no proximity data exists).filter_unhealthy: boolonCallOptions(defaulttrue) — skips candidates whoseProximityGraphentry reports!is_available(). Pin: candidates with NO proximity entry are KEPT (absence of evidence ≠ evidence of unhealth), so a freshly-announced server isn't falsely filtered just because pingwaves haven't propagated yet.- EntityId ↔ node_id bridge —
MeshNode::entity_id_for_node(u64) -> Option<[u8; 32]>accessor consultspeer_entity_idsto map session-layer node ids to entity-layer keys. The single missing piece thatLowestLatencyandfilter_unhealthyboth flow through. - End-to-end coverage (
tests/integration_nrpc_service_discovery.rs, 6 tests): three nodes, two serve "echo", one caller usescall_service— both servers exercised by round-robin;Stickypins consistency;Randomdistributes evenly; no-servers returnsNoRoutewith diagnostic;LowestLatencyfalls back deterministically when no proximity data exists;filter_unhealthykeeps proximity-less candidates.
Streaming, Tracing, Resilience, Metrics
The biggest single chunk of new surface in this release.
- Streaming responses. Multi-fire
DISPATCH_RPC_RESPONSEevents for onecall_idmarked non-terminal vs. terminal via thenrpc-streamingheader (continue/end).RpcResponseSink(unbounded mpsc, non-blockingsend),RpcStreamingHandlerasync-trait, andRpcServerStreamingFold(parallel toRpcServerFoldbut spawns a pump task draining the sink and emitting per-chunknrpc-streaming: continueframes; handler return → terminalendframe, handlerErr→ terminal non-Okframe, handler panic caught bycatch_unwind→ terminalInternal). Per-call ordering guarantee: the streaming fold takes anRpcAsyncResponseEmitter(Arc<dyn Fn(...) -> BoxFuture<()>>) instead of the unary fold's syncRpcResponseEmitter, and the pump task.awaits each emit before reading the next sink chunk — without this, two chunks emitted in tight succession would race into the publish path via independenttokio::spawns and arrive at the caller out of order. Caller side:Mesh::call_streamingreturns anRpcStream: futures::Stream<Item = Result<Bytes, RpcError>>; terminal-Ok closes the stream, terminal-error yields one finalErr(RpcError::ServerError)then closes.RpcStream::Dropclears the pending entry and best-effort emits CANCEL via direct unicast so the server's handler observesctx.cancellation. - Per-stream window grants (closes the Phase 3 streaming backlog). Wire additions:
DISPATCH_RPC_STREAM_GRANT(caller → server, payload is 4-byte big-endianu32credit count) +HEADER_NRPC_STREAM_WINDOW_INITIAL(REQUEST header, ASCII-decimalu32initial window). Server side keeps a per-callArc<tokio::sync::Semaphore>map; pump taskacquire_owned().await+forget()per chunk. STREAM_GRANT eventsadd_permits(n). Caller side:CallOptions::stream_window_initial: Option<u32>.RpcStream::poll_nextauto-grants 1 credit per delivered chunk (in-flight credit holds near the initial window).RpcStream::grant(n)is the explicit API for batched cadence; no-op when flow control isn't enabled. Defensive caps on incoming GRANT amounts so a misbehaving caller can't overflow tokio'sMAX_PERMITS. Bounded streaming pump mpsc with drop-on-full metric so a slow caller can't unbounded-buffer the server. - W3C Trace Context propagation (
cortex::rpc::TraceContext+extract_trace_context/build_trace_headershelpers). NewCallOptions::trace_context: Option<TraceContext>andRpcContext::trace_context: Option<TraceContext>fields. When the caller setsCallOptions::trace_context, the SDK emitstraceparent/tracestateheaders and setsFLAG_RPC_PROPAGATE_TRACE; the server's fold extracts the headers and populatesRpcContext::trace_context. nRPC is transport-only — application code on both sides reads/writes via whatever tracing backend it has wired up (tracing-opentelemetry, Datadog, etc.). Emptytracestateis omitted on the wire (W3C convention). Header-name matching is case-insensitive (W3C + HTTP convention); the previous implementation usedname.as_str() == "traceparent"and silently dropped any non-lowercase variant. - Caller-side retry helper (
sdk/src/mesh_rpc_resilience.rs).RetryPolicywith full-half jitter (each backoff scaled by uniform random in[0.5, 1.0]), exponential growth (backoff_multiplier, default2.0), upper-bound cap (max_backoff), and a swappableretryable: Arc<dyn Fn(&RpcError) -> bool>predicate. Default policy: 3 attempts, 50ms initial → 1s cap.default_retryableretriesTimeout,Transport, andServerErrorfor canonical transient statuses (Internal,Backpressure, server-observedTimeout); does NOT retryNoRoute,Codec, application errors,NotFound,Unauthorized,UnknownVersion, orCancelled. Four wrappers onMesh:call_with_retry,call_service_with_retry,call_typed_with_retry,call_service_typed_with_retry. Typed variants encode once and reuse the bytes across attempts; service variants re-resolve the candidate set per attempt so failover is automatic. - Caller-side hedge helper.
HedgePolicy { delay, hedges }— fire-then-race: primary att=0, additional hedges att=delay*idx, first reply (Ok or Err) wins; if first finisher isErr, the wrapper waits for remaining hedges before surfacing the deterministic last error. Defaults: 50ms delay, 1 hedge. Four wrappers:call_with_hedge_to(targets, ...)/call_typed_with_hedge_tofor explicit-target hedging (e.g. primary + warm-standby),call_service_with_hedge/call_service_typed_with_hedgefor capability-index-driven hedging across replicas. Why service-only and explicit-targets-only, not direct-to-one-target: hedging to the same target is always wrong (same backlog, same GC pause, doubles your load for nothing). Hedge losers'UnaryCallGuard::Dropfires CANCEL to the server, which observes it onctx.cancellation(pinned byhedge_loser_handler_observes_cancellation). - Caller-side circuit breaker.
CircuitBreakerwithCircuitBreakerConfig— three-state machineClosed → Open → HalfOpen → Closed/Open. Defaults: 5 consecutive failures to trip, 30s open cooldown, 1 successful probe to close. Different shape from retry/hedge: a long-lived stateful guard the user instantiates once (typically per logical downstream — one per service, or one per(service, target)pair) and shares viaArc<CircuitBreaker>. The wrapper takes a closure:breaker.call(|| async { mesh.call_typed::<Req,Resp>(...).await }).await. Generic over the inner result type so it composes around raw, typed, retried, OR hedged calls.BreakerError::{Open | Inner(RpcError)}— pattern-matchOpento fall back,Innerto handle the underlying error.default_breaker_failurematchesdefault_retryable(transient infra failures count as health signals; application errors don't). HalfOpen semantics: at most ONE concurrent probe; other calls during HalfOpen short-circuit. Panic-safe: a probe that panics doesn't poison the breaker's mutex; a poisoned mutex is recovered intointo_inner()so the breaker keeps serving. - Unary-call CANCEL-on-drop. New
UnaryCallGuardis constructed insideMesh::callimmediately after the REQUEST is published; if the call future is dropped before resolving (hedge loser,tokio::select!losing arm, caller-sideJoinHandle::abort), the guard's Drop runspending.cancel(call_id)AND spawns a CANCEL publish to the server via the newspawn_cancel_publishhelper (shared withRpcStream::Drop). The success path flipsguard.completed = trueso a happy call doesn't fire a useless CANCEL. - Per-service metrics + Prometheus formatter (
adapter/net/mesh_rpc_metrics.rs).RpcMetricsRegistry— per-MeshDashMap<String, Arc<ServiceMetricsAtomic>>(one entry per service that's been called or served). Bounded; idle entries with no in-flight ops and zero counters get evicted alongside empty queue-group shells. Per-service counters: caller-side (calls_total,errors_no_route/errors_timeout/errors_server/errors_transport,in_flight,latency_sum_ns/latency_count, Prometheus-default cumulative bucketed histogram), server-side (handler_invocations_total,handler_panics_total,handler_in_flight,handler_duration_*,streaming_chunks_emitted_total,streaming_chunks_dropped_total).CallMetricsGuard— RAII shim built BEFORE any potential early-return bumpsin_flighton construction, balances on Drop. Snapshot + Prometheus formatter:MeshNode::rpc_metrics_snapshot()is a cheap one-DashMap-pass copy. Service names are escaped per Prometheus exposition convention (backslash, double-quote, newline,\r); negative gauges from racy decrements clamp to zero.
nRPC bindings — Node, Python, Go (B1–B7)
The seven-phase rollout from NRPC_BINDINGS_PLAN.md ships in full. Each phase landed independently; all phases pass their per-binding test suites and the cross-binding wire-format compat tests. Total ~5,800 LoC of new binding code + ~2,500 LoC of tests.
| Phase | Scope | Commit |
|---|---|---|
| B1 | Node — raw serve / call / callService / callStreaming (Buffer in/out). Validates the napi ThreadsafeFunction handler-bridging pattern. | 98967fdc |
| B2 | Node — typed wrappers + RetryPolicy / HedgePolicy / CircuitBreaker + per-service metrics. | 5741f8e2 |
| B3 | Python — raw + GIL-aware runtime.block_on + tokio::task::spawn_blocking for handler dispatch. | 4003d9bb |
| B4 | Python — typed wrappers + resilience helpers + ServeHandle context manager. | 000b53bc |
| B5 | Go C-ABI — raw lifecycle + unary call / call_service / serve / find_service_nodes (bindings/go/rpc-ffi/, separate cdylib libnet_rpc). | ea7c3836 |
| B6 | Go C-ABI — streaming + pure-Go RetryPolicy / HedgePolicy / CircuitBreaker + ABI version stamp (net_rpc_abi_version() -> u32, 0x0001 initial). | 9cf612ab |
| B7 | Cross-binding wire-format compat — shared tests/cross_lang_nrpc/golden_vectors.json fixture (6 ok cases + 3 error cases) drives parallel suites in Rust (tests/integration_nrpc_cross_lang.rs, 4 tests) + Node (bindings/node/test/cross_lang_compat.test.ts, 4 tests) + Python (bindings/python/tests/test_cross_lang_compat.py, 16 parametrized tests). 24 cross-binding compat assertions total. Drift in any binding's JSON encoding, typed-error mapping, or status-code constants now fails that binding's own CI. | 4cd7366b |
Cross-cutting decisions enforced by the fixture and the per-binding compat suites:
- Stable
nrpc:error prefix. Every binding's caller-side errors carrynrpc:<kind>: <detail>where<kind>is one ofno_route,timeout,server_error,transport,codec_encode,codec_decode,breaker_open. Each binding maps the prefix to typed exception classes viaclassifyError(e)(Node) /classify_error(e)(Python) /parseRpcError+ typed*RpcError(Go). The Node binding throws plainErrorwith the prefix (NOT typed classes) to sidestep vitest's dual-module-instance hazard; users classify at the catch site. - Canonical typed-handler status codes:
NRPC_TYPED_BAD_REQUEST = 0x8000,NRPC_TYPED_HANDLER_ERROR = 0x8001— both in the application-defined range0x8000..=0xFFFF. Re-exported from every binding alongside the typed surface. (The fixture initially used0x4001matching a stale Rust SDK comment; the fixture and Rust test were corrected to match the constant the bindings actually export. Found while writing the cross-binding compat suite.) ServeHandlelifecycle per language. Node:.close()method (finalizers are non-deterministic so callers MUST close). Python: context-manager protocol (with rpc.serve(...) as handle:) + explicit.close(). Rust:Drop. Go:(*ServeHandle).Close()+runtime.SetFinalizeras a backstop. In every case "drop / close stops new dispatch but lets in-flight handlers complete" — same contract as the Rustserve_rpc.- Caller-driven cancellation across all four bindings. Late in the cycle the bindings each grew an explicit cancellation surface beyond the existing CANCEL-on-future-drop:
- Node: AbortSignal-driven (
MeshRpc.reserveCancelToken()mints abigint; pass on the call's options; callMeshRpc.cancelCall(token)from anAbortSignallistener). Abort fires CANCEL on the wire. - Python:
Cancellablepyclass +RpcCancelledError. Pass viaopts={'cancel': cancel};cancel.cancel()from another thread aborts mid-flight. - Go:
ctx.Done()watcher goroutine wired throughnet_rpc_reserve_cancel_token/net_rpc_cancel_callC-ABI exports. Watcher pins to the stream/call's lifetime so it doesn't leak past close. Watcher self-deadlock prevention viawatcherDonechannel closed beforeClose().
- Node: AbortSignal-driven (
- Per-handler timeout configurable everywhere. Each binding's
serveaccepts an optional handler timeout (defaults to 60s for Go, no default for Rust/Node/Python — the SDK wraps user code with no timeout unless asked). Wedged handlers can't hold the in-flight slot indefinitely.
Node binding TS migration
- Single source of truth.
errors.tsandmesh_rpc.tsreplace the hand-writtenerrors.js/mesh_rpc.js+ parallel.d.tsfiles. The.d.tswas the only guard on the public type contract — and reviews of the nRPC work surfaced several places where the two had quietly diverged (theRawMeshRpcshape, thebreaker.armeddead branch, theappErrorhelper signature). Compiling from a single TS source catches that class of drift at build time. - Pipeline. New
tsconfig.build.jsonextends the existing test-onlytsconfig.json;target: ES2022,module: CommonJS,moduleResolution: node,strict,declaration,noEmitOnError.outDir/rootDirboth.so import paths don't change.package.jsongainsscripts.build:ts,scripts.typecheck, and aprepublishOnlythat runs the TS build beforenapi prepublish -t npm. Build artifacts (errors.{js,d.ts}+mesh_rpc.{js,d.ts}) are gitignored — regenerated on publish. - Module shape preserved. Stays CJS.
npm pack --dry-runproduces the same 8 files as before. Existingrequire('@ai2070/net/errors').CortexErrorkeeps working unchanged.index.js/index.d.tsstay JS forever — auto-generated by napi-rs from the Rust crate. - Test-stub conformance enforced. Turning
RawMeshRpcfrom documentation into a real type forcedStubRawMeshRpc,LoopbackHandlerRpc, andCancelTrackingRawto drop theiras unknownescape hatches and grow the missing methods. The compile error IS the win — the parallel.d.tscouldn't catch this. - Outcome. -210 LOC of duplicated
.js/.d.tscontent collapsed into single TS sources. 53/53 vitest tests pass against both source state (TS) and built state (compiled.js).
Test hygiene
- Cross-binding compat fixture — single source of truth for the canonical service contract. Every binding's compat test loads
golden_vectors.jsonand asserts the same matrix. Fixture is versioned viaabi_version_expectedmirroringNET_RPC_ABI_VERSION; bumping the ABI invalidates the fixture and forces every binding's compat test to update. - Streaming flow-control coverage (
tests/integration_nrpc_streaming.rs, 6 tests through real network): collects-all-chunks, drop-cancels-handler, terminal-error-after-partial-stream, plus the three flow-control tests (window_throttles_pump_until_grantsasserts the server'sstreaming_chunks_emitted_totalmetric is exactly the initial window after 300ms;auto_grant_drains_full_stream;explicit_grant_unblocks_pump). - Resilience helpers — 12 SDK integration tests across
mesh_rpc_retry.rs(4),mesh_rpc_hedge.rs(3),mesh_rpc_breaker.rs(5). Each pins a specific aspect: retry-then-succeed, retry-skips-app-errors, retry-exhaustion, predicate classification (retry); backup-wins, zero-degrades, empty-targets-NoRoute (hedge); full-state-machine cycle, failed-half-open-reopens, app-errors-don't-trip, reset-clears-state, error-flatten (breaker). All over real-network handshake. - Cross-language compat — 24 parametrized assertions (4 Rust + 4 Node + 16 Python) all driven from the shared fixture.
Breaking changes
Wire format additions (forward-compat from v0.11)
Unlike v0.11, v0.12 does not break wire compatibility with v0.11 for any pre-existing message type. Every change is a forward-compat addition:
- New dispatch bytes in the CortEX
EventMeta::dispatchnamespace under nRPC:DISPATCH_RPC_REQUEST,DISPATCH_RPC_RESPONSE,DISPATCH_RPC_CANCEL,DISPATCH_RPC_STREAM_GRANT,DISPATCH_RPC_STREAM_CHUNK_DROPPED. All in the CortEX-internal range0x10..=0x1F. A v0.11 receiver that doesn't know nRPC will see these as unknown dispatch values and route them to the no-op fold arm — no crash, no confusion, just a silent skip on the receiver side. MembershipMsg::Subscribegains an optionalqueue_group: Option<String>field (u8length + UTF-8 bytes after the existing token field). Forward-compat: a v0.11 sender (zero remaining bytes after the token) decodes asBroadcast. A v0.12 sender that emits aqueue_groupto a v0.11 receiver — the v0.11 receiver ignores the trailing bytes, which is benign for broadcast semantics but means queue-group dispatch silently degrades to broadcast-fan-out across mixed-version peers. Recommendation: upgrade publishers and subscribers in lockstep if you intend to useQueueGroup.publish_to_peernow stampschannel_hashon the outgoing packet header (was always0pre-fix). A v0.11 receiver doesn't consult the header for dispatch routing on the per-shard inbound path, so this is invisible there; v0.12 receivers consult the field for the per-channel-hash fast-path dispatcher hook. Mixed-version: v0.12 sender → v0.11 receiver works (header byte ignored); v0.11 sender → v0.12 receiver works (zero hash misses the dispatcher map and falls through to per-shard inbound, which is the same behavior the v0.11 sender's receiver already had).- New REQUEST headers:
nrpc-stream-window-initial(ASCII-decimalu32initial flow-control window) and the W3C tracing pairtraceparent/tracestate(whenFLAG_RPC_PROPAGATE_TRACEis set on the REQUEST). All optional; absence means "no flow control" / "no tracing context." - No changes to
IdentityEnvelope,EventMeta,CausalLink,OriginStamp,NetHeader, RedEX on-disk layout, or per-event checksum format — every v0.11 wire-format change persists unchanged into v0.12.
The summary: a v0.11 ↔ v0.12 fleet can coexist on the same mesh for the v0.11 subset of operations. nRPC traffic between mixed-version peers will silently fail (the v0.11 peer doesn't know how to dispatch nRPC), but the existing pub/sub and migration paths continue to work. Recommend lockstep upgrade if you intend to use nRPC across the fleet from day one.
Rust core (net crate) — API surface
SubscriptionModeenum is new inadapter::net::channel::roster. Match arms overSubscriptionModeneed to handle both variants;#[non_exhaustive]was added so this is forward-compatible.MembershipMsg::Subscribegains a publicqueue_group: Option<String>field. Struct-literal constructors must add it; the helper constructors (Subscribe::new, etc.) default toNoneso most call sites don't need updating.Mesh::subscribe_channel_in_queue_group/Mesh::subscribe_channel_in_queue_group_with_tokenare new public methods onMeshNodeand the SDK'sMeshenvelope.Mesh::serve_rpc/Mesh::call/Mesh::call_service/Mesh::find_service_nodesare new public methods onMeshNode. The SDK adds typed counterparts:serve_rpc_typed,call_typed,call_service_typed,serve_rpc_streaming,serve_rpc_streaming_typed,call_streaming,call_streaming_typed.adapter::net::cortex::rpcis a new public module re-exportingRpcContext,RpcHandler,RpcHandlerError,RpcRequestPayload,RpcResponseEmitter,RpcResponsePayload,RpcServerFold,RpcClientFold,RpcClientPending,RpcStatus,RpcStreamingHandler,RpcResponseSink,StreamItem,TraceContext, plus the dispatch + flag constants.adapter::net::mesh_rpcis a new public module re-exportingRpcError,RpcReply,RpcStream,CallOptions,RoutingPolicy,ServeError,ServeHandle,CodecDirection,MAX_RPC_*constants.adapter::net::mesh_rpc_metricsis a new public module re-exportingRpcMetricsRegistry,RpcMetricsSnapshot,ServiceMetrics,ServiceMetricsAtomic,CallOutcome,DEFAULT_LATENCY_BUCKETS_SECS. Snapshot viaMeshNode::rpc_metrics_snapshot(); Prometheus formatter viaRpcMetricsSnapshot::prometheus_text().MeshNode::register_rpc_inbound(channel_hash, dispatcher) -> boolandMeshNode::unregister_rpc_inbound(channel_hash)are new public methods. The dispatcher isArc<dyn Fn(StoredEvent) + Send + Sync>; registered channel hashes route directly and skip the per-shardinboundqueue.register_rpc_inboundreturnsfalseif the hash is already registered (refuses overwrites).ThreadLocalPooledBuilder::set_channel_hash(u32)is a new public method exposing the underlying packet-builder method so the publish path can stamp the channel hash.ChannelConfigRegistry::insert_prefix(prefix, config)/remove_prefix(prefix)are new public methods.get_by_name(name)falls back to a longest-prefix-first walk when no exact match exists. The exact-match hot path (DashMap get) is unaffected.
Rust SDK (net-sdk)
The SDK's nRPC surface is entirely additive — no existing SDK API changes.
- New module
mesh_rpcre-exportsRpcError,RpcReply,CallOptions,RoutingPolicy,ServeHandle,RpcContext,RpcHandler,RpcHandlerError,RpcStatus,ServeError,Codec,RpcStreamTyped,ResponseSinkTyped, plus theNRPC_TYPED_*status constants. - New module
mesh_rpc_resiliencere-exportsRetryPolicy,HedgePolicy,CircuitBreaker,CircuitBreakerConfig,BreakerError,BreakerState, plusdefault_retryable/default_breaker_failurepredicates. - New
Meshmethods (Rust SDK):serve_rpc,serve_rpc_typed,serve_rpc_streaming,serve_rpc_streaming_typed,call,call_service,call_typed,call_service_typed,call_streaming,call_streaming_typed,call_with_retry,call_service_with_retry,call_typed_with_retry,call_service_typed_with_retry,call_with_hedge_to,call_service_with_hedge,call_typed_with_hedge_to,call_service_typed_with_hedge,find_service_nodes,rpc_metrics_snapshot.
FFI / bindings
| Binding | Change |
|---|---|
| All | New nRPC surface — serve / call / callService / callStreaming / findServiceNodes plus typed wrappers + resilience helpers. Importable from @ai2070/net/mesh_rpc (Node), net.mesh_rpc (Python), bindings/go/net/ (reference; Go module ships downstream). All extend the existing binding modules; nothing pre-existing changes. |
| All | Stable nrpc: error prefix on every caller-side failure. Each binding ships a classifyError(e) / classify_error(e) helper for typed-error dispatch at catch sites. |
| Node | Hand-written errors.js / mesh_rpc.js + their .d.ts files replaced by single TypeScript sources (errors.ts, mesh_rpc.ts). Module shape and tarball contents unchanged for consumers; build pipeline now requires npm run build:ts before napi prepublish (wired into prepublishOnly). The TypeScript surface declares RawMeshRpc as a real interface — custom test stubs may need to grow methods that previously got past via as unknown escape hatches. Streaming + resilience helpers (TypedMeshRpc, RetryPolicy, HedgePolicy, CircuitBreaker) ship in the new mesh_rpc.ts. AbortSignal-driven cancellation: MeshRpc.reserveCancelToken() / MeshRpc.cancelCall(token) plus the cancelToken option on call. |
| Python | New net.mesh_rpc module ships TypedMeshRpc.from_mesh(mesh) + RetryPolicy / HedgePolicy / CircuitBreaker + the typed exception hierarchy (RpcError, RpcNoRouteError, RpcTimeoutError, RpcServerError, RpcTransportError, RpcCodecError, BreakerOpenError, RpcCancelledError). ServeHandle is a context manager (with rpc.serve(...)). Cancellation via Cancellable pyclass + opts={'cancel': cancel}. The native net.MeshRpc pyclass is the raw layer the typed wrapper sits on. GIL released across runtime.block_on(...); handler callbacks dispatch under tokio::task::spawn_blocking. |
| Go | New crate net-rpc-ffi at bindings/go/rpc-ffi/ ships the C-ABI cdylib libnet_rpc (separate from the existing compute-ffi). 21 new C entry points: lifecycle (net_rpc_new / _free), ABI-version stamp (net_rpc_abi_version()), unary call (net_rpc_call / _call_service), service discovery (net_rpc_find_service_nodes), serve (net_rpc_serve / _serve_handle_close / _serve_handle_free), streaming (net_rpc_call_streaming / _stream_next / _stream_grant / _stream_close / _stream_free / _stream_call_id), cancellation (net_rpc_reserve_cancel_token / _cancel_call), handler dispatcher registration (net_rpc_set_handler_dispatcher), free helpers (net_rpc_free_cstring / net_rpc_response_free / net_rpc_find_service_nodes_free). New error code NET_RPC_ERR_STREAM_DONE = -6 separates clean stream termination from "no chunk available right now." Reference Go consumer at bindings/go/net/mesh_rpc.go documents the cgo wiring; the Go module itself ships downstream. |
| C | nRPC is not exposed in net.h — it lives in the separate libnet_rpc cdylib (bindings/go/rpc-ffi/). The C SDK README at include/README.md § nRPC documents the entry-point listing, error codes, and ABI version stamp for downstream consumers building against the cdylib directly. |
Behavioral fixes that may surface as test breakage
MembershipMsg::Subscribeencoder emits no trailing bytes whenqueue_group: None. Tests that decoded a v0.11 Subscribe and asserted "trailing zero byte" will fail — the encoder no longer writes the length byte onNone. The decoder still accepts both shapes (forward-compat).- Hedge losers' handlers observe
ctx.cancellation. Pre-fix a hedge loser's request stayed in-flight on the server and the handler ran to completion against a caller that no longer cared. Tests that asserted "handler ran for every hedge attempt" will see the cancellation signal instead. - Caller-side
Mesh::calldropped before resolution emits CANCEL on the wire. Tests that asserted the server-side handler ran to completion despite caller drop will seectx.cancellationfire. - Server-side fold emits
RpcStatus::Cancelledon CANCEL observation. Tests that asserted "deadline + cancel surfaces asTimeout" will seeCancelledif CANCEL beat the deadline timer; the deadline path still surfacesTimeout(no behavior change for the deadline-only case). extract_trace_contextis case-insensitive. Tests that injected only-lowercase trace headers and asserted extraction will continue to work; tests that asserted capitalized variants were silently dropped will see the headers picked up.classify_publish_no_sessionmatches both publish-side and send-side error strings.call_servicefailure to a peer whose session expired between discovery and dispatch now surfacesRpcError::NoRouteinstead ofRpcError::Transport.ChannelConfigRegistryprefix-walk is longest-prefix-first. Tests that relied on insertion-order or shortest-prefix-wins to disambiguate nested prefix registrations will see the most-specific prefix match instead.- Per-handler-timeout default for the Go binding is 60s. Wedged Go-side handlers can no longer hold the in-flight slot indefinitely; tests that exercised "handler runs for >60s" will surface a timeout where they previously hung.
How to upgrade
- Bump your
Cargo.toml/package.json/requirements.txt/go.modto the v0.12 line. Recompile. - For consumers that only use the existing pub/sub + migration surfaces — no source changes required. v0.12 is forward-compatible with v0.11 wire formats for everything that existed in v0.11. The new
SubscriptionModeandMembershipMsg.queue_groupfields are additive. - For consumers that want nRPC — the typed surface is opt-in. Read
net/crates/net/README.md#nrpcfor the cross-binding contract, then per-binding READMEs for language-idiomatic usage:- Rust SDK —
net/crates/net/sdk/README.md§ nRPC. Feature-gated oncortex(already enabled by thelocalandfullumbrella features). - Node —
net/crates/net/sdk-ts/README.md§ nRPC. Import from@ai2070/net/mesh_rpc. - Python —
net/crates/net/sdk-py/README.md§ nRPC +net/crates/net/bindings/python/README.md§ nRPC. Import fromnet.mesh_rpc. - Go —
net/crates/net/include/README.md§ nRPC for the C-ABI surface. Reference cgo wrapper atbindings/go/net/mesh_rpc.go.
- Rust SDK —
- For mixed v0.11 ↔ v0.12 fleets — pub/sub and migration paths continue to work cross-version. nRPC traffic between mixed-version peers will silently fail (v0.11 doesn't know how to dispatch nRPC). Upgrade the fleet in lockstep if you intend to use nRPC across all peers from day one.
QueueGroupsubscriptions silently degrade to broadcast fan-out when crossing into a v0.11 receiver — same recommendation. - Node consumers depending on the hand-written
mesh_rpc.js/errors.jsshape — module exports andrequire()resolution are unchanged. If your test harness usedas unknowncasts to satisfyRawMeshRpcagainst a stub that didn't conform, the stub will need to grow the missing methods (or the casts switched to actual conforming shapes). The TypeScript compile error names the missing method. - Cross-binding nRPC consumers — every binding's compat suite asserts the same fixture (
tests/cross_lang_nrpc/golden_vectors.json). If you're integrating nRPC across language boundaries, your wire-level compatibility is enforced at the binding's own CI. The fixture is versioned viaabi_version_expectedmirroringNET_RPC_ABI_VERSION = 0x0001. - Go consumers — the
libnet_rpccdylib is a separate build artifact from the existinglibcompute_ffi. Build withcargo build --release -p net-rpc-ffiand link both. ABI version drift is detected vianet_rpc_abi_version()vs the consumer's compiled-inExpectedABIVersion. - If you implemented your own caller-side request/response over the existing pub/sub primitives (e.g. via two channels + correlation id) — the nRPC surface implements exactly that pattern, with deadlines, retry/hedge/breaker, response streaming, and end-to-end cancellation. Migration is a straight rewrite per the per-binding README's
## nRPCsection. - If you wired your own metrics around the existing channel publish path for RPC-shaped traffic —
MeshNode::rpc_metrics_snapshot()+RpcMetricsSnapshot::prometheus_text()ships a complete per-service counter set (caller-sidenrpc_calls_total/nrpc_errors_total{kind}/nrpc_in_flight_calls/nrpc_call_latency_seconds_*+ server-sidenrpc_handler_invocations_total/nrpc_handler_panics_total/nrpc_handler_in_flight/nrpc_handler_duration_seconds_*/nrpc_streaming_chunks_emitted_total). One snapshot covers both directions for any service the local node both calls and serves.
Released 2026-05-06.
License
See LICENSE.