MESH ONLINECODENAME: The WarriorsP50: 38ns
SEED ROUNDAI 2070 is raising seed funding to build post-cloud infrastructure.get in touch
RFC-NET-001PROTOCOL.0x4E45·54REV v0.14.0 / Q2 2026

net.
moves
at light.

A latency-first encrypted mesh where every computer, app and device is a first-class node. Existing networks operate in milliseconds (10⁻³). Net operates in nanoseconds (10⁻⁹).

No clients. No servers. No coordinators. The mesh propagates state, not connections. Loosely inspired by the Net from Cyberpunk 2077 — an engineering take on the concept.

§01 / why not best-effort

arpanet assumed scarcity.
net assumes abundance.

TCP was designed when nuclear war was a real possibility. Packets were precious. Bandwidth was scarce. Routes were scarce. The network had to guarantee delivery because the next packet might not get through.

That was the right design for 1969. It's the wrong design now. Sensors don't pause. Token streams don't wait. Market feeds don't care that your queue is full. The firehose doesn't have a pause button.

In a world of abundance, guaranteeing delivery is a threat — you're promising to deliver data that will bury the receiver. The bottleneck isn't delivery. It's processing. Arrival doesn't equal usefulness.

Net inverts the default. TCP starts with trust and detects abuse. Net starts with zero assumptions and lets trust emerge from consistent behavior.

Nodes reject work they can't process within a time window. Dropping a packet and re-requesting from a faster node costs nanoseconds. Waiting for a congested node's guaranteed response costs milliseconds. When dropping is cheaper than waiting, delivery guarantees become overhead.

The benchmark numbers aren't performance metrics. They're existence proofs. They demonstrate that the software layer is no longer the bottleneck. The remaining latency is physics: NIC, wire, speed of light. The software got out of the way.

§02 / topology classes

a new class of system.

Existing networking falls into two categories. Net is neither.

// best-effort
// real-time
// net
TCP / IP / HTTP / gRPC
Optimized for delivery. Queues absorb bursts. Backpressure negotiated. Connections stateful. Trust assumed. Sender slows down when receiver can't keep up.
latency floor: milliseconds
throughput: ~10K req/s · per connection
CAN / EtherCAT / TSN
Optimized for deterministic timing. Fixed topologies. Dedicated hardware. Time-slotted access. Guarantees only because you own the wire.
latency floor: microseconds*
throughput: ~100K updates/s · dedicated bus
NET → latency-first
Real-time latencies on commodity hardware over commodity networks. Drop instead of queue. Route around instead of wait. Observe instead of coordinate. Derive instead of query. Mesh transport.
latency floor: nanoseconds
throughput: ~20M events/s · per core
§03 / protocol properties

nine axioms.
one runtime.

P.01

Latency-first

Sub-nanosecond header serialization. Nanosecond heartbeats, hops, recovery. Packet scheduling at timescales reserved for local function calls.

0.20 nsfwd
sub-ns floor
P.02

Streaming-first

Data is continuous flow, not documents. Sharded ring buffers, adaptive batching. No requests and responses — everything is a stream.


░░░░░░░░░░░░░░
P.03

Zero-copy

Ring buffers, no garbage collector, native Rust. No unsafe. Forwarding doesn't allocate or copy payload data. Design principle, not optimization.

[mem]──refs──▶[wire]
   no alloc
P.04

Encrypted E2E

Noise protocol handshakes. ChaCha20-Poly1305 AEAD with counter nonces. Every packet encrypted source→dest. Intermediate nodes never see plaintext.

A ─ChaCha20──▶ B
    relay sees ░░░
P.05

Untrusted relay

Nodes forward packets without decrypting payloads. The mesh routes through infrastructure you don't trust. Networks grow through adversarial nodes.

trust := observation
not assumption
P.06

Schema-agnostic

Transport moves bytes, not structures. Raw event = payload + hash. Protocol never inspects content. Structure emerges where participants agree.

[hdr][hash][░▒▓█▓]
opaque payload
P.07

Optionally ordered

Ordering is per-stream, not global. Unordered path is the fast path. Causal ordering available where streams need it. Cost paid only by streams that require it.

e → e → e
chain.verify()
P.08

Optionally typed

The protocol doesn't care what's in the payload. Behavior plane can. Typing is a local agreement between nodes, not a network requirement.

type ∈ peer-pair
not network
P.09

Native backpressure

Nodes drop without reply. Not a failure mode — the design. The proximity graph makes silence a signal. Automatic rerouting.

silentsuspect
suspect → reroute
§04 / measured numbers

existence proofs.

All numbers measure packet scheduling — the time to process, route, encrypt, and queue a packet for transmission. They do not include NIC transfer or wire latency.

operationM1 Maxi9-14900K
▸ routing
routing header forward0.57 ns1.75G/s0.20 ns5.06G/s
header serialize1.98 ns505M/s1.31 ns762M/s
routing lookup (hit)38 ns26.3M/s38 ns26.7M/s
▸ multi-hop forwarding
1 hop59 ns16.9M/s53 ns18.7M/s
3 hops163 ns6.13M/s121 ns8.29M/s
5 hops274 ns3.66M/s190 ns5.27M/s
▸ failure detection & recovery
heartbeat29 ns34.5M/s35 ns28.4M/s
circuit breaker check13 ns74.4M/s10 ns98.4M/s
full fail + recover288 ns3.47M/s255 ns3.92M/s
▸ swarm / discovery
pingwave roundtrip0.93 ns1.07G/s0.65 ns1.55G/s
new peer discovery113 ns8.83M/s152 ns6.59M/s
▸ capability system
filter (require GPU)4.05 ns247M/s1.78 ns561M/s
GPU check0.31 ns3.21G/s0.20 ns5.01G/s

// scheduling floor

0.20ns

Routing header forward on i9-14900K. Per-packet overhead. Software is not the bottleneck — physics is.


// hot path

5.06G/s

Operations per second on a single core for the forward path. Five billion. Per second. Per core.


// SDK ingest

6.97M/s

Python via PyO3 batch ingest. The "slow" binding language hits seven million events per second.


// test systems

► M1 Max macOS, aarch64
► i9-14900K @5GHz, Win11
► date 2026-04-27
► profile release + LTO + CG=1


▸ BENCHMARKS.md
§05 / mikoshi // engram transit

state moves.
connections don't.

In Cyberpunk, Mikoshi is Arasaka's construct for storing engrams — consciousness held in digital space, minds persisting outside their original hardware.

Mikoshi in Net is how daemons move between machines. A running program on one node becomes a running program on another without losing its history, its pending work, or its place in the conversation. The source packages its state, the target unpacks it, and for a brief moment the entity exists on both nodes at once — spreading, superposed, then collapsed onto the target as routing cuts over.

The daemon doesn't know it moved. Neither does anything talking to it. Observer nodes watching the stream see the same causal chain continue uninterrupted, the same sequence numbers, the same entity speaking. The hardware underneath shifted. The stream didn't notice.

A factory controller hops from a dying edge box to a healthy one mid-shift. An inference daemon follows its user from laptop to desktop. A trading agent migrates to a node closer to the exchange without dropping a single tick.

It doesn't move a copy. Mikoshi carries the thing itself across.

§06 / daemon runtime // new

compute that
lives on
the wire.

NEWStateful programs that live on the mesh, not on a machine. They have cryptographic identity, a verifiable history, and they move between nodes mid-execution without anyone noticing.subprotocol 0x0500

A program on Net is called a daemon. Its identity is a public key — an origin_hash derived from ed25519, which doesn't change when the daemon moves. Its history is a causal chain — every event it produces is signed and links to the previous one, verifiable by any node. Its location is wherever in the mesh has the capabilities it asked for. When that location goes away, the daemon doesn't.

CASE · trading agent · NYSE colo

// what is a daemon

A daemon is a stateful event processor whose identity is a keypair. It holds working state, snapshots periodically, and exposes five trait methods. Everything else — placement, migration, durability — is the runtime.

  • cryptographic identity — origin_hash from ed25519. survives moves.
  • causal chain — every event signed, links to parent. self-authenticating.
  • capability requirements — daemon declares needs. mesh finds matching node.
  • snapshot + replay — state captured periodically. gap replayed on restore.
  • opaque to mesh — what the daemon does is its business. mesh just hosts.

Mikoshi migration · 6 phases

zero-downtime cutover · ~280ns total
01
snapshot
source serializes daemon state into transferable bytes.
02
transfer
snapshot moves source → target over subprotocol 0x0500.
03
restore
target reconstructs daemon, identity preserved.
04
replay
target catches up on events between snapshot and now.
05
cutover
routing table updates atomically. next packet → target.
06
complete
source releases daemon. target is sole authority.
▸ GRP.01

replica

N interchangeable copies · load-balanced
member 0   event #58 → result
member 1   idle
member 2   idle

round-robin · seq=58

For horizontal scale on stateless workloads. Each replica has its own causal chain derived from a deterministic seed — fail one, spawn another with the same identity. No state to transfer.

identitydeterministic from seedroutinground-robinstatestatelessrecoveryrespawn
▸ GRP.02

fork

independent siblings · documented lineage
parent @ seq=42
   · single chain, no divergence
   · awaiting fork directive

pre-fork · monitoring

For experiments, A/B testing, scenario branching. Each fork carries a cryptographic sentinel linking back to the parent at the fork point. Forks share a past but not a future.

identitydivergent from sentinelroutingper-forkstateindependentrecoveryresnapshot from origin
▸ GRP.03

standby

1 active · N-1 warm · zero duplicate compute
active    processing seq=102
standby   synced_through=98
standby   synced_through=101

all healthy · 3 nodes online

For stateful daemons that need fault tolerance without paying for duplicate compute. Only the active processes events — standbys are warm, not hot. Periodic snapshots track synced_through for each standby. On active failure, the standby with the highest sync point promotes and replays the gap using the same replay machinery migration uses.

identitydeterministic from seedroutingactive onlystatestateful, syncedrecoverypromote + replay gap
// trait surface
5methods
name · requirements · process · snapshot · restore
// migration phases
6strict
snapshot → transfer → restore → replay → cutover → complete
// wire messages
10types
orchestrator + source + target over 0x0500
// cycle time
~280ns
full snapshot → activate, faster than a context switch
§07 / components on the mesh

four primitives.
one mesh.

The mesh moves bytes. Everything above is a thin, optional layer — local-first, feature-flagged, opt-in. Light up the ones you need; the wire doesn't care which.

▸ component.01

nRPC

// typed request/response

Request/response semantics built from a pair of streams. A server registers a handler with serve_rpc; clients dispatch with call_typed. The streams stay primitive — nRPC just wraps them in a typed handle and completes when the response lands.

TypedMeshRpc · paired streams · zero new wire
▸ component.02

RedEX

// append-only event log

The log unbundled and local. 20-byte index records, optional disk persistence per channel, atomic backfill-then-live tailing. A Pi keeps a tiny log of its own readings; a server keeps a huge one. No cluster consensus — log is local, replay is local, retention is local.

21.3 M append/s · 138 ns tail
▸ component.03

CortEX

// RedEX, folded

A reactive, queryable projection of the log, updated event-by-event. Your "database" isn't a process you connect to — it's a Vec<Task> or HashMap<Uuid, Memory> in your code, updating as events fold in. Queries are direct memory access.

8.98 ns find_unique · 8.87 M ingest/s
▸ component.04

NetDB

// unified query façade

One handle bundling typed collections under db.tasks, db.memories, and friends. Prisma-style find_unique / find_many across Rust, TypeScript, and Python — whole-database snapshots round-trip between languages.

6.30 μs open · 48 KB / 1K rows
§08 / install

five languages.
one engine.

All SDKs wrap the same Rust core. The SDK is the developer experience, the engine is Rust.

// C bindings via net.h — build cdylib with . Lower-level bindings (skip SDK ergonomics, talk directly to the engine): ai2070-net, @ai2070/net, ai2070-net (PyPI binding).

§09 / target applications

everything that
can't wait.

Anywhere latency matters. Anywhere the cloud round-trip is too slow. Anywhere there's no central infrastructure to route through.

▸ 0x01 ─ ai agents

AI Agents

Tool calls, state, and memory transfer between heterogeneous GPU nodes. Token streams flow through the mesh; an agent's working memory follows it from node to node mid-conversation. The mesh is the runtime.

▸ 0x02 ─ vehicular mesh

Vehicular Sensor Mesh

Cars sharing LIDAR, radar, camera. Vehicles sync intent — braking, turning, route changes. The car behind doesn't react to braking. It knows about the braking before the brake pads touch the rotor.

▸ 0x03 ─ factory floor

Robotics Factory Floor

Robots don't need line-of-sight for networking. The mesh routes through whatever nodes are reachable. Reroute scheduled in sub-microsecond time. The assembly line doesn't stop.

▸ 0x04 ─ energy grids & extraction

Energy Grids & Extraction

Electrical substations, oil and gas pipelines, drilling rigs, mine haul trucks, distributed solar — coordinating in real time across geographies that fiber doesn't reach. Protective relays trip in single-digit milliseconds; the mesh isolates faults before they cascade. Routes through whatever radios and edge boxes survive.

▸ 0x05 ─ remote surgery

Remote Surgery

Control signals and haptic feedback routed across the mesh. If the primary compute node lags, the mesh reroutes mid-operation. The surgeon doesn't notice. The patient doesn't notice. The scalpel doesn't stop.

▸ 0x06 ─ drone swarms

Drone Swarms

Coordinated flight without a ground controller. A drone that loses a motor broadcasts the failure; the swarm adjusts formation before the drone has begun to fall.

▸ 0x07 ─ live performance

Live Performance

Lighting, audio, video, pyro synchronized across hundreds of nodes. A DMX controller dies, another node picks up the cue list. Audio sync tighter than the speed of sound across the venue.

▸ 0x08 ─ medical nanorobotics

Medical Nanorobotics

Swarms of nanoscale machines coordinating in vivo — drug-delivery vectors, targeted ablation, vascular monitoring. Sub-microsecond reroute when a node leaves the swarm. No cloud round-trip; the patient is the network.

§10 / the blackwall

safety isn't declared.
it's derived.

In Cyberpunk, the Blackwall isn't a wall around the threats — it's a wall around the safe zone. Net works the same way. The "safe mesh" is the part you can observe: nodes that respond within heartbeat intervals, honor their capability announcements, don't flood, respect TTL.

The wall isn't one mechanism. It's the emergent effect of every constraint working together.

▸ Backpressure

Nodes limit in-flight events, prevent overload, and apply pushback by going silent. No node can be forced to accept more than it can process.

▸ Bounded queues

No infinite buffers. Ring buffers have explicit capacity limits. A flood fills a buffer and gets evicted, it doesn't grow the buffer.

▸ Fanout limits

Events don't propagate to everyone. Dissemination is controlled by the proximity graph and routing table. Prevents O(n²) explosion.

▸ Deduplication

The same event doesn't explode repeatedly. Idempotency at the event level protects against loops and amplification.

▸ TTL limits

Events expire. Pingwaves have a hop radius. A misbehaving node's traffic dies at the boundary of its TTL, not the edge of the mesh.

▸ Rate limits

Per-node, per-peer limits. One node cannot flood the mesh. Its neighbors enforce their own limits independently through device autonomy rules.

Any single mechanism can be overwhelmed. All of them together form the wall. No single point to breach because the Blackwall is the mesh itself.

§11 / releases

net releases.

Every tagged release pulled directly from ai-2070/net.

v0.14.0Codename:The Warriors
2026.05.11

Named after Walter Hill's 1979 cult film and Rockstar Games' 2005 adaptation — a gang trying to make it home through hostile turf. Channels in this release do the same: replicas survive partitions, election storms, disk pressure, and divergent tails, and still converge on a consistent leader before the night is out.

v0.14 lands cross-node replication for RedEX channels end-to-end across the substrate and all five bindings. v0.13 ("Chippin' In") made capability the load-bearing layer; v0.14 makes replication the load-bearing layer underneath the channel surface. SUBPROTOCOL_REDEX is now a real wire codec, ReplicationCoordinator is a real tokio runtime task with a 4-state machine pinned per plan §3, leader election is deterministic nearest-RTT with a NodeId tiebreak (no broadcast, no epoch — microseconds-wide convergence), and catch-up is pull-based with bandwidth budgets and a 64 MiB hard ceiling. Every binding exposes the same enable_replication(mesh) / open_file(name, cfg.with_replication(Some(rep))) surface and the same per-channel Prometheus snapshot.

The hardening posture from the Black Diamond line continues — every new surface ships with handle-lifetime, panic-safety, FFI-soundness, lock-order, and cancel-safety guarantees consistent with v0.11 / v0.12 / v0.13 — and a sixty-four-item second-pass review (docs/misc/CODE_REVIEW_2026_05_11_REDEX_DISTRIBUTED.md) shipped its closure commits before the v0.14 branch cut.

Alongside the replication landing, v0.14 carries two cross-cutting breaking changes: capability hardware / network units switch from MB / Mbps to GB / Gbps end-to-end, and the predicate-on-the-wire header renames from cyberdeck-where: to net-where: (predicate envelope ABI bumped to 2).


RedEX Distributed (substrate)

The implementation plan in REDEX_DISTRIBUTED_PLAN.md phases A–I all closed before v0.14. The shape:

ReplicationConfig

pub struct ReplicationConfig {
    pub factor: u8,                       // replicas including leader; 1..=16, default 3
    pub placement: PlacementStrategy,     // Standard / Pinned([NodeId]) / ColocationStrict
    pub heartbeat_ms: u64,                // 100..=300_000, default 500
    pub leader_pinned: Option<NodeId>,    // pin election outcome to a specific NodeId
    pub on_under_capacity: UnderCapacity, // Withdraw (default) / EvictOldest
    pub replication_budget_fraction: f32, // share of measured NIC peak; 0.0 < f ≤ 1.0
}

PlacementStrategy::Standard defers to the v0.13 PlacementFilter axes (scope filter, proximity max-RTT, capability intent matching, anti-affinity, custom-filter callback). Pinned([NodeId]) and ColocationStrict skip the filter chain. UnderCapacity::Withdraw (the default) drops the replica role and lets the leader's other replicas absorb the redundancy responsibility; EvictOldest runs RedexFile::sweep_retention against the configured caps and stays in Replica. validate() enforces every invariant at construction; binding layers run it before crossing the FFI so a malformed config can't leak into the coordinator's hot loop.

Wire protocol — SUBPROTOCOL_REDEX

A new subprotocol family at 0x0E00. Four message types pinned at byte-level:

  • SYNC_REQUEST (0x20, replica → leader) — fixed-size { channel_id, since_seq, chunk_max }.
  • SYNC_RESPONSE (0x21, leader → replica) — variable; carries { channel_id, first_seq, leader_first_retained_seq, events: [{event_seq, payload_len, payload}] }. The new leader_first_retained_seq field lets the replica disambiguate retention-trim from split-brain divergence; legitimate trim with first_seq == leader_first_retained_seq triggers skip-ahead via RedexFile::skip_to, any other gap shape NACKs back and bumps dataforts_replication_skip_ahead_total.
  • SYNC_HEARTBEAT (0x22, bidirectional) — fixed-size { channel_id, tail_seq, role, wall_clock_ms }. Pinned at 52 bytes; the role byte is the validator-checked ReplicaRole discriminant.
  • SYNC_NACK (0x23, leader → replica) — variable; carries { channel_id, since_seq, error_code, detail_len, detail }. Error codes: 1 NotLeader / 2 BadRange / 3 Backpressure / 4 ChannelClosed. detail truncates at a UTF-8 char boundary ≤ u16::MAX so a multi-byte codepoint straddling the cap can't ship invalid UTF-8 to the peer.

Codec is hand-rolled (no serde over the wire) for byte-stable round-trips, validated by byte_layout_pinned tests per message type. Truncation errors carry (need, have) for diagnostics; need = consumed + still_needed so a peer logging the value sees an accurate frame-completion estimate.

ReplicationCoordinator — the 4-state machine

pub enum ReplicaRole { Idle, Replica, Candidate, Leader }

Transitions are matrix-validated and serialized through an outer tokio::sync::Mutex<()> so the state write + chain-tag side-effect (announce_chain / withdraw_chain against MeshNode) can't interleave. Two transition_to calls racing one another produce a deterministic sequence: T1's Replica → Candidate announce never lands after T2's Idle withdraw. The transition signals (CapabilitySelected, MissedHeartbeats, ElectionWon, ElectionLost, GracefulRelinquish, DiskPressureWithdraw, ChannelClose) are pinned per plan §3; ChannelClose is the universal escape valid from any state, used by the disk-pressure / channel-closed paths when the current role isn't Replica.

The coordinator surfaces two error variants:

  • CoordinatorError::Transition — the validator rejected the triple. State unchanged.
  • CoordinatorError::TagSink — the state mutation already happened; only the chain-tag side-effect failed. Operator observes a divergence between local state and advertised state until the next successful announce. Runtime handlers clear the believed leader on both variants so the next tick re-enters discovery cleanly.

Replica selection vs. leader election

Two distinct subsystems per plan §4:

  • Placement consults PlacementFilter to choose which N nodes carry the channel's replica set when the channel is first opened or on roster change. Standard flows through the v0.13 scoring; Pinned skips it. The selected set is published via the causal:<hex> chain-tag layer so peers discover holders without a centralized membership view.

  • Leader election is a pure function over each healthy replica's locally-known state:

elect(replica_set, self_id, rtt_to, health_of) -> ElectionOutcome:
    R = { r ∈ replica_set : health_of(r) }
    sorted = R sorted by (rtt_to(self, r), r.node_id_lex)   // tie-break: lexicographic NodeId
    return ElectionOutcome::PeerWins(sorted[0])
        | ElectionOutcome::SelfWins
        | ElectionOutcome::NoEligibleReplica

No broadcast, no epoch, no collection window. Every healthy replica computes the same winner from the same (replica_set, self_id, rtt_to, health_of) tuple, so leader-loss recovery converges in microseconds without the wire protocol getting involved. Peers with rtt_to == None (no recent ping measurement) rank at Duration::MAX rather than getting excluded — health already filtered the candidate set, and the NodeId tiebreaker keeps the outcome deterministic among any equally-unmeasured peers.

Pull-based catch-up

Replicas drive SYNC_REQUEST(since_seq=local_next, chunk_max=N) on every tick where is_leader_silent == false && believed_leader.is_some() && local_next < leader_tail_seq. The leader's handle_sync_request reads [since_seq, since_seq+chunk_max) from its local file, packs into a SYNC_RESPONSE, and ships. The replica's apply_sync_response validates strict monotonicity (prev.checked_add(1)), enforces a 64 MiB hard chunk ceiling even for the "admit at least one event" branch (so an oversize first event NACKs back rather than DOSing the wire), and routes typed RedexError variants (DiskPressure, Closed) to the right runtime handler.

Bandwidth budgets

BandwidthBudget is a token bucket sized at replication_budget_fraction × measured_NIC_peak. The catch-up loop calls try_consume(estimated_bytes, now) before shipping each chunk; full bucket admits; partial defers and NACKs back Backpressure. Oversize requests (a single event larger than one-second's capacity — rare but representable) admit as a one-off and drain the bucket fully, so the channel can never deadlock trying to ship an event it can never afford.

Heartbeats + repair

HeartbeatTracker per channel per node holds (last_seen, role, tail_seq) for every peer. The runtime tick emits a heartbeat to every non-self peer in the replica set when role ∈ {Leader, Replica}; inbound heartbeats update the tracker and refresh the believed_leader cell. is_leader_silent trips when now - last_seen > heartbeat_ms × miss_threshold (default 3× = 1.5 s at the 500 ms heartbeat), triggering the Replica → Candidate transition and the in-tick election. heartbeat_ms is now validated to [100, 300_000] so a unit-confused config (μs instead of ms) can't saturate the silence-detection multiplication and silently disable failover.

Failover + replica rejoin

Plan §7: leader loss → silence detection → Candidate → election → Leader/Replica per ElectionOutcome. Plan §8: a replica rejoining from a longer-than-trim outage observes first_seq > local_next on the next SYNC_RESPONSE; if first_seq == leader_first_retained_seq the gap is a legitimate retention trim and RedexFile::skip_to(first_seq) runs (bumping dataforts_replication_skip_ahead_total), any other shape is treated as divergence and NACKs back.

Cross-binding API surface

Every binding ships the same two-method extension to its existing Redex type:

  • enable_replication(mesh) — installs the SUBPROTOCOL_REDEX inbound router on the mesh and arms Redex::open_file to spawn a replication runtime when the supplied RedexFileConfig carries replication: Some(ReplicationConfig). Idempotent: a second call with the same mesh is a no-op.
  • open_file(name, cfg) — when cfg.replication.is_some(), spawns a per-channel ReplicationRuntime (tokio task + HeartbeatTracker + BandwidthBudget + ReplicationCoordinator) and registers it on the inbound router. Reopen with a structurally-different ReplicationConfig returns a typed error rather than silently reusing the original.

The substrate exposes Redex::replication_runtime_count(), Redex::replication_coordinator_for(name), Redex::replication_status_snapshot(), and Redex::replication_metrics_snapshot(). The metrics snapshot is also rendered to Prometheus text via Redex::replication_prometheus_text() for direct scraping.

Metrics

Per-channel atomic counters (ChannelMetricsAtomic) — sync_bytes_total, sync_request_total, sync_response_total, sync_nack_total, leader_changes_total, election_thrash_total, under_capacity_total, skip_ahead_total, applied_events_total, applied_bytes_total, leader_lag_micros, replica_lag_micros. Gauges (leader_lag_micros, replica_lag_micros) saturate one tick below LAG_NOT_OBSERVED = u64::MAX so a follow-up arithmetic operation can't accidentally collide with the sentinel. The Prometheus registry caps at 4096 channels to bound a hostile multi-channel scrape; entries past the cap are silently dropped at insertion.

Observability + operator ergonomics

Redex::replication_status_snapshot() returns a Vec<ChannelReplicationStatus> with channel, role, replica_set, believed_leader, tail_seq, lag_micros, under_capacity_total per channel. Plug into a Prometheus exporter via the replication_prometheus_text() text-format helper; pipe into a Grafana dashboard via the per-channel labels.


RedEX Distributed test strategy

The plan's test matrix landed in full:

  • Unit — pure-function coverage for replication_state, replication_election, replication_heartbeat, BandwidthBudget, replication_metrics, the wire codec, and replication_catchup. Every pre-fix correctness item from the second-pass review ships with at least one regression test.
  • Integration (e2e) — multi-tokio-thread tests under tests/redex_replication_e2e.rs covering two-node catch-up, leader-close → replica election, three-node fanout, lag-driven catch-up, heartbeat round-trip, and the bandwidth_budget_metric_field_is_plumbed smoke. The replication_overhead_within_30_percent_budget perf-budget test is marked #[ignore] and lives off CI's default matrix — wall-clock perf on shared CI runners isn't a stable signal.
  • DST (deterministic-simulation) — 14 scenarios under tests/redex_replication_dst.rs covering happy-path catch-up, isolated-replica no-advance, partition heal, asymmetric / symmetric failover, three-node central-peer convergence, restart-during-sync, divergence-freedom after partition-heal AND after kill-revive (the original C-2 single-path scenario expanded), election storms (the C-1 scenario; storm rounds now assert election_thrash_total bumps), and wall_clock_ms determinism. The harness derives wall-clock time from a step counter, not real Instant::now, so traces reproduce byte-identically across machines.
  • Loom — atomic-pattern models for RedexFile::close's swap-true-on-close, the record_tail_seq CAS loop, the replication metrics counters under concurrent increment (including a three-way same-counter contention case), and the try_first_close first-call-wins flag.

Hardening — redex-distributed second-pass review

A two-pass review of the replication branch (docs/misc/CODE_REVIEW_2026_05_11_REDEX_DISTRIBUTED.md) landed sixty-four numbered items (R-1..R-64) plus four coverage gaps (C-1..C-4). The first pass closed forty-four; the second-pass review on 2026-05-12 surfaced one regression in the original R-23 fix plus nineteen new items; all closed before the v0.14 branch cut. Grouped by area:

Runtime / coordinator correctness

  • Role-flip TOCTOU closed. SyncRequest and SyncResponse handlers re-check coordinator.role() immediately before the dispatcher send so a concurrent transition between the entry check and the outbound ship triggers a clean NACK NotLeader rather than a response from a node that no longer claims leadership.
  • Chain-tag side-effects serialized. The coordinator's transition_to holds a tokio::sync::Mutex<()> across the state update + metric bumps + sink call so two racing transitions can't interleave announce_chain from a stale role over a withdraw_chain from a fresher one.
  • NACK NotLeader / BadRange actually recover. NotLeader clears the believed leader so the next tick re-resolves via find_chain_holders; BadRange calls RedexFile::skip_to(since_seq + 1) and re-issues the request, rather than logging-and-dropping.
  • Post-election failure no longer strands Candidate. When the second transition_to (Candidate → Leader / Candidate → Replica) surfaces TagSink (state moved, side-effect failed) or Transition (state moved out from under us), both error branches clear the believed leader so the next tick re-enters discovery from a clean slate.
  • Disk-pressure / channel-closed pick the valid signal per current role. The transition matrix only permits DiskPressureWithdraw on Replica → Idle; Leader / Candidate variants now route through ChannelClose (the universal escape) so a Leader observing disk pressure actually withdraws rather than logging the matrix-reject and continuing to write through.
  • cancel() can't hang. Uses try_send(Shutdown) first; on Full, aborts the JoinHandle directly so a wedged task with a saturated inbox can't block the caller waiting on a buffer the task may never drain.
  • Drop on ReplicationRuntimeHandle aborts the task. The strong-reference cycle MeshNode → router → handle → task → dispatcher Arc is broken unconditionally when the handle goes out of scope, not just via the canonical ReplicationWiring::drop un-installation.
  • is_stopped consults an explicit flag flipped after cancel()'s .await returns, not the JoinHandle slot — two concurrent cancel()s racing on task.lock().take() could previously let the loser observe None and report stopped == true before the winner had finished joining.
  • Channel-id validation defense-in-depth on every inbound type. SyncRequest, SyncResponse, SyncNack, Heartbeat all gate on msg.channel_id == inputs.channel_id at the runtime boundary so mesh misroute can't poison the tracker.
  • GapBeforeChunk underflow closed. first_seq.saturating_sub(local_next) plus a debug_assert!(first_seq > local_next) belt-and-suspenders.

Catch-up correctness

  • Retention-trim vs. divergence disambiguation. SyncResponse carries leader_first_retained_seq on the wire; the replica treats first_seq == leader_first_retained_seq as a legitimate trim (skip-ahead via RedexFile::skip_to) and any other gap shape as divergence (NACK back, bump counter, log loudly).
  • Empty chunk validates first_seq. The short-circuit on response.events.is_empty() now validates first_seq >= local_next so a leader bug emitting first_seq = 999 on an empty chunk isn't silently accepted.
  • 64 MiB hard ceiling enforced for oversize first event. The "admit at least one event" branch rejects events larger than CHUNK_MAX_HARD_CEILING_BYTES rather than shipping wire bytes that the replica's local append would refuse.
  • prev + 1 strict-monotonicity uses checked_add. Practically unreachable; surrounding code used saturating_* and the asymmetry was the real bug.
  • Lag-driven SyncRequest filters believed_leader != self so a test-setup loopback or tracker misuse can't make the runtime issue a SyncRequest to itself.

Wire codec

  • SyncNack::from_bytes truncation reports correct need. The R-23 fix shipped for SyncResponse but missed the SyncNack arm in the original commit; the second-pass review caught and fixed it.
  • SyncNack::to_bytes truncates at a UTF-8 char boundary. A multi-byte codepoint straddling SYNC_NACK_DETAIL_MAX previously shipped invalid UTF-8 that the decoder rejected, losing the structured error code along with the diagnostic.
  • WireError::Truncated.need formula correct everywhere. need = consumed + still_needed in every arm — both header reads and per-event payload reads.

File / manager

  • RedexFile::skip_to panic-safe swap order. Builds the new index / timestamps into temp Vecs, calls evict_prefix_to against the segment, then assigns the new index. Pre-fix a panic between the index swap and the eviction call would leave the index referencing pre-eviction offsets.
  • Reopen with differing replication config rejects with a typed error rather than silently reusing the original. Compares against the live coordinator's config; accepts None ↔ None and Some(cfg_a) ↔ Some(cfg_b) where the two are structurally PartialEq, rejects everything else.
  • mod replication dual public surface collapsed. The flat re-exports under redex:: are now the only public path; pub mod replication is gone.
  • Lag saturation pinned with a named constant LAG_SATURATED_MICROS = LAG_NOT_OBSERVED - 1 and a test asserting the gap from the sentinel is preserved.

Mesh / dispatch

  • from_node == 0 sentinel collision rejected. The replication inbound arm mirrors the reflex handler's guard — a peer whose from_node falls back to 0 (the valid NodeId sentinel collision) is dropped rather than entering the tracker.

Bindings / FFI

  • Python replication=False with replication_* kwargs rejects with a typed RedexError rather than silently dropping the other kwargs.
  • enable_replication is a typed RedexError stub without the net feature in both Node and Python, rather than TypeError: redex.enableReplication is not a function / AttributeError. The Python replication_runtime_count / replication_prometheus_text gates the same way.
  • net_redex_enable_replication drops Box<Arc<MeshNode>> on every error path. Doc-comment now states "consumed regardless of return code."
  • net_redex_open_file and net_redex_file_tail pre-zero *out_handle / *out_cursor on entry so a cgo / C consumer reading the slot after a non-zero return sees null rather than stale stack data.
  • Python runtime.block_on paths release the GIL via py.detach across the blocking open / open-from-snapshot / tail / watch / snapshot-and-watch paths. Existing precedent (wait_for_seq, __next__) already did this; the cortex open / tail / watch paths now match.
  • Node RedexFile.sync and RedexFile.close are async — disk I/O dispatches via tokio::task::spawn_blocking onto the napi worker pool instead of running on the JS event-loop thread. The other read-side methods stay sync (in-memory only).
  • Python rejects kebab-case spellings for colocation_strict / evict_oldest; Node rejects snake_case for the same (each binding accepts only its idiomatic spelling). The FFI core remains liberal so the Go-facing JSON shape can use either.
  • Pinned([]) rejected at the binding layer with a typed error rather than falling through to the core validator.
  • leader_pinned cross-checked against pinned_nodes at the binding layer when placement == Pinned.
  • Node redex_err documents the redex: prefix contract in index.d.ts so JS-side operators can string-sniff on e.message.startsWith("redex:") against a pinned shape.
  • Go OpenFile distinguishes ErrInvalidReplicationConfig from ErrReplicationRequiresEnable. Binding-side validator covers shape errors plus Factor / HeartbeatMs ranges; only the FFI NET_ERR_REDEX for replication-not-enabled falls into the second sentinel.
  • Go RedexFile.mu uses sync.RWMutex so appends / reads aren't serialized per file. The Rust substrate's HandleGuard is a reader-counter; pre-fix the Go binding's mutex defeated that.
  • Go typedef ArcMeshNode aliases the upstream net_compute_mesh_arc_t opaque typedef so the same Arc handle works through both surfaces.

Hygiene + coverage

  • Election sort uses sort_unstable_by. The total compound key (rtt, node_id) provides determinism; stability isn't load-bearing.
  • Event-vec preallocation cap (4096) documented in the wire codec.
  • u32::try_from(payload_len).unwrap_or(u32::MAX) carries a debug_assert! so accidental misuse surfaces in debug builds rather than silently corrupting on the wire.
  • DST harness wall_clock_ms derives from the step counter, not real Instant::now. Traces reproduce byte-identically across machines.
  • DST election-storm scenario asserts election_thrash_total. The harness mirrors the production coordinator's counter locally so storm rounds can observe the gauge without rewiring the harness around the async coordinator.
  • Divergence-freedom check runs after partition_heal AND after restart_during_sync, not just on the happy path.
  • e2e flake-prone test marked #[ignore]. The replication_overhead_within_30_percent_budget 1.3× wall-clock budget is opt-in via cargo test -- --ignored rather than running on shared CI runners.
  • e2e bandwidth_budget_is_observable_in_metrics renamed to bandwidth_budget_metric_field_is_plumbed so the test name matches what the test asserts (field plumbing under the wire path, not budget engagement; the budget-fired path is unit-tested under replication_catchup).
  • Loom metrics model exercises a three-way same-counter contention case beyond the existing two-thread mixed-counter races.
  • BandwidthBudget::try_consume handles oversize requests via the full-bucket admit-once-and-drain escape hatch so a single event larger than one-second's capacity can't deadlock the channel.
  • Election ranks unmeasured-but-healthy peers at Duration::MAX rather than excluding them — health already filtered the candidate set, and the NodeId tiebreaker keeps the outcome deterministic among any equally-unmeasured peers.

CI

  • Three new CI jobs. redex-replication-e2e runs the multi-tokio-thread integration suite under --features "redex net"; redex-replication-dst runs the deterministic-simulation harness under --features redex; loom-models runs the atomic-pattern loom tests under RUSTFLAGS=--cfg loom. All three gate the redex-distributed merge.

Capability hardware units — MB → GB / Mbps → Gbps

v0.14 changes the hardware-axis numeric units from megabyte / megabit-per-second to gigabyte / gigabit-per-second across core and every binding. The tag keys, predicate builders, FFI shapes, and JSON schemas all rename. This is a breaking wire-format change for any CapabilitySet that carries hardware numerics.

The motivation is operator ergonomics — fleets in 2026 routinely advertise hundreds of GB of memory and tens of Gbps of network capacity, and the MB / Mbps wire shape forced operators to read values like 65_536 and 10_000 when 64 / 10 is what they meant. The smaller numeric range also fits cleanly in u32 for the wire encoding.

Tag / key renames

Old (v0.13) New (v0.14)
hardware.memory_mb hardware.memory_gb
hardware.gpu.vram_mb hardware.gpu.vram_gb
hardware.storage_mb hardware.storage_gb
hardware.network_mbps hardware.network_gbps
hardware.accelerator.<i>.memory_mb hardware.accelerator.<i>.memory_gb

Adjust values when migrating: 65_536 MB64 GB, 81_920 MB80 GB, 10_000 Mbps10 Gbps.

Filter / predicate renames

Old New
min_memory_mb min_memory_gb
min_vram_mb min_vram_gb
min_storage_mb min_storage_gb
min_network_mbps min_network_gbps

The predicate builders (p.minMemory(...), p.minVram(...), etc. in TS; the p.min_memory(...) family in Python; Predicate{}.MinMemory(...) in Go) now produce NumericAtLeast tags whose key is memory_gb / vram_gb / storage_gb / network_gbps.

Binding surfaces

Binding Renamed fields / keys
Rust core HardwareCapabilities::memory_gb, GpuCapability::vram_gb, HardwareCapabilities::storage_gb, HardwareCapabilities::network_gbps, AcceleratorCapability::memory_gb. Capabilities::with_memory(gb) takes GB; ResourceEnvelope::max_memory_gb, ResourceClaim::memory_gb, TopologyHint::{uplink_gbps, downlink_gbps} all moved to GB / Gbps.
Go HardwareCaps.MemoryGB, GPUInfo.VRAMGB, HardwareCaps.StorageGB, HardwareCaps.NetworkGbps, AcceleratorInfo.MemoryGB.
Node Hardware.memoryGb, Hardware.storageGb, Hardware.networkGbps, GpuInfo.vramGb, AcceleratorJs.memoryGb (all index.d.ts).
Python dict keys memory_gb / vram_gb / storage_gb / network_gbps; accelerator dict key memory_gb. Stubs (net_sdk.*.pyi) and tests updated.
C / FFI Capability / filter JSON uses *_gb keys (min_memory_gb, min_vram_gb, min_storage_gb) and network_gbps.

Refactors

The core schema (AXIS_SCHEMA) and tag codec emit / parse the new *_gb / *_gbps keys. Placement / scoring and proximity tiers use a 16 GB baseline (was 16 GB previously; the renames are nominal, not behavioral). Serialization APIs that took MB-shaped values now take GB. Safety types and topology hints align. Docs, benches, examples, and every test fixture / cross-binding golden vector regenerate against the new shape; the final sweep removed lingering network_mbps references across tests/cross_lang_capability/ and the per-binding compat suites.

Cross-binding fixtures

The thirteen fixtures under tests/cross_lang_capability/ regenerate against the new unit. predicate_eval, capability_set_diff, capability_validation, placement_score, and the numeric-parity fixtures all carry GB / Gbps values. predicate_nrpc_envelope.json bumps abi_version_expected: 1 → 2 (see below).


Predicate-on-the-wire header — cyberdeck-where:net-where:

The HTTP / nRPC header carrying predicates from caller to callee was named cyberdeck-where: in v0.13 — the project umbrella on the wire. v0.14 renames to net-where: for three reasons:

  1. HTTP / nRPC convention names the protocol, not the parent organization. HTTP doesn't have w3c-content-type:; traceparent / idempotency-key use system-level prefixes, not org names. The umbrella-on-the-wire shape was an outlier.
  2. The header is not nRPC-specific even though it currently rides nRPC. Predicates are protocol-agnostic; any future predicate-bearing surface (raw channel pre-filter, subprotocol call hook, …) should ride the same name. net-where: brackets the right layer (the net crate / SDK), not a specific service inside it.
  3. Symmetric naming with the substrate crate. Net's other reserved headers and protocol identifiers carry the net- / net_ prefix; lining this one up makes the surface easier to grep and easier to teach.

RPC_WHERE_HEADER constant

Every binding exports the new name as a pinned constant:

  • Rust: net::adapter::net::behavior::predicate::RPC_WHERE_HEADER = "net-where"
  • TS: import { RPC_WHERE_HEADER } from '@ai2070/net-sdk'
  • Python: from net_sdk import RPC_WHERE_HEADER
  • Go: net.RPCWhereHeader
  • C: NET_PREDICATE_WHERE_HEADER macro in net.go.h

Server-side decoders accepting the v0.13 cyberdeck-where: name are not provided. Mixed v0.13 / v0.14 fleets cannot exchange predicates over the wire; recommend lockstep upgrade alongside the capability-unit migration.

Predicate envelope ABI version bump

tests/cross_lang_capability/predicate_nrpc_envelope.json bumps abi_version_expected: 1 → 2 to signal the wire-format change. No binding-side ABI version constants pin to 1 — none of the per-binding tests asserted on the envelope fixture's version — so the bump is informational + future-defensive. Future header / envelope changes in v0.15+ will bump to 3 against the same fixture.


Test hygiene

  • Cross-binding wire-format fixtures regenerate against the new units + header name. Thirteen fixtures under tests/cross_lang_capability/, all versioned via abi_version_expected: 2 for the predicate envelope (other fixtures continue at 1 — only the envelope carries the ABI version field today).
  • Three new CI jobs. redex-replication-e2e, redex-replication-dst, loom-models gate the merge.
  • Lib suite at 2640+ tests (was 2330+ at v0.13 release). 300+ net new tests across the replication + regression paths; every numbered review item ships with at least one regression where the shape made one possible.
  • cargo clippy --all-features --all-targets -D warnings clean across substrate + every binding crate.
  • cargo doc --all-features --no-deps clean under RUSTDOCFLAGS="-D warnings" — both rustdoc::broken_intra_doc_links and rustdoc::private_intra_doc_links enforce.

Breaking changes

Wire format — SUBPROTOCOL_REDEX is new

SUBPROTOCOL_REDEX = 0x0E00 is a new mesh subprotocol family; v0.13 nodes don't speak it. Mixed v0.13 / v0.14 fleets cannot exchange replication traffic. Channels opened with replication: None continue to work cross-version (same single-node behavior as v0.13).

Wire format — capability hardware units

v0.14 breaks wire compatibility with v0.13 for CapabilityAnnouncement / CapabilityDiff carrying hardware numerics. hardware.memory_mb / hardware.gpu.vram_mb / hardware.storage_mb / hardware.network_mbps / hardware.accelerator.<i>.memory_mb rename to the *_gb / *_gbps shape. v0.13 receivers parse v0.14 announcements as Tag::Legacy (unknown axis-prefixed tags pass through under the forward-compat rule) — the values survive the round-trip but no longer satisfy min_memory_mb / etc. filters, so placement decisions on a v0.13 receiver may produce different verdicts. Recommend lockstep upgrade.

Wire format — cyberdeck-where:net-where:

v0.14 renames the predicate-on-the-wire HTTP header. v0.13 servers expecting cyberdeck-where: won't see v0.14 callers' header values; v0.13 callers' cyberdeck-where: won't be read by v0.14 servers. Mixed fleets must either upgrade lockstep or maintain a transitional gateway that rewrites the header on the way through.

Rust core (net crate) — API surface

  • Capabilities::with_memory(value) takes GB, not MB. Same for the resource-envelope / claim / topology types: ResourceEnvelope::max_memory_gb, ResourceClaim::memory_gb, TopologyHint::{uplink_gbps, downlink_gbps}.
  • HardwareCapabilities field renamesmemory_gb, gpu.vram_gb, storage_gb, network_gbps. AcceleratorCapability::memory_gb.
  • adapter::net::redex exports — new types ReplicationConfig, PlacementStrategy, UnderCapacity, ReplicationCoordinator, ReplicationCoordinator::transition_to, ReplicaRole, TransitionSignal, StateTransition, HeartbeatTracker, PeerState, BandwidthBudget, ReplicationMetricsRegistry, ChannelMetricsAtomic, ChainTagSink, ChannelIdentity, CoordinatorError, elect, ElectionOutcome, ChannelReplicationStatus. The wire codec types (SyncRequest, SyncResponse, SyncHeartbeat, SyncNack, SyncNackError, SyncEvent, WireError, SUBPROTOCOL_REDEX, DISPATCH_SYNC_*, SYNC_NACK_DETAIL_MAX) re-export at the redex module root.
  • Redex::enable_replication(mesh) is a new method. Idempotent; pair with Redex::open_file carrying cfg.replication = Some(rep) to spawn a per-channel replication runtime.
  • Redex::open_file rejects reopen with a structurally-different ReplicationConfig with a typed RedexError::Channel. Reopen with the same config returns the existing handle (unchanged from v0.13).
  • RPC_WHERE_HEADER = "net-where" (was "cyberdeck-where" in v0.13).
  • HEARTBEAT_MS_MAX = 300_000 added; ReplicationConfig::validate rejects heartbeat_ms > HEARTBEAT_MS_MAX with a typed HeartbeatTooHigh variant.

Rust SDK (net-sdk)

  • net_sdk::capabilities::redex re-exports the substrate replication surface — ReplicationConfig, PlacementStrategy, UnderCapacity, ReplicaRole, ChannelReplicationStatus.
  • net_sdk::capabilities::predicate::RPC_WHERE_HEADER is the renamed constant.

FFI / bindings

Binding Change
All New enable_replication(mesh) method on Redex. New replication field on RedexFileConfig; pair with ReplicationConfig constructor. New ReplicaRole / PlacementStrategy / UnderCapacity enums and ReplicationConfig builder per binding. New replication_runtime_count, replication_status_snapshot, replication_metrics_snapshot, replication_prometheus_text getters on Redex.
All Hardware-numeric field renames — memoryGb / vramGb / storageGb / networkGbps etc. per binding's idiomatic naming. Same for the predicate min-builder family — minMemory / minVram / minStorage / minNetwork now produce GB / Gbps tags.
All RPC_WHERE_HEADER constant renames to "net-where". Header-bearing nRPC call variants (net_rpc_call_with_headers etc.) pass the new name; v0.13 servers expecting cyberdeck-where: won't decode v0.14 callers.
Node New Redex.enableReplication(mesh) method. New replication: ReplicationConfig field on RedexFileConfig. RedexFile.sync() / RedexFile.close() are async (return Promise<void>); callers must await. Pre-v0.14 code calling file.sync() / file.close() synchronously generates an orphan Promise warning under modern Node. The redex: JS-error prefix is pinned in index.d.ts doc-comment as the stable contract.
Python New Redex.enable_replication(mesh) method. New replication= kwarg on Redex.open_file. replication=False with any replication_* kwarg now raises RedexError rather than silently dropping the kwarg. cortex open / tail / watch paths release the GIL via py.detach across the blocking work. enable_replication / replication_runtime_count / replication_prometheus_text are typed RedexError stubs without the net feature.
Go New RedexManager.EnableReplication(meshArc) method. New RedexFileConfig.Replication *ReplicationConfig field. RedexFile uses sync.RWMutex so appends / reads don't serialize. OpenFile returns the matching sentinel (ErrInvalidReplicationConfig vs ErrReplicationRequiresEnable) per error class. ArcMeshNode typedef aliases the upstream net_compute_mesh_arc_t.
C New entry points: net_redex_enable_replication(redex, mesh_arc), net_redex_replication_runtime_count(redex), net_redex_replication_prometheus_text(redex), net_free_string(ptr). net_redex_open_file / net_redex_file_tail pre-zero *out_handle / *out_cursor on entry. The replication config rides the RedexFileConfigJson.replication field; binding-side validators or the FFI core enforce numeric ranges.

Behavioral fixes that may surface as test breakage

  • ReplicationConfig::heartbeat_ms clamps at [100, 300_000]. Tests injecting u64::MAX or other pathological values to observe silence-detection behavior will see ReplicationConfigError::HeartbeatTooHigh instead.
  • PlacementFilter election no longer excludes peers with rtt_to == None. Tests that asserted NoEligibleReplica against an all-unmeasured replica set will see the smallest-NodeId healthy peer elected instead.
  • SyncNack::to_bytes truncates at a UTF-8 char boundary, so a regression test that previously expected from_bytes to fail on an oversize multi-byte payload will see the round-trip succeed at a slightly-shorter detail length.
  • Reopen with a different ReplicationConfig rejects. Tests that opened a channel with one config and reopened with another expecting silent reuse will see RedexError::Channel("different from the original").
  • bandwidth_budget_is_observable_in_metrics renamed. Tests referencing the old test name fail to find it; rename to bandwidth_budget_metric_field_is_plumbed.
  • replication_overhead_within_30_percent_budget marked #[ignore]. CI runs that included this test in the default matrix will no longer see it; run via cargo test -- --ignored.

How to upgrade

  1. Bump your Cargo.toml / package.json / requirements.txt / go.mod to the v0.14 line. Recompile / rebuild the binding cdylib (NAPI for Node, maturin for Python, cargo build -p net-compute-ffi + -p net-rpc-ffi for Go).
  2. Capability hardware-unit migration. Rename memory_mbmemory_gb, vram_mbvram_gb, storage_mbstorage_gb, network_mbpsnetwork_gbps, accelerator.memory_mbaccelerator.memory_gb throughout. Adjust values: 65_53664, 81_92080, 10_00010. The predicate builders pick up the new keys automatically; tag-string literals need a manual rewrite. cargo build (and the binding-side TypeScript / Python static checks) drives the rewrite — the renames are compile errors.
  3. Predicate header migration. If your call sites reference the header name directly ("cyberdeck-where" as a string literal), replace with "net-where" or use the exported RPC_WHERE_HEADER constant. Server-side handlers consuming the v0.13 name need the same rewrite.
  4. Replication opt-in. Channels that want replication: call Redex.enable_replication(mesh) once after constructing the Redex (idempotent), then open each replicated channel with Redex.open_file(name, cfg.with_replication(Some(rep_cfg))). The per-channel ReplicationRuntime spawns automatically; consult the operator surface via Redex.replication_status_snapshot() / replication_prometheus_text().
  5. Channels that don't want replication require no changes. Single-node channels behave identically to v0.13. RedexFileConfig::replication = None is the default.
  6. Node consumers — RedexFile.sync() / RedexFile.close() are async. Add await to call sites:
    await file.sync();
    await file.close();
    
    Sync call sites compile but generate orphan Promise warnings under modern Node and may exit the process before the fsync lands.
  7. Python consumers — Redex.open_file(name, replication_*=…) requires replication=True. Pre-v0.14 code passing replication_factor=5 without replication=True produced a single-node channel; now raises RedexError. Either pass replication=True explicitly or drop the replication_* kwargs.
  8. Go consumers — RedexFileConfig.Replication is the new optional field. Pass a *ReplicationConfig for replicated channels. Numeric validation (factor / heartbeat ranges) runs on the Go side before the FFI; structurally-invalid configs return ErrInvalidReplicationConfig instead of the catch-all ErrReplicationRequiresEnable.
  9. Fleet-wide upgrade required for any deployment using capability announcements with hardware numerics. v0.13 receivers parse v0.14 announcements' hardware.memory_gb as Tag::Legacy — the value survives but no longer satisfies min_memory_mb-keyed filters. Recommend lockstep upgrade alongside the predicate-header migration.
  10. Cross-binding wire fixtures regenerated. If you have CI that asserts golden-vector parity against tests/cross_lang_capability/, the GB / Gbps shape and the net-where: header rename mean every fixture changes. predicate_nrpc_envelope.json bumps abi_version_expected: 1 → 2; future binding-side version pins should track the per-fixture version field.
  11. Operator dashboardsRedex::replication_prometheus_text() emits a per-channel snapshot in Prometheus text format; pipe into your existing scrape config under the dataforts_replication_* metric family. Per-channel labels (channel, role) carry the channel name and current role for dashboard slicing.
  12. DST harness integration — if you have channel-level DST scenarios that drive ReplicaRole directly, the harness's force_transition / tick_node now mirror the production coordinator's election_thrash_total counter onto a per-VirtualNode election_thrash_count field, so storm scenarios can assert on the gauge without rewiring around the async coordinator. The harness's wall_clock_ms derives from a step counter, not Instant::now.
v0.13.0Codename:Chippin' In
2026.05.10

Named after the two "Chippin' In" tracks: Samurai's original Chippin' In, and the Cyberpunk 2077 soundtrack rendition by Damian Ukeje, P.T. Adamczyk, and Kerry Eurodyne.

v0.13 lands the capability system end-to-end across the substrate and all five bindings. v0.12 ("Firestarter") shipped nRPC; v0.13 makes capability the load-bearing layer underneath. The Tag placeholder in v0.10 / v0.11, and the untyped Vec<String> shape v0.12 still carried, both go away — CapabilitySet is now a { tags: HashSet<Tag>, metadata: BTreeMap } typed-taxonomy wire shape, every binding ships the same Predicate AST + evaluator + validator + diff + trace + debug-report aggregator, and predicates ride nRPC request headers (cyberdeck-where:) so server-side filtering picks the right candidate without re-running the predicate per hop.

The hardening posture from the Black Diamond line is intact — every new surface ships with handle-lifetime, panic-safety, and FFI-soundness guarantees consistent with v0.11 / v0.12 — but this release is about replacing the placeholder with the real thing.


Capability System (substrate)

Typed taxonomy

The flat tag namespace becomes a four-axis ontology — hardware / software / devices / dataforts — backed by a typed Tag enum:

pub enum Tag {
    AxisPresent { axis: TaxonomyAxis, key: String },
    AxisValue   { axis: TaxonomyAxis, key: String, value: String, separator: AxisSeparator },
    Reserved    { prefix: String, body: String },   // scope:* / causal:* / fork-of:* / heat:*
    Legacy(String),                                  // untyped strings outside the typed taxonomy
}

Tag::parse(s) accepts every shape including reserved-prefix tags (the deserializer + substrate-internal callers); Tag::parse_user(s) rejects reserved prefixes for application input. TagKey ((axis, key)) is the half-form Predicate matches on. TaxonomyAxis::all() enumerates the four axes for iteration.

Axis values accept either = or : as the separator on the wire (hardware.gpu.vram_mb=24576 and hardware.gpu:nvidia both parse). The separator is preserved through Tag::Eq for byte-stable round-trips, and tag.semantic_eq(other) is the separator-agnostic comparison for tag matching.

Tag shapes for discovery

Reserved-prefix tag shapes flesh out the discovery primitive. causal:<hex> / causal:<hex>:<tip_seq> / causal:<hex>[<range>] for chain holders; fork-of:<parent_hex> for chain ancestry; heat:<chain_hex>=<rate> for hot-chain advertisement; scope:tenant:<id> / scope:region:<name> / scope:subnet-local (scope:* was already in v0.12, now formally part of the taxonomy). RESERVED_PREFIXES constant exposes the full list for binding-level enforcement.

Metadata field

CapabilitySet storage shape collapses to two fields:

pub struct CapabilitySet {
    pub tags: HashSet<Tag>,
    pub metadata: BTreeMap<String, String>,
}

HardwareCapabilities / SoftwareCapabilities / Vec<ModelCapability> / Vec<ToolCapability> / ResourceLimits are projections — derived on demand via caps.views(). Encoding scheme: hardware.cpu_cores=N / hardware.gpu / hardware.gpu.vram_mb=N / software.os=linux / software.model.0.id=... / hardware.limits.max_concurrent_requests=N. Tool JSON-Schema strings (which can't safely round-trip through the tag wire format) live in metadata under tool::<id>::input_schema / tool::<id>::output_schema. Application-defined metadata keys propagate as opaque pairs (subject to a 4 KB soft cap with a MetadataOversize warning at the validator layer).

Wire format emits tags in sorted Tag::to_string() order — the HashSet keeps O(1) membership for in-memory lookups; the serialize_with hook flattens to a sorted Vec on the way out. Without this, two ends of a signed announcement round-trip would produce different bytes (HashSet iteration is process-local random) and the verifier would reject as InvalidSignature.

Bloom-filter primitive

behavior::bloom::BloomFilter ({ len_bits, k, bits: Vec<u64> }) backs compact chain-tag membership probes via xxh3-128 double-hashing. ~1% FPR at 10 K items in ≤ 500 KB per the substrate sizing target. Probe pattern: callers that match the bloom run a follow-up precise lookup (existing causal:<hex> tag membership) before issuing real reads — false positives become recoverable misses, false negatives are impossible by construction. Domain-separated via BLOOM_HASH_SEED = 0xB100_F1AC_DEAD_CAFE so callers using xxh3 elsewhere don't accidentally collide.

BloomFilter::new(expected_items, false_positive_rate) clamps degenerate inputs (expected_items == 0 → 1, p clamped to (1e-9, 0.5)); BloomFilter::with_params(len_bits, k) is the explicit-parameters constructor for cross-binding fixtures. Round-trips via serde with explicit deserialize-side validation (rejects out-of-range k, mismatched len_bits/bits.len() * 64).

Federated query primitives

behavior::query::CapabilityQuery lifts five composable ops over CapabilityIndex:

  • filter(predicate) — predicate-driven candidate set.
  • match_axis(axis, key) — axis-shaped tag scan.
  • aggregate(key, reduction) — per-key cardinality / numeric reductions.
  • traverse(seed, edge_fn, depth) — graph-style join over peer capability links.
  • nearest(predicate, k, proximity) — combine with proximity to score the top-K best matches.

Implementations on CapabilityIndex are O(log n) for indexed predicates and O(n) for the residual scan. The Predicate AST and these five ops together are what Mesh::find_nodes_by_filter / find_best_node_scoped flow through.

PlacementFilter trait + StandardPlacement

PlacementFilter::placement_score(target, artifact) -> Option<f32> is the substrate-level placement primitive. Some(score) admits the candidate at a fitness in [0, 1]; None is a hard veto. Artifact carries the workload type — Chain (causal-chain placement), Replica (channel replica placement), Daemon (compute placement, with required + optional capability sets).

StandardPlacement is the multi-axis reference implementation: scope filter, proximity max-RTT, intent matching (AnyOfLocalCapabilities / StrictMatch / Custom), colocation policy (Ignore / SoftPreference / StrictRequired), resource axis (Storage / Compute / Both), anti-affinity config (leadership-concentration penalty), and a custom-filter axis that consumes a registered host-language PlacementFilter via with_custom_filter_id(id). Axes compose multiplicatively; None on any axis is a hard veto. Per-axis tie-breaking via the locked RTT → free-resource → lexicographic-NodeId chain (tie_break_compare).

IntentRegistry::register(intent, &[required]) registers per-intent placement requirements built from the require! / require_axis! / require_axis_value! macros. Substrate ships defaults for the four canonical intents (ml-training, inference, embedding-cache, tool-call); per-deployment overrides land via the SDK.

global_placement_filter_registry() is the process-wide singleton mapping registered IDs to Arc<dyn PlacementFilter>. Bindings register their language-specific wrappers here; the scheduler resolves an SDK ID to an impl before scoring. Registration is open-by-default — the registry refuses overwrites of an existing ID (register returns false) so two bindings can't accidentally clobber each other's filters.

Mikoshi integration

Mikoshi::select_migration_target(daemon, scope) consults PlacementFilter end-to-end. LegacyPlacement preserves the v0.12 ad-hoc selection under a feature flag for one minor version; new daemons should target StandardPlacement. ReplicaGroup::select_member_node and StandbyGroup::select_promotion_target route through the same scorer so replication / hot-standby promotion get the same axis-composed verdict as initial placement.

Daemon authors declare MeshDaemon::required_capabilities() and optional_capabilities(); the runtime publishes both as part of the daemon's identity-bound announcement so the placement scheduler — and any custom filter — can consult them. Bindings expose the same hook through their daemon-caps dispatcher (net_compute_set_daemon_caps_dispatcher at the C ABI; the equivalent Python / TS / Go callback during factory registration).


Capability Enhancements (substrate refinements)

None of these change the wire format — they sit on top of the typed-taxonomy primitive and pay for themselves at the application layer.

Lazy view projections + diff

caps.views() returns a CapabilityViews handle whose per-axis fields decode-and-cache on first access. Hot-path caps.views().hardware().memory_mb is < 50 ns post-cache; first call is the per-tag scan. Cache invalidates compiler-enforced via the &caps borrow held by views().

caps.diff(prev) returns CapabilitySetDiff { added_tags, removed_tags, changed_metadata } for cheap before/after change detection. MetadataChange::{Added, Removed, Updated} per-key with old/new values. Powers event-driven placement, capability-change dashboards, and delta-based metadata propagation.

Axis schemas

AXIS_SCHEMA is the canonical per-axis schema baked into the substrate at build time: known keys per axis, value types (Presence / Number / String / Enumeration / Bool / Csv), indexed-collection shapes (software.model.<i>.* / software.tool.<i>.* / hardware.accelerator.<i>.*). validate_capabilities(caps) runs the schema against a CapabilitySet and returns a ValidationReport of errors (operator-must-fix: UnknownAxis, TypeMismatch, IndexMalformed) + warnings (forward-compat / hygiene: UnknownKey, MetadataOversize, LegacyTag). Both lists are sorted by JSON-stringified entry so cross-binding fixture comparisons stay order-independent. Each binding regenerates its language-side schema from the same authoritative CAPABILITIES_SCHEMA.md doc.

Predicate AST + nRPC headers

behavior::predicate::Predicate is the typed AST. Variants: Exists / Equals / NumericAtLeast / NumericAtMost / NumericInRange / SemverAtLeast / SemverAtMost / SemverCompatible / StringPrefix / StringMatches / MetadataExists / MetadataEquals / MetadataMatches / MetadataNumericAtLeast / And / Or / Not. Built via the pred! macro in Rust, language-idiomatic builders in every other binding (p.and([...]), p.exists(tagKey('hardware', 'gpu')), etc.). Evaluated against an EvalContext constructed from any (tags, metadata) pair.

Predicates encode losslessly to a cyberdeck-where: nRPC header pair via predicate_to_rpc_header; the receiver decodes via predicate_from_rpc_headers (consumes any iterable of (name, value_bytes) pairs through the AsRpcHeader trait). Pair with net_rpc_call_with_headers / _call_service_with_headers / _call_streaming_with_headers at the C ABI so server-side filtering picks the right candidate without re-running the predicate per hop. Decode-side enforces the encode-side size cap symmetrically — oversize payloads surface as PredicateRpcDecodeError::Oversize instead of walking serde's recursive parse on attacker-shaped input. Wire format pinned by tests/cross_lang_capability/predicate_nrpc_envelope.json.

Query planner

predicate.evaluate(ctx) runs the planned (selectivity-reordered) AST by default; predicate.evaluate_unplanned(ctx) exposes the raw declaration-order path for benchmarking. Planner consumes CardinalityProvider (a TTL-cached lookup over by_axis_key / by_metadata indexes via CapabilityIndex::axis_cardinality). Cost-based AND short-circuits cheap-false-first, cost-based OR cheap-true-first; structurally-equal clauses merge so duplicate work is single-counted. Cardinality casts saturate on u32::MAX so fleets with unbounded-cardinality metadata keys (session id, request id) don't wrap and mis-rank the most-selective key.

Chain composition helpers

caps.requireChain(hash) / requireAnyChain([hashes]) / excludeChain(hash) / fromFork(parent) / heatLevel(rate) are syntactic sugar over the underlying reserved-prefix tags (TS / Python builder shapes; the Rust require_axis_value! macro covers the same). Predicate-side equivalents on the pred.* builder.

Predicate debug sessions

Predicate::evaluate_with_trace(ctx) returns (bool, ClauseTrace) — every clause's verdict + skipped children for short-circuit AND/OR. PredicateDebugReport::from_evaluations(&pred, contexts) aggregates per-clause hit / miss / cost stats across a corpus; report.render() renders a multi-line text summary. Bindings ship a redact_metadata_keys(report, keys) helper for safe persistence — scrubs metadata-equality / -matches values before the report goes to disk or analytics. Wire format pinned by tests/cross_lang_capability/predicate_trace.json and predicate_debug_report.json.


SDK Capability System Surface

Every binding ships the same capability surface. Total ~14 K LoC across the substrate + SDK + bindings + tests, of which the binding surface accounts for ~7 K. The substrate primitives (Tag, TagKey, CapabilitySet, CapabilityViews, Predicate, pred! macro, ValidationReport, CapabilitySetDiff, RequiredCapability + require! macros) re-export through net-sdk::capabilities. Per-binding surfaces:

Binding Surface
Node / TypeScript sdk-ts exports tagFromUserString, RESERVED_PREFIXES, requireTag, withMetadata, the p predicate builder, evaluatePredicate, predicateToRpcHeader / predicateFromRpcHeader, validateCapabilities, diffCapabilities, evaluatePredicateWithTrace, predicateDebugReport, redactMetadataKeys, renderDebugReport, placementFilterFromFn, standardPlacement.
Python sdk-py exports the parallel surface as tag_from_user_string, p, evaluate_predicate, predicate_to_rpc_header, validate_capabilities, diff_capabilities, evaluate_predicate_with_trace, predicate_debug_report, redact_metadata_keys, placement_filter_from_fn, standard_placement.
Go bindings/go/net/ exports Tag, Predicate{}, EvaluatePredicate, PredicateToWhereHeader, ValidateCapabilities, DiffCapabilities, EvaluatePredicateWithTrace, PredicateDebugReport, RegisterPlacementFilter, UnregisterPlacementFilter.
C ABI Stateless evaluator (net_predicate_evaluate), stateless validator (net_validate_capabilities), debug-session helpers (net_predicate_evaluate_with_trace, net_predicate_aggregate_debug_report, net_predicate_redact_metadata_keys), cyberdeck-where: header builder (net_predicate_to_where_header), and header-bearing nRPC call variants (net_rpc_call_with_headers, net_rpc_call_service_with_headers, net_rpc_call_streaming_with_headers plus cancellable streaming variants).
All bindings MeshDaemon capability authoring — daemons declare required_capabilities / optional_capabilities via per-binding factory hooks plumbed through net_compute_set_daemon_caps_dispatcher. Custom PlacementFilter callbacks via placement_filter_from_fn(fn) (TS / Python / Go) or global_placement_filter_registry().register(...) (Rust).

Eight cross-binding wire-format fixtures under tests/cross_lang_capability/ (predicate_eval, capability_set_diff, capability_validation, predicate_trace, predicate_debug_report, predicate_debug_report_redacted, predicate_nrpc_envelope, placement_score) pin the byte-identical contract across Rust / TS / Python / Go / C and are versioned via abi_version_expected: 1.

Cross-cutting invariants the fixtures and per-binding compat suites enforce:

  • Wire format is byte-identical across Rust / TS / Python / Go / C. A predicate authored in TS and shipped to a Go service via the cyberdeck-where: header decodes losslessly; a CapabilitySet::diff on Python reproduces the identical added_tags / removed_tags / changed_metadata shape Rust would. Drift in any binding fails that binding's own CI.
  • Numeric / semver parse semantics agree with Rust. Every binding's f64 parser accepts exactly Rust's f64::from_str set (decimal, scientific, leading +, .5, 1., inf, infinity, NaN) and rejects hex floats / digit-separator underscores. Every binding's semver parser accepts only ASCII digits with optional leading +. Validators bound Number values at u64::MAX and reject negatives; indexed-collection indices bound at u32::MAX.
  • AxisPresent tags don't satisfy value predicates. Equals(_, "") / StringPrefix(_, "") / StringMatches(_, "") never spuriously match a presence-only tag — only the Exists predicate does. CapabilitySet::diff is separator-agnostic on AxisValue tags (hardware.k=v and hardware.k:v carry identical semantics).
  • Reserved-prefix tags only via dedicated helpers. add_tag(s) parses through Tag::parse_user, which rejects reserved prefixes — applications that try to emit a scope:tenant:foo via add_tag get the tag silently dropped. Use with_tenant_scope("foo") / with_region_scope / with_subnet_local_scope / etc. Bindings opt into the unrestricted Tag::parse path so reserved tags round-trip through tags: [...]. Metadata writers gate on the same reserved-prefix list. The schema validator surfaces collisions and oversize as warnings.
  • MeshDaemon::process panic surfaces as RpcStatus::Internal — same hardening posture as v0.12's nRPC fold, applied through the daemon-caps dispatcher when caps extraction itself panics.
  • AttributeError is the only silently-swallowed Python error. Every other exception from a @property getter for required_capabilities / optional_capabilities propagates so operators see real failures instead of phantom-empty-cap daemons.

Hardening

The capability surface landed alongside two parallel audits whose fixes are integrated into the surface descriptions above. The substantive results, grouped by area:

Wire-format determinism and separator agnosticism

  • CapabilitySet::has_tag and RequiredCapability::Tag evaluate via Tag::semantic_eq so caps.has_tag("software.os:linux") matches a stored software.os=linux and vice versa. The separator field is a wire-form detail, not part of identity.
  • CapabilitySet::diff is separator-agnostic and emits ops in deterministic lexicographic-by-tag order. Pre-fix HashMap iteration randomized the op order, and an input tag with : separator that re-encoded canonically as = shipped a phantom RemoveTag without a compensating UpdateSoftware — receivers dropped the tag entirely. Same fix applied to the TS diffCapabilities rewrite (semantic comparison on (kind, axis, key, value)).
  • Capability announcements emit tags in sorted wire order so signed announcements verify byte-stably across processes (HashSet iteration is process-local random; pre-fix verification rejected multi-tag announcements crossing between two processes).
  • Forward-compat axis tags survive CapabilitySet::diff as AddTag / RemoveTag; the is_*_owned_tag predicates no longer over-claim unknown forward-compat keys.

Predicate / placement correctness

  • Custom PlacementFilter impls returning None or NaN are hard vetoes — pre-fix NaN scores poisoned the sort comparator and the highest-scoring candidate could rotate non-deterministically. StandardPlacement::saturating_score, the anti-affinity threshold, and target_axis_value_numeric all clamp NaN / out-of-range values before composition; score_resource_axis::Both collapses to whichever axis carried data (rather than diluting against a permissive 1.0 placeholder for a no-data axis).
  • score_custom_filter_axis resolves outside the with_caps closure so an FFI-registered filter that calls back into the index (index.query(...) from a LegacyPlacement shim, JS callback hitting find_nodes) can't deadlock against a concurrent index.index(...) insert.
  • Scheduler::select_migration_target carries the LocalPreferred fast-path so RTT-aware operators feeding their own TieBreakContext don't silently lose the network-hop-avoidance behavior. place_migration_v2 derives the right PlacementReason from the returned node id.
  • CapabilityQuery::traverse carries a visited-set so cycles in the peer-capability graph terminate. eval_any_in_cost_order ranks Or composites cheap-true-first; redact_label searches every separator position so metadata-equality values containing = round-trip cleanly.
  • Tag::AxisPresent no longer matches value-bearing predicates. Equals(_, "") / StringPrefix(_, "") / StringMatches(_, "") only match AxisValue tags; Predicate::Exists is the dedicated presence-check path in every binding.

Cross-binding numeric / semver agreement

  • Every binding's f64 parser accepts exactly Rust's f64::from_str accepted-set (decimal, scientific, leading +, .5, 1., inf, infinity, NaN) and rejects hex floats (0x1p3) and digit-separator underscores (1_000) that Go's strconv.ParseFloat and Python's float() would otherwise accept. Numeric leaves run through IEEE comparison so NaN never matches and ±inf compare correctly across bindings.
  • Schema Number validators bound at u64::MAX and reject negatives; indexed-collection indices bound at u32::MAX. ASCII digits only with optional leading + — Unicode digits (Arabic-Indic, fullwidth) parse cleanly under Python's int() but Rust's u64::from_str rejects them, so the predicate-side and schema-side parsers both lock to ^\+?[0-9]+$.
  • Semver parsers reject Unicode digits in the version components; 0.0.x is exact-only (every patch is a breaking change boundary per Cargo's caret rule); 0.x.y requires lhs.major == 0.
  • parse_tag_key trims whitespace around the dot, require! parses == before >= / <= so equality values containing comparison substrings parse correctly. Tag::parse_user rejects reserved prefixes consistently across bindings; with_metadata filters reserved-prefix keys at the writer.

FFI / binding hardening

  • predicate_from_rpc_headers enforces the decode-side size cap symmetrically with the encode side — parse-bomb-shaped payloads surface as PredicateRpcDecodeError::Oversize instead of walking serde's recursive parse.
  • dynamic_cost / dynamic_cost_or saturate usize cardinality to u32::MAX so long-running fleets with unbounded-cardinality metadata keys (session id, request id) don't trip the planner into treating the most-selective key as if it had only one distinct value.
  • placement_registry::register pre-creates the per-binding invocation counter only on successful insertion — id-collision register-fail paths don't leak phantom Prometheus binding-counters.
  • Bloom-filter h2 forces odd-only so power-of-2 bit-count probe cycles cover the full bit range; the rounding-saturation path is unit-tested.
  • compute-ffi's parse_side and net_compute_snapshot_bytes_free correctly free (non-NULL ptr, len == 0) malloc'd buffers.
  • rpc-ffi's run_cancellable carries a cancelled flag for register-after-spawn ordering; the cancel-token registry evicts stale orphan entries; net_predicate_to_where_header recovers from partial-write failure. Streaming-call construction is cancellable end-to-end via net_rpc_call_streaming_cancellable and net_rpc_call_streaming_with_headers_cancellable (pre-existing non-cancellable variants kept for back-compat).
  • Python announce_capabilities releases the GIL across the blocking call. Python-binding property-getter errors propagate (except AttributeError) so misbehaving daemon-caps callbacks surface real failures instead of phantom-empty-cap daemons. The Python _try_parse_float rejects whitespace-padded inputs to match Rust's strictness.
  • Go binding's RegisterPlacementFilter / UnregisterPlacementFilter serialize on the same id to close a registry-vs-substrate race; tagKeyFromWire surfaces type-assert failures.
  • Node + Python fp16_tflops_x10 bypasses the f32 round-trip that previously lost precision above 2²⁴ for direct large-value passthrough.
  • tag_codec rejects software runtime / framework / driver names containing the separator characters = / : / . so round-trips through the canonical wire format don't silently truncate.

Go cgo surface widening — origin_hash uint32 → uint64

go/net.h declared every origin_hash parameter and return type as uint32_t, while the canonical net.go.h and the Rust extern "C" signatures use uint64_t / u64. Pre-fix the cgo boundary silently truncated the upper 32 bits of every origin_hash. Closed before merge:

  • C headernet_identity_origin_hash, net_compute_daemon_handle_origin_hash, net_compute_migration_handle_origin_hash, net_compute_fork_group_parent_origin, net_compute_standby_group_active_origin (all now uint64_t return). net_tasks_adapter_open, net_memories_adapter_open, net_compute_runtime_stop, net_compute_runtime_deliver, net_compute_runtime_snapshot, net_compute_start_migration, net_compute_expect_migration, net_compute_migration_phase, net_compute_replica_group_route_event (out_origin), net_compute_standby_group_promote (out_origin), net_compute_fork_group_spawn (parent_origin) (all now uint64_t parameter / out-parameter).
  • Production Go bindingIdentity.OriginHash() uint64, DaemonHandle.OriginHash() uint64, MigrationHandle.OriginHash() uint64, ForkGroup.ParentOrigin() uint64, StandbyGroup.ActiveOrigin() uint64, StandbyGroup.Promote() uint64, ReplicaGroup.RouteEvent() uint64. DaemonRuntime.{Stop, Snapshot, Deliver, StartMigration, ExpectMigration, MigrationPhase} parameters, NewForkGroup's parentOrigin, OpenTasks / OpenMemories's originHash parameter (all uint64).
  • Public Go typesCausalEvent.OriginHash is uint64 (changed from uint32); GroupMemberInfo.OriginHash is uint64; GroupForkRecord.{OriginalOrigin, ForkedOrigin} are uint64.

Breaking change for downstream Go consumers. Code calling daemon.OriginHash() and assigning to a uint32 variable will fail to compile; drop the explicit uint32(...) cast or convert to uint64. The widening matches the Rust substrate's u64 shape.

Regression coverage

Every correctness fix above ships with a regression test. The cross-binding fixture corpus grew from five JSON files at branch start to thirteen: predicate_eval, capability_set_diff, capability_validation, predicate_trace, predicate_debug_report, predicate_debug_report_redacted, predicate_nrpc_envelope, placement_score, plus five new rows pinning numeric-parser parity, separator-strip parity, and schema range-check agreement across Rust / TS / Python / Go / C.


Test hygiene

  • Cross-binding wire-format fixtures. Thirteen golden-vector fixtures under tests/cross_lang_capability/, all versioned via abi_version_expected: 1. Drift in any binding's encode / decode / evaluate path fails that binding's CI. Each fixture drives parallel suites in Rust integration tests + Node Vitest + Python pytest + Go go-test.
  • Integration tests for the load-bearing user flows. integration_nrpc_predicate_header.rs (4 tests) composes header-bearing nRPC call variants with the stateless evaluator over a real two-node mesh — pins that the predicate-as-cyberdeck-where:-header → server-side filter flow works end-to-end. integration_placement_filter_callback.rs (3 tests) registers a custom PlacementFilter via global_placement_filter_registry(), builds StandardPlacement::with_custom_filter_id over a populated CapabilityIndex, verifies the filter's verdict reaches the composed score, and unregister-mid-flight collapses to a hard veto.
  • Lib suite at 2330+ tests (was 2289 at v0.12 release). 40+ net new tests across the regression + integration paths, every correctness fix above shipping with at least one regression.
  • cargo clippy --all-features --all-targets -D warnings clean across substrate + every binding crate.

Breaking changes

Wire format — CapabilitySet shape change

v0.13 breaks wire compatibility with v0.12 for CapabilityAnnouncement / CapabilityDiff / any payload carrying a CapabilitySet. The storage shape collapsed from seven fields (hardware, software, models, tools, tags, limits, metadata) to two (tags, metadata); typed projections decode lazily through views(). Old peers can't decode new announcements; new peers can't decode old. Per locked decision in CAPABILITY_SYSTEM_PLAN.md ("no backward-compatibility shim"), a synchronous fleet-wide upgrade is required for any deployment that uses capability announcements.

Forward-compat preserved within the new shape:

  • Unknown axis-prefixed tags pass through as Tag::Legacy on parse for forward-compat with future schema additions. The validator emits LegacyTag warnings rather than errors.
  • Unknown metadata keys propagate as opaque pairs subject to the 4 KB soft cap.
  • Reserved-prefix tag set is closed at v0.13 (scope: / causal: / fork-of: / heat:). Future reserved prefixes will land in v0.14+; v0.13 receivers will route them through Tag::Legacy until upgrade.

The signed_payload() envelope round-trip is byte-stable across processes thanks to the sorted-tag wire format — pre-fix, signature verification rejected announcements crossing between two processes (different RandomState seeds), silently dropping every multi-tag announcement at the receiver.

MembershipMsg, IdentityEnvelope, EventMeta, CausalLink, OriginStamp, NetHeader, RedEX on-disk layout, per-event checksum format, and every nRPC dispatch / header from v0.12 — all unchanged.

Rust core (net crate) — API surface

  • CapabilitySet's typed-struct fields are gone. caps.hardware, caps.software, caps.models, caps.tools, caps.limits no longer exist as fields. Read through caps.views().hardware() (etc.) — the projection is per-axis OnceCell-cached. Write through caps.set_hardware(hw) / set_software / set_models / set_tools / set_limits — these clear axis-owned tags and re-emit via the codec. The with_* builders are thin wrappers.
  • CapabilitySet::tags field type changes from Vec<String> to HashSet<Tag>. Iterations over caps.tags now yield typed Tag values; render to wire form via t.to_string(). Use caps.add_tag(s) for application-facing additions (parses through Tag::parse_user, rejects reserved prefixes); caps.with_tenant_scope / with_region_scope / with_subnet_local_scope for the dedicated reserved-tag builders.
  • adapter::net::behavior::tag is a new public module re-exporting Tag, TagKey, TaxonomyAxis, AxisSeparator, RESERVED_PREFIXES, CapabilityTagError.
  • adapter::net::behavior::tag_codec is a new public module re-exporting the round-trip codecs (hardware_to_tags / hardware_from_tags / software_to_tags / software_from_tags / models_to_tags / models_from_tags / tools_to_tags / tools_from_tags / resource_limits_to_tags / resource_limits_from_tags) plus the axis-owned-tag predicates (is_hardware_owned_tag / etc.).
  • adapter::net::behavior::predicate is a new public module re-exporting Predicate, EvalContext, ClauseTrace, PredicateDebugReport, predicate_to_rpc_header, predicate_from_rpc_headers, RPC_WHERE_HEADER, MAX_PREDICATE_RPC_HEADER_VALUE_LEN, AsRpcHeader, PredicateRpcEncodeError, PredicateRpcDecodeError, PredicateWire, PredicateNodeWire, RpcPredicateContext, filter_by_predicate. Plus the pred! macro re-exported at the crate root.
  • adapter::net::behavior::required_capability is a new public module re-exporting RequiredCapability, RequireParseError, plus the require! / require_axis! / require_axis_value! macros at the crate root.
  • adapter::net::behavior::schema is a new public module re-exporting validate_capabilities, ValidationReport, SchemaError, ValidationWarning, ValueType, KeyEntry, AxisSchema, AXIS_SCHEMA, METADATA_SOFT_CAP_BYTES.
  • adapter::net::behavior::bloom is a new public module re-exporting BloomFilter.
  • adapter::net::behavior::query is a new public module re-exporting the CapabilityQuery trait.
  • adapter::net::behavior::placement is a new public module re-exporting PlacementFilter, Artifact, StandardPlacement, LegacyPlacement, IntentRegistry, IntentMatchPolicy, ColocationPolicy, ResourceAxis, AntiAffinityConfig, PlacementMetadataKeys, compose_axis_scores, tie_break_compare, LeadershipStatsLookup, RttLookup, ScopeLabel, TieBreakContext, NodeId as PlacementNodeId.
  • adapter::net::behavior::placement_registry is a new public module re-exporting global_placement_filter_registry(), PlacementFilterRegistry.

Rust SDK (net-sdk)

The SDK's capability surface is entirely additive over the substrate re-exports — no existing SDK API changes outside the CapabilitySet shape change.

  • net_sdk::capabilities::* re-exports the substrate capability surface end-to-end. New entries since v0.12: Tag, TagKey, TaxonomyAxis, RESERVED_PREFIXES, CapabilityViews, CapabilitySetDiff, MetadataChange, CardinalityCache, CardinalityProvider, RequiredCapability, RequireParseError, LegacyPlacement, StandardPlacement, Artifact, PlacementFilter, IntentRegistry, IntentMatchPolicy, ColocationPolicy, ResourceAxis, AntiAffinityConfig, PlacementMetadataKeys, LeadershipStatsLookup, RttLookup, ScopeLabel, TieBreakContext, compose_axis_scores, tie_break_compare, global_placement_filter_registry, PlacementFilterRegistry.
  • New submodule net_sdk::capabilities::predicate re-exports Predicate, EvalContext, ClauseTrace, ClauseStats, PredicateDebugReport, predicate_to_rpc_header, predicate_from_rpc_headers, AsRpcHeader, RpcPredicateContext, filter_by_predicate, MAX_PREDICATE_RPC_HEADER_VALUE_LEN, RPC_WHERE_HEADER, plus encode / decode / wire types.
  • New submodule net_sdk::capabilities::schema re-exports validate_capabilities, ValidationReport, SchemaError, ValidationWarning, ValueType, KeyEntry, AxisSchema, AXIS_SCHEMA, METADATA_SOFT_CAP_BYTES.
  • The pred! / require! / require_axis! / require_axis_value! macros are re-exported at the SDK crate root.

FFI / bindings

Binding Change
All New capability-enhancements surface — typed Tag, predicate AST + builders, validator, diff, trace, debug-report aggregator, redaction. Cross-binding wire format is byte-identical and pinned by the eight golden-vector fixtures.
All Reserved-prefix tag passthrough at the binding boundary now uses Tag::parse (not parse_user). SDK consumers can supply scope:* / causal:* / fork-of:* / heat:* via the tags: [...] shape; pre-fix they were silently dropped at the binding boundary.
All placement_filter_from_fn(fn) / placementFilterFromFn(fn) registers a host-language predicate as a custom placement-filter callback. Pair with standardPlacement(custom_filter_id=...) / StandardPlacement::with_custom_filter_id to install. Substrate calls back per candidate.
All MeshDaemon capability authoring — daemons declare required_capabilities / optional_capabilities via per-binding callbacks during factory registration. Substrate's net_compute_set_daemon_caps_dispatcher plus per-binding adapter.
Node New SDK module capability-enhancements.ts exports the full surface (tagFromUserString, RESERVED_PREFIXES, requireTag, requireAxisValue, withMetadata, emptyCapabilities, p, evaluatePredicate, predicateToRpcHeader / predicateFromRpcHeader, RPC_WHERE_HEADER, validateCapabilities, isReportValid, diffCapabilities, evaluatePredicateWithTrace, predicateDebugReport, redactMetadataKeys, renderDebugReport, placementFilterFromFn, standardPlacement, plus the typed wire shapes). NAPI binding rebuild required for the new storage shape.
Python New module net_sdk exports the parallel surface (tag_from_user_string, p, evaluate_predicate, predicate_to_rpc_header, validate_capabilities, diff_capabilities, evaluate_predicate_with_trace, predicate_debug_report, redact_metadata_keys, placement_filter_from_fn, standard_placement). The net._net PyO3 binding adds extract_optional_caps, daemon caps dispatcher, placement-filter callback. Rebuild via maturin develop --release for the storage-shape change.
Go bindings/go/net/ adds the typed surface (Tag, Predicate{}, EvaluatePredicate, PredicateToWhereHeader, ValidateCapabilities, DiffCapabilities, EvaluatePredicateWithTrace, PredicateDebugReport, RegisterPlacementFilter, UnregisterPlacementFilter). The compute-ffi C ABI gains the placement-filter dispatcher entry points.
Go origin_hash widened from uint32 to uint64 end-to-end. Public methods (Identity.OriginHash(), DaemonHandle.OriginHash(), MigrationHandle.OriginHash(), ForkGroup.ParentOrigin(), StandbyGroup.{ActiveOrigin, Promote}(), ReplicaGroup.RouteEvent()) return uint64; DaemonRuntime.{Stop, Snapshot, Deliver, StartMigration, ExpectMigration, MigrationPhase} parameters and NewForkGroup's parentOrigin take uint64; CausalEvent.OriginHash, GroupMemberInfo.OriginHash, GroupForkRecord.{OriginalOrigin, ForkedOrigin} are uint64. Pre-fix the cgo boundary silently truncated the upper 32 bits of every origin_hash. Same widening applied to the cortex adapters (OpenTasks / OpenMemories take uint64 originHash). Breaking change for downstream Go consumers — uint32 callsites need explicit uint64(...) conversion.
Go Cancellable streaming-call entry points. net_rpc_call_streaming_cancellable and net_rpc_call_streaming_with_headers_cancellable add a cancel_token parameter so a parallel net_rpc_cancel_call can abort the construction block_on before the stream handle materializes. Pre-existing non-cancellable variants kept for back-compat.
C net.go.h exports the new error codes (NET_COMPUTE_ERR_NO_DISPATCHER = -4, NET_COMPUTE_ERR_INVALID_UTF8 = -5) and switches mesh_arc from void* to the typed opaque handle net_compute_mesh_arc_t*. New capability entry points: net_validate_capabilities, net_predicate_to_where_header, net_predicate_evaluate, net_predicate_evaluate_with_trace, net_predicate_aggregate_debug_report, net_predicate_redact_metadata_keys, net_rpc_call_with_headers / _call_service_with_headers / _call_streaming_with_headers.

Behavioral fixes that may surface as test breakage

  • CapabilitySet field reads now decode lazily through views(). Tests that did caps.hardware.memory_mb directly fail to compile; rewrite as caps.views().hardware().memory_mb. Same for software / models / tools / limits.
  • caps.tags.contains(&"gpu".to_string()) no longer compiles. tags: HashSet<Tag> carries typed values; use caps.has_tag("hardware.gpu") (which is now separator-agnostic) or caps.tags.iter().any(|t| t.to_string() == "hardware.gpu") for the substring-style check.
  • add_tag("scope:tenant:foo") silently drops at the application layer. Use caps.with_tenant_scope("foo"). The binding-side passthrough via tags: [...] works because bindings parse via the unrestricted Tag::parse.
  • CapabilitySet::diff ops now sort deterministically. Tests that asserted specific diff-op insertion order under Vec semantics will see lexicographic-by-tag ordering instead.
  • PlacementFilter::placement_score returning None is a hard veto. Pre-fix, custom impls returning Some(0.0) and None produced indistinguishable scheduler behavior; v0.13 makes None the explicit "exclude from ranking" signal and Some(0.0) the "score floor" signal. Tests asserting "filter returns None → scheduler ranks among others" will see the candidate excluded.
  • Custom PlacementFilter impls returning NaN are now treated as a hard veto. Tests that injected NaN to observe sort behavior will see a deterministic exclusion.
  • require!("software.id == v>=1.0") parses as Equals, not NumericAtLeast. The == branch now precedes >= / <= in the require-parser to handle equality values containing comparison substrings. Tests asserting the legacy ">= claims the split first" behavior will fail.
  • parse_tag_key trims whitespace around the dot. require!("hardware. gpu == nvidia") now produces TagKey::new(Hardware, "gpu") instead of TagKey::new(Hardware, " gpu") — the latter silently mismatched every real tag.
  • semver_compatible treats 0.0.x as exact-only. Tests that asserted "^0.0.1 matches 0.0.2" will see the rejection.
  • Tag::AxisPresent no longer matches value-bearing predicates. Equals(_, "") / StringPrefix(_, "") / StringMatches(_, "") no longer accept presence-only tags. Use Predicate::Exists for key-presence checks.
  • Forward-compat axis tags survive CapabilitySet::diff. Pre-fix, is_*_owned_tag over-claimed unknown forward-compat keys (hardware.future_field=v2) and the residual filter dropped them; the typed Update* ops didn't capture them either. Real changes to forward-compat tags now ship as AddTag / RemoveTag.
  • Capability announcements emit tags in sorted wire order. Tests asserting HashSet-iteration-order on the wire will see lexicographic ordering instead. Symptom for cross-process verification: the sorted form is what makes signature verification stable.

How to upgrade

  1. Bump your Cargo.toml / package.json / requirements.txt / go.mod to the v0.13 line. Recompile / rebuild the binding cdylib (NAPI for Node, maturin for Python, cargo build -p net-compute-ffi + -p net-rpc-ffi for Go).
  2. CapabilitySet field-access migration. Direct field reads (caps.hardware, caps.software, etc.) move to caps.views().hardware() / software() / etc. Use cargo build to drive the rewrite — the compiler errors name every site. The view handle is per-axis OnceCell-cached (< 50 ns post-cache); same hot-path cost as the old direct field access.
  3. Tag iteration changes from &str to &Tag. Render to wire form via tag.to_string() (the canonical Display impl), or pattern-match on the typed variants. caps.has_tag("...") works with either separator form.
  4. Reserved-prefix tag emission moves to dedicated builders. Replace caps.add_tag("scope:tenant:foo") with caps.with_tenant_scope("foo"), etc. Application code passing reserved tags through caps.add_tag was already silently dropping them in v0.12 prerelease builds.
  5. Fleet-wide upgrade required for capability announcements. v0.12 ↔ v0.13 mixed fleets cannot exchange CapabilityAnnouncement / CapabilityDiff payloads — the storage shape change is intentional. Pub/sub, mesh transport, channels, identity, subnets, NAT traversal, nRPC (the v0.12 surface) all continue to work cross-version. Recommend lockstep upgrade.
  6. For the new capability surface — the typed taxonomy + predicate evaluator + validator + diff + trace + debug report are opt-in. Read net/crates/net/README.md#capabilities for the high-level surface, then per-binding READMEs for language-idiomatic usage:
    • Rust SDKnet/crates/net/sdk/README.md § "Capability enhancements (typed taxonomy + predicates + validation)". pred! macro + require! family in scope under net_sdk::capabilities.
    • Nodenet/crates/net/sdk-ts/README.md § "Capability enhancements". Import from @ai2070/net-sdk.
    • Pythonnet/crates/net/sdk-py/README.md § "Capability enhancements". Import from net_sdk.
    • Gobindings/go/net/ exports the parallel surface. C-ABI entry points documented in net/crates/net/include/README.md.
    • Cnet/crates/net/include/README.md § "Mesh function families" rows "Predicate evaluation", "Predicate where: header", "Capability validation", "Predicate debug session". Worked examples: net/crates/net/docs/CAPABILITY_ENHANCEMENTS_USAGE.md.
  7. Predicate-as-cyberdeck-where:-header → server-side filter. Pair predicate_to_rpc_header with the header-bearing nRPC call variants from v0.12 (net_rpc_call_with_headers and friends; same surface in every binding). Server's nRPC handler decodes via predicate_from_rpc_headers and filters candidates with evaluate_predicate. The cyberdeck-where: header name is exported as RPC_WHERE_HEADER from every binding.
  8. Daemon capability authoring. Daemons that want to participate in capability-driven placement implement required_capabilities / optional_capabilities. The runtime publishes both as part of the daemon's identity-bound announcement. Per-binding integration via the daemon-caps dispatcher (TS / Python: factory callback; Go: RegisterDaemonCaps; C: net_compute_set_daemon_caps_dispatcher).
  9. Custom placement-filter callbacks. When the built-in StandardPlacement axes don't fit a placement rule, plug a host-language predicate via placement_filter_from_fn(closure) (TS / Python / Go) or implement PlacementFilter directly + register via global_placement_filter_registry() (Rust). Pair with StandardPlacement::with_custom_filter_id(id).
  10. Cross-binding consumers — every binding's wire format is pinned by the thirteen golden-vector fixtures under tests/cross_lang_capability/. If you're integrating predicates / capability sets / debug reports across language boundaries, your wire-level compatibility is enforced at the binding's own CI. Fixtures versioned via abi_version_expected: 1.
  11. If you wired your own placement scoring around Mikoshi::select_migration_target or scheduler internals — the v0.13 path consults StandardPlacement with optional custom-filter callback. LegacyPlacement preserves v0.12 behavior under a feature flag for one minor version; new code should target StandardPlacement.
  12. If you have caches keyed off the old CapabilitySet shape on disk — the storage shape changed. Bust the cache or rewrite via the new shape. The view-projection layer is read-only over the typed tags + metadata, so encoding via set_hardware(hw) etc. produces the canonical tag set; subsequent views().hardware() reads back identically.
  13. Go consumers — origin_hash widened to uint64. Callsites assigning daemon.OriginHash() (or Identity.OriginHash() / migration.OriginHash() / replica.RouteEvent() / fork.ParentOrigin() / standby.{ActiveOrigin, Promote}()) to a uint32 variable fail to compile. Drop the explicit cast (or convert to uint64); the canonical Rust shape is u64 and the Go binding's previous u32 silently truncated the upper 32 bits. CausalEvent.OriginHash, GroupMemberInfo.OriginHash, GroupForkRecord.{OriginalOrigin, ForkedOrigin} are now uint64; DaemonRuntime.{Stop, Snapshot, Deliver, StartMigration, ExpectMigration, MigrationPhase} parameters and OpenTasks / OpenMemories / NewForkGroup's originHash / parentOrigin take uint64.
  14. Streaming RPC consumers wanting cancellation during construction — switch from net_rpc_call_streaming / net_rpc_call_streaming_with_headers to the new *_cancellable variants and pass a cancel_token from net_rpc_reserve_cancel_token. A parallel net_rpc_cancel_call(token) now aborts the construction block_on (peer-stalled initial-frame ACK), where pre-fix net_rpc_stream_close only took effect after the stream handle was already constructed. Existing non-cancellable variants kept for back-compat.
v0.12.0Codename:Firestarter
2026.05.06

v0.12 breaks the "Black Diamond" hardening line. After two consecutive releases of pure bug-fix + audit closure (v0.10 / v0.11), Firestarter is the first feature release on the line: it ships a complete request/response RPC surface (nRPC) on top of the v0.11 mesh, plus the four-language binding pipeline that consumes it (Node, Python, Go, plus the existing Rust SDK), plus a TypeScript migration of the Node binding's hand-written modules. The hardening posture is intact — every new surface has the same handle-lifetime, panic-safety, and FFI-soundness guarantees v0.11 established for the existing surfaces — but this release is about adding capability, not just polishing the existing one.


nRPC

Folds, Codec, Mesh Glue

The architectural anchor (and the prerequisite for everything else): an RPC server is a CortEX fold over a directed channel pair. There is no new transport, no new subsystem, no new daemon — just a typed dispatch enum on EventMeta, a channel-naming convention, and small caller-side / server-side helpers.

  • SubscriptionMode::QueueGroup on the channel roster (adapter/net/channel/roster.rs) — the one missing channel-layer primitive. Work-distribution dispatch alongside the existing Broadcast mode. add_with_mode / dispatch_recipients / subscriber_mode API; back-compat shims preserve every existing call site. MembershipMsg::Subscribe.queue_group: Option<String> wire field added at channel/membership.rs with forward-compat decode (pre-queue-group senders with zero remaining bytes after the token decode as Broadcast). Public APIs Mesh::subscribe_channel_in_queue_group[_with_token]. Pinned by 13 regression tests; cross-validated end-to-end by tests/queue_group_dispatch.rs (two QueueGroup subscribers on different nodes divide a stream of 100 events between them with exactly-once delivery; broadcast subscriber + queue-group pool coexist on one channel).
  • cortex::rpc codec (adapter/net/cortex/rpc.rs) — dispatch constants DISPATCH_RPC_REQUEST / RESPONSE / CANCEL / STREAM_GRANT / STREAM_CHUNK_DROPPED, flag bits (FLAG_RPC_STREAMING_RESPONSE, FLAG_RPC_PROPAGATE_TRACE), RpcStatus enum (Net-native with documented gRPC equivalence), RpcRequestPayload / RpcResponsePayload round-trip codec with MAX_RPC_* caps and encoded_len() helpers for buffer pre-sizing. 15 regression tests pin wire stability + decode-rejection of malformed payloads.
  • RpcServerFoldRedexFold<()> decoding REQUEST events, dispatching the handler in tokio, emitting RESPONSE via a RpcResponseEmitter callback. RpcCancellationToken (Notify+AtomicBool wrapper, race-safe), RpcContext (caller_origin + decoded payload + cancellation), RpcHandler async-trait, RpcHandlerError::{Application, Internal}. Handler panic caught via catch_unwind and surfaced as RpcStatus::Internal. Fast deadline-already-passed short-circuit. CANCEL flips the in-flight token. Malformed payloads emit a structured warn-and-skip and continue (do not kill the cortex adapter). Duplicate REQUEST for an in-flight call_id is refused; first-wins semantics. Per-channel-hash inbound dispatch hook on MeshNode (register_rpc_inbound / unregister_rpc_inbound) lets the mesh's inbound packet path consult a dispatcher map per packet (one DashMap get); registered channel hashes route directly and skip the per-shard inbound queue.
  • RpcClientFold + RpcClientPending — symmetric caller side. RpcClientPending::register(call_id) returns a oneshot receiver for unary calls; register_streaming(call_id) returns an mpsc receiver of StreamItem for streaming calls (the same RpcClientFold demuxes both call kinds via a PendingEntry::{Unary | Streaming} enum). Re-register of the same call_id closes the prior receiver (misuse detection).
  • Mesh::serve_rpc(service, handler) / Mesh::call(target_node_id, service, payload, opts) glue (adapter/net/mesh_rpc.rs). serve_rpc registers an inbound dispatcher for <service>.requests's channel hash; the dispatcher pushes events into a tokio mpsc that drains through the RpcServerFold. call lazy-subscribes to <service>.replies.<caller_origin>, allocates a call_id, registers a oneshot in the per-Mesh RpcClientPending, direct-sends the REQUEST via publish_to_peer bypassing the local subscriber roster (RPC's caller-knows-target model doesn't fit the publisher-led pub/sub roster), and awaits the receiver under opts.deadline. Returns RpcReply on Ok, RpcError on any failure. ServeHandle is RAII — the dispatcher unregisters on Drop and in-flight handlers complete (no abort). Per-Mesh state additions on MeshNode: rpc_client_pending, rpc_next_call_id, rpc_reply_subscriptions (bounded; refuses hash collisions instead of overwriting).
  • End-to-end Mesh integration test (tests/integration_nrpc_mesh.rs, 4 tests through real network handshake): round-trip echo, multiple sequential calls reusing the lazy reply subscription with exactly-once handler invocation, server panic surfaces as Internal, deadline emits CANCEL and surfaces as Timeout to the caller. Deadline-fire CANCEL emission is now pinned by an explicit assertion test (rpc_deadline_fires_cancel_on_the_wire).

Service Discovery + Routing Policies

  • Service discovery via capability announcements. Mesh::serve_rpc auto-registers the service in a per-Mesh rpc_local_services set; announce_capabilities[_with] auto-merges nrpc:<service> tags onto the announced CapabilitySet, propagating through the existing capability-broadcast machinery. Two new public APIs: Mesh::find_service_nodes(service) -> Vec<u64> queries the local capability index for nodes carrying the nrpc:<service> tag; Mesh::call_service(service, payload, opts) -> Result<RpcReply, RpcError> finds candidates, picks one per RoutingPolicy, dispatches via the existing direct-addressed call(target, ...). Returns RpcError::NoRoute if no servers advertise the tag. ServeHandle::Drop removes the service from the local registry so subsequent announcements stop emitting the tag.
  • RoutingPolicy enum on CallOptions (default RoundRobin): RoundRobin uses a dedicated per-Mesh cursor with fetch_add (no longer collides with the call-id counter); Random (xxh3 of call_id, modulo); Sticky { key: u64 } (xxh3 of key, modulo a sorted candidate list — same key → same target while the candidate set is stable); LowestLatency (picks the candidate with smallest latency_us per the local ProximityGraph; deterministic fallback to the lexicographically-first sorted node id when no proximity data exists).
  • filter_unhealthy: bool on CallOptions (default true) — skips candidates whose ProximityGraph entry reports !is_available(). Pin: candidates with NO proximity entry are KEPT (absence of evidence ≠ evidence of unhealth), so a freshly-announced server isn't falsely filtered just because pingwaves haven't propagated yet.
  • EntityId ↔ node_id bridgeMeshNode::entity_id_for_node(u64) -> Option<[u8; 32]> accessor consults peer_entity_ids to map session-layer node ids to entity-layer keys. The single missing piece that LowestLatency and filter_unhealthy both flow through.
  • End-to-end coverage (tests/integration_nrpc_service_discovery.rs, 6 tests): three nodes, two serve "echo", one caller uses call_service — both servers exercised by round-robin; Sticky pins consistency; Random distributes evenly; no-servers returns NoRoute with diagnostic; LowestLatency falls back deterministically when no proximity data exists; filter_unhealthy keeps proximity-less candidates.

Streaming, Tracing, Resilience, Metrics

The biggest single chunk of new surface in this release.

  • Streaming responses. Multi-fire DISPATCH_RPC_RESPONSE events for one call_id marked non-terminal vs. terminal via the nrpc-streaming header (continue / end). RpcResponseSink (unbounded mpsc, non-blocking send), RpcStreamingHandler async-trait, and RpcServerStreamingFold (parallel to RpcServerFold but spawns a pump task draining the sink and emitting per-chunk nrpc-streaming: continue frames; handler return → terminal end frame, handler Err → terminal non-Ok frame, handler panic caught by catch_unwind → terminal Internal). Per-call ordering guarantee: the streaming fold takes an RpcAsyncResponseEmitter (Arc<dyn Fn(...) -> BoxFuture<()>>) instead of the unary fold's sync RpcResponseEmitter, and the pump task .awaits each emit before reading the next sink chunk — without this, two chunks emitted in tight succession would race into the publish path via independent tokio::spawns and arrive at the caller out of order. Caller side: Mesh::call_streaming returns an RpcStream: futures::Stream<Item = Result<Bytes, RpcError>>; terminal-Ok closes the stream, terminal-error yields one final Err(RpcError::ServerError) then closes. RpcStream::Drop clears the pending entry and best-effort emits CANCEL via direct unicast so the server's handler observes ctx.cancellation.
  • Per-stream window grants (closes the Phase 3 streaming backlog). Wire additions: DISPATCH_RPC_STREAM_GRANT (caller → server, payload is 4-byte big-endian u32 credit count) + HEADER_NRPC_STREAM_WINDOW_INITIAL (REQUEST header, ASCII-decimal u32 initial window). Server side keeps a per-call Arc<tokio::sync::Semaphore> map; pump task acquire_owned().await + forget() per chunk. STREAM_GRANT events add_permits(n). Caller side: CallOptions::stream_window_initial: Option<u32>. RpcStream::poll_next auto-grants 1 credit per delivered chunk (in-flight credit holds near the initial window). RpcStream::grant(n) is the explicit API for batched cadence; no-op when flow control isn't enabled. Defensive caps on incoming GRANT amounts so a misbehaving caller can't overflow tokio's MAX_PERMITS. Bounded streaming pump mpsc with drop-on-full metric so a slow caller can't unbounded-buffer the server.
  • W3C Trace Context propagation (cortex::rpc::TraceContext + extract_trace_context / build_trace_headers helpers). New CallOptions::trace_context: Option<TraceContext> and RpcContext::trace_context: Option<TraceContext> fields. When the caller sets CallOptions::trace_context, the SDK emits traceparent / tracestate headers and sets FLAG_RPC_PROPAGATE_TRACE; the server's fold extracts the headers and populates RpcContext::trace_context. nRPC is transport-only — application code on both sides reads/writes via whatever tracing backend it has wired up (tracing-opentelemetry, Datadog, etc.). Empty tracestate is omitted on the wire (W3C convention). Header-name matching is case-insensitive (W3C + HTTP convention); the previous implementation used name.as_str() == "traceparent" and silently dropped any non-lowercase variant.
  • Caller-side retry helper (sdk/src/mesh_rpc_resilience.rs). RetryPolicy with full-half jitter (each backoff scaled by uniform random in [0.5, 1.0]), exponential growth (backoff_multiplier, default 2.0), upper-bound cap (max_backoff), and a swappable retryable: Arc<dyn Fn(&RpcError) -> bool> predicate. Default policy: 3 attempts, 50ms initial → 1s cap. default_retryable retries Timeout, Transport, and ServerError for canonical transient statuses (Internal, Backpressure, server-observed Timeout); does NOT retry NoRoute, Codec, application errors, NotFound, Unauthorized, UnknownVersion, or Cancelled. Four wrappers on Mesh: call_with_retry, call_service_with_retry, call_typed_with_retry, call_service_typed_with_retry. Typed variants encode once and reuse the bytes across attempts; service variants re-resolve the candidate set per attempt so failover is automatic.
  • Caller-side hedge helper. HedgePolicy { delay, hedges } — fire-then-race: primary at t=0, additional hedges at t=delay*idx, first reply (Ok or Err) wins; if first finisher is Err, the wrapper waits for remaining hedges before surfacing the deterministic last error. Defaults: 50ms delay, 1 hedge. Four wrappers: call_with_hedge_to(targets, ...) / call_typed_with_hedge_to for explicit-target hedging (e.g. primary + warm-standby), call_service_with_hedge / call_service_typed_with_hedge for capability-index-driven hedging across replicas. Why service-only and explicit-targets-only, not direct-to-one-target: hedging to the same target is always wrong (same backlog, same GC pause, doubles your load for nothing). Hedge losers' UnaryCallGuard::Drop fires CANCEL to the server, which observes it on ctx.cancellation (pinned by hedge_loser_handler_observes_cancellation).
  • Caller-side circuit breaker. CircuitBreaker with CircuitBreakerConfig — three-state machine Closed → Open → HalfOpen → Closed/Open. Defaults: 5 consecutive failures to trip, 30s open cooldown, 1 successful probe to close. Different shape from retry/hedge: a long-lived stateful guard the user instantiates once (typically per logical downstream — one per service, or one per (service, target) pair) and shares via Arc<CircuitBreaker>. The wrapper takes a closure: breaker.call(|| async { mesh.call_typed::<Req,Resp>(...).await }).await. Generic over the inner result type so it composes around raw, typed, retried, OR hedged calls. BreakerError::{Open | Inner(RpcError)} — pattern-match Open to fall back, Inner to handle the underlying error. default_breaker_failure matches default_retryable (transient infra failures count as health signals; application errors don't). HalfOpen semantics: at most ONE concurrent probe; other calls during HalfOpen short-circuit. Panic-safe: a probe that panics doesn't poison the breaker's mutex; a poisoned mutex is recovered into into_inner() so the breaker keeps serving.
  • Unary-call CANCEL-on-drop. New UnaryCallGuard is constructed inside Mesh::call immediately after the REQUEST is published; if the call future is dropped before resolving (hedge loser, tokio::select! losing arm, caller-side JoinHandle::abort), the guard's Drop runs pending.cancel(call_id) AND spawns a CANCEL publish to the server via the new spawn_cancel_publish helper (shared with RpcStream::Drop). The success path flips guard.completed = true so a happy call doesn't fire a useless CANCEL.
  • Per-service metrics + Prometheus formatter (adapter/net/mesh_rpc_metrics.rs). RpcMetricsRegistry — per-Mesh DashMap<String, Arc<ServiceMetricsAtomic>> (one entry per service that's been called or served). Bounded; idle entries with no in-flight ops and zero counters get evicted alongside empty queue-group shells. Per-service counters: caller-side (calls_total, errors_no_route / errors_timeout / errors_server / errors_transport, in_flight, latency_sum_ns / latency_count, Prometheus-default cumulative bucketed histogram), server-side (handler_invocations_total, handler_panics_total, handler_in_flight, handler_duration_*, streaming_chunks_emitted_total, streaming_chunks_dropped_total). CallMetricsGuard — RAII shim built BEFORE any potential early-return bumps in_flight on construction, balances on Drop. Snapshot + Prometheus formatter: MeshNode::rpc_metrics_snapshot() is a cheap one-DashMap-pass copy. Service names are escaped per Prometheus exposition convention (backslash, double-quote, newline, \r); negative gauges from racy decrements clamp to zero.

nRPC bindings — Node, Python, Go (B1–B7)

The seven-phase rollout from NRPC_BINDINGS_PLAN.md ships in full. Each phase landed independently; all phases pass their per-binding test suites and the cross-binding wire-format compat tests. Total ~5,800 LoC of new binding code + ~2,500 LoC of tests.

Phase Scope Commit
B1 Node — raw serve / call / callService / callStreaming (Buffer in/out). Validates the napi ThreadsafeFunction handler-bridging pattern. 98967fdc
B2 Node — typed wrappers + RetryPolicy / HedgePolicy / CircuitBreaker + per-service metrics. 5741f8e2
B3 Python — raw + GIL-aware runtime.block_on + tokio::task::spawn_blocking for handler dispatch. 4003d9bb
B4 Python — typed wrappers + resilience helpers + ServeHandle context manager. 000b53bc
B5 Go C-ABI — raw lifecycle + unary call / call_service / serve / find_service_nodes (bindings/go/rpc-ffi/, separate cdylib libnet_rpc). ea7c3836
B6 Go C-ABI — streaming + pure-Go RetryPolicy / HedgePolicy / CircuitBreaker + ABI version stamp (net_rpc_abi_version() -> u32, 0x0001 initial). 9cf612ab
B7 Cross-binding wire-format compat — shared tests/cross_lang_nrpc/golden_vectors.json fixture (6 ok cases + 3 error cases) drives parallel suites in Rust (tests/integration_nrpc_cross_lang.rs, 4 tests) + Node (bindings/node/test/cross_lang_compat.test.ts, 4 tests) + Python (bindings/python/tests/test_cross_lang_compat.py, 16 parametrized tests). 24 cross-binding compat assertions total. Drift in any binding's JSON encoding, typed-error mapping, or status-code constants now fails that binding's own CI. 4cd7366b

Cross-cutting decisions enforced by the fixture and the per-binding compat suites:

  • Stable nrpc: error prefix. Every binding's caller-side errors carry nrpc:<kind>: <detail> where <kind> is one of no_route, timeout, server_error, transport, codec_encode, codec_decode, breaker_open. Each binding maps the prefix to typed exception classes via classifyError(e) (Node) / classify_error(e) (Python) / parseRpcError + typed *RpcError (Go). The Node binding throws plain Error with the prefix (NOT typed classes) to sidestep vitest's dual-module-instance hazard; users classify at the catch site.
  • Canonical typed-handler status codes: NRPC_TYPED_BAD_REQUEST = 0x8000, NRPC_TYPED_HANDLER_ERROR = 0x8001 — both in the application-defined range 0x8000..=0xFFFF. Re-exported from every binding alongside the typed surface. (The fixture initially used 0x4001 matching a stale Rust SDK comment; the fixture and Rust test were corrected to match the constant the bindings actually export. Found while writing the cross-binding compat suite.)
  • ServeHandle lifecycle per language. Node: .close() method (finalizers are non-deterministic so callers MUST close). Python: context-manager protocol (with rpc.serve(...) as handle:) + explicit .close(). Rust: Drop. Go: (*ServeHandle).Close() + runtime.SetFinalizer as a backstop. In every case "drop / close stops new dispatch but lets in-flight handlers complete" — same contract as the Rust serve_rpc.
  • Caller-driven cancellation across all four bindings. Late in the cycle the bindings each grew an explicit cancellation surface beyond the existing CANCEL-on-future-drop:
    • Node: AbortSignal-driven (MeshRpc.reserveCancelToken() mints a bigint; pass on the call's options; call MeshRpc.cancelCall(token) from an AbortSignal listener). Abort fires CANCEL on the wire.
    • Python: Cancellable pyclass + RpcCancelledError. Pass via opts={'cancel': cancel}; cancel.cancel() from another thread aborts mid-flight.
    • Go: ctx.Done() watcher goroutine wired through net_rpc_reserve_cancel_token / net_rpc_cancel_call C-ABI exports. Watcher pins to the stream/call's lifetime so it doesn't leak past close. Watcher self-deadlock prevention via watcherDone channel closed before Close().
  • Per-handler timeout configurable everywhere. Each binding's serve accepts an optional handler timeout (defaults to 60s for Go, no default for Rust/Node/Python — the SDK wraps user code with no timeout unless asked). Wedged handlers can't hold the in-flight slot indefinitely.

Node binding TS migration

  • Single source of truth. errors.ts and mesh_rpc.ts replace the hand-written errors.js / mesh_rpc.js + parallel .d.ts files. The .d.ts was the only guard on the public type contract — and reviews of the nRPC work surfaced several places where the two had quietly diverged (the RawMeshRpc shape, the breaker.armed dead branch, the appError helper signature). Compiling from a single TS source catches that class of drift at build time.
  • Pipeline. New tsconfig.build.json extends the existing test-only tsconfig.json; target: ES2022, module: CommonJS, moduleResolution: node, strict, declaration, noEmitOnError. outDir/rootDir both . so import paths don't change. package.json gains scripts.build:ts, scripts.typecheck, and a prepublishOnly that runs the TS build before napi prepublish -t npm. Build artifacts (errors.{js,d.ts} + mesh_rpc.{js,d.ts}) are gitignored — regenerated on publish.
  • Module shape preserved. Stays CJS. npm pack --dry-run produces the same 8 files as before. Existing require('@ai2070/net/errors').CortexError keeps working unchanged. index.js / index.d.ts stay JS forever — auto-generated by napi-rs from the Rust crate.
  • Test-stub conformance enforced. Turning RawMeshRpc from documentation into a real type forced StubRawMeshRpc, LoopbackHandlerRpc, and CancelTrackingRaw to drop their as unknown escape hatches and grow the missing methods. The compile error IS the win — the parallel .d.ts couldn't catch this.
  • Outcome. -210 LOC of duplicated .js/.d.ts content collapsed into single TS sources. 53/53 vitest tests pass against both source state (TS) and built state (compiled .js).

Test hygiene

  • Cross-binding compat fixture — single source of truth for the canonical service contract. Every binding's compat test loads golden_vectors.json and asserts the same matrix. Fixture is versioned via abi_version_expected mirroring NET_RPC_ABI_VERSION; bumping the ABI invalidates the fixture and forces every binding's compat test to update.
  • Streaming flow-control coverage (tests/integration_nrpc_streaming.rs, 6 tests through real network): collects-all-chunks, drop-cancels-handler, terminal-error-after-partial-stream, plus the three flow-control tests (window_throttles_pump_until_grants asserts the server's streaming_chunks_emitted_total metric is exactly the initial window after 300ms; auto_grant_drains_full_stream; explicit_grant_unblocks_pump).
  • Resilience helpers — 12 SDK integration tests across mesh_rpc_retry.rs (4), mesh_rpc_hedge.rs (3), mesh_rpc_breaker.rs (5). Each pins a specific aspect: retry-then-succeed, retry-skips-app-errors, retry-exhaustion, predicate classification (retry); backup-wins, zero-degrades, empty-targets-NoRoute (hedge); full-state-machine cycle, failed-half-open-reopens, app-errors-don't-trip, reset-clears-state, error-flatten (breaker). All over real-network handshake.
  • Cross-language compat — 24 parametrized assertions (4 Rust + 4 Node + 16 Python) all driven from the shared fixture.

Breaking changes

Wire format additions (forward-compat from v0.11)

Unlike v0.11, v0.12 does not break wire compatibility with v0.11 for any pre-existing message type. Every change is a forward-compat addition:

  • New dispatch bytes in the CortEX EventMeta::dispatch namespace under nRPC: DISPATCH_RPC_REQUEST, DISPATCH_RPC_RESPONSE, DISPATCH_RPC_CANCEL, DISPATCH_RPC_STREAM_GRANT, DISPATCH_RPC_STREAM_CHUNK_DROPPED. All in the CortEX-internal range 0x10..=0x1F. A v0.11 receiver that doesn't know nRPC will see these as unknown dispatch values and route them to the no-op fold arm — no crash, no confusion, just a silent skip on the receiver side.
  • MembershipMsg::Subscribe gains an optional queue_group: Option<String> field (u8 length + UTF-8 bytes after the existing token field). Forward-compat: a v0.11 sender (zero remaining bytes after the token) decodes as Broadcast. A v0.12 sender that emits a queue_group to a v0.11 receiver — the v0.11 receiver ignores the trailing bytes, which is benign for broadcast semantics but means queue-group dispatch silently degrades to broadcast-fan-out across mixed-version peers. Recommendation: upgrade publishers and subscribers in lockstep if you intend to use QueueGroup.
  • publish_to_peer now stamps channel_hash on the outgoing packet header (was always 0 pre-fix). A v0.11 receiver doesn't consult the header for dispatch routing on the per-shard inbound path, so this is invisible there; v0.12 receivers consult the field for the per-channel-hash fast-path dispatcher hook. Mixed-version: v0.12 sender → v0.11 receiver works (header byte ignored); v0.11 sender → v0.12 receiver works (zero hash misses the dispatcher map and falls through to per-shard inbound, which is the same behavior the v0.11 sender's receiver already had).
  • New REQUEST headers: nrpc-stream-window-initial (ASCII-decimal u32 initial flow-control window) and the W3C tracing pair traceparent / tracestate (when FLAG_RPC_PROPAGATE_TRACE is set on the REQUEST). All optional; absence means "no flow control" / "no tracing context."
  • No changes to IdentityEnvelope, EventMeta, CausalLink, OriginStamp, NetHeader, RedEX on-disk layout, or per-event checksum format — every v0.11 wire-format change persists unchanged into v0.12.

The summary: a v0.11 ↔ v0.12 fleet can coexist on the same mesh for the v0.11 subset of operations. nRPC traffic between mixed-version peers will silently fail (the v0.11 peer doesn't know how to dispatch nRPC), but the existing pub/sub and migration paths continue to work. Recommend lockstep upgrade if you intend to use nRPC across the fleet from day one.

Rust core (net crate) — API surface

  • SubscriptionMode enum is new in adapter::net::channel::roster. Match arms over SubscriptionMode need to handle both variants; #[non_exhaustive] was added so this is forward-compatible.
  • MembershipMsg::Subscribe gains a public queue_group: Option<String> field. Struct-literal constructors must add it; the helper constructors (Subscribe::new, etc.) default to None so most call sites don't need updating.
  • Mesh::subscribe_channel_in_queue_group / Mesh::subscribe_channel_in_queue_group_with_token are new public methods on MeshNode and the SDK's Mesh envelope.
  • Mesh::serve_rpc / Mesh::call / Mesh::call_service / Mesh::find_service_nodes are new public methods on MeshNode. The SDK adds typed counterparts: serve_rpc_typed, call_typed, call_service_typed, serve_rpc_streaming, serve_rpc_streaming_typed, call_streaming, call_streaming_typed.
  • adapter::net::cortex::rpc is a new public module re-exporting RpcContext, RpcHandler, RpcHandlerError, RpcRequestPayload, RpcResponseEmitter, RpcResponsePayload, RpcServerFold, RpcClientFold, RpcClientPending, RpcStatus, RpcStreamingHandler, RpcResponseSink, StreamItem, TraceContext, plus the dispatch + flag constants.
  • adapter::net::mesh_rpc is a new public module re-exporting RpcError, RpcReply, RpcStream, CallOptions, RoutingPolicy, ServeError, ServeHandle, CodecDirection, MAX_RPC_* constants.
  • adapter::net::mesh_rpc_metrics is a new public module re-exporting RpcMetricsRegistry, RpcMetricsSnapshot, ServiceMetrics, ServiceMetricsAtomic, CallOutcome, DEFAULT_LATENCY_BUCKETS_SECS. Snapshot via MeshNode::rpc_metrics_snapshot(); Prometheus formatter via RpcMetricsSnapshot::prometheus_text().
  • MeshNode::register_rpc_inbound(channel_hash, dispatcher) -> bool and MeshNode::unregister_rpc_inbound(channel_hash) are new public methods. The dispatcher is Arc<dyn Fn(StoredEvent) + Send + Sync>; registered channel hashes route directly and skip the per-shard inbound queue. register_rpc_inbound returns false if the hash is already registered (refuses overwrites).
  • ThreadLocalPooledBuilder::set_channel_hash(u32) is a new public method exposing the underlying packet-builder method so the publish path can stamp the channel hash.
  • ChannelConfigRegistry::insert_prefix(prefix, config) / remove_prefix(prefix) are new public methods. get_by_name(name) falls back to a longest-prefix-first walk when no exact match exists. The exact-match hot path (DashMap get) is unaffected.

Rust SDK (net-sdk)

The SDK's nRPC surface is entirely additive — no existing SDK API changes.

  • New module mesh_rpc re-exports RpcError, RpcReply, CallOptions, RoutingPolicy, ServeHandle, RpcContext, RpcHandler, RpcHandlerError, RpcStatus, ServeError, Codec, RpcStreamTyped, ResponseSinkTyped, plus the NRPC_TYPED_* status constants.
  • New module mesh_rpc_resilience re-exports RetryPolicy, HedgePolicy, CircuitBreaker, CircuitBreakerConfig, BreakerError, BreakerState, plus default_retryable / default_breaker_failure predicates.
  • New Mesh methods (Rust SDK): serve_rpc, serve_rpc_typed, serve_rpc_streaming, serve_rpc_streaming_typed, call, call_service, call_typed, call_service_typed, call_streaming, call_streaming_typed, call_with_retry, call_service_with_retry, call_typed_with_retry, call_service_typed_with_retry, call_with_hedge_to, call_service_with_hedge, call_typed_with_hedge_to, call_service_typed_with_hedge, find_service_nodes, rpc_metrics_snapshot.

FFI / bindings

Binding Change
All New nRPC surface — serve / call / callService / callStreaming / findServiceNodes plus typed wrappers + resilience helpers. Importable from @ai2070/net/mesh_rpc (Node), net.mesh_rpc (Python), bindings/go/net/ (reference; Go module ships downstream). All extend the existing binding modules; nothing pre-existing changes.
All Stable nrpc: error prefix on every caller-side failure. Each binding ships a classifyError(e) / classify_error(e) helper for typed-error dispatch at catch sites.
Node Hand-written errors.js / mesh_rpc.js + their .d.ts files replaced by single TypeScript sources (errors.ts, mesh_rpc.ts). Module shape and tarball contents unchanged for consumers; build pipeline now requires npm run build:ts before napi prepublish (wired into prepublishOnly). The TypeScript surface declares RawMeshRpc as a real interface — custom test stubs may need to grow methods that previously got past via as unknown escape hatches. Streaming + resilience helpers (TypedMeshRpc, RetryPolicy, HedgePolicy, CircuitBreaker) ship in the new mesh_rpc.ts. AbortSignal-driven cancellation: MeshRpc.reserveCancelToken() / MeshRpc.cancelCall(token) plus the cancelToken option on call.
Python New net.mesh_rpc module ships TypedMeshRpc.from_mesh(mesh) + RetryPolicy / HedgePolicy / CircuitBreaker + the typed exception hierarchy (RpcError, RpcNoRouteError, RpcTimeoutError, RpcServerError, RpcTransportError, RpcCodecError, BreakerOpenError, RpcCancelledError). ServeHandle is a context manager (with rpc.serve(...)). Cancellation via Cancellable pyclass + opts={'cancel': cancel}. The native net.MeshRpc pyclass is the raw layer the typed wrapper sits on. GIL released across runtime.block_on(...); handler callbacks dispatch under tokio::task::spawn_blocking.
Go New crate net-rpc-ffi at bindings/go/rpc-ffi/ ships the C-ABI cdylib libnet_rpc (separate from the existing compute-ffi). 21 new C entry points: lifecycle (net_rpc_new / _free), ABI-version stamp (net_rpc_abi_version()), unary call (net_rpc_call / _call_service), service discovery (net_rpc_find_service_nodes), serve (net_rpc_serve / _serve_handle_close / _serve_handle_free), streaming (net_rpc_call_streaming / _stream_next / _stream_grant / _stream_close / _stream_free / _stream_call_id), cancellation (net_rpc_reserve_cancel_token / _cancel_call), handler dispatcher registration (net_rpc_set_handler_dispatcher), free helpers (net_rpc_free_cstring / net_rpc_response_free / net_rpc_find_service_nodes_free). New error code NET_RPC_ERR_STREAM_DONE = -6 separates clean stream termination from "no chunk available right now." Reference Go consumer at bindings/go/net/mesh_rpc.go documents the cgo wiring; the Go module itself ships downstream.
C nRPC is not exposed in net.h — it lives in the separate libnet_rpc cdylib (bindings/go/rpc-ffi/). The C SDK README at include/README.md § nRPC documents the entry-point listing, error codes, and ABI version stamp for downstream consumers building against the cdylib directly.

Behavioral fixes that may surface as test breakage

  • MembershipMsg::Subscribe encoder emits no trailing bytes when queue_group: None. Tests that decoded a v0.11 Subscribe and asserted "trailing zero byte" will fail — the encoder no longer writes the length byte on None. The decoder still accepts both shapes (forward-compat).
  • Hedge losers' handlers observe ctx.cancellation. Pre-fix a hedge loser's request stayed in-flight on the server and the handler ran to completion against a caller that no longer cared. Tests that asserted "handler ran for every hedge attempt" will see the cancellation signal instead.
  • Caller-side Mesh::call dropped before resolution emits CANCEL on the wire. Tests that asserted the server-side handler ran to completion despite caller drop will see ctx.cancellation fire.
  • Server-side fold emits RpcStatus::Cancelled on CANCEL observation. Tests that asserted "deadline + cancel surfaces as Timeout" will see Cancelled if CANCEL beat the deadline timer; the deadline path still surfaces Timeout (no behavior change for the deadline-only case).
  • extract_trace_context is case-insensitive. Tests that injected only-lowercase trace headers and asserted extraction will continue to work; tests that asserted capitalized variants were silently dropped will see the headers picked up.
  • classify_publish_no_session matches both publish-side and send-side error strings. call_service failure to a peer whose session expired between discovery and dispatch now surfaces RpcError::NoRoute instead of RpcError::Transport.
  • ChannelConfigRegistry prefix-walk is longest-prefix-first. Tests that relied on insertion-order or shortest-prefix-wins to disambiguate nested prefix registrations will see the most-specific prefix match instead.
  • Per-handler-timeout default for the Go binding is 60s. Wedged Go-side handlers can no longer hold the in-flight slot indefinitely; tests that exercised "handler runs for >60s" will surface a timeout where they previously hung.

How to upgrade

  1. Bump your Cargo.toml / package.json / requirements.txt / go.mod to the v0.12 line. Recompile.
  2. For consumers that only use the existing pub/sub + migration surfaces — no source changes required. v0.12 is forward-compatible with v0.11 wire formats for everything that existed in v0.11. The new SubscriptionMode and MembershipMsg.queue_group fields are additive.
  3. For consumers that want nRPC — the typed surface is opt-in. Read net/crates/net/README.md#nrpc for the cross-binding contract, then per-binding READMEs for language-idiomatic usage:
    • Rust SDKnet/crates/net/sdk/README.md § nRPC. Feature-gated on cortex (already enabled by the local and full umbrella features).
    • Nodenet/crates/net/sdk-ts/README.md § nRPC. Import from @ai2070/net/mesh_rpc.
    • Pythonnet/crates/net/sdk-py/README.md § nRPC + net/crates/net/bindings/python/README.md § nRPC. Import from net.mesh_rpc.
    • Gonet/crates/net/include/README.md § nRPC for the C-ABI surface. Reference cgo wrapper at bindings/go/net/mesh_rpc.go.
  4. For mixed v0.11 ↔ v0.12 fleets — pub/sub and migration paths continue to work cross-version. nRPC traffic between mixed-version peers will silently fail (v0.11 doesn't know how to dispatch nRPC). Upgrade the fleet in lockstep if you intend to use nRPC across all peers from day one. QueueGroup subscriptions silently degrade to broadcast fan-out when crossing into a v0.11 receiver — same recommendation.
  5. Node consumers depending on the hand-written mesh_rpc.js / errors.js shape — module exports and require() resolution are unchanged. If your test harness used as unknown casts to satisfy RawMeshRpc against a stub that didn't conform, the stub will need to grow the missing methods (or the casts switched to actual conforming shapes). The TypeScript compile error names the missing method.
  6. Cross-binding nRPC consumers — every binding's compat suite asserts the same fixture (tests/cross_lang_nrpc/golden_vectors.json). If you're integrating nRPC across language boundaries, your wire-level compatibility is enforced at the binding's own CI. The fixture is versioned via abi_version_expected mirroring NET_RPC_ABI_VERSION = 0x0001.
  7. Go consumers — the libnet_rpc cdylib is a separate build artifact from the existing libcompute_ffi. Build with cargo build --release -p net-rpc-ffi and link both. ABI version drift is detected via net_rpc_abi_version() vs the consumer's compiled-in ExpectedABIVersion.
  8. If you implemented your own caller-side request/response over the existing pub/sub primitives (e.g. via two channels + correlation id) — the nRPC surface implements exactly that pattern, with deadlines, retry/hedge/breaker, response streaming, and end-to-end cancellation. Migration is a straight rewrite per the per-binding README's ## nRPC section.
  9. If you wired your own metrics around the existing channel publish path for RPC-shaped trafficMeshNode::rpc_metrics_snapshot() + RpcMetricsSnapshot::prometheus_text() ships a complete per-service counter set (caller-side nrpc_calls_total / nrpc_errors_total{kind} / nrpc_in_flight_calls / nrpc_call_latency_seconds_* + server-side nrpc_handler_invocations_total / nrpc_handler_panics_total / nrpc_handler_in_flight / nrpc_handler_duration_seconds_* / nrpc_streaming_chunks_emitted_total). One snapshot covers both directions for any service the local node both calls and serves.
v0.11.0Codename:Black Diamond
2026.05.05

v0.11 closes the audit work that v0.10 left open. Same shape: a hardening release with no new transports, no new SDK surfaces, no new feature gates. Every commit on this branch is a bug fix, a regression test, a triage decision, or a wire-format bump that closes a structural gap the previous release flagged but couldn't ship inside its envelope.


Addressed in this release

CortEX watermark, snapshot, and per-event integrity

  • folded_through_seq advanced past unfolded events — under Stop policy, recoverable_decode could publish a watermark for events whose state mutation never landed; wait_for_seq(seq) returned true incorrectly and downstream readers acted on never-applied state. Split the watermark in two: applied_through_seq (strict-prefix, advances only on Ok(()) AND only when seq is the immediate successor of the previous applied) and folded_through_seq (live-progress, retained for low-latency observers). snapshot() writes applied_through_seq; restore re-attempts the previously-skipped event so the post-restore state matches what fold committed, not what fold attempted.
  • Snapshot persisted last_seq for skipped events — same root cause as the watermark fix above. Once the strict-prefix watermark is the source of truth, snapshots no longer carry sequence numbers for events whose state was never applied; the on-disk log remains the source of truth on restore.
  • Per-event checksum did not cover the EventMeta headercompute_checksum(tail) was xxh3 over only the payload tail; a stray bit-flip in the 20-byte EventMeta header (e.g. dispatch: STORED → DELETED) was undetected by the per-event integrity check and silently re-routed the event to the wrong fold arm. The new compute_checksum_with_meta(&meta, tail) covers both the header (with the checksum slot zeroed) and the tail. Producers stamp v2; readers try v2 first and fall back to v1 to keep pre-fix on-disk records readable. Downgrading to a pre-v0.11 binary will skip every event written by a v0.11 producer (the legacy verifier expects xxh3(tail), which v2 records won't match) — the migration is effectively one-way.

RedEX compact_to durability + atomicity (manifest-pointer flip)

Two layered fixes; the first patches per-call durability on Windows, the second closes the cross-file mixed-state window structurally.

  • Per-rename MoveFileExW(MOVEFILE_REPLACE_EXISTING | MOVEFILE_WRITE_THROUGH)compact_to's rename calls used std::fs::rename, which on Windows is MoveFileExW(MOVEFILE_REPLACE_EXISTING) with no write-through — the destination metadata could be cached and lost on power-loss. Now driven through a durable_rename helper that calls MoveFileExW with MOVEFILE_WRITE_THROUGH on Windows; POSIX is unchanged (fs::rename is durable as long as the directory is fsync'd, which the surrounding code already does).

  • Cross-file atomicity via manifest-pointer layout. The old compact_to did three sequential renames (idx, dat, ts). A crash between rename N and N+1 left the on-disk channel in a mixed state (idx at gen K+1 paired with dat/ts still at gen K) that recovery could not distinguish from a clean half-finished compact. The new layout puts each generation's files under its own directory and atomically swaps a single manifest pointer:

    <channel>/manifest                        # 16-byte pointer file
    <channel>/v0000000001/{idx,dat,ts}        # current live generation
    <channel>/v0000000002/{idx,dat,ts}        # next generation (mid-compact)
    

    compact_to writes the new generation's files in full, fsyncs them, then durable_rename(manifest.tmp → manifest) is the single linearizing event. Before the rename, recovery sees the old manifest and uses v<N>/. After it, recovery sees the new manifest and uses v<N+1>/. There is no mixed state — every generation directory is either complete or orphaned, never partially live. Recovery falls back to the highest validated v<NNN>/ if the manifest is torn or missing, and sweeps every generation directory other than the live one on every open (cleaning up orphans left by a crashed prior compact).

    The post-rename fsync_dir(channel_dir) is treated as best-effort: a rare POSIX failure after the linearizing rename is logged and swallowed rather than surfaced as Err, so the cached-handle swap still proceeds and on-disk + in-memory stay aligned. Surfacing the error would have lied to the caller about whether the flip happened, leaving any in-process appends between the failed compact and process exit landing in a now-dead generation. The residual durability gap (a power loss before the next implicit dirent flush could revert the rename) is recovered by the orphan-generation sweep on next open, which converges on a single consistent live generation regardless of which side of the rename survived.

    Legacy v0.10 / v0.11 channels with the flat <channel>/{idx,dat,ts} layout migrate transparently into <channel>/v0000000001/{idx,dat,ts} on first open. The migration is one-shot per channel and idempotent. Pinned by 20 new regression tests including all 10 crash-injection points sketched in BUG_AUDIT_2026_05_03_REMAINING_PLAN.md's long-term-follow-up section, plus mid-rename partial-migration recovery, fault-injected fsync_dir failure handling, and source-shape guards against the deleted post-rename-reopen failure mode drifting back. Design recorded in docs/misc/REDEX_MANIFEST_POINTER_DESIGN.md.

Compute registry quiescence

  • In-flight Arc<Mutex<DaemonHost>> callers mutated through swap and unregisterreplace and unregister rotated the registry's Arc slot but a concurrent caller that had already cloned the prior Arc out of the map (between get_arc and arc.lock()) would land its mutation on the now-orphaned host. The replacement was correct from the registry's point of view but the orphaned host had already been removed from delivery routing, so writes to it disappeared into nothing. Introduced a guard_identity(origin_hash, &held_arc) helper that runs after arc.lock() and re-checks Arc::ptr_eq against the current registry slot. On mismatch the helper surfaces a typed DaemonError::Stale(u32) and the caller bails before mutating; the new variant lets callers distinguish "I lost the swap race" from "the daemon was never registered" without inspecting registry state.

FFI handle lifetime — cortex, mesh, identity, redis-dedup

A foreign caller (Go cgo, Python threads, Node.js workers) racing a _free against an active op against the same handle could (a) UAF on the inner Arc after _free did Box::from_raw → drop, or (b) UAF on the outer handle box itself even when the inner was held alive via an Arc<Inner> clone. The shape was filed as three separate audit items because three separate handle families exhibited it; the underlying race is one race.

  • Shared ffi::handle_guard::HandleGuard extracted with try_enter() -> Option<HandleOp<'_>> and begin_free(deadline) -> bool. Packed atomics (freeing: AtomicBool, active_ops: AtomicU32); SeqCst-ordered Dekker-style "set freeing, check active_ops" handshake; per-handle FFI_HANDLE_FREE_DEADLINE: Duration = 5s. Soundness rule: the handle box is never deallocated once handed to C — _free takes the inner out via ManuallyDrop::take only after begin_free returns true, and the outer Box (carrying HandleGuard's atomics) is intentionally leaked. Concurrent ops doing try_enter after free safely fetch_add on still-valid memory, observe freeing=true, decrement, and bail.
  • All 11 cortex/mesh/identity/redis-dedup handle types ported. RedexHandle, RedexFileHandle, RedexTailHandle, TasksAdapterHandle, TasksWatchHandle, MemoriesAdapterHandle, MemoriesWatchHandle (cortex side); MeshNodeHandle, MeshStreamHandle (mesh side, including the Arc::ptr_eq UAF in handles_match that audit #25 specifically called out); IdentityHandle, RedisDedupHandle. Every entry point gates on try_enter; every _free drives begin_free. _free is idempotent — a second/concurrent _free caller observes the lost CAS, returns false, and bails before the double-take that would UAF the inner allocation.
  • Per-handle regression coverage. Three pinned tests per handle: post-_free op returns ShuttingDown, _free is idempotent under concurrent callers, _free waits for an in-flight op to drain (or timeouts and leaks rather than UAF). Plus five tests on the HandleGuard helper itself (try_enter, post-free bail, drain-wait, drain-timeout, idempotent concurrent free).

Identity & envelope

  • IdentityEnvelope wire format gains a 1-byte version prefix. Pre-fix the AEAD open() path tried v1, and on failure retried v0 — the documented rolling-upgrade fallback. The new layout puts a single IDENTITY_ENVELOPE_VERSION = 1 byte at offset 0; readers reject any other byte via EnvelopeError::UnknownVersion and skip the AEAD attempt entirely. The CPU-DoS amplification framing in the original audit was overstated (the ed25519 signature check fail-fasts random ciphertext before either AEAD attempt fires; only legitimate-but-replayed v0 envelopes ever reached the second AEAD), but the structural improvement of "version byte at offset 0, deterministic dispatch, no v0 fallback at all" closes the gap with one extra byte. IDENTITY_ENVELOPE_SIZE 208 → 209; SNAPSHOT_VERSION 1 → 2.
  • origin_hash widened from u32 to u64 across the application layer. Pre-fix EntityKeypair::origin_hash() returned a 32-bit BLAKE2s projection; with ~65 K distinct daemon identities the birthday probability of two daemons aliasing the same origin_hash crossed 50 %, and cross-channel accounting keyed by origin_hash silently conflated them. Now widened to 64 bits at the application layer (EntityKeypair, EntityId, OriginStamp, CausalLink, EventMeta, ContinuityProof, ForkRecord, DaemonRegistry, daemon_factory, the SDK's public surface). The per-packet NetHeader.origin_hash deliberately stays u32 — that field is the routing fast path's pre-AEAD filter and width matters for cache-line packing; the with_origin(u64) setter downcasts to the routing-side projection. Wire-format constants: CAUSAL_LINK_SIZE 28 → 32, EVENT_META_SIZE 20 → 24, CONTINUITY_PROOF_SIZE 36 → 40.
  • The widening cascade flowed through the SDK, the Node binding (u64 → JS bigint, matching the existing node_id convention), the Python binding (pyo3 maps u64 to native int transparently), and the Go binding (uint32_tuint64_t in include/net.go.h).

Compute orchestrator & merge

  • on_replay_complete synthesized target_head with parent_hash: 0 — downstream verifiers couldn't reconcile a chain head whose parent was the literal zero hash; reconciliation surfaced Forked against legitimate replay-completion messages. Now queries daemon_registry.with_host(...) for the real chain head and stamps the actual parent hash. The audit's separate report against consumer/merge.rs:384 (per-shard cap rolling the cursor backward on unclamped_per_shard > PER_SHARD_FETCH_CAP) was re-triaged as obsolete: the current code already advances the cursor to the last fetched event id; the audit was reading a prior revision. Pinned by a new regression test (poll_merger_does_not_stall_on_single_shard_filter_under_cap).

Mesh transport — mesh.rs deep-read audit

The 9 items the v0.10 release note flagged "queued for the next release" all land here.

  • spawn_heartbeat_loop held a DashMap shard guard across .await — the heartbeat broadcast loop iterated peers.iter() and awaited socket.send_to(...) (heartbeat then pingwave, twice per peer) while still holding the iterator's Ref. Every other task touching the same shard blocked for the cumulative round-trip. Now snapshots (node_id, addr, Arc<NetSession>) tuples into a Vec first and awaits without the iterator alive.
  • accept / start mutual exclusion used AcqRel where the comment relied on SeqCst — the doc-comment argued correctness from "the SeqCst total order on these two atomics," but the accept_in_flight.fetch_add(1, AcqRel) and the matching fetch_sub in AcceptGuard::drop were not part of the SC total order. On x86 the LOCK'd RMW happened to fully fence so the race was unobservable; on AArch64 / RISC-V the dispatcher could race handshake_responder for the inbound msg1. Both increments now SeqCst.
  • Routed-handshake key rotation silently overwrote a live session — the replay guard only fired for the same remote_static_pub; a routed msg1 with a different static for the same peer_node_id fell through and peers.insert overwrote the existing legitimate session. The legitimate peer's subsequent AEAD packets (encrypted under the old session key) failed to verify and were silently dropped. The trusted-PSK threat model rationalised this only if PSK compromise was treated as "any node can DoS any other node's sessions" — which contradicted the rest of the auth surface (entity-ID TOFU pinning, signed capability announcements). Rotation is now refused while the existing session is still within its idle / heartbeat window.
  • handle_routed_handshake peers.getpeers.insert was not atomic — two concurrent routed handshakes for the same peer_node_id (e.g. a flaky peer retrying under a fresh ephemeral) could both pass the replay-guard existing.remote_static_pub check and race the insert; the loser's pending_handshakes initiator state stayed armed waiting for a msg2 now bound to the winner's session, until handshake_timeout fired. Decision and insert now hold a single peers.entry(peer_node_id) write guard.
  • commit_reclassify_observations torn (nat_class, reflex_addr) snapshot — when every probe failed, latest_reflex == None. The code still updated nat_class (typically to Unknown) but left reflex_addr at its previous value; subsequent announce_capabilities_with reads under traversal_publish_mu saw (fresh class, stale reflex). The whole traversal_publish_mu invariant was silently violated on this branch. reflex_addr is now reset to None when latest_reflex is None, keeping the pair coherent.
  • authorize_subscribe rejected idempotent re-subscribes with TooManyChannels — a peer at the channel cap that retransmitted/re-subscribed to a channel it already held was rejected even though SubscriberRoster is set-typed and the operation is a no-op. Now short-circuits (true, None) when the roster already contains the channel, before the cap-check fires.
  • publish_to_peer did not propagate the reliable flag to the packet header — every other sender (send_to_peer, send_routed, send_on_stream, mod.rs:1016/1063) computed if reliable { PacketFlags::RELIABLE } else { PacketFlags::NONE } and threaded it into the packet builder; publish_to_peer hard-coded PacketFlags::NONE and only fed reliable into open_stream_with. Latent today (the dispatch path doesn't yet inspect flags.is_reliable()) but the per-call-site inconsistency would silently bite when a receiver-side path consults the packet flag — proxy.rs / route.rs / router.rs already inspect is_priority / is_control. Same ternary as the other senders now applied.
  • process_local_packet migration loopback unbounded synchronous self-bounce — the in-place pending: VecDeque kept draining as long as the handler emitted self-bound follow-ups. A buggy or attacker-influenced trusted handler that always emitted a self-bound message would spin the dispatch task synchronously, starving every other peer's packets. Now caps loopback depth (tracing::warn! past it).
  • connect_via did not refresh addr_to_node after a successful direct upgrade — after connect_direct → connect_via(peer_reflex, …) succeeded, the upgraded session's dispatch fast path missed on peer_reflex and fell back to a linear peers.iter().find(|e| session_id == ...) per packet. Performance only, but it defeated the addr → nid index for exactly the sessions that benefit most from it. The connect_direct Ok path now inserts the (peer_reflex, peer_node_id) mapping; the relayed-session note in connect_via itself is unchanged (the upgrade is a separate caller).

Behavior / safety / rate limiting

  • per_source.clear() minute-boundary RPM cap exceedance — the periodic sweep cleared the per-source rate-bucket map at the minute boundary, which momentarily zeroed every active source's count and let the next 60 seconds of traffic through unmetered before the budget gate observed it again. Replaced with a packed-atomic RateBucket carrying (window_floor: u32, count: u32) in a single AtomicU64; CAS-based atomic reset on window rollover, no clear-and-reinsert race, no stale-count window. gc_per_source_stale now sweeps stale entries based on observed window age rather than stomping the live state. try_acquire computes its Ok value from the CAS prev, not a racy reload — avoids a second lost-update window.

Cluster F triage (lower-severity items)

  • #81 adapter/redis.rs pipeline timeout duplicate hazard — config-deployment-shape issue; closed with a one-time-per-process tracing::warn! from RedisAdapter::init pointing at net_sdk::RedisStreamDedup so misconfigured deployments are surfaced at boot rather than as silent duplicate publishes under retry.
  • #125 behavior/safety.rs per-source RPM cap — closed via the packed-atomic RateBucket rework above.
  • #127 initiator handshake HandshakePacer — re-triaged as obsolete; the structural fix (per-(peer, us) in-flight handshake registry) is a separate refactor and the existing per-call timeout already bounds the worst case to a known floor.
  • #128 router.rs notify_one + permit-stash soundness — re-triaged as obsolete; the notify-with-stashed-permit pattern is sound vs notify_waiters for this use case (all waiters drain at most-once, no lost-wakeup window). Documented in-line so the design rationale survives the next reader.
  • #73 consumer/merge.rs per-shard cap rolling cursor backward — re-triaged as obsolete; current code advances. Pinned by poll_merger_does_not_stall_on_single_shard_filter_under_cap.
  • #118 behavior/rules.rs rate-limit reset semantics — re-triaged as obsolete; the current reset to 1 is the correct semantic (the audit's reset to 0 would allow max+1 firings per window).
  • #121 behavior/loadbalance.rs P2C with len == 2 — re-triaged as obsolete; the degenerate case IS the P2C algorithm with 2 inputs.

Test hygiene

  • HandleGuard race injection — five tests on the helper module: try_enter, post-free bail, drain-wait, drain-timeout, idempotent concurrent free. Three pinned tests per ported handle (post-free ShuttingDown, idempotent _free, _free waits for in-flight op).
  • Cortex applied_through_seq strict-prefix — five regression tests pinning the watermark advances only on Ok(())-and-immediate-successor; snapshot reflects the strict-prefix value; restore re-attempts the previously skipped event (so post-restore state matches what fold committed, not what fold attempted).
  • compute_checksum_with_meta v2 coverage — pins that v2 detects bit-flips in dispatch, flags, origin_hash, seq_or_ts; pins that v1 fallback still accepts pre-fix on-disk records; pins that v1 and v2 of the same input differ for typical tails (so the fold-side fallback can't accidentally accept a v2 record by numerical coincidence).
  • DaemonRegistry::Stale quiescing — five regression tests pinning that an in-flight mutator holding a now-orphan Arc surfaces DaemonError::Stale(u32) instead of mutating; that replace and unregister both trip the check; that the surviving in-flight Arc and the fresh registration don't produce two parallel writers.
  • durable_rename Windows behavior — three regression tests pinning the MoveFileExW(MOVEFILE_WRITE_THROUGH) path on Windows and the POSIX fast-path passthrough.
  • Identity envelope version-byte rejection — pins that envelopes with any leading byte other than IDENTITY_ENVELOPE_VERSION = 1 surface EnvelopeError::UnknownVersion and never reach the AEAD path.
  • Mesh-audit regression coverage — the heartbeat snapshot, accept/start SeqCst, routed-handshake atomic entry, NAT class/reflex coherence, idempotent re-subscribe, reliable flag propagation, loopback depth cap, and addr_to_node direct-upgrade refresh each carry a pinned regression test in tests/mesh_audit.rs.
  • JetStream msg-id sequence_start per-shard monotonicity — pins that within one bus instance, every shard's batches advance their sequence_start strictly monotonically AND gap-free (seq_start[n+1] == seq_start[n] + len(events[n])). A regression that introduced a gap would let (process_nonce, shard, seq, i) tuples be reused after the JetStream / Redis dedup window closes; an overlap would silently overlay a later batch on an earlier one's slot. Pinned by bus::tests::sequence_start_is_per_shard_monotonic_and_gap_free. The cross-restart variant (persistent next_sequence across process boots) remains feature-shaped and is not in this release; today's invariant relies on process_nonce rotating to disjoin the msg-id namespace.
  • Manifest-pointer crash-injection — 12 regression tests covering manifest codec round-trip + corruption rejection, brand-new-channel init, flat-layout migration, fallback when manifest is missing or torn, sweep of orphan newer / older generation directories, generation advancement + manifest atomicity, and recovery convergence in one open. Maps onto the 10-row crash-injection table in docs/misc/REDEX_MANIFEST_POINTER_DESIGN.md.

Triage decisions recorded in code

One audit item resolved as "no code change needed, but the rationale must live in code so a future contributor doesn't re-open the question":

  • apply_authoritative_grant clamp ordering — the audit recommended reordering the tx_bytes_sent bump and the tx_credit_remaining decrement. The current form uses a CAS-with-delta against max_consumed_seen and adds the delta to tx_credit_remaining via fetch_update; this composes atomically with the CAS in try_acquire_tx_credit and the fetch_update in refund_tx_credit. The audit's reorder presumed a .store()-based recompute from a racy snapshot of tx_bytes_sent — a shape the current code deliberately avoids. The rationale is documented in code at adapter/net/session.rs::apply_authoritative_grant and the codec-side abstract at adapter/net/subprotocol/stream_window.rs::StreamWindow.

Breaking changes

Wire format (v0.10 ↔ v0.11 do not interop)

This is the consequential upgrade. Three structural format changes land together; the wire-format pair are NOT backwards-compatible across the wire (v0.10 ↔ v0.11 do not interop), and the RedEX on-disk layout migrates automatically on first open per channel.

IdentityEnvelope v0 → v1 (208 B → 209 B)

IdentityEnvelope::to_bytes now writes a leading IDENTITY_ENVELOPE_VERSION = 1 byte; from_bytes rejects any other leading byte via EnvelopeError::UnknownVersion. The v0 fallback in open() is removed entirely. IDENTITY_ENVELOPE_SIZE is 1 + 32 + 80 + 32 + 64 = 209.

SNAPSHOT_VERSION bumps 1 → 2 because the snapshot wire format embeds the envelope at fixed offsets and the version byte shifts every subsequent field. v0.10's from_bytes_v0 is removed; from_bytes_v1 was renamed to from_bytes_v2.

Impact: v0.10 → v0.11 must upgrade in lockstep. A v0.10 sender to a v0.11 receiver will get UnknownVersion on every envelope; a v0.11 sender to a v0.10 receiver will fail signature verification because v0.10 doesn't account for the leading byte in its AAD construction.

origin_hash widening: u32u64

EntityKeypair::origin_hash(), EntityId::origin_hash(), and OriginStamp::origin_hash() now return u64 (the full 8-byte BLAKE2s value, not a 4-byte truncation). The struct fields CausalLink.origin_hash, EventMeta.origin_hash, ContinuityProof.origin_hash, and ForkRecord.origin_hash widen accordingly. The wire-format constants:

Type Old size New size
CAUSAL_LINK_SIZE 28 32
EVENT_META_SIZE 20 24
CONTINUITY_PROOF_SIZE 36 40

NetHeader.origin_hash deliberately stays u32. That field is the per-packet routing fast path's pre-AEAD filter and width matters for cache-line packing. The setter with_origin(u64) downcasts to the routing-side projection (as u32); the OriginStamp::origin_hash() doc explicitly notes this convention.

The DaemonRegistry's public surface (register, unregister, snapshot, deliver, with_host, stats, contains) and the daemon_factory::FactoryEntry map are keyed by u64. All SDK methods that take or return an origin_hash (DaemonRuntime::stop, snapshot, deliver, migration_phase, peek_migration_failure, inject_migration_failure, subscriptions, expect_migration, start_migration, etc.) take/return u64. The DaemonHandle.origin_hash, MigrationHandle.origin_hash, and CausalEvent.origin_hash fields widen accordingly.

Impact: on-disk RedEX files written by v0.10 cannot be read by v0.11's cortex adapters — the meta header layout shifts. Re-tail from the source of truth (the bus / publisher) on upgrade. The cortex per-event checksum's v1 fallback path keeps reading legacy checksums, but the meta-size shift means the byte slicing itself differs.

Cortex per-event checksum v1 → v2

Producers stamp compute_checksum_with_meta(&meta, tail) (header-covering). Readers try v2 first and fall back to v1 (compute_checksum(tail)) so pre-v0.11 records remain readable. New writes are v2-only. Downgrading to a pre-v0.11 binary will skip every event written by a v0.11 producer — the migration is one-way.

RedEX on-disk layout: flat → manifest-pointer + generation directories

Each channel's <base>/<channel>/{idx,dat,ts} files now live one level deeper at <base>/<channel>/v0000000001/{idx,dat,ts}, alongside a single <base>/<channel>/manifest pointer file (16 bytes) that names the live generation. Compactions roll the live generation by writing a fresh v<N+1>/ directory and atomically swapping the manifest.

Migration is automatic and transparent. On first open, a v0.10 / v0.11 channel with the flat layout is migrated by renaming each of {idx,dat,ts} into v0000000001/, then writing a manifest pointing at it. The migration is one-shot per channel and idempotent; failure mid-migration leaves the per-file moves in whichever state they reached and the next open re-runs the migration.

Tools that read RedEX files directly (rare; the supported access path is the RedexFile API) need to read the manifest first and follow it to the live generation directory. The 16-byte manifest format is documented in docs/misc/REDEX_MANIFEST_POINTER_DESIGN.md.

Rust core (net crate) — API surface

  • origin_hash types widen to u64 at every public API point listed above. The as u32 downcast at the routing-fast-path boundary (NetHeader::with_origin) is the only place in the new code where the projection survives.
  • DaemonError::Stale(u32) is a new variant. Match arms over DaemonError need to add it; #[non_exhaustive] was already in place so this is forward-compatible, but exhaustive match-on-variant code refuses to compile.
  • compute_checksum_with_meta(meta: &EventMeta, tail: &[u8]) -> u32 is a new public function. compute_checksum(tail: &[u8]) -> u32 remains and is now described as the v1 fallback used only on the read side; new writers must use compute_checksum_with_meta. Both are re-exported from adapter::net::cortex.
  • IDENTITY_ENVELOPE_VERSION: u8 = 1 is a new public constant re-exported from adapter::net::identity. Pin against this instead of literal 1 so a future bump auto-propagates.
  • CortexAdapter splits the watermark. applied_through_seq is the new strict-prefix watermark used by snapshot(); folded_through_seq is the live-progress watermark used by wait_for_seq. Existing snapshot consumers that read last_seq get the strict-prefix value automatically; tests asserting that wait_for_seq(seq) implied state was applied for seq need to be re-read against the new semantic (wait_for_seq indicates fold attempted; restore re-attempts skipped events).
  • HandleGuard is a new public module under ffi::handle_guard (pub mod handle_guard). Custom FFI wrappers built against the crate (rare — most consumers use the bundled bindings) need to embed HandleGuard and route every entry point through try_enter / begin_free to keep the same memory-safety guarantees the bundled bindings now have.

Rust SDK (net-sdk)

  • All origin_hash parameters and fields widen to u64. Identity::origin_hash() -> u64. DaemonHandle.origin_hash: u64. MigrationHandle.origin_hash: u64. Closures move |origin_hash: u64| in PostRestoreCallback, PreCleanupCallback, MigrationFailureCallback. DaemonRuntime::stop, snapshot, deliver, migration_phase, peek_migration_failure, inject_migration_failure, subscriptions, subscribe_channel, unsubscribe_channel, expect_migration, start_migration, start_migration_with. The groups/{fork,replica,standby} surface widens parent_origin / active_origin / route_event return types. group_id in groups/replica deliberately stays u32 — that's a group_seed hash, distinct from origin_hash.
  • The brute-force u32 collision fixture in compute_runtime.rs (spawn_from_snapshot_checks_full_entity_id_not_just_origin_hash) searches for a collision on the as u32 projection rather than the full u64 — the SDK's identity-mismatch guard fires on the routing-side u32 collision, so the test's intent (entity_id check, not origin_hash check) is preserved at the original ~2^16 birthday-bound runtime.

FFI / bindings

Binding Change
All Every FFI handle type (cortex, mesh, identity, redis-dedup) now embeds HandleGuard. _free is idempotent across all 11 types; entry points after _free return typed ShuttingDown instead of segfaulting. Behavior change for callers that depended on _free being one-shot or used double-free as a way to detect prior frees — those patterns now silently succeed where they previously crashed.
All EntityKeypair::origin_hash() and friends return u64. The bundled bindings handle the marshalling per-language; consumers that called these APIs via raw FFI need to widen the receiving type.
C (include/net.go.h) net_identity_origin_hash, net_compute_daemon_handle_origin_hash, net_compute_migration_handle_origin_hash, every net_compute_* function with an origin_hash parameter, all replica/fork/standby out-params, and the cortex net_tasks_adapter_open / net_memories_adapter_open origin_hash parameters are now uint64_t. C consumers must widen their typed pointers.
Node (@net/sdk) The TypeScript surface declares originHash: bigint (matching the existing nodeId: bigint convention). Existing callers using JS Number literals must switch to BigInt literals (0xabcdef01n) or wrap with BigInt(value). The auto-generated index.d.ts reflects the new types.
Python (net-py) Python int is arbitrary precision; the surface is unchanged for callers (PyO3 marshals u64int transparently). One pytest fixture literal was extended from 0xdead_beef to 0xdead_beef_dead_beef to actually exercise the upper 32 bits.
Go (compute-ffi) All origin_hash parameters and out-params are uint64_t in the cgo header; Go callers must use uint64 typed locals where they previously used uint32.

Behavioral fixes that may surface as test breakage

These aren't strictly API-breaking but tests that asserted the pre-fix behavior will need updating:

  • Cortex snapshot last_seq reflects applied_through_seq, not folded_through_seq — tests that asserted snapshots include sequence numbers for skipped events will fail. The strict-prefix semantic is the correct one; the assertion was reading the bug.
  • Cortex restore re-attempts the previously-skipped event — tests that asserted state was preserved verbatim across snapshot+restore (treating the skip as a permanent state change) will see the post-restore state include the re-attempted event. The asymmetric trade-off is documented on snapshot()'s rustdoc.
  • DaemonRegistry::replace / unregister followed by an in-flight mutator returns DaemonError::Stale(u32) — tests that asserted the mutation landed on the orphan host will see the typed error instead.
  • FFI _free is idempotent and returns success on second-call — tests that asserted second-call returned an error code will see success.
  • FFI entry points after _free return ShuttingDown — tests that asserted post-free behavior was undefined / panicked will see the typed error.
  • Per-event cortex checksum is the v2 header-covering hash — tests asserting meta.checksum == compute_checksum(tail) (v1) will fail; switch to compute_checksum_with_meta(&meta, tail). Two pinned tests under tests/integration_cortex_{tasks,memories}.rs already had this issue and were updated.
  • IdentityEnvelope::open rejects v0 envelopes outright — tests that asserted the v0 fallback path engaged will fail. The open_accepts_v0_envelope_for_rolling_upgrade_compat fixture from v0.10 has been removed (it explicitly pinned the now-removed fallback); the new equivalent pins EnvelopeError::UnknownVersion on a leading-byte mismatch.
  • Mesh accept / start use SeqCst on accept_in_flight — tests on AArch64 / RISC-V hardware that relied on the pre-fix race window to construct concurrent-accept-and-start state will see the documented mutual exclusion.
  • Mesh routed-handshake refuses key rotation while a session is live — tests that asserted the silent overwrite (e.g. simulating a Sybil swap-in via routed msg1) will see the rotation refused.
  • authorize_subscribe short-circuits idempotent re-subscribes ahead of the cap-check — tests that asserted at-cap re-subscribe surfaced TooManyChannels will see success instead.
  • RedEX poisoning error strings now reference "partial-write rollback could not restore on-disk state to match in-memory" — log alerting / string assertions that matched the prior "compact_to post-rename reopen failure" parenthetical (which described a setter the manifest-pointer rework deleted) need updating. The poisoning condition itself is unchanged: only the partial-write rollback paths set the flag, and the error wording now accurately names them.

How to upgrade

  1. Coordinate the upgrade across all peers in a deployment. v0.10 and v0.11 do not interop on the wire — the envelope version byte and the EventMeta size both changed. Stand the new version up across the fleet in one window rather than rolling upgrades.
  2. Re-tail from your source of truth (bus / publisher) for any RedEX channels carrying state you need to retain. v0.10's on-disk EventMeta layout (origin_hash at bytes [4..8], seq_or_ts at [8..16], checksum at [16..20]) does not match v0.11's (origin_hash at [4..12], seq_or_ts at [12..20], checksum at [20..24]). The cortex per-event checksum's v1 fallback path reads checksums from pre-v0.11 records, but the meta-size shift means the byte slicing itself is different.
  3. Bump your Cargo.toml / package.json / requirements.txt / go.mod to the v0.11 line. Recompile. The Rust signature changes (u32u64 on origin_hash, DaemonError::Stale variant, applied_through_seq watermark) will surface as compile errors at the exact call sites that need updating.
  4. JS / TypeScript callers: switch originHash literals to BigInt. 0xabcdef010xabcdef01n. The TypeScript surface declares originHash: bigint; existing call sites using Number will fail at runtime against the new declarations.
  5. Go callers: widen uint32 locals to uint64 for every origin_hash parameter, return value, or struct field. The cgo header (include/net.go.h) reflects the new ABI.
  6. Python callers need no source changesint is arbitrary precision and PyO3 handles the marshalling transparently. Re-test fixtures that round-trip an origin_hash through external storage (databases, message queues) to confirm the upper 32 bits are preserved.
  7. C callers: widen uint32_t typed pointers to uint64_t for every origin_hash parameter and out-param. Anyone hand-rolling against include/net.go.h must regenerate their bindings.
  8. If your tests covered any of the items in Behavioral fixes that may surface as test breakage, update the assertions. The cortex applied_through_seq semantic and the v2 checksum migration each have a one-line fix at the assertion site; the v0 envelope removal requires deleting the fixture entirely.
  9. RedEX on-disk layout has changed. Each channel now stores its files under <channel>/v0000000001/{idx,dat,ts} plus a 16-byte <channel>/manifest pointer file, replacing the flat <channel>/{idx,dat,ts} layout. The migration runs automatically on first open of a v0.10 / v0.11 channel (one-shot, idempotent) — no code change required from callers. Tools or scripts that read RedEX files directly (rare; the supported access path is the RedexFile API) need to follow the manifest to the live generation directory.
  10. If you embed FFI handles in a custom Rust wrapper (rare), embed HandleGuard from the new ffi::handle_guard module and route every entry point through try_enter / begin_free. The recipe matches the bundled handles' implementation; the helper module's tests double as documentation.
v0.10.0Codename:Hex
2026.05.03

Addressed in this release

RedEX & CortEX (storage + folded state)

  • Compact temp-file leak on reopen failurecompact_to's cleanup path ran after the post-rename open_or_poison / clone_or_poison fallibles, so a reopen failure left three placeholder files behind in /tmp forever. Cleanup now runs before the fallible reopen.
  • Truncate-on-recovery without sync_all — torn-tail repair set_len was not durable; a crash before the next write reverted the recovery and the same torn bytes were re-read. Now sync_all + fsync_dir after the truncate.
  • Best-effort rollback silently swallowed open errorsif let Ok(f) = OpenOptions::new().write(true).open(...) quietly skipped rollback when the dat/idx open failed; subsequent appends produced permanent dat/idx divergence. Now propagated as RedexError.
  • In-memory index corruption on panic between drain and renormalizesweep_retention could leave rebased base_offset against absolute payload offsets if it panicked mid-rewrite. Now builds the renormalized index in a temp Vec and atomically replaces.
  • saturating_sub(dat_base) as u32 masks heap corruption — silently wrote offset 0 for stale heap entries. Now hardened so the cast never silently squashes a real offset error.
  • next_seq rollback skipped if disk is None — currently safe path; documented and pinned by an invariant comment.
  • Stale watermark advances past unfolded events under Stop policyrecoverable_decode published folded_through_seq.store(seq) for events whose state mutation never landed; wait_for_seq(seq) returned true incorrectly. Now gated on the actual fold result.
  • Snapshot persists last_seq for skipped events — when the watermark fix above lands, snapshot() no longer emits a last_seq for events whose state was never applied; the log remains the source of truth on restore.
  • Cortex WatermarkingFold saturates app_seq at u64::MAX — a peer publishing seq_or_ts == u64::MAX could pin our app_seq; the next fetch_add(1) panicked in debug or wrapped in release, breaking per-origin monotonicity. Inputs are now capped at u64::MAX - 1.
  • Memories upsert was asymmetric and tombstone-less — existing-id STORED partial-updated, missing-id inserted with pinned: false, and a STORED → DELETED → STORED sequence resurrected the deleted entry. Now consistent and tombstone-aware.
  • Memories empty-vec filter footgunSome(vec![]) for require_any_tag excluded everything (any over empty = false); Some(vec![]) for require_all_tags excluded nothing (all over empty = true). UI forms emitting empty multi-selects broke silently. Both empty cases now treated as "no filter."
  • Cortex/memories watch strict-bound mismatch — doc said > / <, code used >= / <=. Strict-bound consumers received boundary events. Now matches the doc.
  • StoredEvent::Serialize round-trips bytes through Value — re-encoding through serde_json::Value discarded original whitespace, normalized number formatting (1.01), and reordered keys. Any downstream that hashed or signed the serialized form silently failed verification. Now passes the raw bytes through &serde_json::value::RawValue.

Bus, shards, and dispatch

  • remove_shard_internal awaited batch worker before drain — contradicting the function's own doc comment. Drain still owned a sender clone, so a wedged adapter pinned this function indefinitely (no tokio::time::timeout shell on this path). Order swapped to drain → batch and the same timeout the rollback path uses now wraps the await.
  • add_shard_internal rollback dispatched stranded batch with stale next_sequence after worker timeout — the still-detached worker may not have published its final flush, so the rollback emitted overlapping msg-ids. Rollback now refuses to dispatch on the timeout path; the JoinHandle leak is acknowledged in the comment.
  • manual_scale_up cooldown loop invariant violated whenever cooldown > 0 — each iteration bumped last_scaling = Instant::now(); iteration 1 immediately failed InCooldown (default 30 s), leaving the first shard half-added. Operator-initiated scale-up now bypasses the auto-scaling cooldown via a dedicated scale_up_provisioning_force path.
  • Scaling monitor and manual_scale_down raced finalize_draining — non-target qualifying Draining shards were silently transitioned to Stopped, dropped on the floor by the target.contains(&shard_id) filter, and leaked. Non-target ids are now still routed through remove_shard_internal.
  • flush() Phase 2 barrier satisfied by post-flush trafficdispatched was a running counter, not a snapshot; with asymmetric per-shard latency the inequality could be satisfied while pre-flush events were still queued. Now snapshots dispatched + dropped at flush entry and gates on the delta.
  • shutdown() deadline path double-counted in-flight eventsevents_dropped += in_flight_ingests then the final two-pass sweep also drained those events into events_dispatched, violating events_ingested = events_dispatched + events_dropped on every deadline-triggered shutdown. Now subtracts the events the final sweep drained.
  • Drop did not surface stranded ring-buffer events — bus dropped without await shutdown() lost ring contents but never bumped events_dropped or set shutdown_was_lossy. Operators reading post-mortem stats saw no record of the loss. Now snapshots shard_stats() in Drop.
  • PollMerger topology swap had a lost-update race — concurrent add_shard_internal / remove_shard_internal could each read shard_ids() and serialize their store(...) in the wrong order, leaving the published merger view including a removed shard until the next topology change. The shard_ids() → store block is now serialized.
  • PollMerger::poll lost cursor context on stalled pollnext_id was None when no shards made progress, even with a valid request.from_id. Callers re-fetched from zero — silent pagination regression. Now echoes back the original from_id.
  • mapper.activate active_count.fetch_add outside the held write lock — three concurrent activates could pass the budget gate against a stale count and transiently overshoot max_shards. Increment moved before drop(shards).
  • mapper.finalize_draining read pushes_since_drain_start with Relaxed — the field's docstring required Acquire to pair with the writer's SeqCst reset. Now matches.
  • JoinHandle errors silently dropped in shutdownlet _ = futures::future::join_all(drains).await; ate panicked drain workers (default Tokio doesn't log task panics). Now captured and surfaced via events_dropped.
  • shutdown_via_ref and in-flight wait loops thrashed the runtime — bare tokio::task::yield_now re-queued the task without parking; tight loops under contention starved the workers they were waiting on. Switched to short tokio::time::sleep.
  • flush() held a sync parking_lot::Mutex inside async fn — replaced with the async-safe variant.
  • JSON cursor key "00" parsed to 0 — collided with shard 0 across rebuilds. Cursor codec now treats string keys as opaque.
  • std::time::Instant mixed with tokio time in shutdown — wall-clock 5s broke tokio::time::pause()-based tests. Now consistent.
  • Drain worker mem::replace/send ordering — swapped scratch before the awaited sender.send(batch); channel-close mid-await silently dropped the batch. Documented as load-bearing under shutdown ordering and pinned by a regression test.

Atomics, timestamps, and counters

  • raw_to_nanos(raw) quanta semantics — clarified to use delta_as_nanos(0, raw) consistently.
  • TimestampGenerator::next re-reads raw inside the CAS loop — pre-fix now was read once outside the loop; on contention, retries reused the stale now and the returned timestamp drifted as last + 1 arbitrarily far behind real time.
  • shard/batch.rs current_batch_size * 3 + target overflow — debug panic / release wrap on adversarial config. BatchConfig::validate now bounds max_size <= 1_000_000.
  • shard/batch.rs velocity-window Instant - Duration underflow — Windows Instant is QPC-relative-to-boot; immediately-after-boot processes aborted the batch worker. Now checked_sub.
  • f64 → usize as cast in batch — added clamp first.
  • shard/mapper.rs next_shard_id.store(first_id + count)checked_add on the bump path.
  • shard/mapper.rs overloaded_count used stale-metric placeholders for freshly-added shards — newly-active shards no longer skew the load signal until they have at least one observation window.
  • record_flush / collect_and_reset latency-sum/count desync — two independent fetch_adds vs two independent swaps let avg_flush_latency = sum.checked_div(count).unwrap_or(0) silently zero out under sustained load, suppressing the scale-up flush-latency trigger. (sum, count) now packed into a single u128 and CAS'd together. Same fix applied to push_latency_sum_ns / push_count.

Adapters (JetStream / Redis / dedup)

  • JetStream Other PublishErrorKind classified as transient — auth failures, permission denied, malformed-subject all retried forever against a backend that would never succeed. Now enumerates the truly transient variants and treats Other as fatal.
  • JetStream "pipelined" publish was actually serial — loop awaited publish_with_headers per event before moving on; only the server-ack join was parallel. 1k-event batch on a 1 ms RTT cost 1 s instead of "1 RTT per batch." Now pushes the publish-future into the join set.
  • JetStream per-event serde_json::Value allocation — violated the per-event no-alloc contract. Now mirrors Redis's RawValue borrow + Bytes::copy_from_slice.
  • JetStream one RTT per sequence in steady statedirect_get(seq) per sequence on a 1 ms RTT cost ≥100 ms wall for a 100-event poll. Now direct_batch_get.
  • JetStream cold-stream bail enabled on transient info() failure — fallback fabricated first_seq = 0, enabling the cold-stream bail; populated streams returning NotFound in deletion gaps bailed after 64 NotFounds with events still ahead. Now propagates Transient.
  • JetStream Fatal decode discarded already-decoded prefix — function returned immediately, dropping the events accumulated so far without advancing the cursor; recovery re-emitted the prefix. Now returns Ok on the good prefix and surfaces the corruption on the next poll.
  • JetStream shutdown retained self.jetstream / self.client — post-shutdown on_batch proceeded against a drained client (typically erroring, sometimes hanging). Both fields now cleared.
  • JetStream init-after-shutdown silently overwrote client without drain() — losing in-flight publishes piggybacking on the prior client. Now drains first.
  • JetStream partial-failure produced duplicate publishes — mid-batch error dropped in-flight PublishAckFutures but bytes were already on the wire; retry re-published, and Nats-Msg-Id deduped only within the dedup window. Documented and pinned; retry path now wraps publish_with_headers in tokio::time::timeout to bound the cancellation surface.
  • JetStream missing r field stored b"null" — could surprise downstream consumers expecting either present-or-absent. Now passes through unchanged.
  • Redis cluster errors classified as fatalMOVED / ASK / READONLY / CLUSTERDOWN / NOREPLICAS were not in the substring set; after any Redis Cluster failover, every batch failed permanently until process restart. Added.
  • Redis is_healthy PING timeout cancellation — wrapped in command_timeout, with a dedicated health-check connection so a desynced ConnectionManager doesn't serve a stale PING reply on the next real command.
  • Redis poll_shard XRANGE had no command_timeout wrapperon_batch and is_healthy honored the timeout contract; poll_shard could block indefinitely. Now wrapped.
  • Redis shutdown didn't drop self.conn — pure advisory flag; get_conn ignored initialized = false. on_batch could write to Redis silently after shutdown. Connection now dropped, get_conn errors with Fatal when the adapter has shut down.
  • RedisStreamDedup 4096-entry default was two orders of magnitude too small — at 10 K events/sec that's a 0.4 s window; the doc described "~minutes of in-flight." Default raised; capacity required at construction.
  • dedup_state startup nonce non-cryptographicxxh3_64 of (pid, tid, ns, stack_addr, ...) narrowed entropy on 32-bit targets. Now mixes a /dev/urandom seed.
  • limit + 1 overflow (Redis & JetStream poll request shaping) on adversarial limits — saturating_add(1).

Mesh transport, sessions, routing

  • handle_routed_handshake Case 2 — replay nuked the live session, no rate limit — Noise NKpsk0's responder uses a fresh ephemeral on each reply, deriving a brand-new session key per replay; an attacker replaying a captured msg1 replaced the legitimate session keys, the legitimate sender kept the old keys, every subsequent packet failed AEAD. Now drops the replay when the live session matches the same remote_static_pub, and the HandshakePacer from the legacy adapter has been added.
  • Pingwave strict_progress permitted address-poisoning via the hops < n.hops arm — an attacker who had observed pingwaves could spoof (origin_id=Y, seq=K, hop_count=0) for K < n.last_seq and overwrite n.addr to their UDP source. The conditions are now AND'd: pw.seq >= n.last_seq AND hops <= n.hops.
  • ThreadLocalPool per-thread cache leaked forever — every connect/disconnect/NAT-rebind/mesh-rebuild cycle leaked ~16 KB × local_capacity × num_threads. Long-lived daemons OOM'd in proportion to peer-churn count. Now Drop walks every thread's LOCAL_BUILDERS to evict its pool_id slot.
  • MAX_PACKET_POOL_SIZE = 1<<20 was OOM-on-first-sessionwith_local_capacity pre-allocated size × ~16 KB ≈ 16 GiB up front. The cap was meant to prevent OOM. Lowered to a few thousand; remaining budget covered by lazy-on-first-use.
  • Anti-replay window forward-jump > 1024 zeroed state instead of refusingMAX_FORWARD = 65_536, WINDOW_SIZE = 1024; a single authenticated jump in (1025, 65_536] zeroed the bitmap and left previously-seen counters in rx_counter - 1024 .. rx_counter replayable. The slide is now refused past WINDOW_SIZE; a fresh handshake is required.
  • Anti-replay received == u64::MAX — first authenticated packet at the boundary saturated rx_counter and rejected every subsequent counter; one hostile authenticated packet could permanently poison the receive path. Now rejected at is_valid.
  • TokenScope::contains(NONE) returned true unconditionally(self.bits & 0) == 0. Compounded with authorizes(NONE, ch) returning unconditional true, so any token authorized the no-op action; callers building action: TokenScope from external input where the input masked to NONE saw true for every token. Short-circuits at the top of contains.
  • route.rs tie-break used <= — doc said "preserved if strictly better." Now <.
  • router.rs route_packet had no source/loop suppression — TTL exhaustion was the only loop-breaker; add_route_with_metric flap or a malicious peer could set up a 2-hop loop. Now drops when routing_header.src_id == routing_table.local_id and inspects a small (src_id, stream_id, sequence) LRU.
  • router.rs RouterError::TtlExpired recheck after forward() double-counted — both record_in and record_drop ran. record_in deferred until after the post-decrement TTL check.
  • linux.rs BatchedTransport::send_batch silently truncated above 64len.min(MAX_BATCH_SIZE) returned ≤ 64 unconditionally; reliable streams stashed the rest via on_send and only learned via NACK/RTO. Now returns InvalidInput over the cap; chunked-internally is a follow-up.
  • linux.rs iov_base: packet.as_ptr() as *mut _ provenance laundering — sound under the kernel-reads-only invariant, but documented at the call site so a future Miri pass doesn't have to re-derive it.
  • mod.rs handshake retry sleep had no upper bound100 * attempt over MAX_HANDSHAKE_RETRIES = 1024 summed to ~14 hours total with the last attempt sleeping ~102 s. Capped at 5 s per attempt.
  • mod.rs handshake recv loop allocated BytesMut::with_capacity(MAX_PACKET_SIZE) per iteration — allocator pressure under stray traffic. Buffer now reused across iterations.
  • session.rs evict_idle_streams LRU vs concurrent open racemin_by_key then remove was non-atomic; a freshly-opened stream could be torn down between selection and removal. Now uses remove_if with a freshness predicate.
  • session.rs verify_and_touch_heartbeat did not pre-check parsed.payload.len() == TAG_SIZE — AEAD caught the mismatch but a length check shortcuts cleartext-flood probes before they touch the cipher.
  • session.rs RxCreditState::on_bytes_consumed consumed/granted not jointly atomic — concurrent calls could publish consumed > granted transiently; observability/metrics showed flicker. Now packed u128 AcqRel CAS.
  • route.rs capability-announcement hop_count += 1 — every other hop-count increment in the crate uses saturating_add(1); this one was bounded today by the < MAX_CAPABILITY_HOPS - 1 = 15 guard but one constant change from a debug panic. Now matches the rest.
  • Static-mode select_shard_by_hash used raw modulo — dynamic-mode was already on Lemire's unbiased (hash * len) >> 64. Same bias, same fix; both paths now consistent.
  • gateway.rs ParentVisible over-permissive direction — predicate accepted both dest.is_ancestor_of(source) and source.is_ancestor_of(dest); the second clause leaked parent-region traffic down into descendants. Now strictly upward.
  • pool.rs (payload.len() - 16) as u16 truncation — currently safe under MAX_PAYLOAD_SIZE = 8112; debug_assert! added so a future cap-raise past u16::MAX + 16 doesn't silently mis-frame on the wire.
  • failure.rs unwrap() on poisoned std::sync::Mutex — the rest of the crate uses unwrap_or_else(|p| p.into_inner()); a single panic anywhere holding these locks would have turned every subsequent unwrap into a runtime panic that took down the failure-detection loop. Switched.
  • failure.rs RecoveryManager::on_failure overwrote FailedNodeState on insertfailed_at and retry_count reset to 0 each time; flapping peers never hit max_retries. Now entry().or_insert(...) and bumps retry_count.
  • failure.rs get_action returned Retry { delay_ms: 0 } for healthy nodes — busy-loop footgun for callers using the action on the healthy path. Now returns the no-op variant.
  • transport.rs BatchedPacketReceiver thread spun at 1 ms on persistent socket errorsEBADF / ENOTSOCK / permission-revoke ate a CPU forever. Now exponential backoff with hard-error early return.
  • proxy.rs telemetry counters incremented before send succeeded — counters drifted high under partial failure. Now incremented on success.
  • proximity.rs update_from_pingwave worse path overwrote better — high-seq pingwave through a long route demoted the cached direct route. Freshness (always take latest seq) is now separate from path quality (only update hops/addr/latency_us when new_hops <= self.hops).
  • proximity.rs self-edge insert_or_update_edge per-pingwave — hot-path noise; skipped.

Compute, daemons, migration

  • start_migration always emitted a single SnapshotReady regardless of sizechunk_index: 0, total_chunks: 1 whether the snapshot was 12 B or 12 MB; the wire encoder rejected any chunk over MAX_SNAPSHOT_CHUNK_SIZE = 7000. Locally-initiated migration of any daemon whose serialized state exceeded 7 KB couldn't be sent. Now routes through chunk_snapshot(daemon_origin, snapshot_bytes, seq_through). Breaking — see breaking-changes section.
  • Snapshot reassembly unbounded chunk hold via seq_through == latest — eviction only fired for strictly greater; an attacker could park up to ~4.3 GiB of unfinished reassembly per (origin, seq) and refresh forever. Per-entry byte cap (MAX_PENDING_REASSEMBLY_BYTES = 64 MiB) plus a per-entry age sweep (MAX_PENDING_REASSEMBLY_AGE = 5 min, opportunistic at the head of every feed plus a public sweep_stale for external timers) close the at-cap-and-quiet residual hole.
  • abort_migration_with_reason did not propagate to MigrationSourceHandler — source-side migrations map retained the entry; is_migrating() stayed true, buffer_event kept buffering into an undrained vector, retries tripped AlreadyMigrating. Now dispatched.
  • standby_group replaced standby marked healthy with synced_through = 0 — a subsequent active failure could promote the fresh zero-state standby and lose all pre-buffer state. Now keeps the replaced standby unhealthy until after a successful sync, and promote() candidates are filtered to last_sync.is_some().
  • migration_target::buffer_event had no phase guard — could insert/deliver post-cutover; combined with normal-path delivery yielded duplicate execution. Now guarded.
  • migration_source::start_snapshot was a contains_keyentry() race — two concurrent snapshots of the same origin could both call user-supplied MeshDaemon::snapshot() (DashMap entry guard was held across user I/O — a separate fix moves the entry-guard drop ahead of the snapshot). The trait API doesn't enforce idempotency; the race is now serialized.
  • migration_source::take_buffered_events had no phase guard — misuse-prone. Now guarded.
  • migration_target::abort did not clear completed index — minor leak. Cleared.
  • orchestrator returned MigrationError::TargetUnavailable(0) from auto-placement — surfaced "target node 0x0 unavailable" to operators when no specific node had ever been tried. Now typed NoTargetAvailable (variant addition).
  • orchestrator::buffer_event returned false at Cutover — downstream caller could route to source post-handoff. Now correctly buffers through Cutover.
  • migration.rs started_at: u64 saturated on clock jump backward — switched to Instant.
  • fork_group forks.pop() and coord.remove_last() invariant unenforced — brittle. Now enforced.
  • bindings.rs Vec::with_capacity from peer-supplied u32 — declared count of ~4 B entries → ~96 GiB allocation before truncation. Now bounded by data.len() / MIN_BINDING_SIZE.
  • reconcile.rs unreachable!() reachable on signed but divergent input — equal-length-equal-payload tiebreak panicked on the chain's reconciliation thread. Now a deterministic tiebreak on parent_hash.
  • reliability.rs silent reliability drop — when pending.len() >= max_pending, the oldest unacknowledged packet was popped; subsequent NACK could never recover that seq because the entry was gone. Now backpressures callers; doesn't drop tracking for in-flight packets.
  • router.rs NetRouter::start had no re-entry guard — a second call spawned a competing dequeue loop. Now compare_exchange on running.
  • continuity/chain (0, Some(non-empty payload)) accepted as genesis-shaped — chain reported Forked against junk. Now Unverifiable.
  • state/log genesis-shaped event with un-validated payload — peer-injected attacker-chosen anchor. Now pinned to the canonical genesis payload.
  • contested/correlation capability-index parent walk loops forever — defensive depth cap (matches the 4-level hierarchy).
  • contested/observation unbounded HashMap + seq_diff_sum overflow — long chains accumulated forever. LRU + saturating_add.
  • contested/superposition target_replayed only advanced from SuperposedSpreading (target catches up before advance(Replay)) stalled forever; ReadyToCollapse never fired. Now both arms advance.
  • contested/propagation lossy f64 → u64 poisoned EWMA — a pathological RTT clamped per_hop to u64::MAX permanently. NaN check tightened.
  • contested/correlation Instant subtraction panickednow - correlation_window panicked if the window exceeded uptime. Now checked_sub.
  • partition.rs NaN >= threshold blocked healing — when other_side.is_empty() the ratio was NaN. Empty case now treated as "fully healed."
  • failure.rs RecoveryManager flapping peers (see Mesh transport, sessions, routing — the recovery and the failure detection both lived in this file).
  • identity/origin.rs origin_hash: u32 collision floor documented — ~65 K peer birthday collision; cross-channel accounting keyed by origin_hash aliases distinct entities. Documented as the boundary; the rename to origin_tag and the wire bump are deferred to the next phase.

Behavior, identity, security

  • safety.rs AuditOnly silently dropped violation logscheck_rate_limits only logged when mode == Enforce; the documented "log violations but don't block" stance simply didn't log. Now logs unconditionally; only the return Err is gated.
  • safety.rs Relaxed / AcqRel mismatchrelease paired against acquire's AcqRel; observable counter drift on weakly-ordered cores. Both sides now AcqRel.
  • safety.rs audit-only token counter fetch_add without saturating — wraps under hostile traffic. Now saturating.
  • loadbalance.rs NaN slipped past total_weight <= 0.0 — switched to !(total_weight > 0.0) which captures NaN.
  • token.rs slot-cap race unboundedcontains_key then entry() overshoot bounded by concurrent calls, not shards. Now entry().or_insert_with() then drop on overflow.
  • token.rs signed_payload() allocated 95 bytes per verify — hot-path waste. Now stack-buffered.
  • channel/roster is_empty()remove_if TOCTOU — idempotent today but fragile. Tightened.
  • channel/guard revoke() did not rebuild bloom — false-positive rate climbed until manual rebuild_bloom. Now triggers rebuild.
  • behavior/diff::to_bytes returned Vec::new() on cap-violation — indistinguishable from a legitimate empty diff; senders silently transmitted zero bytes, receiver dropped. Deprecated in favor of try_to_bytes.
  • crypto.rs ReplayWindow::commit — see Mesh transport, sessions, routing: received == u64::MAX poisoning fixed at is_valid instead of commit.

Bindings (Node, Python, Go, C) & FFI

  • net_poll buffer-too-small dropped already-consumed eventsbus.poll(request) advanced the cursor before the response was serialized; an undersized buffer returned BufferTooSmall and dropped the entire response, but the next call started at the now-advanced cursor. Every event in the failed serialization was silently lost. Buffer is now sized-checked first and the response is buffered so a retry can resume.
  • net_poll_ex allocation failure dropped the entire batchLayout::array::<NetEvent>(count) and std::alloc::alloc(layout) failures returned Unknown and dropped the response. Now pre-validates count against a max event-count.
  • Panic across FFI on OOM in net_poll_exevent.id.as_bytes().to_vec().into_boxed_slice() and event.raw.to_vec().into_boxed_slice() could panic mid-loop and leak earlier Box::into_raws plus the std::alloc::alloc(layout) array. Entry points now catch_unwind; panic = "abort" for the cdylib closes the residual.
  • slice::from_raw_parts(ptr, len) lacked len <= isize::MAX validation — a C caller passing sign-extended -1 triggered immediate UB before any guard fired. Affects every wide-input FFI entry point: net_ingest, net_ingest_raw, net_ingest_raw_batch, net_ingest_raw_ex, mesh.rs::collect_payloads, net_mesh_publish, net_redex_file_append, net_identity_sign, net_identity_install_token, net_parse_token. All now reject above the isize::MAX boundary.
  • net_generate_keypair / net_free_string feature-gated, header unconditional — consumers linking against a cdylib built without net got load-time missing-symbol errors despite the header promising the symbol. Stubs added.
  • net_free_poll_result not idempotent — frees events and next_id but left the struct fields holding the freed pointers. A defensive caller / destructor wrapper double-free'd. Now nulls fields after free; subsequent calls and null-pointer calls are no-ops.
  • bus_taken defense-in-depth claim was doc-only — doc said "FFI ops also check this," but the field was read only inside net_shutdown. Either gate or remove the doc; we gated.
  • Concurrent net_shutdown callers raced the bus_taken swap — a second/third caller returned Success while the first was still inside runtime.block_on(bus.shutdown()), falsely signaling completion. Now serialized.
  • runtime().block_on(...) panics unwound across extern "C"Handle::try_current() guard added at every cortex.rs and mesh.rs block_on site; catch_unwind shim added.
  • FFI handle accessors &*handle without alignment check — misaligned *mut NetHandle from C is immediate UB before the null check. is_aligned_to::<HandleType>() now precedes every dereference.
  • Arc<InnerType>-wrapped FFI handles lacked compile-time Send + Sync auditstatic_assertions::assert_impl_all!(InnerType: Send + Sync); placed next to each handle.
  • c_str_to_str lifetime elision dangled — signature unsafe fn c_str_to_str(p: &*const c_char) -> Option<&str> bound the returned &str to the local stack slot, not the underlying C buffer. Today's call sites are stack-only, but a future refactor moving the result into tokio::spawn(async move { ... }) would have compiled cleanly and dangled. Now unsafe fn c_str_to_str<'a>(p: *const c_char) -> Option<&'a str> with explicit lifetime.
  • net_ingest_raw_batch silently dropped null and invalid-UTF-8 entries — function returned count - 1 accepted; bindings attributed the drop to backpressure, retried the wrong indices, and double-published the good ones. Now surfaces dropped indices via out_failed_indices: *mut size_t, out_failed_len: *mut size_t.
  • parse_config_json silently fell back to DropNewest on unknown backpressure_mode"DropOldset" (typo) or "FailProduce" got a different durability profile with no error at deploy time. Now errors on unknown values; added the Sample { rate } arm with rate validation.
  • retention_max_* accepted zero, fsync params did notretention_max_events = 0 meant "evict everything immediately on first append" — almost certainly a config mistake intended as "no limit." Now rejected at the same gate.
  • Net heartbeat_interval_ms / session_timeout_ms and mesh heartbeat_ms accepted zero — heartbeat-every-0ms busy-looped the heartbeat task and saturated a CPU. Now validated.
  • Cortex non-success paths didn't write *out_json/*out_len — pre-zero is the contract; some paths violated it. Fixed.
  • CString::new failure reported as InvalidUtf8 but caused by interior NUL — error variant retitled.
  • NetEvent / NetReceipt #[repr(C)] lacked cross-arch alignment pinning — const asserts on layout added.
  • TokioMutex held across JSON serialization in cortex FFI — per-cursor latency stall. Serialization now happens outside the held mutex.
  • Mesh FFI g.fp16_tflops_x10.map(|tf| tf as f32 / 10.0) lossy for u32 ≥ 2²⁴ — the neighboring tops_x10 already used saturating_u16_cap. Matched.
  • parse_modality_cap unknown modality strings silently fell back to Modality::Text — used for both capability announcements and capability filters; a typo in require_modalities returned wrong nodes with no error. Switched to Option<Modality> and surfaces NET_ERR_CHANNEL on unknown.

Compute SDK error surface

  • MigrationError::TargetUnavailable(0)NoTargetAvailable — variant addition; the integration test that asserted the pre-fix variant has been updated.
  • start_migration returns Vec<MigrationMessage> instead of single — see breaking changes.

Test hygiene

  • Migration chunked-snapshot regression — pins that locally-initiated migration of a daemon with a serialized state ≥ 7 KB chunks correctly, and the SDK's transport-identity seal path reassembles, seals, and rechunks in order.
  • Snapshot reassembly age-sweep regression — pins that the pending entry is evicted at the head of the next feed past the age cap.
  • active_count budget under concurrent activate — pins that three concurrent activates can't transiently overshoot max_shards.
  • PollMerger from_id echo on stalled poll — pins the cursor-context preservation.
  • flush() Phase 2 barrier delta-snapshot — pins that post-flush ingest can't satisfy the inequality.
  • shutdown_was_lossy no longer false-positives on deadline-triggered shutdown — pins that final-sweep drains are not counted against events_dropped.
  • next_seq observer consistencycommitted_seq is the lock-free invariant readers see.
  • Anti-replay received == u64::MAX rejection — pins that one hostile authenticated packet can't poison the receive path.
  • TokenScope::contains(NONE) is false — pins the no-op-action authorization closure.
  • JetStream cold-stream bail gated only on first_seq == 0 — pins that populated sparse streams are walked past arbitrary deletion gaps.
  • net_free_poll_result idempotency — pins single + multiple + null-pointer free.
  • net_poll minimum-buffer rejection — pins that buffers below MIN_RESPONSE_BUFFER are rejected before the cursor is touched.

Known issues — queued for the next release

mesh.rs deep-read audit

A separate single-file audit of adapter/net/mesh.rs (~8 K LOC) surfaced 9 additional defects that are scoped to that file. None of them are addressed in this release; all are slated for the next phase. For consumers running production deployments, the most consequential are listed below — the full audit is in docs/misc/BUG_AUDIT_2026_05_03_MESH.md.

  • spawn_heartbeat_loop holds a DashMap shard guard across .await — the heartbeat broadcast loop iterates peers.iter() and awaits socket.send_to(...) (heartbeat + pingwave, twice per peer) while still holding the iterator's Ref guard. Every other task touching the same shard blocks for the cumulative round-trip.
  • accept / start mutual exclusion uses AcqRel where the comment relies on SeqCst — Dekker-style mutual exclusion needs both sides SC. On x86 the LOCK'd RMW happens to fully fence so the race is unobservable; on AArch64 / RISC-V the dispatcher can race handshake_responder for the inbound msg1.
  • Routed-handshake key rotation silently overwrites a live session — the replay guard only fires for the same remote_static_pub; a routed msg1 with a different static for the same peer_node_id falls through and peers.insert overwrites the existing legitimate session.
  • commit_reclassify_observations torn (nat_class, reflex_addr) snapshot — when every probe failed, nat_class is updated but reflex_addr keeps its previous value, violating the traversal_publish_mu invariant.
  • authorize_subscribe rejects idempotent re-subscribes with TooManyChannels — a peer at the cap re-subscribing to a channel it already holds is rejected even though SubscriberRoster is set-typed.
  • Routed-handshake peers.getpeers.insert not atomic — concurrent routed handshakes for the same peer_node_id race the insert; the loser's pending_handshakes initiator state is wedged until handshake_timeout.
  • publish_to_peer does not propagate the reliable flag to the packet header — every other sender (send_to_peer, send_routed, send_on_stream, etc.) computes if reliable { PacketFlags::RELIABLE } and threads it in. publish_to_peer hard-codes PacketFlags::NONE. Latent today (per-stream reliability is set on open) but the inconsistency will silently bite when a receiver-side path consults the packet flag.
  • process_local_packet migration loopback unbounded synchronous self-bounce — a buggy / attacker-influenced "trusted" handler that always emits a self-bound message can spin the dispatch task synchronously, starving every other peer's packets.
  • connect_via does not refresh addr_to_node after a successful direct upgrade — the upgraded session's dispatch fast path falls back to a linear peers.iter().find(...) per packet for exactly the sessions that benefit most from the addr → nid index. Performance only.

Items deferred from the main audit

The following remain open from BUG_AUDIT_2026_05_03.md and are tracked for the next release: #1 (Windows compact_to non-atomic — MoveFileExW/MOVEFILE_WRITE_THROUGH), #6 / #7 / #8 (cortex watermark + checksum coverage), #13 (registry replace in-flight quiescing), #23 / #24 / #25 (cortex / mesh handle-lifetime contract on FFI), #39 (msg-id sequence_start monotonicity test), #56 (origin_hash u32 collision boundary; rename / wire bump), #64 (orchestrator target_head parent-hash 0), #68 (registry::unregister in-flight Arc clones), #73 (per-shard cap clamps cursor advancement under filtered single-shard requests), #81 (adapter/redis.rs pipeline timeout duplicate hazard — depends on RedisStreamDedup wiring), #97 (session.rs racy tx_bytes_sent watermark — see notes about credit-window invariant), #102 (envelope v0/v1 prober), #118 (rule window reset), #121 (select_power_of_two degenerate on len == 2), #125 (per_source.clear() minute-boundary RPM cap exceedance), #127 (initiator handshake HandshakePacer), #128 (router.rs lost-wakeup window).


Breaking changes

Rust core (net crate)

MigrationOrchestrator::start_migration returns Vec<MigrationMessage>

start_migration now returns Result<Vec<MigrationMessage>, MigrationError> instead of Result<MigrationMessage, MigrationError>. The local-source path returns one or more SnapshotReady chunks (sized to MAX_SNAPSHOT_CHUNK_SIZE = 7000); the remote-source path returns a single-element vec![TakeSnapshot { .. }].

Why: pre-fix the orchestrator emitted chunk_index: 0, total_chunks: 1 regardless of payload size; the wire encoder rejected anything past 7 KB and locally-initiated migration of any stateful daemon with a non-trivial state vector simply could not be sent.

Migrate:

// Before
let msg: MigrationMessage = orchestrator.start_migration(origin, src, dst)?;
send_migration_message(dest_node, &msg).await?;

// After
let msgs: Vec<MigrationMessage> = orchestrator.start_migration(origin, src, dst)?;
for msg in &msgs {
    send_migration_message(dest_node, msg).await?;
}

If you opted into transport-identity sealing, reassemble all chunks → seal → chunk_snapshot(daemon_origin, sealed, seq_through) → re-dispatch in order. The SDK's start_migration_with and MigrationHandle::reinitiate_attempt route through a new maybe_seal_chunked_snapshot helper that does this for you.

MigrationError::NoTargetAvailable (variant addition)

start_migration_auto now returns MigrationError::NoTargetAvailable when the scheduler finds no candidate, instead of TargetUnavailable(0) (which surfaced "target node 0x0 unavailable" to operators).

Migrate: match arms over MigrationError need to add the new variant; with #[non_exhaustive] already in place this is forward-compatible, but exhaustive match-on-variant code will refuse to compile.

ConsumeResponse::failed_shards

A new failed_shards: Vec<u16> field reports per-shard adapter errors that previously were silently swallowed at warn level (in contrast to stalled_shards, which was already surfaced).

Config validation rejects zero in places it used to accept

  • retention_max_events = 0, retention_max_bytes = 0, retention_max_age_ms = 0 are now rejected at the JSON-config gate (matching the existing fsync zero-rejection). Set them to null or omit the field for "no limit."
  • Net heartbeat_interval_ms = 0, session_timeout_ms = 0, mesh heartbeat_ms = 0 are now rejected. A 0-ms heartbeat saturates a CPU; this was almost always an unintended config.
  • BatchConfig max_size > 1_000_000 is now rejected. Default is 10_000; the cap closes the current_batch_size * 3 + target overflow path.
  • parse_config_json errors on unknown backpressure_mode values instead of silently selecting DropNewest.

BackpressureMode::Sample { rate }

New variant; existing match arms must add a wildcard or the new arm.

behavior::diff::to_bytes deprecated

Returns Vec::new() on cap-violation, indistinguishable from a legitimate empty diff. Migrate to try_to_bytes which returns Result.

WatermarkingFold caps inputs at u64::MAX - 1

A peer publishing seq_or_ts == u64::MAX previously poisoned per-origin monotonicity. Inputs at the boundary are now rejected. Operators feeding the watermarking fold with a synthetic max-seq must pick u64::MAX - 1.

consumer/merge::PollMerger failed/stalled shard surfacing

PollMerger::poll now echoes back the caller's from_id when no shards make progress (instead of None, which callers were interpreting as "no events" and re-fetching from zero). Callers that relied on None as the stall signal need to switch to next_id == request.from_id.

Cross-backend cursor migration enforced

compare_stream_ids's mixed-format lex fallback wedged the cursor across backend migrations (e.g. JetStream → Redis: "1700-0" < "42" lex-compared). The cursor format is now persisted alongside the cursor; cross-backend migration without explicit reset is refused.

StoredEvent serialization passes raw bytes through

Pre-fix StoredEvent::Serialize round-tripped self.raw through serde_json::Value, discarding original whitespace and key order, normalizing number formatting (1.01). Downstream signatures or hashes against the serialized form silently failed verification. Now uses &serde_json::value::RawValue passthrough — byte-equality is preserved.

Rust SDK (net-sdk)

The SDK's public surface is unchanged. The migration kickoff paths (DaemonRuntime::start_migration_with and MigrationHandle::reinitiate_attempt) handle the new chunked Vec<MigrationMessage> internally; if you call the orchestrator directly via DaemonRuntime::orchestrator_arc() (or equivalent) you must update to the new return shape.

FFI / bindings

Binding Change
All Every extern "C" body is now wrapped in catch_unwind; the cdylib uses panic = "abort" so a Rust panic does not unwind across the FFI boundary. Behavior change for callers that depended on a Rust panic partially completing the call before unwinding.
All slice::from_raw_parts(ptr, len) rejects len > isize::MAX as usize. C callers passing sign-extended -1 previously hit immediate UB before any guard fired; they now hit a defined error return.
All FFI handle accessors check alignment via is_aligned_to::<HandleType>(). A misaligned *mut Handle returned from a wrapper that allocated through a non-Rust allocator now returns an error instead of UB.
All net_ingest_raw_batch surfaces dropped indices via two new out-parameters (out_failed_indices, out_failed_len). Bindings that called the function with nullptr for these still get the old "count returned" semantics.
All net_free_poll_result is now idempotent. Callers that ran their own field-nulling defensively can drop it.
All parse_modality_cap returns NET_ERR_CHANNEL on unknown modality strings instead of silently falling back to Modality::Text. Bindings that round-tripped capability announcements through arbitrary string fields will start surfacing errors at deploy time.
C net.h now provides net_generate_keypair / net_free_string stubs in builds without net. Consumers linking against a net-less cdylib previously hit load-time missing-symbol errors despite the header.

Behavioral fixes that may surface as test breakage

These aren't strictly API-breaking, but tests that asserted the pre-fix behavior will need updating:

  • MigrationError::NoTargetAvailable: tests asserting TargetUnavailable(_) from start_migration_auto need to switch.
  • shutdown_was_lossy = false on a clean deadline-triggered shutdown: tests that asserted the false-positive behavior will fail.
  • PollMerger::poll echoes back from_id on stall: tests that asserted next_id == None on stall will see the input cursor instead.
  • active_count cannot transiently exceed max_shards: tests that relied on the budget overshoot to construct a degenerate state will need a different vector.
  • flush() Phase 2 barrier respects pre-flush ingest: tests that satisfied the inequality with post-flush traffic will hang to the deadline.
  • Anti-replay received == u64::MAX is rejected: tests that asserted the boundary was accepted will see the rejection.
  • TokenScope::contains(NONE) == false: tests that asserted the old true will need to flip.
  • JetStream Other PublishErrorKind is fatal: retry-loop tests that simulated Other and asserted retry will see the call return immediately.
  • Memories STORED → DELETED → STORED does not resurrect: tests that asserted resurrection will see the post-tombstone behavior.
  • gateway.rs::ParentVisible is now strictly upward; tests that asserted descendant-side leakage will fail.
  • route.rs route tie-break is strictly better, not equal-or-better: tests that asserted equal-metric overwrite will see preserved routes.

How to upgrade

  1. Bump your Cargo.toml / package.json / requirements.txt / go.mod to the v0.10 line.
  2. Recompile. The signature changes (start_migrationVec, BackpressureMode::Sample, ConsumeResponse::failed_shards, MigrationError::NoTargetAvailable) will surface as compile errors at the exact call sites that need updating — follow the Migrate snippets above.
  3. Audit your config for fields that previously accepted zero where they shouldn't have (retention_max_*, heartbeat_interval_ms, session_timeout_ms, mesh heartbeat_ms). Replace zeros with null (or omit) for "no limit," or pick a small positive value for the heartbeat fields.
  4. Cross-backend cursor migrations require an explicit reset. If your deployment is migrating from JetStream to Redis (or vice-versa), drop the persisted cursor and let the consumer re-tail from the explicit start position.
  5. If you call MigrationOrchestrator directly (rather than through the SDK's DaemonRuntime::start_migration_with), update to the chunked Vec<MigrationMessage> return shape and reassemble + seal + rechunk on the transport-identity-sealing path.
  6. If your test suite covers the items in Behavioral fixes that may surface as test breakage, update the assertions.
  7. Re-run your full suite. The lib + binding suites run green; the FFI / bindings layer now uses catch_unwind + panic = "abort" so any unwind across the boundary that previously "worked" is now a hard failure pointing at an unhandled panic source.
v0.9.0Codename:First Blood
2026.05.02

v0.9 is a hardening release. No new features, no new transports, no new SDK surfaces — every commit on this branch is a bug fix, a regression test, or a documentation tightening. The conviction we shipped under v0.8 ("Killing Moon") was that distributed compute should not be a control-plane problem. v0.9 is the version where we stand behind that conviction by walking it through audit after audit and tightening every seam we found.

The work was driven by four parallel-pass internal audits totalling 102 items across the bus, the shard manager, the RedEX append log and its CortEX fold, the JetStream and Redis adapters, the mesh transport, the FFI surface, and every binding.


Addressed in this release

RedEX & CortEX (storage + folded state)

  • Lost events on partial replay failureMigrationTargetHandler::drain_pending returned on first delivery error without restoring the undelivered tail; everything past the failure was permanently lost. Fix preserves the tail for the next drain and a regression test pins both the resume and the prefix-not-redelivered invariant.
  • Silent eviction during tail backfill — backfill could miss the Lagged signal under retention rollover and silently drop events. Now signals correctly during backfill.
  • Index task exits permanently after Lagged — the tail task halted on Lagged and never recovered. Now clears the index, re-tails live-only with a 5/20/60/250 ms saturation backoff, and surfaces a lag_resets() counter so aggregating downstreams can detect lossy resets.
  • Snapshot-store retention drops high-water mark on remove — a stale producer could re-stage older snapshots after a remove. Added a per-entity high-water table that survives remove. forget() is now pub(crate) so the anti-rewind invariant can't be defeated externally.
  • Observable seq rollback via next_seq() — external readers could observe a temporarily-bumped next_seq mid-rollback. Now reads under the state lock.
  • new_heap accepts RedexFlags::INLINE — the heap path silently accepted the inline flag, breaking invariants. Now rejected.
  • append_batch empty-input returns plausible-looking seq (breaking) — returned 0 for both empty input and the legitimate seq-0 first write. Now Result<Option<u64>, _>. See breaking-changes section.
  • Age-retention off-by-one (breaking) — boundary was > (entries at exact cutoff dropped); now >= (retained). See breaking-changes section.
  • Stop policy halts without final changes_tx notify — subscribers got no signal on halt. Initial fix added notify_waiters + changes_tx.send(seq); the broadcast was later refined to NOT emit the failing seq, since changes_tx is documented as carrying successfully-folded sequences.
  • Cortex changes_tx broadcasts failing seq on Stop+non-recoverable halt — pre-fix subscribers could observe a phantom Seq(failing_seq), mis-routing state. Now drops the broadcast on halt; subscribers poll is_running().
  • RedexFile::Debug deadlock footgunDebug called len() and next_seq(), both of which take the state lock. Now reads only the lock-free atomics.
  • RedexIndex::clear() on Lagged is silent — added the lag_resets() accessor as a public sentinel.
  • RedexIndex saturation-resume can hot-loop — under sustained burst with an under-sized tail_buffer_size the loop emitted a warn per cycle. Now backed off and rate-limited.

Bus, shards, and dispatch

  • Activation-failure abort drops drain-worker scratch buffer / Batch worker abort drops in-memory current_batch.abort() dropped events. Now graceful await + dispatch with bounded tokio::time::timeout(2 × adapter_timeout) so the rollback can't hang on a parked worker.
  • num_shards decremented on rollback that never incremented it — activate-failure rollback over-decremented num_shards for never-activated shards. Decrement is now gated on the shard's mapper state. A targeted remove_specific_stopped_shard replaces the bulk remove_stopped_shards() so sequential manual_scale_down doesn't prune sibling state under itself.
  • ShardManager::activate_shard double-counts on idempotent calls — repeated activates kept bumping num_shards. Now gated on the mapper's transitioned signal.
  • activate() budget gate — load-then-store is safe today because the held write lock on shards serializes both the load and the mutation. The lock-held invariant is now documented as the correctness gate (CAS would be belt-and-braces, not strictly required).
  • Shutdown drain race past in_flight_ingests — single zero-pass could miss late producers. Now requires two consecutive zero passes.
  • shutdown() returns Ok(()) after timeout-with-drops — lossy shutdown looked successful. Now surfaces via events_dropped + a dedicated shutdown_was_lossy flag.
  • drain_finalize_readyRelease pairs only via implicit fence on the in-flight spin's SeqCst; promoted to SeqCst at the store site so the happens-before is explicit. Deadline-break path documented as the data-loss escape hatch.
  • PollMerger default shard list is wrong after dynamic scale-down — polled from a stale 0..num_shards range, missing live shards. Now uses the live shard id set, propagated through both add and remove paths.
  • poll_merger ArcSwap leaves polls operating on stale topology — topology-snapshot semantics now documented on poll().
  • per_shard_limit silently capped at 10 000 — caller had no signal. Surfaced via truncated_at_per_shard_cap: bool in ConsumeResponse.
  • has_more=true from a stalled adapter is silently suppressed — stalled shards invisible to the caller. Now surfaced via stalled_shards: Vec<u16>.
  • Cursor::encode returns empty cursor on serialization failure — empty cursor restarted polling from zero (silent rewind). Initial fix used expect(...); later refined to return Result<String, ConsumerError> so an async poll() panic can't take down a runtime worker. Minor breaking change for direct callers.
  • PER_SHARD_FETCH_CAP made public — exposed an internal tuning knob as API. Now #[doc(hidden)]. Read truncated_at_per_shard_cap instead.
  • add_events(vec![]) flushes as a side effect — load-bearing for the rollback path. Documented and pinned by add_events_empty_can_flush_via_timeout.
  • flush() baseline excludes events flushed via remove_shard_internal — verified events_dispatched is bumped on stranded-flush; was already correct.
  • dispatch_batch final attempt collapses error reasons — all retries were tagged with one collapsed error. Now structured per-attempt reason.
  • dispatch_batch retry sleep has no jitter / backoff — synchronized retry storms across shards. Now jittered exponential via retry_backoff(shard_id, attempt).
  • drain_finalize_ready ordering doc — clarified that the SeqCst happens-before only covers the non-deadline exit; deadline-path stranded events are exactly the ones surfaced via events_dropped + shutdown_was_lossy.

Atomics, timestamps, and counters

  • pushes_since_drain_start mismatched atomic ordering — producer used Relaxed, drain side used Acquire. Now both Acquire.
  • in_flight_ingests is AtomicU32 with no saturating semantics — pathological producer counts could wrap. Widened to AtomicU64.
  • TimestampGenerator uses hard-coded baseline 0 — TSC delta math wrong. Now captures baseline at construction.
  • TimestampGenerator monotonicity stalls before the documented panic — stalled spin instead of advertised panic. Now panics preemptively at u64::MAX.
  • velocity_samples VecDeque bounded only by time, not count — burst could grow unbounded. Now also count-capped.
  • Partition next_id reuses ID 0 on u64::MAX overflow — wrap-around silently re-issued IDs. Now saturates.

Adapters (JetStream / Redis / dedup)

  • JetStream as u16 truncates shard_id — values > 65 535 wrapped silently. Now rejected with Fatal (and poll_shard propagates the Fatal instead of log-and-skipping).
  • JetStream unwrap_or_default() on remote JSON — malformed r field re-serialized as empty bytes. Now propagated as Fatal.
  • JetStream cold-stream poll walks fetch_limit * 10 round-trips — ~1010 RTTs per poll on cold streams. Now bails after consecutive_not_found_cap, gated on first_seq == 0 so populated sparse streams (events at seq 1, 500, 1000) walk past arbitrary deletion gaps.
  • JetStream from_id cursor seq + 1 overflows — wrapped to 0 at u64::MAX, silent restart. Now checked_add(1).unwrap_or(seq).
  • JetStream Fatal drops accumulated batch in poll_shard — documented; acceptable since Fatal is non-retryable.
  • Redis is_healthy PING has no enforced timeout — could hang indefinitely. Now wrapped in command_timeout.
  • Redis & JetStream limit + 1 overflow on adversarial limits — wrapped to 0, silent under-delivery. Now saturating_add(1).
  • RedisStreamDedup::new accepts unbounded capacity — clamped at MAX_CAPACITY = 1<<24.
  • RedisStreamDedup is FIFO eviction, not LRU as documented — docs were wrong. Updated to describe FIFO accurately.
  • dedup_state silently swallows fsync failureslet _ = f.sync_all() ignored disk-full errors. Propagated; cross-platform fixed via single writable handle (File::open returned read-only on Windows; FlushFileBuffers failed silently).
  • dedup_state::create_new(true) poison after crash — a stale tempfile from a crashed prior run could break every subsequent save. Added fs::remove_file(&tmp).ok() before create_new.

Security & permissions

  • ttl_seconds = 0 token mints expired — born-expired tokens with no diagnostic to the issuer. try_issue returns TokenError::ZeroTtl.
  • Identity::issue_token panic on Duration::ZERO — first fix routed through try_issue.expect(...), which still aborted the process with a misleading "ReadOnly" message. Now soft-clamps to 1 second, debug_assert!s in dev builds, and the wrapper's panic messages match each try_issue variant precisely.
  • PermissionToken::issue panic message misattributes ZeroTtl as ReadOnly — fixed in tandem with the above.
  • Anti-replay window cleared on large legitimate jumps — whole bitmap zeroed silently. Now emits a structured warn before zeroing.
  • OriginStamp has no per-packet binding — threat model documented.
  • Untrusted-input panics in subnet config — added try_* fallible constructors for SDK callers.
  • Channel decoder accepts trailing bytes on UNSUBSCRIBE/ACK — decoder now requires cur.remaining() == 0 after the channel name + token.

Bindings (Node, Python, Go, C)

  • Node binding u32 → u8 truncation on member indexas u8 silently truncated > 255. Switched to try_into with explicit > 255 rejection.
  • Python bindings hold GIL across blocking compute opsscale_to, on_node_failure, sync_standbys, promote blocked the GIL during long ops. Now release via PyO3 0.28's py.detach.
  • Node-binding groups carry an unused kind: String field — removed dead field.
  • RedisStreamDedup stripped from generated Node binding surface — a regen-without-redis-feature dropped the class from index.d.ts and index.js. Re-ran NAPI generation with --features redis,….
  • Python parity test for append_batch([]) returns None — added so future binding regenerations don't silently drop the contract.
  • include_str! of go/net.h escapes the crate root — broke cargo publish and out-of-repo vendoring. Copied to in-crate include/net.go.h and updated the parity test.
  • C SDK README — fixed stale references to a removed bindings/go/net/net.h path.
  • Runtime::block_on from extern "C" shims unwinds across FFI — reentrancy hazard documented.

Behavior rules & evaluators

  • Lossy as_f64 for all numeric ordering in rules — big i64/u64 values lost precision through f64. Now compares i64/u64 directly with sign-aware mixed-type fallback.
  • compare_numbers brittle with serde_json/arbitrary_precision — a transitive dep enabling that feature would silently make rules fail closed. Added debug_assert! so the misuse is loud in dev.
  • Non-deterministic verdict orderingwindow_failures ordering depended on iteration order. Now sorts and dedups for determinism.
  • record_execution window-reset across rule reload — counters mis-reset for non-rate-limited rules. Now skipped for those.
  • Stream tight-loop spin — zero poll_interval spun the loop. Clamped to non-zero.
  • Stream backoff overflow on absurd poll_interval — doubling overflowed. Now saturating.
  • Rule::new lossily casts u128 millis to u64 — long uptimes truncated. Now uses saturating u64::try_from.

Compute (daemons + migration)

  • Migration next_seq overflowreplayed_through + 1 could panic at u64::MAX. Now saturating_add.
  • DashMap entry guard held across registry I/Ostart_snapshot held the entry guard across user-supplied snapshot code, deadlock-prone. Drops the guard before I/O. Two racing starts produce two MeshDaemon::snapshot() calls — non-idempotent daemons must single-flight at their layer; documented.
  • on_node_recovery does not break after first matching partition — documented as intentional for overlapping partitions.

Mesh transport & packet codec

  • Silent event_count truncation in packet builder — builder accepted oversized batches and truncated. Now rejects with explicit error.
  • StreamWindow.decode unbounded total_consumed — consumer-side clamp was already enforced; documented.
  • Modulo bias in equal-weight candidate selectionhash % len biased low for non-power-of-2. Now Lemire's (hash * len) >> 64.
  • cpus.saturating_mul(2) caps max_shards: u16 at 65 535 — documented as intentional.
  • mapper.rs cooldown check + scale mutation atomicity — RwLock-implicit serialization documented.

SDK & error surface

  • SdkError::Ingestion(String) flattens structured IngestionError — backpressure / sampled / unrouted all funnelled through one stringly-typed variant. Routed to structured Sampled / Unrouted / Backpressure. Breaking — see breaking-changes section.
  • SdkError enum is breaking and not #[non_exhaustive] — added #[non_exhaustive] so future variant additions are minor-version changes.
  • NetBuilder::identity() silently overrides entity_keypair — builder accepted both fields and silently dropped one; now rejects the conflict at build time.
  • NetAdapterConfig::validate accepts pathological values — added upper bounds + heartbeat floor.
  • Drop releases shutdown gates synchronously while workers hold Arc<Self> — no partial-destruction UB; documented.

Test hygiene

  • MigrationTargetHandler::drain_pending regression test — strengthened to also assert the prefix is NOT redelivered.
  • add_events_empty_can_flush_via_timeout — pins that empty input flushes after max_delay. Load-bearing for the rollback path.
  • retry_backoff jitter test — relaxed from >= 8 / 16 to >= 4 / 16 to stay robust against DefaultHasher distribution drift across toolchain versions.
  • debug_does_not_acquire_state_lock — pins the lock-free Debug invariant by holding state.lock() across format!("{:?}", file).
  • stop_policy_does_not_broadcast_failing_seq — pins the cortex broadcast contract.
  • cold_stream_bail_gate_only_fires_when_first_seq_is_zero — pins the JetStream sparse-stream gate.

Breaking changes

Rust core (net crate)

RedexFile::append_batch signature changed

append_batch and append_batch_ordered now return Result<Option<u64>, RedexError> instead of Result<u64, RedexError>.

Why: the prior shape returned Ok(0) for an empty batch, which collided with the legitimate "first event of a non-empty batch landed at seq 0" return — callers couldn't distinguish "I appended nothing" from "I appended one event at seq 0".

Migrate:

// Before
let first_seq: u64 = file.append_batch(&payloads)?;

// After
let first_seq: Option<u64> = file.append_batch(&payloads)?;

Same change cascaded through OrderedAppender::append_batch and TypedRedexFile::append_batch.

Retention boundary semantics

Age-based retention now uses >= instead of > for the cutoff. An entry whose timestamp equals the cutoff exactly is retained (was: evicted).

Why: the original > comparison was off-by-one — entries on the boundary lasted strictly less than the configured retention_max_age. Production deployments with tight age caps observed events expiring one tick early.

Migrate: no source change required, but tests that asserted exact-boundary entries were evicted will now fail. Update assertions to expect retention.

Cursor::encode returns Result

CompositeCursor::encode now returns Result<String, ConsumerError> instead of String. Affects callers using the type directly; EventBus::poll() already handles the new shape.

Migrate: append .unwrap() (in tests) or ? (in production) to existing call sites.

PollMerger::new signature

PollMerger::new takes Vec<u16> of active shard IDs instead of num_shards: u16. This is an internal-leaning type but pub; downstream wrappers may need to update.

ConsumeResponse struct fields

Added truncated_at_per_shard_cap: bool and stalled_shards: Vec<u16>. Callers that construct ConsumeResponse directly need to populate the new fields. Pattern matches with .. unaffected.

PER_SHARD_FETCH_CAP is #[doc(hidden)]

Still pub const (callable), but no longer documented as API. Callers checking truncation should read ConsumeResponse::truncated_at_per_shard_cap instead of comparing against the constant.

SnapshotStore::forget is pub(crate)

Was pub. The function defeats the high-water-mark anti-rewind invariant — exposing it publicly let any caller stage stale snapshots over fresh ones. No production callers existed; only test code referenced it.

Rust SDK (net-sdk)

SdkError is #[non_exhaustive] + new variants

SdkError now carries the #[non_exhaustive] attribute. Two new variants moved out of the stringly-typed Ingestion(String) fallback:

  • Sampled — event deliberately dropped by a sampling / decimation policy. Retry is pointless.
  • Unrouted — no routable shard for the event (typically a topology-transient state). Retry once topology stabilizes.

From<IngestionError> now routes IngestionError::Sampled and IngestionError::Unrouted to these structured variants. Code that string-matched on the content of Ingestion(String) for those causes silently stops matching.

Migrate:

// Match arms now must include a wildcard
match err {
    SdkError::Backpressure => /* drop or retry */,
    SdkError::Sampled => /* accept the drop */,
    SdkError::Unrouted => /* retry after topology stabilizes */,
    SdkError::NotConnected => /* peer gone */,
    _ => /* future-proof catch-all */,
}

If you were substring-matching on Ingestion(...) for "sampled" or "no shard", switch to the structured variants.

Identity::issue_token no longer panics on Duration::ZERO

Previously the panicking convenience wrapper aborted with a misleading "public-only keypair" message when ttl == Duration::ZERO. It now soft-clamps to 1 second and debug_assert!s in dev builds, so the misuse surfaces in tests but doesn't take down the process in release.

Identity::try_issue_token (the explicit fallible surface) still rejects zero-TTL with TokenError::ZeroTtl — bindings route through it.

Migrate: nothing strictly required. Tests that exercised the panic with #[should_panic(expected = "public-only keypair")] need updating — the new debug-assert message contains "Duration::ZERO".

Bindings

Binding Change
Node appendBatch(...) returns bigint | null (was bigint). Empty input → null.
Python append_batch(...) returns int | None (was int). Empty input → None.
Node RedisStreamDedup class is back on the binding surface (it had been stripped by an earlier feature-incomplete regen — not a breaking change for downstream npm consumers, just a regression repaired).
Go IssueToken{TTLSeconds: 0} returns a non-nil error (was: same — surfaced from FFI's try_issue path). No source change.

Behavioral fixes that may surface as test breakage

These aren't strictly API-breaking, but if your test suite asserted the old behavior they will need updating:

  • num_shards rollback: add_shard + failed activate_shard + rollback no longer over-decrements num_shards. Tests that expected the off-by-one will fail.
  • JetStream sparse-stream polling: poll_shard no longer breaks early on 64 consecutive NotFounds when info() reported a populated stream (first_seq > 0). Tests on populated sparse streams that asserted early-bail behavior will see longer walks.
  • Cortex changes_with_lag halt path: on Stop + non-recoverable error the failing seq is no longer broadcast on changes_tx. Subscribers need to poll is_running() to detect halt — pre-fix they could have observed (incorrectly) a ChangeEvent::Seq(failing_seq).
  • RedexFile::Debug: no longer acquires the state mutex; reads only the lock-free atomics. Output format changed (next_seq_atomic field name; len removed).
  • SnapshotStore::store: equal-seq concurrent-store linearization is now documented to be on the snapshots-side entry guard, not on the high-water mark. Behavior unchanged; doc clarified.

How to upgrade

  1. Bump your Cargo.toml / package.json / requirements.txt / go.mod to the v0.9 line.
  2. Recompile. The signature changes (append_batchResult<Option<u64>>, Cursor::encodeResult, SdkError #[non_exhaustive]) will surface as compile errors at the exact call sites that need updating — follow the Migrate snippets above.
  3. If you have tests that assert pre-fix behavior on the items in Behavioral fixes that may surface as test breakage, update those assertions.
  4. Bindings consumers (Node / Python): no source change is required — the type-stub updates are forward-compatible — but treat the new null / None empty-input returns as the canonical "I appended nothing" signal in your call sites.
  5. Re-run your full suite. The lib + binding suites run green; if your suite covers integration paths not exercised by the audit, this is the right release to catch any drift.
v0.8.0Codename:Killing Moon
2026.05.01

Net is a mesh runtime. Identity is cryptographic, channels are hierarchical, state is causal, and compute moves. There is no broker, no leader, no central directory. Every node is its own keypair. Every event is signed into a chain you can verify without trusting the network underneath. The network is the substrate; the entities are what matter.

This is what we have to show on day one.

Mikoshi

The piece worth naming first.

A daemon in Net is a stateful event processor whose identity is its public key and whose location is the mesh. You don't address it by "node X, slot 3." You address it by its origin_hash, and that fingerprint doesn't change when the daemon moves.

Mikoshi is how it moves.

A running program on one node becomes a running program on another without losing its history, its pending work, or its place in the conversation. The source packages its state, the target unpacks it, and for a brief moment the entity exists on both nodes at once — spreading, superposed, then collapsed onto the target as routing cuts over. The daemon doesn't know it moved. Neither does anything talking to it. Observer nodes watching the stream see the same causal chain continue uninterrupted, the same sequence numbers, the same entity speaking. The hardware underneath shifted. The stream didn't notice.

What moved wasn't a copy. It was the thing itself, carried across.

Six phases, signed at every boundary, with continuity proofs that verify the chain didn't fork. Standby groups and replica groups compose on top — the active dies, the warmest standby promotes, the mesh keeps moving. The daemon is the object, and the object persists.

That is the headline of v0.8.

What's underneath

A non-localized event bus. Encrypted UDP transport with AEAD on every data packet, multi-hop forwarding, NAT traversal, and pingwave swarm discovery. ed25519 identity stamped on every header. Capability announcements that drive routing — a request for inference flows toward the nearest node with a matching GPU, not toward a fixed endpoint. Permission tokens with delegation chains. Bloom-filter authorization checks at sub-10ns per packet. Hierarchical subnets that keep observation cost bounded as the mesh grows.

A storage stack that is embedded, not a service: RedEX as the append-only log, CortEX folding the log into typed domain state, NetDB exposing it as queries and live watches. Disk persistence is a flag. Durability is a knob (Never, EveryN, Interval, IntervalOrBytes). Snapshots round-trip the whole stack in one blob. There is no database to run alongside the runtime. The runtime is the database.

Bindings for Node, Python, and Go. Ergonomic SDKs in TypeScript and Python. The same MeshDaemon interface whether the event came from this process, the next node over, or three hops away. Code written against a single-node prototype runs unmodified on a multi-hop mesh.

What this release means

Net is built on the conviction that distributed compute should not be a control-plane problem. No broker to provision, no orchestrator to fail over, no service registry to keep consistent with reality. The mesh routes around what's down. The chain proves what's true. The daemon is wherever it needs to be.

We chose the Cyberpunk frame because it's the right one. Mikoshi is the engram store — minds persisting outside the hardware that bore them. Net's daemons persist outside the nodes that host them. That is not a metaphor we are reaching for. It is what the migration state machine does, packet by packet, with cryptographic receipts.

v0.8 is the version of Net we are willing to put a name on. The codename does double duty. The song — Echo & the Bunnymen, 1984 — is about the part of yourself you don't get to negotiate with. The mission — Phantom Liberty's final act — is V carrying Songbird (Somi) to the Moon, where the system that would destroy her can't reach.

The release ships when it's ready, not when it's convenient. It happens to ship on May 1, 2026, under a full moon. We didn't plan that. We're taking it.

Codename

"Killing Moon" — Echo & the Bunnymen (1984) / Cyberpunk: Phantom Liberty (2023). Released May 1, 2026.

§12 / post-cloud

not anti-cloud.
post-cloud.

Cloud infrastructure solves the wrong problem. It moves compute closer to a central provider. Net decouples compute from hardware and location.

Cloud adds a trusted intermediary by definition. Net has no intermediaries. Relay nodes forward encrypted bytes they cannot read. There is no Cloudflare, no AWS, no Azure in the path because the path is yours.

Cloud was the right answer when compute was scarce and hardware was expensive. Compute is abundant. Hardware is cheap. The coordination layer should reflect that.

A manufacturing plant running on Net doesn't route sensor data to AWS us-east-1 and back. The sensor talks directly to the decision system on the factory floor. The latency is physics, not geography plus cloud overhead.

the mesh is already
running.
↓ Join the Net
░░░░▒▒▒▒▓▓▓▓████████▓▓▓▓▒▒▒▒░░░░ ░░░░▒▒▒▒▓▓▓▓████████▓▓▓▓▒▒▒▒░░░░ ░░░░▒▒▒▒▓▓▓▓████████▓▓▓▓▒▒▒▒░░░░