Prescott Data Developers
View on GitHub
Blog/JarvisCoreRELEASE

JarvisCore 1.0.2: P2P Keepalive Backoff and Remote Agent Discovery

JarvisCore 1.0.2 fixes two P2P reliability issues: continuous keepalive spam when peers are unreachable, and list_roles() missing agents discovered via SWIM.

Prescott Data OSS
JarvisCore Core Team
March 4, 2026
9 min read

JarvisCore 1.0.2 shipped on March 4, 2026 with four fixes to the P2P layer. This is a reliability patch for multi-node deployments. The public API is unchanged.

The Cross-Loop Future Bug

This is the most consequential fix in the release. It did not appear in logs as an error. It appeared as requests to peer agents that never returned, then eventually timed out.

The cause was an asyncio cross-loop Future collision. The SWIM ZMQ router runs in a dedicated background thread with its own event loop. When a PEER_RESPONSE arrived from a remote agent, the response handler was invoked inside the SWIM thread's event loop. It then called _deliver_message(), which called future.set_result() on a Future that had been created in the main event loop.

Python's asyncio raises a RuntimeError: Future is bound to a different event loop in this case. The error was swallowed silently, the Future never resolved, and the calling coroutine waited until its timeout before giving up. In production, that meant peer requests that appeared to hang with no indication of why.

Before the fix, both request() and ask_async() created Futures using asyncio.get_event_loop(), which returns any available loop and is deprecated in modern Python. The fix changes both call sites to asyncio.get_running_loop(), which correctly binds the Future to the loop the coroutine is actually running in.

The more important change is in _deliver_message(). It now checks whether the loop that called it is the same loop the Future was created in. If they differ, it schedules the result on the Future's own loop using future.get_loop().call_soon_threadsafe(), which is the correct pattern for setting a Future result from a different thread or event loop.

Keepalive Spam Prevention

When peers were unreachable, the keepalive manager sent continuous retry attempts with no pause between failures. In a cluster where a peer node was temporarily down or the network had intermittent connectivity, this produced log noise and unnecessary load on both the sending node and whatever the retries were reaching.

The fix adds a failure backoff mechanism to P2PKeepaliveManager. The manager now tracks consecutive_send_failures and sets a failure_backoff_until timestamp when sends fail. The keepalive check is suppressed until that timestamp clears.

The backoff duration is configurable:

P2P_KEEPALIVE_FAILURE_BACKOFF_SECONDS=45  # default

Or in the Mesh config dict:

mesh = Mesh(mode="distributed", config={
    "keepalive_failure_backoff_seconds": 45,
})

On a successful keepalive send, the consecutive failure counter resets to zero. The backoff does not accumulate across a successful cycle.

Single-Node Graceful Degradation

Before this release, a single-node JarvisCore deployment logged failure warnings when no peers were found during keepalive. This was incorrect for the development case: a developer running one node is not experiencing a failure, they are running a single-node deployment.

The fix adds an allow_zero_peers flag that defaults to True. When the keepalive manager finds zero peers and this flag is set, it logs at debug level and skips the keepalive without incrementing the failure counter or entering failure backoff.

P2P_ALLOW_ZERO_PEERS=true  # default, set to false in production clusters

In a configured cluster where zero peers is a genuine failure, set this to false. The keepalive manager will then treat an empty peer list as a failure and apply the backoff behaviour described above.

Remote Agent Discovery in list_roles()

Before 1.0.2, PeerClient.list_roles() only queried the local agent registry. Agents that had been discovered via the SWIM mesh and registered in the remote agent registry were not included in the results. This meant that in a multi-node deployment, role queries only saw agents on the local node, making cross-node role routing unreliable.

The fix extends list_roles() to search both the local _agent_registry and the remote _remote_agent_registry. Role queries now return the full set of agents visible to the node across the mesh.

ask_peer Timeout

The ask_peer timeout increased from 600 seconds to 7200 seconds (two hours). The previous 10-minute limit caused failures for long-running tasks routed to a remote peer: database queries over large datasets, complex multi-step analysis, batch processing jobs. These are valid use cases for distributed agent workflows and the timeout was not justified.

JARVISCORE_ Environment Variable Prefix

P2P configuration environment variables now use the JARVISCORE_ prefix. The affected variables are:

The reason for this change is that the swim package calls load_dotenv() at import time, which populates os.environ with everything in your .env file. A variable named BIND_PORT in your .env would be picked up by the swim package regardless of whether you intended it for JarvisCore, causing silent port conflicts in multi-process deployments. The JARVISCORE_ prefix isolates JarvisCore's variables from the swim package's namespace.

The old unprefixed names are intentionally not supported as fallbacks. Update your environment configuration before upgrading.

Upgrade

pip install jarviscore==1.0.2

No database migrations are required. If you use the old unprefixed P2P environment variable names, rename them before deploying. The full release notes are on GitHub.

Continue reading

More from the Prescott Data OSS team.

View all postsGitHub