JarvisCore 1.1.0: Agents That Know What They Are Doing

I want to be honest with you about something before we get into what shipped.

The weeks between JarvisCore 1.0.0 and 1.1.0 were rough. We shipped 1.0.2 in March to fix P2P issues that showed up almost immediately after the stable launch. Then we tried to move fast with 1.0.3 and 1.0.4, introduced regressions that were bad enough to warrant yanking both versions from PyPI, and had to regroup. That is not the experience we want anyone building on this framework to have.

1.1.0 is the version we should have shipped. This post covers the full arc: what 1.0.x taught us, what changed in the API, and what 1.1.0 actually enables that was not possible before. If you are already on JarvisCore, read through the migration notes. If you are evaluating it for the first time, start here.

Where Things Stood After 1.0.2

When JarvisCore 1.0.0 launched in February, the core abstractions were solid: the Kernel OODA loop, UnifiedMemory, the WorkflowEngine, MailboxManager, FunctionRegistry. The framework could do a lot. What it could not do was handle the edges gracefully.

The first thing that bit P2P users was keepalive spam. When peers were unreachable, the keepalive manager sent continuous retry attempts with no backoff. In a development environment this was annoying. On a flaky network or a multi-node cloud deployment it was a real problem: excessive load, log noise, and no way to tell whether zero peers was expected or a genuine failure.

1.0.2 fixed this with exponential backoff and the P2P_ALLOW_ZERO_PEERS flag. A single-node development environment no longer triggers failure warnings. A production cluster with P2P_ALLOW_ZERO_PEERS=false will surface the real problem. The keepalive backoff defaults to 45 seconds via P2P_KEEPALIVE_FAILURE_BACKOFF_SECONDS. Small fix, meaningful in practice.

The second issue was list_roles(). Before 1.0.2, calling it on a multi-node mesh only returned agents from the local registry. Agents discovered via SWIM on remote nodes were invisible to role queries. You could have three nodes running specialist agents and the Planner would route everything to the agents it happened to be co-located with. 1.0.2 extended list_roles() to include the full SWIM-discovered agent set across the mesh.

These were real bugs. They got fixed. But fixing them did not address what was happening inside the agents themselves.

What Was Actually Wrong

Here is what we were seeing in agent behavior that went deeper than bugs:

Agents were generating full planning DAGs for questions that needed one tool call. The Planner, Executor, and Evaluator were running for tasks like "what is the current date" or "fetch this URL." The latency overhead was real. The token cost was real. Nothing in the framework was asking whether a given task needed to be planned at all.

Agents were regenerating solutions to problems they had already solved. The FunctionRegistry exists precisely to cache those solutions. But the semantic search that retrieves them works on embeddings, and embeddings are sensitive to phrasing. The same underlying task described in slightly different words produces a different embedding vector, and the registry misses. What should have been a cache hit became another code generation cycle. Every time.

CoderSubAgent was hallucinating imports. When asked to write code, the LLM had no grounding in what was actually available in the sandbox. It made reasonable guesses. Sometimes those guesses were wrong. The sandbox would fail with an import error, or a reference to a function that did not exist, and there was no clean failure path. The agent would retry, guess again, and sometimes spiral.

And perhaps the most subtle problem: a task could complete execution without error and the framework would report success, when the actual output had nothing to do with what was asked. Execution success and task success are different things. The framework could not tell them apart.

These are the problems 1.1.0 is built to address.

What Changed in the API (Read This Before You Upgrade)

Before we get to the new capabilities, there are API changes in the 1.0.x series that are worth calling out explicitly because they affect existing code.

The Mesh(mode=...) constructor argument is gone. If you have code that looks like this:

# This no longer works
mesh = Mesh(mode='distributed')

Replace it with the explicit config form:

# This is the correct form from 1.0.3 onward
mesh = Mesh(config={"p2p_enabled": True})

Alternatively, set P2P_ENABLED=true in your environment and the Mesh constructor picks it up automatically. The mode abstraction collapsed three modes (autonomous, p2p, distributed) into a single configuration flag. If P2P is enabled, you have what was previously called distributed mode. If it is not, you have autonomous. The distinction between the three modes turned out to create more confusion than it resolved.

If you were using ListenerAgent, that profile was merged into CustomAgent in an earlier release. The migration is a single import change:

# Before
from jarviscore.profiles import ListenerAgent

# After
from jarviscore.profiles import CustomAgent

All handler methods (on_peer_request, on_peer_notify, on_error) work exactly the same. No logic changes required.

The HITL system now enforces typed categories. If you are calling request_human_review() with an arbitrary string, that will now raise a ValueError. Use the HITLCategory enum:

from jarviscore.hitl import HITLCategory

await self.request_human_review(
    payload={"data": result},
    category=HITLCategory.CRITICAL_ACTION,  # auth_required | data_required | critical_action
)

These changes landed across 1.0.3 and are enforced in 1.1.0. Run your tests before deploying.

The New Capabilities in 1.1.0

TaskComplexityClassifier: Stop Over-Engineering Simple Work

Before a task reaches the Planner in 1.1.0, a classifier evaluates it and assigns one of three labels: trivial, moderate, or complex. Trivial tasks bypass the full Plan, Execute, Evaluate loop and are dispatched directly. The Planner only runs when the task genuinely requires multi-step reasoning.

This is automatic. You do not configure it. You do not change your agents. Tasks that previously wasted compute on unnecessary orchestration now run faster and cheaper. If you are running agents at scale, the cost impact is immediate.

The classifier itself is LLM-based, which means it costs a small amount per task. For complex workflows, that cost is trivially small compared to the planning overhead it saves. For single-step tasks, you will see measurable latency reduction because the classifier short-circuits the pipeline entirely.

IntentNormalizer: Registry Hits That Actually Hit

The FunctionRegistry is one of the most powerful pieces of the framework and also one of the most invisible when it works well. It caches code solutions keyed by task intent and retrieves them via embedding search. When it works, your agents never generate the same solution twice. When it does not, you pay the full generation cost on every call.

The problem was embedding drift. Verbose, context-heavy task descriptions produce embeddings that do not cluster well with shorter descriptions of the same underlying task. The registry would hold the perfect solution and the semantic search would walk right past it.

The IntentNormalizer sits upstream of that search. Before the registry query runs, it takes whatever description was passed, strips the verbosity, and produces a concise canonical form. The registry then gets consistent inputs regardless of how the task was originally phrased. Cache hit rates improve immediately. You will see it in your token usage if you have monitoring in place.

output_schema: Know What You Are Building Before You Build It

This one changes how I think about defining agents entirely.

You can now attach a Pydantic model to any agent as an output_schema. The Kernel passes it through the pipeline into CoderSubAgent, which validates sandbox output against the schema before the result is returned. If the output does not match, the task fails fast with a clear schema validation error. No silent failures. No unstructured data that breaks downstream.

from pydantic import BaseModel
from jarviscore import AutoAgent

class AnalysisResult(BaseModel):
    summary: str
    confidence: float
    supporting_data: list[str]
    action_required: bool

class AnalysisAgent(AutoAgent):
    role = "analyst"
    capabilities = ["analyse", "summarise", "evaluate"]
    output_schema = AnalysisResult

When this agent completes a task, you are guaranteed to get an AnalysisResult or a validation error. Not a dict that might have the fields you need. Not a string response that requires parsing. A typed, validated output.

This is opt-in. Agents without a schema behave exactly as before. But if you are building agents that feed into downstream systems, pipelines, or other agents, defining an output schema is the most important thing you can do for reliability.

Sandbox Manifests: Code That Actually Runs

When CoderSubAgent writes code, the LLM has been generating against its training data's understanding of what might be available. Sometimes that is wrong. A module that exists in the broader Python ecosystem might not be loaded in the sandbox. A utility function you defined on the agent might not be visible to the code generator.

1.1.0 adds a SANDBOX ENVIRONMENT section to the CoderSubAgent system prompt at generation time. It lists every module and global that is pre-loaded in the sandbox namespace. The LLM writes against what is actually there. Import errors from hallucinated modules go away. Code generation becomes more precise because the model has ground truth about its execution environment.

You can introspect the manifest yourself via SandboxExecutor.get_manifest() if you want to understand what is available or debug generation issues.

semantic_success: Did It Actually Work?

This is the one I am most excited about, and also the one that is easiest to underestimate.

Every result payload in JarvisCore now includes a semantic_success field alongside the existing execution status. Execution status tells you whether the code ran without raising an exception. Semantic success tells you whether the output actually answered the question.

These are not the same thing. A financial analysis agent can produce a beautifully formatted report that is completely wrong. An extraction agent can return a valid JSON object that is missing the field the downstream system needed. Both complete "successfully" by execution metrics. Neither actually succeeded at the task.

CoderSubAgent now includes an evaluator hook that sets semantic_success based on whether the output satisfies the task goal. You can inspect it in the result dict:

result = await mesh.run_task(
    agent="analyst",
    task="Extract the revenue figures from the Q1 report",
    complexity="standard",
)

if result.get("semantic_success"):
    # The agent is confident the output answers the question
    process(result["payload"])
else:
    # The task ran but the output may not have addressed the goal
    escalate(result)

This is the beginning of agents that are honest about their own outputs. The framework can now surface uncertainty rather than hiding it behind a clean execution status.

SemVer, Finally

One thing that has frustrated people building on JarvisCore is the version numbering. Features have shipped in patch releases. Breaking changes appeared in versions that looked like minor updates. That ends with 1.1.0.

From this release forward: breaking API changes increment the major version. New capabilities increment the minor version. Bug fixes increment the patch version. We are committed to this. It means you can pin jarviscore-framework>=1.1.0,<2.0.0 and not get surprises.

Upgrade and What to Expect

pip install "jarviscore-framework>=1.1.0"

If you are coming from 1.0.2 or earlier: update any Mesh(mode=...) calls, update your ListenerAgent imports if you have them, and update HITL calls to use HITLCategory. Run your test suite.

If you want to start getting value from the new capabilities immediately: add output_schema to one agent. Pick a critical one in your pipeline. Define what it should produce. Watch it validate. That single change will surface assumptions you did not know you were making.

The full changelog and migration guide are on GitHub. The JarvisCore documentation site covers every concept, guide, and reference you need. If you hit something unexpected or want to discuss the direction of the framework, open an issue or find us in the community.

This is the version that makes building production agents with JarvisCore worthwhile. I am genuinely excited to see what teams build with it.