all.sourceall.source

Tiered Context Loading: Cut Agent Memory Costs by 60% Without Losing Recall

Tiered Context Loading: Cut Agent Memory Costs by 60% Without Losing Recall

Every agent memory system has the same problem: it doesn't know how much context you actually need.

Ask "what color is Alice's car?" after a conversation about Alice, and the system runs full vector search, graph expansion, and index compression — burning 3,000 tokens of context to answer a question the last 5 messages already cover.

Ask "how does our pricing model relate to the Q3 engineering roadmap?" and you genuinely need cross-domain retrieval across your entire knowledge base.

Same API call. Wildly different context requirements. Same token bill.

The Cost of Always-On Full Recall

If you're running an agent loop with AllSource Prime embedded, every prime_context call currently does this:

query → generate compressed index (~800 tokens)
      → vector similarity search (top-k results)
      → BFS graph expansion (1-2 hops from matches)
      → serialize + return (~2000-5000 tokens)

Our recall bench shows this achieves 89% F1 on cross-domain queries. Impressive. But when you profile real agent conversations, you find:

  • ~30% of turns need no memory at all (tool calls, confirmations, clarifications)
  • ~40% of turns are follow-ups in the same conversation thread
  • ~30% of turns are genuinely new topics or cross-domain questions

You're paying the full retrieval cost on 100% of turns. The L2 bill on the 70% that don't need it is pure waste.

Three Tiers, One API

AllSource Prime now supports a tier parameter on prime_context:

Tier What it returns Token budget Latency When to use
L0 Graph stats + domain list ~100-200 <0.1ms Tool-only turns, orientation, health checks
L1 L0 + recent conversation nodes + 1-hop edges ~500-1500 <0.5ms Follow-ups in same conversation
L2 L1 + compressed index + vector search + graph expansion ~2000-5000 ~2-5ms New topics, cross-domain questions

The default is L2, so existing integrations don't change. But if you're building an agent loop and you control the orchestrator, you can now match retrieval depth to the actual need.

L0: Stats Only

{
  "name": "prime_context",
  "arguments": {
    "query": "",
    "tier": "L0"
  }
}

Returns:

{
  "tier": "L0",
  "stats": {
    "total_nodes": 847,
    "total_edges": 2103,
    "nodes_by_type": {"person": 42, "concept": 215, "project": 89, ...},
    "edges_by_relation": {"works_on": 156, "impacts": 89, ...}
  },
  "token_count": 127,
  "nodes": [],
  "vectors": [],
  "index": ""
}

127 tokens. No index generation, no vector search, no graph walk. Use this when the agent just needs to know the shape of memory — how many entities exist, which domains are populated, whether there's anything to query at all.

Use cases:

  • Agent startup orientation ("do I have any memory?")
  • Health checks in long-running loops
  • Decision point: "is this worth a deeper query?"

L1: Conversation Context

{
  "name": "prime_context",
  "arguments": {
    "query": "what was the decision?",
    "tier": "L1",
    "conversation_id": "conv-2026-03-22-alice"
  }
}

Returns:

{
  "tier": "L1",
  "stats": { "total_nodes": 847, ... },
  "nodes": [
    {"id": "decision-1", "type": "decision", "properties": {"name": "Migrate to Prime", "rationale": "..."}},
    {"id": "alice", "type": "person", "properties": {"name": "Alice", "role": "architect"}},
    {"id": "core-api", "type": "service", "properties": {"name": "Core API"}}
  ],
  "edges": [
    {"source": "decision-1", "target": "core-api", "relation": "impacts"},
    {"source": "alice", "target": "decision-1", "relation": "authored"}
  ],
  "token_count": 890,
  "vectors": [],
  "index": ""
}

890 tokens. L1 pulls the 20 most recent nodes from the current conversation plus their immediate neighbors — no vector search, no index compression. The conversation_id parameter scopes retrieval to nodes tagged with that conversation.

Without a conversation_id, L1 returns the 20 most recently updated nodes across all conversations. Still useful — it gives the agent a "what was I just doing?" view.

Use cases:

  • Follow-up questions in an ongoing thread
  • "Remind me what we decided about X" (when X was discussed recently)
  • Agent loops that maintain conversation state across turns

L2: Full Recall

{
  "name": "prime_context",
  "arguments": {
    "query": "how does the pricing model relate to engineering capacity?",
    "tier": "L2"
  }
}

Same as the existing prime_context behavior. Compressed index, vector search, graph expansion. Use when the question genuinely spans domains or introduces a new topic.

Integrating with Your Agent Loop

If you're embedding AllSource Core in Rust and running your own orchestrator, here's how to wire tiered loading into a typical ReAct loop.

Setup: Shared Projections

The key change is constructing RecallEngine with RecallDeps from Prime, so L0 and L1 tiers have access to the same projections Prime uses for graph queries:

use allsource_core::prime::Prime;
use allsource_core::prime::recall::{RecallEngine, IndexConfig};
 
let prime = Prime::open("~/.agent/memory").await?;
 
// Share Prime's projections with RecallEngine
let recall = RecallEngine::with_deps(
    prime.recall_deps(),
    &IndexConfig::default(),
);

Previously, RecallEngine::new() created its own standalone projections — they wouldn't see data ingested through Prime. with_deps() solves this by sharing NodeStateProjection, AdjacencyListProjection, GraphStatsProjection, and CrossDomainProjection from the Prime instance.

Tier Selection in a ReAct Loop

use allsource_core::prime::recall::{ContextTier, RecallContextQuery};
 
async fn select_tier(
    turn: &AgentTurn,
    prev_conversation_id: Option<&str>,
) -> ContextTier {
    // No memory needed for pure tool execution
    if turn.is_tool_result() || turn.is_confirmation() {
        return ContextTier::L0;
    }
 
    // Follow-up in same conversation → recent context is enough
    if turn.conversation_id() == prev_conversation_id {
        return ContextTier::L1;
    }
 
    // New topic or cross-domain → full recall
    ContextTier::L2
}
 
// In the loop:
let tier = select_tier(&turn, last_conv_id.as_deref());
let context = recall.context(RecallContextQuery {
    query: turn.user_message().to_string(),
    tier,
    conversation_id: turn.conversation_id().map(String::from),
    ..Default::default()
}).await;
 
// Inject context into the system prompt
let system_prompt = match context.tier {
    ContextTier::L0 => format!(
        "You have {} entities in memory across {} types.",
        context.stats.as_ref().map_or(0, |s| s.total_nodes),
        context.stats.as_ref().map_or(0, |s| s.nodes_by_type.len()),
    ),
    ContextTier::L1 => format!(
        "Recent context ({} nodes, {} edges):\n{}",
        context.nodes.len(),
        context.edges.len(),
        format_nodes_for_prompt(&context.nodes),
    ),
    ContextTier::L2 => format!(
        "Knowledge index:\n{}\n\nRelevant nodes: {}",
        context.index,
        format_nodes_for_prompt(&context.nodes),
    ),
};

MCP Server: Already Wired

If you're running allsource-prime as an MCP server (stdio or HTTP), tiered loading is already available. The prime_context tool accepts tier and conversation_id parameters:

// In your MCP client (TypeScript, Python, etc.)
const result = await mcpClient.callTool("prime_context", {
  query: "what's the status?",
  tier: "L1",
  conversation_id: currentConversationId,
});
 
// result.content[0].text contains the tier, stats, nodes, edges

No SDK changes needed. The MCP wire format is the same — just pass the new parameters.

The Token Math

Here's what tiered loading looks like across a realistic 20-turn agent conversation:

Turn Action Old (always L2) With Tiers Tier Used
1 "What projects are active?" 3,200 3,200 L2
2 "Tell me more about Project Alpha" 3,200 3,200 L2
3 "Who's the lead?" 3,200 890 L1
4 "What's their background?" 3,200 920 L1
5 Run search tool 3,200 127 L0
6 "Summarize the results" 3,200 890 L1
7 "How does this relate to Q3 goals?" 3,200 3,200 L2
8-12 Follow-ups on Q3 16,000 4,450 L1
13 Confirmation ("yes, do it") 3,200 127 L0
14-18 More follow-ups 16,000 4,450 L1
19 Tool execution 3,200 127 L0
20 "Anything else I should know?" 3,200 3,200 L2
Total 64,000 24,781

61% reduction. Same accuracy on the turns that matter (L2 for new topics and cross-domain). Zero accuracy loss on follow-ups (L1 has the conversation context). Zero wasted tokens on tool calls (L0).

At $3/M input tokens (Claude Sonnet), that's the difference between $0.19 and $0.07 per conversation. Scale to 10K daily conversations and you're saving $1,200/day.

When NOT to Use Lower Tiers

L1 and L0 have genuine limitations. Use L2 when:

  • The question spans domains. "How does X relate to Y?" where X and Y are in different domains. L1 only has conversation-scoped nodes — it won't find cross-domain edges unless both domains appeared in the conversation.
  • The user introduces a new topic. If the conversation was about engineering and the user asks about revenue, L1 will return engineering nodes. You need L2's vector search to find revenue context.
  • You need the compressed index. The index is a ~800-token summary of the entire knowledge base. L1 doesn't generate it. If the agent needs to orient across all domains, use L2.

A simple heuristic: if the user's question references something that wasn't in the last 20 messages, use L2.

Architecture: What Changed

For those embedding allsource-core directly, the changes are additive:

New types:

  • ContextTier enum (L0, L1, L2) — default L2
  • RecallDeps struct — bundles shared projection references
  • RecallContext gains stats: Option<PrimeStats> and tier: ContextTier fields

New constructor:

  • RecallEngine::with_deps(deps, config) — accepts RecallDeps from Prime::recall_deps()

New query fields:

  • RecallContextQuery.tier — which tier to use (default L2)
  • RecallContextQuery.conversation_id — scope L1 to a conversation

No breaking changes. RecallEngine::new() still works. RecallContextQuery::default() still returns L2. All existing tests pass.

Data Flow Per Tier

L0:  query → GraphStatsProjection.stats()
           → serialize → return                           (~0.1ms, ~150 tokens)

L1:  query → L0
           + NodeStateProjection.all_nodes()
             → filter by conversation_id
             → sort by updated_at, take 20
           + AdjacencyListProjection.outgoing(node_ids)
             → 1-hop edge expansion
           → serialize → return                           (~0.5ms, ~900 tokens)

L2:  query → IndexCompressor.compress()
           + vector_search(query_embedding, top_k)
           + BFS graph expansion(matches, depth)
           → serialize → return                           (~3ms, ~3000 tokens)

L0 does no I/O. L1 reads from in-memory DashMap projections only. L2 is the only tier that generates the compressed index and runs vector search.

Getting Started

If you're embedding Core in Rust:

// 1. Build RecallEngine with shared deps
let recall = RecallEngine::with_deps(prime.recall_deps(), &IndexConfig::default());
 
// 2. Query with tier
let ctx = recall.context(RecallContextQuery {
    query: "...".into(),
    tier: ContextTier::L1,
    conversation_id: Some("my-conv".into()),
    ..Default::default()
}).await;
 
// 3. Use ctx.stats, ctx.nodes, ctx.edges, ctx.tier

If you're using the MCP server:

{"method": "tools/call", "params": {
  "name": "prime_context",
  "arguments": {"query": "...", "tier": "L1", "conversation_id": "my-conv"}
}}

If you're using an HTTP client:

curl -X POST http://localhost:3905/api/v1/prime/context \
  -H "Content-Type: application/json" \
  -d '{"query": "...", "tier": "L1", "conversation_id": "my-conv"}'

The tier parameter is the same across all interfaces. Start with L2 everywhere, then optimize the hot path.

What's Next

Auto-tier selection (feature-gated, opt-in): Let the engine pick the tier based on conversation state. Same conversation + same domain → L1. New topic → L2. No query text → L0. This removes the orchestrator-side heuristic entirely.

Recall bench comparisons: Per-tier accuracy and token cost breakdowns in the CrossRef benchmark suite, so you can validate the accuracy/cost tradeoff for your specific knowledge domain.

The code is in apps/core/src/prime/recall/. The PRD is in docs/proposals/prd-tiered-context-loading.md.

Query any point in history. Never lose an event. Free tier with 10K events/month.

Give your application perfect memory

No credit card required. 10K events/month free.