Every agent memory system has the same problem: it doesn't know how much context you actually need.
Ask "what color is Alice's car?" after a conversation about Alice, and the system runs full vector search, graph expansion, and index compression — burning 3,000 tokens of context to answer a question the last 5 messages already cover.
Ask "how does our pricing model relate to the Q3 engineering roadmap?" and you genuinely need cross-domain retrieval across your entire knowledge base.
Same API call. Wildly different context requirements. Same token bill.
The Cost of Always-On Full Recall
If you're running an agent loop with AllSource Prime embedded, every prime_context call currently does this:
query → generate compressed index (~800 tokens)
→ vector similarity search (top-k results)
→ BFS graph expansion (1-2 hops from matches)
→ serialize + return (~2000-5000 tokens)
Our recall bench shows this achieves 89% F1 on cross-domain queries. Impressive. But when you profile real agent conversations, you find:
- ~30% of turns need no memory at all (tool calls, confirmations, clarifications)
- ~40% of turns are follow-ups in the same conversation thread
- ~30% of turns are genuinely new topics or cross-domain questions
You're paying the full retrieval cost on 100% of turns. The L2 bill on the 70% that don't need it is pure waste.
Three Tiers, One API
AllSource Prime now supports a tier parameter on prime_context:
| Tier | What it returns | Token budget | Latency | When to use |
|---|---|---|---|---|
| L0 | Graph stats + domain list | ~100-200 | <0.1ms | Tool-only turns, orientation, health checks |
| L1 | L0 + recent conversation nodes + 1-hop edges | ~500-1500 | <0.5ms | Follow-ups in same conversation |
| L2 | L1 + compressed index + vector search + graph expansion | ~2000-5000 | ~2-5ms | New topics, cross-domain questions |
The default is L2, so existing integrations don't change. But if you're building an agent loop and you control the orchestrator, you can now match retrieval depth to the actual need.
L0: Stats Only
{
"name": "prime_context",
"arguments": {
"query": "",
"tier": "L0"
}
}Returns:
{
"tier": "L0",
"stats": {
"total_nodes": 847,
"total_edges": 2103,
"nodes_by_type": {"person": 42, "concept": 215, "project": 89, ...},
"edges_by_relation": {"works_on": 156, "impacts": 89, ...}
},
"token_count": 127,
"nodes": [],
"vectors": [],
"index": ""
}127 tokens. No index generation, no vector search, no graph walk. Use this when the agent just needs to know the shape of memory — how many entities exist, which domains are populated, whether there's anything to query at all.
Use cases:
- Agent startup orientation ("do I have any memory?")
- Health checks in long-running loops
- Decision point: "is this worth a deeper query?"
L1: Conversation Context
{
"name": "prime_context",
"arguments": {
"query": "what was the decision?",
"tier": "L1",
"conversation_id": "conv-2026-03-22-alice"
}
}Returns:
{
"tier": "L1",
"stats": { "total_nodes": 847, ... },
"nodes": [
{"id": "decision-1", "type": "decision", "properties": {"name": "Migrate to Prime", "rationale": "..."}},
{"id": "alice", "type": "person", "properties": {"name": "Alice", "role": "architect"}},
{"id": "core-api", "type": "service", "properties": {"name": "Core API"}}
],
"edges": [
{"source": "decision-1", "target": "core-api", "relation": "impacts"},
{"source": "alice", "target": "decision-1", "relation": "authored"}
],
"token_count": 890,
"vectors": [],
"index": ""
}890 tokens. L1 pulls the 20 most recent nodes from the current conversation plus their immediate neighbors — no vector search, no index compression. The conversation_id parameter scopes retrieval to nodes tagged with that conversation.
Without a conversation_id, L1 returns the 20 most recently updated nodes across all conversations. Still useful — it gives the agent a "what was I just doing?" view.
Use cases:
- Follow-up questions in an ongoing thread
- "Remind me what we decided about X" (when X was discussed recently)
- Agent loops that maintain conversation state across turns
L2: Full Recall
{
"name": "prime_context",
"arguments": {
"query": "how does the pricing model relate to engineering capacity?",
"tier": "L2"
}
}Same as the existing prime_context behavior. Compressed index, vector search, graph expansion. Use when the question genuinely spans domains or introduces a new topic.
Integrating with Your Agent Loop
If you're embedding AllSource Core in Rust and running your own orchestrator, here's how to wire tiered loading into a typical ReAct loop.
Setup: Shared Projections
The key change is constructing RecallEngine with RecallDeps from Prime, so L0 and L1 tiers have access to the same projections Prime uses for graph queries:
use allsource_core::prime::Prime;
use allsource_core::prime::recall::{RecallEngine, IndexConfig};
let prime = Prime::open("~/.agent/memory").await?;
// Share Prime's projections with RecallEngine
let recall = RecallEngine::with_deps(
prime.recall_deps(),
&IndexConfig::default(),
);Previously, RecallEngine::new() created its own standalone projections — they wouldn't see data ingested through Prime. with_deps() solves this by sharing NodeStateProjection, AdjacencyListProjection, GraphStatsProjection, and CrossDomainProjection from the Prime instance.
Tier Selection in a ReAct Loop
use allsource_core::prime::recall::{ContextTier, RecallContextQuery};
async fn select_tier(
turn: &AgentTurn,
prev_conversation_id: Option<&str>,
) -> ContextTier {
// No memory needed for pure tool execution
if turn.is_tool_result() || turn.is_confirmation() {
return ContextTier::L0;
}
// Follow-up in same conversation → recent context is enough
if turn.conversation_id() == prev_conversation_id {
return ContextTier::L1;
}
// New topic or cross-domain → full recall
ContextTier::L2
}
// In the loop:
let tier = select_tier(&turn, last_conv_id.as_deref());
let context = recall.context(RecallContextQuery {
query: turn.user_message().to_string(),
tier,
conversation_id: turn.conversation_id().map(String::from),
..Default::default()
}).await;
// Inject context into the system prompt
let system_prompt = match context.tier {
ContextTier::L0 => format!(
"You have {} entities in memory across {} types.",
context.stats.as_ref().map_or(0, |s| s.total_nodes),
context.stats.as_ref().map_or(0, |s| s.nodes_by_type.len()),
),
ContextTier::L1 => format!(
"Recent context ({} nodes, {} edges):\n{}",
context.nodes.len(),
context.edges.len(),
format_nodes_for_prompt(&context.nodes),
),
ContextTier::L2 => format!(
"Knowledge index:\n{}\n\nRelevant nodes: {}",
context.index,
format_nodes_for_prompt(&context.nodes),
),
};MCP Server: Already Wired
If you're running allsource-prime as an MCP server (stdio or HTTP), tiered loading is already available. The prime_context tool accepts tier and conversation_id parameters:
// In your MCP client (TypeScript, Python, etc.)
const result = await mcpClient.callTool("prime_context", {
query: "what's the status?",
tier: "L1",
conversation_id: currentConversationId,
});
// result.content[0].text contains the tier, stats, nodes, edgesNo SDK changes needed. The MCP wire format is the same — just pass the new parameters.
The Token Math
Here's what tiered loading looks like across a realistic 20-turn agent conversation:
| Turn | Action | Old (always L2) | With Tiers | Tier Used |
|---|---|---|---|---|
| 1 | "What projects are active?" | 3,200 | 3,200 | L2 |
| 2 | "Tell me more about Project Alpha" | 3,200 | 3,200 | L2 |
| 3 | "Who's the lead?" | 3,200 | 890 | L1 |
| 4 | "What's their background?" | 3,200 | 920 | L1 |
| 5 | Run search tool | 3,200 | 127 | L0 |
| 6 | "Summarize the results" | 3,200 | 890 | L1 |
| 7 | "How does this relate to Q3 goals?" | 3,200 | 3,200 | L2 |
| 8-12 | Follow-ups on Q3 | 16,000 | 4,450 | L1 |
| 13 | Confirmation ("yes, do it") | 3,200 | 127 | L0 |
| 14-18 | More follow-ups | 16,000 | 4,450 | L1 |
| 19 | Tool execution | 3,200 | 127 | L0 |
| 20 | "Anything else I should know?" | 3,200 | 3,200 | L2 |
| Total | 64,000 | 24,781 |
61% reduction. Same accuracy on the turns that matter (L2 for new topics and cross-domain). Zero accuracy loss on follow-ups (L1 has the conversation context). Zero wasted tokens on tool calls (L0).
At $3/M input tokens (Claude Sonnet), that's the difference between $0.19 and $0.07 per conversation. Scale to 10K daily conversations and you're saving $1,200/day.
When NOT to Use Lower Tiers
L1 and L0 have genuine limitations. Use L2 when:
- The question spans domains. "How does X relate to Y?" where X and Y are in different domains. L1 only has conversation-scoped nodes — it won't find cross-domain edges unless both domains appeared in the conversation.
- The user introduces a new topic. If the conversation was about engineering and the user asks about revenue, L1 will return engineering nodes. You need L2's vector search to find revenue context.
- You need the compressed index. The index is a ~800-token summary of the entire knowledge base. L1 doesn't generate it. If the agent needs to orient across all domains, use L2.
A simple heuristic: if the user's question references something that wasn't in the last 20 messages, use L2.
Architecture: What Changed
For those embedding allsource-core directly, the changes are additive:
New types:
ContextTierenum (L0,L1,L2) — defaultL2RecallDepsstruct — bundles shared projection referencesRecallContextgainsstats: Option<PrimeStats>andtier: ContextTierfields
New constructor:
RecallEngine::with_deps(deps, config)— acceptsRecallDepsfromPrime::recall_deps()
New query fields:
RecallContextQuery.tier— which tier to use (default L2)RecallContextQuery.conversation_id— scope L1 to a conversation
No breaking changes. RecallEngine::new() still works. RecallContextQuery::default() still returns L2. All existing tests pass.
Data Flow Per Tier
L0: query → GraphStatsProjection.stats()
→ serialize → return (~0.1ms, ~150 tokens)
L1: query → L0
+ NodeStateProjection.all_nodes()
→ filter by conversation_id
→ sort by updated_at, take 20
+ AdjacencyListProjection.outgoing(node_ids)
→ 1-hop edge expansion
→ serialize → return (~0.5ms, ~900 tokens)
L2: query → IndexCompressor.compress()
+ vector_search(query_embedding, top_k)
+ BFS graph expansion(matches, depth)
→ serialize → return (~3ms, ~3000 tokens)
L0 does no I/O. L1 reads from in-memory DashMap projections only. L2 is the only tier that generates the compressed index and runs vector search.
Getting Started
If you're embedding Core in Rust:
// 1. Build RecallEngine with shared deps
let recall = RecallEngine::with_deps(prime.recall_deps(), &IndexConfig::default());
// 2. Query with tier
let ctx = recall.context(RecallContextQuery {
query: "...".into(),
tier: ContextTier::L1,
conversation_id: Some("my-conv".into()),
..Default::default()
}).await;
// 3. Use ctx.stats, ctx.nodes, ctx.edges, ctx.tierIf you're using the MCP server:
{"method": "tools/call", "params": {
"name": "prime_context",
"arguments": {"query": "...", "tier": "L1", "conversation_id": "my-conv"}
}}If you're using an HTTP client:
curl -X POST http://localhost:3905/api/v1/prime/context \
-H "Content-Type: application/json" \
-d '{"query": "...", "tier": "L1", "conversation_id": "my-conv"}'The tier parameter is the same across all interfaces. Start with L2 everywhere, then optimize the hot path.
What's Next
Auto-tier selection (feature-gated, opt-in): Let the engine pick the tier based on conversation state. Same conversation + same domain → L1. New topic → L2. No query text → L0. This removes the orchestrator-side heuristic entirely.
Recall bench comparisons: Per-tier accuracy and token cost breakdowns in the CrossRef benchmark suite, so you can validate the accuracy/cost tradeoff for your specific knowledge domain.
The code is in apps/core/src/prime/recall/. The PRD is in docs/proposals/prd-tiered-context-loading.md.

