Context & Memory

KosmoKrator continuously manages the LLM’s context window so that conversations can run indefinitely without hitting token limits. A multi-stage pipeline reduces context pressure progressively, from cheap local operations to full LLM-based summarization. A complementary memory system persists important knowledge across sessions.

Context Pipeline Overview

Every time the agent prepares an LLM call, the ContextManager runs a pre-flight check. If the estimated token count exceeds a warning threshold, the pipeline activates. Each stage runs in order; earlier stages handle cheap, fast reductions while later stages are progressively more aggressive. Note that output truncation and deduplication run outside the pre-flight check — truncation happens during tool execution and deduplication runs on session load. Only pruning, compaction, and trimming run during the pre-flight.

Output Truncation

2,000 lines / 50KB cap on tool results

During tool execution

→

Deduplication

Exact dupes, stale reads, subsumed grep results

On session load

→

Pruning

Score-based placeholder replacement of low-value messages

Pre-flight

→

LLM Compaction

Summarize old messages via LLM; extract memories

Threshold crossed

→

Oldest-Turn Trimming

Drop oldest message to reclaim token budget

Emergency

The pipeline is designed so that most sessions never reach the later stages. Output truncation and deduplication handle the bulk of token reduction silently, keeping the conversation lean without any loss of important context.

Output Truncation

The OutputTruncator is the first line of defense. It processes every tool result the moment it comes back, before the result enters the conversation history. This prevents a single oversized output (such as a large file read or a verbose shell command) from consuming a disproportionate share of the context window.

Limits

Line limit: 2,000 lines maximum
Byte limit: 50 KB (50,000 bytes) maximum
Whichever limit is hit first triggers truncation

Behavior

When truncation occurs, the full untruncated output is first saved to disk at ~/.kosmo/data/truncations/. The truncated version that enters the conversation ends with a notice pointing to the saved file:

[truncated - full output saved to ~/.kosmo/data/truncations/tool_abc123.txt;
 inspect with targeted grep/file_read rather than pasting it back into context]

This means nothing is ever truly lost. The agent can re-read the full output via file_read if it needs a specific section, rather than loading the entire thing into context. Saved truncation files are automatically cleaned up after 7 days.

Tip: If the agent is working with large files or verbose commands, output truncation keeps the context window healthy automatically. You do not need to configure anything — the truncator runs on every tool result by default.

Deduplication

The ToolResultDeduplicator scans the conversation history and identifies redundant tool outputs. Superseded results are replaced with compact placeholder strings, freeing tokens without losing any information that is still current.

Deduplication applies three tiers, checked in order:

Tier 1: Exact Duplicates

If the same tool was called with the same arguments and produced the same result, earlier occurrences are replaced with:

[Superseded -- identical result returned by later call]

This covers cases where the agent re-runs a search, re-reads a file that has not changed, or re-executes a command with identical output. Applies to file_read, grep, and glob tools.

Tier 2: Stale File Reads

When a file is read, then edited (via file_edit or file_write), and then read again, the pre-edit read is now stale. The deduplicator detects this pattern and replaces the older read with:

[Superseded -- file was re-read after modification]

This is particularly effective during iterative editing sessions where the agent reads a file, makes changes, and reads it again to verify — the old version of the file no longer needs to occupy context.

Tier 3: Grep Subsumed by File Read

If the agent ran grep on a specific file and later performed a full file_read of the same file, the grep result is redundant because the full file content already includes the matched lines. The grep result is replaced with:

[Superseded -- content included in later file_read of filename.php]

Tip: Deduplication runs automatically and has no configuration options. It only replaces results that are provably redundant — current results are never touched.

Pruning

The ContextPruner is a fast, non-LLM reduction pass that replaces old tool result content with lightweight placeholder strings. It runs before LLM-based compaction and can often free enough tokens to avoid compaction entirely.

Protection Rules

The pruner never touches recent context. The last 2 user turns and all messages after them are fully protected. Additionally, a configurable token budget (default: 40,000 tokens) of the most recent tool output before the protection boundary is preserved. Only tool results older and less important than this budget are candidates for pruning.

Importance Scoring

Each candidate tool result is scored by importance. Lower-scoring results are pruned first. The score factors include:

Tool type weight — Tools with typically larger, less reusable output score lower (more likely to be pruned). Weights range from bash (70 — most pruneable) down to glob (10 — least pruneable). File edits and writes score 20 (often contain important decisions).
Reference by later reasoning — If an assistant message after the tool result references the file name or quotes part of the result, the score increases by 15 per reference, making it more likely to be kept.
Decision language — If subsequent assistant messages contain phrases like “based on”, “I’ll use”, or “the issue is” that suggest the tool result influenced a decision, the score increases by 10.

Context-Aware Placeholders

When a tool result is pruned, it is replaced with a placeholder that preserves the tool type and target path, so the agent still knows what was done even though the full output is gone:

[Old file_read output cleared for src/Agent/ContextManager.php]
[Old grep output cleared for src/Tool/]
[Old shell output cleared; inspect truncation storage or rerun targeted commands if needed]
[Old glob output cleared]
[Old tool result content cleared]

Minimum Savings

Pruning only activates if the estimated token savings exceed 20,000 tokens (configurable). This prevents churn from pruning a handful of small results that would not meaningfully help.

Tool	Weight	Pruning Priority
`bash`	70	Highest (pruned first)
`shell_read`	65	High
`web_fetch`	55	High
`grep`	50	Medium
`web_search`	40	Medium
`file_read`	30	Lower
`file_edit` / `file_write`	20	Low (often important)
`glob`	10	Lowest (kept longest)

LLM-Based Compaction

When pruning and deduplication are not sufficient and token usage crosses the auto-compact threshold, the ContextCompactor performs a full LLM-based summarization of older messages.

How It Works

The conversation is split into old messages (to be summarized) and recent messages (to be kept verbatim). By default, the most recent 3 message turns are always preserved.
Old messages are formatted into a plain-text transcript and sent to the LLM with a dedicated compaction system prompt. The LLM is instructed to only summarize, not respond to questions in the conversation.
The LLM produces a structured summary covering: the user’s goal, key decisions made, work accomplished (with specific file paths), work in progress, and relevant files.
The old messages are replaced with a single system message containing the summary, followed by the preserved recent messages.
The summary is also stored as a working memory (with a 14-day expiration) so that the context persists even if the session ends.
A second LLM call extracts durable memories from the summary — facts about the codebase, user preferences, and technical decisions. These are saved permanently for cross-session recall.

Summary Format

The compaction prompt instructs the LLM to produce a structured summary:

## Goal
[What the user is trying to accomplish]

## Key Decisions
[Important technical choices, constraints, user preferences]

## Accomplished
[Work completed -- specific file paths and changes]

## In Progress
[Current task and what remains]

## Relevant Files
[Files read, edited, or created]

Protected Context

Certain messages are always preserved before the summary and never summarized away. The ProtectedContextBuilder assembles runtime environment facts that the LLM must always see: the current agent mode, working directory, git branch, and agent type/depth (for sub-agents). This is injected as a system message that cannot be overridden.

Circuit Breaker

If the compaction LLM call fails three times consecutively, the circuit breaker activates. While active, the system skips compaction entirely and falls back to oldest-turn trimming when context pressure is critical. The circuit breaker resets automatically once context pressure drops below the warning threshold.

Settings

Setting	Default	Description
`auto_compact`	on	Toggle automatic compaction on or off
`compact_threshold`	60% of context window	Percentage of the context window at which compaction triggers. Also bounded by the `auto_compact_buffer_tokens` budget if configured.

Tip: You can trigger compaction manually at any time with the /compact slash command. This is useful if you know a long tool output is no longer relevant and want to reclaim context space proactively.

Oldest-Turn Trimming

Oldest-turn trimming is the emergency fallback. It activates when all other strategies are insufficient and the token count hits the blocking threshold, or when the compaction circuit breaker is active.

Drops the single oldest message from the conversation history
Runs exactly once per agent loop iteration
No LLM call required — purely mechanical
Context quality degrades because there is no summarization

In practice, trimming is rare. The combination of truncation, deduplication, pruning, and compaction handles context pressure in the vast majority of sessions. Trimming exists as a safety net to ensure the agent never gets stuck due to context overflow.

Token Budgets

The ContextBudget class defines four thresholds that control when context management interventions occur. All thresholds are derived from the model’s context window size minus configurable buffer values.

Budget	Default	Purpose
`reserve_output_tokens`	16,000	Headroom reserved for the LLM’s response. Subtracted from the raw context window to produce the effective context window — the usable input token budget.
`warning_buffer_tokens`	24,000	When remaining input tokens drop below this buffer, warning-level interventions begin (pruning, deduplication).
`auto_compact_buffer_tokens`	12,000	When remaining input tokens drop below this buffer, automatic LLM-based compaction is triggered.
`blocking_buffer_tokens`	3,000	Hard stop. When remaining input tokens drop below this buffer, oldest-turn trimming activates immediately. This is the last-resort threshold.

How Budgets Are Calculated

The system continuously tracks three components that make up the total token usage:

System prompt tokens — The assembled system prompt including base instructions, injected memories, session recall, mode suffix, parent brief, and active tasks.
Conversation tokens — All messages in the conversation history (user, assistant, tool results, system messages).
Tool schema tokens — The JSON schema definitions of all registered tools.

Token counts are estimated using the TokenEstimator, which uses a fast character-based heuristic (roughly 1 token per 3.2 characters) rather than a full tokenizer. This is accurate enough for budget decisions while being orders of magnitude faster.

The intervention thresholds are:

Example: 200K context window

Available (160K)

Warn

Compact

⚠

Output (16K)

Usable context (160K)

Warning: pruning begins

Auto-compact: LLM summarization

Blocking: force trim oldest

Reserved for LLM output (16K)

Tip: The context bar in the TUI shows a real-time percentage of context used. When it turns yellow, you are approaching the warning threshold. When it turns red, compaction is imminent or active.

Memory System

Memories are persistent knowledge fragments that survive across conversations. They allow the agent to remember facts about the codebase, your preferences, and key decisions made in previous sessions — without those sessions needing to be active.

Saving Memories

The agent uses the memory_save tool to create new memories. Each memory has three required fields:

Field	Description
`type`	Category of the memory:

project — Facts about the codebase, architecture, patterns, conventions
user — Your preferences, workflow style, corrections you have given
decision — Key technical choices and the reasoning behind them

You can ask the agent to remember something explicitly (“remember that I prefer tabs over spaces”) and it will call memory_save automatically.

Searching Memories

The agent uses the memory_search tool to find relevant memories by query. This is used both explicitly (when you ask “what do you remember about X?”) and implicitly during system prompt assembly.

Automatic Memory Extraction

During context compaction, the LLM is asked to extract durable knowledge from the conversation summary. This extraction produces memories categorized as project, user, or decision. Only non-obvious insights are extracted — things that would not be apparent from reading the code alone.

This means important context persists even when the conversation history is summarized away. A decision made in turn 5 of a long session will be captured as a memory and available in all future sessions, even though the original conversation turns have been compacted.

After extraction, the session manager runs memory consolidation to prune expired memories and trim old compaction memories to the 10 most recent, preventing the memory store from growing unboundedly.

Memory Retention Classes

Each memory belongs to a retention class that determines its lifecycle:

Class	Behavior	Typical Use
priority	Always injected first in the system prompt	Critical context that must always be visible to the agent
durable	Persists indefinitely; default for user-created and extracted memories	Project facts, user preferences, key decisions
working	May be garbage-collected after a period of disuse (typically 14 days for compaction summaries)	Session continuity summaries, temporary context

Additionally, any memory can be pinned by setting its pinned boolean flag. A pinned memory is never automatically removed during consolidation, regardless of its retention class. A memory can be both durable and pinned simultaneously.

Memory Injection into System Prompt

The MemoryInjector formats stored memories into structured sections that are appended to the system prompt. Memories are organized by class and type:

# Memories

## Priority Context
- Critical architecture note: The API gateway uses rate limiting per tenant...

## Project Knowledge
- Database schema: Uses PostgreSQL with UUID primary keys... (2026-03-15)
- Test conventions: All tests extend BaseTestCase... (2026-03-20)

## User Preferences
- Code style: Prefers early returns over nested conditionals...
- Communication: Concise responses, no filler text...

## Key Decisions
- Auth approach: JWT with refresh tokens, chosen over session cookies... (2026-03-18)

## Working Memory
- Previous session summary: Implemented the payment webhook handler...

## Previous Sessions
- [2026-04-01] Refactored the notification system...
- [2026-03-28] Added CSV export for user reports...

Up to 6 relevant memories are injected per turn (configurable). Working memory is capped at the 5 most recent entries to limit context size. Compaction summaries from previous sessions show at most 3 entries.

Memory Commands

Command	Description
`/memories`	List all stored memories with their type, class, and creation date
`/forget <id>`	Delete a specific memory by its ID

Memories are stored in a SQLite database at ~/.kosmo/data/kosmo.db. They are scoped to the current project directory, so memories saved while working in ~/projects/alpha will not appear when working in ~/projects/beta.

Session Management

Sessions provide continuity across terminal restarts. Every conversation is automatically saved, and you can resume any previous session with its full history, tool results, and context intact.

Session Commands

Command	Description
`/sessions`	List recent sessions with dates and model names
`/resume`	Pick a session to resume interactively (shows conversation preview)
`/new`	Start a fresh session (the current session is auto-saved)
`/compact`	Manually trigger context compaction on the current session

CLI Flags

Flag	Description
`--resume`	Resume the most recent session automatically on startup
`--session <id>`	Resume a specific session by its ID

Session History Recall

When the memory system is enabled, the agent can search across previous sessions for relevant context. This is different from resuming a session — it pulls in snippets from past conversations that are relevant to the current query, formatted as a “Session Recall” section in the system prompt. Up to 3 relevant session fragments are included.

System Prompt Assembly

The system prompt is assembled by the ContextManager from multiple layers, each adding domain-specific information:

Base prompt — The core system prompt defining the agent’s role, capabilities, tool usage conventions, and general behavior rules.
Relevant memories — Selected by similarity to recent messages. The MemoryInjector formats up to 6 memories into structured sections (priority context, project knowledge, user preferences, key decisions, working memory, previous sessions). This memory block is built once on the first turn and then frozen for cache stability — it is not rebuilt on subsequent turns.
Session history recall — Relevant fragments from previous sessions, found by searching the session history against the current user query (up to 3 results).
Mode-specific suffix — Behavioral rules for the current agent mode:
- Edit mode — Full tool access, write permissions, standard behavior
- Plan mode — Read-only tools, no modifications, focused on analysis and planning
- Ask mode — Conversational, no tool use, answers from existing knowledge and context
Parent brief — When running as a subagent, the parent agent’s task description and constraints are injected so the subagent understands its role in the broader workflow.
Active tasks — A rendered tree of the current task tracking state, so the agent is aware of pending work items and their status.

While most of the prompt is rebuilt every turn (mode, tasks, and other layers can change between turns), the memory and session recall block is built once and then frozen for cache stability. The token cost of the system prompt is included in the ContextBudget calculations.

Tip: Memory injection is suppressed during token estimation to avoid side effects (such as marking memories as “surfaced”). The estimation uses a read-only pass of the prompt builder.