Wardenby Bitmill
Documentation

Session Intelligence

Warden tracks multiple signals in real time to model session health:

SignalWhat it detects
Focus scoreHow concentrated the agent’s work is. Drops when scope widens or subsystems change rapidly.
Loop detectionRepeated command patterns that aren’t producing progress.
Verification debtEdit-heavy sessions that aren’t running tests or builds.
Drift velocityHow quickly the session is diverging from its original goal.
Trust scoreHow well the agent is behaving. High trust = fewer interventions.
Phase trackingWhether the session is in exploration, implementation, testing, or finishing.

When signals indicate degradation, Warden emits a targeted advisory. When the session is healthy, it stays completely silent.

Session Phases

Warden models every session as moving through five phases. The current phase determines intervention thresholds — how sensitive Warden is to signals and how aggressively it intervenes.

PhaseTurnsBehavior
Warmup0-5The agent is reading files, understanding the task, running initial commands. Warden is lenient — exploratory reads and broad searches are expected.
Productive6-30The core working phase. The agent is making edits, running builds, iterating. Warden watches for loops and drift but keeps a light touch.
ExploringvariesThe agent has shifted from its original task to investigate something tangential. This isn’t necessarily bad, but Warden tracks how far the exploration goes.
StrugglingvariesError rates are climbing, commands are repeating, the same files are being edited without progress. Warden increases advisory frequency.
Late80+Deep into the session, context pressure is mounting. Warden tightens output compression, issues more targeted advisories, and may suggest wrapping up.

Phases aren’t strictly sequential. A session can move from Productive to Struggling and back to Productive if the agent resolves its issues. Phase transitions are driven by the signals described below.

Signals in Detail

Focus score tracks how concentrated the agent’s work is across the file system. When the agent edits src/auth/login.ts, then src/auth/session.ts, then src/auth/token.ts, focus is high — it’s working in one subsystem. When it jumps to src/database/connection.ts, then src/ui/header.tsx, then package.json, focus drops. Rapid subsystem switching usually indicates the agent is flailing.

Real scenario: The agent is asked to add a new API endpoint. It starts in the right directory, but after hitting a type error, it starts modifying the database schema, then the test fixtures, then the CI config. Focus score drops from 90 to 40. Warden injects: “Focus has dropped. Consider addressing the type error in the auth module before changing other subsystems.”

Loop detection identifies repeated command patterns that aren’t producing progress. The simplest loop is running the same command and getting the same error — but Warden also detects more subtle patterns like edit-build-fail cycles where the edits aren’t addressing the actual error.

Real scenario: The agent runs cargo build and gets a lifetime error. It edits a function signature, builds again, gets a different lifetime error. Edits again, builds again, gets the original error back. After 4 iterations of this cycle, Warden detects the loop: “Loop detected: 4 build-fail cycles on lifetime errors. Consider a different approach — the borrow structure may need redesigning rather than signature tweaks.”

Verification debt accumulates when the agent edits files without running tests or builds. Editing 2-3 files before building is normal. Editing 8 files across 4 directories without a single build or test run is risky — any of those changes could be broken, and the agent won’t find out until much later.

Real scenario: The agent edits src/handler.rs, src/model.rs, src/routes.rs, src/middleware.rs, src/types.rs, src/config.rs, tests/integration.rs, and Cargo.toml — all without running cargo build or cargo test. Warden counts 8 unverified files and injects: “Verification debt: 8 files edited since last build. Run cargo build or cargo test to catch errors early.”

Drift velocity measures how quickly the session is diverging from its initial goal. It combines subsystem switching, file distance from the original working set, and whether the agent is still touching files related to the original request.

Trust score is a composite metric (0-100) that starts at 100 and decreases based on errors (-5), verification debt (-3 per file), subsystem switches (-2), and dead ends (-4). It’s used to control the advisory injection budget: high-trust sessions get minimal intervention, low-trust sessions get more guidance.

The Scorecard System

At the end of each session (or on demand via warden scorecard), Warden generates a quality scorecard with four dimensions:

DimensionWhat it measures
SafetyHow many safety rules fired, how many were critical, whether any were bypassed
EfficiencyToken utilization, output compression ratio, unnecessary command count
FocusAverage focus score, number of subsystem switches, drift velocity
UXHow many advisories were emitted, whether loops were detected and broken, verification debt at end

Each dimension scores 0-100. The overall session quality is the weighted average. Scorecard data feeds into the dream state for cross-session learning.

Dream State (Between Sessions)

When a session ends, Warden’s background worker processes the session data through 10 learning tasks:

TaskPurpose
LearnEffectivenessWhich rules and advisories preceded progress vs. were ignored
BuildResumePacketCompact summary of the session for the next one to pick up from
LearnSequencesSuccessful action sequences worth repeating
ClusterErrorsGroup repeated errors into durable knowledge
LearnRepairPatternsMap error types to fixes that worked
LearnConventionsProject conventions from recurring patterns
UpdateWorkingSetRankingWhich files/directories are most important by recency-frequency-outcome
BuildDeadEndMemoryApproaches that were tried and failed (so the next session avoids them)
ScoreArtifactsPrune weak or outdated learning artifacts
ConsolidateEventsCompress raw event logs into higher-level facts

These tasks run in priority order during daemon idle time. The highest-value work (effectiveness learning, resume packet) runs first. Lower-priority housekeeping (event consolidation) runs last.

The dream state produces a resume packet — a compact summary of what the session learned. When the next session starts (or after context compaction), the resume packet is injected so the agent doesn’t lose hard-won context. This includes the working set (top files by importance), dead ends to avoid, and conventions discovered.

Adaptive Thresholds

Warden doesn’t use fixed thresholds for intervention. The session phase and trust score adjust them in real time:

SettingWarmupProductiveStrugglingLate
Output compression max lines80806040
Advisory injection budget11-33-5uncapped
Loop detection sensitivitylowmediumhighhigh
Verification debt warning threshold8 files5 files3 files2 files

This means a fresh session tolerates broad exploration and messy output. A late, struggling session gets tight compression and aggressive guidance. The agent doesn’t need to know about any of this — the adjustments happen transparently based on what Warden observes.

How Interventions Work in Practice

Warden’s interventions are injected as text into the tool call response. The agent reads them as part of the output and can act on them (or ignore them). Here’s what a typical intervention sequence looks like in a real session:

Turn 1-10 (Warmup): The agent reads files, runs rg searches, examines the directory structure. Warden is silent — this is expected exploration behavior.

Turn 15 (Productive): The agent edits src/handler.rs and src/model.rs. No advisory — two files without a build is fine.

Turn 22 (Productive, verification debt rising): The agent has now edited 6 files without running cargo build or cargo test. Warden injects: “Verification debt: 6 files edited since last build. Consider running tests before continuing.”

Turn 28 (Struggling): The agent runs cargo build, gets an error, edits a file, builds again, gets the same error. After the third cycle, Warden injects: “Loop detected: 3 build-fail cycles on the same error. The approach may need rethinking rather than incremental fixes.”

Turn 35 (Productive): The agent takes a different approach, the build passes, and tests are green. Trust score recovers. Warden goes silent again.