Changelog

This changelog is maintained as a best-effort summary; for line-level detail and any gaps, see the commit history (git log) or the GitHub PR list.

[3.1.2] - 2026-06-17 — Maia

Hardware reach

AMD ROCm via llama.cpp — including RDNA4 / RX 9070 (gfx1200/gfx1201) and community-verified cards (#26)
Apple Silicon — native macOS hybrid Metal path (native llama-server for inference + Docker for the rest of the stack) (#32)
Vulkan universal fallback — one image covering AMD / Intel / Snapdragon / Apple-via-MoltenVK / CPU (#114)

Agent reliability — local-model tool loop

Tool results are rendered as user-role turns on the wire. Gemma's chat template has no tool role and silently dropped role:"tool" messages, so the model never saw any tool output (list_directory / read_file / run_command) and re-issued the same call until the repetition breaker fired. This was the root cause behind the "it can't see what it's reading / it just loops" reports. Model-agnostic (Qwen reads the [tool result] marker the same way).
Read-dedup false-negative fixed: fileContentInContext probed the raw longest line, but tool results are stored JSON-escaped, so any file whose longest line contained a quote (e.g. a Flask app's embedded HTML) was wrongly judged "trimmed" and re-served every read → read loop. Now probes the longest escape-free run.
Traceback → directed edit (#39 / option 3): a run_command crash extracts the deepest in-project frame, quotes the offending line, and steers a minimal edit_file; run tools are banned from the next decision's grammar so the model must edit rather than re-run.
move_file tool: relocations/renames (e.g. index.html → templates/) no longer require a read→write→delete dance; shell mv/cp point here. Refuses to clobber an existing destination.
Steers for common dead-ends: No module named X → pip install (instead of re-running), and a filename that differs only in case from a real workspace file → the correct name.
Per-turn max_tokens 32768 → 8192 (ATLAS_MAX_TOKENS) and a content-stream loop cut, bounding runaways that previously ran to the slot ceiling.
Conversation window sized to the per-slot context (with the active file pinned in the trim) instead of a flat cap that dropped the file under edit.

Sandbox — shell policy + isolation

run_command shell gate narrowed from "block every mutating verb" to catastrophic-only (whole-project/root rm -rf, fork bombs, device destruction), since the sandbox container (read-only rootfs, no-new-privileges, project-only writable mount, cwd jailed) is the real boundary. Ordinary mv/cp/mkdir/rm <file>/sed -i now run; bash -c/eval are unwrapped so a wrapped catastrophic command can't slip through.
Host-sized cgroup limits on the sandbox: pids_limit (kernel-level fork-bomb stop) and a memory cap (atlas init detects host RAM and writes ATLAS_SANDBOX_MEM ~75%); :-0 fallback keeps a raw docker compose up working uncapped.
Interactive wall-clock cap on the V3 pipeline (ATLAS_V3_TIMEOUT, default 180s) — a runaway falls back to the model's own (syntax-gated) content instead of hanging the session.

Geometric Lens — per-model thresholds + in-the-loop training data

G(x) operating thresholds are now per-model and ship with the lens artifact (gx_thresholds.json): the lens service loads them and returns them in each score response; the proxy uses them for its regression checks. The hardcoded 0.3 / 0.15 / 0.05 cutoffs were calibrated to one model's score scale and never fired for a model (e.g. Gemma) whose scores cluster higher. atlas lens build auto-emits the file, calibrated from the run's PASS-score percentiles.
ast_edit now matches <script> / <style> (tree-sitter parses them as dedicated script_element / style_element nodes, not generic elements — the old query matched 0).
In-the-loop lens-training data collection: each agent file-write is captured per pass; in the TUI, /good·/bad rate a pass and /review + /deny·/accept set per-file verdicts, which the proxy turns into labeled, weighted samples (a 👎 pass down-weights even its accepted files; a denial is a full-weight negative). /redo regenerates a rejected file. A one-time "lens retrain available" banner appears once enough balanced samples accrue.
atlas lens retrain trains the lens on that collected corpus (weighted G(x)) so it learns the user's own workloads, and emits fresh calibrated thresholds. New env: ATLAS_LENS_DATA_DIR, ATLAS_LENS_RETRAIN_MIN. TUI slash commands: /good /bad /review /deny /accept /redo.

Structural call-graph reasoning (#39, thanks @yogthos)

Intra-file call-graph neighborhood (calls: / called by: per symbol) rides on outline_file and whole-file read_file of a .py, gated by ATLAS_CALL_GRAPH. Surfaces structure at the localization decision point without a repo-wide scan. (PR #125 by Dmitri Sotnikov, integrated and extended.)

Documentation

Translated ARCHITECTURE.md to zh-CN / ja / ko (#25); added a language switcher to the English ARCHITECTURE.md

[3.1.0] - 2026-05-12 — Maia

Removed

Removed dead ATLAS_USE_FOX code paths in benchmark runner (#22)

Aider removed

proxy/aider_format.go (whole-file format translator), handleChatCompletions + handleStreamingChat, and the OpenAI-compat agent-loop wrapping are all deleted (~2000 lines). /v1/chat/completions on the proxy is now a transparent passthrough to llama-server via the catch-all handler.
.aider.model.settings.yml, .aider.model.metadata.json, the .aider* .gitignore exceptions, and the _find_aider/launch_aider paths in atlas/cli/repl.py are all gone. Bare atlas (interactive tty) now launches the TUI by default; pipe mode falls through to the built-in /solve REPL.
Proxy launcher (atlas/cli/repl.py) now reaps any pre-existing atlas-proxy-v2 process before spawning a fresh one and redirects proxy stdout/stderr to ~/.cache/atlas/proxy.log instead of /dev/null. Closes the "old binary in memory after rebuild" foot-gun.

Bubbletea TUI (PC-062)

New atlas tui subcommand launches a native Bubbletea terminal UI as the canonical chat client (and is now the default for plain atlas)
Five-pane layout: header (proxy/cwd/mode/spinner) + pipeline (live V3 stage table from /events) + chat (glamour-rendered markdown + inline tool calls) + events log + stats strip + textarea input
Hotkeys: Enter send, Shift+Enter newline, Ctrl+L clear, Ctrl+T cycle permission mode, Ctrl+R resend last, Ctrl+C cancel turn / quit, Ctrl+D quit
Slash commands inside the TUI: /add /drop /context /diff /commit /undo /run /help /quit
New atlas-proxy POST /cancel endpoint indexed by session_id — TUI cancels the in-flight /v1/agent turn on Ctrl+C as defense-in-depth alongside TCP disconnect
43 atlas-tui Go tests + 4 atlas-proxy /cancel tests, all green under go test -race
tui/ is a standalone Go module (github.com/itigges22/atlas-tui) — depends on bubbletea, lipgloss, bubbles, glamour

Documentation

Added multilingual documentation: Simplified Chinese (zh-CN), Japanese (ja), Korean (ko) for README, SETUP, and TROUBLESHOOTING
Added language selector badges to README
Added star history chart to Latest News section
Rewrote README contributing section to encourage issue reports and community feedback
Fixed V3_1_STATUS.md false claims about speed optimizations that were never applied to code
Documented RDNA4 (RX 9070 / 9070 XT, gfx1200/gfx1201) ROCm 7.x setup in SETUP.md and TROUBLESHOOTING.md — requires ATLAS_ROCM_TAG=7.2.3-complete; ATLAS_HSA_OVERRIDE_GFX_VERSION must stay unset (#119, thanks @Kaihui-AMD)
Corrected stale Metal/macOS docs: the macOS hybrid Metal path (#32) is now documented as shipping across README, SETUP.md, CONFIGURATION.md, and ARCHITECTURE.md (was mislabeled "V3.1.2 planned"); rewrote ARCHITECTURE.md §8.4 to describe the actual hybrid (native llama-server + Docker) rather than the never-shipped pure-native install
Restructured the README roadmap into V3.1.1 (hardware reach, landed), V3.1.2 (BYO-model + ROCm-on-K8s), and V3.2 (planning phase #120, structural+wavelet reasoning #39, reasoning-with-sampling #9), with a help-wanted backlog — all sourced from open issues
De-staled user-facing CLI strings: atlas init and atlas tier no longer print "Metal — V3.1.2 planned"; they report Metal as the supported macOS hybrid path (#32) — strings/comments only, no logic change
Synced zh-CN / ja / ko translations (README + SETUP.md) to the corrected English: Metal/macOS shown as shipping, multi-vendor GPU support table, V3.1.1/V3.1.2/V3.2 roadmap, and fixed NVIDIA-only requirements rows and SETUP_MACOS.md link paths

Code Accuracy Audit

Audited and corrected comments across 72 files for V3.0.1 accuracy
Updated model references: Qwen3-14B to Qwen3.5-9B, embedding dimensions 5120 to 4096
Renamed service references: rag-api to geometric-lens, Fox to llama-server
Corrected G(x) XGBoost status: deployed and active (was incorrectly described as removed)
Fixed normalization comments from "Fox 9B" to "Qwen3.5-9B C(x)"
Marked legacy Fox code paths as unused in benchmark runner and geo_learning

Test Fixes

Fixed embedding dimensions in test fixtures (5120 to 4096)
Fixed geometric-lens port in test conftest (8001 to 8099)
Updated DivSampling test assertions to match actual 4+4+4 perturbation counts
Corrected G(x) cost field parameter count: ~2.16M / 8.3MB (was ~2.7M / 10MB)
Finished the 3.0.1 api-portal cleanup: removed tests/integration/test_e2e_flow.py and tests/integration/test_e2e_training.py (616 lines). These depended on the test_api_key fixture which calls the deleted api-portal service, so every test in them errored on session setup. The 3.0.1 changelog claimed this cleanup was done but these two files survived it.
test_empty_messages_handled (tests/infrastructure/test_llm.py) now accepts 200/400/422/500. Current llama.cpp returns 500 for empty messages array; the test was hard-coded to 200 and broke against newer llama.cpp builds.
PC-061 step B: implemented _emit_event, _classify_stage, _logical_stage in v3-service/main.py. The test file (tests/v3-service/test_event_emission.py) was committed in c5216be ("Install observability") but the implementation never landed, leaving the test red on dev. The contract is now satisfied: legacy {stage, detail} frame always emitted, typed envelope opt-in, suffix-based stage classification (_pass/_skip/_done → stage_end success=true, _failed → stage_end success=false, _error → error event, _retry → fresh stage_start), and stage_start→stage_end pairing via logical-name parent_id + duration_ms.

Repo restructure

Renamed atlas-tui → tui and atlas-proxy → proxy at the repo level; moved ablation data under docs/reports. 362 reference updates across the tree.

Phase 0: first-run installer + model wizard

New atlas init command (atlas/cli/commands/init.py): interactive first-run wizard that probes hardware, picks the right tier (T0/T1/T2/T3), recommends a model, writes ~/.atlas/config.yaml.
New atlas model command (atlas/cli/commands/model.py) with list / verify / add / remove subcommands; backed by model_registry.py (add/get/list with SHA verification) and model_recommendations.py (per-tier defaults, split out from tier.py in PC-055.2).
atlas/cli/events.py (PC-061 step A): typed-event SSE protocol — Event dataclass, parse_envelope, iter_events, suffix-based stage classification. Schema documented in docs/PROTOCOL.md. Producer-side helpers in v3-service landed as PC-061 step B (see Test Fixes above).
atlas doctor extended for the same hardware probe used by the wizard.

Install + bootstrap hardening

Hardened fresh-VM install path against partial failures across RHEL 9, Ubuntu, Rocky; curl … | bash and curl … | sudo bash both work.
Auto-install NVIDIA driver libraries on RHEL 9 and put the Python CLI on $PATH.
Bootstrap now installs Go and pre-builds atlas-tui so first-run latency is download-bound, not compile-bound.

CI: lint + security + cross-distro

Added ruff (Python lint) and CodeQL (security scan) as GitHub workflows.
New PR-time test job that runs the full Python suite against a cross-distro install matrix (Ubuntu 22.04 / 24.04 / Rocky 9).
Fixed pip PEP 660 friction, Rocky curl conflict, and a CLI-wizard GPU-mock path that was breaking the matrix.

PC-159: surgical-edit gate (proxy)

New gate in proxy/agent.go that refuses an edit_file when the proposed change would rewrite more than a configured fraction of the target file. Forces the model to pick the right tool (write_file for new files, ast_edit for structural rewrites, edit_file only for actual surgical patches).

Chat history threading (proxy + tui)

/v1/agent now accepts full prior chat history from the TUI, replacing the per-call stateless wrapper. Assistant turns are re-wrapped in a JSON envelope so the proxy can tell user messages from prior model turns when rebuilding context.

Plan mode (May 5)

New /v3/plan endpoint on v3-service generates a structured plan (steps + verify step + adherence score) before the agent loop begins; Qwen3 reasoning extraction fixed in the same commit.
proxy/agent.go consumes the plan via a plan bridge, an agent-loop hook that pins the current step into each request, and an adherence gate that flags reasoning that drifts from the active step.
TUI renders plan_loaded / plan_adherence / plan_revise events live (tui/commands.go, tui/model.go).
New docs: docs/PLAN_MODE.md, docs/PROTOCOL.md.

Proxy reliability (May 5)

Output sanitiser strips reasoning preambles and dangling JSON fragments from model responses before parsing.
Shell-op gate refuses dangerous rm -rf / style commands and the bash -c bypass route.
System prompt hardened: clearer tool-use rules, fewer hallucinated fields.
Verification gate added before type=done (foundation that tonight's done-without-action gate composes with).
Host paths in tool-call arguments translated to container paths so the sandbox sees the right file when the model thinks in host-fs terms.
Fixed a conversation-history drop bug where the post-V3 trim was eating the user's prompt; V3 pipeline now fires on more edit shapes (not just write_file).
Lens-call timeout in v3-service bumped from 5s to 30s with structured fallback logging on miss.

Sandbox + execution stack

PC-188: every run_command now executes inside the sandbox container, not on the host. Closes the "model writes rm and the host runs it" risk.
PC-189: workspace-drift fix and a false-positive in the truncating-redirect detector (was rejecting legit > file.txt writes).
PC-190: sandbox verify stack pre-bakes common dev deps (pytest, ruff, etc.), uses tmpfs for the working tree, prints a "create a venv" hint when the model tries to install into the system Python.
PC-191/192/193: sandbox is language-agnostic — works on a working codebase (not just a single-file scratchpad). Detects Python, Node, Go, Rust, Java, C/C++ project layouts and uses the appropriate runner.

Anti-laziness gates

PC-194/195: write_file rejects empty content, single-line stubs, "TODO"-only files, files with pass-only bodies, and other lazy outputs.
PC-196: explicit run_background tool for long-running processes (e.g. python app.py); shell & backgrounding through run_command is detected and routed to run_background.
PC-197: completion-claim verification — when the model declares done, the gate checks the workspace state matches the claim (structural check, foundation that tonight's claim-check gate extends).
PC-198: trims boilerplate from the system prompt and strips host /workspace/ prefixes from model-emitted paths.
PC-199/200: detects "stops at the easy fix" pattern (one tweak then done); raises tier-aware turn caps so the model has runway to complete a real task.
PC-201: write_file is allowed to overwrite an existing file when that file is corrupted (e.g. truncated mid-write from a prior crashed turn) instead of failing with the usual "file exists" gate.

PC-202: per-layer residual hidden states from llama-server

Patched llama-server's /embedding endpoint to accept a layers: [int] parameter and return the residual-stream hidden state at each requested layer. Foundation for both PC-207 (per-token lens scoring) and tonight's ASA steering vector build.

PC-206 + PC-207: lens-as-PRM (per-step process reward)

PC-206: thinking-mode plumbing in v3-service/main.py LLMAdapter — thinking keyword resolves per-call against an instance default.
PC-207: lens computes per-token C(x) + G(x) scores during candidate generation; /internal/lens/score-per-step exposes aggregates (gx_min, gx_mean, off_rails_idx, cx_norm_max) the proxy and v3-service consume for early-exit and ranking. Wired into v3-service candidate generation, the agent loop (foundation for tonight's reasoning-repeat + path-aware detectors), with structured per-step logging across all three services.
Severe-score short-circuit: gx_min below 0.05 fires a corrective immediately without waiting for a second sample (calibrated against the May 7 dashboard.html stub-loop session).
V3↔lens alignment: lens now vetoes a sandbox-passing candidate when its gx_min indicates a stub or placeholder collapse — closes the "sandbox approves a stub V3 generated" loophole.

Agent loop reliability sweep (May 7)

Empty-response fallback: when the model returns nothing parseable, the loop emits a corrective hint instead of retrying the same prompt verbatim.
Plan-threshold guard: refuses to enter the agent loop on a plan with adherence score below threshold.
Tool-repeat detector: precursor to tonight's reasoning-repetition detector — catches verbatim tool-call repeats within a window.

GH #39: AST-aware surgical edits + tier-aware V3 routing (May 8)

v1 (5e44ffb): new ast_edit tool — friendly-selector AST node replacement using tree-sitter. Supports function:NAME, class:NAME, and <tag> selectors. The selector vocabulary is intentionally small in v1; nested selectors (e.g. <style> inside <head>) are NOT supported and produce a "0 nodes matched" error.
Point 1 (468a555): structural verification veto for V3 candidates — rejects candidates that pass sandbox but fail structural shape checks (e.g. removed a required import, lost the class definition).
Point 2 (b95f741): cyclomatic-complexity enrichment in tier classification — tier.py now considers logic density, not just line count, when assigning T0/T1/T2/T3.
Point 3 (2629652): Phase 3 repair receives call-chain context (callers + callees of the file being repaired) so the repair model can reason about cross-file effects.
Point 4 (bd0b02b): auto-injection of a reachability slice from the user's message — the lens picks the most relevant file regions and inlines them into the system context before the loop starts.
Plan generation made aware of ast_edit so plan steps suggest it when the target is a structural edit.
edit_file "string not found" error now suggests ast_edit as the recovery; write_file rejection on existing files also points to ast_edit.
Three follow-up fixes: encoding (HTML entities in selector args), trim-resilience (large content fields surviving the post-V3 trim), and parse-failure categorization in logs.
Jinja crash fix when symbol_index injects snippets: the snippet role was being set to system, which Jinja resolved as a template literal; changed to user role.

BiasBusters tool-selection mitigations

Tool descriptions rewritten to push the model toward the right tool for the task: edit_file framed as the surgical default, ast_edit marked REQUIRED for HTML/Python structural edits, write_file restricted to new-file creation only.
Conditional GBNF grammar built per turn: when the loop has already entered a step the model has just claimed done, the grammar bans re-emitting the same tool name token-side so the model can't loop on the same failed tool call.
Per-step tool-list filter (buildToolDescriptionsExcluding): the system prompt strips tools the loop has explicitly excluded for this step, so the model never sees them as options.
ASA (Activation Steering for Aast_edit) wired into the inference entrypoint: inference/entrypoint-v3.1-9b.sh auto-detects /models/ast_edit_steering.gguf and applies it always-on via llama.cpp --control-vector. Default scale 0.5, default layer range full-model, both overridable via env. PC-202's per-layer-residual /embedding patch is the upstream that makes this possible.

ASA steering vector

New geometric-lens/asa_calibration/ directory: 1000 contrast-pair prompts (50+ base templates × variation pools) cover function selectors (54%), HTML tags (27%), and CSS classes (19%). generate_pairs.py produces contrast_pairs.jsonl; build_steering_vector.py extracts residuals via the lens extract_per_layer_per_token endpoint at layer 27 (of 36 in Qwen3.5-9B), means across tokens/prompts/sign, and writes a llama.cpp-format GGUF control vector. Final vector: 16736 bytes, ‖v_global‖ = 8.6444 after 730s on 2000 prompts.

Agent loop hardening

Plan-progress reminder (proxy/plan_reminder.go): ephemeral system note injected into every step request rendering plan progress N/M — currently on step "sX": <action> <target> plus done/remaining sub-step IDs. Lazy-initializes ctx.PlanStepsSatisfied. Not persisted to ctx.Messages, so it survives the post-V3 conversation trim cycle.
Reasoning-repetition detector (proxy/reasoning_repeat.go): tracks the model's reasoning-stream opening; on 3 consecutive identical normalized openings (case-folded, whitespace-collapsed, 80-char snippet) the loop queues a corrective system message. Successfully broke a session-2 stuck loop in live testing.
Path-aware error breaker (extractFailurePath in proxy/lens_score.go, breaker logic in proxy/agent.go): tracks ctx.RecentFailurePaths per tool failure. Known limitation: the v1 implementation resets on intervening successes, so it can miss long stuck-loop sequences with sporadic productive turns in between.
Done-without-action gate (proxy/guardrails.go): refuses type=done when the user prompt is fix-intent and no successful verification command has run this loop. Action-intent words (rewrite, create, add, update, redesign) also trigger a productive-change check parallel to the existing verify check. Caught 4 false-success done attempts in live testing.
Truncation recovery shims (proxy/agent.go): recoverTruncatedAstEdit + recoverTruncatedEditFile + recoverTruncatedToolCall rescue malformed tool emissions from the model and re-pack into a well-formed shape. Each shim is targeted at a specific failure mode observed in production logs.
Conversation history error surfacing (proxy/agent.go): extractModelResponse now exposes the actual Unmarshal error path (directErr vs balancedErr) so debug logs distinguish parse-shape failures from content failures.
Removed ResponseHeaderTimeout from proxy/v3_bridge.go and removed all client-level timeouts on the V3 HTTP path. Long V3 chains (10+ minute passes) were getting bounced by the 10-minute response-header window even when the pipeline was making progress.
Removed absoluteMaxTurns ceiling from proxy/types.go. Turn caps now come solely from TierMaxTurns (T0:5, T1/T2/T3:0 = uncapped) with no override clamp. Reasoning: 8 detectors armed in the loop make a hard cap redundant — let the detectors decide when to break.

Surgical-edit hardening (V3 routing)

proxy/tools.go ast_edit executor: tier classification now uses max(oldTier, newTier) and the previous V3-tier floor for HTML was dropped (it was over-triggering V3 on the smallest CSS tweaks). Doctype dedup (leadingDoctypeRe + stripLeadingDoctype in proxy/guardrails.go) prevents the model's "" prefix from being inserted twice when ast_edit replaces the <body>.
Suspiciously-shrunk-edit guard (validateNotSuspiciouslyShrunk in proxy/guardrails.go): rejects an edit that shrinks an >100-byte file to <64 bytes. Final threshold tuned after a legitimate 80-byte one-liner refactor was false-rejected at 128. Triggered on a destructive 32-byte stub in pre-release testing.
Working-directory phantom-dir guard (validateWorkingDirReference + workspaceRefRe): catches model emissions that try to cd templates/workspace or similar nested-workspace references; legitimate cd /workspace at the sandbox root is allowed.
Action-intent gate (actionIntentWords + isActionIntentMessage + actionWithoutProductiveChangeMessage): companion to the verification gate, catches done declarations on rewrite/create/add/redesign-style prompts that don't include a productive edit this loop.

TUI reasoning stream visibility

tui/model.go adds a streamingReasoningText buffer and a reasoning_token event handler that renders with a ‹thinking› prefix so the user sees the model's reasoning stream live alongside its content. Both buffers reset on llm_call_start / llm_call_end.
tui/commands.go extended to forward the delta.ReasoningContent field from the SSE stream as reasoning_token events.
proxy/agent.go plumbs reasoning content through the agent loop: stashes ctx.LastTurnReasoning, captures pendingReasoningCorrective via recordReasoning, and re-emits reasoning deltas to the client mid-turn (with a sync.Mutex around the http.ResponseWriter to fix the SSE race that produced the "chunked line ends with bare LF" errors).

Tests

New Go tests: proxy/path_aware_test.go, proxy/reasoning_repeat_test.go, proxy/recover_truncated_test.go, proxy/step_restriction_test.go. Extended proxy/guardrails_test.go, proxy/plan_hook_test.go.
All go test ./... on both proxy/ and tui/ modules pass.
Full Python suite: 1055 passed / 4 skipped / 0 failed / 0 errors locally.

[3.0.1] - 2026-04-05

Tool-Call Agent Loop Architecture

Replaced Aider format-translation proxy with structured JSON tool-call agent loop
Grammar-constrained output via llama-server response_format:json_object — 100% valid JSON
8 tool definitions: read_file, write_file, edit_file, delete_file, run_command, search_files, list_directory, plan_tasks
Per-file tier classification: T1 (config/data) writes directly, T2 (logic/features) routes through V3 pipeline
3400+ lines new Go code across 12 files in proxy/

V3 Pipeline Integration

All 14 V3 steps wired into write_file/edit_file executors for T2/T3 files
PlanSearch → DivSampling → Budget Forcing → Build Verification → C(x)/G(x) Scoring → Best-of-K → S*/Blend-ASC → Failure Analysis → PR-CoT Repair → Refinement Loop → Derivation Chains → Metacognitive → Final Write
Per-file-type build verification: tsc, py_compile, gcc, go build, cargo check, bash -n
V3 service SSE streaming: pipeline progress visible in real-time

CLI Experience

atlas command: starts all services and launches Aider
Streaming progress: [Turn N/M] with tool call details, V3 pipeline steps, completion summary
Exploration budget: 4 consecutive read-only calls triggers nudge, prevents model from over-exploring
Pre-injected project context: model sees project file list in system prompt
File deletion via fast-path before tier classification
Truncation prevention: 32K context, reject write_file for existing files >100 lines, detect truncated args before execution

Deployment

Docker Compose (docker-compose.yml) for full stack orchestration
Podman compatible with host networking
.env.example with all configurable parameters
atlas script auto-detects Docker vs bare-metal and routes accordingly

Renames (362 total reference updates)

rag-api/ → geometric-lens/ (directory + all references)
ATLAS_RAG_URL → ATLAS_LENS_URL
ATLAS_FOX_URL → ATLAS_INFERENCE_URL
foxURL → inferenceURL (Go code)
ralph-loop → verify-repair loop
rag.py → pipeline.py (geometric-lens orchestration)

Reliability

8-level test × 3 iterations: 95.8% (23/24)
5-language integration: 100% (Shell, Python, Rust, C, Go)
L6 (add feature to existing project): 67% — marked as future improvement

Documentation Overhaul

ARCHITECTURE.md: Complete rewrite — 13 Mermaid diagrams (service topology, agent loop flow, V3 pipeline, module map, sequence diagrams), every component verified against source code
API.md: Complete rewrite — every endpoint across all 5 services verified against source, request/response formats, SSE stages
CLI.md: Complete rewrite — startup flow diagram, streaming format, workflow examples, troubleshooting, env vars, Aider config reference
CONFIGURATION.md: Complete rewrite — every env var across all services verified, internal constants, Docker Compose vs K3s differences
MAP.md: Complete rewrite — every file in repo with clickable tree, 150 file links, 18 description tables
SETUP.md: Complete rewrite — verified build steps, first-run guide, bare metal, K3s, hardware sizing, Lens training guide
TROUBLESHOOTING.md: Complete rewrite — quick diagnostics, 20+ issue scenarios with verified fixes
README.md: Honest 7-step setup with actual download command, prerequisites, model clarity (Qwen3-14B vs Qwen3.5-9B)
Reorganized historical docs into docs/reports/ (ablation studies, status tracking, migration guides)

Bug Fixes

geometric-lens Dockerfile port mismatch: Container was listening on 8001 but docker-compose expected 8099 — fresh Docker Compose deploys had broken Lens service. Fixed Dockerfile to use port 8099.
Python CLI default RAG port: atlas/cli/client.py defaulted to port 31144 (K3s NodePort) instead of 8099 (Docker Compose). Fixed default to match Docker Compose.
Missing Aider config files: .aider.model.settings.yml and .aider.model.metadata.json were not in the repo — the atlas launcher would fail without them. Restored both files and added .gitignore exceptions.
GitHub Issue #6: hostname -I → portable fallback chain (ip addr → hostname -I → hostname -i) for Arch Linux compatibility
GitHub Issue #10: rag-api/ → geometric-lens/ restructuring resolved missing models directory
GitHub Issue #11: Added Geometric Lens training documentation to SETUP.md with HuggingFace dataset link
GitHub Issue #12 / PR #13: docker image exists → docker image inspect in build script

Cleanup

Removed 62 stale test directories, old v1 proxy binary, dead G(x) metric tensor training scripts
Removed stale tests for deleted services (api-portal, dashboard, embedding-service, task-worker)
Removed root-level development artifacts (bubble_sort.py, snake_game.py, etc.)
All hardcoded /home/isaac/ paths replaced with $HOME or ATLAS_DIR env vars

[3.0] - 2026-03-05

V3.0 Benchmark Release

74.6% LCB pass@1 (447/599) on frozen Qwen3-14B
Full ablation study: conditions A–D with per-task results
Phase 1 (PlanSearch/DivSampling): +12.4pp
Phase 3 (PR-CoT/Refinement/Derivation): +7.3pp
Self-verified Phase 3 using model-generated test cases

[2.5.1] - 2026-02-23

Confirmation Ablation: Embedding Source Hypothesis — STRONG CONFIRMATION

H1: Self-embeddings restore C(x) discrimination: CONFIRMED (+39.5pp)
- C(x) selects passing candidate 87.8% on mixed-result tasks vs 48.3% random (p < 0.000001)
- V2.5 result (+0.6pp under nomic 768-dim) was an embedding source limitation, not architecture failure
- Reverse energy selects only 4.3%, proving strong directional signal
- Val AUC: 0.9934, energy separation: 21.75 (7.2x wider than V2.5)
H2: G(x) adds value beyond C(x): NEUTRAL (0.0pp)
- G(x) contributes zero at optimal alpha (0.001); monotonically degrades at higher alpha
- Zero corrections, zero breakages across all mixed-result tasks
Outcome B: Ship C(x)-only with self-embeddings, remove or redesign G(x)
Difficulty routing validated: Q1 (low energy) = 100% oracle, Q4 (high energy) = 0.3%
C(x) confirmed as both verifier (87.8% selection) and router (perfect difficulty stratification)
Runtime: 24h 42m on LiveCodeBench v5 (599 tasks, K=3, 4 epochs)
Infrastructure: Qwen3-14B with --embeddings (no spec decode, ~45 tok/s)
Risk R6 (Lens non-discriminating) RESOLVED; Risk R11 (no verifier) substantially mitigated

[2.5.0] - 2026-02-21

Ablation Study

Systematic ablation of Geometric Lens, router, and infrastructure components
Finding: C(x) energy scoring ≈ random for candidate selection under nomic embeddings (37.7% vs 37.1%, within 3.4pp seed variance) — V2.5.1 confirmed this was an embedding source limitation (87.8% accuracy restored with self-embeddings)
Finding: C(x) energy strongly correlates with task difficulty (58.5% vs 18.9% pass rate across tiers)
Finding: G(x) metric tensor confirmed dormant (5.2M params, zero impact)
Finding: Pattern cache bypassed entirely by benchmark runner

Architecture Change

Discovered --embeddings flag breaks speculative decoding (forces n_batch=512)
Migrated to two-server sidecar architecture: generation + spec decode on Server A, embeddings via nomic-embed-text-v1.5 on Server B
Recovered ~2.6x generation throughput (~38 tok/s → ~100 tok/s)
Net VRAM delta: approximately -230 MiB (sidecar cheaper than --embeddings overhead)

[2.0.0] - 2026-02-18

Architecture Changes

Replaced Qdrant vector DB + embedding service with PageIndex tree-based RAG
Added Geometric Lens (Cost Field + Metric Tensor) for candidate quality prediction
Added Confidence Router with difficulty-based adaptive-k selection
Added Pattern Cache (Redis + Ebbinghaus memory decay)
Added Best-of-K pipeline with parallel candidate generation
Added sandboxed code execution for benchmark evaluation
Added speculative decoding with Qwen3-0.6B draft model
Added KV cache quantization (q4_0)

Benchmark Results (Run ID: v2_run_20260217_125310)

LiveCodeBench: 36-41% pass@1 (across Lens training epochs, k=3)
GPQA Diamond: 47.0% (k=5)
SciCode: 14.7% sub-problems (341 tasks, k=1)
Geometric Lens: 0.968 Val AUC, ~80% first-pick accuracy (151/188)
Throughput: 109 tasks/hr on RTX 5060 Ti 16GB

Removed

Qdrant vector database
MiniLM-L6-v2 embedding service
LoRA nightly training pipeline (moved to v1_archived/, CronJob suspended)
V1 benchmark suite (HumanEval, MBPP, Custom)

Fixed Post-Release

mlock allocation failure — added LimitMEMLOCK=infinity systemd override for K3s
Speculative decode slot 1 failure — quantized draft KV cache to q4_0 (-ctkd/-ctvd)
Dashboard crash-loop — fixed missing Jinja2 default filters

Notes

IFBench evaluation incomplete (excluded from results)
All results from single benchmark run (variance unknown)

[1.0.0] - 2026-02-04

Initial release. See benchmark/v1_benchmark_report.md for V1 results.

Uh oh!

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

[3.1.2] - 2026-06-17 — Maia

Hardware reach

Agent reliability — local-model tool loop

Sandbox — shell policy + isolation

Geometric Lens — per-model thresholds + in-the-loop training data

Structural call-graph reasoning (#39, thanks @yogthos)

Documentation

[3.1.0] - 2026-05-12 — Maia

Removed

Aider removed

Bubbletea TUI (PC-062)

Documentation

Code Accuracy Audit

Test Fixes

Repo restructure

Phase 0: first-run installer + model wizard

Install + bootstrap hardening

CI: lint + security + cross-distro

PC-159: surgical-edit gate (proxy)

Chat history threading (proxy + tui)

Plan mode (May 5)

Proxy reliability (May 5)

Sandbox + execution stack

Anti-laziness gates

PC-202: per-layer residual hidden states from llama-server

PC-206 + PC-207: lens-as-PRM (per-step process reward)

Agent loop reliability sweep (May 7)

GH #39: AST-aware surgical edits + tier-aware V3 routing (May 8)

BiasBusters tool-selection mitigations

ASA steering vector

Agent loop hardening

Surgical-edit hardening (V3 routing)

TUI reasoning stream visibility

Tests

[3.0.1] - 2026-04-05

Tool-Call Agent Loop Architecture

V3 Pipeline Integration

CLI Experience

Deployment

Renames (362 total reference updates)

Reliability

Documentation Overhaul

Bug Fixes

Cleanup

[3.0] - 2026-03-05

V3.0 Benchmark Release

[2.5.1] - 2026-02-23

Confirmation Ablation: Embedding Source Hypothesis — STRONG CONFIRMATION

[2.5.0] - 2026-02-21

Ablation Study

Architecture Change

[2.0.0] - 2026-02-18

Architecture Changes

Benchmark Results (Run ID: v2_run_20260217_125310)

Removed

Fixed Post-Release

Notes

[1.0.0] - 2026-02-04