This changelog is maintained as a best-effort summary; for line-level detail and any gaps, see the commit history (
git log) or the GitHub PR list.
- AMD ROCm via llama.cpp — including RDNA4 / RX 9070 (gfx1200/gfx1201) and community-verified cards (#26)
- Apple Silicon — native macOS hybrid Metal path (native llama-server for inference + Docker for the rest of the stack) (#32)
- Vulkan universal fallback — one image covering AMD / Intel / Snapdragon / Apple-via-MoltenVK / CPU (#114)
- Tool results are rendered as user-role turns on the wire. Gemma's chat template has no
toolrole and silently droppedrole:"tool"messages, so the model never saw any tool output (list_directory/read_file/run_command) and re-issued the same call until the repetition breaker fired. This was the root cause behind the "it can't see what it's reading / it just loops" reports. Model-agnostic (Qwen reads the[tool result]marker the same way). - Read-dedup false-negative fixed:
fileContentInContextprobed the raw longest line, but tool results are stored JSON-escaped, so any file whose longest line contained a quote (e.g. a Flask app's embedded HTML) was wrongly judged "trimmed" and re-served every read → read loop. Now probes the longest escape-free run. - Traceback → directed edit (#39 / option 3): a
run_commandcrash extracts the deepest in-project frame, quotes the offending line, and steers a minimaledit_file; run tools are banned from the next decision's grammar so the model must edit rather than re-run. move_filetool: relocations/renames (e.g.index.html→templates/) no longer require a read→write→delete dance; shellmv/cppoint here. Refuses to clobber an existing destination.- Steers for common dead-ends:
No module named X→pip install(instead of re-running), and a filename that differs only in case from a real workspace file → the correct name. - Per-turn
max_tokens32768 → 8192 (ATLAS_MAX_TOKENS) and a content-stream loop cut, bounding runaways that previously ran to the slot ceiling. - Conversation window sized to the per-slot context (with the active file pinned in the trim) instead of a flat cap that dropped the file under edit.
run_commandshell gate narrowed from "block every mutating verb" to catastrophic-only (whole-project/rootrm -rf, fork bombs, device destruction), since the sandbox container (read-only rootfs, no-new-privileges, project-only writable mount, cwd jailed) is the real boundary. Ordinarymv/cp/mkdir/rm <file>/sed -inow run;bash -c/evalare unwrapped so a wrapped catastrophic command can't slip through.- Host-sized cgroup limits on the sandbox:
pids_limit(kernel-level fork-bomb stop) and a memory cap (atlas initdetects host RAM and writesATLAS_SANDBOX_MEM~75%);:-0fallback keeps a rawdocker compose upworking uncapped. - Interactive wall-clock cap on the V3 pipeline (
ATLAS_V3_TIMEOUT, default 180s) — a runaway falls back to the model's own (syntax-gated) content instead of hanging the session.
- G(x) operating thresholds are now per-model and ship with the lens artifact (
gx_thresholds.json): the lens service loads them and returns them in each score response; the proxy uses them for its regression checks. The hardcoded 0.3 / 0.15 / 0.05 cutoffs were calibrated to one model's score scale and never fired for a model (e.g. Gemma) whose scores cluster higher.atlas lens buildauto-emits the file, calibrated from the run's PASS-score percentiles. - ast_edit now matches
<script>/<style>(tree-sitter parses them as dedicatedscript_element/style_elementnodes, not genericelements — the old query matched 0). - In-the-loop lens-training data collection: each agent file-write is captured per pass; in the TUI,
/good·/badrate a pass and/review+/deny·/acceptset per-file verdicts, which the proxy turns into labeled, weighted samples (a 👎 pass down-weights even its accepted files; a denial is a full-weight negative)./redoregenerates a rejected file. A one-time "lens retrain available" banner appears once enough balanced samples accrue. atlas lens retraintrains the lens on that collected corpus (weighted G(x)) so it learns the user's own workloads, and emits fresh calibrated thresholds. New env:ATLAS_LENS_DATA_DIR,ATLAS_LENS_RETRAIN_MIN. TUI slash commands:/good /bad /review /deny /accept /redo.
- Intra-file call-graph neighborhood (
calls:/called by:per symbol) rides onoutline_fileand whole-fileread_fileof a.py, gated byATLAS_CALL_GRAPH. Surfaces structure at the localization decision point without a repo-wide scan. (PR #125 by Dmitri Sotnikov, integrated and extended.)
- Translated ARCHITECTURE.md to zh-CN / ja / ko (#25); added a language switcher to the English ARCHITECTURE.md
- Removed dead
ATLAS_USE_FOXcode paths in benchmark runner (#22)
proxy/aider_format.go(whole-file format translator),handleChatCompletions+handleStreamingChat, and the OpenAI-compat agent-loop wrapping are all deleted (~2000 lines)./v1/chat/completionson the proxy is now a transparent passthrough to llama-server via the catch-all handler..aider.model.settings.yml,.aider.model.metadata.json, the.aider*.gitignoreexceptions, and the_find_aider/launch_aiderpaths inatlas/cli/repl.pyare all gone. Bareatlas(interactive tty) now launches the TUI by default; pipe mode falls through to the built-in/solveREPL.- Proxy launcher (
atlas/cli/repl.py) now reaps any pre-existingatlas-proxy-v2process before spawning a fresh one and redirects proxy stdout/stderr to~/.cache/atlas/proxy.loginstead of/dev/null. Closes the "old binary in memory after rebuild" foot-gun.
- New
atlas tuisubcommand launches a native Bubbletea terminal UI as the canonical chat client (and is now the default for plainatlas) - Five-pane layout: header (proxy/cwd/mode/spinner) + pipeline (live V3 stage table from
/events) + chat (glamour-rendered markdown + inline tool calls) + events log + stats strip + textarea input - Hotkeys: Enter send, Shift+Enter newline, Ctrl+L clear, Ctrl+T cycle permission mode, Ctrl+R resend last, Ctrl+C cancel turn / quit, Ctrl+D quit
- Slash commands inside the TUI:
/add /drop /context /diff /commit /undo /run /help /quit - New atlas-proxy
POST /cancelendpoint indexed bysession_id— TUI cancels the in-flight/v1/agentturn on Ctrl+C as defense-in-depth alongside TCP disconnect - 43 atlas-tui Go tests + 4 atlas-proxy
/canceltests, all green undergo test -race tui/is a standalone Go module (github.com/itigges22/atlas-tui) — depends on bubbletea, lipgloss, bubbles, glamour
- Added multilingual documentation: Simplified Chinese (zh-CN), Japanese (ja), Korean (ko) for README, SETUP, and TROUBLESHOOTING
- Added language selector badges to README
- Added star history chart to Latest News section
- Rewrote README contributing section to encourage issue reports and community feedback
- Fixed V3_1_STATUS.md false claims about speed optimizations that were never applied to code
- Documented RDNA4 (RX 9070 / 9070 XT, gfx1200/gfx1201) ROCm 7.x setup in SETUP.md and TROUBLESHOOTING.md — requires
ATLAS_ROCM_TAG=7.2.3-complete;ATLAS_HSA_OVERRIDE_GFX_VERSIONmust stay unset (#119, thanks @Kaihui-AMD) - Corrected stale Metal/macOS docs: the macOS hybrid Metal path (#32) is now documented as shipping across README, SETUP.md, CONFIGURATION.md, and ARCHITECTURE.md (was mislabeled "V3.1.2 planned"); rewrote ARCHITECTURE.md §8.4 to describe the actual hybrid (native llama-server + Docker) rather than the never-shipped pure-native install
- Restructured the README roadmap into V3.1.1 (hardware reach, landed), V3.1.2 (BYO-model + ROCm-on-K8s), and V3.2 (planning phase #120, structural+wavelet reasoning #39, reasoning-with-sampling #9), with a help-wanted backlog — all sourced from open issues
- De-staled user-facing CLI strings:
atlas initandatlas tierno longer print "Metal — V3.1.2 planned"; they report Metal as the supported macOS hybrid path (#32) — strings/comments only, no logic change - Synced zh-CN / ja / ko translations (README + SETUP.md) to the corrected English: Metal/macOS shown as shipping, multi-vendor GPU support table, V3.1.1/V3.1.2/V3.2 roadmap, and fixed NVIDIA-only requirements rows and SETUP_MACOS.md link paths
- Audited and corrected comments across 72 files for V3.0.1 accuracy
- Updated model references: Qwen3-14B to Qwen3.5-9B, embedding dimensions 5120 to 4096
- Renamed service references: rag-api to geometric-lens, Fox to llama-server
- Corrected G(x) XGBoost status: deployed and active (was incorrectly described as removed)
- Fixed normalization comments from "Fox 9B" to "Qwen3.5-9B C(x)"
- Marked legacy Fox code paths as unused in benchmark runner and geo_learning
- Fixed embedding dimensions in test fixtures (5120 to 4096)
- Fixed geometric-lens port in test conftest (8001 to 8099)
- Updated DivSampling test assertions to match actual 4+4+4 perturbation counts
- Corrected G(x) cost field parameter count: ~2.16M / 8.3MB (was ~2.7M / 10MB)
- Finished the 3.0.1 api-portal cleanup: removed
tests/integration/test_e2e_flow.pyandtests/integration/test_e2e_training.py(616 lines). These depended on thetest_api_keyfixture which calls the deleted api-portal service, so every test in them errored on session setup. The 3.0.1 changelog claimed this cleanup was done but these two files survived it. test_empty_messages_handled(tests/infrastructure/test_llm.py) now accepts 200/400/422/500. Current llama.cpp returns 500 for empty messages array; the test was hard-coded to 200 and broke against newer llama.cpp builds.- PC-061 step B: implemented
_emit_event,_classify_stage,_logical_stageinv3-service/main.py. The test file (tests/v3-service/test_event_emission.py) was committed in c5216be ("Install observability") but the implementation never landed, leaving the test red on dev. The contract is now satisfied: legacy{stage, detail}frame always emitted, typed envelope opt-in, suffix-based stage classification (_pass/_skip/_done→ stage_end success=true,_failed→ stage_end success=false,_error→ error event,_retry→ fresh stage_start), and stage_start→stage_end pairing via logical-name parent_id + duration_ms.
- Renamed
atlas-tui→tuiandatlas-proxy→proxyat the repo level; moved ablation data underdocs/reports. 362 reference updates across the tree.
- New
atlas initcommand (atlas/cli/commands/init.py): interactive first-run wizard that probes hardware, picks the right tier (T0/T1/T2/T3), recommends a model, writes~/.atlas/config.yaml. - New
atlas modelcommand (atlas/cli/commands/model.py) withlist/verify/add/removesubcommands; backed bymodel_registry.py(add/get/listwith SHA verification) andmodel_recommendations.py(per-tier defaults, split out fromtier.pyin PC-055.2). atlas/cli/events.py(PC-061 step A): typed-event SSE protocol —Eventdataclass,parse_envelope,iter_events, suffix-based stage classification. Schema documented indocs/PROTOCOL.md. Producer-side helpers in v3-service landed as PC-061 step B (see Test Fixes above).atlas doctorextended for the same hardware probe used by the wizard.
- Hardened fresh-VM install path against partial failures across RHEL 9, Ubuntu, Rocky;
curl … | bashandcurl … | sudo bashboth work. - Auto-install NVIDIA driver libraries on RHEL 9 and put the Python CLI on
$PATH. - Bootstrap now installs Go and pre-builds
atlas-tuiso first-run latency is download-bound, not compile-bound.
- Added ruff (Python lint) and CodeQL (security scan) as GitHub workflows.
- New PR-time test job that runs the full Python suite against a cross-distro install matrix (Ubuntu 22.04 / 24.04 / Rocky 9).
- Fixed pip PEP 660 friction, Rocky curl conflict, and a CLI-wizard GPU-mock path that was breaking the matrix.
- New gate in
proxy/agent.gothat refuses anedit_filewhen the proposed change would rewrite more than a configured fraction of the target file. Forces the model to pick the right tool (write_filefor new files,ast_editfor structural rewrites,edit_fileonly for actual surgical patches).
/v1/agentnow accepts full prior chat history from the TUI, replacing the per-call stateless wrapper. Assistant turns are re-wrapped in a JSON envelope so the proxy can tell user messages from prior model turns when rebuilding context.
- New
/v3/planendpoint on v3-service generates a structured plan (steps + verify step + adherence score) before the agent loop begins; Qwen3 reasoning extraction fixed in the same commit. proxy/agent.goconsumes the plan via a plan bridge, an agent-loop hook that pins the current step into each request, and an adherence gate that flags reasoning that drifts from the active step.- TUI renders
plan_loaded/plan_adherence/plan_reviseevents live (tui/commands.go,tui/model.go). - New docs:
docs/PLAN_MODE.md,docs/PROTOCOL.md.
- Output sanitiser strips reasoning preambles and dangling JSON fragments from model responses before parsing.
- Shell-op gate refuses dangerous
rm -rf /style commands and thebash -cbypass route. - System prompt hardened: clearer tool-use rules, fewer hallucinated fields.
- Verification gate added before
type=done(foundation that tonight's done-without-action gate composes with). - Host paths in tool-call arguments translated to container paths so the sandbox sees the right file when the model thinks in host-fs terms.
- Fixed a conversation-history drop bug where the post-V3 trim was eating the user's prompt; V3 pipeline now fires on more edit shapes (not just write_file).
- Lens-call timeout in v3-service bumped from 5s to 30s with structured fallback logging on miss.
- PC-188: every
run_commandnow executes inside the sandbox container, not on the host. Closes the "model writesrmand the host runs it" risk. - PC-189: workspace-drift fix and a false-positive in the truncating-redirect detector (was rejecting legit
> file.txtwrites). - PC-190: sandbox verify stack pre-bakes common dev deps (pytest, ruff, etc.), uses tmpfs for the working tree, prints a "create a venv" hint when the model tries to install into the system Python.
- PC-191/192/193: sandbox is language-agnostic — works on a working codebase (not just a single-file scratchpad). Detects Python, Node, Go, Rust, Java, C/C++ project layouts and uses the appropriate runner.
- PC-194/195:
write_filerejects empty content, single-line stubs, "TODO"-only files, files withpass-only bodies, and other lazy outputs. - PC-196: explicit
run_backgroundtool for long-running processes (e.g.python app.py); shell&backgrounding throughrun_commandis detected and routed torun_background. - PC-197: completion-claim verification — when the model declares
done, the gate checks the workspace state matches the claim (structural check, foundation that tonight's claim-check gate extends). - PC-198: trims boilerplate from the system prompt and strips host
/workspace/prefixes from model-emitted paths. - PC-199/200: detects "stops at the easy fix" pattern (one tweak then
done); raises tier-aware turn caps so the model has runway to complete a real task. - PC-201:
write_fileis allowed to overwrite an existing file when that file is corrupted (e.g. truncated mid-write from a prior crashed turn) instead of failing with the usual "file exists" gate.
PC-202: per-layer residual hidden states from llama-server
- Patched llama-server's
/embeddingendpoint to accept alayers: [int]parameter and return the residual-stream hidden state at each requested layer. Foundation for both PC-207 (per-token lens scoring) and tonight's ASA steering vector build.
- PC-206: thinking-mode plumbing in
v3-service/main.pyLLMAdapter—thinkingkeyword resolves per-call against an instance default. - PC-207: lens computes per-token C(x) + G(x) scores during candidate generation;
/internal/lens/score-per-stepexposes aggregates (gx_min, gx_mean, off_rails_idx, cx_norm_max) the proxy and v3-service consume for early-exit and ranking. Wired into v3-service candidate generation, the agent loop (foundation for tonight's reasoning-repeat + path-aware detectors), with structured per-step logging across all three services. - Severe-score short-circuit: gx_min below 0.05 fires a corrective immediately without waiting for a second sample (calibrated against the May 7 dashboard.html stub-loop session).
- V3↔lens alignment: lens now vetoes a sandbox-passing candidate when its gx_min indicates a stub or placeholder collapse — closes the "sandbox approves a stub V3 generated" loophole.
- Empty-response fallback: when the model returns nothing parseable, the loop emits a corrective hint instead of retrying the same prompt verbatim.
- Plan-threshold guard: refuses to enter the agent loop on a plan with adherence score below threshold.
- Tool-repeat detector: precursor to tonight's reasoning-repetition detector — catches verbatim tool-call repeats within a window.
- v1 (5e44ffb): new
ast_edittool — friendly-selector AST node replacement using tree-sitter. Supportsfunction:NAME,class:NAME, and<tag>selectors. The selector vocabulary is intentionally small in v1; nested selectors (e.g.<style>inside<head>) are NOT supported and produce a "0 nodes matched" error. - Point 1 (468a555): structural verification veto for V3 candidates — rejects candidates that pass sandbox but fail structural shape checks (e.g. removed a required import, lost the class definition).
- Point 2 (b95f741): cyclomatic-complexity enrichment in tier classification —
tier.pynow considers logic density, not just line count, when assigning T0/T1/T2/T3. - Point 3 (2629652): Phase 3 repair receives call-chain context (callers + callees of the file being repaired) so the repair model can reason about cross-file effects.
- Point 4 (bd0b02b): auto-injection of a reachability slice from the user's message — the lens picks the most relevant file regions and inlines them into the system context before the loop starts.
- Plan generation made aware of
ast_editso plan steps suggest it when the target is a structural edit. edit_file"string not found" error now suggestsast_editas the recovery;write_filerejection on existing files also points toast_edit.- Three follow-up fixes: encoding (HTML entities in selector args), trim-resilience (large
contentfields surviving the post-V3 trim), and parse-failure categorization in logs. - Jinja crash fix when
symbol_indexinjects snippets: the snippet role was being set tosystem, which Jinja resolved as a template literal; changed touserrole.
- Tool descriptions rewritten to push the model toward the right tool for the task:
edit_fileframed as the surgical default,ast_editmarked REQUIRED for HTML/Python structural edits,write_filerestricted to new-file creation only. - Conditional GBNF grammar built per turn: when the loop has already entered a step the model has just claimed done, the grammar bans re-emitting the same tool name token-side so the model can't loop on the same failed tool call.
- Per-step tool-list filter (
buildToolDescriptionsExcluding): the system prompt strips tools the loop has explicitly excluded for this step, so the model never sees them as options. - ASA (Activation Steering for Aast_edit) wired into the inference entrypoint:
inference/entrypoint-v3.1-9b.shauto-detects/models/ast_edit_steering.ggufand applies it always-on via llama.cpp--control-vector. Default scale 0.5, default layer range full-model, both overridable via env. PC-202's per-layer-residual/embeddingpatch is the upstream that makes this possible.
- New
geometric-lens/asa_calibration/directory: 1000 contrast-pair prompts (50+ base templates × variation pools) cover function selectors (54%), HTML tags (27%), and CSS classes (19%).generate_pairs.pyproducescontrast_pairs.jsonl;build_steering_vector.pyextracts residuals via the lensextract_per_layer_per_tokenendpoint at layer 27 (of 36 in Qwen3.5-9B), means across tokens/prompts/sign, and writes a llama.cpp-format GGUF control vector. Final vector: 16736 bytes, ‖v_global‖ = 8.6444 after 730s on 2000 prompts.
- Plan-progress reminder (
proxy/plan_reminder.go): ephemeral system note injected into every step request renderingplan progress N/M — currently on step "sX": <action> <target>plus done/remaining sub-step IDs. Lazy-initializesctx.PlanStepsSatisfied. Not persisted toctx.Messages, so it survives the post-V3 conversation trim cycle. - Reasoning-repetition detector (
proxy/reasoning_repeat.go): tracks the model's reasoning-stream opening; on 3 consecutive identical normalized openings (case-folded, whitespace-collapsed, 80-char snippet) the loop queues a corrective system message. Successfully broke a session-2 stuck loop in live testing. - Path-aware error breaker (
extractFailurePathinproxy/lens_score.go, breaker logic inproxy/agent.go): tracksctx.RecentFailurePathsper tool failure. Known limitation: the v1 implementation resets on intervening successes, so it can miss long stuck-loop sequences with sporadic productive turns in between. - Done-without-action gate (
proxy/guardrails.go): refusestype=donewhen the user prompt is fix-intent and no successful verification command has run this loop. Action-intent words (rewrite,create,add,update,redesign) also trigger a productive-change check parallel to the existing verify check. Caught 4 false-success done attempts in live testing. - Truncation recovery shims (
proxy/agent.go):recoverTruncatedAstEdit+recoverTruncatedEditFile+recoverTruncatedToolCallrescue malformed tool emissions from the model and re-pack into a well-formed shape. Each shim is targeted at a specific failure mode observed in production logs. - Conversation history error surfacing (
proxy/agent.go):extractModelResponsenow exposes the actualUnmarshalerror path (directErr vs balancedErr) so debug logs distinguish parse-shape failures from content failures. - Removed
ResponseHeaderTimeoutfromproxy/v3_bridge.goand removed all client-level timeouts on the V3 HTTP path. Long V3 chains (10+ minute passes) were getting bounced by the 10-minute response-header window even when the pipeline was making progress. - Removed
absoluteMaxTurnsceiling fromproxy/types.go. Turn caps now come solely fromTierMaxTurns(T0:5, T1/T2/T3:0 = uncapped) with no override clamp. Reasoning: 8 detectors armed in the loop make a hard cap redundant — let the detectors decide when to break.
proxy/tools.goast_edit executor: tier classification now usesmax(oldTier, newTier)and the previous V3-tier floor for HTML was dropped (it was over-triggering V3 on the smallest CSS tweaks). Doctype dedup (leadingDoctypeRe+stripLeadingDoctypeinproxy/guardrails.go) prevents the model's "" prefix from being inserted twice when ast_edit replaces the<body>.- Suspiciously-shrunk-edit guard (
validateNotSuspiciouslyShrunkinproxy/guardrails.go): rejects an edit that shrinks an >100-byte file to <64 bytes. Final threshold tuned after a legitimate 80-byte one-liner refactor was false-rejected at 128. Triggered on a destructive 32-byte stub in pre-release testing. - Working-directory phantom-dir guard (
validateWorkingDirReference+workspaceRefRe): catches model emissions that try tocd templates/workspaceor similar nested-workspace references; legitimatecd /workspaceat the sandbox root is allowed. - Action-intent gate (
actionIntentWords+isActionIntentMessage+actionWithoutProductiveChangeMessage): companion to the verification gate, catchesdonedeclarations onrewrite/create/add/redesign-style prompts that don't include a productive edit this loop.
tui/model.goadds astreamingReasoningTextbuffer and areasoning_tokenevent handler that renders with a‹thinking›prefix so the user sees the model's reasoning stream live alongside its content. Both buffers reset onllm_call_start/llm_call_end.tui/commands.goextended to forward thedelta.ReasoningContentfield from the SSE stream asreasoning_tokenevents.proxy/agent.goplumbs reasoning content through the agent loop: stashesctx.LastTurnReasoning, capturespendingReasoningCorrectiveviarecordReasoning, and re-emits reasoning deltas to the client mid-turn (with async.Mutexaround thehttp.ResponseWriterto fix the SSE race that produced the "chunked line ends with bare LF" errors).
- New Go tests:
proxy/path_aware_test.go,proxy/reasoning_repeat_test.go,proxy/recover_truncated_test.go,proxy/step_restriction_test.go. Extendedproxy/guardrails_test.go,proxy/plan_hook_test.go. - All
go test ./...on bothproxy/andtui/modules pass. - Full Python suite: 1055 passed / 4 skipped / 0 failed / 0 errors locally.
- Replaced Aider format-translation proxy with structured JSON tool-call agent loop
- Grammar-constrained output via llama-server
response_format:json_object— 100% valid JSON - 8 tool definitions:
read_file,write_file,edit_file,delete_file,run_command,search_files,list_directory,plan_tasks - Per-file tier classification: T1 (config/data) writes directly, T2 (logic/features) routes through V3 pipeline
- 3400+ lines new Go code across 12 files in
proxy/
- All 14 V3 steps wired into
write_file/edit_fileexecutors for T2/T3 files - PlanSearch → DivSampling → Budget Forcing → Build Verification → C(x)/G(x) Scoring → Best-of-K → S*/Blend-ASC → Failure Analysis → PR-CoT Repair → Refinement Loop → Derivation Chains → Metacognitive → Final Write
- Per-file-type build verification: tsc, py_compile, gcc, go build, cargo check, bash -n
- V3 service SSE streaming: pipeline progress visible in real-time
atlascommand: starts all services and launches Aider- Streaming progress:
[Turn N/M]with tool call details, V3 pipeline steps, completion summary - Exploration budget: 4 consecutive read-only calls triggers nudge, prevents model from over-exploring
- Pre-injected project context: model sees project file list in system prompt
- File deletion via fast-path before tier classification
- Truncation prevention: 32K context, reject write_file for existing files >100 lines, detect truncated args before execution
- Docker Compose (
docker-compose.yml) for full stack orchestration - Podman compatible with host networking
.env.examplewith all configurable parametersatlasscript auto-detects Docker vs bare-metal and routes accordingly
rag-api/→geometric-lens/(directory + all references)ATLAS_RAG_URL→ATLAS_LENS_URLATLAS_FOX_URL→ATLAS_INFERENCE_URLfoxURL→inferenceURL(Go code)ralph-loop→verify-repair looprag.py→pipeline.py(geometric-lens orchestration)
- 8-level test × 3 iterations: 95.8% (23/24)
- 5-language integration: 100% (Shell, Python, Rust, C, Go)
- L6 (add feature to existing project): 67% — marked as future improvement
- ARCHITECTURE.md: Complete rewrite — 13 Mermaid diagrams (service topology, agent loop flow, V3 pipeline, module map, sequence diagrams), every component verified against source code
- API.md: Complete rewrite — every endpoint across all 5 services verified against source, request/response formats, SSE stages
- CLI.md: Complete rewrite — startup flow diagram, streaming format, workflow examples, troubleshooting, env vars, Aider config reference
- CONFIGURATION.md: Complete rewrite — every env var across all services verified, internal constants, Docker Compose vs K3s differences
- MAP.md: Complete rewrite — every file in repo with clickable tree, 150 file links, 18 description tables
- SETUP.md: Complete rewrite — verified build steps, first-run guide, bare metal, K3s, hardware sizing, Lens training guide
- TROUBLESHOOTING.md: Complete rewrite — quick diagnostics, 20+ issue scenarios with verified fixes
- README.md: Honest 7-step setup with actual download command, prerequisites, model clarity (Qwen3-14B vs Qwen3.5-9B)
- Reorganized historical docs into
docs/reports/(ablation studies, status tracking, migration guides)
- geometric-lens Dockerfile port mismatch: Container was listening on 8001 but docker-compose expected 8099 — fresh Docker Compose deploys had broken Lens service. Fixed Dockerfile to use port 8099.
- Python CLI default RAG port:
atlas/cli/client.pydefaulted to port 31144 (K3s NodePort) instead of 8099 (Docker Compose). Fixed default to match Docker Compose. - Missing Aider config files:
.aider.model.settings.ymland.aider.model.metadata.jsonwere not in the repo — theatlaslauncher would fail without them. Restored both files and added.gitignoreexceptions. - GitHub Issue #6:
hostname -I→ portable fallback chain (ip addr→hostname -I→hostname -i) for Arch Linux compatibility - GitHub Issue #10:
rag-api/→geometric-lens/restructuring resolved missing models directory - GitHub Issue #11: Added Geometric Lens training documentation to SETUP.md with HuggingFace dataset link
- GitHub Issue #12 / PR #13:
docker image exists→docker image inspectin build script
- Removed 62 stale test directories, old v1 proxy binary, dead G(x) metric tensor training scripts
- Removed stale tests for deleted services (api-portal, dashboard, embedding-service, task-worker)
- Removed root-level development artifacts (bubble_sort.py, snake_game.py, etc.)
- All hardcoded
/home/isaac/paths replaced with$HOMEorATLAS_DIRenv vars
- 74.6% LCB pass@1 (447/599) on frozen Qwen3-14B
- Full ablation study: conditions A–D with per-task results
- Phase 1 (PlanSearch/DivSampling): +12.4pp
- Phase 3 (PR-CoT/Refinement/Derivation): +7.3pp
- Self-verified Phase 3 using model-generated test cases
- H1: Self-embeddings restore C(x) discrimination: CONFIRMED (+39.5pp)
- C(x) selects passing candidate 87.8% on mixed-result tasks vs 48.3% random (p < 0.000001)
- V2.5 result (+0.6pp under nomic 768-dim) was an embedding source limitation, not architecture failure
- Reverse energy selects only 4.3%, proving strong directional signal
- Val AUC: 0.9934, energy separation: 21.75 (7.2x wider than V2.5)
- H2: G(x) adds value beyond C(x): NEUTRAL (0.0pp)
- G(x) contributes zero at optimal alpha (0.001); monotonically degrades at higher alpha
- Zero corrections, zero breakages across all mixed-result tasks
- Outcome B: Ship C(x)-only with self-embeddings, remove or redesign G(x)
- Difficulty routing validated: Q1 (low energy) = 100% oracle, Q4 (high energy) = 0.3%
- C(x) confirmed as both verifier (87.8% selection) and router (perfect difficulty stratification)
- Runtime: 24h 42m on LiveCodeBench v5 (599 tasks, K=3, 4 epochs)
- Infrastructure: Qwen3-14B with
--embeddings(no spec decode, ~45 tok/s) - Risk R6 (Lens non-discriminating) RESOLVED; Risk R11 (no verifier) substantially mitigated
- Systematic ablation of Geometric Lens, router, and infrastructure components
- Finding: C(x) energy scoring ≈ random for candidate selection under nomic embeddings (37.7% vs 37.1%, within 3.4pp seed variance) — V2.5.1 confirmed this was an embedding source limitation (87.8% accuracy restored with self-embeddings)
- Finding: C(x) energy strongly correlates with task difficulty (58.5% vs 18.9% pass rate across tiers)
- Finding: G(x) metric tensor confirmed dormant (5.2M params, zero impact)
- Finding: Pattern cache bypassed entirely by benchmark runner
- Discovered
--embeddingsflag breaks speculative decoding (forces n_batch=512) - Migrated to two-server sidecar architecture: generation + spec decode on Server A, embeddings via nomic-embed-text-v1.5 on Server B
- Recovered ~2.6x generation throughput (~38 tok/s → ~100 tok/s)
- Net VRAM delta: approximately -230 MiB (sidecar cheaper than --embeddings overhead)
- Replaced Qdrant vector DB + embedding service with PageIndex tree-based RAG
- Added Geometric Lens (Cost Field + Metric Tensor) for candidate quality prediction
- Added Confidence Router with difficulty-based adaptive-k selection
- Added Pattern Cache (Redis + Ebbinghaus memory decay)
- Added Best-of-K pipeline with parallel candidate generation
- Added sandboxed code execution for benchmark evaluation
- Added speculative decoding with Qwen3-0.6B draft model
- Added KV cache quantization (q4_0)
- LiveCodeBench: 36-41% pass@1 (across Lens training epochs, k=3)
- GPQA Diamond: 47.0% (k=5)
- SciCode: 14.7% sub-problems (341 tasks, k=1)
- Geometric Lens: 0.968 Val AUC, ~80% first-pick accuracy (151/188)
- Throughput: 109 tasks/hr on RTX 5060 Ti 16GB
- Qdrant vector database
- MiniLM-L6-v2 embedding service
- LoRA nightly training pipeline (moved to v1_archived/, CronJob suspended)
- V1 benchmark suite (HumanEval, MBPP, Custom)
- mlock allocation failure — added LimitMEMLOCK=infinity systemd override for K3s
- Speculative decode slot 1 failure — quantized draft KV cache to q4_0 (-ctkd/-ctvd)
- Dashboard crash-loop — fixed missing Jinja2 default filters
- IFBench evaluation incomplete (excluded from results)
- All results from single benchmark run (variance unknown)
Initial release. See benchmark/v1_benchmark_report.md for V1 results.