Curious Compass — Code Review¶

Branch: steve/prd-foundation-lock Date: 2026-03-04 Reviewer: Friday Scope: All Python in src/, services/, tests/, plus archived/, Docker, dependencies, architecture alignment

Executive Summary¶

Category	Verdict	Notes
Security	⚠️ PASS (with caveats)	No hardcoded secrets in production code. Static token pattern for SharePoint is a design risk. Archived `.env` committed (benign but sloppy).
Code Quality	✅ PASS	Clean, well-structured, DRY core. Service-level boilerplate is duplicated but acceptable at this stage.
Architecture	✅ PASS	Implementation matches `architecture-consolidated.md` faithfully. Pi-mom router, 4 agents, SQLite hot state, SharePoint cold store — all present.
Error Handling	⚠️ PASS (with caveats)	Core loop handles tool errors gracefully. Three broad `except Exception` swallowing errors silently in service configs. No max-iteration guard on agent loop.
Type Safety	⚠️ PASS (with caveats)	Good use of dataclasses and type hints overall. A few gaps: `callable` instead of `Callable`, `**kwargs: Any` throughout tools.
Testing	⚠️ NEEDS WORK	Core loop, LLM routing, and tools tested well. Zero tests for services, router, state manager CRUD, or Docker builds.
Dependencies	✅ PASS	Sensibly pinned with floor+ceiling ranges. Lock file present. Minimal dependency set.
Docker	❌ FAIL	Build context mismatch — `docker-compose.yml` specifies `build: ./services/bot` but Dockerfiles reference repo-root paths. Builds will fail.
Smells	⚠️ MINOR	Duplicated service boilerplate, mock tools that may linger, shared Docker volume for SQLite without WAL locking guarantees across containers.

Overall: Solid Week 2 foundation. The architecture is clean, the code is well-structured, and the core abstractions (agent loop, LLM adapter, tool system) are genuinely good. The Docker build config is broken and needs fixing before anyone tries to docker compose up. A few other items need attention before Week 3.

Detailed Findings¶

1. Security¶

S1. `GRAPH_ACCESS_TOKEN` — Static Token Pattern (P1)¶

File: src/compass/tools/sharepoint.py:58, docker-compose.yml:26,45,68

The SharePoint client uses a static GRAPH_ACCESS_TOKEN from environment. Graph access tokens expire after ~60-75 minutes. In production, this will silently fail mid-run.

The archived push_to_sharepoint.py (line 22-31) correctly implements OAuth client credentials flow with acquire_token(). The production code doesn't.

Fix: Implement token acquisition via client credentials flow in create_sharepoint_client(), using TENANT_ID, CLIENT_ID, CLIENT_SECRET (the same pattern used in archived/scripts/push_to_sharepoint.py). Add token refresh logic or short-lived caching.

S2. Archived `.env` committed to git (P2)¶

File: archived/.env

Contains a GCP service account email (curioushealth-worker@curious-473917.iam.gserviceaccount.com) and a local credential path. No actual secrets, but the .gitleaks.toml allowlist for archived/ means this was intentionally allowed. Still, best practice is to not commit .env files at all.

Fix: Remove archived/.env from git tracking (it's already allowlisted, but it sets a bad precedent). Add a note in the allowlist that this is intentional and benign.

S3. No authentication on service HTTP endpoints (P2)¶

Files: services/*/main.py

The agent HTTP endpoints (/run, /health) have no authentication. This is acceptable since they're on an internal Docker network with no port bindings to the host, but should be explicitly documented as a design decision. If the docker-compose topology changes (e.g., external port exposure), these become attack surfaces.

Fix: Add a comment in docker-compose.yml noting that agent services are internal-only. Consider adding a shared bearer token check as defense-in-depth when moving to production.

S4. No input validation on `/run` task text (P3)¶

Files: services/bot/main.py:53, services/owned-agent/main.py:22, services/paid-agent/main.py:22

The task field in RunRequest has no length limit. An attacker (or a misbehaving router) could send an enormous string. Pydantic model accepts any string.

Fix: Add max_length constraint to the task field: task: str = Field(max_length=50_000).

2. Code Quality¶

Q1. Duplicated service main.py boilerplate (P3)¶

Files: services/owned-agent/main.py, services/paid-agent/main.py

These two files are nearly identical (54 lines each, ~95% overlap). The only difference is the import of build_agent_config from different config.py files. The _extract_last_text logic also duplicated across bot/main.py.

Fix: Extract a shared create_service_app(config_builder) factory function into src/compass/service.py. Each service's main.py becomes 3 lines. Not urgent — fine for Week 2 scaffolding, but should be consolidated before Week 4 when more services come online.

Q2. Duplicated Mock SharePoint tools (P3)¶

Files: services/owned-agent/config.py:82-100, services/paid-agent/config.py:96-114

MockSharePointReadTool and MockSharePointUploadTool are copy-pasted between owned and paid agent configs. They're identical.

Fix: Move mock tools to a shared location (e.g., src/compass/tools/mocks.py) or simply reuse the real SharePointReadTool/SharePointUploadTool with a try/except fallback pattern (which is already done in _sharepoint_tools() — but the mocks are defined locally in each config).

Q3. `_to_block_dicts` in loop.py re-serializes response content (P3)¶

File: src/compass/loop.py:15-27

The function converts TextBlock/ToolUseBlock dataclasses back to dicts immediately after client.chat() returns them. This creates a dict→dataclass→dict round trip. Not a bug, but the intermediate dataclass step is unnecessary overhead. The design is intentional (typed response from LLM client, normalized storage), so this is a minor observation rather than a fix.

Q4. No dead code detected ✅¶

The codebase is tight. Everything defined is used. No unreachable branches or orphaned functions in the production code.

3. Architecture Alignment¶

A1. Implementation matches architecture doc ✅¶

Reference: docs/canon/architecture-consolidated.md

The implementation faithfully follows the architecture spec: - ✅ Pi-style agent loops (~82 lines in loop.py — matches "~100 lines" spec) - ✅ Pi-mom router pattern with dispatch tool (services/bot/router.py) - ✅ 6 Docker containers: caddy + bot + 4 agents - ✅ SQLite hot state (src/compass/state.py) - ✅ SharePoint cold store (src/compass/tools/sharepoint.py) - ✅ ENV-configurable model routing with prefix-based dispatch (src/compass/llm.py) - ✅ Three-phase model strategy (OpenRouter dev → eval → Azure prod) reflected in create_client() - ✅ Agent tool sets match spec (state_read, state_write, sharepoint_read/upload, plus domain-specific tools) - ✅ HTTP inter-container communication (Option B from spec) - ✅ Docker Compose topology matches spec exactly

A2. Spec mentions Teams SDK — not yet implemented (P3 / Expected)¶

Reference: Architecture §3

The bot service is currently a CLI harness + FastAPI stub, not a Teams SDK integration. This is correct for Week 2 (Teams is Week 7 per the implementation sequence). The architecture is designed for this — the agent loop is SDK-agnostic, and the bot's /run endpoint is a clean seam for future Teams integration.

A3. `brandwatch-agent` and `scoring-agent` are stubs with `profiles: [phase2]` ✅¶

Exactly as specified in the architecture. They return {"status": "phase2-stub"} health checks and are behind Docker Compose profiles. Clean.

A4. Router's agent list matches architecture ✅¶

File: services/bot/router.py:23-27

The dispatch tool's enum includes owned, paid, brandwatch, scoring — matching the four specialist agents in the architecture spec.

4. Error Handling¶

E1. Agent loop has no max-iteration guard (P1)¶

File: src/compass/loop.py:37

The while True loop continues until the LLM returns no tool calls. If the LLM repeatedly returns tool calls (hallucinated or real), this loop runs indefinitely, consuming API credits and potentially timing out the HTTP request.

Fix: Add a configurable max_iterations parameter (default 20-30) to agent_loop(). After exceeding it, raise a RuntimeError or return with an error message. The architecture spec doesn't mention this, but it's a standard safety mechanism for agent loops.

MAX_ITERATIONS = 25

async def agent_loop(config, messages, on_text=None, on_tool_call=None, max_iterations=MAX_ITERATIONS):
    for iteration in range(max_iterations):
        ...
    raise RuntimeError(f"Agent loop exceeded {max_iterations} iterations")

E2. Silent exception swallowing in `_sharepoint_tools()` (P2)¶

Files: services/owned-agent/config.py:97, services/paid-agent/config.py:102

def _sharepoint_tools() -> list[Tool]:
    try:
        sp_client = create_sharepoint_client()
        return [SharePointReadTool(sp_client), SharePointUploadTool(sp_client)]
    except Exception:
        return [MockSharePointReadTool(), MockSharePointUploadTool()]

If SharePoint config is missing, the code silently falls back to mocks with no logging. In production, this means the agent thinks it uploaded to SharePoint but actually did nothing. This is fine for Week 2 dev, but must be logged before production.

Fix: Add import logging; logger = logging.getLogger(__name__) and log a warning in the except block: logger.warning("SharePoint credentials not configured, using mock tools", exc_info=True).

E3. Silent exception in `StateManager._conn()` (P3)¶

File: src/compass/state.py:27

The context manager catches Exception and rolls back, but re-raises — which is correct. However, the _init_db method (line 31) uses a bare with sqlite3.connect(...) which swallows connection errors since it's called in __init__. If the DB path is invalid, the error only surfaces later.

Fix: This is fine as-is. The mkdir(parents=True, exist_ok=True) at line 19 prevents the most common failure. Low priority.

E4. `DispatchTool` has 600s timeout but no error message customization (P3)¶

File: services/bot/router.py:46

If a specialist agent takes >10 minutes, httpx.ReadTimeout propagates to the agent loop's generic except Exception, which returns Error: ReadTimeout: .... This is acceptable error handling — the loop catches it and feeds it back to the LLM. But the 10-minute timeout is worth documenting as a design decision.

5. Type Safety¶

T1. `callable` instead of `Callable` (P3)¶

File: src/compass/llm.py:59

model_transform: callable,

Should be Callable[[str], str] from collections.abc for proper type safety. callable (lowercase) is the builtin function, not a type annotation. Python 3.12 won't error on this, but type checkers will flag it.

Fix: from collections.abc import Callable and use model_transform: Callable[[str], str].

T2. Tool `execute()` uses `**kwargs: Any` throughout (P3)¶

Files: All tool implementations

Every tool's execute() method accepts **kwargs: Any and manually extracts/casts arguments:

async def execute(self, **kwargs: Any) -> str:
    run_id = int(kwargs["run_id"])

This means the type system provides no help for callers, and KeyError is the failure mode for missing arguments (which the generic except in the loop catches).

Fix: This is a deliberate design choice (tools are invoked dynamically by the LLM), so **kwargs makes sense. Consider adding TypedDict annotations as documentation. Low priority.

T3. `OnText` and `OnToolCall` type aliases could be more precise (P3)¶

File: src/compass/loop.py:10-11

OnText = Callable[[str], Awaitable[None]] | None
OnToolCall = Callable[[str, str, dict[str, Any]], Awaitable[None]] | None

These are good but could use Protocol for better IDE support. Very minor.

6. Testing¶

TE1. No tests for service endpoints (P1)¶

Coverage gap: services/bot/main.py, services/owned-agent/main.py, services/paid-agent/main.py

Zero test coverage for: - FastAPI route handlers (/run, /health) - Request/response serialization - build_router_config() integration - Service-level error handling

Fix: Add integration tests using httpx.AsyncClient with FastAPI's TestClient:

from httpx import AsyncClient, ASGITransport
from services.bot.main import app

async def test_health():
    async with AsyncClient(transport=ASGITransport(app=app)) as client:
        r = await client.get("/health")
        assert r.status_code == 200

TE2. No tests for `StateManager` CRUD operations (P2)¶

Coverage gap: src/compass/state.py

The state tools are tested (test_tools.py), but StateManager.enqueue_work(), dequeue_work(), update_run_status(), and write_tool_result() (with non-state_write tool names) are untested.

Fix: Add a test_state.py with direct StateManager tests covering the work queue and run lifecycle.

TE3. No tests for router dispatch (P2)¶

Coverage gap: services/bot/router.py

DispatchTool.execute() makes HTTP calls to agent services. No test mocks or verifies this behavior.

Fix: Add a test using httpx.MockTransport to verify dispatch routes correctly and handles errors.

TE4. Good existing test quality ✅¶

The three existing test files are well-written: - test_llm_routing.py — covers all provider prefixes, missing env errors, unsupported prefix rejection - test_loop.py — tests the full tool-call-then-text cycle and unknown tool handling - test_tools.py — tests state tools with real SQLite and SharePoint tools with mock HTTP

Clean use of pytest.mark.asyncio, monkeypatch, tmp_path, and httpx.MockTransport. No flaky patterns detected.

TE5. `asyncio_mode = "auto"` in pyproject.toml ✅¶

Good — avoids the common pytest-asyncio configuration trap.

7. Dependencies¶

D1. Dependency pinning is sensible ✅¶

File: pyproject.toml

dependencies = [
  "fastapi>=0.115,<1.0",
  "httpx>=0.27,<1.0",
  "pydantic>=2.7,<3.0",
  "uvicorn>=0.30,<1.0",
]

Floor + ceiling ranges. Lock file (uv.lock) present. Using uv for dependency management — modern and fast.

D2. Stub Dockerfiles use `pip install` instead of `uv` (P3)¶

Files: services/brandwatch-agent/Dockerfile, services/scoring-agent/Dockerfile

The stub services use FROM python:3.12-slim + pip install while the real services use ghcr.io/astral-sh/uv:python3.12-bookworm-slim + uv pip install. Inconsistent but fine for Phase 2 stubs.

Fix: Standardize when Phase 2 stubs get real implementations.

D3. No `requirements.txt` for stub services (P3)¶

The brandwatch-agent and scoring-agent Dockerfiles install fastapi uvicorn without version pins. These are stubs so it doesn't matter yet, but should be pinned when they get real implementations.

8. Docker¶

DK1. Build context mismatch — builds will fail (P0 🔴)¶

Files: docker-compose.yml:13,35,55, services/*/Dockerfile

The docker-compose build contexts are set to individual service directories:

build: ./services/bot
build: ./services/owned-agent
build: ./services/paid-agent

But the Dockerfiles reference paths relative to the repo root:

COPY pyproject.toml /app/pyproject.toml    # ← doesn't exist in ./services/bot/
COPY src /app/src                           # ← doesn't exist in ./services/bot/
COPY services/bot /app/services/bot         # ← doesn't exist in ./services/bot/

This means docker compose build will fail with COPY errors. None of these images can be built as configured.

Fix (Option A — recommended): Change docker-compose to use repo root as build context with per-service Dockerfiles:

bot:
  build:
    context: .
    dockerfile: services/bot/Dockerfile

Fix (Option B): Restructure Dockerfiles to use paths relative to their service directory and copy only what's needed. Less clean since src/ is shared.

DK2. No `.dockerignore` file (P2)¶

Missing file

Without .dockerignore, the Docker build context includes .git/ (large), archived/ (~extensive PoC code), test_home.png (2.1MB screenshot), docs/ (research docs), and uv.lock (37KB). This bloats build context transfer and layer cache.

Fix: Add a .dockerignore:

.git/
archived/
docs/
tests/
*.png
*.zip
run/
.env
.env.*
!.env.example

DK3. No health checks in docker-compose (P2)¶

File: docker-compose.yml

All services have /health endpoints, but no healthcheck directives in docker-compose. This means depends_on doesn't wait for service readiness, and Docker won't auto-restart unhealthy containers.

Fix: Add health checks:

owned-agent:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8001/health"]
    interval: 30s
    timeout: 5s
    retries: 3

(Or use python -c "import httpx; ..." since curl may not be in the slim image.)

DK4. `uv run` in CMD is unnecessary (P3)¶

Files: services/bot/Dockerfile:10, services/owned-agent/Dockerfile:8, services/paid-agent/Dockerfile:8

The Dockerfiles install packages system-wide with uv pip install --system, then use uv run uvicorn in CMD. Since packages are installed system-wide, uvicorn is already on PATH. uv run adds unnecessary overhead (it checks/creates virtual environments).

Fix: Change CMD to ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "XXXX"].

DK5. Shared Docker volume for SQLite across containers (P2)¶

File: docker-compose.yml — agent-data:/data mounted on bot, owned-agent, paid-agent

Multiple containers share the same Docker volume and potentially the same SQLite database file (/data/state.db). SQLite with WAL mode handles concurrent reads well, but concurrent writes from different processes can cause SQLITE_BUSY errors. The StateManager does set PRAGMA journal_mode=WAL, which helps.

Fix: Either: 1. Give each agent its own state DB path (e.g., /data/owned-state.db, /data/paid-state.db), or 2. Move to a shared SQLite accessed only by one service (the bot/router), with agents reporting back via HTTP responses.

Option 2 aligns better with the architecture — agents return results via HTTP, and the bot stores state.

9. Smells¶

SM1. `test_home.png` — 2.1MB screenshot checked into repo (P3)¶

File: test_home.png (root)

A 2.1MB screenshot in the repo root. Presumably a test artifact from the PoC. Bloats clone size permanently (even after deletion, it stays in git history).

Fix: Remove from tracking. Add *.png to .gitignore root (it's already there for runtime outputs via run/).

SM2. Mock tools may silently persist into production (P2)¶

Files: services/owned-agent/config.py, services/paid-agent/config.py

The build_agent_config() functions include mock tools (MockWebScrapeTool, MockScreenshotTool, etc.) alongside real tools. These mocks return fake data (json.dumps({"ads_found": 3})). When real tools are implemented, someone needs to remember to remove the mocks.

Fix: Add a TODO(week3) comment on each mock, or gate mocks behind an ENV flag:

if os.getenv("USE_MOCKS", "true").lower() == "true":
    tools.append(MockWebScrapeTool())

SM3. Caddy external hostname is hardcoded (P3)¶

File: Caddyfile

compass.agentflow.dev {
    reverse_proxy bot:3978
}

The hostname compass.agentflow.dev is hardcoded. For deployment to the client's Azure (Weeks 18-20), this needs to change. Not a problem now, but worth noting.

SM4. `archived/.env` contains GCP project ID and service account email (P3)¶

File: archived/.env

While not a secret, GCP_IMPERSONATE_SERVICE_ACCOUNT=curioushealth-worker@curious-473917.iam.gserviceaccount.com reveals infrastructure details. The .gitleaks.toml allowlists the entire archived/ directory, which is sensible since it's reference material.

10. Archived PoC Comparison¶

What should be carried forward:¶

PoC Component	Status	Notes
OAuth token acquisition (`push_to_sharepoint.py:22-31`)	⚠️ CARRY FORWARD	The `acquire_token()` function is the correct pattern for Graph API auth. Production `sharepoint.py` uses static tokens instead.
Chunked upload (`push_to_sharepoint.py:34-68`)	⚠️ CARRY FORWARD	Large file upload via upload sessions is needed for screenshots/HTML. Production `SharePointClient.upload_text()` only handles small text files.
Brand discovery & validation (`brand_discovery.py`)	✅ INFORMATIONAL	Rich Perplexity-powered discovery with HTTP validation. The production version uses mock tools, but this PoC code shows what the real owned-agent tools need to do.
Website capture with Playwright (`website_capture.py`)	✅ INFORMATIONAL	Sophisticated capture with stealth, WAF detection, cookie consent, fallback chain. Production mock tools need to eventually replicate this.
Config loader & models (`config_loader.py`, `models.py`)	✅ INFORMATIONAL	Well-structured YAML config system. Not needed in production (agents get config via system prompts + tool calls), but useful reference for project configuration.
Summary/flow logging (`summary.py`)	✅ INFORMATIONAL	Nice flow logging pattern. Production could adopt a similar structured logging approach.

What should NOT be carried forward:¶

PoC Component	Reason
`archived/app/main.py` — FastAPI web UI with upload form	Replaced by Teams interface. The web UI pattern doesn't apply.
`archived/Dockerfile` — Monolithic Dockerfile with Playwright	Replaced by per-service Dockerfiles. The monolithic approach doesn't fit the multi-container architecture.
Global mutable caches (`_API_KEY_CACHE`, `_PPLX_KEY_RESOLVED`) in `brand_discovery.py`	Global mutable state is an anti-pattern. The production code correctly uses environment lookups via `get_env()`.
GCP Secret Manager integration (various files)	Removed in production (env-only secrets). Correct decision — GCP is not in the Azure compliance boundary.
`requests` library usage (throughout PoC)	Production correctly uses `httpx` (async). Don't regress to synchronous HTTP.
`python-dotenv` / `load_dotenv` (PoC `__main__.py`)	Production uses pure env vars via Docker. Don't add dotenv as a dependency.

Prioritized Fix List¶

Priority	ID	Issue	Effort
P0	DK1	Docker build context mismatch — builds will fail	10 min
P1	E1	Agent loop has no max-iteration guard	15 min
P1	S1	`GRAPH_ACCESS_TOKEN` static token will expire in production	1-2 hrs
P1	TE1	No tests for service endpoints	2-3 hrs
P2	E2	Silent SharePoint mock fallback — no logging	10 min
P2	S3	No auth on internal service endpoints (document decision)	10 min
P2	DK2	Missing `.dockerignore`	10 min
P2	DK3	No Docker health checks	15 min
P2	DK5	Shared SQLite volume across containers	30 min
P2	SM2	Mock tools may persist into production silently	15 min
P2	TE2	No tests for StateManager CRUD	1 hr
P2	TE3	No tests for router dispatch	1 hr
P2	S2	Archived `.env` committed	5 min
P3	T1	`callable` instead of `Callable` type annotation	5 min
P3	Q1	Duplicated service boilerplate	30 min
P3	Q2	Duplicated mock SharePoint tools	15 min
P3	DK4	Unnecessary `uv run` in Docker CMD	5 min
P3	S4	No input length validation on `/run` endpoint	5 min
P3	SM1	2.1MB `test_home.png` in repo root	5 min
P3	SM3	Hardcoded Caddy hostname	5 min

Overall Assessment¶

This is a strong Week 2 foundation. The architecture is clean, the abstractions are well-chosen, and the code reads like it was written by someone who's built agent systems before. The pi-mom router, provider-agnostic LLM adapter, and tool-based agent loop are all solid patterns that will scale.

The one blocking issue is the Docker build context mismatch (P0 DK1) — docker compose up will fail as written. This is a 10-minute fix.

The one design concern to address before production is the static Graph API token (P1 S1). The PoC already has the right pattern (acquire_token() in push_to_sharepoint.py), it just needs to be ported into the production SharePointClient.

The agent loop's unbounded iteration (P1 E1) is a standard safety gap that should be patched early — it's the kind of thing that costs $50 in API calls at 3am when an LLM gets confused.

Test coverage is the biggest area for improvement. The existing tests are high quality, but they only cover the core library. The service layer, router, and state manager need tests before this codebase grows.

Line count: ~1,597 lines of production Python across src/ + services/ + tests/. Lean and purposeful. The architecture doc estimated ~2,500 total — on track.

Recommendation: Fix P0 (Docker) immediately. Address P1s this week. P2s before Week 4. P3s whenever convenient.