Ops platform · 2026 Live

Rampart

Enforcement-first operational OS for field service ops. Deterministic workflow engine plus AI-augmented incident command.

GitHub

Tech stack

Python FastAPI Pydantic v2 PostgreSQL Redis Streams APScheduler React Vite Groq Docker Compose

The problem

Field service operations sit on top of a workflow that has to survive incomplete closeouts, SLA games, manager overrides made for the wrong reason, and a 3am incident where nobody is sure who acknowledged what. Most teams reach for a CRM with a status dropdown and hope the audit story works itself out. It does not. The control system has to refuse a bad transition before it lands, capture every override with justification and approval, escalate on its own when an SLA breaks, and prove all of it after the fact.

Goals

Make the workflow a real state machine with guards, side effects, and atomic transitions
Centralize every authorization through one enforcement engine that returns allow, deny, allow_with_override, or escalate, with reason codes
Capture every override with actor, role, justification, supervisor approval, and expiry
Watch SLAs in the background and raise warning then breach events, escalating up the on-call ladder if nobody acknowledges
Open an incident room automatically on breach so the bridge has the job, events, responders, timeline, and chat in one place
Add an AI layer for triage, dispatch ranking, closeout drafting, and audit Q&A without letting it touch the deterministic core

The solution

Declarative FSM in Python: states, transitions, guards run pre-transition, side effects run post-transition, transitions and audit rows committed in the same Postgres transaction
Enforcement engine with a versioned rule catalog (R001 closeout evidence, R003 override capture, more to come), returning structured decisions with reason codes
Redis Streams event bus that every subscriber (dashboard, SLA watcher, AI layer) reads from, decoupling the operational primitives
SLA watcher as a background worker that emits sla.warning then sla.breach and auto-opens an incident in the same transaction
Incident command engine with severity-based escalation ladder, on-call rotation lookup, responder tracking, and a system timeline
Provider abstraction for the AI layer: a deterministic Echo provider for tests and offline demos, a Groq provider (llama-3.3-70b-versatile) for real LLM use, swappable through a single env var
Four agents (triage, dispatch, closeout, audit chat) each writing to an ai_recommendations table the deterministic core never reads, so an LLM can suggest but never decide
React command-centre dashboard polling the API, surfacing live job board, event stream, incident bridge, triage card, and audit chat panel

My role

→ Solo architect and engineer, system design through deploy
→ Five-layer architecture (engine, ops, AI, API, dashboard) and the JD-aligned module layout
→ FSM, enforcement engine, audit model, and override flow
→ SLA watcher, escalation ladder, and incident command bridge
→ Provider abstraction plus the four AI agents and their schemas
→ Phase plan, test strategy (51 passing tests against real Postgres), and screenshot-driven case study

UI direction

Operator-first command-centre dashboard. Left column is a colour-coded live job board, right column is the Redis Streams event tail. Click into a job and the incident room opens with timeline, responders, chat, triage card, and the action ladder. A floating audit chat panel sits bottom-right for natural-language questions over the audit log.

User flows

False closeout, denied and audited

1 Technician POSTs a closeout transition with missing photo, missing checklist, or out-of-radius geo
2 FSM consults the enforcement engine; R001 returns deny with the exact reason codes for each missing piece of evidence
3 Denied transition still writes to the audit log alongside a per-rule row listing what was missing
4 Job stays at closeout_pending; the audit story is complete (who tried, when, why blocked) even though state did not advance
5 Dashboard event stream shows transition.denied with the rule that fired

SLA breach to incident bridge

1 SLA watcher (background worker) sees an open job approaching its deadline and emits sla.warning to the Redis stream
2 Deadline passes with no closeout; watcher emits sla.breach and the incident bridge opens a HIGH incident in the same transaction
3 On-call dispatcher is seated as the level-1 responder, a system message lands in the incident chat
4 Escalation to level 2 pulls the on-call supervisor in; every responder change goes through the ladder, every message persists
5 Supervisor approves a manual override (R003): justification recorded, expiry set, the override row links the denied transition to the new allow_with_override transition

AI triage and audit chat, deterministic core untouched

1 Incident opens; the API endpoint POSTs to /ai/triage/incidents/{id} which builds a structured context (job, recent events, severity, responders) in a single transaction
2 Provider.generate_json runs the triage schema; output (severity tier, recommended action, confidence, rationale) lands in ai_recommendations
3 Dashboard triage card polls /ai/recommendations/by-target and surfaces the recommendation with a re-run button
4 Operator asks an audit question in the audit chat panel; the agent hands a fixed-window slice (last 50 transitions, last 10 incidents, last 30 stream events) plus the question to the provider
5 Answer plus citations write back; deterministic core never imports the AI module, no SQL access for the LLM, every output is auditable

Screenshots

Click any image to open at full size.

Phase 1 test suite: FSM edge map, R001 closeout-evidence rule (happy plus four denial paths), and end-to-end paths against a real Postgres.

False closeout, forensically recorded. The denied transition lands in the audit log with a per-rule row listing exactly which evidence was missing. Nothing happened, and the system can prove who tried.

Command-centre dashboard with four seeded jobs in four SLA states. Left column reads GET /board and colour-codes by deadline distance; right column tails the Redis Streams event bus.

Manager override unblocks a denied closeout. Event stream shows transition.applied R003_OVERRIDE_APPROVED right above the original transition.denied R001_INCOMPLETE_CLOSEOUT_EVIDENCE. The override authorises one bypass, never relaxes the rule.

Incident room with the command bridge. SLA breach auto-opened a HIGH incident, the on-call dispatcher was seated as level 1, an escalation pulled the supervisor in at level 2, and every responder, message, and state change is persisted.

Phase 4 AI surface: a triage agent card inside the command bridge (severity, action, confidence, rationale) and an audit chat panel that turns natural-language questions over the audit log into a cited timeline. Every output lands in ai_recommendations.

Key learnings

Atomic transition plus audit-row commit is the single design decision that buys the whole audit story; trying to log after the fact loses the ordering guarantee
Returning a structured decision (allow, deny, allow_with_override, escalate) with reason codes turns the enforcement engine into the one place to reason about authorization, which is worth more than the rules themselves
An Echo provider that produces plausible structured outputs is not a stub. It is the proof that swapping providers is a one-line factory change, and it keeps the test suite hermetic
Keeping the deterministic core blind to ai_recommendations is what lets the AI layer ship without changing the safety story: an LLM can suggest, a human commits, and the existing rules run on the commit
Phases as commits (Phase 1, 2, 3, 4 each a single commit on main) make the progression legible to a reviewer who scrolls git log before reading any code

Want something like Rampart?

I'm open to senior contract work. Let's talk about what you're building.

Get in touch