◆ ai operations, done right● data-first // results, not excuses● shipping since 2024● ai operations, done right◆ data-first // results, not excuses● shipping since 2024● ai operations, done right● data-first // results, not excuses◆ shipping since 2024● ai operations, done right● data-first // results, not excuses● shipping since 2024
UNEXPECTED404
About
Operative
Let's Talk →
OverviewModelSymptomsProcessHow we helpFAQ
Services / Agentic Engineering

Agents that ship work,
not transcripts.

For RevOps, support, and engineering leaders who've seen the demos and want the real thing. We design, build, and harden agents that operate inside your systems, with evals, guardrails, and observability built in from day one.

ForRevOps · Support · IT · Eng leaders
DrivesThroughput · Coverage · Cost-per-task
EngagementFrom 6 weeks · Per-agent or program retainer
ModelsAnthropic · OpenAI · BYOM on your infra
Outcome  Production agents · eval suite · runbook · cost dashboard
What this service is

Get your team out of the queue.

Most "AI projects" stop at a chat box. We build agents that take actions, file the ticket, draft the deal review, reconcile the records, escalate when uncertain. They run inside your stack, with the same access controls and audit trails as a human teammate, and they get measurably better every week.

For leaders who've watched a slick demo and want to know what it takes to actually ship one, safely, into a system of record.

Field Note · No.05

A demo is a transcript. A production agent is a teammate, with permissions, evaluations, and a manager.

Operations Brief , UNEXPECTED404
◆ What's in scope

Eight things we ship, every engagement.

An agent program isn't a model call, it's a stack. The eight pieces below are what separate a Friday demo from a worker that runs on Monday and is still running in Q4.

01 ⌕

Task Framing

Pick the work the agent owns vs. the work it escalates. The job description before the build.

02 ⇄

Tool Layer

The functions the agent can actually call, scoped, typed, idempotent, and revocable.

03 ⌥

Eval Harness

A test set you trust. Every change runs against it. Regressions blocked before they reach prod.

04 ↻

Guardrails

Input validation, output checks, refusal patterns. The agent says "I don't know" before it hallucinates.

05 ◔

Memory & Context

Retrieval, scratchpads, and session state, sized to the task, not maxed out by default.

06 ⚠

Observability

Every trace, prompt, and tool call logged. Cost, latency, and quality on one dashboard.

07 ◈

Escalation Paths

When the agent isn't sure, who picks it up? Real human handoff, not a dead-end form.

08 ◰

Governance

Access scoping, PII boundaries, model versioning, audit trails. Compliance won't be the blocker.

Not in scope: shipping a research-grade demo, model fine-tuning programs, building a fully-autonomous "AGI" anything. We build small, safe, useful agents. See the engagement breakdown ↓
Anatomy of a working agent

It isn't the model. It's the stack around it.

Pick a real task on the left and watch what actually happens. The model is one of seven layers, and most of the work that makes an agent reliable lives in the other six.

Pick a task
Same stack. Different prompts, tools, evals.
/ AGENT
Refund triage
  1. L1 Trigger Inbound email · "I want a refund" event
  2. L2 Context fetch Account · order history · refund policy v3.2 retrieval
  3. L3 Reasoning Plan: check eligibility → propose action → ask for confirmation if > $200 model
  4. L4 Tool calls get_order(id) → check_eligibility() → issue_refund() action
  5. L5 Guardrails PII redacted · refund cap $500 · policy quote required safety
  6. L6 Escalation Confidence < 0.85 OR amount > $500 → human queue handoff
  7. L7 Observability Trace · cost · latency · eval-set replay on every change ops
Coverage
~72% auto-resolved
Median latency
8.4s end-to-end
Cost / task
$0.06
Eval pass rate
94% / 200 cases
How a typical engagement runs

From whiteboard to production agent, in eight to twelve weeks.

Five phases. Five working agreements. We ship a real agent into a real system, small at first, broader as the eval set proves it out.

01
Wk 1–2

Frame

Pick the task. Define what counts as success. Write a 1-page job description for the agent.

02
Wk 2–4

Prototype

Wire the model to a thin tool layer. Hit it with 20 real cases. Iterate fast on prompt + tools.

03
Wk 4–7

Harden

Build the eval set. Add guardrails. Test the failure modes you don't want explained on a Monday.

04
Wk 7–10

Ship

Roll out to a single team or queue. Observability dashboard live. Escalation paths wired to humans.

05
Ongoing

Operate

Weekly eval review. Cost tuning. Expand scope as confidence grows. New tasks queued every month.

You should know

Five symptoms.
Five fixes. Side by side.

The left column is what most agent projects look like at week six. The right is what we ship, small surface area, high reliability, observable end-to-end.

Demo-grade · "looks good in the slack thread"
Production · evaluated, observed, owned
  1. 01
    Today

    The chatbot demo

    A wrapper around a model with no tools, no memory, no policy. It hallucinates on edge cases and never takes an action.

    →
    After

    A production agent

    Tool-calling, retrieval, evaluated on real cases. Makes decisions, files tickets, escalates when uncertain.

    72% tasks auto-resolved
  2. 02
    Today

    Eval-by-vibes

    "Looks good, ship it." No baseline, no regression test. Every prompt change is a coin flip and the team is afraid to touch it.

    →
    After

    A real eval suite

    200+ frozen test cases. Every prompt or model swap runs against them. Regressions blocked at the door.

    94% pass rate · 200 cases
  3. 03
    Today

    Black-box outputs

    You can't tell why the agent did what it did. Cost is a surprise on the monthly bill. Latency is whatever it is.

    →
    After

    Traced & observable

    Every prompt, tool call, and token logged. Cost, latency, and quality on one dashboard.

    MTTD < 5 min · per-trace cost
  4. 04
    Today

    Hallucination as feature

    No guardrails. The agent confidently makes up policy, prices, and account details. PII flows wherever the prompt goes.

    →
    After

    Constrained & safe

    Output checks, refusal patterns, scoped permissions. The agent says "I don't know" before it makes things up.

    0 PII leaks · scoped tools
  5. 05
    Today

    Vendor lock-in

    You shipped on one model. The vendor changes pricing or deprecates the API and your stack breaks overnight.

    →
    After

    Model-portable

    Agent logic in your code, not the vendor's console. Swap providers in a sprint, not a re-platform.

    Any model · same harness

Recognize three or more? Book the 10-minute intro call →

The work, broken out

What this actually entails.

Four phases. Each one starts with a real task and ends with something your team can operate, evaluate, and improve without us.

Phase 01 · Wk 1–2
  • Task selection
  • Success criteria
  • Escalation rules
01

Frame the Job

We pick the task, define what "done" means, and write a one-page job description for the agent. What it owns, what it escalates, what counts as a win, agreed before we touch a model.

DeliverableAgent spec + success metrics
Phase 02 · Wk 2–4
  • Tool layer v0
  • Prompt + retrieval
  • Real-case dry runs
02

Prototype the Loop

Wire the model to a thin tool layer. Run it against twenty real cases, the messy ones, not the demo-friendly ones. Iterate the prompt, the tools, and the retrieval together until the loop is stable.

DeliverableWorking prototype + 20 trace baseline
Phase 03 · Wk 4–7
  • 200-case eval set
  • Output checks
  • Cost & latency tuning
03

Harden & Evaluate

Build a real eval set from 200+ historical cases. Add output guardrails. Tune for cost. Stress-test the failure modes you don't want explained on a Monday call. Every change runs the suite.

DeliverableEval suite + guardrail policy
Phase 04 · Wk 7+
  • Phased rollout
  • Cost dashboard
  • Weekly eval review
04

Ship & Operate

Roll out to a single team or queue first. Observability dashboard live before traffic. Escalation paths wired to real humans. Weekly eval reviews after, improvements compound month over month.

DeliverableProduction agent + ops cadence

Four phases · one production agent · zero hand-waving

Common questions

Questions we hear every week.

If you don't see yours here, the easiest way to get an answer is to ask. We don't gate calls.

Email us a question →

Anthropic Claude, OpenAI, and BYOM setups on your own infra. We pick the model that gets your eval set across the line at the cost and latency you can afford to run forever, not the one with the loudest launch event.

High-volume, well-bounded, and reversible. The work that's done the same way thousands of times a quarter, where a wrong answer can be caught and corrected. Big, ambiguous decisions stay with humans.

Every action is traced. Confidence-scored decisions either auto-resolve, ask for confirmation, or escalate to a human queue with full context. Mistakes are visible inside an hour, not a month.

Yes, we work to it, not around it. Scoped service accounts, encrypted secrets, PII redaction at the boundary, audit trails on every action. Your security team reviews the same tool layer the agent does.

No. Agent logic, eval suite, and tool layer live in your code, not a vendor console. Swap models in a sprint without rewriting the agent. We've done it; it works.

Pairs well with

Agents rarely ship alone.

Most agent programs pull in one or two of these. Workflow automation builds the rails the agent runs on, data makes its decisions trustworthy, and strategy decides which agents are worth shipping at all.

View all services →
01 / StrategyPairs well

Martech Strategy

Frame the bet, platform mix, operating model, and roadmap before the build queue.

  • Stack audit
  • Roadmap
  • TCO model
→
02 / AgentsCurrent

Agentic Engineering

Production agents with tools, evals, guardrails, observability, and human handoff.

  • Tools
  • Evals
  • Guardrails
●
03 / LeadsPairs well

Lead Management

Capture, dedupe, score, route, and nurture without leaks between marketing and sales.

  • Scoring
  • Routing
  • SLA
→
04 / ProgramsPairs well

Marketing Automation

Behavior-triggered programs that ship at scale across lifecycle, nurture, and expansion.

  • Lifecycle
  • Triggers
  • Reporting
→
05 / DataPairs well

Data Management

Schemas, cleanup, enrichment, quality monitoring, and governance.

  • Schemas
  • Sync
  • Governance
→
06 / WorkflowsPairs well

Workflow Automation

Integrations and back-office workflows that remove manual handoffs.

  • Triggers
  • Routing
  • Audit
→
◆ Let's talk · 30-minute scoping call

Ship the first agent that earns its keep on day one.

A demo doesn't scale. A production agent, small, evaluated, observable, does. Bring us a real task and we'll tell you in thirty minutes whether it's a fit.

Book intro call → See case studies
72%
tasks auto-resolved by agents we shipped in 2025
94%
eval pass rate before any agent reaches production
8 wk
average from kickoff to first agent in production
$0.06
median cost-per-task on shipped agents (pick the model that fits)
UNEXPECTED404

Ops that think. Data that proves it.

Services
AI OperationsMartech StrategyAgentic EngineeringLead ManagementMarketing AutomationData ManagementWorkflow Automation
Platforms
HubSpotMarketoSalesforceAll Platforms
Company
AboutPartnersOperative
Get in touch
ContactSyght.ioEmail us
© 2026 UNEXPECTED404, LLCPrivacyTermsCookiesLegalPreferences