How We Built a Reliable AI Agent

The engineering behind ClarityQ's agent - and why we stopped optimizing prompts

Michael Livshits

January 7, 2026

When you ask ClarityQ to analyze your data, you see it work. It searches your semantic catalog, queries your data, checks its results, and creates rich visualizations. A typical question takes 7 to 9 steps. Deep analysis can span 30 or more.

Most AI tools feel like magic. Ours is engineering. We rebuilt our agent from scratch, twice, to make it reliable. Here's what we learned.

The Workflow Trap

Our first version of ClarityQ was workflow-based. Predefined steps. Prompt optimization. If the user asks X, do Y, then Z. It worked. Until it didn't.

Problems started to surface: Unstable paths that couldn't handle unexpected data. Poor error recovery. One failed step derailed everything. And we were constantly patching prompts instead of fixing the underlying mechanism.

We were optimizing the wrong thing.

The Agent Harness

So we rebuilt. Not the prompts. The mechanism.

The industry calls this the agent harness: infrastructure around the model that wraps the agent loop, enforces rules, and injects context. Manus has rebuilt theirs five times since early 2024. Anthropic regularly rips out and simplifies Claude Code's harness as models improve.

Here's what we built into ours.

The Domain: An Agentic Semantic Layer

Traditional semantic layers are either ETL machines or SQL generation machines. Ours is a SQL generation machine too, but the engine is state of the art AI that keeps getting better each quarter. We built an agentic semantic layer.

It's composed of three catalogs. A table catalog describes your data warehouse structure. An event catalog captures business events and their properties. A semantic catalog defines entities, dimensions, measures, metrics, segments and their relationships. Together, these form the domain model. The environment the agent works in.

Exploration Tools

Before the agent writes any SQL, it explores. We built a full-text search that indexes all metadata across the three catalogs. The agent can search for concepts, find related metrics, and understand your data model before committing to a query approach.

Specific tools let the agent inspect table schemas, view metric definitions, and understand relationships. This is progressive disclosure: search first, inspect what looks relevant, then use what's needed.

Direct database queries are the escape hatch. When data isn't available in the catalogs, or when the agent needs to drill deeper than the semantic layer allows, it can query directly. But it starts with the map, not the raw territory.

And guardrails are built in. The model cannot modify your data. It cannot pull too much at once. We estimate query size before execution and reject anything that would overwhelm your warehouse. The agent has freedom to explore, within boundaries that protect your infrastructure.

Knowing When to Stop

Agents without explicit stop conditions run until they hit API limits or spiral into nonsense. We define stop conditions as composable rules over conversation state:

Has the agent exceeded its step budget?
Did the user get their answer?

When ClarityQ finishes an analysis and presents results, it stops. When it needs clarification, it asks and waits. No runaway loops, no wasted compute.

Recovering from Errors

Errors are normal. Models hallucinate. APIs timeout. Data has edge cases. The question is whether your agent can recover.

Our agent uses checkpoint continuation. When something fails mid-analysis, we don't start over. We capture the conversation state, classify the error, and retry from the last good checkpoint. Rate limits, server errors, connection issues. All handled automatically with exponential backoff.

The model sees its own failures in context. This matters. When an approach doesn't work, the agent adjusts rather than repeating the same mistake.

Fighting Context Drift

A typical ClarityQ analysis runs 7 to 15 steps. Deep analysis can hit 30. That's a lot of room to drift off-topic or forget the original goal.

We use step handlers. Logic that runs between each agent step. Before the model decides its next action, we inject reminders, check context size, or trigger automatic summarization when the conversation gets too long.

For complex analyses like root cause investigation, we inject structured plans. The agent writes out its steps and checks them off as it goes. This isn't just for show. It keeps the goal in the model's recent attention span, preventing it from spiraling out or cutting corners.

Asking, Not Assuming

One of our most important reliability patterns is the simplest: when the agent is uncertain, it asks.

We built clarification hooks into the harness. If the model encounters ambiguity (unclear metric definitions, multiple possible interpretations, missing context) it surfaces a question with structured options rather than guessing and producing wrong results.

This sounds obvious, but most AI systems default to confident-sounding answers even when uncertain. We'd rather pause and clarify than deliver an analysis based on wrong assumptions.

Mechanism Over Prompts

The shift from workflows to agent harness changed how we think about reliability. We stopped trying to anticipate every path and started building mechanisms that handle uncertainty gracefully.

This is what "context engineering" actually means in practice. Not just what context to feed the model, but when, how, and with what guardrails. The harness manages all of it.

Conclusion: Agentic (AI-Native) Semantic Layer

Three catalogs give the agent a map. Exploration tools let it explore before committing. Guardrails protect your data. Stop conditions prevent spiraling. Checkpoint continuation recovers from errors. Step handlers and reminders keep 30-step analyses on track. Clarification hooks catch ambiguity before it becomes a wrong answer.

Mechanism, not magic. That's why it works.

‍

Text Link