Logo

Evidence Over Intuition: A Process for Building AI That Works

Most AI projects don't fail because the technology doesn't work. They fail because the team can't prove it works. Here is our process.

Framework AI Engineering Evaluation

By Patrick Creeden · Last reviewed April 2026 · 13 min read

Introduction

Most AI projects don’t fail because the technology doesn’t work. They fail because the team can’t tell whether it’s working, or worse, they think it’s working when it isn’t.

The gap between a promising prototype and a production system that stakeholders trust is where the real engineering happens. This paper describes a process we follow at OBLSK for building AI systems. It’s structured around a simple conviction: every decision, enhancement, and improvement should be traceable to evidence, not intuition. That conviction runs through every phase of a project, from requirements through implementation, testing, and the decisions you make about what to ship and what to hold back, and it’s anchored by an evaluation loop where each proposed change is tested against a hypothesis, measured, diagnosed, and kept or adjusted.

This process emerged from real project work across multiple projects, and we’ll reference a recent engagement throughout to illustrate how this plays out in practice. But the process itself is repeatable and portable across domains.

Phase 1: Start With Questions, Not Specs

The Problem With Traditional Requirements

Most AI projects start with a requirements document that describes what the system should do. The problem is that these documents are usually written by someone who already has a mental model of the solution. Assumptions get baked in before anyone examines them, and the document ends up recording those assumptions rather than testing them.

Unchecked assumptions are where issues hide. “The system should return the right thing” sounds fine until you ask what that right thing should be and whether anyone has separated “we found the correct document” from “the answer is correct.” Those are two different things, and mixing them up can lead to issues downstream that are hard to unravel while continuing to solve evolving use cases with more ambiguity.

Our Approach: AI-Conducted Structured Interviews

We use AI tools to run structured requirements interviews, not as a shortcut around the thinking, but as a way to make the thinking more careful. The AI asks questions across set areas: user roles, data sources, limits, success criteria, failure modes, and compliance needs. The humans answer. The AI turns those answers into a product requirements document (PRD).

This produces two things. First, a thorough PRD that covers things a hand-drafted document often misses, especially around edge cases, role boundaries, and failure behavior. Second, and more importantly, a record of the reasoning behind each requirement. When a question comes up later about why a design decision was made, the answer is easy to trace.

On a recent engagement, this process produced a PRD with 40 user stories across four different roles, along with implementation decisions and testing criteria. It also captured something that shaped the whole build: retrieval quality matters more than breadth of features. That priority, set during requirements, kept scope from drifting in every later phase.

The Interrogation Step

The PRD alone is not enough. Before any implementation begins, we walk through the full plan step by step and challenge every assumption.

This is not a status meeting where someone presents slides and others nod. It’s a structured review where every choice gets a “what are the other options and what are the risks” conversation. The goal is to bring out the decisions that will be costly to undo later and make them on purpose rather than by default.

On the same engagement, this review highlighted other insights that directly shaped the build:

  • Lexical search should stay literal and clear. Adding semantic interpretation there would add risk without matching value.
  • Confidence scores help the system decide what to do next, but weren’t reliable probabilities for this project, so were not shown to end users, just used as a data point for decision making behind the scenes.
  • Chunk size tuning should come after reranking is in place (not before), because reranking changes how chunk size affects results.
  • Formatting loss during text extraction is a real risk that varies by document type and should be checked before trusting any document category in production.

None of these findings were obvious from reading the plan. They came out of questioning it.

Phase 2: Develop by Deliverable, Not by Layer

A pattern in AI projects is to build layer by layer:

  1. Get all the data loaded
  2. Build all the retrieval logic
  3. Build all the UI
  4. Test everything at the end

This feels efficient but creates a dangerous feedback delay. You don’t learn whether the system actually works until a lot of work is already done.

We break the work into discrete deliverables instead. A deliverable can be:

  • a user-facing capability
  • a data pipeline
  • a search component
  • or the evaluation harness itself

anything the system needs that can be built, tested, and evaluated end to end on its own. Each deliverable has its own acceptance criteria and clear dependencies on the deliverables it builds on.

What This Looks Like in Practice

On a recent engagement, the implementation plan was organized into deliverables across six categories:

  • Foundation: application foundation, production hardening
  • Data pipelines: four ingestion paths (one per data source), refreshing the reference document index
  • Search components: keyword search, natural-language search
  • Routing: tier classification, confidence routing
  • Draft workflow: draft request, draft generation, expert review, output generation
  • Evaluation tooling: the evaluation harness itself

The key design choice was making the evaluation harness its own deliverable rather than treating testing as something tacked on at the end. Evaluation wasn’t something we’d do after the system was “done.” It was a real deliverable with its own acceptance criteria, built early enough to guide every later decision.

This structure also made dependencies visible. We could see that the merge strategy couldn’t be properly evaluated until all ingestion paths were complete. We could see that confidence routing depended on having enough retrieval data to stress-test edge cases. Those dependencies helped to better understand what logic needed to exist before the next thing can be evaluated fairly.

Phase 3: Run an Evaluation Loop, Not One-Off Tests

The Core Principle

The evaluation loop means that no change to the system is accepted without measured evidence that it improved the thing it was supposed to improve and didn’t break something else.

This sounds obvious. In practice, it’s surprisingly rare. Most AI teams evaluate loosely: try a few queries, see if the results look better, ship it. That works until it doesn’t and the cost of “until it doesn’t” can be very high.

Inside the Loop

Each evaluation round follows the same steps:

Hypothesis. What specific change are we testing, and why do we think it will improve a specific metric?

Controlled change. Only one thing changes per round. If two things change in the same round, you can’t tell which one caused the result.

Measurement. Results are scored against set metrics, not just “does it look better” but specific, comparable numbers.

Diagnosis. When results are unexpected, the round doesn’t end with the metrics. It ends with an explanation of why the metrics moved the way they did.

Decision. Based on the diagnosis, the change is kept, rolled back, or adjusted. The decision and the reasoning are recorded.

Hot tip: Keeping an evaluation journal recording ongoing steps and notes is invaluable. Coding tools and LLMs are great at this task.

What 15 Rounds of Evaluation Taught Us

On a recent engagement, we ran 15 formal rounds of retrieval evaluation. The path was not a smooth upward climb. It was a series of gains, setbacks, reversals, and recoveries that each produced specific, useful insights. In some cases, fewer rounds may be needed; in others, many more.

Friendly baselines are misleading. The first round produced perfect scores on a set of simple queries. The second round made the queries harder with reworded versions and unclear phrasing, and Top-1 accuracy immediately dropped 15 points. The system was working, but only on easy questions. This is the failure mode that motivated our earlier piece on tying evaluation to business outcomes. Being correct on a friendly dataset is not the same as being useful in production.

Metric improvements can be traps. One round added a hand-tuned ranking rule that improved Top-1 accuracy by 14 points. We rolled it back the same day. The rule was built on assumptions about how document titles are structured in the current collection, and those assumptions would quietly break as new documents were added. A metric improvement that depends on fragile assumptions is not a real improvement. It’s technical debt that looks like progress.

Separate what you can measure from what you can’t. A mini-eval sequence showed that some retrieval misses were passage-level problems (the right document was ranked first, but the wrong chunk was shown) while others were document-level problems (the wrong document was ranked first entirely). These need completely different fixes, and treating them as one problem would have wasted effort. The eval framework’s ability to tell the two apart was what made the correct diagnosis possible.

Removing a crutch reveals the real baseline. When a previous workaround was removed in one of the rounds, Top-1 dropped from 64.3% to 50%. That was painful but necessary. It showed what the core system could actually do without extra help, which made the later improvements (Reciprocal Rank Fusion, structured reranking) measurable against an honest starting point.

Data problems can look like system problems. One round produced results that looked much worse than expected. A closer look showed that several expected document IDs in the evaluation dataset were wrong: the system was being marked down for “missing” documents it was actually ranking correctly under different IDs. The eval framework caught this because results could be checked one query at a time. Without that detail, the team would have spent time “fixing” retrieval behavior that wasn’t actually broken, when really there was a coding bug that needed fixing.

Phase 4: Document What Doesn’t Work, Not Just What Does

Open Issues as a Signal of Trust

Most teams document what their system can do. Far fewer document what it can’t do yet and why. We believe the open issues list is one of the most important artifacts a project produces, not as an admission of failure, but as a sign the team knows what it built and where it still has work to do.

Any team can list its features. A team that can name its open edge cases, with the reason for each and what would fix it, reveals something much harder to fake: whether it actually understands what it built.

What an Honest Issues List Looks Like

On a recent engagement, the final deliverable included a documented list of known open issues. Some obfuscated examples shared below:

  • Some internal confidence signals weren’t fully independent - a high-quality answer in one dimension could be overridden by an unrelated low signal elsewhere.
  • Ranking sometimes placed the wrong document first when multiple documents covered overlapping territory or used very similar language.
  • Retrieval coverage was uneven across document formats - certain tables and unreadable PDF pages weren’t indexed cleanly.

Each issue was documented with its root cause, its practical effect, and the conditions under which it would need to be fixed. None of them were blockers for the system’s intended use, so they could be handled in future enhancements.

This kind of transparency is especially important when stakeholders need to understand not just what the system does, but where its limits are.

Phase 5: Building Reusable Assets, Not Just Features

A well-run AI project should produce more than a working system. It should produce artifacts and abilities that build up over time.

The Evaluation Framework as a Reusable Asset

The evaluation harness built for one engagement (the query sets, the metrics, the round-over-round comparison tooling) doesn’t have to be rebuilt from scratch for the next project. The approach is portable. The specific queries change, but the loop of hypothesis, controlled change, measurement, diagnosis, and decision applies to any retrieval, generation, or classification system.

Over multiple engagements, this creates a team advantage: each project makes the evaluation process sharper, and each client benefits from the rigor built up over every previous project.

Document Risk Assessment as a Repeatable Deliverable

Before building retrieval against any document collection, it’s worth checking what happens to those documents during ingestion. Which document types survive text extraction cleanly? Which ones lose meaningful structure? Where are the tables, the label-value pairs, the appendix pages that might not extract at all?

This check is quick compared to the total project effort. It prevents a class of problems that are expensive to diagnose later, when retrieval quality is poor and nobody can tell whether the problem is in the ranking logic or in the source material that was fed into it.

On a recent engagement, this check found that some documents were low-risk for formatting loss, while tabular specification documents and data sheets were higher risk. That finding directly guided which document categories were trusted earliest in the retrieval pipeline.

Documentation as Onboarding Infrastructure

The review records, evaluation round reports, and architectural decision logs produced during a project are not just project artifacts. They’re onboarding materials. A new team member, or a client’s internal team taking over maintenance, can read the docs and understand not just what the system does, but why each decision was made and what the other options were.

This is especially valuable for AI systems, where the reasoning behind a design choice is often more important than the choice itself. Knowing that lexical retrieval was kept literal on purpose, and why, prevents a future maintainer from “improving” it in a way that brings back a problem that was already found and avoided.

The Role of AI Tooling (and Humans) in This Process

It’s worth addressing directly how AI development tools fit into this process, because the answer is more complicated than “AI makes everything faster.”

We use AI tooling heavily for structured requirements interviews, plan generation, implementation, and evaluation analysis. These tools meaningfully shrink the time between “we have an idea” and “we can evaluate that idea against a working system.” That matters because the faster you can get to real evaluation, the sooner you find out what actually works versus what you assumed would work.

But AI tooling does not shrink (and should not replace) the judgment part. The review where every assumption is questioned is still a human activity. The decision to roll back a metric improvement because it won’t generalize is still a human judgment. The call on which open issues are okay for launch and which are blockers is still a human call.

Two things need extra attention. First, a human has to review real AI outputs, not just the metrics that summarize them. Metrics miss problems a reviewer spots in seconds: wrong tone, small factual errors, something missing that changes the meaning, or formatting that looks right but breaks in context. No automated test catches everything a person sees in a sample of outputs. Second, when an evaluation gives an unexpected result, finding the real cause takes human judgment. A 15-point drop could be a ranking issue, a change in how documents were split up, a change in the test set, or a metric measuring the wrong thing. You can’t tell which from the number alone. It takes reading logs, following traces, and looking at outputs along the way.

The approach works because it uses AI to speed up the parts of the process that benefit from speed (drafting, generating, building, searching) while keeping human judgment for the parts that benefit from careful review (questioning assumptions, weighing tradeoffs, deciding what’s ready and what isn’t).

Summary: The Process at a Glance

Phase 1: Start With Questions, Not Specs. Use AI-conducted interviews to produce thorough, traceable requirements. Then challenge every assumption before building.

Phase 2: Develop by Deliverable, Not by Layer. Structure implementation as standalone, testable steps with clear dependencies. Make evaluation a real deliverable, not something tacked on at the end.

Phase 3: Run an Evaluation Loop, Not One-Off Tests. Test every change with a specific hypothesis, one controlled variable, measured results, a diagnosis, and a recorded decision. No change ships without evidence.

Phase 4: Document What Doesn’t Work, Not Just What Does. Document what works and what doesn’t with equal rigor. The open issues list is a trust signal, not a liability.

Phase 5: Reusable Assets. Build evaluation frameworks, work logs, decision records, and documentation that build up across projects.

AI Tooling. Use it to speed up the cycle from idea to evaluation. Don’t use it to skip the judgment that makes evaluation meaningful.