Your AI system's outputs are only as trustworthy as your logging. Most teams find this out the hard way.
When something goes wrong in an AI decision engine, like a prediction nobody can explain, or a status that says “cancelled” with no trail of why, you could end up with three teams pointing at each other. Not because anyone was careless, but because the system was never designed to answer that question. The output exists. The reason it exists doesn’t.
(By decision engine I mean any system that takes inputs from one or more models and produces a consequential output: a recommendation, a classification, a generated response.)
This is one of the oldest tensions in systems design, and I’m seeing AI systems make it urgent again. Traditional ML teams have model registries, prediction logs, and feature stores to trace an output back to its origin. LLM-based systems often log nothing but the final response: no prompt version, no model version, no confidence signal. When your decision engine is downstream of both, you need provenance from all of it. Not as an afterthought. As an operational requirement you define before the first model goes live.
Have you ever stared at a database record that said the status = “cancelled” and had absolutely no idea why? No log entry, no audit trail, no error message: just a status that appeared sometime overnight and a morning standup where everyone shrugs. That feeling has a name. It’s called missing provenance, and if you don’t plan ahead, it’s about to become your AI team’s most expensive problem.
My software career is over two and a half decades, but my journey to this blog post started in earnest well over a decade ago. I wasn’t architecting these systems from scratch. I was the person who had to understand them well enough to know when they were broken, who to call, and what questions to ask. It turned out to be a useful vantage point. You see the seams between systems that the people who built them don’t always notice, because they only ever owned one piece. (And yes, those people were excellent at their jobs. The seams get stretched when no one owns designing a seamless Ops experience.)
What I’ve been doing lately is trying to map how these pieces connect. Taking the concepts I knew operationally and learning the vocabulary and theory underneath them. What struck me is how directly the old problems map to the new ones. The same architectural tensions that made database-plus-logs frustrating a decade ago are showing up again in AI pipelines today: now with more moving parts, more indeterminacy, and higher stakes.
So this post is structured as a vocabulary map. Not a tutorial, not a how-to. A guide to the terms, how they connect, and why they matter if you’re building or evaluating AI systems that need to be trusted in production. If you’ve worked in and around these systems without ever having to design one from scratch, this is the connective tissue.
If you’ve spent time around modern AI tooling, you’ve probably encountered the word “graph” used in a way that feels more architectural than mathematical. It’s not a chart. It’s an execution model. Once the vocabulary clicks, a lot of the design decisions in these systems start making sense.
A node is a discrete unit of work. A function, an agent call, a tool execution. It does one thing and returns a result. An edge is the connection between nodes. An edge defines what happens next, and it can be conditional. If the result is an error, go here. If the result is approved, go there. The thing traveling between nodes is called state: a structured data object that every node can read from and write to.
The word “state” (or status) comes up again, so stay with me. The data object traveling through the graph is one kind of state. But every entity in the system (a node, an agent, a work item) also has its own entity state: running, idle, waiting, failed. These entity states are tracked separately and the two states can be easy to conflate. When an agent is busy processing, it has an entity state of running. The data object it’s processing has its own fields and values. They interact but they aren’t the same.
One thing worth mentioning here: when people say “graph state” they don’t mean the whole graph is in one condition. They mean each individual run. The execution of the graph carries its own state object. Ten concurrent runs means ten independent state objects. The graph definition is the blueprint. State lives in the run instance.
Different frameworks expose these concepts with varying levels of transparency. Some make them explicit, while higher-level tools can hide them: sometimes to their detriment in production.
When you’re working in these systems you quickly accumulate a vocabulary challenge. The thing being worked on gets called an artifact, a deliverable, an asset, a payload, a message body, or event data. The technical world doesn’t have one universal term because it varies by domain. The etymology stems from who built the system and what decade they learned in. For our purposes these are the same thing: the object of work. A document, a prediction, a generated response, an image, a JSON blob, a compiled binary.
Funny thing: the artifact doesn’t know its own status. A document doesn’t know it’s “in review.” A prediction doesn’t know it’s “flagged for human review.” Something external has to track that. The structure that does it is called an envelope. Think of it exactly like a physical package: the artifact is the contents, the envelope is the wrapper that carries the metadata. Status, timestamps, who owns it, where it came from, where it’s going, why it was last touched.
In workflow systems this wrapper gets called a work item or task record. In messaging systems it’s a message envelope. The name changes but the pattern is the same: the artifact travels inside a structured container that knows things the artifact itself doesn’t.
This matters architecturally because your state object (that data envelope traveling through the graph) is typically this wrapper, not the artifact itself. The artifact lives inside it as one field among many. Which means when you’re designing what your state object looks like, you’re really designing the envelope: what metadata travels with the work, what gets written at each step, and crucially, what gets recorded when a status changes.
That last part is where most systems quietly fail. And it’s the bridge into the next topic.
Here’s a failure mode that anyone who’s worked in operations has hit. You open a database record. (We’re repeating the example from before, but going deeper this time.) Status says “cancelled.” You need to know why. So you go to the logs. The logs don’t have enough detail. So you ask engineering to add more logging. You wait for the next incident. Repeat.
This isn’t a tooling problem or a lazy developer problem. It’s a structural one. Databases and logs were built by different people for different purposes and bolted together operationally rather than designed as a unified system. The database answers “what is.” The log answers “what happened.” Nobody enforced that they had to speak to each other, and in most organizations they still don’t.
The root cause is that writing UPDATE jobs SET status='cancelled' is a state change with no enforced explanation. Whether a log entry gets written explaining why depends entirely on whether the developer thought to do it, whether the logging framework made it easy, whether the ticket required it. It’s discretionary by default.
One structural answer to this problem has existed for years: event sourcing. Instead of storing current state in a database and hoping the logs explain it, you store every event that ever happened in an append-only log and derive current state by replaying those events. The log is the database. “Cancelled” isn’t a value in a column. It’s the result of a JobCancelledEvent existing in the log with a timestamp, an actor, and a required reason field. You can’t cancel something without explaining why, because the explanation is structurally part of the operation.
Paired with CQRS (Command Query Responsibility Segregation) you get the read performance of a regular database too, via projections built by replaying the event log. You don’t have to choose between queryability and auditability.
Event sourcing is one structural answer, not a default recommendation. But it makes the tradeoff explicit. In production it brings its own challenges: schema evolution pain, projection rebuild disasters, and teams may regret adopting it too early because it’s a heavy lift to operationally support when everything else is also new. When auditability is non-negotiable, though, it remains a coherent model rather than a collection of conventions.
Most enterprises don’t use event sourcing because it’s genuinely hard to build correctly and brutal to retrofit onto existing systems. What they do instead, at best, is maintain a structured audit log table. A database table written transactionally alongside every state change, recording who, what, when, and ideally why. It’s not as rigorous but it collapses the two-system lookup into one. It still fails when developers skip the “why” field under deadline pressure, which is most of the time.
Modern observability platforms (Dynatrace, Datadog, Honeycomb, and Grafana among others) approach the problem from the logging side through distributed tracing: attaching a trace ID to every operation so you can follow a request across services, databases, and log lines with a single identifier. It doesn’t give you the “why” unless developers instrument it, but it solves the correlation problem structurally. That’s not nothing.
The reason this matters for AI systems specifically is coming up next.
What do you do when your AI answer bot starts giving the wrong answers? If you’re like most teams, you dig into the logs, only to find they don’t have enough detail. So you ask engineering to add more logging. Then you wait for the next incident to test whether you added the right logging. The database-log gap didn’t go away when you added AI. It got a new layer on top.
The discipline that addresses this is called model lineage or model provenance, answering: where did this output come from, what produced it, and what the state of the producing system was at the time. The operational umbrella is MLOps for traditional machine learning teams and LLMOps for LLM-based systems. These are the same concerns that DevOps brought to software deployment, applied to models.
The starting point is a model card, a structured document describing what a model does, how it was trained, what data it used, its known limitations, and its intended use cases. If a team you depend on doesn’t have one, asking for one is a reasonable operational requirement.
Beyond documentation, there are four artifacts every model team should be producing if their outputs are feeding a decision engine downstream.
If you’re receiving outputs from both a traditional ML model and an LLM and routing them through the same decision engine, ask for a source tag on every output. A source tag is a field identifying not just the model but the modality and the team. It sounds obvious. Teams don’t include it because from their side there’s only one model and it’s self-evident which one it is. Downstream, at the answer bot’s output stage, it isn’t.
The framing that tends to land well when having this conversation: “When our decision engine produces a wrong outcome, we need to be able to determine within minutes whether the fault is in our deterministic logic or in our indeterminate (LLM) output, without involving multiple teams.” That framing works because it protects the model team from blame they don’t deserve. It’s a shared interest, not an audit.
Think of the concepts in this section as reliability primitives: disciplines that quietly determine whether your system survives production.
Observability is often reduced to “logging” in practice but the formal discipline has three components: logs, metrics, and traces. Each answers a different question. Logs tell you what happened. Metrics tell you how the system is behaving over time. Traces tell you where time was spent and how a request traveled across services. Knowing which one to reach for when a system misbehaves is a skill that takes deliberate development practice.
Finally, if your decision engine ever spans multiple services or databases (and it will) spend time with distributed systems theory, specifically consistency tradeoffs and eventual consistency behavior. Understanding why consistency is hard to guarantee across systems explains behavior that otherwise looks like bugs. It won’t make the problems go away but it will stop them from being surprises.
Every concept in this post is a different answer to the same question: when something goes wrong, can you reconstruct what happened and why?
Event sourcing, model lineage, distributed tracing, data contracts, observability triads: these aren’t separate problems owned by separate teams. They’re the same problem with different names, showing up at different layers of the stack. The teams that build AI systems that are trusted in production aren’t necessarily the ones with the best models. They’re the ones who designed for explainability before the first incident, not after.
If your team already has all of this in place, you’re ahead of the curve. Most teams will start with structured audit logs, prompt and model versioning, and trace IDs. Event sourcing is the endgame when you can afford it.
The outputs exist. Make sure the reasons do too.