Valuable Metrics: Numbers That Don’t Change Behavior Are Noise
By Darryl Brown
Every change we made over the last two years was an experiment. We had a hypothesis that simplifying our pipeline would increase throughput, that weekly sprints would reduce stress, and that automated code reviews would catch issues earlier. But a hypothesis without measurement is just an opinion. To know whether our experiments were working, we needed to establish what we were measuring before we changed anything, make the change, and then watch what happened. That discipline — measure, change, observe — is what turned our instincts into evidence. Metrics were how we replaced hope with information.
The framework we’ve been developing to structure that discipline looks like this:
Define the problem clearly:
- What is the problem or behavior we are trying to address?
- Why does it matter?
- What is the measurable current state/the baseline we can track against?
Design the experiment:
- What is our theory on why the problem is occurring?
- What solution or change do we want to try?
- What does success look like, and by when?
That structure forces two things we used to skip: defining the metric before the experiment begins, and separating the problem from the solution. It’s easy to correctly execute an experiment and still have a negative impact on the thing you actually care about because you were measuring the implementation rather than the outcome. Defining the metric first removes that trap.
This framework is still a work in progress. Not all of our experiments have been this well-structured. We’ve tried things with less rigor, measured the wrong things, or defined success too loosely. But it’s the standard we’re working toward, and when we get it right, the difference is clear.
Here’s what it looks like when it’s done well. Our Director of Quality Engineering recently defined this goal:
Goal: Reduce late requirement clarification.
Problem: Developer time is lost to requirement clarification after implementation has already started — a costly and disruptive pattern.
Outcome: By year-end, decrease late requirement clarification by 75%, without increasing the time spent refining requirements upfront or reducing overall throughput.
Measures: Late Clarification Rate as the primary metric; a diagnostic metric we call the Definition of Ready False Confidence Rate — which tracks cases where a work item passed our readiness check but still generated clarification requests later.
Hypotheses: Readiness criteria for work items can be implicit or inconsistent; requirements are validated too late in the process; developers compensate for unclear work items during implementation rather than flagging them earlier.
Experiments: Introduce structured readiness criteria by work item type; implement and refine Definition of Ready enforcement mechanisms, including the AI-powered automation described in Post 4.
The problem is specific. The outcome is measurable and time-bound. The hypotheses are honest about uncertainty. The experiments are designed to test those hypotheses — not just to implement a solution and hope it works.
Getting to that kind of clarity required some hard lessons about metrics themselves.
Early in this journey we made a serious attempt at building a proper data infrastructure. We set up a data warehouse and used automated connectors to pipe data in from our various tools: time tracking, project management, development workflow. The goal was a single place where we could see everything, compute trends, and make informed decisions about the team.
It was a good instinct. For a larger organization with dedicated analytics capacity, I still think that’s a great architecture. But for a twenty-person team, the cost-to-benefit didn’t hold up. Maintaining the connectors was more overhead than we could absorb, and when they broke — which happened with enough regularity to be a problem — the data became unreliable. We’d look at a metric, trust it, and make a decision based on it, only to discover later that the underlying data had been wrong for weeks.
Inaccurate metrics are more dangerous than no metrics. At least with no data we knew we were operating on instinct. With bad data we were operating on incorrect evidence.
So we pivoted. We moved our time tracking into our project management tool, which already held our workflow data, and leaned on what it provided out of the box. We traded sophistication for reliability. For our team at our size, it was the right call. A simple metric we trusted was worth more than a sophisticated one we didn’t.
We also learned that a metric isn’t a thing you add to a dashboard because it seems interesting. It’s a specific answer to a specific question: is this experiment working? For that, three things matter: the metric has to be easy to collect, it has to be accurate, and it has to reach the person whose behavior we want to influence. If any one of those is missing, we don’t have a metric. We have noise. And noise is worse than nothing.
We track throughput — the total number of completed tasks and their estimated points per month — as a trend line. When we add AI to our development process, the throughput trend tells us whether it’s moving the needle. We track production bug rates when we experiment with new quality engineering approaches. These are straightforward signals, not sophisticated dashboards. But they’re accurate, they’re current, and we trust them.
A number on a dashboard that nobody acts on is decorative. A piece of information that lands with the right person at the right time and changes what they do on a Tuesday afternoon — that’s a metric earning its keep.
The broader principle here connects to something Gene Kim illustrates throughout “The Phoenix Project.” In a factory, you can walk the floor and see exactly where work is piling up, where machines are idle, and where the bottlenecks are. In a software organization, that visibility doesn’t exist by default. You have to build it. That’s what our metrics are trying to do. Throughput trends, bug rates, clarification rates, these are our factory floor. They make the state of our process visible so we can see what’s actually happening, not just what we assume is happening. Without them, we’re managing blind.
We are still building this out deliberately, adding metrics where we can answer the “what decision will this change” question clearly, and resisting the temptation to add them where we can’t. The principle we’re holding onto: every metric we add should either tell us something we’ll act on, or tell someone else something they’ll act on. If neither is true, we probably don’t need it.
Before you automate anything, make sure it’s worth doing in the first place.
Next up we will look at what happened once we had a way to measure our experiments and what made developers genuinely want to show up.
