A Debugging Method That Beats Vibes

Reproduce first, force three ranked hypotheses, run the cheapest discriminating test. A systematic debugging method that finds root cause instead of guessing.

June 13, 202611 min readdebugging,systematic debugging,developer workflow,ai debugging,root cause analysis

MaxtDesign

Engineering

Macro of a single bright cable pulled clear from a dense tangle of dark cables on slate, the drawn-out strand in sharp focus and catching cool light.

You know the loop. The bug shows up. You change a line that looks related, refresh, still broken. You change another, add a log, refresh, still broken. An hour later you have touched nine files, you are not sure which changes you kept, and you could not explain what is actually wrong if someone asked. That is debugging by vibes, and the maddening part is that it sometimes works, which is exactly why we keep doing it.

It works often enough to feel like a method and fails badly enough to eat whole afternoons. The fix is not to be smarter or to know the codebase better. It is to run a method that makes the bug tell you where it lives instead of guessing at it. Here is the one I use. It is the same whether you are debugging by hand or asking an AI to help, and handing the AI this structure is the difference between a useful partner and a confident one that fixes the wrong thing.

Why guess-and-check fails

Guess-and-check has three failure modes baked in. The first is tunnel vision: you latch onto the first plausible cause and stop looking, so if the first guess is wrong (and on a non-trivial bug it usually is) you spend your energy proving a theory that was never going to hold. The second is fixing before reproducing: you change something, the symptom seems to go away, and you declare victory without ever confirming you could trigger the bug on demand, which means you cannot know whether you fixed it or just disturbed the timing. The third is no memory: each attempt overwrites the last, so you cannot tell which changes mattered and you often reintroduce a bug you already fixed.

A method beats vibes by attacking all three: it forces a confirmed reproduction before any theory, it forces more than one hypothesis so the first guess cannot tunnel you, and it keeps a written log so beliefs update instead of evaporating.

Step 1: reproduce before you theorize

Before a single hypothesis, write down three things:

Expected: what you wanted to happen, stated precisely.
Observed:what actually happened. The exact error text, the exact wrong value, the exact log line. Not "it crashes," but the stack trace.
Reproduction: the minimal sequence that triggers it, reliably. If you cannot trigger it on demand, that is your whole problem right now, and the work is to find a reliable repro, not a fix.

This step is non-negotiable, and it is the one people skip. The single most common way AI-assisted debugging goes wrong is the model proposing a confident fix for a bug nobody reproduced, so it patches a plausible-looking cause that has nothing to do with the actual failure. No confirmed repro, no fix. If you are asking an AI for help, give it the expected, observed, and repro up front and tell it not to propose a fix until the repro is confirmed. That one instruction removes most of the wasted motion.

Step 2: force three hypotheses

Now, and only now, theorize. The rule that does the work: name at least three hypotheses, even when one feels obvious. Forcing the third is not busywork, it is the specific thing that breaks tunnel vision. The obvious cause is the one you would have chased anyway. The second and third are where the real bug often hides.

Each hypothesis gets three parts: a claim("the timestamp is wrong because we store local time and compare it as UTC"), a test (the cheapest experiment that would prove it false), and a prediction (what you will see if it is true versus false). Writing the prediction before you run the test is what keeps you honest. A test you cannot predict the outcome of is not discriminating between your theories.

Then rank them, and rank by the right axis. Not pure likelihood, but likelihood times ease of testing. A hypothesis that is probably right but takes four hours to test is a worse next move than one that is possibly right and takes thirty seconds, because cheap tests collapse the search space fast. You want the experiment that eliminates the most uncertainty per minute spent.

Rank by likelihood times ease of test

Plot each hypothesis by how likely it is and how cheap it is to test. Start in the top-right corner: the cheap experiments that would also explain the bug. They collapse the search space fastest, whether or not they turn out to be the cause.

Step 3: the cheapest discriminating test

Take the top-ranked hypothesis and run its test. The result does one of two things. If it confirms the hypothesis, you have your root cause and you move to the fix. If it falsifies it, you do not start over. You add what you learned to the log (a falsified test is information, it just told you where the bug is not) and re-rank the remaining hypotheses with the new constraint in hand. Often the falsified test reshapes the list rather than just crossing one off.

This is the loop that replaces guess-and-check: hypothesis, cheapest test, update, repeat. The difference is that every step narrows the space and is written down, so you converge instead of wandering, and you never test the same thing twice by accident.

Step 4: fix, then write the test that would have caught it

When the root cause is confirmed, resist the urge to refactor the neighborhood. Make the minimum change that fixes the confirmed cause. Then write the regression test: the test that would have failed before your fix and passes after it. That test goes in the same commit as the fix, because a fix without a test is a bug waiting to come back the next time someone touches that code.

If the bug is genuinely hard to test (a UI rendering glitch, a real race condition), say so out loud and write a manual-test description instead of pretending a unit test covers it. An honest "here is how to check this by hand" is worth more than a test that asserts nothing.

The special case: it worked yesterday

When a bug is new (it worked yesterday, it is broken today) you have a shortcut that beats reasoning about the code: bisect. Find a commit where it worked and a commit where it is broken, then binary search between them with a deterministic test command. Git will walk you to the exact commit that introduced the bug in a handful of steps, which is almost always faster than rereading the diff and guessing which change did it. The hypothesis log still applies, but now your first hypothesis is "something in this one commit," which is a very cheap thing to test.

Where one debugger is not enough

Sometimes the method stalls. You are four rounds into the hypothesis log, every test came back falsified, and you can feel the tunnel closing in. That is not a sign to try harder. It is a sign that your mental model of the code is the thing that is wrong, and you need eyes that have not already committed to your theories.

In a team you would grab a colleague and have them cold-read the suspect code without hearing your hypotheses first, precisely so their read is not biased by yours. You can reproduce that alone: ask a fresh reviewer to explain what the code does with no context about the bug, and compare their explanation to what you assumed. When the symptom looks like an invariant violation (two fields that should agree and do not, a type-level inconsistency) bring in someone thinking about the data model rather than the control flow. The same senior-review instinct applies here: the second perspective is most valuable exactly when you are most sure you already know the answer.

The method, packed

This hypothesis-driven method is one of the skills in the Senior Solo Coder Skillpack, a set of 31 AI skills for developers who ship solo. The version in the pack enforces the parts that are easy to skip when you are frustrated: it refuses to propose a fix before the repro is confirmed, it makes you name three hypotheses before chasing one, and when the log hits four rounds without converging it brings in a cold-read reviewer to give you a fresh take, because that is the moment a solo developer most needs a second set of eyes and is least likely to ask for one. You still own the diagnosis and the fix. The method above works on its own, in any tool. Write the observation log, force the three hypotheses, test the cheapest one first, and you are already debugging better than vibes.

Need help putting this into practice?

MaxtDesign builds the AI-powered web stacks the articles describe, from agentic workflows to performance-first WordPress + WooCommerce. Talk to us about your project.

Start a conversation More on Developer Tools