#2 Root Causing issues with the Craig’s method ๐ŸŽฏ

This is (almost) an ode to Craig, a former colleague, friend and mentor who put up with me while travelling around the world non-stop for a couple of years while working at GSK. When I joined the team, Craig was much faster at tackling any technical problem than the rest of the team. It did not matter if he was the SME in the area or if the user had provided incorrect data to begin with. Craig would always follow the same logical mechanisms, and unsurprisingly, they always worked.

I observed what he did, and I came up with what I call the Hypothesis-Effort-Elimination (HEE) method. Craig followed a sound principle: to narrow down the problem as fast as possible, systematically. In practice, when debugging processes, systems or code, this consisted of evaluating the hypothesis that was most likely to eliminate the highest number of potential root causes with the least amount of effort. Then repeat until the root cause is found, Easier said than done.

Example Time

Let’s start by reviewing a simple example. In today’s world, processes have generally hundreds of steps end-to-end. For example, let’s imagine a product was shipped with an incorrect label. Working backwards, the label was stuck on the product by a machine. Prior to that, the label must have been printed by a different machine, which, again must have been calibrated prior to being used. For the machine to know which labels to be print, data must come from the Product Lifecycle Management software where the SKU’s Master Data is stored. The error could have happened at any point, which allows to formulate a few hypotheses, which can be ranked based on effort to evaluate them and the number of root cause they would eliminate depending on the outcome for the hypothesis.

HypothesisEffortRoot Causes EliminatedScore
Incorrect label was applied on the product due to machine errorHigh (hard to test remotely, need to in-person support)If correct (50%), the root cause was the machine the machine applying the label malfunctioner. If incorrect, error could be on the printer or the label master data.Poor
The printer output the wrong labelsMedium(possible but tedious check output from printers remotely)if correct (50%) , the root cause can be the printer or the master data. If incorrect, the root cause must be the machine applying the label.OK
The label master data was incorrectLow (Label Master Data is easily accessible)If correct (50%), the error must be in the master data. If incorrect, the root cause must be downstream in either the printer or label applicator.Good

For this first example, let’s assume we know nothing and it’s the first time we are working with this process and systems. Hence, the probability of being right is 50% for each hypothesis. Consequently, all three hypothesis narrow down the problem to the same ‘expected’ outcomes: 50% x 1 + 50% x 2 = 1.5 outcomes remaining after completing any hypothesis. The deciding factor becomes the effort, hence the lowest effort hypothesis should be completed first. If unsuccessful, the Medium effort hypothesis should be completed second to determine the root cause. Hence the scores given.

However, if we recently had previous errors with the same label applicator, this would push on the odds to 90% on the High effort hypothesis. This likely makes tackling the high effort hypotheses first a viable alternative.

Reflecting on the Hypothises-Effort-Elimination (HEE) Method

Overall, the example above is greatly simplified. In today’s tech dominated world, we are surrounded by a thousands of small services that are continuously taking to each other. However, by pairing the HEE algorithm with fundamental understanding of how servers, access controls and networking works, it is possible to establish an order to minimize the effort or time to find a solution for almost any technical problem.

I personally found the HEE algorithm super helpful in becoming a what I would call a “professional root cause identifier”. The more you practice it consciously, you will find youself formulating better hypothesis (those that eliminate more root causes with the least effort).

Over time, I found myself applying this algorithm more and enhancing it with the knowledge that work experience brings. For example, an unwritten rule I now follow is to think about simpler hypothesis that can be more easily evaluated (ie do a record count instead of a complete 1:1 validation). Similarly, it’s very rare to see systems failing intermittently and also intermittent failures are much harder to analyze generally. If a see an intermittent failure while solving a problem, I would tend to look upstream for lower-effort and more optimal hypotheses. Anyway, a book could be written about this and I cannot find much online surprisingly!


Picture of the Day

Koh Tao in Thailand is one of my favorite islands. I am feeling beachy today. It’s known for scuba diving and great chilled atmosphere. Everything is a short 15 min moped ride away, from beaches with small sharks and turtles to the amazing small island of Nang Yuan pictured above. Highly recommend it over Koh Phangan or Koh Samui!


Word of the Day

Bravado is an noun that represents bold courage, especially that shown by doing something unnecessary or dangerous with the intent to make people admire you

Larry Bird’s bravado was second to none, he was the most amusing and savage trash talker the NBA has ever seen

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *