Debug domino

We found a bug in some old code. And we fixed it. Then other stuff started failing. It turned out that there were some other code, using the same function, that worked because of the bug. That forced us to roll back the fix and thereby re-introduce the initial bug that weren’t as critical.

So – we had to find a way to find out which code-paths were working because of the bug so we could fix those.

This was a frequently used, but lightweight helper function, returning a calculated output based on the input. It turned out that the implementation had some flaws, so we wrote tests that failed and then fixed it. That’s when we got into new problems.

On our second attempt to put it into production, we put it in as a separate function, so that we had both the old and new available. We then rewrote the code in the original function, so that it did the original calculation and, optionally, called the new function, did a comparison of the result and then logged a warning if the results differed.

Since we didn’t know exactly how often the error would happen, resulting in a log-write to the disk, we put on a feature toggle to be able to start and stop the sampling.

The code, which we in lack of a better name, called shadow-code, ended up something like this (pseudo-code – not compileable):

begin
  l_old_result := old_function(input_data);
  if configuration_value(do_shadow_calculation) = 'YES' then
    l_shadow_result := new_function(input_data);
    check_and_log_if_different(l_old_result, l_shadow_result);
  end if;

  return l_old_result;
end;

The check_and_log_if_different were overloaded (there were several versions of the old_function), based on input-type and looked something like this:

procedure check_and_log_if_different( p_result1 ..., p_result2 ...) is
begin
  if nvl(p_result1, c_magic_value) <> nvl(p_result2, c_magic_value) then
    logger.log_warning(logger.sprintf(
      'New and old implementation values differ: %s/%s', l_result, l_shadow_result));
  end if;
end;

By doing it this way, it was transparent for the calling code and we could work in iterations: Turn on sampling, find the next inconsistency, fix that code and call the new function, repeat.

Since we’re using logger, we had the call stack from the log-statements, so we could easily find the spots that were relying on the wrong results, correct these and point these to the new function.

The configuration_value function were a perfect candidate for result-cache, since it was just reading a row from a static table. The table is only updated when we change a config, resulting in an invalidation of the result-cache.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s