STOP doing error checking!
February 06, 2017
Error codes are the problem, not the solution. You probably don't realize it, but all that code you add to "handle errors" is just making the problem...
Error codes are the problem, not the solution.
You probably don’t realize it, but all that code you add to “handle errors” is just making the problem worse. And no matter how much more time you devote to “error checking,” you’ll never end up with a system that’s smart enough to keep itself out of trouble.
So why do you keep trying? You do know the definition of insanity, right?
As it happens, there is a better way: just stop doing error checking. Purge your system of the concept of “error” in the right way, in fact, and you’ll end up with an implementation that works better in nominal cases and gives you better insight and opportunity to regain control when something unexpected happens. Your code will be simpler, too.
To get this right we’ll need to reformulate our thinking a little. Rather than bore you with hypotheticals, however, I’d rather just work an example with you across the next few paragraphs.
Consider the humble, motorized valve. When working normally, the device simply converts an AC or DC voltage level into cross-sectional area: at maximum voltage, the valve is “open”; at minimum voltage, the valve is “shut”. There are countless variations (i.e., ball valves that can be reliably positioned at any point between fully open and shut, and solenoid valves which are either completely open OR shut).
What can possibly go wrong with a device that’s so simple? Plenty, actually, which is precisely why mere error checking is an inadequate way of keeping the situation under control.
For example, if the mechanical linkage between the actuator and valve body breaks, then the actuator can move correctly but the valve gate itself won’t. Whether the valve gate naturally opens, closes, flutters, or sticks in its last position as a result of this failure depends on the type of valve and how it’s being used.
Another way for the valve system to fail is for something to become lodged in the valve body or actuator (i.e., contaminants, debris, or corrosion that blocks material flow and/or prevents the valve gate from moving properly). The actuator may still attempt to move the valve gate, but some portion of the valve gate’s overall range is lost until the obstruction clears. As with the previous failure mode, the valve’s overall design determines details like whether the actuator can continue to move properly even if the valve gate itself can’t.
Of course, we might simply deprive the actuator of its control signal altogether, due to electrical interference or by simply cutting the wire.
The failures I’ve listed above may all look distinct, but they have one key feature in common: a few, ordinary laws of physics always apply to the valve system as a whole, even when the system itself is malfunctioning. In my experience, it’s rare for an embedded developer to be mindful of this – and even more unusual for them to exploit its implications in a way that successfully improves the quality of the overall system.
Remarkably, there’s actually one more thing that all of the failures I just listed have in common: we probably just don’t care. Or, rather, we shouldn’t care about them in detail until later, when it’s someone’s job to specifically diagnose a root cause. In the meantime, we want the assurance that if something doesn’t look right in the valve system, we can detect and react to it properly regardless of the nature or origin of the failure.
Finally, and since we’re programmers, we want a solution that applies uniformly to a wide range of situations. That way, we can establish a trustworthy, uniform implementation pattern.
How do we do this? Don’t worry, now that I’ve opened the floodgates I’m not going to leave you to drown. Stay tuned. :-)
Bill Gatliff