Every Penny Counts in Embedded Design

By Shawn Prestridge

Senior Field Application Engineer & US FAE Manager

IAR Systems

September 21, 2022


Every Penny Counts in Embedded Design

Do more with less – this phrase, which captures Buckminster Fuller’s concept of ephemeralization, caught fire in the embedded space in the 1990s but never seems to go out-of-fashion. Managers constantly squeeze budgets and schedules to deliver products faster and cheaper, often with quality suffering as a result. Unlike Fuller’s vision of ever-increasing quality and solutions, this approach often results in the test phase of the product lifecycle being foreshortened to meet aggressive schedule goals and occasionally means cutting features from the final product (perhaps to be added later as a version update). 

Let’s explore techniques that will help developers find and fix defects more quickly, help save money on build material lists (BML), and perhaps avoid the challenges of ephemeralization. While the primary focus is on Arm-based cores, many of these techniques are directly applicable to other cores as similar functionality exists in many embedded devices.

One of the easiest ways to quantify savings is in the BML: lower cost parts require the company to spend less money to manufacture a product. On most embedded designs, the two most expensive parts are often the screen (if the device has one; most IoT devices do not), and the processor. As you add more memory (flash and RAM) to a processor, the cost of the processor increases. While the specifics of how much the cost increases vary from semiconductor company to semiconductor company, a rough rule-of-thumb is that the processor per-unit cost increases by about one US dollar each time you double the memory.

What makes this problem worse is that embedded engineers are often not very good at forecasting memory requirements during the design phase of an application. These best “guesstimates” on the amount of memory needed are key factors in the processor selection. Given that many production runs are in the hundreds of thousands or millions of units per year, adding an unnecessary dollar to the BML has a deleterious impact on the company’s bottom line.

As a result, innumerable projects “run tight on resources,” which is code for “we didn’t forecast our memory needs correctly.” What exacerbates this problem is that BMLs are often pitched to upper management towards the beginning of the project. Once this happens, the cost becomes inviolable. This leaves people scrambling to reduce the memory footprint or leaning on procurement to keep the BML costs the same as management expects by negotiating better prices on other components. To lower the memory footprint, teams often turn to their compiler’s optimization engine to reduce the size of the generated code.

Raising the Bar for Compiler Optimization

Some engineers are exceedingly reluctant to crank up the optimization because they perceive that optimization introduces bugs into the system. This is rarely the case, and in my experience, about 5 percent of optimizer issues turn out to be a problem with the optimizer.

When the optimization level is raised, the compiler gets extremely picky about the semantics of the C and C++ language. The optimization decisions are made based upon a strict interpretation of the language rules. Often, engineers are not fully aware of all the nuances of the language and code in a way that seems natural to them.

For example, if a function call is written like this:

myFunc(varA, varB, varC, varD);

The natural assumption is that the variables will be read from left to right: varA will be read from memory, then varB, etc.

However, there is nothing in C or C++ that says this has to be the case. If the memory is laid out either purposefully or by happenstance where varB is next to varD, then a high optimization might use an index register to read successive memory spaces to save on code size and speed.

In most cases, this will not make a difference to the code. However, if you are depending on the variables being accessed as they have been written, from left to right, then a situation may occur where the code runs fine at lower optimization but not at high levels. This is where a good support structure from your tools vendor can help with spotting these types of problems and rewrite sections of code to optimize better and work correctly, independent of the optimization settings.

Moreover, if your code can work the same at high optimization, it’s written correctly and is better-tested. If the code does not work at higher optimizations, there is a good chance that a latent defect is waiting to “bite you.”

Good tools can save 10-40 percent on code size when set for high size optimization. However, not all optimization transformations are good choices for any piece of code – some transformations might actually increase code size on certain types of code. This could be an article unto itself.

For now, there are resources available that address “getting the least out of your compiler,” meaning the smallest size code and lowest size execution time. Saving this amount of code space can be the difference between stripping out functionality to stay within a device’s size, missing schedules due to hand-optimizing your code, or going over-budget on the BML.

While good code can operate the same at any level of optimization, debugging highly-optimized code is tricky at the very best. For example, entire sections of code can be folded into other sections of code in a completely different place. This is why it is essential to debug your code at low or no optimization and to verify the code is functioning correctly before increasing the optimization to run the full battery of tests.

Debugging Out Cost in the BML

Part of what makes embedded debugging difficult is that most people simply don’t know of all the debugging tools in their arsenal. They tend to default to printf statements and code breakpoints. These defaults don’t help when trying to isolate a hard fault, find where a stack overflow is occurring, or find out why a variable keeps getting clobbered.

The good news is that exceptional tools exist that help find these types of problems.

Handling Hard Faults

Let’s start with the hard fault. Many modern MCUs have live instruction trace capabilities that allow you to follow the instruction flow. On Arm-based devices, the technology used to accomplish this is Embedded Trace Macrocell (ETM). A reference manual will indicate if the device supports ETM. If so, pull the trace pins to your debug header and use a trace-enabled debugger, such as the IAR I-jet Trace, that can capture that live instruction flow and show it in the debugger window.

To find what caused the hard fault, simply scroll through the trace window and find the instruction that executed before you went to the fault handler. Voila! That instruction is the culprit. If the bug can be reliably reproduced, set a breakpoint at the fault handler and eliminate all the scrolling in the trace window – the culprit is the penultimate instruction in the trace window.

Now the cause is known, so a breakpoint can be set on the culprit and run through the test case again to see what’s wrong with your code that’s causing the exception.

But what if you don’t have ETM? Most Arm-based devices have Serial Wire Output (SWO) that allows for a sampled, low-speed trace. While you do not get every single instruction, this can provide enough trace information to narrow down and locate the problem. Additionally, try to derate the MCU clock and/or adjust the SWO settings to get a finer granularity of trace information out of the debugger to hone in on where the problem occurs.

Other device architectures have similar functionality to ETM or SWO. Therefore, using high-quality tools can leverage that information and quickly isolate and eradicate the problem. Additionally, available support resources help wring extra performance out of the SWO to secure more trace data.

Stopping Stack Overflows

How about a stack overflow or finding out why a variable mysteriously loses its contents? Use the same technique to diagnose both of these conditions.

In the Arm universe, most processors have a Data Watchpoint and Trace (DWT) block in their debug interface that can be used to quickly isolate these types of issues. In this case, use a data watchpoint to find out where the bad stuff is happening. This watchpoint is essentially a breakpoint whenever a piece of data gets touched.

Configure the options to only break execution if the data is read from, written to, or both. Furthermore, even restrict it to only breaking if the data is a specific value with a certain bitmask. This is quite handy when avoiding it from stopping each and every time that data gets accessed.

In the case of the stack overflow, we want to set a data watchpoint at the top of the stack. It does not matter if reading or writing to that value because the stack is already blown at that point in code. The processor will halt execution at the top of the stack, providing a fully preserved call stack that allows for visibility into which piece of code is blowing the stack as well as how you arrived at that point. This is key to determine how to fix the bug.

Cleaning Up Clobbered Data

With clobbered data, we use essentially the same technique, just setting a data watchpoint if that variable experiences a write. If it is always clobbered with the same value, narrow the breakpoint further to only trip when that value is written to the variable. Then, run our test case one more time and find out whose code is causing the issue.

Again, many other architectures (such as the Renesas RL78, RX, and devices from many other silicon vendors) have similar functionality that can be used to effect the same results. With high-quality tools, finding these types of issues becomes easier and increases the odds of meeting an aggressive schedule and deadline.

Let Procurement Know You Care

Doing more with less may seem to be a contradiction, but it can be easily accomplished by using the right tools. By using compiler optimizations, you can shoehorn your code into the smallest possible space in order to use the least expensive device for your application.

Optimization can also help desk-check your code to see if it runs the same at high optimization to find potential code defects before you check it into a build (and thus make every defect count against your release metrics). It also helps you debug more efficiently by using your full toolbox to find bugs more quickly, thus shortening the test-and-fix cycle and getting your project out the door faster.

If you know what tools are in your toolbox (and how to use them properly), you can make every penny count for your organization.

Shawn Prestridge is a Senior Field Application Engineer & US FAE Manager at IAR Systems.