Getting more for less: designing smarter Edge AI

August 03, 2021

Blog

Getting more for less: designing smarter Edge AI

Identifying the information that is actually useful to you is a relentless challenge. What information do I need for my applications? How does that relate to the information coming from vendors competing for my business?

In this article, I’ll try to distill what I’ve learned to sort the truth from the bluster when it comes to automotive AI hardware platforms.

“We cannot solve our problems with the same thinking we used when we created them.”

—Albert Einstein

Why do we love big numbers?

As an engineer with more than 40 years of experience as both an R&D Director and CMO in the semiconductor business, I consider myself and my peers reasonably logical. However, how many of us can honestly say we have not been seduced by claims like, “My widget is faster than yours?” I’m afraid it’s just human nature, especially when we are not confident in your expertise to probe the claims.

The problem is always one of definition: how do I define “faster” or “lower power” or ”cheaper?” This is the problem benchmarks try to solve – it’s about having consistent context and external criteria to ensure you are comparing like with like. Anyone working with benchmarks knows this only too well (aiMotive was born from a leading GPU benchmarking company).

The need to solve this bombardment of claims has never been more urgent than when trying to compare hardware platforms for automotive AI applications.

When is 10 TOPS not 10 TOPS?

Whether they have dedicated NPUs or not, most SoCs quote their capacity for executing NN workloads as TOPS: Tera Operations Per Second. This is simply the total number of arithmetic operations the NPU (or SoC as a whole) can, in principle, execute per second, whether all concentrated in a dedicated NPU or distributed across multiple computation engines such as GPUs, CPU vector co-processors, or other accelerators.

However, no hardware execution engine executes every aspect of any workload with 100% efficiency. For Neural Network inference, some layers (such as pooling or activations) are mathematically very different to convolution. Data must be rearranged or moved from one place to another before the convolution itself (or other layers such as pooling) can start. Other times the NPU might need to wait for new instructions or data from the host CPU controlling it, per layer or even per data tile.  These all result in fewer computations being done, constraining the theoretical maximum capacity.

Hardware utilization – not what it appears

Many NPU suppliers will quote hardware utilization to indicate how well their NPU executes a given NN workload. This basically says, “This is how much of the theoretical capacity of my NPU is being used to execute the NN workload.” Surely that tells me what I need to know.

Unfortunately not. The problem with hardware utilization is one of definition: the number depends entirely on how the NPU vendor chooses to define it. Indeed, the problem with both hardware utilization and with TOPS is that they only tell you what the hardware engine is theoretically capable of achieving, not how well it achieves it.

This can lead to some misleading information. Figure 1 below shows the comparison we performed between an aiWare3P NPU rated at a claimed 4 TOPS with another well-known NPU rated at 8 TOPS.

 

Figure 1: Utilization vs efficiency comparison for two automotive inference NPUs (Source: aiMotive using publicly available hardware and software tools)

For two different well-known benchmarks, Competitor X NPU claims 8 TOPS capacity compared to aiWare3P’s 4 TOPS. That should mean it would deliver roughly 2x higher fps performance than aiWare3P. However, in reality, it’s the reverse: aiWare3P delivered 2x-5x higher performance despite being only half the claimed TOPS!

The conclusion: TOPS is a really bad way to measure AI hardware capacity; and hardware utilization is almost as misleading as TOPS.

NPU Efficiency and Autonomy: key for optimizing PPA

That’s why I believe you must base your assessments of NPU capability on the efficiency when executing a set of representative workloads, not raw theoretical hardware capacity. Efficiency is defined as how many operations are needed to execute a specific CNN for one frame, as a percentage of total claimed TOPS. This figure is calculated based solely on the underlying mathematical algorithms that define any CNN, regardless of how the NPU actually evaluates it. It compares actual vs. claimed performance, and that’s what really matters.

An NPU that demonstrates high efficiency means it will make the best use of every mm2 of silicon used to implement it, and that translates to lower chip cost and lower power consumption. Efficiency enables the best possible PPA (Performance, Power, and Area) for automotive SoCs or ASICs.

The Autonomy of the NPU is another important factor. How much CPU load does the NPU place on the host CPU to achieve the highest performance? How does this relate to the memory subsystem? An NPU must be considered as a large block in any SoC or ASIC – its impact on the rest of the chip and subsystem cannot be ignored.

Conclusions

When designing any SoC or ASIC Automotive, AI engineers must focus on building production platforms capable of executing their algorithms reliably while achieving exceptional PPA: lowest power, lowest cost, higher performance. They must also commit to their choice of hardware platform early in the design cycle, usually well before the final algorithms have been developed.

Efficiency is the best way to achieve this; neither TOPS nor hardware utilization are good measures. Assessment of the NPU’s autonomy is also crucial if demanding production targets are to be met.


Tony King-Smith is Executive Advisor at aiMotive. He has more than 40 years’ experience in semiconductors and electronics, managing R&D strategy as well as hardware and software engineering teams for a number of multi-nationals including Panasonic, Renesas, British Aerospace and LSI Logic. He is also well-known globally as an inspirational technology marketer from his role as CMO for leading semiconductor IP vendor Imagination Technologies. Tony is based near London, UK.

A highly experienced senior executive with 40 years industry experience in the electronics and semiconductor industries. With 26 years engineering R&D management, followed by 10 years as a global CMO, working in leading multi-nationals including BAE Systems, Panasonic, LSI Logic and Imagination Technologies, Tony brings a unique combination of in-depth hardware and embedded software know-how combined with extensive global product, segment and corporate marketing. Specialist expertise includes se

More from Tony