Three Ways to Achieve Tenfold Embedded Memory Performance for Heterogeneous Multicore

By Brandon Lewis


Embedded Computing Design

March 01, 2021


Three Ways to Achieve Tenfold Embedded Memory Performance for Heterogeneous Multicore

In theory, a heterogeneous multicore device can equip a compute block optimized for any type of operation a given use case can throw at it. A GPU for video processing, a neural network processor for object recognition, a CPU to run the OS, and so on. The different fit-for-purpose cores provide an SoC with more flexibility, and therefore greater performance and lower power consumption across a wider range of workloads, than a homogeneous processor of the same class. 

(Editor's Note: Read "2021 Embedded Processor Report: Beyond Mooore's Law with Parallel Proocessing & Heterogeneous SoCs")

But as you start considering the requirements of applications like edge-based AI, computer vision, etc., the truth is that I/O and memory become just as restrictive as raw processing performance, if not moreso.

“Memory speed is only going up so fast, right?” poses Deepu Talla, Vice President and General Manager of Embedded & Edge Computing at Nvidia. “It’s not growing exponentially. The bit width is approximately the same because of the size: It’s either 16-, 32-, 64-, 128-bit, whatever. And most of the embedded processors typically have 32-bit or maybe even 16-bit interfaces, again because of cost and size reasons.

“The speed of memory is only growing 2x generation-over-generation, and that typically happens every three years,” he continues. “However, the compute requirements within the SoC have gone up probably 10x or 20x.”

How do you reconcile this disproportionate increase in compute performance against comparatively minor advances in memory technology? Particularly as processors evolve into unique collections of logic that all require their own access to resources like memory.

According to Talla, you give it to them. Here are three ways embedded memory architectures are advancing to meet the demands of next-generation heterogenous multicore processors.

#1. Core-Specific SRAM

“If you look at a lot of these embedded processors, they’ve always had SRAM in the past,” Talla says. “Now, for each specific unit, we have local SRAM, which gets data from DRAM, stores it locally and processes it, and then sends back the final output.”

Core-specific SRAM offers a couple advantages, starting with memory performance gains that result from not having to write temporary data back to off-chip DRAM.

This architecture also has the added benefit of reducing power consumption, because the very-low-voltage SRAM blocks reside nearby or adjacent to the corresponding logic IP within the SoC.

“If you go to DRAM, that’s probably an order of magnitude more power, so you’re actually saving power by using those techniques,” Talla explains.

#2. Increased System Memory

Embedded processors today feature as much as 4 MB to 8 MB of system memory. This system memory is not dedicated to any one specific core, and can be shared between elements like a CPU, GPU, and accelerator.

Similar to the dedicated SRAM, the primary benefit of more shared system memory is fewer DRAM accesses. For example, where a traditional video encoding sequence would look like this:

DRAM -> Video Encoder -> DRAM -> Additional Compute -> DRAM

Increased system cache enables this:

DRAM -> Video Encoder -> System Memory -> Additional Compute -> DRAM

As stated, the difference being that separate cores don’t have to continually fetch data from an off-chip DRAM because the large system memory eliminates the need for that intermediate step.

#3. Increased Cache Sizes

Finally, as newer process technologies make higher capacity memory more affordable, cache sizes will inevitably increase. Larger caches for CPUs, GPUs, DSPs, and other core architectures found on a heterogeneous SoC will also mitigate the amount of DRAM traffic.

And pairing increased cache sizes with the previous two advancements starts yielding some serious gains.

“More SRAM, system memory that’s common across, and then more high-capacity caches allows you to increase performance by 10x to 100x over the next three-to-five years even though the memory bandwidth has probably only doubled or quadrupled,” Talla points out.

Hopefully that will buy us some breathing room. For now.

Brandon is responsible for guiding content strategy, editorial direction, and community engagement across the Embedded Computing Design ecosystem. A 10-year veteran of the electronics media industry, he enjoys covering topics ranging from development kits to cybersecurity and tech business models. Brandon received a BA in English Literature from Arizona State University, where he graduated cum laude. He can be reached at [email protected].

More from Brandon