Transforming Edge AI with Clusters of Neural Processing Units

By Saumitra Jagdale

Freelance Technology Writer

June 13, 2022


Transforming Edge AI with Clusters of Neural Processing Units
Image Credit: Synopsys

As the field of Artificial Intelligence gains traction, the devices are becoming more computational and power-consuming. Subsequently, the processing load on the edge devices significantly grows with the performance and the complexity of the system architecture. Hence, higher resolution images and more sophisticated algorithms are inculcated in the system which further needs to be optimized with the increasing demand for AI processing to achieve high TOPS performance.


Synopsys has released a Neural Processing Unit (NPU), Intellectual Property (IP) core, and toolchain to address the performance needs of the increasingly complicated neural network models in AI Systems on Chip (SoCs). Its new DesignWare ARC NPX6 and NPX6FS NPU IP handle the demands of real-time computing while consuming ultra-low power for AI applications. Additionally, the company's new MetaWare MX development tools provide a full compilation environment with automated neural network algorithm partitioning to maximize resource efficiency for application software development on the newest NPU.


Using the new DesignWare ARC NPX6 and NPX6FS NPU IP, as well as MetaWare MX development toolkits, designers can take advantage of the newest neural network models, satisfying the escalating performance expectations, and faster time-to-market for their next intelligent SoCs. The ARC NPX6 NPU IP family includes numerous products that handle deep learning algorithm coverage, including computer vision tasks such as object recognition, picture quality enhancement, and scene segmentation, as well as larger AI applications such as audio and natural language processing. Individual cores in the design may scale from 4K MACs to 96K MACs for a single AI engine performance of over 250 TOPS and over 440 TOPS with sparsity.

The NPX6 NPU IP contains hardware and software support for multi-NPU clusters of up to eight NPUs with a sparsity of 3500 TOPS. Scaling to a large MAC count is achievable, thanks to advanced bandwidth features in hardware and software, as well as a memory hierarchy (containing L1 memory in each core and a high-performance, low-latency connection to access common L2 memory). For applications that benefit from BF16 or FP16 inside the neural network, an optional tensor floating-point unit is offered.

Image Credits: Synopsys


The MetaWare MX development toolkit provides a software programming environment for application software development, including a neural network software development kit (NN SDK) and virtual model support. The NN SDK automatically turns neural networks trained using popular frameworks such as Pytorch, Tensorflow, or ONNX into NPX-optimized executable code.

The concept is that the NPX6 NPU processor IP may subsequently be used to make a variety of products, ranging from a few TOPS to thousands of TOPS, all of which can be written using a single toolchain.

Key Features of the NPX6 NPU IP:

  • Scalable real-time AI / neural processor IP with the performance of up to 3,500 TOPS supporting  CNNs, RNNs/LSTMs, transformers, recommender networks, and other neural networks.

  • Power efficiency (up to 30 TOPS/W), is unrivaled in the industry.

  • 1-24 cores of a convolution accelerator with an increased 4K MAC/core

  • Tensor accelerator that supports the Tensor Operator Set Architecture and allows for variable activation (TOSA)

  • Kit for Software Development

    • Tools for automatically mixed-mode quantization

  • Architecture and software tool characteristics that reduce bandwidth

  • Latency is reduced by processing individual layers in parallel.

  • DesignWare ARC VPX vector DSPs are seamlessly integrated.

  • Productivity is high. Tensorflow and Pytorch frameworks, as well as the ONNX interchange standard, are supported by MetaWare MX Development Toolkit.

Moreover, the ARC NPX6FS NPU IP complies with ISO 26262 ASIL D standards for random hardware failure detection and systematic functional safety development flow. The processors have specific safety mechanisms for ISO 26262 compliance and handle the mixed-criticality and virtualization needs of next-generation zonal designs, as well as thorough safety documentation.

The ARC MetaWare MX development toolkit includes a neural network software development kit (SDK), compilers and debugger, virtual platforms SDK, runtimes and libraries, and advanced simulation models. It offers a unified toolchain environment to speed up application development and intelligently divides algorithms among MAC resources for optimal processing. The MetaWare MX development toolkit for safety contains a safety handbook and a safety guide to assist developers to satisfy ISO 26262 criteria and preparing for ISO 26262 compliance testing for safety-critical automotive applications.

Accelerating Edge AI Applications with Clusters of NPUs

To address the increasing performance and complex needs of the AI applications, the NXP NPU IP core offers high performance, scalable real-time AI and neural processing IP with up to 3500 TOPS supporting various neural networks like CNNs, and RNNs/LSTMs, transformers, and recommenders networks.

Further, it reduces the latency through parallel processing of individual layers. Moreover, the high productivity MetaWare MX Development Toolkit supports Tensorflow and Pytorch frameworks and the ONNX exchange format.

For more information related to Synopsys’s NPU IP, register for the session at the Embedded Vision Summit on May 19 where Synopsys will present the concepts regarding the optimization of AI performance and power for the applications of deep learning neural networks.


Saumitra Jagdale is a Backend Developer, Freelance Technical Author, Global AI Ambassador (SwissCognitive), Open-source Contributor in Python projects, Leader of Tensorflow Community India, and Passionate AI/ML Enthusiast.

More from Saumitra