MLPerf Tiny Inference Benchmark Lays Foundation for TinyML Technology Evaluation, Commercialization

By Chad Cox

Production Editor

Embedded Computing Design

By Tiera Oliver

Associate Editor

Embedded Computing Design

By Brandon Lewis


Embedded Computing Design

July 02, 2021


MLPerf Tiny Inference Benchmark Lays Foundation for TinyML Technology Evaluation, Commercialization

The speed with which edge AI ecosystems like TinyML are evolving has made standardization difficult, much less the creation of performance and resource utilization benchmarks that could simplify technology evaluation. Edge AI benchmarks would be hugely beneficial to the ML industry as they could help accelerate solution comparison, selection, and the productization process.

But standing in the way of this is the fundamentally distributed nature of the edge and the varied applications and systems that reside there, which mean a benchmark of any value must account for:

  • Hardware heterogeneity ranging from general-purpose MCUs and processors to novel accelerators and emerging memory technologies that are commonplace in the TinyML ecosystem.
  • Software Heterogeneity varying wildly across TinyML systems that often use their own inference stacks and deployment toolchains.
  • Cross-Product Support, as the heterogeneity mentioned previously means that interchangeable components can and are being used at every level of TinyML stacks.
  • Low power by profiling device/system power consumption and energy efficiency via a power analysis mechanism that considers factors like chip peripherals and any underlying firmware.
  • Limited memory within different devices with different resource constraints, which in the case of edge AI is usually under a gigabyte.

In an effort to overcome these barriers, MLCommons, the organization behind the popular MLPerf family of benchmarks AI training and inferencing benchmarks, recently released version 0.5 of the MLPerf Tiny benchmark. It’s an open-source, system-level Inferencing benchmark designed to measure how quickly, accurately, and power-efficiently resource-constrained embedded technologies can execute trained neural networks of 100 kB or less

Inside the MLPerf Tiny Edge Inferencing Benchmark

Developed in collaboration with EEMBC, the Embedded Microprocessor Benchmark Consortium, this iteration of MLPerf Tiny Inference consists of four separate tasks for measuring the latency and accuracy or power consumption of an ML technology:

  • Keyword Spotting (KWS) uses a neural network that detects keywords from a spectrogram
  • Visual Wake Words (VWW) is a binary image classification task for determining the presence of a person in an image
  • Tiny Image Classification (IC) is a small image classification benchmark with 10 classes
  • Anomaly Detection (AD) uses a neural network to identify abnormalities in machine operating sounds

These tasks are presented in four different scenarios that an edge device may encounter or be deployed in, namely single-stream queries, multiple-stream queries, server configuration, or offline mode. Each scenario requires approximately 60 seconds to complete, and some have latency constraints.

Figure 1. The MLPerf Tiny inferencing benchmark v0.5 presents each of the tasks in four different deployment scenarios. (Source: ML Commons)

This combination of tasks and scenarios make it possible to analyze sensors, ML applications, ML datasets, ML models, training frameworks, graph formats, inference frameworks, libraries, operating systems, and hardware components. This is possible thanks to multi-layered test suites that look at the rational, dataset, model, and quality targets (usually a measure of accuracy when executing the data set and model).

Figure 2. The MLPerf Tiny inference benchmark test suite permits the evaluation of the end-to-end edge ML stack. (Source: ML Commons)

The test suite procedure is as follows:

  • Latency – The latency measurement is performed five times in the following order:
    1. Download the input stimulus,
    2. Load the tensor and converting the data as needed,
    3. Run the inference for a minimum of 10 seconds and over 10 iterations
    4. Measure the inferences per second (IPS)
      The median IPS of the five runs is reported as the latency score.
  • Energy – The energy test is identical to latency, but measures of the total energy used during the compute timing window
  • Accuracy – A single inference is performed on the entire set of validation inputs, which vary depending on the model. The output tensor probabilities are then collected to calculate the percentage score.

Modular, Open and Closed

Of course, there are also limitations around the MLPerf Tiny benchmark in the form of run rules that ensure components are analyzed accurately and reproducibly. The run rules are established via a modular benchmark design that addresses the end-to-end ML stack, as well as two divisions that permit different types of analysis.

  • Modular design allows hardware and software users to target specific components of the pipeline, like quantization, or complete solutions. Each benchmark within the TinlyML suite has a reference implementation that contains training scripts, a hardware platform and more to provide a baseline result that can be modified by a submitter to show the performance of a single component.

Closed and Open divisions are more strict and more flexible, respectively, in the submissions they accept. The closed division offers a more direct comparison of systems whereas the open division provides a broader scope that allows submitters to demonstrate performance, energy, and/or accuracy improvements in any stage of the ML pipeline. The open division also allows submitters to change the model, training scripts, and dataset.

Figure 3. MLPerf Tiny’s two divisions provide a flexible way to test edge ML components against each other and a generic reference implementation. (Source: ML Commons)

The MLPerf Tiny inferencing benchmark rules are available on Github.

The first batch of submissions has already been published. It includes entries from Latent AI, Peng Cheng Laboratory, Syntiant and hls4ml, all of whom except hls4ml submitted to the Closed division.

In the Closed Division:

Figure 4. The MLPerf Tiny inferencing benchmark reference implementation is based on an STMicroelectronics Nucleo-L4R5ZI. (Source: ML Commons)

In the Open Division:

Measured on latency and energy consumption, these ML stack combinations ran the Visual Wake Word, Image Classification, Keyword Spotting, and Anomaly Detection workloads described in Table 1.


Visual Wake Words

Image Classification

Keyword Spotting

Anomaly Detection


Visual Wake Words Dataset


Google Speech Commands

ToyADMOS (ToyCar)


MobileNetV1 (0.25x)



FC AutoEncoder


80% (top 1)

85% (top 1)

90% (top 1)

0.85 (AUC)

Table 1. Submitters to the MLPerf Tiny v0.5 inferencing benchmark put their solutions up against these workloads. (Source: ML Commons)

Below are the results for each entrant:

  • Harvard (Reference)
    • o Visual Wake Word Latency: 603.14 ms
    • o Image Classification Latency: 704.23 ms
    • o Keyword Spotting Latency: 181.92 ms
    • o Anomaly Detection Latency: 10.40 ms
  • Latent AI LEIP Framework
    • Visual Wake Word Latency: 3.175 ms (avg)
    • Image Classification Latency: 1.19 ms (avg)
    • Keyword Spotting Latency: .405 ms (avg)
    • Anomaly Detection Latency: .18 ms (avg)
  • Peng Cheng Laboratory:
    • Visual Wake Word Latency: 846.74 ms
    • Image Classification Latency: 1239.16 ms
    • Keyword Spotting Latency: 325.63 ms
    • Anomaly Detection Latency: 13.65 ms
  • Syntiant:
    • Keyword Spotting Latency: 5.95 ms
  • hls4ml:
    • Image Classification Latency: 7.9 ms
    • Image Classification Accuracy: 77%
    • Anomaly Detection Latency: 0.096 ms
    • Anomaly Detection Accuracy: 82%

Editor’s note: An expanded table containing the results can be found here:

New Classes of Edge AI

The MLPerf Tiny inferencing benchmark is a step in the right direction for the commercialization of edge AI technology and the new classes of applications it will bring. A product of collaboration between more than 50 organizations throughout industry and academia, the benchmark provide a fair measure of component and system-level ML technologies with room to expand into other applications and higher-order benchmarks like MLPerf Inference Mobile, Edge, and Data Center.

For more information or to submit your results to the MLPerf Tiny inference benchmark, visit

Chad Cox. Production Editor, Embedded Computing Design, has responsibilities that include handling the news cycle, newsletters, social media, and advertising. Chad graduated from the University of Cincinnati with a B.A. in Cultural and Analytical Literature.

More from Chad

Tiera Oliver, Associate Editor for Embedded Computing Design, is responsible for web content edits, product news, and constructing stories. She also assists with newsletter updates as well as contributing and editing content for ECD podcasts and the ECD YouTube channel. Before working at ECD, Tiera graduated from Northern Arizona University where she received her B.S. in journalism and political science and worked as a news reporter for the university’s student led newspaper, The Lumberjack.

More from Tiera

Brandon is responsible for guiding content strategy, editorial direction, and community engagement across the Embedded Computing Design ecosystem. A 10-year veteran of the electronics media industry, he enjoys covering topics ranging from development kits to cybersecurity and tech business models. Brandon received a BA in English Literature from Arizona State University, where he graduated cum laude. He can be reached at [email protected].

More from Brandon