Software Considerations for Heterogeneous Arm Cores in Safety-Critical Applications

October 02, 2020


Software Considerations for Heterogeneous Arm Cores in Safety-Critical Applications

Embedded systems benefit from the use of multicore processors in terms of higher throughput and better size, weight, and power (SWaP).

Embedded systems benefit from the use of multicore processors in terms of higher throughput and better size, weight, and power (SWaP).  Processors with heterogeneous processor cores add the ability to match applications to the capabilities of each core type, further improving throughput and SWaP. The advantages of multicore processors come with increased complexity in software architectures to maximize the utilization of the processor cores. For real-time systems, particularly safety-critical systems, multicore processors create a significant challenge to tight determinism due to contention for resources shared among the processor cores.  That challenge increases with heterogeneous cores, as the worst-case execution time can vary depending on which core the application executes.

To explore this tradeoff in more detail, consider the heterogeneous cores in the NXP® i.MX 8QuadMax applications processor (Figure 1). With four Arm® Cortex®-A53 cores and two Cortex-A72 cores, the i.MX 8QuadMax enables power consumption optimization by matching the performance requirements of each application task to the performance capacities of the different cores. Compared to the A53 cores, the A72 cores provide roughly twice the performance but with higher power consumption.

(Figure 1: NXP i.MX 8 architecture)

To achieve the throughput and SWaP benefits of multicore solutions, the software architecture needs to support high utilization of the available processor cores. All multicore features must be supported, from enabling concurrent operation of cores (versus available cores being forced into an idle state or held in reset at startup) to providing a mechanism for deterministic load balancing. The more flexible the software multi-processing architecture, the more tools the system architect has to achieve high utilization.

Software Multi-Processing Architectures          

Like multi-processor systems, the software architecture on multicore processors can be classified by the amount of sharing and coordination among cores. The simplest software architecture for a multicore-based system is Asymmetric Multi-Processing (AMP), where each core runs independently, each with its own OS or hypervisor/guest OS pair. Each core runs a different application with little or no meaningful coordination between the cores in terms of scheduling. This decoupling can result in underutilization due to lack of load balancing, difficulty mitigating shared resource contention, and the inability to perform coordinated activity across cores such as required for comprehensive built-in test.

The modern alternative to AMP is Symmetric Multi-Processing (SMP), where a single OS controls all the resources, including which application threads run on which cores. This architecture is easy to program because all cores access resources “symmetrically,” freeing the OS to assign any thread to any core. For a processor with heterogeneous cores, like the i.MX 8QuadMax, not knowing which type of core the application will run on can cause a wide range of execution times that significantly impacts deterministic performance. 

Directly addressing that issue, Bound Multi-Processing (BMP) is an enhanced and restricted form of SMP that statically binds an application’s tasks/threads to specific cores. That static binding allows the system architect to tightly control the concurrent operation of multiple cores.  

Ensuring Deterministic Behavior

In addition to achieving the throughput and SWaP goals for multicore processors, safety-critical systems need to maintain a predictable worst-case execution time (WCET) for each application. Using BMP to restrict the type of core paired with the application is an essential component of ensuring deterministic behavior in a heterogeneous system. The other techniques to ensure determinism are time and space partitioning as well as managing contention for shared resources.

In a single-core processor, multiple safety-critical applications may execute on the same processor by robustly partitioning the memory space between the hosted applications. Memory space partitioning dedicates a non-overlapping portion of memory to each application running at a given time, enforced by the processor’s memory management unit (MMU). Determinism can be enhanced further through the use of time partitioning, which divides a fixed time interval, called a major frame, into a sequence of fixed sub-intervals referred to as partition time windows. Each application is allocated one or more partition time windows, with the length and number of windows being driven by the application’s WCET and required repetition rate.

Multicore Interference Challenges Determinism

In a multicore environment, there can be multiple applications running concurrently across the different cores.  Those concurrent applications each need to access the processor’s resources. Each processing core has some dedicated resources, but most resources are shared among the processor cores, including memory controllers, I/O, shared cache, and the internal fabric that connects them. Contention for these shared resources results when multiple processor cores try to access the same resource concurrently. In safety-critical applications, such as avionics, the principal concern is how such shared resource contention can cause an application running on one core to interfere with an application running on another core, negatively affecting determinism, quality of service, and, ultimately, safety.

The effects of shared resource contention can be significant if left unmitigated. Examining just one of the shared resources, DDR memory, one might guess that the WCET could double when one other core is trying to access the same memory and both cores are running memory-constrained applications.  In reality, the WCET can increase by 8x instead of just 2x due to non-linear behaviors in the shared resource arbitration and scheduling algorithms. Additional cores attempting to access DDR memory or contending for other resources, such as the on-chip interconnect, can cause the WCET to grow even more significantly (Figure 2).

(Figure 2: Multicore interference increases faster than the number of cores.)


Multicore Interference Mitigation

One approach to mitigation multicore interference is to hand schedule applications to minimize resource contention. Such an approach will not eliminate all the interference, and all the applications will need to be retested and validated any time any single application is modified or a new one added. Another approach is to schedule only a single multi-tasking application to run at a time. Interference will still happen among the tasks, but there will be no interference with other applications.  Such an approach is particularly ineffective on a processor with heterogeneous cores because of the variation in execution time on different core types.

A more general approach is to have the OS manage shared resource contention.  In the same way that the OS uses the hardware MMU to implement space partitioning by allocating different memory regions to different applications, the OS can allocate bandwidth to shared resources on a per-core basis. Addressing multicore interference in the OS provides the system integrator with an effective, flexible, and agile solution. It also simplifies the addition of new applications without significant changes to the system architecture and reduces re-verification activities.

Example Solution for Heterogeneous Cores in Avionics

The NXP i.MX 8QuadMax applications processor includes four Arm Cortex-A53 cores that share a 1MB L2 cache and two Arm Cortex-A72 cores that share another 1MB L2 cache.  The processor also includes two Cortex-M4F cores for offloading system functions and two GPUs capable of running OpenCL, Vulkan, and OpenVX vision acceleration. One unique feature of the i.MX 8 is hardware resource partitioning, where the system controller commits peripherals and memory regions into specific customer-defined domains. Any communication between domains is forced to use messaging protocols running through hardware messaging units. The i.MX8QuadMax targets a wide range of applications, including Industrial HMI (Human Machine Interface) and Control, electronics cockpit (eCockpit), heads-up displays, building automation, and single-board computers.

Green Hill’s INTEGRITY®-178 tuMP™ multicore RTOS is a unified operating system that runs across all 64-bit processor cores in the i.MX 8 and supports simultaneous combinations of AMP, SMP, and BMP. The RTOS’s Time-variant Unified Multi-Processing (tuMP) approach provides maximum flexibility for porting, extending, and optimizing safety-critical and security-critical applications to a multicore architecture. INTEGRITY-178 tuMP uses a time-partitioned kernel running across all cores that allows applications to be bound to a core or groups of cores called affinity groups. If required, each task of an application within an affinity group can be restricted further to run on a specific core. For the i.MX 8QuadMax processor, system architects can use affinity groups to ensure that the tasks of a given application execute only on the Cortex-A72 cores or only on the Coretex-A53 cores (Figure 3).

(Figure 3: Using Affinity Groups, one application is bound to the two Cortex-A72 cores
while two other applications are bound to the sets of Cortex-A53 cores.)

Directly addressing multicore interference, INTEGRITY-178 tuMP includes a Bandwidth Allocation and Monitoring (BAM) capability developed to the strictest safety levels. BAM functionality monitors and enforces the bandwidth allocation to shared resources from each processor core. BAM emulates a high-rate hardware-based approach to ensure continuous allocation enforcement of each core’s use of shared resources.  BAM regulates the bandwidth smoothly throughout the application’s execution time window, thereby allowing other applications in the same execution time window to acquire their allocated portion of the shared resources. Using the previous example of memory access interference, allocating 50% of the memory bandwidth to a high-criticality application results in a near-constant WCET even as the number of interfering cores increases and 8x lower WCET when there are multiple interfering cores (Figure 4).  This capability effectively mitigates multicore interference and greatly lowers integration and certification risks while also enabling integrators to gain the maximum performance advantages of multicore processors.

(Figure 4:  After allotting 50% of the shared resource
bandwidth to the critical application using BAM,
the WCET is nearly constant and greatly reduced.)

The NXP i.MX 8QuadMax presents a significant opportunity for optimizing SWaP in avionics and other embedded real-time systems. The combination of Cortex-A72 and Cortex-A53 core provides the system architect the ability to emphasize performance or power efficiency to create the optimal system-level solution.  The corresponding software architecture needs to have the flexibility and control to fully use those heterogeneous application cores while preserving tight determinism.  Together, the capability to use Affinity Groups or some other form of BMP and a solution for multicore interference mitigation, such as BAM, enable effective use of the i.MX 8QuadMax in safety-critical applications.