More cores, less waiting
November 01, 2009
Optimizing multicore devices will be one of the biggest challenges facing developers in the future.
The majority of embedded systems developers today are accustomed to working with single-core processors. As the capabilities of end products increase, there is a rising demand for the increased processing power, lower cost, and robust power performance that multicore processors offer.
The multicore story is really about system integration. The increase in the number of cores has been mirrored by the increased integration of peripherals, bus fabrics, and multi-level memories. In general, there are two groups of multicore processors differentiated by their architectures: homogenous multicore processors and heterogeneous multicore processors.
A homogenous multicore processor has two or more identical programmable cores that share peripherals and memory. A heterogeneous processor features multiple unique processing elements each tailored to a specific function, and each core might have only selective access to peripherals and memory. Both types of multicore processors yield dramatic potential performance increases. Architects can tailor the design based on the needs of the end application and achieve a balance between performance, flexibility, and power consumption.
Developers working with either homogenous or heterogeneous multicore systems will generally face two classes of problems: synchronization and timing and understanding performance.
Each processing element in a multicore device executes a portion of the device’s functionality. Coupled with improved coding, packaging, and integration practices, traditional debug techniques such as synchronous run, step, and halt are effective in multicore systems. However, when threads interact, timing- or synchronization-related problems often arise.
Multicore devices always have dependencies between processing elements. Many times, one processing element has to wait on the results of another, and if they are out of sync, the handoffs between the two threads will be incorrect or inefficient. However, more subtle issues can result. For example, in a frame-oriented processing scheme, a new frame could be missed because processing was not completed in time. While this might not cause an application to crash, it could degrade the quality of the application’s processing results. These issues are very hard to isolate because they result from the interaction of cores, bus fabrics, and shared peripherals. Chip-level tools with insight into the relationship between processing elements and peripherals are key to optimizing and debugging the multicore system (see Figure 1).
In embedded systems, a stalled processing element not only reduces performance, but also affects power consumption. In order to maximize performance and decrease power consumption, developers must minimize the amount of time each processing element waits when data is ready. When idle, the processing element can be powered down.
Performance at the element level is critical to understanding how to partition threads among the cores in a multicore device. Measured performance at key interfaces further aids developers in identifying system bottlenecks. This will help determine whether more can be done to optimize the use of the cores, bus fabrics, and peripherals. With this knowledge, opportunities to lower power consumption become apparent.
Optimizing multicore devices will be one of the biggest challenges facing developers in the future. The ability to expose a multicore device’s tremendous processing potential will require insight into the synchronization and timing between processing elements and peripherals, as well as an understanding of the performance of processing elements and key interfaces. New multicore processors will also require robust chip-level tools that allow developers to exploit their potential performance.