Instant replay for multicore systems

November 01, 2010

Instant replay for multicore systems

Multicore developers need a tool that automatically analyzes and graphically depicts real-time events.


Real-time systems must react quickly to external and internal demands. When a system uses a multicore architecture, the speed and number of interactions rise sharply. While this improves system performance, it complicates real-time sequencing of application events, given that multicore system events can occur simultaneously over multiple independent processors instead of sequentially over a single processor.

For the multicore developer, the increased complexity of managing the number of events and their simultaneous nature represents an exponentially more challenging system to design. Diagnosing the cause of a system failure or inefficiency is much more difficult with a multicore system than it is with a single-processor system. With few multicore-ready tools available, developers are left with primitive print statement techniques that leave bread crumbs throughout the system’s operation indicating data about various events that have occurred. The developer must gather and make sense of the crumbs and infer the system’s state, a process that often requires subsequent re-instrumentation to gain a finer degree of granularity and a repeat of the process.

To efficiently unravel the intricate sequence of operations on a multicore system, developers need an instant replay enabling them to examine the system’s operations that immediately precede an area of interest. As shown in Figure 1, a new type of debugging tool shows exactly what is going on in a multicore system across a particular period of time. A graphical analysis of all system events is displayed across a single timescale organized by application thread and grouped by processor core.


Figure 1: TraceX offers a graphical view of real-time events in a multicore system. In this example, Core-0 and Core-1 are seen simultaneously executing different threads.




Traditional approach to system-event analysis

Real-time programmers have long understood the importance of system behavior to the functionality and performance of their applications. The conventional approach addresses these issues by generating data on system behavior when the code reaches a certain stage by toggling an I/O pin, using printf, setting a variable, or writing a value to a file.

Inserting such responses requires a substantial amount of time, especially considering that the instrumentation code often doesn’t work exactly as expected the first time around and must be debugged. Once that part of the application is verified, the instrumentation code needs to be removed, and its removal needs to be debugged. Most of the instrumentation process is manual and thus time-consuming and prone to additional errors.

Besides instrumenting the code, the developer also needs to find a way to interpret the data generated. The volume of information generated by the instrumentation code complicates the task of determining what system events took place in what sequence.

New approach offers advantages

In contrast to the conventional method, TraceX automatically analyzes and graphically depicts system and application events captured on the target system during runtime. Events such as thread context switches, preemptions, suspensions, terminations, and system interrupts each leave a bread crumb that the debugging tool recognizes and displays. These bread crumbs describe what event just happened, which thread was involved, which core that thread was running on, when it occurred, and other relevant information.

With this tool, the user can log any desired application events using an Application Programming Interface (API). Event information is stored (logged) in a circular buffer on the target system with buffer size determined by the application. A circular buffer enables the most recent “n” events to be stored at all times and available for inspection in the case of a system malfunction or other significant event.

Good multicore debugging tools allow event logging to be stopped and started dynamically by the application program at a specific time, such as when an area of interest is encountered. This avoids cluttering the database and consuming target memory when the system is performing correctly. The event log can be uploaded to the host for analysis when encountering a breakpoint or system crash or after the application has finished running.

Once the event log is uploaded from target memory to the host, the tool displays the events graphically on the horizontal axis, which represents time (refer to Figure 1). The various application threads and system routines related to events are listed along the vertical axis, and the events themselves appear in the appropriate row. For multicore systems, the events are linked to their respective processor core and grouped together so that developers can easily see all the events for a core.

All events are also presented in the top summary row, regardless of core or thread, giving developers a handy way to obtain a complete picture of system events without scrolling down through all threads and cores. Events are represented by color-coded icons located at the point of occurrence along the horizontal timeline as well as to the right of the relevant thread or system routine. The axes can be expanded to show more event detail or collapsed to show more events. The timescale can be panned left (back) or right (ahead) to show any point in the trace buffer. When an individual event is selected, as shown in Figure 2, detailed information is provided for that event, including the core, context, event, thread pointer, new state, stack pointer, and next thread point.


Figure 2: Individual event details can be displayed by clicking on an event icon.




Solving priority inversion problems

One of the most challenging real-time problems is priority inversions. Priority inversions arise because Real-Time Operating Systems (RTOSs) employ a priority-based preemptive scheduler to ensure the highest-priority thread that is ready to run actually runs. The scheduler can preempt a lower-priority thread in mid-execution to meet this objective.

Problems can occur when high- and low-priority threads share resources, such as a memory buffer. If the lower-priority thread is using the shared resource when the higher-priority thread is ready to run, the higher-priority thread must wait for the lower-priority thread to finish. If the higher-priority thread must meet a critical deadline, then the maximum time it might have to wait for all its shared resources must be calculated to determine its worst-case performance. Priority inversions occur when a high-priority thread is forced to wait while the CPU serves a lower-priority thread.

Priority inversions are difficult to identify and correct. Their symptom is normally poor performance, but poor performance stems from many potential causes. Compounding the challenge of identifying the cause is the fact that priority inversion can evade testing, which could mean the inversion is non-deterministic.

A systems event tool like TraceX makes it possible to easily and automatically identify priority inversions. The trace buffer clearly identifies which thread is running at any point in time and records any change in a thread’s readiness. Thus, it is easy to go back in time to determine if a higher-level priority thread is ready to run but blocked by a lower-priority thread that holds a resource needed by the higher-priority thread. Figure 3 shows non-deterministic priority inversion.


Figure 3: The higher-priority thread must wait for the lower-priority thread to release a mutex in a non-deterministic priority inversion.




As shown in this graphic, Low_thread holds a mutex when it is preempted by High_thread. High_thread then seeks the same mutex, but must wait for Low_thread to release it. However, Medium_thread has intervened and can run for an indeterminate length of time, delaying not only Low_thread, but also High_thread. Only when Medium_thread yields enough time to Low_thread for it to complete its processing and release the mutex can High_thread resume.

Improving application performance

While most developers use multicore-enabled tools to understand and correct problems, the benefits don’t end there. These tools offer an execution profile for analyzing and improving system-level application performance. Using an execution profile, developers see the amount of CPU time used by each thread and system services (see Figure 4). The developer can easily drill down on specific events for diagnostic purposes.


Figure 4: An execution profile shows the CPU time used by each thread.




Even more relevant to multicore system operation, balancing the processing load across all available cores can achieve greater system throughput. If a system profile provides information about which cores have greater idle time, as shown in Figure 4, the developer gets a strong clue regarding how to shift processing to an otherwise idle core.

A multicore-enabled debugging tool paints a graphical picture of a system in a way that standard debuggers can’t provide. It gives developers a clear view of interrupts, context switches, and other system events typically detected through time-consuming code instrumentation and tedious examination of the resulting data. Consequently, developers can find and fix bugs and optimize application performance in substantially less time than is required using standard debugging tools alone. With debugging taking up to 70 percent of application development, these tools significantly improve products while requiring less development time.


John A. Carbone is VP of marketing for Express Logic. He has 35 years of experience in real-time computer systems and software, ranging from embedded system developer and FAE to sales and marketing management roles. Prior to joining Express Logic, John was VP of marketing for Green Hills Software. He holds a BS in Mathematics from Boston College.

Express Logic 858-613-6640 [email protected]