Designing Safety Critical Embedded Systems: The Challenges of Detecting Faults in SRAM at Run-Time
June 24, 2022
When designing safety critical systems, the international safety standards are critical of us to select the appropriate processes and adequate techniques to detect and avoid dangerous faults in the end-product. The standards ensure that we do not fall into the same pits as our fellow safety engineers before us.
However, the danger of the standards is that they assume that you have detailed knowledge about the underlying hardware, say a microcontroller, which may cause less experienced safety engineers to implement unsafe designs. As an example, the IEC (International Electrotechnical Commission) 60730 standard recommends the use of a checkerboard memory test to detect DC faults in variable memories for Class B software, which is more challenging than it may seem.
This article describes how the undocumented difference between the logical and physical layout of SRAM can cause us to inadvertently implement memory tests such as the checkerboard algorithm incorrectly. The necessary information is typically not available in the datasheet of standard microcontrollers, but fortunately, there are memory test algorithms which are not influenced by the difference between the logical and physical layout of the SRAM.
Testing SRAM for Defects at Run-Time
SRAM memories are obviously tested in production by the vendor of the IC and products with defects are not shipped to consumers. Still, random hardware defects can, and will, appear during the lifetime of the IC, which is one of the reasons that it is required to test the hardware in a microcontroller at run-time in safety critical applications.
The Checkerboard Memory Test
Safety standards such as the IEC 60730 (H.126.96.36.199) suggest that a checkerboard algorithm may be used to identify certain defects (DC faults) in SRAM for applications that must comply with the Class B safety level. The checkerboard test is often selected because it covers the most likely faults in an SRAM and is relatively fast, which is convenient to minimize the performance impact on the application itself. In addition to DC faults, where a bit is permanently stuck high or low, the checkerboard algorithm can also detect defects, where neighboring bits affect each other.
An SRAM is logically consisting of a number of bits organized in words. The words are typically 8-, 16- or 32-bits wide, but can be longer as well. Physically, the bits are organized in arrays, where each bit typically has eight neighboring bits (see Figure 1). A physical defect in a bit can affect a single bit so that it is stuck high or low (DC fault), or the defect can be in the separation of two bits, in which case a neighboring aggressor cell (marked in purple in Figure 1) may influence a victim cell (marked in yellow in Figure 1). The aggressor-victim scenario is commonly referred to as coupling fault. Statistically seen, the DC fault is more likely to occur, but it is still relevant to detect the most likely coupling faults.
Figure 1 - Potential coupling faults between neighboring bits.
If a fault is affecting a single bit, so that the bit is stuck high or low, it can be revealed by writing the value one, verifying the one by reading it back, and next writing the value zero and verifying the zero by reading it back as illustrated in Figure 1. If, on the other hand, the defect is a coupling fault between two neighboring bits, say bit column 9 and 10 in row 2, certain patterns, such as all ones or all zeros will not reveal the coupling fault as the cells have the same value during the test.
Such coupling faults as the neighboring cells (to the sides, above and below) have opposite binary values. Figure 1 (lower right) illustrates that the one in bit 10 has contaminated bit 9, and the coupling fault is revealed since bit 9 does not hold the expected value, zero.
Physical vs Logical Layout of SRAM
For the checkerboard algorithm to work it is required to know which bits are neighboring bits. This turns out to be a problem as the data sheets normally only describe the logical layout of the SRAM and not how the SRAM is physically organized.
To understand the physical layout of SRAM, one must differentiate between bit-oriented memories (BOM), in which one bit can be accessed at the time, and word-oriented memories (WOM) in which an n-bit word is read and written at the time. While most real-world memories are implemented as WOM, the classic memory testing algorithms in scientific literature often assumes BOM implementations.
For WOM memories, there are three main categories of physical organization of the bits constituting the word: adjacent, interleaved, and sub-arrays. While a logical layout places each word below the previous word in the same column (address space-like), the adjacent memories place each word in the same row next to each other as shown in Figure 2. Interleaved architectures separate each bit of the word into the different columns and rows of the SRAM array. Finally, the sub-array organization places each bit of a word in different physically separate blocks of the SRAM. The reality is that that you do not know the physical layout, which is required to implement a checkerboard test correctly.
Figure 2 - Examples of physical layout of word-oriented memories.
Properties and Shortcomings of the Checkerboard Test
The presumably straightforward approach for implementing a checkerboard algorithm is to alternately write the value 0xAA (assuming an 8-bit data words) to the first address and 0x55 in the next address until all addresses under test have been filled with the checkerboard pattern of ones and zeros. The pattern is then verified to detect any DC or coupling faults between neighboring cells. The process is then repeated using the inverse pattern. As already indicated, there is a catch: a checkerboard pattern in logical layout of the memory may not be a checkerboard pattern in the underlying physical layout as shown in Figure 3.
Figure 3 - Data pattern of the logical vs. physical SRAM.
It may seem obvious to compensate for the difference between the logical and physical layout, but the necessary information is rarely available in the datasheet of the device. So, what do you do? Accept the lower coverage, after all, the diagnostic will still cover DC faults and some coupling faults between neighboring bits? Request the layout from the IC vendor, and make a custom implementation of the checkerboard test for each device? Or select another algorithm?
Now that you are aware of the potential shortcoming of the checkerboard test, you can make an informed decision.
Alternative Algorithms for Run-Time Testing of SRAM
The memory testing techniques proposed in IEC 60730 for the Class C safety level have higher fault detection coverage, but these are algorithms fall into what can be considered production test algorithms: they take a longer time to run, detects rarer fault types as well, but will typically destroy the data stored in the SRAM as they operate on the entire SRAM and not in sub-blocks.
In general, we do not tolerate this very well for our embedded design. We therefore propose that you consider hybrid March algorithms adapted from the production test March algorithm: these algorithms are available in WOM optimized implementations and provide high test coverage. Further, these hybrid March algorithms can be implemented so that they run on smaller overlapping sections of the SRAM, to avoid wiping all the data in the SRAM all at once, which means that a reboot of the embedded system can be avoided. The drawback of the March algorithms is that they are more computationally heavy than the traditional checkerboard algorithms, but that is an expense that may be required in safety critical systems.
If you consider swapping a traditional checkerboard test with a March test, you can find such implementation from some microcontroller vendors. Microchip is one of the companies that offer performance optimized implementation of a March C- algorithm as part of their software diagnostic libraries. The Microchip implementation supports testing of the entire SRAM, normally done at start-up only to get maximum test coverage, and also testing of smaller memory blocks, intended to reduce the real-time impact on the application. The implementation can be downloaded for free from Microchip’s website as part of the IEC 60730 Class B library. The implementation is for PIC® and AVR® microcontrollers but can be ported to other Microchip MCUs.
For more information about IEC 60730 Class B tests: https://www.microchip.com/PIC-AVR-IEC60730.
Henrik Nyholm: Safety software engineer in the PIC and AVR applications group, responsible for developing products and software for safety critical systems targeting ISO 26262 and IEC 60730.
Jacob Lunn Lassen: Technical business development manager for safety critical systems, responsible for the market strategy and projects targeting ISO 26262, IEC 61508 and IEC 60730 with Microchip’s PIC and AVR microcontrollers.