Guidelines for Reliable RTOS Usage

By Ralph Moore

President

Micro Digital

March 13, 2025

Blog

Guidelines for Reliable RTOS Usage

There is a great deal of information about how to write reliable code, but very little about how to pick a reliable RTOS and how to use it reliably. An erroneous concept has flourished for decades that “since all RTOSs are equivalent”, the simplest one will do. Whereas this might be an acceptable idea when the alternative is bare metal, it is not acceptable for the complex embedded and IoT systems that are being developed today.

As more layers of software are placed on top of an RTOS, it is logical for people to assume that the underlying RTOS is as well engineered as a desktop OS. This is seldom the case. It is important to pick an RTOS that behaves as you expect it to and then to use it properly. Doing so will produce more rugged code and result in fewer problems. The following are some guidelines based upon my decades of RTOS experience:

  1. Never use binary semaphores for mutual exclusion. There are two problems with doing so:
    • If a task tests a semaphore twice, it will hang up. This can happen, for example, if function A tests the semaphore then calls function B, which also tests the semaphore. If function B is only called under unusual circumstances, this problem might escape testing and show up later in the field. Mutexes are designed to allow the same task to safely test them multiple times and therefore should be used instead.
    • No priority promotion. This means that there could be unbounded priority inversion for a task waiting at a semaphore, which could cause it to miss its deadline. This also might be a rare occurrence that escapes testing and shows up later in the field.
  2. Use finite timeouts on all waits. Most RTOSs provide timeouts for all or nearly all wait conditions. When it is not clear what the timeout should be, use a default timeout measured in seconds or minutes, but do not use infinite timeouts. Expected events, such as an interrupt failing to occur, do fail to happen in the real world. If the waiting task times out (even if minutes have elapsed), recovery is still possible. Otherwise recovery is not possible and the hung-up task may eventually cause the system to fail. If so, it may be difficult to figure out why the system failed.
  3. Use binary event semaphores for uncounted events. There are cases where the number of events that have occurred is not important. For example, a UART with an internal buffer may interrupt for each character received and the associated ISR may signal a semaphore each time it runs. A binary semaphore can only count up to 1 and additional signals are ignored. When a task services the semaphore, it will read in all characters in the UART buffer at once. Thus, using a counting event semaphore could cause wasted activity and time.
  4. Use event semaphores for events and resource semaphores for resources. Event semaphores typically count up, whereas resource semaphores start at a count equal to the number of resources available (e.g. blocks in a block pool) and count down. When a resource semaphore reaches 0, it starts suspending tasks that test it until it receives signals that indicate resources are being released. This is not a good fit for counting events such as the number of objects that have passed by on a conveyor belt. For this, use an event semaphore that counts up.
  5. One task per event semaphore. There should only be one task that waits on an event semaphore. Otherwise, the system is error-prone because the wrong task might get the wrong signal, if things get out of sequence. Of course, if the tasks are clones of each other (i.e. all use the same main function and do the same thing), then it’s ok for them to wait at the same event semaphore. This is a nice way to handle a situation where a task might hang up waiting for something else (e.g. a mutex) and not be able to get back in time for the next event. Then a clone task waiting at the semaphore could take over for it. Creative use of tasks, like this, can reduce system problems and complexities.
  6. Keep ISRs very short. An ISR should do only what must be done immediately. All else should be deferred to a task or other agent. An important reason for this is that the larger the ISR, the larger the attack surface for a hacker to exploit. Compromised ISRs are especially serious because they run in privileged mode, thus enabling the hacker to turn off the MPU, if one is being used, and then to access secret information, such as encryption keys. Calling RTOS services from ISRs makes the attack surface even larger and should be avoided, if possible. Another potential problem is that when all ISRs occur at once and each has its worst-case run time, new interrupts may be missed. This causes the dreaded “once in a blue moon”[1] failure.
  7. Always test mutexes in the same order. Not doing this runs the risk of deadlock. For example, if task1 tests mutexes A and B, and task2 tests mutexes B and A, it is possible that task1 will get A and task2 will get B. Then neither one can run. (This is a good example of where using finite timeouts might save the day.) To avoid deadlocks, make a list of all mutexes in the system and make sure to always get them in the order that they appear in the list.
  8. Start with a small number of priorities, and add more only as necessary. This reduces preemptions since same-priority tasks cannot preempt each other. There may be a temptation to start with a large number of priorities, thinking that this is the best way to design a system. In such a case, not only is it difficult to figure out what priorities tasks should have, but more priorities do not necessarily produce better results. That is because a slightly higher priority task may preempt a longer waiting task, which is not necessarily desirable. Rate Monotonic Scheduling is more about time than priority, so it is best served with fewer priorities.
  9. Never number priorities. The reason is that if you need to add a priority, you must increase all of the priorities above it, which is a hugely error-prone activity. Priorities should always be named, and the best way to assign them numbers is by using an enum such as:

enum PRIORITIES {PRI_MIN, PRI_LO, PRI_NORM, PRI_HI, PRI_SYS};

Then, a new priority can be easily added as follows:

enum PRIORITIES {PRI_MIN, PRI_LO, PRI_NORM, PRI_HI_NORM, PRI_HI, PRI_SYS};

The new priority, PRI_HI_NORM is obviously above PRI_NORM and below PRI_HI. Most importantly, unrelated code need not be changed. You can name priorities however you want, for example:

enum PRIORITIES {ZERO, ONE, ONEptFIVE, TWO, PRI_SYS};

There should be a system priority which is above all other priorities and is reserved for system actions such as shutting down compromised tasks and recovering from hacking attacks. The lowest priority should be reserved for the idle task. If your RTOS defines 0 as the highest priority, you can easily adapt the above scheme as follows:

enum PRIORITIES {PRI_SYS, PRI_HI, PRI_NORM, PRI_LO, PRI_MIN};

  1. If a task is getting too complicated, divide it into two or more tasks, and let RTOS services help you to deal with the complexity. Smaller tasks usually require smaller stacks because subroutine nesting is not as deep. In addition, it might be possible to make one or more of the tasks into one-shot tasks[2], which share stacks. Hence the increase in memory required due to more tasks may not be significant.
  2. What is the right number of tasks? Generally speaking, every asynchronous activity should be assigned to a separate task or independent agent, such as an ISR or an LSR[3]. This makes the code easier to write and to debug, and it makes the best use of RTOS services to deal with synchronization and coordination problems. However, many RTOSs place a heavy overhead on tasks. Their TCBs can be as large as 300 bytes. Task stacks often need to be 1000 bytes or more, depending upon the amount of subroutine nesting and how subroutines use the stack. ( Some subroutines put large buffers into the stack.) In these cases, the number of tasks is likely to be severely limited, if fast memory is in short supply.

Smaller tasks generally require smaller stacks because subroutine nesting is less. Some RTOSs have much smaller TCBs, and some RTOSs support one-shot tasks, which can share task stacks. Some RTOSs also support agents, which have very small control blocks and very small stacks, yet can operate independently. In general, the more a system can be broken down into tasks and independent agents, the easier it is to write the code and to get it to work correctly. A winning strategy is to keep tasks simple and to let the RTOS handle complex timing and coordination problems.

  1. Handling pipe (message queue) overruns. When a pipe is full, the next pipe put could overwrite the oldest entry in the pipe, thus causing loss of data, unless the RTOS suspends the writing task or aborts the put. However, for simple pipe puts that are designed to work with ISRs, suspending a task is not an option. In this case, monitoring the pipe fill level during debug is necessary. In either case, pipes should be much longer than seems to be necessary in order to protect against unusual or unexpected data buildups in the field.
  2. Check return values from RTOS services. Normally, return values from RTOS services should be checked before using them. For example:

task = TaskCreate(…);

if (task)

    use task

else

    error_manager(task_create_error);

This is the best way to handle RTOS calls. However, testing every result of every RTOS call is cumbersome and complicates the code. In some cases, it can be avoided as in the following:

task = TaskCreate(…);

if (!TaskStart(task))

    error_manager(task_start_error);

This works if the RTOS tests parameters passed to its services. In this case, TaskStart() detects that task == NULL and aborts operation. The error manager can determine that the cause of the abort was that the task failed to be created. However, in other cases, not testing a return value may result in dereferencing a NULL pointer, or similar, so be careful.

  1. Point-of-call vs. central error handling. In general, the latter is favored because point-of-call error handling tends to complicate code and can greatly increase its size. However, in some cases errors must be handled locally. For example, lack of a resource might be handled better by trying again to get it later. In the above examples, the central error manager is being explicitly called. In some RTOSs the central error manager is automatically called whenever an error is detected by an RTOS service. This simplifies the application code.

In either case, the central error manager should send an error message to the console. (There should always be a console to monitor errors during debug!) It should also log the error into an error buffer or an event buffer. It might also keep counts of error types, load the error number into the current task’s TCB and into a global error number, execute a callback function to allow error-specific operations, and shut the system down if the error is irrecoverable.

  1. Interlocked operations are recommended over open-ended operations – i.e. verify that an intended operation was actually successful, rather than assuming that it was. For example, wait for an acknowledgement before sending the next message to another task. If not received, resend the previous message. This approach, which is commonly used for communication protocols, is also a good idea inside of embedded systems. Given the explosion of hacking attacks we cannot consider the insides of embedded systems to be safe any longer.
  2. Always check that the values of variables are reasonable before using them, especially inputs from outside of the system. This makes the system more robust against noise and malware.
  3. Avoid task preemptions when not necessary. Watch for the case where a task is preempted by a task, which then suspends itself to wait for the lower-priority task to finish. In this case, control goes back to the lower priority task, which finishes, then the higher priority task runs. For example:

void taskA_main(u32)

          {

               TaskStart(taskB);

               SemSignal(semA):

               …

          }

          void taskB_main(u32)

          {

               SemTest(semA);

               …

          }

Assuming task B has higher priority than task A, it will preempt task A as soon as it is started. But then it suspends itself on semA, waiting for a signal from taskA. Hence, unnecessary task switches occur: A -> B -> A -> B rather than just A -> B. The solution to this problem is shown below:

          void taskA_main(u32)

          {

               TaskLock();

               TaskStart(taskB);

               SemSignal(semA);

               TaskUnlock();

               …

          }

          void taskB_main(u32)

          {

               SemTest(semA, 100);

               …

          }

Locking taskA blocks taskB from preempting until after taskA has signaled semA and unlocked itself. This is a simplistic example, but unnecessary task switches do commonly occur and eliminating them improves performance and reduces confusion.

  1. When disabling interrupts, it is recommended to save the interrupt state, then restore it later. For example:

CPU_FL ps;

          ps = IntStateSaveDisable();

          /* perform operation with interrupts disabled */

          IntStateRestore();

Assuming you know the current interrupt enable state can have dangerous consequences if it changes or if you are wrong -- it is better to be safe.

  1. It is better to allocate buffers from a heap than to define them statically (e.g. buf[100]). This is because a common problem with buffers is that overflow or underflow damages adjacent variables, thus causing difficult problems to find. Also buffer overflow is a common tactic used by hackers. If allocated from a heap, block overflow or underflow will damage a heap chunk control block after or before the block. This is likely to be caught and fixed before damage occurs, if the heap is frequently scanned for damaged links. In the static case, damage is unlikely to be detected until it is too late.
  2. If a task is locked, be aware that any RTOS service that might suspend the task will break the lock regardless of whether the task is actually suspended or not. This is because the RTOS does not know, a-priori, if the service will cause the task to wait or not. (Obviously the lock must be broken if the task waits, because no other task could run.) The problem is that downstream code will be unprotected even if the RTOS service succeeded and the task did not wait. For some RTOSs, specifying NO_WAIT avoids this problem, by allowing the RTOS to decide to maintain the lock.
  3. If task locks are not counted, code may be unprotected. The RTOS should count locks, then require an equal number of unlocks to unlock the task. If the RTOS does not maintain a lock count, a failure can occur. For example, if the task main() locks the task then calls a subroutine that also locks the task, when the subroutine finishes, it will unlock the task and return to main(). From the appearance of the main code, one would think that the code after the subroutine call is protected when it is not.
  4. Deleting objects is hazardous. Some RTOSs release tasks and messages waiting at an object being deleted and also load NULL into the object handle so it cannot be reused. However, great care is still necessary to avoid delete problems. It is generally best to delete an object only from the lowest-priority task that uses it and never from another agent. Even so, careful consideration is necessary to determine if deleting the object will cause other parts of the system to hang up or to malfunction. In some cases, resources are so scarce that deleting unused objects is necessary. However, the most common use for deleting objects is to recover from an attack without rebooting the entire system. Rebooting is undesirable because it may require an undesirable action such as shutting down an entire production line.
  5. Intrinsic Paths. An intrinsic path is a path through the RTOS that the user assumes exists, but may not actually exist. This can lead to difficult problems to diagnose. Examples of intrinsic paths are:
    • If the priority of a task is changed, is it requeued in a queue? For example, if a task is having difficulty obtaining a mutex, a programmer might assume that increasing its priority will enable it to gain the mutex sooner. But this works only if the task is not already waiting for the mutex. Inconsistent behavior causes difficult problems to find.
    • Is priority promotion propagated? If a higher priority task than the current owner starts waiting at mutex A, priority promotion means that the priority of the mutex owner will be increased, so it will release the mutex sooner. But if that task is waiting at mutex B, will the mutex B owner priority also be increased? For an application using very few mutexes, this may not be important, but for an application using many mutexes, lack of priority promotion propagation can cause inconsistent behavior – i.e. sometimes unbounded priority inversion, sometimes not.
    • If a task own mutexes A and B and A has a higher priority than B[4], and B is higher than the task’s normal priority, when the task releases A, will its priority drop to B or to its normal priority?  In the latter case, priority promotion fails and the task may experience unbounded priority inversion.
    • If a task’s priority is increased, will priority promotion follow? Boosting the priority of a task waiting at a mutex and about to miss its deadline could help to avoid the missed deadline. However, if the raised priority is not promoted, this tactic could fail.
    • If a task is suspended indefinitely or deleted, does it automatically release all mutexes that it owns? If not, this must be done before it is suspended or deleted. If the RTOS does not keep track of which mutexes a task owns, then it is necessary to search through all mutexes in the system to find which ones it owns.
    • If a task is deleted, does it automatically give up all objects that it owns? If not, then it is necessary to find and release all such objects before the task is deleted. Some RTOSs have separate callbacks for when a task is first started and for when it is deleted. Using these callbacks helps to assure that all statically assigned resources are released. However, there is still a problem with dynamically acquired resources such as mutexes, blocks, and messages. If the RTOS does not keep track of these per task, then it may be necessary to add a monitor that does.

It is important to understand if intrinsic paths like the above exist or not in the RTOS being used. These are typically not well documented, so you may need to do some experimentation to find out. If a path does not exist, additional code may be required to achieve reliable operation.

Conclusion

Hopefully, the above pointers have given you some insights for where to look if you are having certain problems and also insights to consider when you are picking an RTOS.


Ralph Moore is a graduate of Caltech. He and a partner started Micro Digital Inc. in 1975 as one of the first microprocessor design services. Now Ralph is primarily the Micro Digital RTOS innovator. His current focus is to improve the security of IoT and embedded devices through firmware partitioning. He believes that it is the most practical approach for achieving acceptable security for devices connected to networks. Ralph can be contacted at [email protected], or visit www.smxrtos.com/securesmx to learn more.


[1] The meaning of “once in a blue moon” is that the problem occurs so infrequently that it is impossible to replicate it in the lab. Hence tracking it down requires great ingenuity, and sometimes working around it is the only solution.

[2] A one-shot task normally runs straight through with no internal loop and gives up its stack when done, since there is no information to carry over to the next run. Thus, several equal-priority one-shot tasks can share one stack, thereby reducing memory usage considerably. One-shot tasks are compatible with round-robin scheduling and are useful for servers.

[3] A Link Service Routine (LSR) uses the current stack or a very small stack and runs ahead of all tasks. Hence it is not subject to preemption by higher-priority tasks nor to priority inversion, and thus it is able to do deferred interrupt processing with little jitter, unlike a task.

[4] A mutex has the priority of its highest-waiting task or of its owner.

I am no longer running the daily business at Micro Digital. Instead, I have been involved for the past four years in improving the smx RTOS kernel. smx is a hard-real-time multitasking kernel, which is intended for embedded systems that require high efficiency and high performance.

More from Ralph