Making Legacy Code Safe in Multi-Threaded Environments

 

A technical brief by Wesley Steiner for Computer Results

Copyright © 2003 by Computer Results

7/09/03

 

Introduction

 

Writing thread-safe, or reentrant, code from scratch is relatively straightforward as long as you follow some strict but simple coding guidelines. Making thousands of lines of legacy C and C++ code combined with multiple third party libraries into a thread-safe package for use in modern multi-threaded environments is more difficult. This document attempts to point out the issues and propose solutions for just such a task.

 

The Setup

 

Lets say you, as a programmer, are responsible for maintaining and enhancing a library of legacy code that, when incorporated into a single-threaded monolithic application, performs some magical function or functions on the input data and, to date, this application is working flawlessly and has been in use for many years.

 

A request has just come in to package the functionality into a library for use in a multi-threaded dispatching environment where more than one thread will be calling your API concurrently. Since your package will be running on a server machine, it must be bulletproof and provide a performance benefit due to the concurrency.

 

A First Solution

 

Since you already know that your library works flawlessly when executing in a single thread the quickest and simplest solution is to introduce a monolithic locking mechanism into your API that guarantees to let only one thread through at a time, effectively blocking all other threads until the first thread completes its task. When the first thread is finished it releases the lock and the next thread in the queue is allowed to continue. The queuing and releasing of threads is handled automatically by the operating system via some implementation of a mutual exclusion device (a mutex). At this point you can point out to management that the library is now 100% thread-safe and compatible with all multi-threaded services.

Of course, if your API consists of more than one function this solution may not work correctly especially if the functions share resources or maintain state between them. Also if your functions perform any type of lengthy task then blocking all threads while one finishes will be unacceptable to your client and their users.

 

The First Step

 

Before you can even think about removing the top level blocking noted above you must first identify and categorize all global resources used by the library. Global variables are reentrancy killers. If you live on another planet in a different universe you may find that there are no global variables in your library, or in any of the third party libraries that it uses, in which case you can skip this step.

 

Still here?

 

After you have identified all global variables used in your code base you can quickly shorten the list by eliminating all those that are explicitly qualified as constant (const). Since these resources are declared as constants you can be assured that their values will never change during the lifetime of your library and therefore they are not a concern in regards to thread safety.

 

You can make a second short-list by identifying those resources that are not declared as constants but actually never change during the lifetime of your library. This may have been a simple oversight by the original programmer or the values may be assigned only once at startup and never change after that.

 

In the first case, rather than leaving things as they are, you should take the time now to add a const qualifier to those resources and address any areas of the code where the compiler complains about the newly added const'ness. This extra bit of effort will greatly assist maintenance programmers, including yourself, when you need to revisit this code in the future.

 

The second case, where the value is initialized only once at startup, requires a simple comment to that effect. This is often the case when startup values are read from a preference file.

 

So now you're left with the real globals that are used throughout your code for reading and writing. If one thread is allowed to change one of these values while another thread is reading it there is no telling what the results might be. Hopefully this list is a short one since there is no easy way around it. A solution for each global resource in this list must be considered individually. The questions to consider, in order, are:

 

  1. Does this resource need to be global or was it just an oversight? In this case just convert to be used as an automatic variable.
  2. Is this resource only used in non-production code or for diagnostic or debugging purposes? In this case it can be left as-is with a note that it may not behave correctly in a multi-threaded environment.
  3. Is this resource used only within code that is already known to be non-reentrant and will not be benefiting from the current reentrancy work - at least not in the immediate future? Since access to these types of resources will be wrapped in a thread block anyways there is no need to do anything more than including a comment to that effect.
  4. Can the code be re-factored or re-architected within the schedule to remove the global altogether? Answering this question requires an in-depth knowledge of your library and the interaction between its components. Even then the results can be surprising once you begin the task.

 

If you answered no to all of the above questions then you have no choice but to relegate the global to per-thread storage or re-architect your code to create a separate context for each thread that contains these global resources. Per-thread or thread-local storage is just that: a block of storage that is allocated for each thread that contains the global resources for that thread. Of course this will add a performance penalty to your code and may require modifications to your source to access them correctly. Most compilers offer a convenient method for qualifying thread-local variables ("declspec(thread)" comes to mind) so that the linker will automatically allocate and manage the storage for you when a thread is created. This works fine for static linking but breaks down for dynamic linking in the case of DLLs. Here you need to explicitly call OS functions (TlsAlloc, …) that will create and manage the thread local storage for you at the appropriate locations in your code.

 

To go beyond this point and actually see any benefit from multiple threads of execution requires the removal of the top-level thread blocks mentioned above. This is where the fun begins and is the subject of the next section.

 

The Second Step

 

Assuming you have addressed all global resources as outlined above it is now time to begin removing or re-scoping the top-level monolithic thread blocks in turn and evaluating the effect of each. Here the term "evaluating the effect of" refers to the heuristic process of reviewing the code and running multi-threaded test beds and test situations that exercise your code to its fullest capacity waiting for something to fail.

 

Before you can determine the effect of these modifications you will need to assemble one or more test applications using your favorite IDE and framework that will let you dispatch an arbitrary number of threads so that each one will test the same function concurrently. In this way if something bad can happen due to non-reentrant code, it will.

 

For the purposes of this discussion lets consider the orderly process for just one of the functions in your library. The same process will apply to every function exposed in your library.

 

  1. Can the thread block be eliminated altogether? If your function only allocates automatic variables, doesn't write to global storage and doesn't call down to any suspect library functions then you can be pretty sure it is reentrant and hence thread-safe. Removing the block is all you have to do. Determining whether a function is reentrant is a matter of how deeply you understand your software library and all of its interactions.
  2. Can you modify your code at this function level to make it reentrant? Perhaps the original code was written without much thought given to reentrancy in which case it may not be too difficult to perform some local re-factoring and re-architecting to address the problem.
  3. Can the block be moved to a smaller scope within the same function? If you can determine that a specific block of code or a single call within your function is the source of the error then you can refine the blocking level by moving it into the suspect code area thereby reducing the scope over which the block is in effect.

 

At this point the above process needs to be repeated at the next deeper level within the function. Assuming you have identified the call that generates the error in concurrent environments you will have to apply the above process at the next deeper level of the suspect function in order to remove the thread block at this level. By continuing to drill down in this way you can realize finer and finer blocks of code that will require thread blocking.

 

Where to from Here

 

Given enough time and resources all code can be made reentrant, however, in the real world we need to deal with deadlines and schedules. These factors will dictate how deeply you can drill down into your code, and third party code, in your effort to claim that your library is thread-safe. Although your ultimate goal is for 100% reentrancy in reality you will need to apply thread blocking around areas of code that are known to be non-reentrant. In fact many operating system calls incorporate thread blocking to get their work done. Using these functions in your code already eliminates the chance of ever achieving 100% reentrancy. In many cases this may not be that bad of a solution if you can determine that not every thread will be calling into this code at the same time thus reducing the amount of time that threads will need to wait on a lock. In other cases a lock will be required if you are accessing a shared resource for writing unless you can replace the resource with a thread-safe version (which in many cases just encapsulates the blocking inside the implementation).

 

At this point it is time to perform some benchmarking and instrumentation on your code to determine if it is performing adequately for the task. If so you can move forward with work on new features and enhancements to the library with your newly acquired knowledge of and skill for writing reentrant code from the start.