MSDN Magazine - December 2007 - (Page 18) runtime that may need to allocate resources, but it should give you an idea of why it is not really practical to predict and mitigate the consequences of any specific allocation failure. In the family of the asynchronous exceptions, thread aborts have a special role. Thread aborts are not errors due to resource exhaustions (like OOMs and SOs), but they too can happen at any time. If you abort a thread from a different thread, the point on the aborted thread where the exception will be raised is completely random. Stack overflows also have For a high level of resiliency, their own idiosyncrasies. you really need to find the Stack space is reserved per thread and committed eapart of the application that gerly, so it should always be failed and recycle that part. possible to avoid any competition for the resource. But there are problems with this. It is a guessing game to predict how much stack is enough for every application. The OS limits the amount of stack space per thread. There are issues with presenting an exception via Structured Exception Handling (SEH) when the thread is low on stack space due to unwinding issues. And reentrancy and recursion rule out computing a finite upper bound on the stack space required by a method. In a nutshell: it is practically impossible to forecast when OutOfMemoryException, StackOverflowException, and ThreadAbortException might occur. Thus, writing backout code to attempt to recover from asynchronous exceptions is not practical on a large scale. You may be wondering whose responsibility it is to make the application reliable if the runtime cannot guarantee resilience to asynchronous exceptions. The ASP.NET example we just discussed hinted at the answer. While the application code is responsible for handling the common synchronous exceptions, it cannot handle asynchronous ones; this must be handled by the host process. In the case of ASP.NET, this is where the logic is contained that triggers process recycling when memory consumption goes beyond a known threshold. In other more sophisticated cases, hosts like SQL Server™ 2005 utilize the CLR’s hosting APIs to decide whether to abort the managed thread running a transaction and roll it back, unload an AppDomain, or even suspend the execution of all managed code on the server. The default policy for an application is that the host process will be killed, but there are several techniques you can use to embrace and extend this approach or to override this behavior. For further reading on the CLR’s hosting APIs, see the August 2006 CLR Inside Out column (msdn.microsoft.com/msdnmag/ issues/06/08/CLRInsideOut). The first model is to make the process itself the unit of failure and isolate the managed code execution in one or more worker processes that can die and spawn liberally. The second is to keep two redundant processes working in parallel doing the processing, with one active and the other dormant. On failure, the dormant process takes over and spawns another dormant process to act as a backup in case of another failure. The third model is to make the AppDomain the unit of failure and ensure that the process is never affected by any failure occurring in managed code or in the runtime. We’ll look at these three approaches in more depth, analyzing the cost of implementation and the different ways that there are to approach the design. Recycling the Host Process Say we accept the fact that resource exhaustion could tear down the process hosting the CLR. And say the system is built in such a way that the work is isolated in one or more child processes supervised by a master process whose job it is to manage the lifetime of the worker processes. Under these circumstances, we have a cheap and effective solution with which to provide a reliable system. The system will not be resilient, but it will be fully recoverable and predictable in its behavior. This approach is ideal for whenever you are handling a large number of independent stateless requests, such as Web requests in ASP.NET, and you want to isolate the execution of their processing. If an asynchronous exception is raised, a worker process dies and the requests being worked on will simply not be serviced— they need to be resubmitted. This, however, makes the approach ill-suited for long and expensive operations in which the cost of resubmitting the job may be too high. Still, this is the cheapest way to provide a reliable service running managed code. You are effectively making use of the runtime’s default behavior. Mirrored Processes IT shops make extensive use of redundancy for everything from local disk drives to entire servers. If a disk or server fails, a second one that is in sync with the first quickly takes over. A similar approach can be taken with a process. You can design software that runs two copies of a process on the same machine, each receiving the same input and producing the same output. Under circumstances where the main process fails due to a temporary fault in that specific process (as opposed to a reproducible bug that will affect all instances of the process), this model will provide some resiliency. You should also use a transacted store in this scenario to ensure that any failed requests can be safely rolled back. The National Aeronautics and Space Administration (NASA) uses a model like this for the computers on the space shuttle. When life-and-death operations depend on a computer, some degree of redundancy is a must. NASA actually ran into a problem when using just two computers in this scenario: a situation occurred where the two computers disagreed on the result of a computation. Which one was Dealing with Failure By now, you’ve probably identified the key to solving this problem. A resilient application is one that isolates the work in units that can independently fail without affecting the other units. So far, three models have been proven successful at creating resilient, managed applications. 18 msdnmagazine CLR Inside Out http://msdn.microsoft.com/msdnmag/issues/06/08/CLRInsideOut http://msdn.microsoft.com/msdnmag/issues/06/08/CLRInsideOut
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.