MSDN Magazine - December 2007 - (Page 21) initialization), but this pessimistic heuristic is the tightest we have found, without requiring an unacceptable additional level of discipline from users. You can safely use interlocked operations, as they are only safe for editing one piece of shared state atomically. And knowing whether they succeed is easy—they either do or they do not. Escalation Policy The CLR’s escalation policy goes a bit further. We’d like to attempt to give user code a chance to clean up when aborting threads. So the CLR attempts to run finally blocks and finalizers when aborting threads and before the AppDomain is unloaded. But there is tension caused by the user’s desire to run arbitrary cleanup code and the host’s availability needs. You can impose some timeout on running finally blocks and finalizers and simply abort them if they do not finish within a reasonable time. A harder question, though, is how to be resilient if an asynchronous exception occurs while accessing a process-wide, machine-wide, or cross-machine piece of state. We’ll discuss this in more detail later. acted host like SQL Server that guarantees its own consistency, all other concerns about user and library code (like correctness, performance, and maintainability) are secondary to the host’s ability to guarantee its availability. If user code has a bug that occasionally causes a crash, the server will live on as long as it can make forward progress and doesn’t degrade over time. Clean AppDomain unloading and careful resource management allow your code to fight against the CLR’s escalation policy, enabling you to identify code that must have an opportunity to run to ensure consistency of a resource or to ensure that no resources are leaked. In most cases, an asynchronous exception will have already happened or will be unpreventable—these are the tools your code will need to become resilient to failure. The most important advice for library writers is to use SafeHandle, and it’s critical that you understand several other available features so you can fully grasp why to use SafeHandle. Choosing a Reliability Bar Not all code is equally critical. You should think about what level of reliability is required from a certain block of code. The techniques we describe can increase development costs, so a good engineer should determine how much resiliency is necessary for a block of code. First, ask yourself what your code should do when a power failure occurs. As a starting point, all code obviously must be able to restart and work correctly after a power failure. Even if client applications will lose work and encounter data corruption, they still must, at a bare minimum, be able to start back up. Designing mail servers to ensure they don’t lose e-mail when a power failure occurs is a significantly harder problem. Similarly, software controlling a nuclear power plant must be able to tolerate such failures with greater resilience than, say, a basic productivity app. Figuring out where your application falls on this continuum should be trivial, and it will help inform your decisions on how much to invest in resiliency. For most client applicaIf user code has a bug that tions, the surprising answer occasionally causes a crash, is to do very little. Surviving the server will live on as an async exception is overkill for most client apps. long as it can make forward Killing the process and progress and doesn’t restarting is often suffidegrade over time. cient. And when using the Windows Vista® Restart Manager APIs, this approach can help limit how much state is lost when the client app crashes. Outlook® offers a great example. If Outlook 2007 crashes on Windows Vista, it can recover and reopen all windows to the right spot. If you were composing a message when Outlook crashed, you might lose only the last minute or two of typing, rather than the entire message. For libraries, the reliability bar is determined by the most aggressive host in which your code will run. If your library is used by hosts that recycle processes, your reliability needs are less than for a host that recycles AppDomains. However, if your library allows CLR Inside Out december2007 21 Resiliency to Escalation Escalation policy also imposes limits on code, both when written by users and library authors. For SQL Server, the CLR allows stored procedures to be written in managed code, with some very high restrictions on what it can express. For scalability, reliability, and security concerns, user code in SQL Server should not launch or terminate threads, it should minimize or completely avoid shared state, and it should not be allowed to access certain types of OS resources. However, trusted libraries such as the .NET Framework must access these resources, often on behalf of this relatively untrusted user code. The CLR provides code access security as a first line of defense to tweak the set of permissions given to user code. However, the CLR does not include permissions for all interesting resource types. For this purpose, the CLR defines the HostProtectionAttribute attribute, which can be used to mark methods that raise programming model concerns, such as by having the ability to kill threads. These limitations on user code are actually a very good thing because by restricting user code from accessing OS resources directly (along with other limitations), user code is freed from the responsibility of tracking its use of these resources. In the process recycling world, whenever a process is killed, the OS frees all of the machine-wide resources used by the process. For AppDomain recycling to truly provide resiliency, AppDomain unloading must provide the same level of guarantees. Since an AppDomain is just a unit within the process, managed libraries that provide access to resources must fill the gap between the operating system’s ability to clean up on process exit and the demands of AppDomain unloading. Writing Reliable Code AppDomain unloading must be clean. This is the guiding principle for writing libraries resilient to async exceptions. For a trans-
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.