MSDN Magazine - December 2007 - (Page 17) Writing Reliable .NET Code ALESSANDRO CATORCINI AND BRIAN GRUNKEMEYER hen we talk about something being reliable, we’re referring to it being dependable and predictable. When it comes to software, however, there are other key attributes that must also be present for the code to be considered reliable. Software needs to be resilient, meaning that it will continue to function in the face of internal and external disruptions. It must be recoverable, such that it knows how to restore itself to a previously known, consistent state. The software needs to be predictable, so it will provide timely and expected service. It must be undisruptable, meaning that changes and upgrades won’t affect its service. And, finally, the software must be production-ready, meaning that it contains a minimal number of bugs and will require only a limited number of updates. When these criteria are met, then the software can be considered truly reliable. These key attributes of reliable code depend on various factors—some depend on the overall architecture of the software, some depend upon the OS on which the software will run, and others depend on the tools used to develop the application and the framework on which it is built. Resilience is an attribute that relies on every layer, and an application will only be as resilient as its weakest link. Now consider Microsoft® .NET Framework-based applications. These apps delegate to the runtime certain operations that in a native environment either did not exist (such as just-in-time compilation of IL code) or were under the direct control of the developer (such as memory management). In terms of reliability, the platform itself can introduce its own points of failure that impact the reliability of the applications that run on top of it. It’s important to understand where these breakdowns can occur and what techniques you can use to create more reliable .NET-based apps. W A Look at Runtime Failure There are certain exceptional events that can occur at any time and in any code section. These events, which we will call asynchronous exceptions, are resource exhaustions (out of memory and stack overflows), thread aborts, and access violations. (In the execution of managed code, access violations occur in the runtime.) This last case is not very interesting—if this event does actually occur, it means that a serious bug in the implementation of the common language runtime (CLR) is being exposed and should be fixed. The first two cases, however, deserve deeper analysis. In theory, we would imagine that resource exhaustions would be gracefully managed by the runtime and that they would never affect the ability of application code to continue running. That’s just theory, though—reality is more complex. To explain, we’ll start by taking a look at how some popular server applications deal with out-of-memory (OOM) events. Server applications, such as ASP.NET and Exchange Server 2007, that require very high availability have achieved this through AppDomain and process recycling. The operating system provides A resilient application is a very powerful mechaone that isolates the work in nism to clean up memory and most other resources units that can independently used by a process—all this fail without affecting the is done for you when the other units. process terminates. In a client scenario, when memory pressure gets to the point that even small allocations fail, the overall system reaches such a level of unresponsiveness due to extensive thrashing and paging that the user is much more likely to reach for the reset button or the task manager than to allow any recovery code to run. In a sense, the user’s initial reaction is to perform the same action manually that ASP.NET or Exchange 2007 will do automatically. Some OOMs may not even be caused by any particular issue with the running code. Another process running on the machine or another AppDomain running in the process may be hogging the available resource pool and causing allocations to fail. In this sense, you should consider resource exhaustions to be asynchronous in that they can occur at any time in the execution of code and they may depend on environmental factors external to and independent from the running code. This problem is exacerbated by the fact that the runtime may allocate memory to perform operations related to its own workings. Here are a few examples of allocations that happen in the runtime that could fail in a constrained resource environment: • boxing and unboxing • delayed class loading until the first use of the class • remoted operations on MarshalByRef objects • certain operations on strings • security checks • JITing methods This is just a partial list of the many internal operations in the december2007 17
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.