Dr. Dobb's Journal - December 2008 - (Page 46) D12sutt_p3db 10/10/08 9:13 AM Page 46 Effective Concurrency by Herb Sutter Understanding Parallel Performance How do we know we’ve succeeded? LET’S SAY THAT we’ve slickly written our code to apply divide-and-conquer algorithms and concurrent data structures and parallel traversals and all our other cool tricks that make our code wonderfully scalable in theory. Question: How do we know how well we’ve actually succeeded? Do we really know, or did we just try a couple of tests on a quad-core that looked reasonable and call it good? What key factors must we measure to understand our code’s performance, and answer not only whether our code scales, but quantify how well under different circumstances and workloads? What costs of concurrency do we have to take into account? This month, I’ll summarize some key issues we need to keep in mind to accurately analyze the real performance of our parallel code. I’ll list some basic considerations, and then some common costs. Next month, I have a treat in store: We’ll take some real code and apply these techniques to analyze its performance in detail as we successively apply a number of optimizations and measure how much each one actually buys us, under what conditions and in what directions, and why. the greater the throughput. We can get a sense of how this particular algorithm scales, and in what directions, by examining how and where throughput grows and shrinks. In this example, we have good scalability up to a total of about 15 threads before we start to peak and realize no further gains, and we can see that scalability is better when there are somewhat more producers than consumers in the system. Figure 2 directly compares different candidate algorithms running the same kind of workload. Peak throughput occurs, naturally enough, at the peak of each curve. Scalability shows up directly as the left-hand ascending side of the curve; the steeper it is, and the farther to the right that it goes before topping out, the more scalable our code will be. Here, the blue algorithm demonstrates the horror of negative scalability; it actually manages to get less work done using additional cores, which probably won’t earn us our next raise. But both figures also let us see two basic penalties. Contention arises when different workers interfere with each other by fighting for resources, such as contending for mutexes, cache space, or cache lines via false sharing. In the most extreme case, adding a new worker might actually cost more in contention than it adds in extra work, resulting in less total work being done, and we can see this extreme effect in both graphs: In Figure 1, in several directions we reach areas where adding more workers makes total throughput actually go down. In Figure 2, we see the same effect in the form of the right-hand downslope where adding more work actually decreases throughput even when there are otherwise-idle cores. But Figure 2 also lets us more clearly see the effect of contention before it gets that far: On the left-hand upslope, even as throughout is still rising, the rate of increase is slowing down as the curve begins to bend. That’s a classic effect of growing contention. The other basic penalty is oversubscription, or having more CPU-bound work ready to execute than we have available hardware. These samples were taken from test runs on a 24-core machine; sure enough, in Figure 1 we see a faint diagonal line where #producers + #consumers = 24, above which throughput is noticeably thinner; sometimes the result is even more dramatic. Similarly, in Figure 2 we see even the best algorithms can’t scale beyond the available cores, and incur a penalty for trying to exceed that number by adding contention at least for CPU time and often also for other resources. With these fundamentals in mind, let’s consider a few specific costs that arise and impact scalability because of contention, oversubscription, and other effects. Fundamentals To understand our code’s scalability, we need to know what to measure and what to look for in the results. First, identify the workload variables: What are the key different kinds of work and/or data? For example, we may want to measure how well a producer-consumer system performs with varying numbers of producer threads and consumer threads, or measure a container by how well it scales when holding data items of different sizes. Second, use stress tests to measure throughput, or the total amount of work accomplished per unit time, while varying each of these dimensions so that we can measure their relative impact. Look for scalability trends, the change in throughput as we add hardware resources: Can we effectively use more cores to either get the answer faster or to get more work done? Figures 1 and 2 show two useful visualization tools that will help us understand our code’s parallel performance. Figure 1 shows a sample scatter graph that charts throughput results for a selected algorithm against different numbers of two kinds of workers: producer threads and consumer threads. The larger the bubble, 46 Dr. Dobb’s Journal l www.ddj.com l December 2008 Sources of Overhead, and Threads versus Pools We incur a basic concurrency overhead from just expressing code in a parallel way. Consider the following toy example that performs three independent subcomputations to generate a result: // Example 1: Sequential code in MyApp 1.0 // int NetSales() { int wholesale = CalcWholesale(); int retail = CalcRetail(); int returns = TotalReturns(); return wholesale + retail - returns; } http://www.ddj.com
Table of Contents Feed for the Digital Edition of Dr. Dobb's Journal - December 2008 Dr. Dobb's Journal - December 2008 Contents Friday Night Fish Fry Alia Vox Developer Diaries Conversations The Man Who Sold the Sky Performance on Rails LINQ-to-SQL and T-SQL A Remote Java RMI Registry Beyond B-Trees File Descriptors and Multithreaded Programs Effective Concurrency The Agile Edge Swaine's Flames Dr. Dobb's Journal - December 2008 Dr. Dobb's Journal - December 2008 - Dr. Dobb's Journal - December 2008 (Page Cover1) Dr. Dobb's Journal - December 2008 - Dr. Dobb's Journal - December 2008 (Page Cover2) Dr. Dobb's Journal - December 2008 - Dr. Dobb's Journal - December 2008 (Page 1) Dr. Dobb's Journal - December 2008 - Dr. Dobb's Journal - December 2008 (Page 2) Dr. Dobb's Journal - December 2008 - Dr. Dobb's Journal - December 2008 (Page 3) Dr. Dobb's Journal - December 2008 - Contents (Page 4) Dr. Dobb's Journal - December 2008 - Contents (Page 5) Dr. Dobb's Journal - December 2008 - Friday Night Fish Fry (Page 6) Dr. Dobb's Journal - December 2008 - Friday Night Fish Fry (Page 7) Dr. Dobb's Journal - December 2008 - Friday Night Fish Fry (Page 8) Dr. Dobb's Journal - December 2008 - Friday Night Fish Fry (Page 9) Dr. Dobb's Journal - December 2008 - Alia Vox (Page 10) Dr. Dobb's Journal - December 2008 - Alia Vox (Page 11) Dr. Dobb's Journal - December 2008 - Developer Diaries (Page 12) Dr. Dobb's Journal - December 2008 - Developer Diaries (Page 13) Dr. Dobb's Journal - December 2008 - Conversations (Page 14) Dr. Dobb's Journal - December 2008 - Conversations (Page 15) Dr. Dobb's Journal - December 2008 - The Man Who Sold the Sky (Page 16) Dr. Dobb's Journal - December 2008 - The Man Who Sold the Sky (Page 17) Dr. Dobb's Journal - December 2008 - The Man Who Sold the Sky (Page 18) Dr. Dobb's Journal - December 2008 - The Man Who Sold the Sky (Page 19) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 20) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 21) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 22) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 23) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 24) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 25) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 26) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 27) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 28) Dr. Dobb's Journal - December 2008 - LINQ-to-SQL and T-SQL (Page 29) Dr. Dobb's Journal - December 2008 - LINQ-to-SQL and T-SQL (Page 30) Dr. Dobb's Journal - December 2008 - LINQ-to-SQL and T-SQL (Page 31) Dr. Dobb's Journal - December 2008 - LINQ-to-SQL and T-SQL (Page 32) Dr. Dobb's Journal - December 2008 - LINQ-to-SQL and T-SQL (Page 33) Dr. Dobb's Journal - December 2008 - LINQ-to-SQL and T-SQL (Page 34) Dr. Dobb's Journal - December 2008 - A Remote Java RMI Registry (Page 35) Dr. Dobb's Journal - December 2008 - A Remote Java RMI Registry (Page 36) Dr. Dobb's Journal - December 2008 - A Remote Java RMI Registry (Page 37) Dr. Dobb's Journal - December 2008 - A Remote Java RMI Registry (Page 38) Dr. Dobb's Journal - December 2008 - A Remote Java RMI Registry (Page 39) Dr. Dobb's Journal - December 2008 - Beyond B-Trees (Page 40) Dr. Dobb's Journal - December 2008 - Beyond B-Trees (Page 41) Dr. Dobb's Journal - December 2008 - File Descriptors and Multithreaded Programs (Page 42) Dr. Dobb's Journal - December 2008 - File Descriptors and Multithreaded Programs (Page 43) Dr. Dobb's Journal - December 2008 - File Descriptors and Multithreaded Programs (Page 44) Dr. Dobb's Journal - December 2008 - File Descriptors and Multithreaded Programs (Page 45) Dr. Dobb's Journal - December 2008 - Effective Concurrency (Page 46) Dr. Dobb's Journal - December 2008 - Effective Concurrency (Page 47) Dr. Dobb's Journal - December 2008 - Effective Concurrency (Page 48) Dr. Dobb's Journal - December 2008 - The Agile Edge (Page 49) Dr. Dobb's Journal - December 2008 - The Agile Edge (Page 50) Dr. Dobb's Journal - December 2008 - The Agile Edge (Page 51) Dr. Dobb's Journal - December 2008 - Swaine's Flames (Page 52) Dr. Dobb's Journal - December 2008 - Swaine's Flames (Page Cover3) Dr. Dobb's Journal - December 2008 - Swaine's Flames (Page Cover4)
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.