Dr. Dobb's Journal - December 2008 - (Page 48) D12sutt_p3db 10/10/08 9:13 AM Page 48 Effective Concurrency wait_all( wholesale, retail ); return wholesale.value() + retail.value() – returns; } Naturally, what matters most is not the total overhead, but how big it is compared to the total work. We want the cost of each chunk of work to be significantly larger than the cost of performing it asynchronously instead of synchronously. even just one. Even in sequential code, it’s typical to switch to something like bubble sort when the subrange’s size falls below a threshold; similarly in parallel code, for small ranges we should switch to a sequential sort. Example 6 incorporates both of these changes: // Example 6: An improved parallel quicksort, still // sorts subranges in parallel but more efficiently // void ParallelQuicksort( Iterator first, Iterator last ) { if( distance(first,last) <= threshold ) { SequentialSort( first, last ); } else { Iterator pivot = Partition( first, last ); future f1 = pool.run( [&]{ ParallelQuicksort( first, pivot ); } ); ParallelQuicksort( pivot, last ); f1.wait(); } } The Cost of Unrealized Concurrency A key cost in today’s environments is the cost of unrealized concurrency. What’s the cost of our parallel code compared to the sequential code in the case when the parallel algorithm actually ends up running sequentially? For example, what happens if our parallel-ready code executes on a machine with just one core, so that we don’t actually get to realize the concurrency because the tasks end up running sequentially anyway (e.g., there’s only one pool thread)? We’ve added the overhead to express concurrency, and we pay for that overhead even if we don’t get to benefit from it on a particular system. If Example 1 is the code we shipped in MyApp version 1.0 and Example 4 is what we’ll ship in MyApp 2.0, then an existing customer with a legacy single-core machine may find that the new application is actually slower than the old one, even though the new application will run better when more cores are available. To mitigate this on low- and single-core machines, we can reduce the overhead by adjusting granularity to use fewer and larger chunks of work, or even switch to a sequential implementation. In general, we want to slice the work as finely as possible, but not to the point where the work is comparable in size to the overhead. Forward-Looking Note: Work Stealing Future runtime systems will significantly drive down all of these costs, including the overhead per chunk and the cost of unrealized concurrency, to the point where we will often be able to ignore it and blithely write code like Example 5 without worrying about its performance most of the time. The basic idea is to use work stealing whereby default “potentially asynchronous work” is actually not shipped elsewhere, but rather queued up to be executed on the original thread. Only if another core runs out of work and sees that our thread has waiting queued work to be performed, will the work be “stolen” and efficiently shipped to the other thread. The idea is to drive down the cost of unrealized concurrency by only actually incurring the overhead of running on another core if it’s worth doing at that particular instant—on this hardware and with this amount of other work currently occupying the machine; and the very next execution of the very same function on the very same hardware might not steal, if all the cores are already busy. Sample current and upcoming technologies that feature work stealing runtimes include: Intel Threading Building Blocks; Microsoft’s Parallel Patterns Library, Task Parallel Library, and PLINQ; Java 7’s Fork/Join framework; and the granddaddy of them all, Cilk, which popularized the technique among implementers. The Double-Edged Sword of Fine-Grainedness Even in Example 4, the code only scales to at most three cores. Can we do better? Yes, if we can make the work more fine-grained, slicing the work to be done more finely into smaller chunks. One approach in Example 4 would be to apply similar techniques within CalcWholesale, CalcRetail, and TotalReturns to further decompose their work into concurrent subtasks. Even better is when we can exploit natural parallelism in algorithms (e.g., divide-and-conquer algorithms like quicksort) and data structures (e.g., trees and graphs) to subdivide our work in a way that scales with the amount of data. But now we encounter a fundamental tension between scalability and overhead: The smaller the chunks of work, the more chunks we have and the more easily we can distribute them to utilize larger number of cores to get an answer faster—but the more the overhead per chunk starts to dominate in our performance costs. Let’s consider quicksort as a common example. Can you spot the performance flaws in this code? // Example 5 (flawed, today): Naïve parallel quicksort, // sorts left and right subranges in parallel // void ParallelQuicksort( Iterator first, Iterator last ) { Iterator pivot = Partition( first, last ); future f1 = pool.run( [&]{ ParallelQuicksort( first, pivot ); } ); future f2 = pool.run( [&]{ ParallelQuicksort( pivot, last ); } ); wait_all( f1, f2 ); } On Deck To understand your code’s scalability, first identify the key variables that affect the workload, and then measure throughput for workloads with different combinations of those variables. Look for scalability, and how it hits the contention and oversubscription barriers. Prefer thread pools (today) and work stealing (tomorrow). Next month, we’re going to apply the tools we discussed this month to analyze the performance impact of specific optimizations on concrete code. Fasten your seat belts. DDJ Herb is a bestselling author and consultant on software development topics, and a software architect at Microsoft. He can be contacted at www.gotw.ca. We can improve this in at least two ways. First, as noted earlier, we should take the tail chunk of work and do it ourselves to mitigate the overhead of shipping work to a pool thread in today’s environments. Second, we can notice that most of the spun-off work will be at the leaves of the computation containing subranges of a few elements or 48 Dr. Dobb’s Journal l www.ddj.com l December 2008 http://www.gotw.ca http://www.ddj.com
Table of Contents Feed for the Digital Edition of Dr. Dobb's Journal - December 2008 Dr. Dobb's Journal - December 2008 Contents Friday Night Fish Fry Alia Vox Developer Diaries Conversations The Man Who Sold the Sky Performance on Rails LINQ-to-SQL and T-SQL A Remote Java RMI Registry Beyond B-Trees File Descriptors and Multithreaded Programs Effective Concurrency The Agile Edge Swaine's Flames Dr. Dobb's Journal - December 2008 Dr. Dobb's Journal - December 2008 - Dr. Dobb's Journal - December 2008 (Page Cover1) Dr. Dobb's Journal - December 2008 - Dr. Dobb's Journal - December 2008 (Page Cover2) Dr. Dobb's Journal - December 2008 - Dr. Dobb's Journal - December 2008 (Page 1) Dr. Dobb's Journal - December 2008 - Dr. Dobb's Journal - December 2008 (Page 2) Dr. Dobb's Journal - December 2008 - Dr. Dobb's Journal - December 2008 (Page 3) Dr. Dobb's Journal - December 2008 - Contents (Page 4) Dr. Dobb's Journal - December 2008 - Contents (Page 5) Dr. Dobb's Journal - December 2008 - Friday Night Fish Fry (Page 6) Dr. Dobb's Journal - December 2008 - Friday Night Fish Fry (Page 7) Dr. Dobb's Journal - December 2008 - Friday Night Fish Fry (Page 8) Dr. Dobb's Journal - December 2008 - Friday Night Fish Fry (Page 9) Dr. Dobb's Journal - December 2008 - Alia Vox (Page 10) Dr. Dobb's Journal - December 2008 - Alia Vox (Page 11) Dr. Dobb's Journal - December 2008 - Developer Diaries (Page 12) Dr. Dobb's Journal - December 2008 - Developer Diaries (Page 13) Dr. Dobb's Journal - December 2008 - Conversations (Page 14) Dr. Dobb's Journal - December 2008 - Conversations (Page 15) Dr. Dobb's Journal - December 2008 - The Man Who Sold the Sky (Page 16) Dr. Dobb's Journal - December 2008 - The Man Who Sold the Sky (Page 17) Dr. Dobb's Journal - December 2008 - The Man Who Sold the Sky (Page 18) Dr. Dobb's Journal - December 2008 - The Man Who Sold the Sky (Page 19) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 20) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 21) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 22) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 23) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 24) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 25) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 26) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 27) Dr. Dobb's Journal - December 2008 - Performance on Rails (Page 28) Dr. Dobb's Journal - December 2008 - LINQ-to-SQL and T-SQL (Page 29) Dr. Dobb's Journal - December 2008 - LINQ-to-SQL and T-SQL (Page 30) Dr. Dobb's Journal - December 2008 - LINQ-to-SQL and T-SQL (Page 31) Dr. Dobb's Journal - December 2008 - LINQ-to-SQL and T-SQL (Page 32) Dr. Dobb's Journal - December 2008 - LINQ-to-SQL and T-SQL (Page 33) Dr. Dobb's Journal - December 2008 - LINQ-to-SQL and T-SQL (Page 34) Dr. Dobb's Journal - December 2008 - A Remote Java RMI Registry (Page 35) Dr. Dobb's Journal - December 2008 - A Remote Java RMI Registry (Page 36) Dr. Dobb's Journal - December 2008 - A Remote Java RMI Registry (Page 37) Dr. Dobb's Journal - December 2008 - A Remote Java RMI Registry (Page 38) Dr. Dobb's Journal - December 2008 - A Remote Java RMI Registry (Page 39) Dr. Dobb's Journal - December 2008 - Beyond B-Trees (Page 40) Dr. Dobb's Journal - December 2008 - Beyond B-Trees (Page 41) Dr. Dobb's Journal - December 2008 - File Descriptors and Multithreaded Programs (Page 42) Dr. Dobb's Journal - December 2008 - File Descriptors and Multithreaded Programs (Page 43) Dr. Dobb's Journal - December 2008 - File Descriptors and Multithreaded Programs (Page 44) Dr. Dobb's Journal - December 2008 - File Descriptors and Multithreaded Programs (Page 45) Dr. Dobb's Journal - December 2008 - Effective Concurrency (Page 46) Dr. Dobb's Journal - December 2008 - Effective Concurrency (Page 47) Dr. Dobb's Journal - December 2008 - Effective Concurrency (Page 48) Dr. Dobb's Journal - December 2008 - The Agile Edge (Page 49) Dr. Dobb's Journal - December 2008 - The Agile Edge (Page 50) Dr. Dobb's Journal - December 2008 - The Agile Edge (Page 51) Dr. Dobb's Journal - December 2008 - Swaine's Flames (Page 52) Dr. Dobb's Journal - December 2008 - Swaine's Flames (Page Cover3) Dr. Dobb's Journal - December 2008 - Swaine's Flames (Page Cover4)
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.