Dr. Dobb's Journal - April 2008 - (Page 54) d04sutt_p5ds 2/13/08 9:53 AM Page 54 Effective Concurrency Parallel search naturally exploits nonuniform distributions of matches that there is another effect that can help: It’s when using more threads lets you use a bigger machine, literally. Running on a Bigger Machine Besides leveraging the superlinearity that can crop up naturally in parallel algorithms, how else could we ever achieve a superlinear speedup by using more parallelism on the same machine? After all, more parallelism means we can use more cores to speed up compute-bound work, but just using more cores in itself gives us only a linear speedup. Right? The fallacy is in the question’s final words: “ …on the same machine.” When you run a program with less parallelism and another with more parallelism on the same modern desktop or server hardware, the one with more parallelism literally runs on a bigger machine—a disproportionately larger share of the available hardware. This happens because the one with more parallelism can use not only additional cores, but additional hardware resources attached to those cores that would not otherwise be available to the program. In particular, using more cores typically also means getting access to more cache and/or more memory. To see why this is so, consider Figure 2 and Figure 3. Each of these shows a simplified block diagram of the cores and caches on two modern commodity CPUs: the current Intel “Kentsfield” processor, and the upcoming AMD “Barcelona” processor, respectively. The interesting feature in both chips is that each core has access to cache memory that is not available to some or all of the other cores. In the Kentsfield processor, each pair of cores shares a private L2 cache; in the Barcelona chip, each core has its own private L2 cache. In both cases, no core by itself has access to all the available L2 cache…and that means that code running on just one core is limited, not only to the one core, but also to just a fraction of the available cache. For code whose performance is memory bound, the amount of available cache can make a huge difference. In Figure 2, with only one thread running, as expected we see three of the Kentsfield’s four cores are unused. But by running a sequential algorithm we don’t only lose three cores; we also lose half the system’s available L2 cache. When we run a parallel algorithm using two workers (for example, on cores 1 and 3) or more, not only do we get to use more processors, but we also double the amount of L2 cache available—that is, we literally run on a bigger machine with more cache. Similarly on the Barcelona chip shown in Figure 3, each additional worker adds to the total cache size until we saturate the machine with four workers. A similar effect occurs on NUMA (nonuniform memory access) machines. Figure 4 shows a sample NUMA architecture where the machine is organized into nodes that each has its own processors and its own RAM. All the processors can use all the RAM in the machine, but the fly in the ointment is that not all RAM is created equal: RAM that’s attached to other processors is “further away” and more expensive to use, because it requires a trip through the internode interconnect. So this time the question isn’t one of making more RAM available, but rather of making faster RAM available. To illustrate, imagine that in Figure 4, the average time to access memory (in the case of a cache miss) is 200 clock cycles if the request goes to directly connected RAM, and 1000 cycles if it’s to RAM on another node. A program with only one thread running might experience an average memory access time of around 600 cycles, modulo other cache and locality effects. But a program that runs two or more threads that are spread fairly evenly across the two nodes might experience an average cache miss access time of only 200 cycles. Given that waiting for memory is already a common bottleneck for modern high-performance software, and that we could do an awful lot of computation in those 400 clock cycles if we could use them for something better than sitting stalled while waiting for RAM to respond, getting those cycles back can be a big deal. Well-written parallel code that is memory-bound and cachebound can get faster superlinearly to a point because more data fits into memory and cache, and so total computation time goes down because we get more page and cache hits. The more the code makes multiple passes over the data, the more this superlinear effect is amplified. On Deck There are two main ways into the superlinear stratosphere: • Do disproportionately less work. • Harness disproportionately more resources. To accomplish the first point, we saw that interruption was essential. But how should we interrupt (or, to use the C-word, “cancel”) a thread or a unit of work? More on that when we return… Notes [1] H. Sutter. “Going Superlinear” (Dr. Dobb’s Journal, March 2008; www.ddj.com/cpp/206100542. [2] Randomized algorithms do have their place in mitigating worstcase behaviors for sequential code, as noted in this reference: Si. Pi Ravikumar. Parallel Methods for VLSI Layout Design, 4.3.2 (Ablex/Greenwood, 1996). DDJ Figure 4: Sample NUMA processor and cache utilization: 1 thread versus 4 threads. 54 Dr. Dobb’s Journal l www.ddj.com l April 2008 http://www.ddj.com/cpp/206100542 http://www.ddj.com
Table of Contents Feed for the Digital Edition of Dr. Dobb's Journal - April 2008 Dr. Dobb's Journal - April 2008 Contents Hmmmm Alia Vox Developer Diaries Dr. Dobb's Excellence in Programming Award Conversations Fast String Search on Multicore Processors The Byzantine Generals Problem Optimizing Math-Intensive Applications with Fixed-Point Arithmetic Random Numbers in a Range Using Generic Programming The Agile Edge Effective Concurrency Swaine's Flames Dr. Dobb's Journal - April 2008 Dr. Dobb's Journal - April 2008 - Dr. Dobb's Journal - April 2008 (Page Cover1) Dr. Dobb's Journal - April 2008 - Dr. Dobb's Journal - April 2008 (Page Cover2) Dr. Dobb's Journal - April 2008 - Dr. Dobb's Journal - April 2008 (Page 1) Dr. Dobb's Journal - April 2008 - Dr. Dobb's Journal - April 2008 (Page 2) Dr. Dobb's Journal - April 2008 - Dr. Dobb's Journal - April 2008 (Page 3) Dr. Dobb's Journal - April 2008 - Contents (Page 4) Dr. Dobb's Journal - April 2008 - Contents (Page 5) Dr. Dobb's Journal - April 2008 - Hmmmm (Page 6) Dr. Dobb's Journal - April 2008 - Hmmmm (Page 7) Dr. Dobb's Journal - April 2008 - Hmmmm (Page 8) Dr. Dobb's Journal - April 2008 - Hmmmm (Page 9) Dr. Dobb's Journal - April 2008 - Alia Vox (Page 10) Dr. Dobb's Journal - April 2008 - Alia Vox (Page 11) Dr. Dobb's Journal - April 2008 - Developer Diaries (Page 12) Dr. Dobb's Journal - April 2008 - Developer Diaries (Page 13) Dr. Dobb's Journal - April 2008 - Developer Diaries (Page 14) Dr. Dobb's Journal - April 2008 - Developer Diaries (Page 15) Dr. Dobb's Journal - April 2008 - Dr. Dobb's Excellence in Programming Award (Page 16) Dr. Dobb's Journal - April 2008 - Dr. Dobb's Excellence in Programming Award (Page 17) Dr. Dobb's Journal - April 2008 - Conversations (Page 18) Dr. Dobb's Journal - April 2008 - Conversations (Page 19) Dr. Dobb's Journal - April 2008 - Fast String Search on Multicore Processors (Page 20) Dr. Dobb's Journal - April 2008 - Fast String Search on Multicore Processors (Page 21) Dr. Dobb's Journal - April 2008 - Fast String Search on Multicore Processors (Page 22) Dr. Dobb's Journal - April 2008 - Fast String Search on Multicore Processors (Page 23) Dr. Dobb's Journal - April 2008 - Fast String Search on Multicore Processors (Page 24) Dr. Dobb's Journal - April 2008 - Fast String Search on Multicore Processors (Page 25) Dr. Dobb's Journal - April 2008 - Fast String Search on Multicore Processors (Page 26) Dr. Dobb's Journal - April 2008 - Fast String Search on Multicore Processors (Page 27) Dr. Dobb's Journal - April 2008 - Fast String Search on Multicore Processors (Page 28) Dr. Dobb's Journal - April 2008 - Fast String Search on Multicore Processors (Page 29) Dr. Dobb's Journal - April 2008 - The Byzantine Generals Problem (Page 30) Dr. Dobb's Journal - April 2008 - The Byzantine Generals Problem (Page 31) Dr. Dobb's Journal - April 2008 - The Byzantine Generals Problem (Page 32) Dr. Dobb's Journal - April 2008 - The Byzantine Generals Problem (Page 33) Dr. Dobb's Journal - April 2008 - The Byzantine Generals Problem (Page 34) Dr. Dobb's Journal - April 2008 - The Byzantine Generals Problem (Page 35) Dr. Dobb's Journal - April 2008 - The Byzantine Generals Problem (Page 36) Dr. Dobb's Journal - April 2008 - The Byzantine Generals Problem (Page 37) Dr. Dobb's Journal - April 2008 - Optimizing Math-Intensive Applications with Fixed-Point Arithmetic (Page 38) Dr. Dobb's Journal - April 2008 - Optimizing Math-Intensive Applications with Fixed-Point Arithmetic (Page 39) Dr. Dobb's Journal - April 2008 - Optimizing Math-Intensive Applications with Fixed-Point Arithmetic (Page 40) Dr. Dobb's Journal - April 2008 - Optimizing Math-Intensive Applications with Fixed-Point Arithmetic (Page 41) Dr. Dobb's Journal - April 2008 - Optimizing Math-Intensive Applications with Fixed-Point Arithmetic (Page 42) Dr. Dobb's Journal - April 2008 - Optimizing Math-Intensive Applications with Fixed-Point Arithmetic (Page 43) Dr. Dobb's Journal - April 2008 - Random Numbers in a Range Using Generic Programming (Page 44) Dr. Dobb's Journal - April 2008 - Random Numbers in a Range Using Generic Programming (Page 45) Dr. Dobb's Journal - April 2008 - Random Numbers in a Range Using Generic Programming (Page 46) Dr. Dobb's Journal - April 2008 - Random Numbers in a Range Using Generic Programming (Page 47) Dr. Dobb's Journal - April 2008 - Random Numbers in a Range Using Generic Programming (Page 48) Dr. Dobb's Journal - April 2008 - The Agile Edge (Page 49) Dr. Dobb's Journal - April 2008 - The Agile Edge (Page 50) Dr. Dobb's Journal - April 2008 - The Agile Edge (Page 51) Dr. Dobb's Journal - April 2008 - Effective Concurrency (Page 52) Dr. Dobb's Journal - April 2008 - Effective Concurrency (Page 53) Dr. Dobb's Journal - April 2008 - Effective Concurrency (Page 54) Dr. Dobb's Journal - April 2008 - Effective Concurrency (Page 55) Dr. Dobb's Journal - April 2008 - Swaine's Flames (Page 56) Dr. Dobb's Journal - April 2008 - Swaine's Flames (Page Cover3) Dr. Dobb's Journal - April 2008 - Swaine's Flames (Page Cover4)
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.