Dr. Dobb's Journal - June 2008 - (Page 48) d06sutt_p3ds 4/11/08 11:19 AM Page 48 Effective Concurrency by Herb Sutter Maximize Locality, Minimize Contention Want to kill your parallel application’s scalability? Easy: Just add a dash of contention IN THE CONCURRENT WORLD, locality is a firstorder issue that trumps most other performance considerations. Now locality is no longer just about fitting well into cache and RAM, but to avoid scalability busters by keeping tightly coupled data physically close together and separately used data far, far apart. Now at least some of the threads’ work can be done in parallel. Of course, we still hit Amdahl’s Law [1]: Even on infinitely parallel hardware with an infinite number of workers, the maximum speedup is merely (s+p)/s, minus whatever overhead we incur to perform the locking and context switches. But if we fail to measure (ahem) and the time spent in p is much less than in s, we’re really gaining little and we’ve again written a convoy. We’d never willingly do this. But the ugly truth is that we do it all the time: Sometimes it happens unintentionally; for example, when some function we call might be taking a lock on a popular mutex unbeknownst to us. But often it happens invisibly, when hardware will helpfully take exactly such an exclusive lock automatically, and silently, on our behalf. Let’s see how, why, and what we can do about it. Of Course, You’d Never Convoy On a Global Lock Nobody would ever willingly write code like this in a tight loop. // Threads 1-N while( ) { globalMutex.lock(); DoWork(); globalMutex.unlock(); } Recap: Chunky Memory For high-performance code, we’ve always had to be aware of paging and caching effects. Now, hardware concurrency adds a whole new layer to consider. When you ask for a byte of memory, the system never retrieves just one byte. We probably all know that nearly all computer systems keep track—not of bytes, but of chunks of memory. There are two major levels at which chunking occurs: the operating system chunks virtual memory into pages, each of which is managed as a unit, and the cache hardware further chunks memory into cache lines, which again are each handled as a unit. Figure 1 shows a simplified view. (In a previous article, we considered some issues that arise from nonshared caches, where only subsets of processors share caches in common [2].) First, consider memory pages: How big is a page? That’s up to the OS and varies by platform, but on mainstream systems the page size is typically 4K or more (see Figure 2). So when you ask for just one byte on a page that’s not currently in memory, you incur two main costs: • Speed: A page fault where the OS has to load the entire page from disk. • Space: Memory overhead for storing the entire page in memory, even if you only ever touch one byte from the page. Second, consider cache lines: How big is a line? That’s up to the cache hardware and again varies, but on mainstream systems the line size is typically 64 bytes This is clearly foolish, because all the work is being done while holding a global lock and so only one thread can make progress at a time. We end up with a classic lock convoy: At any given time, all of the threads save one are piled up behind the lock as each waits idly for its turn to come. Convoys are a classic way to kill parallel scalability. In this example, we’ll get no parallel speedup at all because this is just a fancy way to write sequential code. In fact, we’ll probably get a minor performance hit because of taking and releasing the lock many times and incurring context switches, and so we would be better off just putting the lock around the whole loop and making the convoy more obvious. True, we sometimes still gain some parallel benefit when only part of each thread’s work is done while holding the lock: // Threads 1-N while( ) { DoParallelWork(); globalMutex.lock(); DoSequentialWork(); globalMutex.unlock(); } // p = time spent here, // parallel portion // s = time spent here, // sequential portion Figure 1: Chunking in the memory hierarchy. Figure 2: Load a byte = load a line + load a page. 48 Dr. Dobb’s Journal l www.ddj.com l June 2008 http://www.ddj.com
Table of Contents Feed for the Digital Edition of Dr. Dobb's Journal - June 2008 Dr. Dobb's Journal - June 2008 Contents Friday Night Fish Fry Alia Vox Developer Diaries There Must Be Contest Conversations Building a Test Harness for RTOS QT and Windows CE Software to Hardware Parallelization Performance Portable C++ Effective Concurrency The Agile Edge Swaine's Flames Dr. Dobb's Journal - June 2008 Dr. Dobb's Journal - June 2008 - Dr. Dobb's Journal - June 2008 (Page Cover1) Dr. Dobb's Journal - June 2008 - Dr. Dobb's Journal - June 2008 (Page Cover2) Dr. Dobb's Journal - June 2008 - Dr. Dobb's Journal - June 2008 (Page 1) Dr. Dobb's Journal - June 2008 - Dr. Dobb's Journal - June 2008 (Page 2) Dr. Dobb's Journal - June 2008 - Dr. Dobb's Journal - June 2008 (Page 3) Dr. Dobb's Journal - June 2008 - Contents (Page 4) Dr. Dobb's Journal - June 2008 - Contents (Page 5) Dr. Dobb's Journal - June 2008 - Friday Night Fish Fry (Page 6) Dr. Dobb's Journal - June 2008 - Friday Night Fish Fry (Page 7) Dr. Dobb's Journal - June 2008 - Friday Night Fish Fry (Page 8) Dr. Dobb's Journal - June 2008 - Friday Night Fish Fry (Page 9) Dr. Dobb's Journal - June 2008 - Alia Vox (Page 10) Dr. Dobb's Journal - June 2008 - Alia Vox (Page 11) Dr. Dobb's Journal - June 2008 - Alia Vox (Page 12) Dr. Dobb's Journal - June 2008 - Alia Vox (Page 13) Dr. Dobb's Journal - June 2008 - Developer Diaries (Page 14) Dr. Dobb's Journal - June 2008 - Developer Diaries (Page 15) Dr. Dobb's Journal - June 2008 - There Must Be Contest (Page 16) Dr. Dobb's Journal - June 2008 - There Must Be Contest (Page 17) Dr. Dobb's Journal - June 2008 - There Must Be Contest (Page 18) Dr. Dobb's Journal - June 2008 - There Must Be Contest (Page 19) Dr. Dobb's Journal - June 2008 - Conversations (Page 20) Dr. Dobb's Journal - June 2008 - Conversations (Page 21) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 22) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 23) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 24) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page IBM-1) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page IMB-2) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 25) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 26) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 27) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 28) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 29) Dr. Dobb's Journal - June 2008 - QT and Windows CE (Page 30) Dr. Dobb's Journal - June 2008 - QT and Windows CE (Page 31) Dr. Dobb's Journal - June 2008 - QT and Windows CE (Page 32) Dr. Dobb's Journal - June 2008 - QT and Windows CE (Page 33) Dr. Dobb's Journal - June 2008 - QT and Windows CE (Page 34) Dr. Dobb's Journal - June 2008 - QT and Windows CE (Page 35) Dr. Dobb's Journal - June 2008 - Software to Hardware Parallelization (Page 36) Dr. Dobb's Journal - June 2008 - Software to Hardware Parallelization (Page 37) Dr. Dobb's Journal - June 2008 - Software to Hardware Parallelization (Page 38) Dr. Dobb's Journal - June 2008 - Software to Hardware Parallelization (Page 39) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 40) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 41) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 42) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 43) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 44) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 45) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 46) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 47) Dr. Dobb's Journal - June 2008 - Effective Concurrency (Page 48) Dr. Dobb's Journal - June 2008 - Effective Concurrency (Page 49) Dr. Dobb's Journal - June 2008 - Effective Concurrency (Page 50) Dr. Dobb's Journal - June 2008 - Effective Concurrency (Page 51) Dr. Dobb's Journal - June 2008 - The Agile Edge (Page 52) Dr. Dobb's Journal - June 2008 - The Agile Edge (Page 53) Dr. Dobb's Journal - June 2008 - The Agile Edge (Page 54) Dr. Dobb's Journal - June 2008 - The Agile Edge (Page 55) Dr. Dobb's Journal - June 2008 - Swaine's Flames (Page 56) Dr. Dobb's Journal - June 2008 - Swaine's Flames (Page Cover3) Dr. Dobb's Journal - June 2008 - Swaine's Flames (Page Cover4)
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.