Dr. Dobb's Journal - June 2008 - (Page 36) d06cass_p4ds 4/11/08 11:00 AM Page 36 State of the Art by Steve Casselman Software to Hardware Parallelization Optimizing algorithms for multicore hardware Steve is CTO and cofounder of DRC Computer and can be contacted at www.drccomputer.com. Most of the software written today involves instructions executed in sequence. To speed up execution, programmers have typically pushed hardware designers to build processors with increasingly higher clock rates, resulting in heavily pipelined processors that operate at clock rates of 3 GHz and more. To get the most out of every clock cycle, these processors resort to tricks such as large caches and out-oforder execution. However, faster processors generate lots of heat. As a result, clock speeds have, for the most part, leveled off since the heat generated by faster circuits ends up constraining clock speeds. To continue the march towards faster execution, hardware designers have switched from single processors on a chip to dual, quad, and even more CPU cores on a single chip. The operating system then allocates processors to different applications, all running in parallel. The next challenge is to find ways to parallelize the code running in each application, then run that parallelized code on multiple engines either within the CPU or in a companion coprocessor that’s optimized to execute that particular segment of parallelized code. Hardware designers have turned to the latest generation of field programmable gate arrays (FPGAs) and the new open-standard Torrenza coprocessor interface on Opteron system platforms defined by Advanced Micro Devices, and the Intel QuickAssist Technology Acceleration Abstraction Layer (AAL) for the hardware portion of the parallelization goal. By downloading configurations into an FPGA tied to either a Torrenza or QuickAssist motherboard platform, designers can accelerate computationally complex algorithms such as encryption, compression, search, and sort up to 1000 times over general-purpose processors. Also, any algorithm that needs billions of integer or floating-point operations per second (image and audio processing, seismic exploration, bioinformatics, and the like) can be accelerated by an FPGA-based coprocessor. To accelerate an algorithm, you must first identify the code within the application that can be parallelized, then figure out how to parallelize that code so it can be executed on an array of computational elements configured in an FPGA or ASIC. But the question is where to start. The logical starting point is to first profile the code to find its computationally intensive portions, then find ways to isolate the code so that it does not have many data dependencies. Once the code has been isolated, you need to find ways to optimize it so that it can be executed on the resources available on the coprocessor. It is difficult to optimize the code to fit on architectures like graphicsprocessing units and the Cell processor, where users are not given all the data for the device. FPGAs, on the other hand, make it easier for you to define an optimized architecture on a case-by-case basis. The next step to accelerating an algorithm is to examine the communication channels between hardware and software. Make sure the hardware is not data starved (not enough data reaches the hardware to keep the hardware continually performing its computations), and at the same time, ensure that the hardware can keep up with the pace at which the software feeds data to the hardware. Finding the right balance in the hardware/software partitioning between system inputs and outputs is critical to smooth operation. If it takes longer to transmit data to the accelerator than it takes the CPU to calculate the answer, make sure that the hardware design uses all memory accesses most efficiently. You may even consider adding another memory port to feed more data to the compute array. For example, if you are accelerating a fast-Fourier transform (FFT) followed by a phase shift and then followed by an inverse FFT in the software algorithm, you take the data out of a memory location, perform the FFT, and then put data back into memory. The data then comes out of the memory again and the algorithm executes the phase shift computations and places the data back into memory one more time. Finally, the inverse FFT retrieves the data, executes the algorithm, and then places the final result back in memory. Main memory accesses are expensive, so reusing data can dramatically improve performance by eliminating multiple store and access operations. 36 Dr. Dobb’s Journal l www.ddj.com l June 2008 http://www.drccomputer.com http://www.ddj.com
Table of Contents Feed for the Digital Edition of Dr. Dobb's Journal - June 2008 Dr. Dobb's Journal - June 2008 Contents Friday Night Fish Fry Alia Vox Developer Diaries There Must Be Contest Conversations Building a Test Harness for RTOS QT and Windows CE Software to Hardware Parallelization Performance Portable C++ Effective Concurrency The Agile Edge Swaine's Flames Dr. Dobb's Journal - June 2008 Dr. Dobb's Journal - June 2008 - Dr. Dobb's Journal - June 2008 (Page Cover1) Dr. Dobb's Journal - June 2008 - Dr. Dobb's Journal - June 2008 (Page Cover2) Dr. Dobb's Journal - June 2008 - Dr. Dobb's Journal - June 2008 (Page 1) Dr. Dobb's Journal - June 2008 - Dr. Dobb's Journal - June 2008 (Page 2) Dr. Dobb's Journal - June 2008 - Dr. Dobb's Journal - June 2008 (Page 3) Dr. Dobb's Journal - June 2008 - Contents (Page 4) Dr. Dobb's Journal - June 2008 - Contents (Page 5) Dr. Dobb's Journal - June 2008 - Friday Night Fish Fry (Page 6) Dr. Dobb's Journal - June 2008 - Friday Night Fish Fry (Page 7) Dr. Dobb's Journal - June 2008 - Friday Night Fish Fry (Page 8) Dr. Dobb's Journal - June 2008 - Friday Night Fish Fry (Page 9) Dr. Dobb's Journal - June 2008 - Alia Vox (Page 10) Dr. Dobb's Journal - June 2008 - Alia Vox (Page 11) Dr. Dobb's Journal - June 2008 - Alia Vox (Page 12) Dr. Dobb's Journal - June 2008 - Alia Vox (Page 13) Dr. Dobb's Journal - June 2008 - Developer Diaries (Page 14) Dr. Dobb's Journal - June 2008 - Developer Diaries (Page 15) Dr. Dobb's Journal - June 2008 - There Must Be Contest (Page 16) Dr. Dobb's Journal - June 2008 - There Must Be Contest (Page 17) Dr. Dobb's Journal - June 2008 - There Must Be Contest (Page 18) Dr. Dobb's Journal - June 2008 - There Must Be Contest (Page 19) Dr. Dobb's Journal - June 2008 - Conversations (Page 20) Dr. Dobb's Journal - June 2008 - Conversations (Page 21) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 22) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 23) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 24) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page IBM-1) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page IMB-2) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 25) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 26) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 27) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 28) Dr. Dobb's Journal - June 2008 - Building a Test Harness for RTOS (Page 29) Dr. Dobb's Journal - June 2008 - QT and Windows CE (Page 30) Dr. Dobb's Journal - June 2008 - QT and Windows CE (Page 31) Dr. Dobb's Journal - June 2008 - QT and Windows CE (Page 32) Dr. Dobb's Journal - June 2008 - QT and Windows CE (Page 33) Dr. Dobb's Journal - June 2008 - QT and Windows CE (Page 34) Dr. Dobb's Journal - June 2008 - QT and Windows CE (Page 35) Dr. Dobb's Journal - June 2008 - Software to Hardware Parallelization (Page 36) Dr. Dobb's Journal - June 2008 - Software to Hardware Parallelization (Page 37) Dr. Dobb's Journal - June 2008 - Software to Hardware Parallelization (Page 38) Dr. Dobb's Journal - June 2008 - Software to Hardware Parallelization (Page 39) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 40) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 41) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 42) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 43) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 44) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 45) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 46) Dr. Dobb's Journal - June 2008 - Performance Portable C++ (Page 47) Dr. Dobb's Journal - June 2008 - Effective Concurrency (Page 48) Dr. Dobb's Journal - June 2008 - Effective Concurrency (Page 49) Dr. Dobb's Journal - June 2008 - Effective Concurrency (Page 50) Dr. Dobb's Journal - June 2008 - Effective Concurrency (Page 51) Dr. Dobb's Journal - June 2008 - The Agile Edge (Page 52) Dr. Dobb's Journal - June 2008 - The Agile Edge (Page 53) Dr. Dobb's Journal - June 2008 - The Agile Edge (Page 54) Dr. Dobb's Journal - June 2008 - The Agile Edge (Page 55) Dr. Dobb's Journal - June 2008 - Swaine's Flames (Page 56) Dr. Dobb's Journal - June 2008 - Swaine's Flames (Page Cover3) Dr. Dobb's Journal - June 2008 - Swaine's Flames (Page Cover4)
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.