Embedded Systems Design Europe - November 2007 - (Page 31) multiprocessing sustainable power budgets as well as physical size, thermal, and packaging constraints. Adding more devices to resolve performance issues forces other tradeoffs and adds yet another component to an already lengthy bill of materials. Modern FPGAs, with their ability to integrate multiple processors and coprocessors in a single device, provide one solution to this problem. In a modern FPGA-based application, one processor may be used to run an operating system. Further integration may be achieved by adding additional coprocessors for noncritical algorithms. These processors can be integrated with dedicated hardware accelerators, all in the same programmable FPGA device. The result is a hybrid multiprocessing application with a reduced component count. LEVERAGING PARALLELISM Solving complex computational problems through integration and parallelism is not new. It’s long been recognized that many of the computing challenges in embedded and high-performance systems can be addressed using parallel-processing techniques. The use of dual- or quadcore processors, multiple processing boards, or even clustered PCs has become commonplace in many applications. In embedded applications, traditional processors can be paired with DSPs, which are often paired with custom or off-the-shelf hardware accelerators. In recent years, the trend has been to combine multiple processing elements on one device. One example of this multicored approach is the Cell Broadband Engine Architecture, jointly designed by Sony, Toshiba, and IBM. The Cell architecture increases the performance of graphics and video applications by introducing systemlevel parallelism. It also supports a flexible, programmable acceleration that’s highly optimized and provides for high clock frequencies while minimizing power. The keys to the Cell architecture’s high performance are the Synergistic Processing Elements (SPEs) that provide coherent offload, abundant local memory, and asynchronous coherent DMA engines. End applications, such as multimedia and vector processing, benefit from the combination of the general-purpose processor core and streamlined coprocessing elements. (Editor’s note: see “Programming the Cell Broadband Engine,” Alex Chunghan Chow, June 2006, www.embedded.com/188101999 for more info on SPEs.) Figure 1 shows Nvidia’s Compute Unified Device Architecture (CUDA), another type of parallel processing engine. It’s based on standard graphics processing units (GPUs), which are stream processors (highlighted in light green in the figure) that have been combined to form a general purpose, streams-oriented parallel processing engine. CUDA provides access to the native instruction set and memory of the parallel computation elements in the GPUs. Like the Cell processor, the CUDA architecture promises higher performance over standard processors, while simplifying software development using the standard C language for data-intensive problems. These architectures accelerate performance by providing dedicated processing engines operating in parallel. Parallelism can exist at many levels • • • • System level through using multiple CPUs and coprocessors Process level via multiple threads or communicating processes within each processor Subroutine and loop levels using unrolling and pipelining for example Statement level via instruction 31 www.embedded.com/europe | embedded systems design europe | NOVEMBER – DECEMBER 2007 030-031-032-033-034-035_ESDE.ind31 31 9/11/07 11:26:56 http://www.embedded.com/188101999 http://www.embedded.com/europe
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.