Stream processors form a new class of architectures that offers performance scalable to TeraOPS on a single chip, have cost and power efficiency comparable to ASICs, and are completely software programmable from high-level languages. Stream processor technology was developed over 8 years from 1997-2004 of DARPA-funded research at Stanford University that resulted in a working prototype IC, software tools, application software, and evaluation board. Stream Processors Inc. (SPI) has commercialized this technology with the recently-announced Storm-1 product family - with performance up to 112 GMACS at cost and power dissipation levels suitable for embedded systems. This technology provides a 4-20x improvement in GOPS and GOPS/Watt compared to other leading commercially-available software-programmable DSPs. As this architecture is scaled to 45nm and smaller CMOS technologies, it will be feasible to integrate hundreds to thousands of ALUs on a single chip, providing TeraOPS of performance. In this proposal, we intend to explore tradeoffs in how best to scale the architecture from current designs with 80 ALUs on a chip to novel stream processor architectures with over 100s of ALUs at higher clock rates for over 10x the available peak performance. The goal is to design an architecture capable of efficiently supporting thread-level, data-level, and instruction-level parallelism, while retaining the advantages of compiler-managed data movement and storage hierarchy afforded by stream processing. This stands in contrast to typical multi-core architectures which rely primarily on programmer-managed thread-level parallelism for performance, rely on cache-based memory systems for managing data movement, and do not provide the efficiency advantages provided by stream processing.
Keywords: Stream Processors, High-Performance Dsp, Parallel Processing, Signal Processing, Video Processing, Multi-Core