To be used naturally in design optimization, parametric study and achieve quick total time-to-solution, simulation must naturally and personally be available to the scientist/engineer, as easily as email or word-processing. Environments such as Matlab/IDL allow ease of use, but unless simulations are extremely fast, they cannot be used naturally. Many large-scale numerical calculations require storage and computation that grow as the square/cube of the number of variables, including such linear algebra operations as solving dense linear systems, computing eigen-values/vectors, and others. The use of fast algorithms such as the fast multipole method (FMM) coupled with iterative methods allows many problems of interest to be solved in near linear time and memory. We have taken a leadership role in applying and extending the FMM to various problems in acoustics, fluid flow, electromagnetics, function fitting and machine learning. Graphical Processing Units (GPUs) are now ubiquitous in game consoles, in workstations and other devices and are special purpose processors for graphics, that are predicted to shortly achieve performance in the hundreds of gigaflop range for specialized calculations (much faster than COTS PCs) at low price points. It is conceivable now to equip personal workstations with several CPUs and GPUs, and solve problems with millions or billions of variables quickly using fast algorithms. We will take an important algorithm with wide applicability: the FMM, and implement it on the widely available heterogeneous CPU/GPU architecture, and prove the feasibility of accelerating it tremendously. A fundamental reconsideration of the algorithm that maps appropriate pieces on to the correct part of the architecture forms the basis of our approach. Developed software will be tested, and benchmark problems solved. A library of software that will support the porting of the FMM and other scientific computing to the CPU/GPU architecture will be developed.