SBIR-STTR Award

Intelligent, real-time migration of scientific computing applications on commercial cloud-based HPC platforms
Award last edited on: 1/5/2023

Sponsored Program
SBIR
Awarding Agency
DOE
Total Award Amount
$1,349,742
Award Phase
2
Solicitation Topic Code
C52-04a
Principal Investigator
Hakim Weatherspoon

Company Information

Exotanium Inc

350 Duffield Hall Suite N
Ithaca, NY 14850
   (607) 218-5948
   N/A
   www.exotanium.io
Location: Multiple
Congr. District: 23
County: Tompkins

Phase I

Contract Number: DE-SC0021862
Start Date: 6/28/2021    Completed: 12/27/2021
Phase I year
2021
Phase I Amount
$199,874
Despite intending to reduce computing workload costs by migrating IT infrastructure and operations from physical servers to the cloud, businesses find that leasing the same amount of server space in the cloud can prove even more costly, and cloud cost management is a pain point. This is partly because businesses must overprovision to accommodate potential surges in server use and ensure that stateful applications, which cannot tolerate any downtime, are not compromised. Furthermore, the infiltration of graphic processing units (GPUs) into the cloud market has grown due to their capabilities as computational accelerators, subsequently increasing their demand within cloud storage. However, many cloud platforms are limited in their ability to support GPUs, and still charge considerable amounts of money for their storage and use. Exotanium is planning to develop novel technologies (X-Spot and X-Consolidate) that can consolidate idle workloads and over-sized software containers to take advantage of deeply discounted server space such as the Spot market. Additionally, this technology will develop the X-Spot platform to support GPUs, processors designed to render images and provide fast graphics processing. This support would enable large, highly compute-intensive applications to reliably run uninterrupted while taking advantage of the spot market's significant cost savings. Exotanium will address this through the following three objectives: 1) Demonstration of Exotanium supporting DOE workload in the public cloud and GovCloud of AWS, 2) Provide GPU Migration Support, and 3) Establishment of Hybrid Cloud Support. Exotanium can support DOE workloads in the public cloud at an unprecedented level of price, performance, and ease of use. Exotanium will support DOE HPC applications without requiring them to be changed at all. Successfully establishing Exotanium’s X-Spot platform to support GPUs could result in a potentially large number of highly compute-intensive DOE applications being run in the spot market of the AWS and Azure GovClouds at significant cost savings. Finally, demonstrating a successful hybrid cloud transparent live migration of workloads between on-premises private cloud to the public cloud could lead to significant cost savings without changing a line of code for the application, presenting a potential approach for modernization, migrating to the cloud.

Phase II

Contract Number: DE-SC0021862
Start Date: 8/22/2022    Completed: 8/21/2024
Phase II year
2022
Phase II Amount
$1,149,868
Cloud computing has the potential to serve as a cost-effective and energy-efficient computing paradigm for scientists to accelerate discoveries. Extensive use of commercial cloud computing resources in the scientific community has the potential to lower costs, accelerate research, and enhance collaboration. However, cloud computing utilization is often suboptimal. Users typically overprovision to accommodate potential surges in server use, as well as to ensure that stateful applications, which cannot tolerate any downtime, are not interrupted. To reduce wasteful spending and enable more efficient usage of cloud resources, a technology is being developed that consolidates idle workloads and over-sized software containers to take advantage of deeply discounted server space such as the Spot Market. The technology is a combination of two separate products. The first module spawns containers on discounted VM instances (Spot Instances), and dynamically relocates containers between such instances, based on availability and price. A second technology packs idle containers onto a small number of VMs during the idle period, and relocates containers onto different VMs when workload increases, without any service interruption. This lack of service disruption is a fundamental departure from current market solutions that offer “cloud optimization” requiring manually re-architecting cloud infrastructure with significant downtime during testing and redeployment. In Phase I, live migration of government High-Performance Computing (HPC) workloads within a single public cloud was demonstrated. The measured savings were up to 80% as compared to on demand costs, with the same performance (i.e. 5x the amount of compute for the same cost). The ability to do similar migrations with similar value in other public clouds is required to address substantial commercial opportunities and DOE user needs. Also, demonstrating a successful hybrid cloud live migration of workloads between on-premises private cloud to the public cloud could lead to significant cost savings without changing a line of code for the application, presenting a potential approach for migrating to the cloud in an inexpensive and low-risk manner. Finally, successfully establishing the platform to support GPUs could result in a potentially large number of highly compute-intensive DOE applications being run in the spot market of multiple public GovClouds at significant cost savings. During the Phase II award, multi-public cloud support will be developed, hybrid cloud support established, and capabilities extended to GPU processing.