SBIR-STTR Award

SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing Systems
Award last edited on: 12/21/21

Sponsored Program
SBIR
Awarding Agency
DOE
Total Award Amount
$256,500
Award Phase
1
Solicitation Topic Code
02a
Principal Investigator
Donglai Dai

Company Information

X-ScaleSolutions LLC

750 Deer Run Drive
Columbus, OH 43230
   (614) 316-4209
   contactus@x-scalesolutions.com
   www.x-scalesolutions.com
Location: Single
Congr. District: 03
County: Franklin

Phase I

Contract Number: DE-SC0021587
Start Date: 2/22/21    Completed: 2/21/22
Phase I year
2021
Phase I Amount
$256,500
As the field of High-Performance Computing (HPC) heads towards Exascale with modern processing, networking and storage technologies, it is increasingly becoming important to provide enhanced I/O capabilities and scalable checkpoint-restart support for users of these systems. For example, I/O-intensive HPC, Deep Learning (DL) and Machine Learning (ML) applications are hitting the I/O wall on large-scale systems. These applications are also spending a lot of time in checkpoint-restart phases. The Scalable Checkpoint-Restart (SCR) project, designed and developed by researchers at the Lawrence Livermore National Laboratory (LLNL), has made considerable progress along these lines. This code has been deployed on many LLNL systems and has demonstrated benefits to many users. The current version of the SCR library needs enhancements and hardening to achieve cross-platform portability and applicability for a diverse range of supercomputers and HPC clouds. The core support within the current SCR library also needs multiple enhancements to satisfy the needs of next generation Exascale systems and applications. The proposed project takes a systematic and comprehensive approach to enhance SCR for next-generation HPC, DL and ML applications using a set of new features, enhancements and capabilities. The project will be led by X-ScaleSolutions, in collaboration with research partners Dr. Kathryn Mohror and Adam Moody (the original developers and maintainers of the SCR library) of Lawrence Livermore National Laboratory. The project will develop a set of innovations in SCR-Exa to take advantage of the various features and mechanisms of the next generation networking, computing, and storage technologies. These innovations include: 1) modular design with plug and play support for a diverse set of resource managers and job launchers, 2) enhanced user interface with Python support, 3) Python bindings for HPC/DL/ML applications and frameworks, 4) optimized designs with asynchronous data transfers, 5) offloading redundancy encoding schemes using multi-threading, 6) I/O acceleration through integration with UnifyFS, 7) designs for checkpoint-restart on sub-communicators, 8) applicability of the SCR library to cloud environments with virtualization support, and 9) integrated development and evaluation. Tasks 1, 2, 3, and relevant portions of 9 will be carried out as part of Phase-1 activities. The transformative impact of the proposed SCR- Exa product will be to achieve improved I/O acceleration and scalable fault-tolerance on emerging Exascale systems and cloud environments for a range of HPC, DL, and ML applications. We expect that the proposed solutions in SCR-Exa library will boost the performance of I/O write operations (critical for many HPC and DL applications) up to 50x. In addition, it will reduce the checkpoint-restart time for the applications by a factor of 10-20x. This can result in significant boost to Exascale applications along three dimensions: performance, scalability, and fault-tolerance. In cloud environments, the proposed designs will enable effective fault-tolerance for end-users with lower Service Level Agreement (SLA) and cost. Thus, SCR-Exa will be applicable to a range of supercomputer centers, data centers, cloud providers, and national/international laboratories.

Phase II

Contract Number: ----------
Start Date: 00/00/00    Completed: 00/00/00
Phase II year
----
Phase II Amount
----