Ensuring correctness in high-performance computing (HPC) applications is one of the fundamental challenges that the HPC community faces today. While significant advances in verification, testing, and debugging have been made to isolate software errors (or defects) in the context of non-HPC software, several factors make achieving correctness in HPC applications and systems much more challenging than in general systems software—growing heterogeneity (architectures with CPUs, GPUs, and special purpose accelerators), massive scale computations (very high degree of concurrency), use of combined parallel programing models (e.g., MPI+X), new scalable numerical algorithms (e.g., to leverage reduced precision in floating-point arithmetic), and aggressive compiler optimizations/transformations are some of the challenges that make correctness harder in HPC. The following report lays out the key challenges and research areas of HPC correctness: DOE Report of the HPC Correctness Summit.
As the complexity of future architectures, algorithms, and applications in HPC increases, the ability to fully exploit exascale systems will be limited without correctness. With the continuous use of HPC software to advance scientific and technological capabilities, novel techniques and practical tools for software correctness in HPC are invaluable.
The goal of the Correctness Workshop is to bring together researchers and developers to present and discuss novel ideas to address the problem of correctness in HPC. The workshop will feature contributed papers and invited talks in this area.
Topics of interest include, but are not limited to:
Correctness in Scientific Applications and Algorithms
- Formal methods and rigorous mathematical techniques for correctness in HPC applications
- Frameworks to address the challenges of testing complex HPC applications (e.g., multiphysics applications)
- Approaches for the specification of numerical algorithms with the goal of correctness checking
- Error identification in the design and implementation of numerical algorithms using finite-precision floating point numbers
Tools for Debugging, Testing, and Correctness Checking
- Tools to control the effect of non-determinism when debugging and testing HPC software
- Scalable debugging solutions for large-scale HPC applications
- Scalable tools for model checking, verification, certification, or symbolic execution
- Static and dynamic analysis to test and check correctness in the entire HPC software ecosystem
- Predictive debugging and testing approaches to forecast the occurrence of errors in specific conditions
- Machine learning and anomaly detection for bug detection and localization
Programing Models and Runtime Systems Correctness
- Correctness in emerging HPC programing models
- Analysis of software error propagation and error handling in HPC runtime systems and libraries
- Metrics to measure the degree of correctness of HPC software
- Specifications to check the correctness of runtime systems
- Large databases of bug reports and/or reproducible test cases of HPC software
- Benchmarks to test the effectiveness of HPC correctness tools