I am trying to get the newest gromacs 2025.0 working on a cray system and have run into what appears to be some building issues that don’t lend to an obvious fix (at least to me). I am hoping another pair of eyes might be able to help me pin this down.
A few points:
This is on a Cray system with AMD gpus (gfx90a)
Build is using the rocm provided clang (to match the AdaptiveCpp build).
AdaptiveCpp was built using ROCM-6.2.4 (acpp-info data is provided below).
When using make I get 20 errors, but all of them a look like the following:
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../include/c++/13/type_traits:2388:21: error: static assertion failed due to requirement '__declval_protector<mu::ParserCallback *>::__stop': declval() must not be used!
2388 | static_assert(__declval_protector<_Tp>::__stop,
AdaptiveCPP (acpp-info) output looks like the following:
Should not be an issue with Clang per se (ROCm 6.2.4 works on a Cray system we have here), but something indeed goes wrong with the way compiler is invoked. Could you also show the output of module list, and perhaps run the build as VERBOSE=1 make, and share the command line printed right before it barfs out the stream of errors?
Nitpick: GMX_GPLUSPLUS is not a valid option; it should be GMX_GPLUSPLUS_PATH. But it picks up the correct headers anyway, so that should not be relevant to the problem.
[ 0%] Built target gmx_objlib
[ 0%] Built target scanner
[ 0%] Generating release version information
[ 0%] Built target release-version-info
[ 0%] Built target internal_rpc_xdr
[ 0%] Built target thread_mpi
[ 1%] Built target tng_io_obj
[ 3%] Built target tng_io_zlib
[ 3%] Built target lmfit_objlib
[ 6%] Built target colvars_objlib
[ 6%] Building CXX object _deps/muparser-build/CMakeFiles/muparser.dir/src/muParserBase.cpp.o
In file included from /lustre/orion/proj-shared/bie123/GromacsInstalls/gromacs-2025.0/src/external/muparser/src/muParserBase.cpp:29:
In file included from /lustre/orion/proj-shared/bie123/GromacsInstalls/gromacs-2025.0/src/external/muparser/include/muParserBase.h:33:
In file included from /opt/rocm-6.2.4/lib/llvm/lib/clang/18/include/openmp_wrappers/cmath:86:
In file included from /opt/rocm-6.2.4/lib/llvm/lib/clang/18/include/__clang_hip_cmath.h:20:
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../include/c++/13/type_traits:2388:21: error: static assertion failed due to requirement '__declval_protector<mu::ParserCallback *>::__stop': declval() must not be used!
2388 | static_assert(__declval_protector<_Tp>::__stop,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../include/c++/13/type_traits:903:10: note: in instantiation of function template specialization 'std::declval[device={arch(amdgcn)}, implementation={extension(match_any, allow_templates)}]<mu::ParserCallback *>' requested here
903 | auto declval() noexcept -> decltype(__declval<_Tp>(0));
| ^
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../include/c++/13/type_traits:1255:31: note: in instantiation of function template specialization 'std::declval<mu::ParserCallback *>' requested here
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
make[2]: *** [_deps/muparser-build/CMakeFiles/muparser.dir/build.make:90: _deps/muparser-build/CMakeFiles/muparser.dir/src/muParserBase.cpp.o] Error 1
make[2]: Leaving directory '/lustre/orion/bie123/proj-shared/GromacsInstalls/gromacs-2025.0/build'
make[1]: *** [CMakeFiles/Makefile2:4616: _deps/muparser-build/CMakeFiles/muparser.dir/all] Error 2
make[1]: Leaving directory '/lustre/orion/bie123/proj-shared/GromacsInstalls/gromacs-2025.0/build'
make: *** [Makefile:166: all] Error 2
Another note, when compiling using VkFFT instead of the rocFFT and not using HEFFTe, it seems like it builds (at least my colleague has told me this), but I’ve not had a chance to test it yet.
EDIT: I checked with my colleague and they also didn’t have the craype-accel-gfx90a module loaded. Not sure if that is relevant.
I unloaded the craype-accel-gfx90a module and was still running into issues. If I drop trying to use HeFFTe and rocFFT and just use VkFFT everything does seem to compile nicely and run.
If I don’t use HeFFTe but still try to use rocFFT, I end up with the following errors:
[ 30%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/fft/gpu_3dfft_sycl_rocfft.cpp.o
acpp warning: No optimization flag was given, optimizations are disabled by default. Performance may be degraded. Compile with e.g. -O2/-O3 to enable optimizations.
/lustre/orion/proj-shared/bie123/GromacsInstalls/gromacs-2025.0/src/gromacs/fft/gpu_3dfft_sycl_rocfft.cpp:326:50: error: expected expression
326 | impl_->queue_.submit(GMX_SYCL_DISCARD_EVENT[&](sycl::handler & cgh) {
| ^
/lustre/orion/proj-shared/bie123/GromacsInstalls/gromacs-2025.0/src/gromacs/fft/gpu_3dfft_sycl_rocfft.cpp:326:26: error: use of undeclared identifier 'GMX_SYCL_DISCARD_EVENT'
326 | impl_->queue_.submit(GMX_SYCL_DISCARD_EVENT[&](sycl::handler & cgh) {
| ^
/lustre/orion/proj-shared/bie123/GromacsInstalls/gromacs-2025.0/src/gromacs/fft/gpu_3dfft_sycl_rocfft.cpp:326:66: error: expected '(' for function-style cast or type construction
326 | impl_->queue_.submit(GMX_SYCL_DISCARD_EVENT[&](sycl::handler & cgh) {
| ~~~~~~~~~~~~~ ^
/lustre/orion/proj-shared/bie123/GromacsInstalls/gromacs-2025.0/src/gromacs/fft/gpu_3dfft_sycl_rocfft.cpp:326:68: error: use of undeclared identifier 'cgh'
326 | impl_->queue_.submit(GMX_SYCL_DISCARD_EVENT[&](sycl::handler & cgh) {
| ^
4 errors generated when compiling for gfx90a.
make[2]: *** [src/gromacs/CMakeFiles/libgromacs.dir/build.make:12023: src/gromacs/CMakeFiles/libgromacs.dir/fft/gpu_3dfft_sycl_rocfft.cpp.o] Error 1
EDIT: I also have been able to get it to build now with the craype-accel-amd-gfx90a module loaded and using VkFFT. rocFFT builds still end up with the error listed above.
I noticed the new release and was able to get what appeared to be a functional build working (at least for my initial tests); however, now when I try to run any production length simulations, i.e. one that need to checkpoint, I am running into the following error:
step 324900, remaining wall clock time: 24 s
-------------------------------------------------------
Program: gmx mdrun, version 2025.1
Source file: src/gromacs/mdlib/mdoutf.cpp (line 475)
Function: void write_checkpoint(const char *, gmx_bool, FILE *, const t_commrec *, int *, int, IntegrationAlgorithm, int, gmx_bool, LambdaWeightCalculation, int64_t, double, t_state *, ObservablesHistory *, const gmx::MDModulesNotifiers &, gmx::WriteCheckpointDataHolder *, bool, MPI_Comm)
System I/O error:
Cannot rename checkpoint file from state.cpt to state_prev.cpt; maybe you are
out of disk space?
For more information and tips for troubleshooting, please check the GROMACS
website at https://manual.gromacs.org/current/user-guide/run-time-errors.html
-------------------------------------------------------
MPICH ERROR [Rank 0] [job id ] [Wed Mar 12 23:06:20 2025] [frontier09097] - Abort(1) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
What’s strange is that I have plenty of disk space where I am running. I’ve built with and without HeFFTe support (using the same ROCM/AMD settings used in my initial post) and one using just VkFFT. All of them give the same problem. If I do not checkpoint, my test systems seem to run, but that isn’t ideal on a shared system using a queuing system. Thoughts?
{Edit: I did need to set the MuParser option to none for the build to work}
{Edit 2: I also tried without any GPUs and I get the same issue. Building with the amd clang and using my local resource’s cray-mpich)
If you have time, could you elaborate a bit here: with the same set of loaded modules (and with muParser enabled), VkFFT was building fine, but HeFFTe/rocFFT build was failing (due to the muParser error)? If you load HeFFTe module but still try to build with VkFFT, does it work?
Does the issue happen at the very first/second checkpoint write or randomly later? What type of the filesystem you’re running on?
We use the standard file copying function, so it is not immediately obvious what can go wrong there. GPU offloading type should not matter here, but it’s nice that you checked thoroughly.
For building the new 2025.1 code, with VkFFT or HeFFTe/ROCM I had done my initial tests with the muparser disabled (set to NONE) since I needed to disable them in my prior 2025.0 build attempts with HeFFTe/ROCM build.
Doing a fresh build this morning with the muparser enabled worked when using VkFFT. I also did a fresh build with the muparser enabled but without HeFFTe and it also was able to build. And this morning I was able to get a build with the muparser enabled with HeFFTe and rocmfft but only if the craype-accel-amd-gfx90a is not loaded
Once the craype-accel-amd-gfx90a module is loaded the builds fail unless I disable the muparser. Not using the craype-accel-amd-gfx90a module is a bit problematic as the user guide for the system I am on indicates that this module needs to be loaded (at build and runtime) in order to enable the GPU-aware MPICH.
Regarding the checkpoint/io issue; these occur whether or not the craype-accel-gfx90a module is loaded at build or runtime. I’ve also found this issues when using both a single rank and 1 gpu and 7 threads and when i’ve used multiple (upwards to 128 ranks) 8-gpus per node (so 16 nodes) and 7 cpu-threads per rank.
The runs seem to work fine until the second checkpoint, when GROMACS tries to rename the state.cpt to state_prev.cpt. If I do a restart from a job I have previously run (copied from a different computer that has a working GROMACS install) the error appears be the moment it tries to make a new checkpoint.
Interestingly, I did find a workaround (of a sort). If I use the -cpnum 1 option so that a unique checkpoint is saved every time a checkpoint is made instead of the standard [name-here].cpt and [name-here]_prev.cpt pairing I do not get the out of space; i/o error .
Just an update. The checkpoint issue (from 2025.1) also occurs when trying to do a build that uses HIP instead of SYCL also. I’m digging into this a bit more to see if I can reproduce the issue on a system with a non-lustre filesystem and I’ll edit this post if/when I can reproduce the error.
Edit:
Digging at the gmx_file_copy utility function an adding an output of the errorcode gives; i.e. modifying gmx_file_copy
Now results in the following additional information prior to the checkpoint error:
Error copying file: No data available
-------------------------------------------------------
Program: gmx mdrun, version 2025.1
Source file: src/gromacs/mdlib/mdoutf.cpp (line 475)
Function: void write_checkpoint(const char *, gmx_bool, FILE *, const t_commrec *, int *, int, IntegrationAlgorithm, int, gmx_bool, LambdaWeightCalculation, int64_t, double, t_state *, ObservablesHistory *, const gmx::MDModulesNotifiers &, gmx::WriteCheckpointDataHolder *, bool, MPI_Comm)
System I/O error:
Cannot rename checkpoint file from state.cpt to state_prev.cpt; maybe you are
out of disk space?
For more information and tips for troubleshooting, please check the GROMACS
website at https://manual.gromacs.org/current/user-guide/run-time-errors.html
-------------------------------------------------------
Still working on testing this on a non-lustre filesystem setup.