GROMACS version: 2025.4-conda_forge
GROMACS modification: No
I have installed GROMACS 2025.4 with GPU support using conda and when i started a run it gave me this message:
Compiled SIMD is AVX2_256, but CPU also supports AVX_512 (see log).The current CPU can measure timings more accurately than the code in gmx mdrun was configured to use. This might affect your simulation speed as accurate timings are needed for load-balancing.
Are there any version on conda-forge that support AVX_512 CPUs? does changing to this make the simulation runs faster?
the gmx –version gives this output:
:-) GROMACS - gmx, 2025.4-conda_forge (-:
Executable: /nfs/slurm/cu001/.conda/envs/gromacs/bin.AVX2_256/gmx
Data prefix: /nfs/slurm/cu001/.conda/envs/gromacs
Working dir: /nfs/slurm/cu001/data/iinsilico/strp
Command line:
gmx --version
GROMACS version: 2025.4-conda_forge
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
NBNxM GPU setup: super-cluster 2x2x2 / cluster 8 (cluster-pair splitting on)
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.10-sse2-avx
GPU FFT library: cuFFT
Multi-GPU FFT: none
RDTSCP usage: disabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /home/conda/feedstock_root/build_artifacts/gromacs_1764344666726/_build_env/bin/x86_64-conda-linux-gnu-cc GNU 14.3.0
C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler: /home/conda/feedstock_root/build_artifacts/gromacs_1764344666726/_build_env/bin/x86_64-conda-linux-gnu-c++ GNU 14.3.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict SHELL:-fopenmp -O3 -DNDEBUG
BLAS library: Internal
LAPACK library: Internal
CUDA compiler: /nfs/slurm/cu001/.conda/envs/gromacs/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2025 NVIDIA Corporation;Built on Tue_May_27_02:21:03_PDT_2025;Cuda compilation tools, release 12.9, V12.9.86;Build cuda_12.9.r12.9/compiler.36037853_0
CUDA compiler flags: -O3 -DNDEBUG
CUDA driver: 12.20
CUDA runtime: 12.90
okay thank you.
I have tried a system of 31955 atoms including water using this GROMACS version with GPU. the available GPU is NVIDIA A100-SXM4-80GB. after finishing the production run of 200 ns with timestep of 2.0 femtoseconds it showed this:
(ns/day) (hour/ns)
Performance: 91.946 0.261
is this good performance?
this is the .sh file and the command i used: #!/bin/sh #SBATCH --job-name=strp #SBATCH --gres=gpu:1 #SBATCH --ntasks=2 #SBATCH --time=24:00:00 #SBATCH --output=output_%j.log #SBATCH --error=error_%j.log gmx mdrun -ntmpi 1 -ntomp 2 -v -deffnm step5_production
That is very slow. I get 1000 ns/day on 24000 atoms on my RTX 4070, which is roughly equally fast as your GPU. Using more OpenMP threads will improve performance a little bit. So something looks to be sub-optimal with your setup.
I have no clue what could be wrong. Can you post the table “R E A L C Y C L E A N D T I M E A C C O U N T I N G” that is printed at the end of the log fie?
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
Pair Search distance check 468637.599312 4217738.394 0.0
NxN QSTab Elec. + LJ [F] 517469861.538432 27425902661.537 98.1
NxN QSTab Elec. + LJ [V&F] 5227012.353792 423388000.657 1.5
1,4 nonbonded interactions 90689.800638 8162082.057 0.0
Shift-X 3794.208880 22765.253 0.0
Bonds 18807.625584 1109649.909 0.0
Propers 89989.264079 20607541.474 0.1
Impropers 5829.888991 1212616.910 0.0
Virial 37995.232000 683914.176 0.0
Stop-CM 3794.208880 37942.089 0.0
Calc-Ekin 151767.108955 4097711.942 0.0
Lincs 16670.395404 1000223.724 0.0
Lincs-Mat 72380.862096 289523.448 0.0
Constraint-V 377149.885764 3394348.972 0.0
Constraint-Vir 36047.976360 865151.433 0.0
Settle 114603.031652 42403121.711 0.2
CMAP 2244.091689 3814955.871 0.0
Urey-Bradley 62929.555300 11516108.620 0.0
Total 27952726058.178 100.0
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 2 OpenMP threads
Activity: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
Neighbor search 1 2 118736 993.907 4174.465 4.5
Launch PP GPU ops. 1 2 23628266 403.707 1695.594 1.8
Force 1 2 11873501 4095.343 17200.677 18.4
PME GPU mesh 1 2 11873501 544.824 2288.292 2.4
Wait GPU NB local 1 2 11873501 518.225 2176.577 2.3
Wait GPU state copy 1 2 10686150 11955.031 50211.823 53.6
NB X/F buffer ops. 1 2 1187351 60.345 253.453 0.3
Write traj. 1 2 262 0.669 2.810 0.0
Update 1 2 11873501 781.552 3282.564 3.5
Constraints 1 2 11873501 1917.805 8054.890 8.6
Kinetic energy 1 2 4749401 791.874 3325.915 3.5
Rest 251.243 1055.237 1.1
Total 22314.525 93722.297 100.0
Breakdown of PME mesh activities
Wait PME GPU gather 1 2 11873501 96.371 404.765 0.4
Reduce GPU PME F 1 2 11873501 19.986 83.941 0.1
Launch PME GPU ops. 1 2 106861509 402.404 1690.120 1.8
Core t (s) Wall t (s) (%)
Time: 44629.048 22314.525 200.0
6h11:54
(ns/day) (hour/ns)
Performance: 91.946 0.261
cutoff-scheme = Verlet nstlist = 20 vdwtype = Cut-off vdw-modifier = Force-switch rvdw_switch = 1.0 rvdw = 1.2 rlist = 1.2 rcoulomb = 1.2 coulombtype = PME DispCorr = no ; Note that dispersion correction should be applied in the case of lipidmonolayers, but not bilayers
All setting look reasonable. You could use a factor a fourier-spacing of 0.15, but that will not help much. The timings don’t reveal much, as nearly everything runs on the GPU. I have no clue what could be the issue.
Maybe the best thing is to build yourself and see if the performance improves. Maybe the the Conda build has some sub-optimal configurations settings for CUDA.