GROMACS conda version that uses AVX_512 with GPU

GROMACS version: 2025.4-conda_forge
GROMACS modification: No
I have installed GROMACS 2025.4 with GPU support using conda and when i started a run it gave me this message:

Compiled SIMD is AVX2_256, but CPU also supports AVX_512 (see log).The current CPU can measure timings more accurately than the code in gmx mdrun was configured to use. This might affect your simulation speed as accurate timings are needed for load-balancing.

Are there any version on conda-forge that support AVX_512 CPUs? does changing to this make the simulation runs faster?

the gmx –version gives this output:


               :-) GROMACS - gmx, 2025.4-conda_forge (-:

Executable:   /nfs/slurm/cu001/.conda/envs/gromacs/bin.AVX2_256/gmx
Data prefix:  /nfs/slurm/cu001/.conda/envs/gromacs
Working dir:  /nfs/slurm/cu001/data/iinsilico/strp
Command line:
gmx --version

GROMACS version:     2025.4-conda_forge
Precision:           mixed
Memory model:        64 bit
MPI library:         thread_mpi
OpenMP support:      enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:         CUDA
NBNxM GPU setup:     super-cluster 2x2x2 / cluster 8 (cluster-pair splitting on)
SIMD instructions:   AVX2_256
CPU FFT library:     fftw-3.3.10-sse2-avx
GPU FFT library:     cuFFT
Multi-GPU FFT:       none
RDTSCP usage:        disabled
TNG support:         enabled
Hwloc support:       disabled
Tracing support:     disabled
C compiler:          /home/conda/feedstock_root/build_artifacts/gromacs_1764344666726/_build_env/bin/x86_64-conda-linux-gnu-cc GNU 14.3.0
C compiler flags:    -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler:        /home/conda/feedstock_root/build_artifacts/gromacs_1764344666726/_build_env/bin/x86_64-conda-linux-gnu-c++ GNU 14.3.0
C++ compiler flags:  -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict SHELL:-fopenmp -O3 -DNDEBUG
BLAS library:        Internal
LAPACK library:      Internal
CUDA compiler:       /nfs/slurm/cu001/.conda/envs/gromacs/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2025 NVIDIA Corporation;Built on Tue_May_27_02:21:03_PDT_2025;Cuda compilation tools, release 12.9, V12.9.86;Build cuda_12.9.r12.9/compiler.36037853_0
CUDA compiler flags: -O3 -DNDEBUG
CUDA driver:         12.20
CUDA runtime:        12.90


Combined with a GPU, AVX2_256 is usually faster than AVX_512. And if not, the CPU doesn’t have much to do.

okay thank you. Another question please. which is better, using conda forge to install GROMACS or building it from source in terms of running speed?

If you install yourself you will in most cases get an optimal build. But conda forge also seems to do a pretty good job.

okay thank you.
I have tried a system of 31955 atoms including water using this GROMACS version with GPU. the available GPU is NVIDIA A100-SXM4-80GB. after finishing the production run of 200 ns with timestep of 2.0 femtoseconds it showed this:


             (ns/day)    (hour/ns)

Performance:       91.946        0.261

is this good performance?
this is the .sh file and the command i used:
#!/bin/sh
#SBATCH --job-name=strp
#SBATCH --gres=gpu:1
#SBATCH --ntasks=2
#SBATCH --time=24:00:00
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
gmx mdrun -ntmpi 1 -ntomp 2 -v -deffnm step5_production

That is very slow. I get 1000 ns/day on 24000 atoms on my RTX 4070, which is roughly equally fast as your GPU. Using more OpenMP threads will improve performance a little bit. So something looks to be sub-optimal with your setup.

this is the output from nvidia-smi command:


±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------±-----------------------±---------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:01:00.0 Off |                   On |
| N/A   32C    P0            118W /  500W |    6152MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
±----------------------------------------±-----------------------±---------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:41:00.0 Off |                   On |
| N/A   25C    P0             47W /  500W |     119MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
±----------------------------------------±-----------------------±---------------------+
|   2  NVIDIA A100-SXM4-80GB          Off |   00000000:81:00.0 Off |                   On |
| N/A   25C    P0             48W /  500W |     119MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
±----------------------------------------±-----------------------±---------------------+
|   3  NVIDIA A100-SXM4-80GB          Off |   00000000:C1:00.0 Off |                   On |
| N/A   24C    P0             47W /  500W |     119MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
±-----------------±---------------------------------±----------±----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    3   0   0  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  0    4   0   1  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  0    5   0   2  |            6064MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 2MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  0    6   0   3  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  1    3   0   0  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  1    4   0   1  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  1    5   0   2  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  1    6   0   3  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  2    3   0   0  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  2    4   0   1  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  2    5   0   2  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  2    6   0   3  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  3    3   0   0  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  3    4   0   1  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  3    5   0   2  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+
|  3    6   0   3  |              30MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
±-----------------±---------------------------------±----------±----------------------+

±----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0    5    0           206411      C   python3                                6028MiB |
±----------------------------------------------------------------------------------------+


I have no clue what could be wrong. Can you post the table “R E A L C Y C L E A N D T I M E A C C O U N T I N G” that is printed at the end of the log fie?

this is the last part of the log file:

    M E G A - F L O P S   A C C O U N T I N G

NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
W3=SPC/TIP3p  W4=TIP4p (single or pairs)
V&F=Potential and force  V=Potential only  F=Force only

Computing:                               M-Number         M-Flops  % Flops

Pair Search distance check          468637.599312     4217738.394     0.0
NxN QSTab Elec. + LJ [F]         517469861.538432 27425902661.537    98.1
NxN QSTab Elec. + LJ [V&F]         5227012.353792   423388000.657     1.5
1,4 nonbonded interactions           90689.800638     8162082.057     0.0
Shift-X                               3794.208880       22765.253     0.0
Bonds                                18807.625584     1109649.909     0.0
Propers                              89989.264079    20607541.474     0.1
Impropers                             5829.888991     1212616.910     0.0
Virial                               37995.232000      683914.176     0.0
Stop-CM                               3794.208880       37942.089     0.0
Calc-Ekin                           151767.108955     4097711.942     0.0
Lincs                                16670.395404     1000223.724     0.0
Lincs-Mat                            72380.862096      289523.448     0.0
Constraint-V                        377149.885764     3394348.972     0.0
Constraint-Vir                       36047.976360      865151.433     0.0
Settle                              114603.031652    42403121.711     0.2
CMAP                                  2244.091689     3814955.871     0.0
Urey-Bradley                         62929.555300    11516108.620     0.0

Total                                             27952726058.178   100.0

  R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 2 OpenMP threads

Activity:              Num   Num      Call    Wall time         Giga-Cycles
Ranks Threads  Count      (s)         total sum    %

Neighbor search           1    2     118736     993.907       4174.465   4.5
Launch PP GPU ops.        1    2   23628266     403.707       1695.594   1.8
Force                     1    2   11873501    4095.343      17200.677  18.4
PME GPU mesh              1    2   11873501     544.824       2288.292   2.4
Wait GPU NB local         1    2   11873501     518.225       2176.577   2.3
Wait GPU state copy       1    2   10686150   11955.031      50211.823  53.6
NB X/F buffer ops.        1    2    1187351      60.345        253.453   0.3
Write traj.               1    2        262       0.669          2.810   0.0
Update                    1    2   11873501     781.552       3282.564   3.5
Constraints               1    2   11873501    1917.805       8054.890   8.6
Kinetic energy            1    2    4749401     791.874       3325.915   3.5
Rest                                            251.243       1055.237   1.1

Total                                         22314.525      93722.297 100.0

Breakdown of PME mesh activities

Wait PME GPU gather       1    2   11873501      96.371        404.765   0.4
Reduce GPU PME F          1    2   11873501      19.986         83.941   0.1
Launch PME GPU ops.       1    2  106861509     402.404       1690.120   1.8

           Core t (s)   Wall t (s)        (%)
   Time:    44629.048    22314.525      200.0
                     6h11:54
             (ns/day)    (hour/ns)

Performance:       91.946        0.261

I see nothing strange there. Using more OpenMP threads should give you some 10% more performance.

What are your cut-off and PME settings?

these are the parameters:

cutoff-scheme = Verlet
nstlist = 20
vdwtype = Cut-off
vdw-modifier = Force-switch
rvdw_switch = 1.0
rvdw = 1.2
rlist = 1.2
rcoulomb = 1.2
coulombtype = PME
DispCorr = no ; Note that dispersion correction should be applied in the case of lipidmonolayers, but not bilayers

And your Fourier spacing?

it is automatically determined in GROMACS. the system was prepared using charmm-gui

this is the log file from the first equilibration step (NVT):

step4.1_equilibration.log (1.3 MB)

fourier spacing in 0.12

All setting look reasonable. You could use a factor a fourier-spacing of 0.15, but that will not help much. The timings don’t reveal much, as nearly everything runs on the GPU. I have no clue what could be the issue.

Maybe the best thing is to build yourself and see if the performance improves. Maybe the the Conda build has some sub-optimal configurations settings for CUDA.