Performance Discrepancy in Equilibration Runs: Seeking Solutions for Efficient Production Run

GROMACS version: 2023.3 and 2023.2
GROMACS modification: No

Hello Gromacs community,

I am seeking help to enhance the performance of our local supercomputer.
I compared a 125 ns equilibration run on OSC (Ohio Supercomputer) versus our local supercomputer. The local supercomputer took 28 hours, whereas OSC only took 40 minutes for the same 125 ns equilibration run. I have attached the output files of both simulations for your reference.

I utilized all the CPUs from our local supercomputer (2 sockets times 24 cores times 2 threads = 96 CPUs), and 48 CPUs from OSC (2 sockets times 24 cores times 1 thread = 48 CPUs). Please see below for the CPU comparison.

I believe the production run will take a significantly longer time, as indicated by the comparison between OSC and the local supercomputer. Therefore, it needs to be addressed before we proceed with the production run. Could you please provide your ideas on how to deal with this issue.

The attempt I will do to fix

  1. Use one thread instead of 2 threads to see if it is causing problem.

  2. Manually change the domain decomposition, as suggested in line 2923 of the local supercomputer output file.
    step6.1_equilibration_test_GPU_server.log (135.0 KB)
    step6.1_equilibration_test_OSC.log (119.4 KB)

               Local GPU server        OSC
    

±----------------------±-----------------------±--------------------------+

| | AMD EPYC 7352 | Intel Xeon Platinum 8268

±----------------------±-----------------------±--------------------------+

| Cores | 24 | 24

| Threads | 48 (SMT) | 48 (Hyper-Threading)

| Base Clock Speed | 2.3 GHz | 2.9 GHz

| L3 Cache | 16 MB | Varies

| Architecture | Zen 2 | Cascade Lake

| Socket | SP3 (PCIe 4.0) | FCLGA3647 (PCIe 3.0)

| PCIe Support | PCIe 4.0 | PCIe 3.0

| TDP | Not specified | Not specified

| Virtualization | AMD-V | VT-x, VT-d

| Manufacturer | AMD | Intel

±----------------------±-----------------------±--------------------------+

I would try using fewer ranks. One rank per core is often not optimal, especially not if there are many cores. I would first try -ntmpi 4 or -ntmpi 8 to see if it helps.

Is there a reason why you’re not using the GPUs on your local supercomputer?

Hi MagnusL,

Thanks for your suggestions. Following the note from the output file I used flag -dd to manually change domain decomposition, and had to add -nt and -ntomp as well; this significantly improved the CPU based performance, the job completed similar to OSC time (around 40 mins ; 28 hrs previously) . Following is the command line:
gmx mdrun -nb cpu -bonded cpu -nt 36 -ntomp 1 -dd 12 1 3 -v -deffnm step6.1_equilibration_test_2

I later used GPUs as well (4 GPUs), achieved the same job in 12 minutes, using -ntmpi instead of -nt. I later discovered that -dd flag in not necessary. My final command line with improved performance was:
gmx mdrun -ntmpi 48 -ntomp 1 -v -deffnm step7_production_2

I am happy with the current performance but still get the following message:

NOTE:
The number of threads is not equal to the number of (logical) cpus
and the -pin option is set to auto: will not pin threads to cpus.
This can lead to significant performance degradation.
Consider using -pin on (and -pinoffset in case you run multiple jobs).

I tried using -ntmpi 96 which stops the NOTE above but significantly reduces the performance.

Pawan

Thanks for the update. Good to hear that you got speed-ups. Did you try with lower -ntmpi as well, as I suggested? I would try with 4 or 8. If gmx mdrun -nb cpu -bonded cpu -ntmpi 4 -v -deffnm step6.1_equilibration_test_2 does not help, you could try gmx mdrun -nb cpu -bonded cpu -ntmpi 4 -ntomp 12 -pin on -v -deffnm step6.1_equilibration_test_2.

When running on GPU I would suggest just using one at a time, e.g.:
gmx mdrun -ntmpi 4 -gpu_id 0 -v -deffnm step6.1_equilibration_test_2.

If you want to use more GPUs (which is a good idea for efficient resource usage), I would suggest running more jobs in parallel - one job per GPU and fewer CPU cores per job. That is usually the most efficient way. Communication between GPUs is getting better in GROMACS, especially if you have NVLink, but in most cases I’d still recommend running multiple parallel jobs.

I hope that helps.

Hi MagnusL,

Thank you for your information. The best performance was using all 48 CPUs ( any combination of -ntmpi and -ntomp that results 48; lower value of -ntomp were better). I think the reason I am getting that “The number of threads is not equal to the number of (logical) cpus” is because I built the Gromacs without mpi support. I think there are 2 nodes in my system which can’t be efficiently used ( all 96 CPUs 48 in each node) without MPI support. The following “lscpu” information gives me the idea of 2 NUMA nodes: Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7352 24-Core Processor
Stepping: 0
CPU MHz: 3148.522
CPU max MHz: 2300.0000
CPU min MHz: 1500.0000
BogoMIPS: 4599.92
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95

If you’ve got numactl installed you can try:
numactl --cpunodebind=0 --membind=0 gmx mdrun ... to use only the first CPU.

Hi Magnus,
Thank you for the suggestions. I tried them. These flags removed the NOTE “The number of threads is not equal to the number of (logical) cpus…”, however drastically reduced the performance. Both nodes were slow, however the node 1 performed better than node 0. I think I will ignore the NOTE for now and go with current setting. Thank you for your suggestions and help. I will try to build the Gromacs with MPI support in future where I might be able to use both nodes at once. Hopefully, i would be able to fix those failed test (during “check make” of gromacs build using mpi support). Do you have expertise in the gromacs build process? I want to build gromacs using MPI support.