HREMD, GPU fallen of the bus

GROMACS version: 2019
GROMACS modification: Yes, plumed

Hi,

I would like to run HREMD simulations. I planned to test the setup on a single node with 4xRTX2070 GPUs. AMD Ryzen Threadripper 2950X 16-Core Processor, PSU is 1700W, if I remember well.

I have Debian, open mpi, plumed 2.6.1, gromacs-2019.6, NVIDIA: Driver Version: 418.87.00 CUDA Version: 10.1

mpirun -np 4 /opt/gromacs-2019.6-plumed261/bin/mdrun_mpi/mdrun -v -plumed plumed.dat -multidir sim[0123] -replex 100 -hrex -dlb no

The simulations starts with no problem.
However, the last entry in the md.log:
Replica exchange at step 13800 time 27.60000

At the first run, I had this error message: GPU at 00000000:41:00.0 has fallen off the bus.
In this case no change in the system, just no visible GPU, nvidia-smi says: reboot
At the second run, after successful simulation start, the computer simply restarted without any message in any log file.

My first suspicion is that no sufficient power supply.
However, while I could monitor gpu usage by nvidia-smi, I did not see more than 50-60%.

My second thought that the automatic GPU utilization may be a problem. Or CPU utilization may also be an associated issue. From one of the log files:

This is simulation 0 out of 4 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process
*Using 8 OpenMP threads *

4 GPUs selected for this run.
Mapping of GPU IDs to the 8 GPU tasks in the 4 ranks on this node:

  • PP:0,PME:0,PP:1,PME:1,PP:2,PME:2,PP:3,PME:3*
    PP tasks will do (non-perturbed) short-ranged interactions on the GPU
    PME tasks will do all aspects on the GPU
    Pinning threads with an auto-selected logical core stride of 1
    System total charge: 0.000

Now, I run regular MD using 16 cores and 4GPUs and I do not have any problem.
How should I test this issue? How should I run this HREMD simulation?

Thanks for your help and suggestions,
Tamas

Now, I run regular MD using 16 cores and 4GPUs and I do not have any problem.
How should I test this issue? How should I run this HREMD simulation?

Thanks for your help and suggestions,
Tamas

Hi,

That the GPU fell off of the bus suggests a hardware issue and so does the unexpected reboot. It could be faulty PSU or possibly motherboard. I think a 1700W PSU should be sufficient (if it is decent quality, it needs to be able to actually deliver >1000W).

The reason why you have not seen such errors before could be e.g. because the GPU load when parallalizing across 4 devices is lower compared to the load an ensemble run puts on each GPU.

I suggest to check the kernel log for messages prior to the GPU falling off of the bus or the machine restart? I’d start with that and try to diagnose and eliminate a possible hardware failure.

Cheers,
Szilárd

Hi,

Thanks. Hopefully it will be the PSU.
If I run only 3 replicas using 3 GPUs in total, no problem.

The only thing I found in the logs associated to the failure:
[453289.242071] nvidia 0000:07:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x0000000729438040 flags=0x0020]
I found that it may caused by hardware DMA and if I change to software with kernel parameter iommu=soft, it may help. It did not.

Hopefully I will know more about the PSU in a few days.

Thanks, Tamas

Hi,

It seems that I have a PSU problem and I do not know how to correct it. It seems to be small and I have no idea what should be a sufficient wattage.

I have to identical linux box: same motherboard, processors, 1700W high-end PSU.

One has four GTX1080Ti, the other has four RTX2080Ti.

https://outervision.com/power-supply-calculator results in 1320W and suggests 1600W. If I run HREMD simulations using 4 GTX1080Ti, everything is fine. If I move these cards to the other box, everything is fine. (Failure of PSU, motherboard, RAM, or something else is excluded.) If I run the same simulations using 4 RTX2080Ti card, the cards get fallen from the bus or the box reboots. If I limit the power to 200W for the cards using nvidia-smi, the same phenomenon happens.

If I power 2x2080Ti cards from PSU of the same computer and 2x2080Ti cards from the PSU of the other computer (on my office desk, with open houses; so there is 2x1700W) then the simulation runs without any problem also with RTX2080Ti cards.

This is somewhat strange, since RTX2080Ti is told to require less power than 1080Ti.

Nevertheless, I do not know what should be the size of a sufficient PSU? Will be 2000W sufficient or not? Should I go with 3000W?

Could it happen that this phenomenon is a motherboard limitation and would work with 2 PSUs, but not with 1 large one? (mobo: ROG ZENITH EXTREME)

Thanks for your help and suggestions,

Tamas

Hi,

Strange issue indeed. I’ve no definite recommendation on the issue, can only suggest further experiments. But before that, could you just try to get a warranty claim on the PSU and have it swapped?

If you have two identical systems with only the four GPUs different and I’d try all four combinations of PSU + mobo, i.e. PSU_1 + mobo_1, PSU_1 + mobo_2, etc. with the four 2080Ti’s. That could at least help clarify if it is an issue with the suspected PSU or somehow a combination of one of the PSUs and one of the motherboards.

Have you also tried set an even lower GPU power cap?

Having heard the RTX 3000 series debacles where similar issues arise due to IIRC spiked in power draw, I could imagine that if you were closer to the 1320W the calculators recommend, you could be seeing a similar issue. However, my gut feeling is that a quality 1700W PSU should be sufficient and at least with 2080Ti’s you should not need more.

Cheers,
Szilárd

I have tried all those combinations, therefore I am sure that changing PSU under warranty would not work. They are Enermax Platimax EPM1700EGT 1700W

Now I tried to limit power cap to 100W and there is no problem.

It came into my mind that running 3 cards is OK, so likely I have to get a PSU w 1700W+power for one GPU (>>250W, so 500W; totaling 2200-2500W).