GROMACS hangs when run from python multiprocess

GROMACS version: 2022.4
GROMACS modification: No
Here post your question

Dear all,
I have written a small python program that runs multiple simulations under Linux using multiprocessing to run a set of parallel simulations. The program then analises the data from each produced trajectory.
Each thread runs a single instance of gmx and produces the trajectory as expected.

Unfortunately, when the trajectory is produced, the process doesn’t close and the jobs hang there waiting forever or until I kill them, using GPU memory although they finished.
I know there is the -multidir option but due to how the program is structured I can’t use it (it would mean rewriting a lot of code that is used by other classes and stuff).

So far I have tried:
gmx mdrun -gpu_id {GPU} -deffnm {self.par[“Output”]}{trajCount}{walk_count}
or
mpirun -np 2 gmx_mpi mdrun -multidir {walk_count} -gpu_id {GPU} -deffnm {self.par[“Output”]}{trajCount}{walk_count}
and finally:
mpiexec -np 2 gmx_mpi mdrun -ntomp 8 -gpu_id {GPU} -deffnm {self.par[“Output”]}{trajCount}{walk_count}

They all work and make the trajectory, but they all hang at the end. I suspect it’s how multiprocess and mpi interact (a thread that creates multiple threads?)
What can I do to “trigger” the release of the process after the production of the trajectory is complete?

I have seen that this question was asked in the forums about 18 years ago ([gmx-users] gromacs hanging at end of parallel run) but the answer was not clear or was followed up.

Unfortunately, this issue seems to happen only with GROMACS (tested with NAMD, ACEMD and openMM) as if it’s waiting for a signal to move forward and close the thread.

Thank you for any help or advice!

Ludovico

You mention “multiprocessing” but also “threads”. Are you using multiprocessing, one of the several threading-related modules, or something external, like mpi4py?

How are you launching your subprocesses? Some older mechanisms were susceptible to problems like you describe if you neglected to handle all of the output. If you haven’t migrated yet, use subprocess.run() to get a CompletedProcess instance so you know the process has ended.

Still, it is possible that something in your MPI environment is not getting cleaned up. There are definitely weird possible interactions.

In part to address this sort of situation, the gmxapi Python package uses mpi4py and MPI subcommunicators, exclusively, to manage multiple cores, and you only use mpiexec on the initial call to the Python interpreter. This could be a good option if gmxapi-managed MPI subcommunicators look like a good fit for your concurrent simulations. Through gmxapi 0.4, though, tasks launch and get released synchronously, so resource allocation may not be sufficiently flexible.

Still, even without the “ensemble” feature, the Python bindings to mdrun through gmxapi never directly use MPI_COMM_WORLD, and clean up their MPI environment when the simulator completes, without finalizing the global communicator until the Python interpreter exits. This could be helpful if your problem is related to multiple mpirun calls from the same process messing up each other’s environment somehow.

1 Like

Thank you very much for your reply!
Sorry for the confusion, I am using multiprocessing and I was referring to the spawned processes.
I use python’s multiprocessing with pool.apply_async to run each simulation.

The idea is to run “x” amount of multiple MDs in parallel, wait for them all to finish, analyse and extract some data, and choose the “best simulation” according to a metric → save checkpoint of the best → restart a new batch of “x” simulations from the previous “best’s” checkpoint → merge “winning” trajectories.

Specifically, each process is launched through pool.apply_async() via subprocess.Popen() as I need to wait for all simulations to run in parallel, finish and only then analyse them and then choose the one I need.

E.g.:

with mp.Pool(processes=len(GPUbatches)) as pool:
    for GPUbatch in GPUbatches:
        results.append(pool.apply_async(self.runGPU_batch, args=(self.trajCount, walk_count, GPUbatch, q)))
        walk_count += len(GPUbatch)
    for result in results:
        result.get()
    print(f"Waiting for all processes to finish...")
    while not q.empty():
        q.get()
    print(f"All batches finished.")
pool.close()
pool.terminate()

the method called by pool is:

def runGPU_batch(self, trajCount, walk_count, GPUbatch, queue):
    processes = []
    for GPU in GPUbatch:
        os.chdir('tmp/walker_' + str(walk_count))
        command = self.lauchEngine(trajCount, walk_count, GPU,
                                   self.customProductionFile)
        process = subprocess.Popen(command, shell=True)
        processes.append(process)
        walk_count += 1
        os.chdir(self.folder)
        print(command)
    # Wait for all subprocesses to finish
    for process in processes:
        process.wait()
    for GPU in GPUbatch:
        queue.put((trajCount, walk_count, GPU))  # Notify completion
    return walk_count  # Return the updated walk_count value

Each walker represents a different folder where an instance of GROMACS is launched.
Once the batch of walkers (and GPUs) is complete, I move to the next one (walkers +=…).
while not q.empty() makes me wait for ALL the simulations to be over (necessary to compare them all, obviously) to proceed with the code.

From what I have seen from the -debug, it seems that everything should be clean. Both ACEMD and NAMD terminate the process normally in the queue and clean the environment, but somehow it seems like GROMACS is waiting for a communication signal.

I am also pretty sure I’m not running GROMACS (resource-wise and command-wise) the best way as I am not familiar with the engine.
I’m running on a 4GPUs machine under Linux.

I will try subprocess.run() and see how it goes.

Should this fail I’ll move to gmxapi and see how it goes.

Again, any advice or help is super appreciated!
Thank you for your time!

Do all of the test cases use gpus and the same version of gromacs? It is conceivable that there is something in the GPU resource management that is getting confused or stuck under these circumstances. It would be interesting to know whether the same thing is triggered by, say, backgrounding a mdrun with & at the terminal and then launching a concurrent process for a different GPU, as you seem to be doing. It may be that this scenario is/was not well tested with the sort of process hierarchy you have. Normally, I think people use different GPU s from separate HPC jobs, but I could be wrong.

Can you share more details of your computing environment? I.e. is this an HPC environment with a job queue system?

Is it feasible to check other releases of gromacs?

I agree.
I fear that there is something in the MPI resource management (if you look at my initial post this is something that happened in the past on this forum).

I have tried to free the console with &, > gromacs.log, 2&1> gromacs.log or normally (no & or redirect to log) but it gets stuck anyway.
Notice that if I run the jobs manually outside of python everything goes well. I think the problem lies within multiprocessing and it deals with MPI.
I can try >/dev/null 2>&1 but I fear that it won’t change much (I’ll try anyway!)

Executable: /share/apps/gromacs/bin/gmx
Data prefix: /share/apps/gromacs
Working dir: /home/pipitol
Command line:
gmx -version

GROMACS version: 2022.4
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: cuFFT
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /share/apps/gcc/devtoolset-9/root/usr/bin/cc GNU 9.3.1
C compiler flags: -mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler: /share/apps/gcc/devtoolset-9/root/usr/bin/c++ GNU 9.3.1
C++ compiler flags: -mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA compiler: /share/apps/cuda/11.4/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2021 NVIDIA Corporation;Built on Wed_Jul_14_19:41:19_PDT_2021;Cuda compilation tools, release 11.4, V11.4.100;Build cuda_11.4.r11.4/compiler.30188945_0
CUDA compiler flags:-std=c++17;–generate-code=arch=compute_35,code=sm_35;–generate-code=arch=compute_37,code=sm_37;–generate-code=arch=compute_50,code=sm_50;–generate-code=arch=compute_52,code=sm_52;–generate-code=arch=compute_60,code=sm_60;–generate-code=arch=compute_61,code=sm_61;–generate-code=arch=compute_70,code=sm_70;–generate-code=arch=compute_75,code=sm_75;–generate-code=arch=compute_80,code=sm_80;–generate-code=arch=compute_86,code=sm_86;-Wno-deprecated-gpu-targets;–generate-code=arch=compute_53,code=sm_53;–generate-code=arch=compute_80,code=sm_80;-use_fast_math;;-mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver: 12.10
CUDA runtime: 11.40

I am running via ssh a linux machine with 4 GPUs (no SLURM or any queuing system required).
Pretty much runs like a local host via ssh.

I can ask to test other releases of gromacs.
Thank you again for your quick reply!

I have tried with subprocess.run() and still doesn’t work.

I do think there’s something wrong with MPI and resource management (bu that could depend on me as well):
whenever pool runs multiple threads even if using different GPUS, the speed is drastically decreased, as if they were using the very same resource.

In summary:

  • &, 2&1> or not redirecting doesn’t make a difference
  • running via pool_async drastically reduces speed and leaves the processes hanging (although they are completed successfully and trajectories are written)
  • probably I’m not using the best command line to use GROMACS at its best

Thanks!

Hi,

You seem to be runing a thread-MPi build, you could try lib-MPI, see

https://manual.gromacs.org/documentation/current/install-guide/index.html#typical-installation

https://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html

You may also be getting resource oversubscription unless you made sure that you assign a different set of CPU (cores) and GPUs to each mdrun simulation. That would explain very low performance, but likely not the hanging.

For an example of how manually scheduled node-sharing runs can be set up, e.g. with externally managed resources (with numactl and CUDA_VISIBLE_DEVICES) you can take a look at the examples here:

Cheers,
Szilárd

1 Like

Thank you for your reply!
I hope it’ll be a matter of build.

In my case, I think I don’t need to share nodes. I am trying to recreate the most simple scenario: one folder - one GPU - one simulation running locally.
The script is meant to run short, independent MD runs using one GPU each via -gpu_id.

When I run in series (one after another), I have no performance issues whatsoever and the production goes well.
However, when I run parallel threads from Python’s multiprocessing, the performance is drastically reduced and it hangs. :(

It’s a shame because from what I’ve seen, GROMACS seems very fast.

Thank you!

By sharing I meant that multiple processes (not belongings to the same mdrun multi run) share the same node. In this case the mdrun engine itself can not take care of resource allocation itself since there are multiple independent processes.

Resource assignment is critically important regardless of the use case or application, but If you run simulations (or any workload) side-by-side simultaneously on the same node you need to make sure that these use the resources you intended them to use and share only those resources you intended to be shared.

Concretely, mdrun, unless told otherwise will use all CPU cores in a node (and with thread-MPI it will also try to make use of all GPUs by default).

I suggest to take a look at the previously shared example and the thread pinning and GPU command line options of mdrun.

Cheers
Szilárd

pipitoludovico
May 3
I agree.
I fear that there is something in the MPI resource management (if you look at my initial post this is something that happened in the past on this forum).

In the examples you gave, you include a thread-MPI version of gromacs, though. Where would the MPI interaction be in that case? (Note that the multiprocessing module does not use MPI.)

Similarly, have you tried without GPUs?

I have tried to free the console with &, > gromacs.log, 2&1> gromacs.log or normally (no & or redirect to log) but it gets stuck anyway.
Notice that if I run the jobs manually outside of python everything goes well. I think the problem lies within multiprocessing and it deals with MPI.
I can try >/dev/null 2>&1 but I fear that it won’t change much (I’ll try anyway!)

I was specifically interested in testing & at the command line to reproduce a similar process hierarchy without involving Python. The experiment would have been more relevant, I think, if you were launching these tasks from within an HPC allocation. Otherwise, you might as well just try multiple terminal windows.

You might also try one of the other process launchers for the multiprocessing module: multiprocessing — Process-based parallelism — Python 3.12.0 documentation

Sorry, my bad, as always: I assumed that Python’s multiprocessing was just meant to spawn x number of “child” processes and that the resource distribution was regulated by MPi through GROMACS on its own.

Yes. And the result is the same. Processes hang when running through pool.appy_async().

I think this is a good point (as well as what Szilárd suggested) to explain the performance reduction: if multiprocess inherits the resources from the original parent process, this could be the reason why performance is reduced when running multiple mdrun from Python.
From what I have understood, the multiprocess’ main process takes a given amount of resources and splits them between all its “child” processes (please correct me if I’m wrong).

I will experiment with “spawn” and “fork” and see if performance improves.

I can’t run from multiple terminals as I’m using a single script that’s set to automatically run “x” amount of mdruns. However, I tried to run multiple mdruns one by one from different folders outside Python and the performance is good and, as expected, it doesn’t hang.
The problem, seems to be with multiprocessing.

Thanks!

Strictly, there is no “redistribution”. You can and should tell each mdrun instance where to run, e.g. assign each simulation num_total_cpu_cores / num_concurrent_simulations cores, and make sure that mdrun runs on these (be it using numactl or preferably using the -pin/-pinoffset options). Same goes for GPUs. Neither python nor the operating system does that for you.

Alternatively, a job scheduler or the mdrun -multidir option can do that for you.

It spawns child processes through various possible mechanisms. If you use the Pool, then it limits the number of direct subprocesses, but if those subprocesses in turn call mpiexec, there will be processes outside of the control of the multiprocessing module, and you can get into trouble with interactions between MPI libraries (or their dependency libraries) and kernel native process management like fork calls. (Best to choose one single scheme for managing processes and stick to it.)

The multiprocessing module does not do any resource division or resource management beyond that. (All it does is count the number of processes it has launched.) If you want to restrict the number of cores (or pin specific cores), GPUs, etc, you have to do that on your own. Szilard cited some references on how to tell gromacs more granularly what resources it should use. Otherwise, it does its best to detect and use all resources on the system. So even if you do restrict the resources available to the child processes (such as through Linux cgroups or something), you have to tell gromacs, or it will oversubscribe resources even within its own process space.

Yes. And the result is the same. Processes hang when running through pool.appy_async().

Thanks. This is important to understand. (Unfortunately, it means that if the process launch mechanism doesn’t address the problem, I may be out of ideas.)

Also (for the record) let me attempt to clarify that neither process.wait() nor process.run() actually returned, right? So the process didn’t finish or terminate. It completed its work, but never exited, producing a exit code. Meaning that it gets stuck somewhere within the program and we could conceivably attach a debugger to find what line (or subroutine or library) it is stuck at (preferably with a CMAKE_BUILD_TYPE=Debug build of gromacs). And this occurs

  • in gromacs 2022.4,
  • with or without a (external library / “real”) MPI implementation,
  • with or without a GPU (but only tested with CUDA-enabled builds),
  • with RDTSCP enabled (and without RDTSCP errors)
  • without HWLOC
  • on a Linux system without a queuing system or other execution manager.

100% accurate. I tried to “give it a nudge” by simulating a “hit enter” at the end of the process but it was still stuck.

Exactly. If I manually kill all the processes with a pkill -f gmx, everything runs as it should.

I can run with -debug and drop everything into a log file if you need it.

I don’t know what’s RDTSCP or HWLOC, but to the best of my knowledge, I’ve never heard/used any of those. Consider that I’m running a Linux machine via ssh with no queuing system or execution manager.

I have specified the -pin on - pinoffset x where x was a different number for each run and -gpu_if I where I was the GPU ID. Still, when running through pool.apply_async() the performance was drastically reduced. When looking at the output GROMACS correctly uses the different GPUs on the machine.

I thought about using mdrun -multidir but I need to understand how to set it right (from what I have understood you need to prepare each in/out file for each subdirectory and then run -multidir a b c from the level above).

Thanks!

And btw thank you both for the explanations! :)

I don’t think that is likely to help. I meant that someone ought to try attaching gdb or dtrace, preferably after recompiling.

I don’t know what’s RDTSCP or HWLOC, but to the best of my knowledge, I’ve never heard/used any of those. Consider that I’m running a Linux machine via ssh with no queuing system or execution manager.

I’m just trying to establish information to help with reproducing the problem or narrowing down the scope.

1 Like

That sounds good. You might also want to verify that things run where you expect them to run (e.g. using htop or hwloc).

Correct, that’s all you need to do.

Also note that the slight drawback of -multidir is that the multi-simulation will only complete when each member simulation has completed. If the individual simulations are of equal length this is not an issue, but if they are not this will be inferior to a more sophisticated python script which fills each “slot” on the machine when a simulation completes.

Frankly speaking, what I’m doing with my script is, in a way, similar to what -multidir already does (although I run a set of independent simulations from one .cpt instead of setting multiple in/out…but again I need to understand how to work with GROMACS better).

I will look into that and check as you said. The thing that baffles me is that if I run say 4 simulations directly from the command line from each respective folder, everything works well! :\
If I automate the same exact thing via Python’s multiprocess, it hangs…but completes everything.

It seems like GROMACS is waiting for a signal to terminate the process.
If I do pkill -f gmx to close all the mdrun proceses, the processes are terminated, the script continues, the environment is clean and everything works fine.
I don’t know why this happens only with GROMACS unfortunately :(

Regarding the performance loss with multiprocess, I will try setting its context to “spawn” and see if this changes anything.

Thanks again for your patience!

Update:
I have tried setting mp such as:
mp.set_start_method('spawn')
but it was spawning multiple parent processes after calling the command line x amount of times. Furthermore, the performance was still pretty bad compared to running in serial mode.

However, I have specified more the use of the resources In the command line and now the processes don’t hang anymore!
:-)

Specifically, I’m running mdrun with the following settings after trying various combinations:

gmx mdrun -s aptamer_0_1.tpr -v    -gpu_id 0 -npme -1 -ntmpi 0 -ntomp 0 -ntomp_pme 0 -pin on -pme gpu -nb gpu -bonded gpu -update gpu -pinoffset 0 -nstlist 40 &> gromacs.log

The speed, however, is still affected if running in parallel via multiprocess.

Consider that Python’s pool.async_apply() is running multiple command line from the terminal with different GPU ids and pinoffsets.
e.g.:

gmx mdrun -s aptamer_0_2.tpr -v    -gpu_id 1 -npme -1 -ntmpi 0 -ntomp 0 -ntomp_pme 0 -pin on -pme gpu -nb gpu -bonded gpu -update gpu -pinoffset 1 -nstlist 40 &> gromacs.log

and so on (basically the script runs gmx mdrun from the command line, referencing each time to a different tpr, a different GPU id, and sets a different pinoffset for each run).

When running each command individually outside of Python it takes about 30 seconds for a 400K atoms system (55 ns/day).
However, when run through Python’s multiprocessing it takes about 9 minutes and the speed is reduced to 5-7 ns/day.

In summary:

  • processes don’t hang anymore after specifying -npme -1 -ntmpi 0 -ntomp 0 -ntomp_pme 0
  • performance is still drastically reduced compared to running that very same command above outside of Python one by one

What should I try now?
Thank you very much for your advice and patience! They are really appreciated!

I don’t think that’s what you want to do. Not setting the number of threads and offsetting by 1 will leqd to oversubscription (and most likely pinning will also fail because mdrun will launch as many threads for each mdrun as cores/hardware threads).

As I noted earlier, you need to partition both CPU and GPU resources between the mdrun instances you are trying to launch, please check the blog post i linked earlier.