Segmentation fault with REMD+GPU--potential bug?

GROMACS version: gromacs/2024.1+oneapi-2023.1
GROMACS modification: No

Hi everyone,

I am currently running a replica exchange simulation on a 30-mer of the PNIPAM polymer, with a modified OPLS-AA and SPCE water (from this paper, but I am doing a 30-mer instead of the 40-mer in the paper: https://pubs.acs.org/doi/full/10.1021/acs.jpcb.9b01644) and am having issues with segmentation faults when using REMD+GPU.

Before REMD, I perform a thorough equilibration on every replica (which ran with no errors), consisting of:
250 ps NVT with position restraints on polymer atoms
10 ns NPT with position restraints on polymer atoms
2 ns NPT
10 ns NVT
(used the checkpoint files from each stage as input for the next stage; I can attach the .mdp files for these if needed)

I then attempt to run replica exchange, with 32 temperatures spaced from 294 K to 325 K in 1 K increments (294, 295, …, 324, 325). However, I always run into a constraint failure with the water:

step 67600: One or more water molecules can not be settled.
Check for bad contacts and/or reduce the timestep if appropriate.
Wrote pdb files with previous and current coordinates
[midway2-0642:7092 :0:7092] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffc07111948)
==== backtrace (tid:   7092) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000afed0d _INTERNALedf52f4d::nbnxn_make_pairlist_part<NbnxnPairlistGpu>()  pairlist.cpp:0
 2 0x0000000000af8a16 PairlistSet::constructPairlists()  ???:0
 3 0x000000000013a1a3 __kmp_invoke_microtask()  ???:0
 4 0x00000000000b9070 __kmp_fork_call()  ???:0
 5 0x0000000000079ee5 __kmpc_fork_call()  ???:0
 6 0x0000000000af7e12 PairlistSet::constructPairlists()  ???:0
 7 0x0000000000af7815 PairlistSets::construct()  ???:0
 8 0x0000000000af76ab nonbonded_verlet_t::constructPairlist()  ???:0
 9 0x00000000010dbccb _INTERNALf783a4fd::doPairSearch()  sim_util.cpp:0
10 0x00000000010d8634 do_force()  ???:0
11 0x0000000001255fc9 gmx::LegacySimulator::do_md()  ???:0
12 0x00000000012504ed gmx::LegacySimulator::run()  ???:0
13 0x000000000127adac gmx::Mdrunner::mdrunner()  ???:0
14 0x000000000040be08 gmx::gmx_mdrun()  ???:0
15 0x000000000040b92a gmx::gmx_mdrun()  ???:0
16 0x0000000000b1ffef gmx::CommandLineModuleManager::run()  ???:0
17 0x000000000040a0da main()  ???:0
18 0x0000000000022555 __libc_start_main()  ???:0
19 0x0000000000409f7b _start()  ???:0
=================================

This constraint failure never happens immediately, but occurs on the order of 50-100 ps into the simulation. Additionally, it does not always happen right after an exchange, but can be up to several hundreds of steps after an exchange.

I have tried the following fixes, but receive the same outcome:

Change timestep to 1 fs: always the same outcome (and I refuse to use anything slower than this such as 0.5 fs for sampling efficiency)
Lower exchange frequency: same issue with 1 ps, 2 ps, 5 ps, and 10 ps exchange interval (though increasing the interval causes the error to happen later)

I highly doubt it is poor initial structures or an inherent instability in the system or incorrect setup, as each replica equilibrated smoothly without warnings and running the REMD code with exchanges turned off (-replex -1) resulted in the error not occurring.

What could be some potential reasons for the error? Is it an inherent issue in my .mdp parameters (e.g., integrator, constraints, etc.)?
Is it from poor temperature spacing? I did notice that the exchange probabilties between replicas were almost always 1 or 0, and this can be an issue, but I don’t understand why it would lead to segmentation faults. I can attach my REMD output as well if needed.
Is it a problem with the GPU scripting or cluster? This is what I predict is going wrong.

Additionally, how could I fix this? Would it be better to run REMD on GROMACS interfaced with PLUMED? Please let me know if I can attach anything more which can be of help; I have been debugging this for a very long time and am truly baffled why the replicas can run perfectly fine on their own but always receive a segmentation fault once I turn exchange on.

(note: since I am working on two different clusters clusters, the .tpr files were generated with GROMACS 2022.4 from another cluster but I don’t think this should be an issue)

The .mdp parameter files for each replica (300 K shown as an example, different for diff. temperatures) and the REMD submit script is attached below.

Thanks in advance!

Best,
Ian Bongalonta
University of Chicago

.mdp file for each replica (example with 1 fs time step)

; REMD Production Run (10 ns) for Replica

; Run control
integrator              = sd            ; Stochastic dynamics integrator
dt                      = 0.001         ;
nsteps                  = 5000000       ; 
ld-seed                 = -1            ; Random seed (will be set per replica)
nstcomm                 = 100           ; More frequent removal of COM motion
comm_mode               = linear

; Output control
nstxout                 = 0             ; Don't save full precision coordinates
nstvout                 = 0             ; Don't save velocities
nstfout                 = 0             ; Don't save forces
nstcalcenergy           = 100           ; Calculate energies every 100 steps (needed for replica exchange)
nstenergy               = 5000          ; Save energies
nstlog                  = 5000          ; Log
nstxout-compressed      = 5000          ; Save compressed coordinates

; Neighbor searching
cutoff-scheme           = Verlet
nstlist                 = 20            ; More frequent neighbor list updates
pbc                     = xyz
rlist                   = 1.2

; Electrostatics
coulombtype             = PME
rcoulomb                = 1.2

; van der Waals
vdwtype                 = Cut-off
vdw-modifier            = Force-switch
rvdw_switch             = 1.0
rvdw                    = 1.2
DispCorr                = no

; Temperature coupling
tc-grps                 = Other SOL     ; Exact molecule names from topology
tau-t                   = 2.0 2.0       ; Time constants for both groups
ref-t                   = 300 300       ; Will be replaced per replica for both groups

; No pressure coupling since this is NVT

; Bonds
constraints             = h-bonds
constraint_algorithm    = LINCS
continuation            = yes           ; Continue from equilibration

submit script:

#!/bin/bash
#SBATCH --job-name=XXX
#SBATCH --output=XXX
#SBATCH --error=XXX
#SBATCH --partition=XXX
#SBATCH --account=XXX
#SBATCH --qos=XXX
#SBATCH --time=1-12:00:00       # Time limit - maximum allowed (1.5 days)
#SBATCH --nodes=1               # Single node
#SBATCH --ntasks=32             # 32 tasks for 32 replicas
#SBATCH --cpus-per-task=1       # CPUs per task
#SBATCH --mem=64G              # Memory - increased but still under 179G limit
#SBATCH --gres=gpu:4            # 4 GPUs (V100)

# Load necessary modules
module load gromacs

# Set OpenMP threads and GPU communication
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export GMX_GPU_DD_COMMS=true
export GMX_GPU_PME_PP_COMMS=true

# Count number of replicas in multidir.dat
NUM_REPLICAS=$(wc -l < multidir.dat)
echo "Found $NUM_REPLICAS replicas in multidir.dat"

# Rename TPR files in each replica directory to match what GROMACS expects
echo "Preparing files for REMD run..."
while read replica_dir; do
    if [ -d "$replica_dir" ]; then
        # Extract temperature from directory name
        temp=$(echo $replica_dir | sed 's/replica_//')

        if [ -f "${replica_dir}/remd_${temp}.tpr" ]; then
            # Make a copy with the name that GROMACS expects
            cp "${replica_dir}/remd_${temp}.tpr" "${replica_dir}/remd.tpr"
            echo "Copied ${replica_dir}/remd_${temp}.tpr to ${replica_dir}/remd.tpr"
        else
            echo "Error: ${replica_dir}/remd_${temp}.tpr not found. Cannot proceed."
            exit 1
        fi
    fi
done < multidir.dat

# Create a checkpoint directory for recovery
mkdir -p checkpoints
timestamp=$(date +%Y%m%d_%H%M%S)
mkdir -p "checkpoints/$timestamp"

# Only backup the directories listed in multidir.dat
while read replica_dir; do
    if [ -d "$replica_dir" ]; then
        cp -r "$replica_dir" "checkpoints/$timestamp/"
    fi
done < multidir.dat

echo "Backup of initial replica directories created in checkpoints/$timestamp/"

echo "Starting REMD simulation with $NUM_REPLICAS replicas across 4 GPUs..."

# Run GROMACS REMD simulation with GPU acceleration and improved stability
mpirun -np $NUM_REPLICAS gmx_mpi mdrun \
    -multidir $(cat multidir.dat) \
    -deffnm remd \
    -replex 5000 \
    -ntomp $SLURM_CPUS_PER_TASK

This should always work, independently of run settings, as least when the systems are stable in individual runs. At which temperatures do you get the constraint warnings? At the highest ones? Have you tried running the highest temperature replica without REMD? Is that stable?

I guess you are using GPU update. If that is the case, please try -update cpu to check if the problem might be related to that and report back.