Segmentation fault with REMD+GPU--potential bug?

GROMACS version: gromacs/2024.1+oneapi-2023.1
GROMACS modification: No

Hi everyone,

I am currently running a replica exchange simulation on a 30-mer of the PNIPAM polymer, with a modified OPLS-AA and SPCE water (from this paper, but I am doing a 30-mer instead of the 40-mer in the paper: https://pubs.acs.org/doi/full/10.1021/acs.jpcb.9b01644) and am having issues with segmentation faults when using REMD+GPU.

Before REMD, I perform a thorough equilibration on every replica (which ran with no errors), consisting of:
250 ps NVT with position restraints on polymer atoms
10 ns NPT with position restraints on polymer atoms
2 ns NPT
10 ns NVT
(used the checkpoint files from each stage as input for the next stage; I can attach the .mdp files for these if needed)

I then attempt to run replica exchange, with 32 temperatures spaced from 294 K to 325 K in 1 K increments (294, 295, …, 324, 325). However, I always run into a constraint failure with the water:

step 67600: One or more water molecules can not be settled.
Check for bad contacts and/or reduce the timestep if appropriate.
Wrote pdb files with previous and current coordinates
[midway2-0642:7092 :0:7092] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffc07111948)
==== backtrace (tid:   7092) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000afed0d _INTERNALedf52f4d::nbnxn_make_pairlist_part<NbnxnPairlistGpu>()  pairlist.cpp:0
 2 0x0000000000af8a16 PairlistSet::constructPairlists()  ???:0
 3 0x000000000013a1a3 __kmp_invoke_microtask()  ???:0
 4 0x00000000000b9070 __kmp_fork_call()  ???:0
 5 0x0000000000079ee5 __kmpc_fork_call()  ???:0
 6 0x0000000000af7e12 PairlistSet::constructPairlists()  ???:0
 7 0x0000000000af7815 PairlistSets::construct()  ???:0
 8 0x0000000000af76ab nonbonded_verlet_t::constructPairlist()  ???:0
 9 0x00000000010dbccb _INTERNALf783a4fd::doPairSearch()  sim_util.cpp:0
10 0x00000000010d8634 do_force()  ???:0
11 0x0000000001255fc9 gmx::LegacySimulator::do_md()  ???:0
12 0x00000000012504ed gmx::LegacySimulator::run()  ???:0
13 0x000000000127adac gmx::Mdrunner::mdrunner()  ???:0
14 0x000000000040be08 gmx::gmx_mdrun()  ???:0
15 0x000000000040b92a gmx::gmx_mdrun()  ???:0
16 0x0000000000b1ffef gmx::CommandLineModuleManager::run()  ???:0
17 0x000000000040a0da main()  ???:0
18 0x0000000000022555 __libc_start_main()  ???:0
19 0x0000000000409f7b _start()  ???:0
=================================

This constraint failure never happens immediately, but occurs on the order of 50-100 ps into the simulation. Additionally, it does not always happen right after an exchange, but can be up to several hundreds of steps after an exchange.

I have tried the following fixes, but receive the same outcome:

Change timestep to 1 fs: always the same outcome (and I refuse to use anything slower than this such as 0.5 fs for sampling efficiency)
Lower exchange frequency: same issue with 1 ps, 2 ps, 5 ps, and 10 ps exchange interval (though increasing the interval causes the error to happen later)

I highly doubt it is poor initial structures or an inherent instability in the system or incorrect setup, as each replica equilibrated smoothly without warnings and running the REMD code with exchanges turned off (-replex -1) resulted in the error not occurring.

What could be some potential reasons for the error? Is it an inherent issue in my .mdp parameters (e.g., integrator, constraints, etc.)?
Is it from poor temperature spacing? I did notice that the exchange probabilties between replicas were almost always 1 or 0, and this can be an issue, but I don’t understand why it would lead to segmentation faults. I can attach my REMD output as well if needed.
Is it a problem with the GPU scripting or cluster? This is what I predict is going wrong.

Additionally, how could I fix this? Would it be better to run REMD on GROMACS interfaced with PLUMED? Please let me know if I can attach anything more which can be of help; I have been debugging this for a very long time and am truly baffled why the replicas can run perfectly fine on their own but always receive a segmentation fault once I turn exchange on.

(note: since I am working on two different clusters clusters, the .tpr files were generated with GROMACS 2022.4 from another cluster but I don’t think this should be an issue)

The .mdp parameter files for each replica (300 K shown as an example, different for diff. temperatures) and the REMD submit script is attached below.

Thanks in advance!

Best,
Ian Bongalonta
University of Chicago

.mdp file for each replica (example with 1 fs time step)

; REMD Production Run (10 ns) for Replica

; Run control
integrator              = sd            ; Stochastic dynamics integrator
dt                      = 0.001         ;
nsteps                  = 5000000       ; 
ld-seed                 = -1            ; Random seed (will be set per replica)
nstcomm                 = 100           ; More frequent removal of COM motion
comm_mode               = linear

; Output control
nstxout                 = 0             ; Don't save full precision coordinates
nstvout                 = 0             ; Don't save velocities
nstfout                 = 0             ; Don't save forces
nstcalcenergy           = 100           ; Calculate energies every 100 steps (needed for replica exchange)
nstenergy               = 5000          ; Save energies
nstlog                  = 5000          ; Log
nstxout-compressed      = 5000          ; Save compressed coordinates

; Neighbor searching
cutoff-scheme           = Verlet
nstlist                 = 20            ; More frequent neighbor list updates
pbc                     = xyz
rlist                   = 1.2

; Electrostatics
coulombtype             = PME
rcoulomb                = 1.2

; van der Waals
vdwtype                 = Cut-off
vdw-modifier            = Force-switch
rvdw_switch             = 1.0
rvdw                    = 1.2
DispCorr                = no

; Temperature coupling
tc-grps                 = Other SOL     ; Exact molecule names from topology
tau-t                   = 2.0 2.0       ; Time constants for both groups
ref-t                   = 300 300       ; Will be replaced per replica for both groups

; No pressure coupling since this is NVT

; Bonds
constraints             = h-bonds
constraint_algorithm    = LINCS
continuation            = yes           ; Continue from equilibration

submit script:

#!/bin/bash
#SBATCH --job-name=XXX
#SBATCH --output=XXX
#SBATCH --error=XXX
#SBATCH --partition=XXX
#SBATCH --account=XXX
#SBATCH --qos=XXX
#SBATCH --time=1-12:00:00       # Time limit - maximum allowed (1.5 days)
#SBATCH --nodes=1               # Single node
#SBATCH --ntasks=32             # 32 tasks for 32 replicas
#SBATCH --cpus-per-task=1       # CPUs per task
#SBATCH --mem=64G              # Memory - increased but still under 179G limit
#SBATCH --gres=gpu:4            # 4 GPUs (V100)

# Load necessary modules
module load gromacs

# Set OpenMP threads and GPU communication
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export GMX_GPU_DD_COMMS=true
export GMX_GPU_PME_PP_COMMS=true

# Count number of replicas in multidir.dat
NUM_REPLICAS=$(wc -l < multidir.dat)
echo "Found $NUM_REPLICAS replicas in multidir.dat"

# Rename TPR files in each replica directory to match what GROMACS expects
echo "Preparing files for REMD run..."
while read replica_dir; do
    if [ -d "$replica_dir" ]; then
        # Extract temperature from directory name
        temp=$(echo $replica_dir | sed 's/replica_//')

        if [ -f "${replica_dir}/remd_${temp}.tpr" ]; then
            # Make a copy with the name that GROMACS expects
            cp "${replica_dir}/remd_${temp}.tpr" "${replica_dir}/remd.tpr"
            echo "Copied ${replica_dir}/remd_${temp}.tpr to ${replica_dir}/remd.tpr"
        else
            echo "Error: ${replica_dir}/remd_${temp}.tpr not found. Cannot proceed."
            exit 1
        fi
    fi
done < multidir.dat

# Create a checkpoint directory for recovery
mkdir -p checkpoints
timestamp=$(date +%Y%m%d_%H%M%S)
mkdir -p "checkpoints/$timestamp"

# Only backup the directories listed in multidir.dat
while read replica_dir; do
    if [ -d "$replica_dir" ]; then
        cp -r "$replica_dir" "checkpoints/$timestamp/"
    fi
done < multidir.dat

echo "Backup of initial replica directories created in checkpoints/$timestamp/"

echo "Starting REMD simulation with $NUM_REPLICAS replicas across 4 GPUs..."

# Run GROMACS REMD simulation with GPU acceleration and improved stability
mpirun -np $NUM_REPLICAS gmx_mpi mdrun \
    -multidir $(cat multidir.dat) \
    -deffnm remd \
    -replex 5000 \
    -ntomp $SLURM_CPUS_PER_TASK

This should always work, independently of run settings, as least when the systems are stable in individual runs. At which temperatures do you get the constraint warnings? At the highest ones? Have you tried running the highest temperature replica without REMD? Is that stable?

I guess you are using GPU update. If that is the case, please try -update cpu to check if the problem might be related to that and report back.

Hi hess,

Apologies for the late reply–I have been trying to debug for the past few weeks but am still unsuccessful.

I checked -update cpu and the system ran quite slowly, so I think that I wish to keep it with -update gpu. They happened on the higher end of the ladder, but not at the very high temperatures (308-315 K range).

I got it to work for my system (30-mer of PNIPAM with OPLS-AA/SPCE), but it is now not working with different values of N and with a different water model (22-mer and 44-mer with OPLS-AA/TIP4P-Ew).

I am quite baffled, as each of the replicas runs perfectly fine without any problems when I turn the exchange off. hey do not occur right after the exchanges, however–they always occur a few ps in. They were also sufficiently equilibrated for 20 ns beforehand. Each equilibration (EM / 500 ps NVT with position restraints / 10 ns NPT with position restraints / 5 ns NPT / 5 ns NVT) went perfectly with no issues or warnings.

When I turn up the exchange frequency, the system still crashes with the same error message, but the longer the exchange frequency, the longer it takes for the system to crash. It crashed at 94 ps with a 10 ps exchange frequency, 316 ps with a 20 ps exchange frequency, and 989 ps with a 50 ps exchange frequency. The energy landscape of my system is extremely rugged, as this is a polymer undergoing a coil-to-globule collapse transition in the temperature range I am investigating. Could it just be poor equilibration between exchanges?

I am currently using these temperatures:

replicas=(270 275 280 283 285 287 290 293 295 297 300 302 304 305 306 308 310 312 315 317 320 323 325 330 335 340 345 350)

Should I just test different temperatures?

Best,
Ian

But with -update cpu there are no issues? In that case there is a bug.

Hi Hess,

With -update cpu, the simulation still crashes with the same error.

Also, I found out that the longer -replex is, the longer it takes to crash. With a 5 ps exchange interval, it crashed in 42.5 ps, 10 ps exchange interval crashed in 84.7 ps, a 20 ps exchange interval crashed in 120 ps. When I change the LINCS settings:

lincs_order = 6 ; Increase from 4 to 6 for better stability during exchanges
lincs_iter = 2 ; Increase from 1 to 2 to correct more accurately for rotational lengthening
lincs_warnangle = 45 ; Increase warning angle threshold from 30 to 45 degrees

the simulation takes much longer to crash, but with the same error. Sometimes I get just the segmentation fault, and sometimes I get “One or more water molecules could not be settled”; sometimes it gives me PDBs, and sometimes it does not.

Is it likely an inherent instability in my system/parameters, or a bug in GROMACS? I am still confused, since every replica runs perfectly fine on its own without exchanges. The equilibration process was also thorough, with over 20 ns per replica which all ran fine.

Here is my prep script containing my parameters:

#SBATCH --time=00:20:00         # Time limit - 20 min should be plenty
#SBATCH --nodes=1               # Single node
#SBATCH --ntasks=1              # Only one task for preparation
#SBATCH --cpus-per-task=2       # CPUs per task
#SBATCH --mem=4G                # Memory
#SBATCH --gres=gpu:1            # 4 GPUs (V100)

# Load necessary modules
module load gromacs/2024.1+oneapi-2023.1

# Define replica temperatures
#replicas=(270 275 280 283 285 287 290 293 295 297 300 302 304 305 306 308 310 312 315 317 320 323 325 330 335 340 345 350 355 360 365 370)
replicas=(275 280 283 285 287 290 293 295 297 300 302 304 305 306 308 310 312 315 317 320 323 325 330 335)

# Create REMD MDP file template - with MD integrator and v-rescale thermostat
cat > remd_template.mdp << EOF
; REMD Production Run (10 ns) for Replica

; Run control
integrator              = md            ; Molecular dynamics integrator (leap-frog)
;integrator              = sd
dt                      = 0.001         ; 1 fs timestep
nsteps                  = 5000000       ; 1 ns
nstcomm                 = 100           ; More frequent removal of COM motion
comm_mode               = linear

; Output control
nstxout                 = 0             ; Don't save full precision coordinates
nstvout                 = 0             ; Don't save velocities
nstfout                 = 0             ; Don't save forces
nstcalcenergy           = 100           ; Calculate energies every 100 steps (needed for replica exchange)
nstenergy               = 5000          ; Save energies
nstlog                  = 5000          ; Log
nstxout-compressed      = 5000          ; Save compressed coordinates

; Neighbor searching
cutoff-scheme           = Verlet
nstlist                 = 20            ; More frequent neighbor list updates
pbc                     = xyz
rlist                   = 1.2

; Electrostatics
coulombtype             = PME
rcoulomb                = 1.2

; van der Waals
vdwtype                 = Cut-off
vdw-modifier            = Force-switch
rvdw_switch             = 1.0
rvdw                    = 1.2
DispCorr                = no

; Temperature coupling with v-rescale (improved stability)
tcoupl                  = v-rescale     ; v-rescale thermostat
tc-grps                 = Other SOL     ; Exact molecule names from topology
tau_t                   = 1.0 1.0       ; Time constants (shorter than SD for stability)
ref_t                   = 300 300       ; Will be replaced per replica for both groups

; No pressure coupling since this is NVT

; Bonds
constraints             = h-bonds
;constraint_algorithm    = LINCS
;lincs_order             = 4             ; Higher order for more stability
;lincs_iter              = 2             ; More iterations for stability
continuation            = yes           ; Continue from equilibration

; Bonds - improved constraint settings
;constraints             = all-bonds    ; Convert all bonds to constraints instead of just h-bonds
constraint_algorithm    = LINCS        ; Keep using LINCS
lincs_order             = 6            ; Increase from 4 to 6 for better stability during exchanges
lincs_iter              = 2            ; Increase from 1 to 2 to correct more accurately for rotational lengthening
lincs_warnangle         = 45           ; Increase warning angle threshold from 30 to 45 degrees

EOF

# Process each replica
for temp in "${replicas[@]}"; do
    replica_dir="replica_$temp"

    # Check if directory exists
    if [ ! -d "$replica_dir" ]; then
        echo "Error: Directory $replica_dir does not exist."
        continue
    fi

    echo "Preparing REMD files for replica at ${temp}K..."

    # Create REMD MDP file with temperature-specific parameters
    # Replace only the temperature references
    sed "s/ref_t                   = 300 300/ref_t                   = ${temp} ${temp}/g" remd_template.mdp > ${replica_dir}/remd_${temp}.mdp

    # Check for the NVT equilibration checkpoint file (step5)
    if [ -f "${replica_dir}/step5_${temp}.cpt" ]; then
        echo "Using NVT equilibration checkpoint for replica at ${temp}K..."

        # Generate TPR using the checkpoint file to maintain velocities
        gmx_mpi grompp -f ${replica_dir}/remd_${temp}.mdp \
                -c ${replica_dir}/step5_${temp}.gro \
                -p topol.top \
                -t ${replica_dir}/step5_${temp}.cpt \
                -o ${replica_dir}/remd_${temp}.tpr \
                -maxwarn 1

        if [ $? -ne 0 ]; then
            echo "Warning: Failed to create TPR with checkpoint file. Checking alternatives..."
        else
            echo "Successfully created TPR file with NVT checkpoint for ${temp}K."
            continue  # Move to next replica
        fi
    fi

    # If checkpoint not found or grompp failed, try using just the structure file
    if [ -f "${replica_dir}/step5_${temp}.gro" ]; then
        echo "Checkpoint not available. Using NVT equilibrated structure for replica at ${temp}K..."

        # Switch to regenerating velocities since we don't have the checkpoint
        cp ${replica_dir}/remd_${temp}.mdp ${replica_dir}/remd_${temp}.mdp.bak
        sed -i "s/continuation            = yes/continuation            = no/" ${replica_dir}/remd_${temp}.mdp
        echo "gen_vel                 = yes" >> ${replica_dir}/remd_${temp}.mdp
        echo "gen_temp                = ${temp}" >> ${replica_dir}/remd_${temp}.mdp
        echo "gen_seed                = ${temp}" >> ${replica_dir}/remd_${temp}.mdp

        gmx_mpi grompp -f ${replica_dir}/remd_${temp}.mdp \
                -c ${replica_dir}/step5_${temp}.gro \
                -p topol.top \
                -o ${replica_dir}/remd_${temp}.tpr \
                -maxwarn 1
    else
        # Last resort - try the energy minimized structure if that's all we have
        if [ -f "${replica_dir}/em_${temp}.gro" ]; then
            echo "NVT structure not found. Falling back to minimized structure for ${temp}K (not ideal)..."

            # Definitely need to generate velocities for minimized structures
            cp ${replica_dir}/remd_${temp}.mdp ${replica_dir}/remd_${temp}.mdp.bak
            sed -i "s/continuation            = yes/continuation            = no/" ${replica_dir}/remd_${temp}.mdp
            echo "gen_vel                 = yes" >> ${replica_dir}/remd_${temp}.mdp
            echo "gen_temp                = ${temp}" >> ${replica_dir}/remd_${temp}.mdp
            echo "gen_seed                = ${temp}" >> ${replica_dir}/remd_${temp}.mdp

            gmx_mpi grompp -f ${replica_dir}/remd_${temp}.mdp \
                    -c ${replica_dir}/em_${temp}.gro \
                    -p topol.top \
                    -o ${replica_dir}/remd_${temp}.tpr \
                    -maxwarn 1
        else
            echo "Error: No suitable structure file found for replica at ${temp}K."
            continue
        fi
    fi
done

# Create multidir.dat file for the REMD run
echo "Creating multidir.dat file..."
rm -f multidir.dat
for temp in "${replicas[@]}"; do
    if [ -f "replica_${temp}/remd_${temp}.tpr" ]; then
        echo "replica_${temp}" >> multidir.dat
    fi
done

# Count number of replicas ready for REMD
NUM_REPLICAS=$(wc -l < multidir.dat)
echo "Prepared $NUM_REPLICAS replicas for REMD simulation."

# Clean up template file
rm -f remd_template.mdp

echo "All REMD files have been prepared!"
echo "You can now submit the REMD job using the remd.sh script."

and here is my run script (I tried with and without -update cpu, and both failed)

#SBATCH --time=16:00:00       # Time limit - maximum allowed (1.5 days)
#SBATCH --nodes=2               # Single node
#SBATCH --ntasks=24             # 32 tasks for 32 replicas
#SBATCH --ntasks-per-node=12
#SBATCH --cpus-per-task=2       # CPUs per task
#SBATCH --mem=32G              # Memory - increased but still under 179G limit
#SBATCH --gres=gpu:4            # 4 GPUs (V100)

# Load necessary modules
module load gromacs/2024.1+oneapi-2023.1

# Define replica temperatures
#replicas=(270 275 280 283 285 287 290 293 295 297 300 302 304 305 306 308 310 312 315 317 320 323 325 330 335 340 345 350)
replicas=(275 280 283 285 287 290 293 295 297 300 302 304 305 306 308 310 312 315 317 320 323 325 330 335)

# Set OpenMP threads and GPU communication
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export GMX_GPU_DD_COMMS=false
export GMX_FORCE_UPDATE_DEFAULT_GPU=true

# Check if multidir.dat exists
if [ ! -f "multidir.dat" ]; then
    echo "Error: multidir.dat file not found. Run prep_remd_new.sh first."
    exit 1
fi

# Count number of replicas in multidir.dat
NUM_REPLICAS=$(wc -l < multidir.dat)
echo "Found $NUM_REPLICAS replicas in multidir.dat"

# Rename TPR files in each replica directory to match what GROMACS expects
echo "Preparing files for REMD run..."
while read replica_dir; do
    if [ -d "$replica_dir" ]; then
        # Extract temperature from directory name
        temp=$(echo $replica_dir | sed 's/replica_//')

        if [ -f "${replica_dir}/remd_${temp}.tpr" ]; then
            # Make a copy with the name that GROMACS expects
            cp "${replica_dir}/remd_${temp}.tpr" "${replica_dir}/remd.tpr"
            echo "Copied ${replica_dir}/remd_${temp}.tpr to ${replica_dir}/remd.tpr"
        else
            echo "Error: ${replica_dir}/remd_${temp}.tpr not found. Cannot proceed."
            exit 1
        fi
    fi
done < multidir.dat

# Create a checkpoint directory for recovery
mkdir -p checkpoints
timestamp=$(date +%Y%m%d_%H%M%S)
mkdir -p "checkpoints/$timestamp"

# Only backup the directories listed in multidir.dat
while read replica_dir; do
    if [ -d "$replica_dir" ]; then
        cp -r "$replica_dir" "checkpoints/$timestamp/"
    fi
done < multidir.dat

echo "Backup of initial replica directories created in checkpoints/$timestamp/"

echo "Starting REMD simulation with $NUM_REPLICAS replicas across 4 GPUs..."

# Run GROMACS REMD simulation with GPU acceleration and improved stability
mpirun -np $NUM_REPLICAS gmx_mpi mdrun \
    -multidir $(cat multidir.dat) \
    -deffnm remd \
    -replex 5000 \
    -ntomp $SLURM_CPUS_PER_TASK \
    -dlb no \
    -update cpu

the errors that I get are also inconsistent. Sometimes I get pdbs, but is it mostly all segfaults.

Is my system just bad? I would love some help, as I have been stuck with this error for over a month, and I am baffled since it works perfectly fine without exchanges.

What are your thermostat and barostat settings?