Adding external gpu force to gromacs result in energy conservation problem

GROMACS version:2025.3
GROMACS modification: Yes

I’m currently implementing a module putting external neural network potential into gmx. For some reason i cannot use libtorch, so i write a external force module to add my force into gpu force buffer in sim_utils.cpp :: do_force(), something like this

after do postProcessForceWithShiftForces and before pme_receive_force_ener

I have checked stateGpu->getForces() is the force ptr to do integration. But after doing this , i find when running on nve the system energy is not conserved.

after a bunch of try and error, i find a simple but strange solution. removing one line in md.cpp :: do_md :

if ((simulationWork.useGpuPme && simulationWork.useCpuPmePpCommunication)
|| (!runScheduleWork_->stepWork.useGpuFBufferOps))
                    {
      // The PME forces were recieved to the host, and reduced on the CPU with the
      // rest of the forces computed on the GPU, so the final forces have to be
      // copied back to the GPU. Or the buffer ops were not offloaded this step,
      // so the forces are on the host and have to be copied
      // stateGpu->copyForcesToGpu(f.view().force(), AtomLocality::Local); // hclu: comment this line for external force mode
      // hclu: if not set -pme gpu , gmx use default cpu pme and go into this branch, may result in copy zero force to stategpu buffer?
   }

if c

omment this line , the integration is correct in running nve. I’m wondering whether this would affect other function? or there is a more elegant way to add my force to gpu force buffer?

My system is a pbc box with 216 water molecules. as i only use nnpot to do dynamics, I faked a non interaction topology for water molecule:

; sol_nil.itp
; 非相互作用水分子 (SOL) 的拓扑文件。
; 目标: 所有相互作用 (LJ, electrostatic, bonded) 均设置为零。

[ atomtypes ]
; name  at.num  mass    charge  ptype  sigma(nm)  epsilon(kJ/mol)
; 
OW      8       15.999  0.0     A      0.0        0.0
HW      1       1.008   0.0     A      0.0        0.0

[ moleculetype ]
; name     nrexcl
SOL        1

[ atoms ]
; nr  type  resnr  residue  atom  cgnr  charge  mass
; 
1     OW    1      SOL      OW    1     0.0     15.999
2     HW    1      SOL      HW1   1     0.0     1.008
3     HW    1      SOL      HW2   1     0.0     1.008

; 注释: 不需要 [ bonds ], [ angles ], [ dihedrals ] 部分
; Zero-Interaction System Topology
; 该拓扑文件已修改为使用自定义的 ‘SOL’ 零力场模型。

[ defaults ]
; nbfunc  comb-rule  gen-pairs  fudgeLJ  fudgeQQ
1       2          no         0.0      0.0

; 引入自定义水分子拓扑 (SOL)
; 确保 sol_nil.itp 中的 moleculetype 名称现在是 SOL
#include “sol_nil.itp”

[ system ]
; Name
Zero-Interaction Water Box

[ molecules ]
; Compound         #mols
; 名称现在是 SOL
SOL                216

the mdp file:

; MD parameters for external force testing

integrator = md ; Leap-frog 积分器
dt = 0.0005 ; 1 fs 时间步
nsteps = 10000000 ; 1000 步 = 1 ps
nstxout = 100 ; 每 10 步输出坐标
nstvout = 100 ; 每 10 步输出速度
nstfout = 100 ; 每 10 步输出力
nstlog = 100 ; 每 10 步输出日志
nstenergy = 100 ; 每 10 步输出能量
nstxout-compressed = 100 ; 每 10 步输出压缩轨迹

; ========================================
; ⭐ Verlet 方案 + 最小化非键相互作用
; ========================================

; DispCorr: 不使用长程色散修正
DispCorr = no

tcoupl = no
tc_grps = system
tau_t = 0.1
ref_t = 300

constraints = none ; 不使用约束

gen-vel = no ; 生成初始速度
continuation = no
gen-seed = 1 ; 随机种子

nstlist = 100
rcoulomb = 0.6
rlist = 0.6
rvdw = 0.6
pbc = xyz
periodic_molecules = yes

the gmx command include my self defined tag -extforcegpu and -extport, other is the same:

gmx mdrun -s test.tpr -ntmpi 1 -ntomp 1 -gpu_id 2 -update gpu -extforcegpu -extport 1098 -v

Hi! I’m guessing that you’ve seen the existing NNPot interface in GROMACS. You say that “for some reason you cannot use libtorch”, do you mean that your model is not trained in PyTorch?
In GROMACS, we have the MDModules/IForceProvider API, that should make something like this straightforward to implement (see e.g. the implementation of the NNPot interface). However, it is true that this API does not provide access to the GPU buffers, and the problem you encountered is precisely why. With all the work being scheduled on the GPU, including constraints and update, it is very tricky to ensure that e.g. the position and force buffers are valid at the point of access.

I can’t tell you why the fix you found solves your energy conservation problems, but I would say that doing this is pretty much guaranteed to break things in other simulations, e.g. when you want to do hybrid NNP/MM-style simulations where some part of the system is still treated at the MM level with PME.

However, if your neural network potential implementation is not extremely lightweight, I think taking the detour through the CPU will not affect the performance that much, because data transfer should be fast by comparison with NNP inference.

As an aside, this should probably be moved to the developers discussions. @al42and , do you mind?

Done. Also, consider the developer meeting tomorrow (if that works for your timezone)

1 Like

oh that would be great

The situation is that my model is infered in distributed form so it can not be transfered into a torchscript, and at the other point I try not to do D2H or H2D too frequently cause the system might be large. I try to do only memcpy in device, so i decide not to use the forceprovider API instead just directly write into the gpu forcebuffer. I just wonder how the data is transfer during one md step. I think maybe i didn’t write my force to the right time or right place? or i can write my NN force into the bonded force buffer in gpu?

And i’m not sure at what condition do_md() function would enter the h2d force copy branch? is there any setting i can avoid this step to happen?

I see. Yes, that’s of course the most scalable approach, but then you’ll have to be extremely careful what you do here. A mistimed host-device copy is the definitely the most likely thing to go wrong. In terms of GPU data and control flow, I’m afraid I can’t help you much since I’d have to dig into the code myself first.

Thanks for the answer by lukas, andrey and Prof. Hess and other members in gromacs develop team, maybe the force provider would be a better choice, i try moving my implmentation in force provider.

Thanks for joining the call! And sorry for the perhaps unsatisfying answer. For correctness and maintainability, the ForceProvider API is definitely the better choice. If you run into any problems, don’t hesitate to ask!

To summarize: I don’t think your initial approach could not work, but going via ForceProviders is safer and likely easier. More future-proof too: we sometimes reshuffle the inner schedule code even in patch releases.

Once you have the ForceProviders implementation, you can see the overhead of moving data via CPU. And if it turns out to be unacceptable, you can always go back to hacking do_md/do_force, but now have the known-good version as a reference. And maybe your experience will help with extending ForceProviders to support GPU buffers :)

1 Like