Increasing and excesive use of memory using OpenCL and AMD GPU

GROMACS version: 2023.1
GROMACS modification: Yes/No
Here post your question:

I successfully compiled on a AMD/GPU node cluster. Each node has 8 gpus. So I’m running 8 simultaneous simulations on each node. As the simulations proceed the memory usage keeps increasing until some of the jobs die. I noticed this issue came up back in 2018. memory leak in OpenCL runs with - Redmine #2470 (#2470) · Issues · GROMACS / GROMACS · GitLab. Per the suggestion in the latter post, I set the GMX_DISABLE_GPU_TIMING variable. However, the problem persists. Any suggestions to avoid the memory leak? Thanks!

GROMACS version: 2023.1
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: OpenCL
NB cluster size: 8
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: VkFFT internal (1.2.26-b15cb0ca3e884bdb6c901a12d87aa8aadf7637d8) with OpenCL backend
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/tce/bin/cc GNU 10.3.1
C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler: /usr/tce/bin/c++ GNU 10.3.1
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-variable -Wno-newline-eof -Wno-old-style-cast -Wno-zero-as-null-pointer-constant -Wno-unused-but-set-variable -Wno-sign-compare -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library: External - detected on the system
LAPACK library: External - detected on the system
OpenCL include dir: /opt/rocm-5.4.3/include
OpenCL library: /opt/rocm-5.4.3/lib/libOpenCL.so
OpenCL version: 2.2

Hi!

Thanks for reporting the problem.

Could you share a bit more details about the system you’re running and how fast is memory leaking, and is it CPU or GPU memory?

If you don’t mind testing, could you please check if the following patch resolves your issue?

diff --git src/gromacs/gpu_utils/devicebuffer_ocl.h src/gromacs/gpu_utils/devicebuffer_ocl.h
index 3d868a979a..b250418cab 100644
--- src/gromacs/gpu_utils/devicebuffer_ocl.h
+++ src/gromacs/gpu_utils/devicebuffer_ocl.h
@@ -268,9 +268,8 @@ void clearDeviceBufferAsync(DeviceBuffer<ValueType>* buffer,
     const int       pattern       = 0;
     const cl_uint   numWaitEvents = 0;
     const cl_event* waitEvents    = nullptr;
-    cl_event        commandEvent;
     cl_int          clError = clEnqueueFillBuffer(
-            deviceStream.stream(), *buffer, &pattern, sizeof(pattern), offset, bytes, numWaitEvents, waitEvents, &commandEvent);
+            deviceStream.stream(), *buffer, &pattern, sizeof(pattern), offset, bytes, numWaitEvents, waitEvents, NULL);
     GMX_RELEASE_ASSERT(clError == CL_SUCCESS,
                        gmx::formatString("Couldn't clear the device buffer (OpenCL error %d: %s)",
                                          clError,

(If you’re not familiar with the “diff” format: open the src/gromacs/gpu_utils/devicebuffer_ocl.h file, find line 273, and replace &commandEvent at the end of the line with NULL).

By the way, are there any reasons you’re using OpenCL and not the newer SYCL backend? Compiling it might be a bit more complicated, but it’s more optimized and that’s where most of the current development is focused as far as AMD GPUs are concerned.

All compute nodes have AMD Rome processors with 48 cores/node. Each compute node has 8 AMD MI50 GPUs with 256 GB per node.

It is the CPU memory that leaks. And it roughly leaks ~ 100 MB / 8 minutes / simulation. I’m running 8 simultaneous . Note at the start of the simulation, the memory usage is ~ 120 MB.

In terms of the specific CPU specs:

vendor_id : AuthenticAMD

cpu family : 23

model : 49

model name : AMD EPYC 7402 24-Core Processor

stepping : 0

microcode : 0x8301055

cpu MHz : 2800.000

cache size : 512 KB

I will try your suggestion and recompile. I’ll have to look more closely at the SYCL backend.

Thanks!

-Sergio

Hi;

Yes, the patch fixed the issue.

Thanks!!

-Sergio

Great, thanks for checking! The fix will be in 2023.2 release.

@swong This has now been fixed in the release-2023 branch, you can build from that or wait until the next patch version, 2023.2 is released (planned for next week).