Is "hyperthreading" possible and necessary?

GROMACS version:2019
GROMACS modification: No
Here post your question

Our cluster can enable the hyperthreading function. But I am not familiar with the computer archetechtures. So the simple question is, is it possible and necessary to enable hyperthreading in the MD?

My common MD command is

gmx mdrun -deffnm md_0_1 -cpi -append

gerun mdrun_mpi -v -deffnm md_0_1 -replex 100 -cpi -append -multidir sim[0123]

If hyperthreading is enabled, how to adjust those two commands?

Hyperthreading typically improves performance, in some cases by a significant amount (up to 5-10%). In general, if schedule mdrun to run on the entire node and use all cores and thread of the node (and keep thread internal thread pinning on either), it will do the right thing. If you want to control things manually, e.g. to compare performance with / and without Hyperthreading , you can use the -pinstride option (e.g. -pinstride 2 will skip using every second hardware thread, that is only use one thread per core).

Szilárd

Hi Szilard, each of our cluster nodes has 40 physical cores.

The computational resource I request is

#!/bin/bash -l
#$ -S /bin/bash
#$ -l h_rt=01:00:0
#$ -l mem=2G
#$ -N REMD
#$ -pe mpi 80
#$ -cwd

The mdrun command I use is
gerun mdrun_mpi -v -deffnm md_0_1 -replex 100 -cpi -append -multidir sim[0123]

You mentioned -pinstride 2 option, should this be included in the computational resources, or in the mdrun command? Should I just use 2 or any other number?

I am sorry I am not an expert on computer architecture.
My current understanding is, (are they correct?)

  • one node consists of multiple cores

  • one core should commonly have one thread (i.e. without hyperthreading), or have two threads (i.e. with hyperthreading).

What does -pinstride 2 actually do?

Our cluster website also mentions MPI processes, like the below. What does it mean?

  • export OMP_NUM_THREADS=2

  • export OMP_NUM_THREADS=80

Also, if hyperthreading is good, why not just use it? In other words, why it is by default disabled?

Hi,

Your cluster’s CPUs do support HyperThreading, so the 40 cores will have 80 hardware threads to execute application threads on.

#!/bin/bash -l
#$ -S /bin/bash
#$ -l h_rt=01:00:0
#$ -l mem=2G
#$ -N REMD
#$ -pe mpi 80
#$ -cwd

Here you request 80 MPI ranks (tasks) hence placing 2 MPI tasks on a core and assuming that affinity setting (i.e. the mapping of the application threads to the CPU cores) is done correctly, this will use of HyperThreading. It may not be the ideal setup (for the specific hardware and simulation settings), but it will likely be close.

As you do not explicitly request pinning the default will be used, which in this case means that, as all cores/threads of the CPU are used, unless your jobs scheduler itself sets thread affinities, mdrun will default to -pin on. If you pass the same explicitly, you can make sure that mdrun does pinning itself. You should see a note in the log whether pinning is done or skipped (and why).

Stride 1 or 2 are the only useful values for the type of hardware you have (with 2 hardware threads per core). A stride of 1 is used when placing 2 threads onto a single core (when HyperThreading / SMT is available), and a value of 2 is used when placing a single thread of a core.

Of course, you have to adjust the total number of threads you launch (across all ranks in a node) accordingly, i.e. in your case 40 total threads should be launched to not use HyperThreading and 80 to use it.

Correct. Also note that even if the hardware supports it HyperThreading (more generally called simultaneous multithreading (SMT)) can be disabled by a system administrator. Check the hardware detection report in the beginning of the log file which will show the details of CPU features detected.

See above.

These are OpenMP threads. MPI and OpenMP are two different means to parallelize work across CPUs: the former relies on distributing work across multiple processes while the latter across threads within a single process. I recommend that you read up separately about these.

It is not disabled by default, if mdrun is allowed to manage threading itself, it will make use of the hardware threads exposed by the HyperThreading feature.

Cheers
Szilárd

Hi Szilard,

Many thanks for your detailed explanation. Sorry I replied a bit late.

After confirming with our cluster staff, the situation is:
1 node = 40 physical cores = 80 virtual cores = 80 slots

Using -pe mpi 80 alone would NOT enable hyperthreading

To enable hyperthreading, we need to choose an optimal layout of dividing up MPI processes vs threads. My simulation has 16 subfolders for the Gromacs REMD, so it seems that the MPI should be a multiple of 16. As I have tested, in one hour time with 80 cores requested:


) no hyperthreading

step 126500

) #$ -l threads=1
#$ -pe mpi 160
export OMP_NUM_THREADS=2

step 139200

) #$ -l threads=1
#$ -pe mpi 160
export OMP_NUM_THREADS=80

Fatal error:The number of ranks (2) is not a multiple of the number of simulations (16)

) #$ -l threads=1
#$ -pe mpi 160
export OMP_NUM_THREADS=10

step 156200


You can see that export OMP_NUM_THREADS=10 gives the best performance. So how can I further select the optimal layout of dividing up MPI processes vs threads?

Hyperthreading is not enabled by the user, it is either enabled on a system level, case in which you will see twice as many “CPUs” shown by the Linux system (see top or /proc/cpuinfo) which correspond to two hardware threads per real core (or “virtual cores” as you refer to these).

All a user / job submission can do is to select the number and placement of the application threads in a way that only one thread runs on one physical core (i.e. leaving one “slot”/hardware thread empty) and with that opting to not make use of the second hardware thread provided by the Hyperthreading functionality.

Not sure if you are quoting someone else, but no, to make use of the Hyperthreading feature you do not need to choose an optimal setup, you just need to place two application threads per core. The thread to rank division is irrelevant to whether Hyperthreading is exploited or not.

That said, the optimal balance between MPI ranks to OpenMP threads is related, but a different matter.

In addition, if you run 16-way REMD, you need to use at least 16 MPI ranks. The rest is flexible in the sense that you just need to run Nlcm(16,80) ranks (if you want avoid wasting resources).
E.g. 80 ranks on a single node with 5 ranks x 1 threads per simulation (i.e. 16x5x1=80), 16 ranks on a single node with 1 rank per simulation and 5 threads each (16x1x5), but you can also use 1280 ranks on 16 nodes with 40 ranks per simulation and 2 threads per rank (40
2*16=1280).

Please share log files of the runs and make sure you launch simulations with valid resources requests (at least the second seem to be invalid).

Cheers,
Szilárd

Hi Szilard,

Our HPC staff said “optimal layout” is needed. Not sure who is correct.

Here is my log files. I calculate the ranks as
#(-pe mpi) / #(OMP_NUM_THREADS)
e.g. 160/2 = 80 ranks; 160/10 = 16 ranks

) no hyperthreading

step 126500

REMD.no_hyper.log (172.1 KB)

) #$ -l threads=1
#$ -pe mpi 160
export OMP_NUM_THREADS=2

step 139200

REMD.80MPI.log (267.8 KB)

) #$ -l threads=1
#$ -pe mpi 160
export OMP_NUM_THREADS=10

step 156200

REMD.16MPI.log (89.6 KB)


You can see that the last layout proceeded with the most steps in one-hour HPC time.

Hi,

These are not log files but standard output of the jobs., please share the mdrun logs. Also, we generally refer to performance as a job throughput using “ns/day” (or walltime/step if you prefer time-step agnostic metric).

Also, you only ran the 80 ranks with 2 and 1 threads each to test using / not using HyperThreading (but not the other) and you are relying on your job scheduler to do the right thing. (Assuming that this does) As expected HT does benefit performance.

What more would you like to learn? Are these two-node runs your target production settings?

Cheers,
Szilárd

Hi Szilard,

Thank you! Here is my md log files (hope I attach the right files this time).

16_MPI.log (407.4 KB) 80_MPI.log (383.6 KB) no_hyper.log (351.1 KB)

I basically want to know which value to assign to the export OMP_NUM_THREADS, as it can be 2 or 5 or 10, as long as 160/n is a multiple of 16.

Sorry, there are many terms that I still could not fully understand, like ranks, threads, nodes, HT. Maybe I leave it as it is, and move on? Sorry for taking so much of your time.

If you want a concrete answer to that question, you must to specify the total amount of resources you want to use. If you want to run on 2 nodes, you’ll get the best performance wtith 16 total MPI ranks 10 threads each (i.e. OMP_NUM_THREADS=10); that is no domain-decomposition within a member simulation of the REMD ensemble.

I can not give you an universal recipe and extrapolating from the above (to other systems or resource count) won’t always work, e.g. if you want to use 16 codes, OMP_NUM_THREADS=80 will surely not be optimal, so you will need to experiment further to find the best settings.

For that you should try to get a basic understanding of those concepts so you know what your are doing when you put -pe mpi or OMP_NUM_THREADS in your job scripts.

Hi, yes, using mpi 160 and threads 10 give the fastest speed so far.

#$ -l threads=1
#$ -pe mpi 160
export OMP_NUM_THREADS=10

Thank you for your great help!