Hyperthreading typically improves performance, in some cases by a significant amount (up to 5-10%). In general, if schedule mdrun to run on the entire node and use all cores and thread of the node (and keep thread internal thread pinning on either), it will do the right thing. If you want to control things manually, e.g. to compare performance with / and without Hyperthreading , you can use the -pinstride option (e.g. -pinstride 2 will skip using every second hardware thread, that is only use one thread per core).
The mdrun command I use is gerun mdrun_mpi -v -deffnm md_0_1 -replex 100 -cpi -append -multidir sim[0123]
You mentioned -pinstride 2 option, should this be included in the computational resources, or in the mdrun command? Should I just use 2 or any other number?
I am sorry I am not an expert on computer architecture.
My current understanding is, (are they correct?)
one node consists of multiple cores
one core should commonly have one thread (i.e. without hyperthreading), or have two threads (i.e. with hyperthreading).
What does -pinstride 2 actually do?
Our cluster website also mentions MPI processes, like the below. What does it mean?
export OMP_NUM_THREADS=2
export OMP_NUM_THREADS=80
Also, if hyperthreading is good, why not just use it? In other words, why it is by default disabled?
Here you request 80 MPI ranks (tasks) hence placing 2 MPI tasks on a core and assuming that affinity setting (i.e. the mapping of the application threads to the CPU cores) is done correctly, this will use of HyperThreading. It may not be the ideal setup (for the specific hardware and simulation settings), but it will likely be close.
As you do not explicitly request pinning the default will be used, which in this case means that, as all cores/threads of the CPU are used, unless your jobs scheduler itself sets thread affinities, mdrun will default to -pin on. If you pass the same explicitly, you can make sure that mdrun does pinning itself. You should see a note in the log whether pinning is done or skipped (and why).
Stride 1 or 2 are the only useful values for the type of hardware you have (with 2 hardware threads per core). A stride of 1 is used when placing 2 threads onto a single core (when HyperThreading / SMT is available), and a value of 2 is used when placing a single thread of a core.
Of course, you have to adjust the total number of threads you launch (across all ranks in a node) accordingly, i.e. in your case 40 total threads should be launched to not use HyperThreading and 80 to use it.
Correct. Also note that even if the hardware supports it HyperThreading (more generally called simultaneous multithreading (SMT)) can be disabled by a system administrator. Check the hardware detection report in the beginning of the log file which will show the details of CPU features detected.
See above.
These are OpenMP threads. MPI and OpenMP are two different means to parallelize work across CPUs: the former relies on distributing work across multiple processes while the latter across threads within a single process. I recommend that you read up separately about these.
It is not disabled by default, if mdrun is allowed to manage threading itself, it will make use of the hardware threads exposed by the HyperThreading feature.
Many thanks for your detailed explanation. Sorry I replied a bit late.
After confirming with our cluster staff, the situation is:
1 node = 40 physical cores = 80 virtual cores = 80 slots
Using -pe mpi 80 alone would NOT enable hyperthreading
To enable hyperthreading, we need to choose an optimal layout of dividing up MPI processes vs threads. My simulation has 16 subfolders for the Gromacs REMD, so it seems that the MPI should be a multiple of 16. As I have tested, in one hour time with 80 cores requested:
You can see that export OMP_NUM_THREADS=10 gives the best performance. So how can I further select the optimal layout of dividing up MPI processes vs threads?
Hyperthreading is not enabled by the user, it is either enabled on a system level, case in which you will see twice as many “CPUs” shown by the Linux system (see top or /proc/cpuinfo) which correspond to two hardware threads per real core (or “virtual cores” as you refer to these).
All a user / job submission can do is to select the number and placement of the application threads in a way that only one thread runs on one physical core (i.e. leaving one “slot”/hardware thread empty) and with that opting to not make use of the second hardware thread provided by the Hyperthreading functionality.
Not sure if you are quoting someone else, but no, to make use of the Hyperthreading feature you do not need to choose an optimal setup, you just need to place two application threads per core. The thread to rank division is irrelevant to whether Hyperthreading is exploited or not.
That said, the optimal balance between MPI ranks to OpenMP threads is related, but a different matter.
In addition, if you run 16-way REMD, you need to use at least 16 MPI ranks. The rest is flexible in the sense that you just need to run Nlcm(16,80) ranks (if you want avoid wasting resources).
E.g. 80 ranks on a single node with 5 ranks x 1 threads per simulation (i.e. 16x5x1=80), 16 ranks on a single node with 1 rank per simulation and 5 threads each (16x1x5), but you can also use 1280 ranks on 16 nodes with 40 ranks per simulation and 2 threads per rank (402*16=1280).
Please share log files of the runs and make sure you launch simulations with valid resources requests (at least the second seem to be invalid).
These are not log files but standard output of the jobs., please share the mdrun logs. Also, we generally refer to performance as a job throughput using “ns/day” (or walltime/step if you prefer time-step agnostic metric).
Also, you only ran the 80 ranks with 2 and 1 threads each to test using / not using HyperThreading (but not the other) and you are relying on your job scheduler to do the right thing. (Assuming that this does) As expected HT does benefit performance.
What more would you like to learn? Are these two-node runs your target production settings?
I basically want to know which value to assign to the export OMP_NUM_THREADS, as it can be 2 or 5 or 10, as long as 160/n is a multiple of 16.
Sorry, there are many terms that I still could not fully understand, like ranks, threads, nodes, HT. Maybe I leave it as it is, and move on? Sorry for taking so much of your time.
If you want a concrete answer to that question, you must to specify the total amount of resources you want to use. If you want to run on 2 nodes, you’ll get the best performance wtith 16 total MPI ranks 10 threads each (i.e. OMP_NUM_THREADS=10); that is no domain-decomposition within a member simulation of the REMD ensemble.
I can not give you an universal recipe and extrapolating from the above (to other systems or resource count) won’t always work, e.g. if you want to use 16 codes, OMP_NUM_THREADS=80 will surely not be optimal, so you will need to experiment further to find the best settings.
For that you should try to get a basic understanding of those concepts so you know what your are doing when you put -pe mpi or OMP_NUM_THREADS in your job scripts.