Our cluster has slowly grown over time with new additions, but the oldest group of nodes are the only ones with infiniband, something we’ve never got around to configuring after rebuilding the cluster the first time. Well now the time came to give it another shot. Using the OpenFabric’s OFED distribution, I installed just the kernel drivers and needed libraries, I planned on building a different version of OpenMPI later. Whats nice about this distribution is it will build rpms for you, so after testing on one node, I copied the rpms to the head node and added them to the list of rpms to install.
Then picking a few more nodes to test the installation on, this is where my troulbles began. I could manually ssh to a node and run the OSU benchmarks without an issue, but whats the point of that if you can run it distributed? So I make a job script and submit, only to find it crashing with the following:
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
Strangely limits were being set, even though
/etc/security/limits.confwas empty. Thanks to the folks in the Rocks mailing list, I found I needed to add: H_MEMORYLOCKED=infinity to the cluster via qconf -mconf and add it to execd_params.