Building exabayes(1.2.1) for Rocks 6.1

To build exabayes(note, this is for version 1.2.1. 1.3 just came out and doesn’t build for me just yet) on Rocks 6.1, which is based on CentOS 6.3, LLVM’s clang and libc++ need to be installed. I have a previous blog post about this.

The available prebuilt binaries do not work on CentOS 6.3, but once clang and libc++ is installed, rebuilding it is fairly straight forward. Download and extract exabayes and go into its directory. Use the following commands to configure and build both the serial and parallel versions of exabayes:

CC=clang CXX=clang++ CXXFLAGS=”-std=c++11 -stdlib=libc++” ./configure
OMPI_CC=clang OMPI_CXX=clang++ OMPI_CXXFLAGS=”-std=c++11 -stdlib=libc++” CC=mpicc CXX=mpic++ ./configure –enable-mpi
make clean
OMPI_CC=clang OMPI_CXX=clang++ OMPI_CXXFLAGS=”-std=c++11 -stdlib=libc++” CC=mpicc CXX=mpic++ make

mpicc and mpic++ are just wrappers for gcc, but by using those environment variables, they can be pointed to another compiler without having to build a separate version of openmpi. Now that is done, within the top level directory are all the exabayes binaries. Ignore the ones in bin/bin, those are the prebuilt ones that don’t work.


Building libc++ on CentOS 6

For the cluster I manage, a user needed exabayes(there will be another post on building that later) but their prebuild binaries didn’t work on Rocks 6.1, which is based on CentOS 6.3. GCC is too old to build it as they use C++11, but luckily clang 3.4 is available from EPEL. Only thing, it still wouldn’t compile. I got the following two errors:

/usr/bin/../lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/exception_ptr.h:143:13: error:
unknown type name ‘type_info’
const type_info*
./src/Density.cpp:79:34: error: use of undeclared identifier ‘begin’
double sum = std::accumulate(begin(values), end(values), 0. );

While there was a potential workaround for the first error, nothing viable was to be found for dealing with the second error. But this research pointed in the next direction, building LLVM’s libc++ as these errors have to do with GCC’s old version of the standard c++ library. Its a bit complicated and rather hackish but looks like it works, so here we go.

Download libc++ via svn, but instead of following their directions for building, do this:

cd libcxx/lib

Thanks to this blog post, which is in Chinese, but the commands are easy to understand. After building the library, copy it to /usr/lib(or because this is 64bit, I put it in /usr/lib64) and create the needed symlinks. Then copy libcxx/include to /usr/include/c++/v1. Remember this as we’ll be replacing libc++ later with a rebuild version.

Next is building libc++abi. Again download from svn and build it like above and copy the library to /usr/lib64 and make the symlinks. The include directory doesn’t need to be copied. Now time to rebuild libc++ with libc++abi. This requires CMake, and I opted for the newer version available from EPEL. The command is then cmake28. I also started with a fresh download of libc++

cd libcxx
mkdir build
cd build
CC=clang CXX=clang++ cmake28 -G “Unix Makefiles” -DLIBCXX_CXX_ABI=libcxxabi -DLIBCXX_LIBCXXABI_INCLUDE_PATHS=”<libc++abi-source-dir>/include” -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr ../

Since I don’t like to mess with the system install, I used DESTDIR during the make install step. This then allows me to build an rpm package using rocks create package. I also create a package for libc++abi.

With this now its possible to compile with clang and c++11. Test it out like so: clang++ -stdlib=libc++ -std=c++11 input.cpp

Monitoring Lustre via Ganglia and Collectl

This here is just a rudimentary outline of how I setup ganglia to post stats gathered by collectl about Lustre.

All the information used is available from these two sources: Roy Dragseth’s page in the Rocks Clusters wiki page and Lustre Tutorial page from collectl’s website. First and foremost have ganglia, ganglia-gmond, ganglia-gmond-python, and collectl installed.

Edit /etc/collectl.conf and reconfigure the DaemonCommands with the following:  -f /var/log/collectl -r00:00,7 -m -F60 -sl -P –export lexpr,f=/tmp/L This gathers the Lustre information and saves it /tmp/L for reading by ganglia later. Now the important bit here is if this is a Lustre node or client determines the variable prefix exported by collectl. For the MDS, lusmds, OSTs, lusost, and for the clients, lusclt.

For an OST, collectl collects reads, writes, writekbs, and readkbs. For an MDS, its gattrP, sattrP, sync, and unlink. Copy the and collectl.pyconf from the wiki page to their respective locations for ganglia and edit and replace the relevant entries. The posted files are setup to monitor Lustre stats on a client, so replace lusclt.reads with lusost.reads. Once those files are updated for that server’s needs, start up the collectl daemon and restart gmond. Wait a few minutes and the new stats should be graphed on server’s ganglia page.

Rocks and 10gig hardware don’t mix

Well at least that was my current experience. I should mention I’m running Rocks 5.3, so just a point version behind. But here is the story:

A faculty member just recently purchased some dell blade servers for some research work. These blades and thus the chassis came with 10gig ethernet hardware. Cool. Setup the hardware, check. Plug in all the cables(but no 10gig since we don’t have 10gig hardware), check. Setup software, uh oh.

The problem happened when I booted the nodes to have the Rocks installer image them. They pxebooted just fine off of the first ethernet device(just a plain ol 1gig connection). Linux loaded, the installer runs, it tries to find an ip and fail. The installer was scanning eth0, eth2, then eth1. Turns out the kernel was numbering eth0-3 the 10gig nics and eth4 was the 1gig nic it should be using.

A few days on the mailing list to no avail. They just gave up on me, but I never give up. I narrowed it down a problem with the kickstart script overriding options I set in the pxeboot config. I added IPAPPEND 2 and ksdevice=bootif. This tells the system to use the same device it booted from. Well that wasn’t working. Not until I tell it to not run the kickstart script, by removing the ‘ks’ option, was it able to use eth4 as it should have been. But the mailing list failed me and offered no solution. Drivers! Bios! Update!! No no no! But whatev, just had to do it the hard way.

Back to the server room, removed the 10gig cards from the nodes, eth0 was now the 1gig nic, install os, reinstall hardware, and done. Luckily it was only two nodes, but still there should have been a software solution to fix this, but life goes on.