Yum Repository Priorities and Ansible

I’ve been investigating setting up a VPN gateway on my home server, but so far my search has only shown implementations that run on Linux or BSD, but not SmartOS/Illumos(Solaris). It’s planted some doubt in my choice in SmartOS as my server’s OS, but so what! I can still run a Linux VM via KVM. I’ve gone with my old standard of CentOS, a very solid OS that I trust to run my eventual VPN gateway securely.

When I get to it, I’ll let you all know how I go about setting up StrongSWAN, but I’m not there yet. I need to first setup some groundwork and that’s installing the EPEL yum repo. I’m very familiar with how to do so manually, adding the repo and then using the priorities plugin, but it’s all about Ansible today. I’ve created an Ansible Galaxy role to install and configure the priorities plugin, go check it out.

davidmnoriega/yum-plugin-priorities

Building exabayes(1.2.1) for Rocks 6.1

To build exabayes(note, this is for version 1.2.1. 1.3 just came out and doesn’t build for me just yet) on Rocks 6.1, which is based on CentOS 6.3, LLVM’s clang and libc++ need to be installed. I have a previous blog post about this.

The available prebuilt binaries do not work on CentOS 6.3, but once clang and libc++ is installed, rebuilding it is fairly straight forward. Download and extract exabayes and go into its directory. Use the following commands to configure and build both the serial and parallel versions of exabayes:

CC=clang CXX=clang++ CXXFLAGS=”-std=c++11 -stdlib=libc++” ./configure
make
OMPI_CC=clang OMPI_CXX=clang++ OMPI_CXXFLAGS=”-std=c++11 -stdlib=libc++” CC=mpicc CXX=mpic++ ./configure –enable-mpi
make clean
OMPI_CC=clang OMPI_CXX=clang++ OMPI_CXXFLAGS=”-std=c++11 -stdlib=libc++” CC=mpicc CXX=mpic++ make

mpicc and mpic++ are just wrappers for gcc, but by using those environment variables, they can be pointed to another compiler without having to build a separate version of openmpi. Now that is done, within the top level directory are all the exabayes binaries. Ignore the ones in bin/bin, those are the prebuilt ones that don’t work.

Building libc++ on CentOS 6

For the cluster I manage, a user needed exabayes(there will be another post on building that later) but their prebuild binaries didn’t work on Rocks 6.1, which is based on CentOS 6.3. GCC is too old to build it as they use C++11, but luckily clang 3.4 is available from EPEL. Only thing, it still wouldn’t compile. I got the following two errors:

/usr/bin/../lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/exception_ptr.h:143:13: error:
unknown type name ‘type_info’
const type_info*
./src/Density.cpp:79:34: error: use of undeclared identifier ‘begin’
double sum = std::accumulate(begin(values), end(values), 0. );

While there was a potential workaround for the first error, nothing viable was to be found for dealing with the second error. But this research pointed in the next direction, building LLVM’s libc++ as these errors have to do with GCC’s old version of the standard c++ library. Its a bit complicated and rather hackish but looks like it works, so here we go.

Download libc++ via svn, but instead of following their directions for building, do this:

cd libcxx/lib
./buildit

Thanks to this blog post, which is in Chinese, but the commands are easy to understand. After building the library, copy it to /usr/lib(or because this is 64bit, I put it in /usr/lib64) and create the needed symlinks. Then copy libcxx/include to /usr/include/c++/v1. Remember this as we’ll be replacing libc++ later with a rebuild version.

Next is building libc++abi. Again download from svn and build it like above and copy the library to /usr/lib64 and make the symlinks. The include directory doesn’t need to be copied. Now time to rebuild libc++ with libc++abi. This requires CMake, and I opted for the newer version available from EPEL. The command is then cmake28. I also started with a fresh download of libc++

cd libcxx
mkdir build
cd build
CC=clang CXX=clang++ cmake28 -G “Unix Makefiles” -DLIBCXX_CXX_ABI=libcxxabi -DLIBCXX_LIBCXXABI_INCLUDE_PATHS=”<libc++abi-source-dir>/include” -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr ../
make

Since I don’t like to mess with the system install, I used DESTDIR during the make install step. This then allows me to build an rpm package using rocks create package. I also create a package for libc++abi.

With this now its possible to compile with clang and c++11. Test it out like so: clang++ -stdlib=libc++ -std=c++11 input.cpp

Setting up MoinMoin on CentOS

MoinMoinMoinMoin is a great and simple wiki package built out of Python. We previously had it setup on an older machine, but since the order was given to reorganize the wiki, thought I’d take the time to just set it up on a new machine with the latest version. Only problem is I don’t remember it being this difficult. The install documentation isn’t very clear from my point of view, but that doesn’t stop me!

So download the package, untar, and ran the setup script:

python setup.py install --record=install.log --prefix='/usr/local' --install-data=/var/www/moin

I set install-data to /var/www/ to simply keep web related things in one place, but really didn’t have to. With that done, time to setup the wiki’s directory structure.

I created it at /var/www/html/wiki. Copy wikiconfig.py and data/ and underlay/ from the install-data directory. Now in the install-data directory is moin.wsgi config file. Edit it to find the location of the MoinMoin python files(eg /usr/local/lib/python2.4/). Then its time to edit the apache config.

Add ‘WSGISocketPrefix /var/run/htpd’ to the main httpd config(not inside any VirtualHost sections). The default setting doesn’t work. Then I added the following to my main site config:

<Directory /var/www/html/wiki/data>
        Order Deny,Allow
        Deny from all
</Directory>

WSGIScriptAlias /wiki /var/www/moin/share/moin/server/moin.wsgi
WSGIDaemonProcess moinwiki user=apache group=apache processes=5 threads=10 maximum-requests=1000 umask=0007
WSGIProcessGroup moinwiki

First section protects the data/ directory, then the WSGI directives point to the config file and setup the daemon.

Now I configure things in the wikiconfig.py file. Its mostly self-explainitory, though the part where it asks for the front page is a tricky step. There are two options, one for single language, and the second for multiple languages. Go for the second option, when I went for the first, I would get message in the website error log about missing pages from the underlay/ directory.

With the second option taken, restart httpd, then visit the wiki, create the superuser account and then the final step that I had to research to find. You have to go to the language setup page and tell it which help pages to setup, otherwise there is nothing there. You reach that page by going to /LanguageSetup?action=language_setup. Its just wasn’t obvious to me that this was needed, plus there is no link to it on the wiki itself.

Now with that taken care of, the wiki is ready to go.

Rocks and 10gig hardware don’t mix

Well at least that was my current experience. I should mention I’m running Rocks 5.3, so just a point version behind. But here is the story:

A faculty member just recently purchased some dell blade servers for some research work. These blades and thus the chassis came with 10gig ethernet hardware. Cool. Setup the hardware, check. Plug in all the cables(but no 10gig since we don’t have 10gig hardware), check. Setup software, uh oh.

The problem happened when I booted the nodes to have the Rocks installer image them. They pxebooted just fine off of the first ethernet device(just a plain ol 1gig connection). Linux loaded, the installer runs, it tries to find an ip and fail. The installer was scanning eth0, eth2, then eth1. Turns out the kernel was numbering eth0-3 the 10gig nics and eth4 was the 1gig nic it should be using.

A few days on the mailing list to no avail. They just gave up on me, but I never give up. I narrowed it down a problem with the kickstart script overriding options I set in the pxeboot config. I added IPAPPEND 2 and ksdevice=bootif. This tells the system to use the same device it booted from. Well that wasn’t working. Not until I tell it to not run the kickstart script, by removing the ‘ks’ option, was it able to use eth4 as it should have been. But the mailing list failed me and offered no solution. Drivers! Bios! Update!! No no no! But whatev, just had to do it the hard way.

Back to the server room, removed the 10gig cards from the nodes, eth0 was now the 1gig nic, install os, reinstall hardware, and done. Luckily it was only two nodes, but still there should have been a software solution to fix this, but life goes on.

Setting up Lustre

I know I promised a Lustre update a while back, but no apologies. I have my own schedule and just now decided to do this, so be happy ūüėõ

Basics

First thing first is the Lustre vocabulary that one has to get familar with. Its one of the things that first put me off as I was left scratching my head as to what they mean. With practice of explaing things to the boss, you’ll get it. OSS/OST = Object Storage Server/Object Storage Target: In Lustre, the magic comes from separating the meta data(ownership, file size, access times, etc) and the actual data. The OST is typically a disk array attached to an OSS, and for full failover, they come in pairs(ie in our case two disk arrays were attached to two servers with full multipathing). MDS/MDT = Metadata Server/Metadata Target: Again with the OSS/OST, we have disk storage for holding the metadata attached to a server for providing the metadata. MGS = Management Server: In most cases like ours, the MGS can be on the MDS/MDT as well. All it is is a place to hold information on all the servers and targets, etc.

Hardware

This is our setup: Three disk arrays and four servers. All disk arrays setup with RAID 5 and a hot spare. Two of the disk arrays are crossconnected to two servers(the OSS) and the last disk array is crossconnected to the last two servers(MDS/MGS). This is all setup for failover. Each server is setup to take over for the other in case of a problem or for upgrades. Lustre currently only supports having multiple OSSs, but only a single MDS, this will change in the 2.x release.

Lustre currently is built on top of ext3 and as such only supports 8Tb LUNs. Also ext3 has issues with partitions greater then 2Tb. To get around this, you have to label each LUN from the disk arrays as ‘gpt.’ You can use parted to do so:

# parted /dev/sdX
GNU Parted 1.7.1
Using /dev/sdX
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel
New disk label type? gpt

Software

These steps are to be done on all servers.

First thing is to setup networking, again for failover, by setting up bonding(ie two NICs acting as one). This will increase bandwidth while allowing for a possible network failure. Lets take eth0 and eth1 to be the interfaces to bond. First setup modprobe.conf:

alias bond0 bonding 
options bond0 mode=balance-alb miimon=100 

Then setup their respective network scripts:

# nano /etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE=eth0
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
USERCTL=no

# nano /etc/sysconfig/network-scripts/ifcfg-eth1

DEVICE=eth1
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
USERCTL=no

# nano /etc/sysconfig/network-scripts/ifcfg-bond0

DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes
NETMASK=X.X.X.X
IPADDR=X.X.X.X
GATEWAY=X.X.X.X
USERCTL=no

Then reboot. Next is multipathing. This is handled by a daemon that monitors the fiber connection and will switch over should it discover a disconnect.It should be stated that any drivers needed for the fiber cards be installed, but in our case they were part of the kernel.

# yum install device-mapper-multipath
# chkconfig multipathd on

Then edit /etc/multipath.conf for the following:

defaults {
        udev_dir                /dev
        polling_interval        10
        selector                "round-robin 0"
        path_grouping_policy    multibus
        getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
        prio_callout "/sbin/mpath_prio_rdac /dev/%n"
        path_checker            rdac
        rr_min_io               100
        max_fds                 8192
        rr_weight               priorities
        failback                immediate
        no_path_retry           fail
        user_friendly_names     yes
} 

Comment out from multipath.conf file:

blacklist {
        devnode "*"
}

Reboot. Now we install the lustre software. From the lustre website, download and install: kernel, e2fsprogs, lustre, lustre-modules, lustre-ldiskfs. Reboot for the new kernel to take effect, and verify via ‘uname -a’. Next is to setup LNET, the lustre networking api that does even more magic. Edit modprobe.conf and add “options lnet networks tcp=(bond0)”

MDS/MDT Setup

These steps are for the MDS only, and to only be done on a single node, do not do these on both MDSs as they both can see the same LUN from the disk array. On the LUN to be used as an MDT, we identify it as a physical volume, then add it to a volume group, then create a logical volume on it(ie we are using LVM2). Now thats done, we format it to be a lustre partition. Its a bit more complicated as the format command also passes varables to be stored there for lustre.

# mkfs.lustre --fsname=lustre --mgs --mdt /dev/lustre-mdt-dg/lv1
# mount -t lustre /dev/lustre-mdt-dg/lv1 /mdt

Simply mounting a lustre filesystem starts all thats needed for it to work, no daemons/etc are used. You can use dmesg to see lustre output messages.

OSS/OST Setup

These steps are for the OSS only, and again to only be done on a single node. In this case as we have more then one LUN, we will want to distribute them across the two OSSs. This is done by mounting two on OSS1 and the last two on OSS2, even though since they are crossconnected, each OSS can see all the LUNs, or OSTs(see it gets confusing). You can do all the logical volume and formatting stuff on a single node to keep things easy, but dont forget to mount them on different nodes.

There is/was? a naming convention used by the vendor who came and did this, but its all up to you. Looking back at the MDT setup, and the following, this is our naming convention for the logical volumes: /dev/lustre-ostX-dgX/lv1 So we have two OSTs, each has two LUNs, so we would get something like /dev/lustre-ost2-dg2/lv1 Anyways we ended up with four of them and used the following to format them for lustre:

# mkfs.lustre --fsname=lustre --ost --mgsnode=X.X.X.X@tcp0 /dev/lustre-ost1-dg1/lv1

So now thats done, edit your fstab on each of the OSSs to mount two on the first one and the last two on the second. When failover is setup later, the running OSS will mount the other OSTs that were on the failed server to keep that data available. This is where the LNET magic comes in, it keeps the client waiting, while if this were an NFS server, problems are immediately known on the client, usually by stale nfs handles and the like.

Client Setup

You can setup up a client without patching the kernel, but I found it was easier to use the patched kernel as the other way requires a particular version of the kernel. So install the lustre kernel and lustre and lustre-modules packages and reboot to use the new kernel. Then to mount the lustre filesystem, use ‘mount -t lustre X.X.X.X@tcp0:/lustre /lustre’ where X.X.X.X is the ip of the MDS/MGS server.

Thats the basics. Failover hasn’t even been setup, because thats harder and I’ll detail that later.